Processador ARM for server

image.php


The LeMaker Cello has an AMD Opteron A1120 quad-core Cortex-A57 processor running at 1.7GHz, 2x DDR3 SO-DIMM slots, two Serial ATA 3.0 ports, Gigabit Ethernet, two USB 3.0 ports, one PCI Express x16 3.0 slot, etc. The AMD A1120 SoC has a 25 Watt TDP so it will also need active cooling. The LeMaker Cello should be able to mount in a mini-ITX / micro-ATX chassis.

http://phoronix.com/scan.php?page=news_item&px=AMD-LeMaker-Cello

Muito interessante e tendo em conta que é cpu+board, o preço não é mau. E este pelo menos tem um cpu ARM a sério. Nada de A53.
 
(...) E este pelo menos tem um cpu ARM a sério. Nada de A53.

Mas 25W sem GPU? Tens que compreender que isto é para outros mercados e não sei até que ponto isto não levará sova da grande do novo Kirin 950, por exemplo, que consome bem menos. É verdade que aqui há mais IO e afins mas mesmo assim não espero nada de revolucionário imho.
 
Mas 25W sem GPU? Tens que compreender que isto é para outros mercados e não sei até que ponto isto não levará sova da grande do novo Kirin 950, por exemplo, que consome bem menos. É verdade que aqui há mais IO e afins mas mesmo assim não espero nada de revolucionário imho.

25W para um quad core ARM parece muito, sim, mas vou esperar pelos benchmarks. Pode ser uma desilusão, mas este processador para a AMD também é um inicio de algo.
Mas vamos imaginar que o Kirim 950 é mais rápido. Óptimo, mas ninguém o vai usar no mercado empresarial onde este AMD está inserido.
 
Este artigo do Facebook, para mim, explica a situação actual com os SOCs da Intel e o problema que ARM vai ter em entrar neste mercado.

Uma das razões explicadas, é que cpus com cores muito simples (tipo Atom) não são convenientes à infraestrutura do Facebook e isso mata à partida os actuais ARM para servidores.

Mas falando do Post. O Facebook desenvolveu com a Intel o Xeon-D, na versão que o Facebook tem, servidores com apenas 1 cpu (ao contrários dos comuns 2), 100% SOC, sem chipset, com 16 cores, a 2.3 Ghz, consumindo apenas 65W.
Os servidores são estes:
12809198_1699872030290917_2018254014_n.png


Que são colocados em numa placa com 4 deles, e um NIC com dois links 2X25 Gbits. Cada servidor tem um ip distinto e não há ligação entre os servidores (Não há NUMA):
12532947_841896102589392_1547469747_n.jpg


Passaram de um servidor com dual cpu e um nic, para uma placa, com 4 servidores de 1 cpu apenas e um NIC partilhado por eles:
12427053_1667177640200406_779340125_n.jpg


A diferença na performance por Watt é esta:
12532937_165975033791458_472092477_n.jpg


Actualmente o Xeon-D é o ARM killer, havendo poucos ARM servers no mercado.
 
Demonstração do RHEL 7.2 com um kernel upstream num 96boards Huskyboard, com o cpu da AMD (minuto 28:00):

A importancia de standards no mundo ARM e demonstração do cpu da Qualcomm com RHEL 7.2:

Como tenho vindo a dizer, os standards no mundo ARM vão começar a aparecer nos servidores e poder assim correr um kernel upstream. Falta ver se estes standards vão também aparecer no mundo consumidor.
Procurem por SBSA/SBBR que é o que está a ser implementado nos cpus ARM para servidores.
 
Benchmarks de um Dual Cavium ThunderX. Ao todo 96 cores ARMv8:
http://www.servethehome.com/exclusi...rx-dual-48-core-96-core-total-arm-benchmarks/

Mais ou menos previsível. Nesta altura não se consegue fazer um processador com 48 cores sendo os cores competitivos com um Haswell ou um Skylake da Intel.
O que temos aqui é um core da classe dos Atom, o que à primeira vista pode parecer mau, mas num sistema dual cpu conseguem-se ter 96 cores. Isto é, é um cpu que para single thread é para esquecer, mas em multi thread ou com muitas threads diferentes, não deixa de ser interessante.
Além disso está muito bem equipado a nível de rede com vários links de 10 Gbit para cima e pode levar 512 GB de Ram por cpu.
Interessante para um sistema que corra uma infinidade de processos ao mesmo tempo ou para algo como Redis ou Elasticseach que não consumam muito cpu, mas que precisem de muita Ram.

No entanto, é um cpu para nichos de mercado e um Xeon E5 continua a ser bem melhor para computação geral por larga margem.
Outro ponto é que a Intel facilmente consegue fazer um processador com as mesmas características deste ThunderX se quiser. O Intel Xeon Phi Knights Landing tem 72 cores Atom mais as duas unidades vectoriais por core. "Facilmente" a Intel pode fazer uma versão deste processador sem as unidades vectoriais.

Não estou à espera que este ThunderX ameace muito os Xeons da Intel mas não deixa de ser interessante.
 
Foi apresentado o AppliedMicro X-Gene3 e em parte o X-Gene3XL

APM-X-Gene-3-block-diagram.png


O X-Gene3 é um processador de um só socket com 32 cores, 8 canais de memória, 32 MB de L3 e 42 lanes Pci-Express3. Não tem aceleradores integrados nem Lans. Em simulações parece ser competitivo no SpecInt com o Intel Xeon E5-2680 v4. A questão é que o X-Gene3 só sai no segundo semestre de 2017 e nessa altura a Intel terá no mercado o Skylake-EP e a AMD o Zen.

O X-Gene3XL sairá depois e tem 64 cores e suporte para 2 sockets.

Source:
http://semiaccurate.com/2016/04/25/appliedmicros-x-gene-3-aims-for-intels-e5-xeons/
http://www.linleygroup.com/cms_builder/uploads/x-gene-3-white-paper-final.pdf

É interessante. Não sei é se não virá demasiado tarde.
 
Update ao teste do Cavium ThunderX

CAVIUM THUNDERX BENCHMARKS PART II: WHY ENTERPRISE ARM DEVELOPERS NEED THESE MACHINES

Final Words
There is a lot more content coming on the Cavium side. At the end of the day, the x86 software ecosystem still has many more man hours behind optimizations of compilers and software. On the other hand, there are some big applications out there (nginx, memcached, redis, HAproxy, Ceph and others) that already have a lot of ARM optimization work behind them.

If you are a data center ARM developer in the summer of 2016, the Cavium ThunderX machines are the 64-bit ARM platform to get. While there are less expensive options out there, those do not compete with the Intel Xeon line-up at this point. In the STH lab we are seeing the impact of developer’s efforts bringing huge performance increases to the Cavium platform. The Cavium ThunderX is the only game in town when it comes to having a generally available data center ARM platform that not just meets Intel’s platform in some price/ performance areas but exceeds them in several.
http://www.servethehome.com/cavium-...arks-enterprise-arm-developers-need-machines/


E já agora

Huaxintong Semiconductor Licenses ARMv8-A Architecture
Huaxintong Semiconductor Technology is a joint venture between China’s Guizhou province and a subsidiary of Qualcomm. The venture is registered in Guizhou province, the first region to build an industrial cluster for big data development in China. The area is already home to a data center cluster of more than 2.5 million servers for companies including China Mobile, China Telecom and China Unicorn.
https://www.hpcwire.com/off-the-wire/huaxintong-semiconductor-licenses-armv8-architecture/
 
ARM Announces ARM v8-A with Scalable Vector Extensions: Aiming for HPC and Data Center

2vcug79.png


Scalable Vector Extensions (SVE) will be a flexible addition to the ISA, and support from 128-bit to 2048-bit. ARM has included the extensions in a way that if included in the hardware, the hardware is scalable: it doesn’t matter if the code being run calls for 128-bit, 512-bit or 2048-bit, the scheduler will arrange the calculations to compensate for the hardware that is available. Thus a 2048-bit code run on a 128-bit SVE core will manage the instructions in such a way to complete the calculation, or a 128-bit code on a 2048-bit core will attempt to improve IPC by bundling 128-bit calculations together.

2dlrfxx.png


This is different to NEON, which works on 64-bit and 128-bit vectors. ARM is soon submitting patches to GCC and LLVM to support the auto-vectorization for VSE, either by directives or detecting applicable command sequences.

Performance metrics performed in ARMs labs show significant speed up for certain data sets already and expect that over time more code paths will be able to take advantage of SVE. ARM is encouraging semiconductor architecture licensees that need fine-grained HPC control to adopt SVE in both hardware and code such that as the nature of the platform adapts over time both sides will see a benefit as the instructions are scalable.

2dt6l2c.png


http://www.anandtech.com/show/10586...tor-extensions-aiming-for-hpc-and-data-center
 
ARM Puts Some Muscle Into Vector Number Crunching

With the ARMv8 chasing new architectures – and in particular getting some footing in the HPC space in Europe, China, and Japan – Stephens and his team at ARM wanted to have a new vector architecture that was a bit more radical and would last for a long time. Specifically, they determined that this SVE approach would need to support gather load and scatter store operations, have per lane predication, and support even longer vectors so that more fine-grained parallelism could be extracted from the processor (and not have to be offloaded to a GPU or DSP or FPGA coprocessor). But the issue was, how long do you make the vectors?

2elt9wl.jpg


To give a sense of how much better SVE vectors are for HPC workloads compared to NEON vectors, Stephens show the following chart, which shows the increase in the vectorization of various HPC codes and the level of speedup when pushing that increased vectorized code through an SVE unit with 2,048-bit vectors:

23psmo.jpg


As you can see, many of the HPC codes tested by ARM using simulators for both SVE and NEON do not see any increased vectorization and performance boost at all, while those on the right of the chart show increasing vectorization and performance speedups, particularly as the vectors get increasingly large in terms of their bitness. The 512-bit limit is significant because Fujitsu has chosen a 512-bit length for the ARM-based processor to be used at the heart of the Post-K machine; longer vectors are possible and would presumably show an even larger speedup than is shown for certain applications.
http://www.nextplatform.com/2016/08/23/arm-puts-muscle-vector-number-crunching/


- Technology Update: The Scalable Vector Extension (SVE) for the ARMv8-A architecture

- ARM Reaches for Supercomputers
 
- ARM Research Summit 2016

ARM's first Research Summit is happening today at Churchill College, Cambridge. We have near-front row seats and are expecting some details on future HPC plans today.
http://www.anandtech.com/show/10680/arm-research-summit-2016-keynote-live-blog

a 2ª parte não é dedicada a servidores/HPC
http://www.anandtech.com/show/10684/arm-research-summit-research-roadmap-keynote-live-blog


- Details Emerge On China’s 64-Core ARM Chip

the first engineering samples of the two Earth ARM chips, the FT-1500A/4 and the FT-1500A/16, as well as the one Mars ARM chip, the FT-2000/64, are back from TSMC and that we saw systems running the Kylin Linux operating system (a variant of Canonical’s Ubuntu) at the Hot Chips event.
...
Both the Earth and Mars chips are etched with the mature 28 nanometer processes from TSMC

The low-end Earth chip is the FT-1500A/4, which has four FTC660 generation Xiaomi cores on its die. The chip runs at between 1.5 GHz and 2 GHz, and has 2 MB of L2 cache spread across the cores and an additional 8 MB of L3 cache shared by the cores. The entry Earth chip, which is aimed at desktops, laptops, and lightweight server workloads like web serving, email serving, storage arrays and clusters, has two DDR3 memory controllers running at 1.6 GHz, which deliver an aggregate of 25.6 GB/sec of memory bandwidth. The chip, which has 1,150 pins, has one 1 Gb/sec Ethernet interface and a PCI-Express 3.0 controller that can express itself as two x16 or four x8 interfaces to peripherals. The FT1500A/4 has a maximum power draw of a mere 15 watts.

The FT-1500A/16 variant of the Earth chip uses the same cores, and with four times the number of Xiaomi FTC660 cores, it has four times the L2 cache spread across those cores at 8 MB. The L3 cache on this fatter Earth chip stays the same at 8 MB of total capacity. This bigger Earth chip has four DDR3 memory controllers running at 1.6 GHz, for a total of 51.2 GB/sec of memory bandwidth, and it has two 1 Gb/sec Ethernet ports coming off the on-die network controller and the same PCI-Express controller. The FT-1500A/16 chip has 1,944 pins, and it has a maximum power of 35 watts.

The Mars FT-2000/64 chip is based on the FTC661 generation of Xiaomi cores, and has the same 512 KB L2 cache per core, but delivers a whopping 128 MB of L3 cache across the 64 cores on the die. The cores on the Mars chip have a design frequency of between 1.5 GHz and 2 GHz, just like the cores in the Earth chips. The Mars processor has sixteen DDR3 memory controllers, with deliver a total of 204.8 GB/sec of memory bandwidth running at 1.6 GHz. The whole shebang needs 2,892 pins and has a maximum power draw of 100 watts.

Mars FTC661 Xiaomi Core

13z95js.jpg


The LIUs interface to the cache and memory chip (CMC) units, which include the memory controllers and interfaces to the cache memory;
...
interface between the Xiaomi core panels and this CMC is proprietary, and it looks like it will have multiple uses. It provides 19.2 GB/sec of read and write bandwidth per interface.

2s0o2lv.jpg


Special Purpose Accelerators (SPA)

j8eazd.jpg


Here is the prototype FT-2000/64 system that Phytium was showing off:

104ndd5.jpg


vfhvcz.jpg

http://www.nextplatform.com/2016/09/01/details-emerge-chinas-64-core-arm-chip/
 
Xiaomi? A Xiaomi que produz telemóveis e gadgets? Também produzem CPU's agora? Ou Xiaomi é só o nome/ID dos cores e nada tem a haver? Sim, confusão na minha cabeça :P
 
Imagens da apresentação na Hot Chips

Xiaomi é o nome dos cores, há duas variantes FTC660 (Earth) e FTC 661 (Mars), a empresa é a Phytium. Embora seja aparentemente uma subsidiária da China Electronics Corporation.

A unica empresa da área de telemóveis que produz os seus próprios chips é a Huawei, com a HiSilicon, mas esses também são um gigante na área servers/cloud/redes/storage.
 
Pois é mesmo por isso que estava a fazer-me confusão, mas como a Xiaomi empresa andava a meter o nariz em tudo o que podia, ainda admiti que tinham alguma secção que não conhecia lol.

A ver se tenho paciência para depois ir ler um bocado dos artigos :)
 
Isto está a começar a alinhar-se, pelo menos a nível de hardware.

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

2mdgfid.png


The idea behind a coherent mesh between cores as it stands in the ARM Server SoC space is that you can put a number of CPU clusters (e.g. four lots of 4xA53) and accelerators (custom or other IP) into one piece of silicon. Each part of the SoC has to work with everything else, and for that ARM offers a variety of interconnect licences for users who want to choose from ARM's IP range. For ARM licensees who pick multiple ARM parts, this makes it easier for to combine high core counts and accelerators in one large SoC.

drdvva.png


The previous generation interconnect, the CCN-512, could support 12 clusters of 4 cores and maintain coherency, allowing for large 48-core chips. The new CMN-600 can support up to 128 cores (32 clusters of 4). As part of the announcement, There is also an agile system cache which a way for I/O devices to allocate memory and cache lines directly into the L3, reducing the latency of I/O without having to touch the core.

28vcbck.jpg


Also in the announcement is a new memory controller. The old DMC-520, which was limited to four channels of DDR3, is being superseded by the DMC-620 controller which supports eight channels of DDR4. Each DMC-620 channel can contain up to 1 TB DDR4, giving a potential SoC support of 8TB.

2m3n4lt.png

http://www.anandtech.com/show/10711/arm-cmn-600-dmc-620-128-cores-8-channel-ddr4

ARM adds CMN-600 interconnect and DMC-620 memory controllers
http://semiaccurate.com/2016/09/27/arm-adds-cmn-600-interconnect-dmc-620-memory-controllers/


De notar que na 2ª imagem já aparece um Coherent Multichip Link, com suporte CCIX
https://forum.zwame.pt/threads/cache-coherent-interconnect-for-accelerators-ccix.959776/
 
Este mercado tem estado em reboliço, desde o "desaparecimento" do Vulcan da Broadcom, que após a compra da Avago se supseita ter sido cancelado, até à oferta de compra da MACOM sobre a Applied, que terá como resultado imediato a alienação da sua divisão de Computing, na qual estava integrada a linha X-Gene, o que deixa algumas dúvidas sobre o futuro da mesma.


Entretanto a Qualcomm apresentou oficialmente o seu SoC para servidores Centriq 2400 com custom CPU ARM Falkor
Today, we’re excited to announce the Qualcomm Centriq 2400, the world’s first server processor built on a 10-nanometer process node.

This revolutionary processor is purpose-built for performance-oriented datacenter applications. As the first in the Qualcomm Centriq product family, the Qualcomm Centriq 2400 series has up to 48 cores. It features a Qualcomm Falkor CPU — our custom ARMv8 CPU core that’s optimized for server-class workloads. The Falkor CPU is the result of generations of CPU design expertise combined with talent from throughout the server industry.

We understand the importance of a healthy software ecosystem in the datacenter. As such, Falkor was designed from Day 1 to be SBSA compliant, which ensures that software that runs on any ARMv8 server platform would also be able to run on a Qualcomm Centriq 2400-based server platform (and vice-versa).

Earlier this week, I spoke at an industry event where we demonstrated our new processor. Our software development platform ran a typical datacenter application configured with Linux, Java, and Apache Spark.
https://www.qualcomm.com/news/onq/2...00-worlds-first-10-nanometer-server-processor


- Does Qualcomm Have A Real Shot At Competing In The Datacenter?
 
Update: soube-se entretanto que a Cavium terá adquirido o IP do chip Vulcan (ARM server) da Broadcom, embora nada tenha sido oficialmente anunciado, uma mail com um patch para o GCC (compilador usado por defeito na maioria das ditribuições linux):

This patch adds -mcpu=thunderx2t99. Cavium has acquired the Vulcan
IP from Broadcom. I am keeping the old -mcpu=vulcan as backwards
compatible but renaming all of the structures to be based on the new
name of the chip. In the next few weeks, I am auditing the current
tuning and will be posting some changes too.

OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
Also tested -mcpu=native on a ThunderX2 CN99xx machine.
https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01986.html

A HP Enterprise fez também um novo update ao projecto "The Machine", que usava um CPU ARM não especificado

HPE is not yet divulging whose ARM SoC it designed into The Machine’s initial sled design.
....
Does The Machine’s workload processor have to be ARM-based? HPE Labs says no, and adds that The Machine is intended to explore how data-centric architecture impacts code optimization for various processor and coprocessor architectures, including APUs, GPUs, FPGAs, and perhaps hardware accelerators. HPE chose the ARM architecture for this first design iteration to learn about enterprise-grade ARM and ARM’s rapidly maturing server software development ecosystem – a benefit of the above mentioned philosophy of simultaneously learning about multiple aspects of system behavior.
https://www.nextplatform.com/2017/01/09/hpe-powers-machine-architecture/
 
Back
Topo