Processador Nvidia Grace (ARM Neoverse V2)

Dark Kaeser

Colaborador
Staff
Supostamente baseado no core ARM neoverse

A novidade em si estará nas ligações, usar o NVLink para suportar as Nvidia A100.

Para já não há muitos mais dados, além das apresentações da GTC 2021

18177218_575px.jpg


18280140_575px.jpg


O primeiro produto será algures em 2023 e terá como primeiro cliente anunciado o ETH Zürich

18301171_575px.jpg


e inclui um roadmap

18354812_575px.jpg


https://www.anandtech.com/show/1661...-keynote-live-blog-starts-at-830am-pt1630-utc

A ver se há mais detalhes.

EDIT: afinal este Grace CPU também fará parte posteriormente, pois refere Grace Next, num SoC para o mercado auto (condução autonóma?)

20410828_575px.jpg




NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems​

https://www.anandtech.com/show/1661...formance-arm-server-cpu-for-use-in-ai-systems
 
Última edição:
Até pelo uso de LPDDR5X se percebe que este processador é criado para para fazer "companhia" aos GPUs e outros aceleradores, no caso do Bluefield.
Será mais importante a capacidade de IO do que propriamente a capacidade de computação.

É interessante na mesma, visto que deixam de estar dependentes, em Servidores, de empresas externas (Intel, AMD, IBM) e podem fazer o Processador à medida, com muitos Links NVLink, muita bandwidth a aceder a RAM, coerência entre o CPU e GPU, etc. No entanto, não parece ser um Processador genérico para Servidores.
 
E é por isto que a nVidia quer comprar a ARM, para ter ainda ainda mais e melhores cores e não estar dependente de outros.

Mas ao mesmo tempo, sem intensão de bloquear acesso a outros, a fazer parcerias á esquerda e á direita a outras plataformas ARM.

All business is good business..
 
Parece-me que o único interesse no CPU é este alimentar os GPU, os ARM Neoverse (N2) de qualquer forma pelo que se sabe já terão, para além de DDR5, Pci-e 5 e CCIX 2.0 e CXL 2.0, que creio que seja isso o "link" entre os CPU e a memória.

Neoverse-crop-10_575px.png

https://www.anandtech.com/show/16073/arm-announces-neoverse-v1-n2

CXL%202.0%20Press%20Briefing%20Deck_Embargoed-page-005_575px.jpg

CXL%202.0%20Press%20Briefing%20Deck_Embargoed-page-009_575px.jpg

https://www.anandtech.com/show/16227/compute-express-link-cxl-2-0-switching-pmem-security

Aliás a recente notícia de que a Micron vai vender a Fab que adquiriu à Intel e que fabrica os módulos 3d Xpoint (a.k.a. Optane) é precisamente para poder começar a fabricar Memória para poder ser usada no CXL 2.0 😎
 
Há mais um sistema anunciado, o "Venado", para o Los Alamos National Lab (USA), mas por incrível que pareça o sistema anunciado no 1º post, o "Alps" para o CSCS (Suiça) é mais maior, a julgar pelos AI Flops 2x, o mais estranho é que sendo ambos os sistemas baseados no HPE (Cray) "Shasta" aparentemente irão usar o seu "Slingshot11" interconnect ao invés do NVSwitch da Nvidia.

Opening Up The Future “Venado” Grace-Hopper Supercomputer At Los Alamos​


A year ago, when these deals were announced, we figured that the 20 exaflops was gauged using mixed precision floating point because there is no way anyone is installing a 20 exaflops machine as measured in 64-bit floating point precision. We have had enough trouble getting machines into the field with 2 exaflops peak, to be frank. Nvidia has confirmed that the 10 exaflops rating on Venado is using eighth-precision FP8 floating point math with sparsity support turned on to come up with that number
los-alamos-venado.jpg

As far as we can tell, neither the Alps nor Venado systems are using the NVSwitch 3 extended switch fabric in the new DGX H100 SuperPODs as the basis of their machines, unlike what Nvidia is doing with its future “Eos” kicker to the Selene supercomputer, which will be rated at 275 petaflops at FP64 on its vector engines without sparsity and 18 exaflops at FP8 on its matrix engines with sparsity. By picking the Shasta design, that means a more traditional clustering at the node level, not at the GPU memory level, and at only 256 GPUs, that is not enough scale for either Alps or Venado and it is too different from the machines that CSCS and Los Alamos are used to building. To be fair, Eos is using big fat InfiniBand pipes to lash together 18 DGX H100 SuperPODs, which both CSCS and Los Alamos could have done but decided not to.

Venado may be at that inflection point, particularly since Nvidia and Los Alamos have a longer deal that aimed to research the use of DPUs to create what Nvidia calls “cloud native supercomputing,” which uses less expensive DPUs to offload storage, virtualization, and certain routines such as MPI protocol processing from CPUs and put them on specialized accelerators to let the more expensive CPUs do more algorithmic work in a simulation. The Los Alamos and Nvidia deal looks to accelerate HPC application by 30X over the next several years, and as far as we know from Nvidia, the DPUs are not necessarily part of the Venado system. But they could be.
The trouble is that a BlueField-3 DPU from Nvidia bundles together a 400 Gb/sec ConnectX-7 network interface card that speaks both InfiniBand and Ethernet plus 16 “Hercules” Armv8.2+ A78 cores and, as an option for even further acceleration, an Nvidia GPU. It does not, strictly speaking, speak Slingshot 11 Ethernet in the same way that HPE’s own “Cassini” NIC ASIC does. (Although we do know that the Slingshot 10 interface cards were actually ConnectX-6 cards from Nvidia.) There probably is some way to make Slingshot 11 work with 400 Gb/sec BlueField-3 DPUs, but that would also mean gearing down their throughput to 200 Gb/sec, which is the fastest speed that Slingshot currently supports. (HPE has 400 Gb/sec and 800 Gb/sec switch ASICs and adapters in the pipeline, as we have previously reported.) With BlueField-4, due in early 2023 according to the roadmap, the ConnectX-7 ASIC, the Arm DPU, and the Nvidia GPU will be all put into one package. That is the one that Los Alamos probably wants to play with in Venado.

The exact compute ratio of CPUs and GPUs and the node count for the Venado system has not been divulged, but Lujan did give us some hints.

“We are looking at an 80/20 split, where 80 percent of the cycles in the nodes come from the GPUs and then the other 20 percent are coming from the CPUs only,” Lujan says. “That way we can facilitate a significant amount of research on the CPU forefront, yet still provide significant amount of cycles that we can get out of the GPUs as well. But we are mindful that we have a unique capability with Grace-Hopper because of this close coupling of the CPU and the GPU.”

That is a statement we can work with. In modern GPU-accelerated systems that are completely loaded up on each node with GPUs, the ratio of CPU compute to GPU compute in FP64 precision – which is what HPC cares about most – is on the order of 5 percent to 95 percent.
If you do the math backwards, to get 10 exaflops at FP8 precision using the Hopper GPUs would require 3,125 Hopper GH100 accelerators. If you work that back to FP64 on the vector cores on the Hopper without sparsity on, there is 30 teraflops per H100 so a total of 93.75 petaflops from the Hopper units. Using an 80-20 ratio of GPU to CPU compute, that would still be 23.45 petaflops on an all-CPU portion of Venado, based on what Lujan said.
https://www.nextplatform.com/2022/0...ado-grace-hopper-supercomputer-at-los-alamos/
 
Aparentemente mais um sistema... os Espanhóis do BSC após 3 anos de trapalhada para seleccionar as propostas para o MareNostrum 5 optaram pelo Grace + Hopper, este tem a particularidade de ser da Atos, ao invés da HPE.

O engraçado é que o sistema foi anunciado pela... Nvidia, dado que o comunicado oficial do EuroHPC apenas menciona que foi seleccionado um sistema da ATOS e nada sobre o hardware.


Nvidia Superchips to Grace Atos-Built MareNostrum 5 Supercomputer​


Some brief history: back in June 2019, BSC was one of eight sites selected for EuroHPC’s inaugural octet of supercomputers—specifically, one of the three selected to host a “pre-exascale” system measuring in the hundreds of petaflops. News of procurements—and even installations—of its siblings came and went over the ensuing years until, abruptly, the procurement for MareNostrum 5 was canceled in May 2021.
Now, Atos has been selected as the vendor for MareNostrum 5
Nvidia told HPCwire that MareNostrum 5 will be powered by Nvidia’s Arm-based Grace CPU Superchips, which feature tightly paired CPUs and which are slated to debut on the Venado system at Los Alamos National Laboratory. The system will further feature Nvidia’s H100 Tensor Core GPUs and its Quantum-2 (aka NDR) InfiniBand networking.
BSC’s press release says that MareNostrum 5 will have “a peak performance of 314 [petaflops] and more than 200 petabytes of storage and 400 petabytes of active archive,” and Nvidia adds that the system is expected to deliver 18 exaflops of AI performance, making it the fastest AI supercomputer in the European Union. BSC elaborated that the system will be “fully powered with green energy, and will utilize heat reuse technology.”
No other system specifications for MareNostrum 5 have been announced, but it’s interesting to note that the call closed just a few days after Atos’ announcement of its next-generation BullSequana XH3000 supercomputers. (The XH3000 also specifically features support for Nvidia’s Grace CPU, and the XH2000 does not support Quantum-2 “NDR” InfiniBand.)
https://www.hpcwire.com/2022/06/16/...grace-atos-built-marenostrum-5-supercomputer/
 

New NVIDIA Grace Arm CPU Details Ahead of HC34​


NVIDIA says that the new Grace CPU will be based on Arm v9.0 and will have features like, SVE2, virtualization/ nested virtualization, S-EL2 support (Secure EL2), RAS v1.1 Generic Interrupt Controller (GIC) v4.1, Memory Partitioning and Monitoring (MPAM), System Memory Management Unit (SMMU) v3.1, and more.
Here is the basic diagram for a NVIDIA Grace CPU:
NVIDIA-Grace-IO-HC34-Cover-696x452.jpg

NVIDIA Grace IO HC34 Cover

NVIDIA also shared that it expects the new 72-core CPU to hit around SPEC CPU2017 Integer Rate scores of around 370 per full 72-core Grace CPU using GCC. For some context, a 64-core AMD chip will have official scores in the 390-450 range (e.g. an AMD EPYC 7773X.) With two Grace CPUs on a module, we get the Grace Superchip that would effectively double these numbers.
That is really only part of the equation when it comes to Grace. Grace is not designed for all-out CPU performance. Instead, it is designed for memory bandwidth, power efficiency, and coherency with the company’s GPU products. One can see the 16x memory channels on the Grace diagram above. The basic goal of Grace is to use LPDDR5x to get a HPC/ AI usable capacity. Then NVIDIA will use many channels to get a lot of bandwidth.
NVIDIA-Grace-Bandwidth-HBM2e-DDR5-LPDDR5x-HC34.jpg

The big benefit is really the memory bandwidth. Here are the estimated memory bandwidth results for a Grace CPU (single CPU, not the Grace Superchip):
NVIDIA-Grace-Estimated-Memory-Performance-HC34-696x305.jpg


NVIDIA also showed a bit more on how the company plans to scale designs. Using the NVIDIA Scalable Coherecy Fabric, NVIDIA hopes to get up to 4-socket coherency and 72 cores/ 117MB of L3 cache (1.625MB/ core.)
NVIDIA-Scalable-Coherency-Fabric-HC34-696x334.jpg

Final Words​


There are really two reasons NVIDIA is building its own CPU. First, NVIDIA wants something that is a more efficient co-processor to the company’s GPUs in large systems. NVIDIA owns the hardware/ software stack much like Apple, so this is an attempt to become a truly full-stack provider. NVIDIA does not need Intel/ AMD’s new AI accelerators, it has its own. Likewise, it can build a chip with the right performance attributes to augment a GPU, rather than duplicating other parts.
https://www.servethehome.com/new-nvidia-grace-arm-cpu-details-ahead-of-hc34/
 
nVidia Grace Live e sem ser em renders:
jdNUuDG.png


Dual CPU Package:
NJaqj9z.png


CPU + GPU Package:
3jIs6ql.png


Parceiros:
9rc0rus.png


ab6K3UV.png


Specs:
1nMSzxB.png


jiuMqEy.png


Benchmarks:
AwWNm5q.png


Tanto o chip como o package, são bastante grandes.
De notar que nos Benchmarks, os valores são de "Throughput".....
 
Última edição:

NVIDIA GH200 CPU Performance Benchmarks Against AMD EPYC Zen 4 & Intel Xeon Emerald Rapids​


NVIDIA's GH200 combines the 72-core Grace CPU with H100 Tensor Core GPU and support for up to 480GB of LPDDR5 memory and 96GB of HBM3 or 144GB of HBM3e memory. The Grace CPU employs Arm Neoverse-V2 cores with 1MB of L2 cache per core and 117MB of L3 cache.
image.php

For the purposes of this testing Ubuntu 23.10 with Linux 6.5 was used for having an up-to-date kernel as well as the GCC 13 stock compiler. The toolchain versions are close to what will be found in Ubuntu 24.04 LTS in April and using Ubuntu 23.10 is worthwhile for a leading-edge look at the NVIDIA GH200 Linux performance as well as against the other Intel Xeon Scalable, AMD EPYC, and Ampere Altra Max processors for this comparison.
Unfortunately there are no power consumption numbers for today's article. The NVIDIA GH200 doesn't appear to currently expose any RAPL/PowerCap/HWMON interface under Linux for being able to read just the GH200 power/energy use.

Screenshot-2024-02-08-at-22-58-04-NVIDIA-GH200-CPU-Performance-Benchmarks-Against-AMD-EPYC-Zen-4-I.png

On a geo mean basis across all the benchmarks conducted, the GH200 Grace CPU performance nearly matched the Intel Xeon Platinum 8592+ Emerald Rapids processor. The Arm Neoverse-V2 based Grace CPU tended to be much faster than the 128-core Ampere Altra Max AArch64 server. It will be interesting to see how AmpereOne can compete albeit no hardware available yet for testing. (Unfortunately no AMD MI300A hardware either for testing right now.) The NVIDIA ARM CPU performance has certainly come a long way from benchmarking the NVIDIA Tegra early days for ARM performance.
https://www.phoronix.com/review/nvidia-gh200-gptshop-benchmark
 
Back
Topo