Gráfica AMD CDNA GPU Architecture: Dedicated GPU for Data Centers

Dark Kaeser

Colaborador
Staff
À partida deve ser isto a Arcturus. Mas deixo este post reservado para reunir a Info.

EDIT:

AMD CDNA ARCHITECTURE

Communication and Scaling

The AMD CDNA architecture uses standards based high-speed AMD Infinity Fabric technology to connect to other GPUs. The Infinity Fabric links operate at 23GT/s and are 16-bits wide similar to the previous generation, but the MI100 brings a third link for full connectivity in quad GPU configurations offering greater bi-section bandwidth and enabling highly scalable systems. Unlike PCIe®, the AMD Infinity Fabric links support coherent GPU memory, which enables multiple GPUs to share an address space and tightly cooperate on a single problem
amd-cdna-whitepaper-pdf-1.png

As Figure 3 illustrates, the additional AMD Infinity Fabric links enable a fully connected 4-GPU building block, whereas the Radeon Instinct™ MI50 GPU could use only a ring topology. The fully connected topology boosts performance for common communication patterns such as all-reduce, and scatter/gather. These communication primitives are widely used in HPC and ML, e.g., the weight update communication phase of training neural networks found in DLRM
amd-cdna-whitepaper-pdf-2.png


AMD CDNA Architecture Compute Units

The command processor and scheduling logic translate higher-level API commands into compute tasks. These compute tasks in turn are implemented as compute arrays and managed by the Asynchronous Compute Engines (ACE). Each of the four ACEs maintains an independent stream of commands and can dispatch wavefronts to the compute units.The impressive 120 CUs of the AMD CDNA architecture are organized into four arrays of CUs. The CUs are derived from the earlier GCN architecture and execute wavefronts that contain 64 work-items. However, the CUs are enhanced with new Matrix Core Engines that are optimized for operating on matrix datatypes, boosting compute throughput and power efficiency.
The AMD CDNA architecture builds on GCN’s foundation of scalars and vectors and adds matrices as a first class citizen while simultaneously adding support for new numerical formats for machine learning and preserving backwards compatibility for any software written for the GCN architecture. These Matrix Core Engines add a new family of wavefront-level instructions, the Matrix Fused Multiply-Add or MFMA. The MFMA family performs mixed-precision arithmetic and operates on KxN matrices using four different types of input data: 8-bit integers (INT8), 16-bit half-precision FP (FP16), 16-bit brain FP (bf16), and 32-bit single-precision (FP32). All MFMA instructions produce either 32-bit integer (INT32) or FP32 output, which reduces the likelihood of overflowing during the final accumulation stages of a matrix multiplication.The different numerical formats all have different recommended applications.
amd-cdna-whitepaper-pdf-5.png

As Figure 5 illustrates, the CUs are augmented with new matrix engines to handle the MFMA instructions and boost throughput and energy efficiency. The matrix execution unit has several advantages over the traditional vector pipelines in GCN. First, the execution unit reduces the number of register file reads, since in a matrix multiplication many input values are re-used. Second, the narrower datatypes create a huge opportunity for workloads that do not require full FP32 precision, e.g., machine learning. Generally speaking, the energy consumed by a multiply-accumulate operation is the square of the input datatypes, so shifting from FP32 to FP16 or bf16 can save a tremendous amount of energy.

amd-cdna-whitepaper-pdf-6.png

Oak Ridge National Laboratory tested their exascale science codes on the MI100 as they ramp users to take advantage of the upcoming exascale Frontier system. Some of the performance results ranged from 1.4x faster to 3x faster performance compared to a node with V100. In the case of CHOLLA, an astrophysics application, the code was ported from CUDA to AMD ROCm™ in just an afternoon while enjoying 1.4x performance boost over V100.

AMD ROCm™ Open Software Ecosystem

amd-cdna-whitepaper-pdf-7.png

AMD ROCm is built on three core philosophical observations about heterogeneous computing. First, GPUs and CPUs are equallyimportant computing resources; they are optimized for different workloads and should work together effectively. Second, code should be naturally portable and high-performance using a combination of libraries and optimized code generators. Third, building an open-source toolchain empowers customers to fully optimize their applications, eases deployment, and enables writing code that is easily portable to multiple platforms.
amd-cdna-whitepaper-pdf-8.png

10Figure 8 illustrates the robust nature of the AMD ROCm™ open ecosystem for several example production applications. In most cases, existing workloads can be migrated from proprietary architectures using the ROCm toolchain in a matter of days. The resulting codebase is portable with virtually no performance degradation. These examples highlight the robust quality of the ROCm ecosystem for cutting edge Exascale systems that will form the basis of scientific computing for the foreseeable future.
https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
 
Última edição:
Os vídeos da conferência de "Supercomputadores" dedicados à area do Gás e Petróleo que decorreu na Rice University, já foram colocados online, estive a ver uns e durante a apresentação da HPE (ou melhor da Cray) sobre o Shasta, apareceu uma imagem que acho curiosa e que se refere ao sistema Frontier a ser instalado no ORNL e que será um sistema AMD (AMD EPYC "custom after Milan" + GPU provavelmente a Arcturus),

Shasta.jpg


o apresentador, que deve saber do que fala pois é um dos VP e CTO da área HPC & AI, diz que como os GPU têm mais PCIe dá para ligar directamente os GPU ao interconnect deles (Slingshot) sem ter de passar pelo CPU.
 
E ganhara em single precision porque estão a considerar apenas FP32 "tradicioal" , e a meter de lado o novo formato introduzido pela nvidia ( TF32 ) que acaba por ser bastante proximo em termos de "qualidade".

Para FP64 e FP16 eles podem usar tensor cores, que rebenta por completo a competição
 
Porting to AMD GPUs in the Corona Age
The early days of GPUs brought some challenges, but dedication from developers and Nvidia to make sure as many HPC codes were ported and CUDA-ready over the years eased the transition to acceleration. However, things are getting more complicated with the addition of new GPUs from other vendors hitting the datacenter, most notably (at least now with Intel still lagging in this department for HPC) with AMD coming on strong.
The “Corona” supercomputer at Lawrence Livermore National Lab has an interesting story that goes beyond its untimely name (it was dubbed during the solar eclipse a few years ago). It represents the lab’s future foray into a massive-scale all-AMD system, El Capitan, which will be installed within the next three years. That will mean future generation AMD CPUs and GPUs, the latter of which will be a primary driver behind its expected exascale performance.
For its part, AMD took the CUDA MD code and ported it via HIP (a lightweight runtime API and kernel language designed for portability for AMD/Nvidia GPUs from a single source code). We’ve heard much about the ROCm performance and other abstraction layer hits over the years but Karlin says that the performance overhead was not bad at all. “We’re getting competitive performance.”

The LLNL team’s machine learning porting effort took some work in getting the environment to work correctly, Karlin says, “but once we figured out all those tricks porting machine learning codes for other applications it went smoothly, it was just a matter of figuring out what the formula was, what the challenges were. But AMD has been very receptive in helping us out, assisting us with getting things going and prioritizing key bugs to make sure important applications are running well.”
https://www.nextplatform.com/2020/10/06/porting-to-amd-gpus-in-the-corona-age/
 
Mais 2 sistemas CPU (Epyc 3 - Milan) e GPU (Instinct basedas no que quer que saia daqui)

- HPE to Build Australia’s Most Powerful Supercomputer for Pawsey
The new system – as-yet unnamed – will use the HPE Cray EX architecture, next-generation AMD Epyc CPUs and AMD Instinct GPUs and the Cray ClusterStor E1000 data storage system.
https://www.hpcwire.com/2020/10/20/hpe-to-build-australias-most-powerful-supercomputer-for-pawsey/

- HPE, AMD and EuroHPC Partner for Pre-Exascale LUMI Supercomputer
LUMI is based on the HPE Cray EX supercomputer architecture, and will harness next-generation AMD Epyc CPUs and AMD Instinct GPUs. Storage will include 7 PB of accelerated flash-based storage (LUMI-F, utilizing a Cray ClusterStor E1000 storage system); an 80 PB Lustre file system (LUMI-P); and 30 PB of encrypted object storage (LUMI-O). LUMI’s primary GPU-driven partition (LUMI-G) will be supplemented by a data analytics partition with 32 TB of memory and additional GPUs (LUMI-D), as well as a CPU partition featuring around 200,000 AMD Epyc CPU cores. LUMI will use HPE Slingshot networking.
LUMI-slide-1000x.png


https://www.hpcwire.com/2020/10/21/hpe-amd-and-eurohpc-partner-for-550-petaflops-lumi-supercomputer/
 
Mas duvido que as V sejam baseadas no CDNA.

Não sei que arquitectura será, mas a AMD tem uma Série "V", com a V340 para o mercado de Virtualização. A placa são duas Vega 56 com 16 GB de HBM cada, mas não tem saídas de video, o que é normal para este mercado.
Isto é, se essas "V" forem para o mesmo mercado, nem precisam de usar um GPU "Completo". Isto se as CDNA não forem GPUs "Completos".
 
Sim, pode ser isso, usarem dies que não qualificam para MI100.

Mas ainda há o mistério das designações MI100 e MI200.
Se bem que aqui pode muito bem ser apenas a sucessora.
 
A AMD anunciou a MI100, a tal "Arcturus".

qYMj5y5.jpg


Xg1t0Nn.jpg


ewhBS4N.jpg
´

EoWaG5D.jpg


onWKRL0.jpg


HOAqBxc.jpg


wwviNCh.jpg


GhlA46V.jpg


LJta2Yo.jpg


WGmW5VJ.jpg


AMD is a few quarters behind NVIDIA launching the MI100 but well ahead of Intel’s Xe HPC part. With the big headliners of two major exascale supercomputers moving to AMD, we can imagine other smaller installations will look at what AMD is offering more seriously. It is good to see that AMD has a plan to close the software and capability gap with NVIDIA.

https://www.servethehome.com/amd-radeon-instinct-mi100-32gb-cdna-gpu-launched/

Por "coincidência", a nVidia também lançou hoje uma versão da A100 com 80 GB de HBM2e, subindo a bandwidth total para 2 TB/s.

p8toJrV.jpg


TFH2dwk.png


Também apresentaram uma nova versão da Workstation DGX Station A100, com 4 A100 de 80 GB de RAM + Epyc + 512 GB RAM + 2 TB SSD para SO + 8 TB para Dados + Dual 10 Gbit + Remote Management (IPMI), etc.
Tudo isto a usar algum tipo de refrigerante para a Workstation ser silenciosa. :)

vPrthPE.jpg


https://www.anandtech.com/show/16250/nvidia-announces-a100-80gb-ampere-gets-hbm2e-memory-upgrade

Esta Workstation deve ser interessante ao vivo. :D
 
Última edição:
@Nemesis11 posta isso na thread da Ampere sff.

Bom, AMD ta ambiciosa em atacar o HPC com força!

4 M100 juntinhas, bom e puro hardware p0rn.

No inverno europeu era bom montar uma rig com 4 para minerar bitcoin :D
 
Acho que a principal noticia relativamente ao MI100 não é a performance. A partir de agora vamos começar a ver crescimento do ecosistema de software com adoção das ferramentas que existem no ROCm. É um marco importante para a AMD.
 
Já se sabia que a AMD tinha ganho 2 dos 3 sistemas Exascale nos EUA, com o chamado A+A (AMD CPU + GPU), a Intel ficou o outro (I+I).

A Gigabyte, entre outros, já anunciou sistemas, e como está no 1º post (entretanto editado), o facto de as MI100 terem 3 IF links permite ligar directamente 4 GPU entre si, e conjugado com as "novas boards"

GIGABYTE Releases GPU-Centric Servers for 8+ Accelerators and 160 PCIe Lanes

Improved Accelerator Performance:

All three new servers can run all their PCIe Gen 4 lanes to function at x16, which means there can be as many as 160 PCIe lanes available, and they all have a theoretical bandwidth of 32 GB/s for fast communication between CPU and GPU. This is made possible by AMD EPYC 7002 architecture that allows a single socket-to-socket Infinity Fabric link (one of four) to reallocate the bandwidth to PCIe lanes for other purposes. This is possible because a dual AMD EPYC 7002 system uses 64 PCIe lanes from each CPU to link to the adjacent CPU. By freeing up one of these four links, 16 lanes per CPU can be redirected for fast PCIe Gen 4 speeds in storage, accelerators, or networking.
G482-series-768x340.png

Incredibly, the G482-Z53 and G482-Z54 are built for 8 x GPUs, such as the AMD Instinct MI50, slotted next to each other and linked via the AMD Infinity Fabric link. These servers have been tested and designed for AMD Instinct MI50 and other accelerators. The servers utilize dual AMD EPYC 7002 processors operating with 128+ PCIe Gen 4 lanes. Recognizing the full potential of high bandwidth PCIe Gen 4 lanes, the 2nd Gen AMD EPYC processors are capable of reallocating bandwidth dedicated to CPU to CPU connectivity to be redirected for accelerators and networking. What is even more impressive is that all GPUs have their own dedicated PCIe Gen 4 x16 lanes, without the use of a PCIe switch to share a single lane with two GPUs.
https://www.hpcwire.com/off-the-wir...ervers-for-8-accelerators-and-160-pcie-lanes/



AMD At A Tipping Point With Instinct MI100 GPU Accelerators
The big change with the Arcturus GPU is that AMD is forking its graphics card GPUs aimed at gamers, where the processing of frames per second is paramount, from its GPU accelerators aimed at HPC and AI compute, where floating point and integer operations per second is key. This is the split between RDNA and CDNA chips, in the AMD lingo, and the Arcturus chip is the first instantiation of the CDNA architecture.
Specifically, the Arcturus chip takes all of the circuits out of the streaming processors related to graphics, such as graphics caches and display engines as well as rasterization, tessellation, and blending features but because of workloads that chew on multimedia data – such as object detection in machine learning applications – the dedicated logic for HEVC, H.264, and VP9 decoding is left in. This freed up die space to add more stream processors and compute units.
At some point in the future, the Epyc CPUs and the Instinct GPUs will have enough Infinity Fabric ports to cross couple a single CPU to a quad of GPUs, all with coherent memory across the devices. IBM has supported such coherence between Power9 processors and Nvidia V100 GPU accelerators for the past three years, and it is one reason that Big Blue won the contracts to build the “Summit” hybrid supercomputer at Oak Ridge National Laboratories and its companion “Sierra” supercomputer at Lawrence Livermore National Laboratories. For whatever reason, this coherence between CPU and GPU will not be available with the Power10 processors and the current Ampere GPUs and we presume future Nvidia GPUs because IBM wants to use OpenCAPI and Nvidia wants to use NVLink, and this may be one reason why Big Blue didn’t win the contracts for the follow-on “Frontier” and “El Capitan” exascale-class systems at these two labs in the United States. That said, the fallout over OpenCAPI and NVLink could be one result of losing the deal, not necessarily an effect.
amd-arcturus-server-oems.jpg

Pricing on the Instinct MI100 was buried in the footnotes, and is $6,400 a pop in single unit quantities. We are going to try to see what the OEMs are charging for it. Here are the initial OEM vendors and their machines that will support the MI100 GPU accelerator:
https://www.nextplatform.com/2020/11/16/amd-at-a-tipping-point-with-instinct-mi100-gpu-accelerators/


AMD Courts HPC with 11.5 Teraflops Instinct MI100 GPU
HPC market watcher Addison Snell, CEO of Intersect360 Research, remarked on AMD’s HPC focus and the implementation of its datacenter-centric CDNA architecture, distinct from the gaming-oriented RDNA (Radeon DNA) architecture.

“With the MI100 GPU, AMD is staying pure to its corporate focus on HPC,” said Snell. “While Nvidia’s messaging and benchmarking have been AI-heavy, AMD is hitting HPC hard, with 11.5 teraflops of double-precision performance as the marquee stat.”

“AMD is also emphasizing its new CDNA architecture as the focus for computing versus graphics; that’s where we find the GPU-to-GPU communication on the second-generation Infinity architecture.”
Prominent HPC sites Oak Ridge National Laboratory, the University of Pittsburgh and Pawsey Supercomputing Center evaluated the new GPUs along with AMD’s software frameworks. Their reports are positive.

“We’ve received early access to the MI100 accelerator, and the preliminary results are very encouraging. We’ve typically seen significant performance boosts, up to 2-3x compared to other GPUs,” said Bronson Messer, director of science, Oak Ridge Leadership Computing Facility. “What’s also important to recognize is the impact software has on performance. The fact that the ROCm open software platform and HIP developer tool are open source and work on a variety of platforms, it is something that we have been absolutely almost obsessed with since we fielded the very first hybrid CPU/GPU system.”
https://www.hpcwire.com/2020/11/16/amd-courts-hpc-with-11-5-teraflops-instinct-gpu/
 
Back
Topo