Processador "Nehalem" EX: 8 cores nativos (16 threads)

Segundo o DailyTech, este "Nehalem-EX" já não usa os Fully Buffered-DIMM's DDR2, mas sim DIMM's DDR3 registados normais, com ECC (Error Checking and Correcting).
Uma boa notícia, quer em termos de latências, quer em termos de consumo eléctrico (o chip agregador em cada FB-DIMM chegava a gastar quase 6W por si só, imaginemos várias dezenas de DIMM's num único blade e chegamos ao cenário menos desejado).

O buffer continua lá, mas está na board. Um por canal. A usar os quatro canais, são 4 buffers por cpu.
A vantagem é que temos menos buffers e pode-se usar Dimms DDR3

A Amd tinha uma solução parecida, com o G3MX.
http://arstechnica.com/hardware/news/2007/07/amd-announces-fb-dimm-killer.ars
 
Este mundo informático cada vez evolui mais depressa :P Parece ser algo óptimo embora para nós, meros mortais, seja algo para admirar apenas a existência.

Já agora, a nVidia não ia também entrar na corrida nos cpu's para servers ou nesta aérea vai apostar apenas em clusters/super-computadores ?
 
Já agora, a nVidia não ia também entrar na corrida nos cpu's para servers ou nesta aérea vai apostar apenas em clusters/super-computadores ?

Acho que a nVidia nunca falou em entrar neste mercado, nem quando apareceram os rumores que a nVidia iria produzir um cpu x86.
 
Eles tinham falado no GPGPU (General-purpose computing on graphics processing units), um GPU que também faz as instruções do CPU :)

GPGPU não é um Gpu que faz instruções de Cpu. É um Gpu a correr algo que normalmente seria executado pelo cpu. No entanto o Gpu não corre instruções X86.

Umas imagens do "animal", na sua versão simples só com 4 sockets.

intel7500512gb675.jpg


intel7500server675.jpg


intel7500dimm675.jpg


Cada processador liga-se a duas "placas" de memória. Cada 4 dimms é um canal de memória. À frente fica o buffer, aquele chip com o dissipador.
 
:drooling: :drooling: :drooling: :drooling: :drooling:

Isso está apenas ao alcance dos deuses :D

8 x 8 = 64 slots de memória x 4 GB = 256 GB num só computador :wow:

Mas reparem nos cabos de power que ligam À board :wow: parece ser equivalente a 2 EPS12V

portanto vai xupar muitos amperes :002:
 
:drooling: :drooling: :drooling: :drooling: :drooling:

Isso está apenas ao alcance dos deuses :D

8 x 8 = 64 slots de memória x 4 GB = 256 GB num só computador :wow:

Mas reparem nos cabos de power que ligam À board :wow: parece ser equivalente a 2 EPS12V

portanto vai xupar muitos amperes :002:

Ele suporta Dimms de 8 GB. Isto é, com 4 sockets o limite é 512 GB de Ram e com 8 sockets 1 TB de Ram.
Com chipset como o Ibm EX5, com suporte para mais sockets, o limite será maior que 1 TB de Ram num sistema x86.

Gostava de ver o layout da board com 8 sockets. Se é preferivel ter uma board com "2 andares" ou se fazem como a memória e colocam cada Cpu num Slot.

Outro pormenor é o numero de entradas Pci-Express. Só naquela board parece-me ter 10.
 
Um artigo muito interessante de como o Nehalem-EX é diferente do Nehalem "normal".

The biggest difference between the two is the ring bus. At 24MB, the cache is far too big to run at an acceptable speed, and making 24MB 8-ported fast cache RAM was a basically impossible task. Instead, Intel split the cache up into eight 3MB chunks called slices, and assigned one per core. That size cache is easy enough to design, and they ended up as inclusive with 4-ports.

Eight independent caches are not all that useful compared to a single large 24MB cache, so Intel put a large, bidirectional ring bus in the middle of the cache to shuttle the data around. If any core needs a byte from any other cache, it is no more than 4 ring hops to the right cache slice.

The ring bus is actually four rings, with the data ring being 32 bytes wide in each direction. It looks a lot like the ring in Larrabee, but Intel has not announced the width of that part yet. That said, it is 1024 bits wide, 512b times two directions. There are eight stops on the Becton ring, and data moves across it at one stop per clock.

To pull data off the ring, each stop can only pull one request off per clock. With two rings, this could be problematic if data comes in from both directions at the same time. The data flow would have to stop or the packets would have to go around again. Neither scenario is acceptable for a chip like this.

Intel solved this by putting some smarts into the ring stops, and added polarity to the rings. Each ring stop has a given polarity, odd or even, and can only pull from the correct ring that matches that polarity. The ring changes polarity once per clock, so which stop can read from which ring on a given cycle is easy to figure out.

Since a ring stop knows which other stop it is going to talk to, what the receiver polarity is, and what the hop count is between them, the sender can figure out when to send so that the receiver can actually read it. By delaying for a maximum of one cycle, a sender can assure the receiver can read it, and the case of two packets arriving at the same time never occurs.

In the end, the ring has four times the bandwidth of a similar width unidirectional ring, half the latency, and never sends anything that the receiver can't read. The raw bandwidth available is over 250GBps, and that scales nicely with the number of stops. You could safely speculate that Eagleton will have a 375GBps ring bus if the clocks don't change much.

Moving on to QPI, there is a second controller to enable four links per socket. In addition to allowing Becton to scale to eight sockets gluelessly, the chip can do two independent transactions over QPI at the same time. There are two functional blocks to assist with this, and Intel calls them QPI Home Agents (HA).

The home agents have much deeper caches and request queues than a normal QPI controller on a Lynnfield or Bloomfield part. The HAs support 256 outstanding requests, with up to 48 from one single source. For an eight socket system, this is not just nice but somewhat mandatory for scaling.

HAs don't just track QPI requests, they can also track memory requests, and do some prefetching and write posting. On top of that, they control a lot of the cache coherency between sockets, something Intel calls a hybrid coherency protocol.

Augmenting the HAs are a QPI Caching Agent, with two per core, one per HA. The Caching Agents do what they sound like they do, cache QPI requests and data. Additionally, they can go directly to local memory, not just QPI, and send results directly to the correct core as well. QPI handling is in Becton is not just more intelligent, but also much better buffered as well.

The Nehalem family is the first modern Intel part to have memory controllers on die, so the memory controller count scales with socket count. Becton has two memory controllers per die, two channels per die, and two memory buffers per channel. With four DDR3 DIMMs per channel, that means 2 X 2 X 2 X 4, or 32 for the math adverse, per socket. On an eight socket system, that means 512 DIMMs, 4TB of memory per box. That is almost enough for running Vista at tolerable speeds.

In case you didn't notice, there was something new in the memory hierarchy, memory buffers. These. The idea is simple, the earlier FB-DIMMs put a complex buffer onto the DIMM itself. It was expensive, hot, and generally unloved, but brought a ton of useful features to the memory subsystem.

Intel was a bit shy when it came to talking about what these new buffers do, but we hear it started with FB-DIMM AMB buffers and evolved things from there. If the new buffers kept the RAS features and other similar technologies, they will be a net plus for the Nehalem EX platform.

http://www.semiaccurate.com./2009/08/25/intel-details-becton-8-cores-and-all/
 
Back
Topo