The biggest difference between the two is the ring bus. At 24MB, the cache is far too big to run at an acceptable speed, and making 24MB 8-ported fast cache RAM was a basically impossible task. Instead, Intel split the cache up into eight 3MB chunks called slices, and assigned one per core. That size cache is easy enough to design, and they ended up as inclusive with 4-ports.
Eight independent caches are not all that useful compared to a single large 24MB cache, so Intel put a large, bidirectional ring bus in the middle of the cache to shuttle the data around. If any core needs a byte from any other cache, it is no more than 4 ring hops to the right cache slice.
The ring bus is actually four rings, with the data ring being 32 bytes wide in each direction. It looks a lot like the ring in Larrabee, but Intel has not announced the width of that part yet. That said, it is 1024 bits wide, 512b times two directions. There are eight stops on the Becton ring, and data moves across it at one stop per clock.
To pull data off the ring, each stop can only pull one request off per clock. With two rings, this could be problematic if data comes in from both directions at the same time. The data flow would have to stop or the packets would have to go around again. Neither scenario is acceptable for a chip like this.
Intel solved this by putting some smarts into the ring stops, and added polarity to the rings. Each ring stop has a given polarity, odd or even, and can only pull from the correct ring that matches that polarity. The ring changes polarity once per clock, so which stop can read from which ring on a given cycle is easy to figure out.
Since a ring stop knows which other stop it is going to talk to, what the receiver polarity is, and what the hop count is between them, the sender can figure out when to send so that the receiver can actually read it. By delaying for a maximum of one cycle, a sender can assure the receiver can read it, and the case of two packets arriving at the same time never occurs.
In the end, the ring has four times the bandwidth of a similar width unidirectional ring, half the latency, and never sends anything that the receiver can't read. The raw bandwidth available is over 250GBps, and that scales nicely with the number of stops. You could safely speculate that Eagleton will have a 375GBps ring bus if the clocks don't change much.
Moving on to QPI, there is a second controller to enable four links per socket. In addition to allowing Becton to scale to eight sockets gluelessly, the chip can do two independent transactions over QPI at the same time. There are two functional blocks to assist with this, and Intel calls them QPI Home Agents (HA).
The home agents have much deeper caches and request queues than a normal QPI controller on a Lynnfield or Bloomfield part. The HAs support 256 outstanding requests, with up to 48 from one single source. For an eight socket system, this is not just nice but somewhat mandatory for scaling.
HAs don't just track QPI requests, they can also track memory requests, and do some prefetching and write posting. On top of that, they control a lot of the cache coherency between sockets, something Intel calls a hybrid coherency protocol.
Augmenting the HAs are a QPI Caching Agent, with two per core, one per HA. The Caching Agents do what they sound like they do, cache QPI requests and data. Additionally, they can go directly to local memory, not just QPI, and send results directly to the correct core as well. QPI handling is in Becton is not just more intelligent, but also much better buffered as well.
The Nehalem family is the first modern Intel part to have memory controllers on die, so the memory controller count scales with socket count. Becton has two memory controllers per die, two channels per die, and two memory buffers per channel. With four DDR3 DIMMs per channel, that means 2 X 2 X 2 X 4, or 32 for the math adverse, per socket. On an eight socket system, that means 512 DIMMs, 4TB of memory per box. That is almost enough for running Vista at tolerable speeds.
In case you didn't notice, there was something new in the memory hierarchy, memory buffers. These. The idea is simple, the earlier FB-DIMMs put a complex buffer onto the DIMM itself. It was expensive, hot, and generally unloved, but brought a ton of useful features to the memory subsystem.
Intel was a bit shy when it came to talking about what these new buffers do, but we hear it started with FB-DIMM AMB buffers and evolved things from there. If the new buffers kept the RAS features and other similar technologies, they will be a net plus for the Nehalem EX platform.