vortal
Power Member
PS3's Cell CPU tops high-performance computing benchmark
By Jon Stokes | Published: April 30, 2008 - 10:40PM CT
A UC Berkeley paper [PDF] recently submitted to the IEEE International Parallel and Distributed Processing Symposium manages to highlight two common and seemingly unrelated themes that have come up a number of times over the past few years in my reporting on the high-performance computing (HPC) space: 1) IBM's Cell is really good at HPC workloads when you invest the time to write custom code for it, and 2) Intel's Xeon platform is perennially bandwidth starved and not very power-efficient.
The researchers used a common HPC benchmark apparently known primarily by its initials, LBMHD, to test the following dual-socket, multicore platforms: Intel's "Clovertown" Xeon, AMD's Opteron X2, Sun's Niagara2, and IBM's Cell, along with a single single-socket, single-core Itanium2. The main focus of the research was uncovering the bottlenecks behind the LBMHD benchmark and in exploring the use of auto-tuners for optimizing the benchmark for different multicore systems. Along the way, though, the researchers uncovered some interesting details that support the two conclusions pointed out above.
The LBMHD benchmark is easily parallelizable and scales well with the number of threads. Because a platform's per-thread performance doesn't affect the benchmark the way that thread count does, it was widely thought to be memory-bandwidth constrained, because it uses complex data structures with irregular memory access patterns.
In looking for bottlenecks for their auto-tuner to optimize, the researchers discovered that memory bandwidth was not, in fact, the main constraint holding back the benchmark on the platforms under examination—at least not initially. Rather, translation look-aside buffer (TLB) resources, cache bandwidth, memory latency, and code scheduling were holding the unoptimized versions of LBMHD back on the different platforms.
The researchers then optimized for the above factors on each platform and ran the benchmark again to find that, in the case of Intel's Clovertown, the bottleneck had now shifted to the memory subsystem. It turns out that Clovertown is bottlenecked in memory bandwidth, and that the chipset can't move enough data from the single DRAM bus into both FSBs. This problem seriously constrains Clovertown's ability to scale with the number of threads, despite the other optimizations.
In contrast, the Operton X2's NUMA design is not bandwidth bottlenecked and gets nearly linear scaling with the TLB optimization enabled.
At the top of the benchmark heap in terms of scaling and raw performance was IBM's Cell, which achieves near-perfect linear scaling for a few reasons. First, the LBMHD code had to be written especially for Cell, since the architecture was so unique that they couldn't just compile the code and autotune it like they did for the other architectures. So Cell had an advantage there, in that its code was optimized from the get-go. It's also the case that Cell's RDRAM-based memory subsystem gives it plenty of bandwidth directly into the Cell socket, and this was an even bigger factor in the platform's superior performance on this benchmark.
http://arstechnica.com/news.ars/pos...ops-high-performance-computing-benchmark.html