AMD - HSA e a evolução do APU

Dark Kaeser · 14 de Junho de 2012

A conferência deste ano trouxe alguns anúncios interessantes:

- Criação da HSA Foundation

AMD , ARM, Imagination Technologies, MediaTek Inc., and Texas Instruments (TI) are the initial founding members of the HSA Foundation.

About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing.

About Heterogeneous System Architecture (HSA)
Developers will benefit from the open standard programming of HSA for both the CPU and GPU, which allows the two processors to work cooperatively and directly in system memory. Additionally, HSA provides a single architecture across multiple operating systems and hardware designs. By maximizing the full compute capabilities of systems with both CPUs and GPUs, users can see performance and energy efficiency boosts across a variety of applications.

http://hsafoundation.com/hello-hsa-foundation/
http://semiaccurate.com/2012/06/12/amd-and-arm-joined-by-imagination-ti-and-mediatek/

- AMD 2013 APUs To Include ARM Cortex-A5 Processor For TrustZone Capabilities

In order to implement a hardware security platform on their future APUs, AMD has chosen to enter into a strategic partnership with ARM for the purpose of gaining access to ARM’s TrustZone technology. By licensing TrustZone, AMD gains a hardware security platform that’s already in active use, which means they avoid fragmenting the market and the risks that would bring. Furthermore AMD saves on the years of work – both technical and evangelical – that they would have needed had they rolled their own solution. Or more simply put, given their new willingness to integrate 3[SUP]rd[/SUP] party IP, licensing was the easy solution to getting a hardware security platform quickly.

But because TrustZone is an ARM technology (both in name and ISA) AMD needs an ARM CPU to execute it. So the key to all of this will be the integration of an ARM processor into an AMD APU, specifically ARM’s Cortex-A5 CPU. The Cortex-A5 is ARM’s simplest ARMv7 application processor, and while it’s primarily designed for entry-level and other lower-performance devices, as it turns out it fits AMD’s needs quite nicely since it won’t be used as a primary application processor.

http://www.anandtech.com/show/6007/...rtexa5-processor-for-trustzone-capabilities/1

Van Doom said AMD plans to offer its first APUs incorporating the Cortex-A5 core next year. The scheme will first be implemented in G-Series APUs for tablets and ultrathin PCs in 2013, then implemented across the firm's APU product line in 2014, he said.

http://www.eetimes.com/electronics-news/4375240/AMD-to-integrate-ARM-core-into-APUs

- AMD Announces CodeXL At AFDS 12: A Unified Debugging and Profiling Tool for Heterogeneous Applications

In its simplest form AMD’s CodeXL brings together AMD’s GPU compute tools and AMD’s CPU compute tools to enable faster and more robust development of OpenCL and GPU accelerated applications. CodeXL does not replace AMD traditional tools for CPU specific applications or for GPU rendering application...
One of the biggest features of AMD’s CodeXL is support for both OpenGL and OpenCL debugging. It also includes tools for doing both CPU and GPU code profiling.

http://semiaccurate.com/2012/06/12/amd-announces-codexl-at-afds/

Esta última poderá estar relacionada com o HSA roadmap, que prevê uma maior integração em 2013

http://www.tomshardware.com/news/amd-fusion-trinity-apu-liveblog,15986.html

Dark Kaeser · 16 de Agosto de 2012

Como este tópico apesar das visualizações não parece ter despertado grande interesse, estava eu a tentar compilar uns artigos para tentar explicar com mais detalhe as implicações do que tinha sido apresentado durante a conferência, especialmente em relação à evolução das APU e da sua relação com a criação da HSA Foundation, mas alguém na Tomshardware decidiu adiantar-se e fazer um artigo mais completo:

AMD Fusion: How It Started, Where It’s Going, And What It Means

xipe7 · 18 de Agosto de 2012

A AMD precisa de algo concreto e MUITO melhor que a intel para voltar ao patamar em que estava antigamente..

Dark Kaeser · 29 de Agosto de 2012

Conferência da Hot Chips, em que se começa a perceber melhor a aposta e o rumo da AMD com o HSA:

Começando pelo último slide é mais fácil

o caminho a percorrer, etc

tudo isto com este objectivo final,

http://www.techpowerup.com/171255/A...-Outlines-Vision-for-Surround-Computing-.html

como se pode ver pelo slide do HSA Roadmap que prevê a integração (ou fusão) do CPU e GPU em 2014 e com os objectivos da criação da HSA Foundation:

HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption.

http://hsafoundation.com/hello-hsa-foundation/

Esta conversa do "Fusion", APU, HSA e etc não é inocente e a aposta nela é uma espécie de tentativa de "flanquear" a Intel, e ir buscar a performance ao GPU, não esquecer que a actual arquitectura das GPU, GCN, tem um poder de computação superior ás anteriores arquitecturas.
Anunciado também hoje a contratação de John Gustafson como Chief Graphics Product Architect

Gustafson is a 35-year veteran of the computing industry. He joins AMD from Intel, where he headed the company's eXtreme Technologies Lab, conducting cutting-edge research on energy-efficient computing and memory, as well as optical, energy and storage technologies. Prior to that, he served as CEO at Massively Parallel Technologies and CTO at ClearSpeed Technology, a high-performance computing company. Gustafson has also held key management and research positions at numerous companies including Sun Microsystems, Ames Laboratory and Sandia National Laboratories.

In 1988, Gustafson wrote Reevaluating Amdahl's Law to address limitations of Amdahl's Law, which models the maximum potential performance improvement from parallel processing. Gustafson proved that processors working in parallel can solve larger problems, marking a change in how the industry viewed parallel processing. Today, Gustafson's Law is widely accepted among academia as the standard for parallel processing education.

"I look forward to working with my teams to expand the AMD graphics technology roadmap," said Gustafson. "The next decade will serve as a watershed era for GPUs in graphics rendering power and compute capabilities, creating the opportunity for multi-teraFLOPS APUs. In terms of raw performance, the evolution of discrete graphics has far exceeded that of the CPU, and the programmable characteristics of today's GPUs have thrown open a door that could very well see it rival the CPU as the most critical element of computer performance in the near future."

http://www.techpowerup.com/171256/AMD-Hires-John-Gustafson-as-Chief-Graphics-Product-Architect.html

A AMD tem um plano/estratégia, ter conseguido o apoio da ARM e alguns dos seus parceiros para esta empreitada é sempre melhor que fazer o caminho sózinha, mas em relação a planos/estratégias, como diria esse grande filósofo e pensador da actualidade

Dark Kaeser · 31 de Agosto de 2012

Heterogeneous System Architecture (HSA) Foundation announces six new members

Supporter

Arteris – a leading supplier of network-on-chip (NoC) interconnect IP solutions
“HSA will usher in a new era of advanced processing capabilities,” said K. Charles Janac, president and chief executive officer of Arteris. “To optimize for power and performance, we will provide network-on-chip interconnect IP and system IP that will make HSA systems-on-chip more efficient, minimizing power consumption and size for consumer electronics, mobile, automotive and other applications.”

MulticoreWare – a leading software tool and library provider
“MulticoreWare has been at the forefront of providing developer tools and libraries that leverage heterogeneous computing,” said AGK Karunakaran, president and chief executive officer of MulticoreWare. “As HSA takes hold as an industry standard and becomes the interface for parallel computing, we will provide the tools, libraries and support to semiconductor vendors with their HSA supported SDKs and developers at ISVs that want to optimize their applications for the next era of computing performance.”

Contributor

Apical – a leader in advanced image processing technology
“We are living in a world of screens where visuals have become the preferred choice of how we communicate,” said Michael Tusch, chief executive officer of Apical. As a new member of the HSA Foundation, we look forward to leveraging heterogeneous computing to deliver advanced digital imaging and display technologies that will improve the user experience.”

Sonics – a leading supplier of system IP for cloud-scale SoCs
“Technology companies are rushing to deliver products that satisfy the growing appetite for connected devices and content,” said Jack Browne, vice president of marketing at Sonics. “Our broad portfolio of system IP, which includes network, memory, power and security subsystems, helps leading SoC vendors build better chips, faster and at lower cost. Sonics’ support of HSA will further accelerate SoC and OEM vendors’ time-to-market and put the next-generation of connected devices in consumers’ hands sooner.”

Symbio – a leading provider of R&D innovation services and outsourced product development solutions
"Symbio believes HSA heterogeneous architecture is a game changer that will unbound the limitations of traditional processor and graphics architectures,” said Jacob Hsu, chief executive officer of Symbio. “We’re just scratching at the surface of all the possibilities as performance of many of the algorithms and usage cases will be significantly improved."

Associate

Vivante – a worldwide leader in graphics and GPU Compute technologies for handheld, consumer and embedded devices"The heterogeneous computing revolution has taken a huge step forward with the formation of an open standard driven by HSA Foundation. As a global innovator in graphics and GPU technologies, Vivante is excited to join the foundation as it defines a hybrid platform architecture that takes full advantage of the massively parallel processing cores in our GPUs," said Wei-Jin Dai, president and chief executive officer of Vivante. “We look forward to collaborating with ecosystem partners as we bring exciting hybrid computing initiatives to future mobile, consumer, and embedded devices.”

http://www.techpowerup.com/171469/HSA-Foundation-Announces-Six-New-Members.html

EDIT: Samsung Joins The HSA Foundation

This morning AMD announced that Samsung has officially joined the HSA Foundation as founding member of the consortium.

But more than that it’s becoming ever clearer how AMD is trying to unify the all of the companies that have something to gain from moving past Intel’s monopoly on market. This is why we’re seeing big ARM based companies like Samsung, TI, Vivante, and of course ARM proper, joining up with the HSA Foundation. It’s still too early to say how this one is going to turn out, but an HSA enabled future almost sounds too good to pass on.

http://semiaccurate.com/2012/08/31/samsung-joins-the-hsa-foundation/

Dark Kaeser · 26 de Setembro de 2012

A AMD lançou a beta do CodeXL

With CodeXL essentially anyone can profile both the CPU and GPU code execution of their application. Additionally, CodeXL allows developers to debug both OpenCL and OpenGL code, and has the ability to do static kernel analysis, which can accurately estimate how your code is going to perform without compiling it.

CodeXL is aimed at helping developers create applications that are both CPU and GPU accelerated, rather than just GPU accelerated or purely CPU based. In the simplest terms AMD’s CodeXL is the best tool available for profiling and debugging OpenCL code on AMD platforms. Outside of the merits of its basic functionality, AMD is setting CodeXL up for success by developing three different version of the tool. There will be a standalone version for Windows, another standalone version for Linux, and a third version that is a plugin for Microsoft’s Visual Studio development environment.

Up until now the OpenCL and GPU acceleration melody that AMD has been playing to developers has been handicapped by a lack of good tools to get the job done.
We are still a ways off from having a high-level language for systems with a unified memory address space as Mr. Malloy envisioned, but CodeXL does address the here and now aspect of enabling more developers to take advantage of GPU acceleration.

http://semiaccurate.com/2012/09/25/amd-releases-codexl-public-beta/

Um video feito durante a conferência do AFDS 2012
http://www.youtube.com/watch?v=EtiAWf_lufE

Dark Kaeser · 1 de Outubro de 2012

AMD and Oracle to Explore Heterogeneous Computing for Java

During the JavaOne 2012 Strategy Keynote, AMD announced its participation in OpenJDK Project "Sumatra" in collaboration with Oracle and other members of the OpenJDK community to help bring heterogeneous computing capabilities to Java for server and cloud environments.

The OpenJDK Project "Sumatra" will explore how the Java Virtual Machine (JVM), as well as the Java language and APIs, might be enhanced to allow applications to take advantage of graphics processing unit (GPU) acceleration, either in discrete graphics cards or in high-performance graphics processor cores such as those found in AMD accelerated processing units (APUs).
As emerging server and cloud platforms tap into the heterogeneous compute capabilities of APUs and discrete GPUs to achieve enhanced power/performance capabilities, developers are requiring mainstream programming models such as Java to help them harness the advantages of GPU acceleration. Project "Sumatra" may also provide guidance on enabling heterogeneous compute support for other JVM-based languages such as Scala, JRuby and Jython.

"Affirming our plans to contribute to the OpenJDK Project represents the next step towards bringing heterogeneous computing to millions of Java developers and can potentially lead to future developments of new hardware models, as well as server and cloud programming paradigms," said Manju Hegde, corporate vice president, Heterogeneous Applications and Developer Solutions at AMD. "AMD has an established track record of collaboration with open-software development communities from OpenCL™ to the Heterogeneous System Architecture (HSA) Foundation, and with this initiative we will help further the development of graphics acceleration within the Java community."

"We expect our work with AMD and other OpenJDK participants in Project 'Sumatra' will eventually help provide Java developers with the ability to quickly leverage GPU acceleration for better performance," said Georges Saab, vice president, Software Development, Java Platform Group at Oracle. "We hope individuals and other organizations interested in this exciting development will follow AMD's lead by joining us in Project 'Sumatra.'"

http://www.techpowerup.com/173011/AMD-and-Oracle-to-Explore-Heterogeneous-Computing-for-Java.html

Roberto1973 · 3 de Outubro de 2012

HSA Foundation Announces Qualcomm as Newest Founder Member

The Heterogeneous System Architecture (HSA) Foundation today announced that Qualcomm Incorporated has joined as a Founder Member. Qualcomm's commitment reinforces HSA as the next technological underpinning in computing for a broad range of platforms and devices. Since its formation in June, the HSA Foundation has more than doubled its membership with new Founder, Supporter, Contributor and Associate members that have joined the consortium.

Qualcomm joins AMD, ARM, Imagination Technologies, MediaTek Inc., Samsung Electronics Ltd. and Texas Instruments as founder members of the HSA Foundation. The companies are working together to drive a single architecture specification, which simplifies the programming model for software developers on modern platforms and devices. The HSA Foundation will unlock the performance and power efficiency of the parallel computing engines found in heterogeneous processors.

"It's great to see an innovative company like Qualcomm, which has revolutionized the wireless communications market, placing their support behind HSA," said Phil Rogers, HSA Foundation President and AMD Corporate Fellow. "With HSA, computing becomes much more power efficient, enabling member companies like Qualcomm, to create unique and compelling experiences for the consumer."

"Future Snapdragon processors from Qualcomm will contain substantially more computing performance and integrated parallel processing technology in order to meet the high performance, low power needs of our mobile customers," said Jim Thompson, senior vice president of engineering at Qualcomm. "We believe that developers will be able to deliver faster and more innovative applications on future Snapdragon processors if certain aspects of heterogeneous computing are standardized, so we are pleased to join the HSA Foundation to help define open standards."

HSA Foundation continues to build momentum throughout the industry and will be delivering technical presentations at the IEEE International Conference on Computer Design, Sept. 30 - Oct. 3, 2012; and at the 2012 ARM TechCon, Oct. 30 - Nov. 1.

http://www.techpowerup.com/173133/HSA-Foundation-Announces-Qualcomm-as-Newest-Founder-Member.html

Dark Kaeser · 22 de Outubro de 2012

But current CPUs and GPUs have been designed as separate processing elements and do not work together efficiently – and are cumbersome to program. Each has a separate memory space, requiring an application to explicitly copy data from CPU to GPU and then back again.

A program running on the CPU queues work for the GPU using system calls through a device driver stack managed by a completely separate scheduler. This introduces significant dispatch latency, with overhead that makes the process worthwhile only when the application requires a very large amount of parallel computation. Further, if a program running on the GPU wants to directly generate work-items, either for itself or for the CPU, it is impossible today!

The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model.

HW Configuration

4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1

APU: AMD A10 4600M with Radeon™ HD Graphics

CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz)

GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz

http://developer.amd.com/Resources/hc/heterogeneous-systems-architecture/Pages/default.aspx

Para quem estiver interessado em alguns detalhes mais técnicos da HSA do ponto de vista do hardware e do software
http://developer.amd.com/Resources/hc/heterogeneous-systems-architecture/Asset/hsa10.pdf

blaster_00 · 23 de Outubro de 2012

APU: AMD A10 4600M with Radeon™ HD Graphics
Então o HSA funciona nos Trinity? Não vai ser só o steamroller a fazer essa integração "verdadeira" no acesso à memória do cpu e gpu?
O dobro da performance vs OpenCL ?? Quando é que sai o x264 HSA?

Dark Kaeser · 24 de Outubro de 2012

O HSA poderá funcionar no "Trinity" de forma limitada ou apenas em algumas situações. É que a HSA define alguns requesitos a nível de hardware que não são preenchidos pelo "trinity".

Há mais quem esteja a chegar a esta conclusão do APU e HSA, pelo que a ideia não é assim tão disparatada.

4 October, 2012

Kim, a University of Wisconsin-Madison assistant professor of electrical and computer engineering, seeks significant energy savings by optimizing the processors that integrate a central processing unit (CPU) that handles complex operations with a graphic processing unit (GPU) ... hese processors — known as accelerated processing units (APUs) — can process a large amount of complex information more efficiently than either a CPU or GPU alone.Tightly integrating a CPU and a GPU on a single chip considerably reduces power and performance overhead normally wasted as CPUs and GPUs communicate over long electrical connections. “Letting the two processors work together at a close distance increases their efficiency,” says Kim.

Even while the CPU or the GPU is working, there are many components that tap into the processor, but aren’t used at the same time,” says Kim. “They can be put into sleep mode or low-power mode as well. The overall power budget can be used more efficiently for higher computing performance.”
Even tiny power savings could make a big difference at the scale of a modern data center. “It costs billions of dollars in electricity to keep servers running each year,” says Kim. “If we can reduce the power consumption by 10 percent, that can translate to millions, potentially billions in savings. It has a huge economic impact.”
Kim’s research funding includes a $450,000 grant from the National Science Foundation and $900,000 from the Defense Advanced Research Projects Agency (DARPA) as part of a $2.7 million collaboration with Josep Torrellas of the University of Illinois at Urbana-Champaign and Radu Teodorescu of Ohio State University, where Kim will apply his work to support a DARPA initiative aiming to maximize processing capability in unmanned aerial vehicles with limited energy resources.
“The program focus is on how much more you can do per second, and how much power is needed to do it,” says Kim.

http://gpuscience.com/news/cpu-gpu-optimization-could-offer-big-power-savings-for-data-centers/

Dark Kaeser · 5 de Dezembro de 2012

AMD CodeXL 1.0: Final Release Now Available

A quick recap on AMD CodeXL – it is a unified developer tool suite that enables you to quickly and easily identify performance issues and programming errors in applications, without requiring source code modifications. It enables you to debug, profile, and analyze applications to help you achieve maximum performance on AMD APUs, GPUs and CPUs. The tool suite includes

GPU Debugger –a comprehensive debugging tool for AMD APUs/GPUs with OpenCL™, OpenGL API calls and OpenCL™ kernels.

CPU Profiler – a profiling suite that helps you to identify, investigate and tune application performance on AMD CPUs.

GPU Profiler - a complete GPU profiler that you can use to discover bottlenecks in your OpenCL and DirectCompute applications, and find ways to improve performance on AMD APUs/GPUs.

Static Analyzer – a handy utility to analyze your OpenCL application statically and estimate performance of your OpenCL kernels without having to run on the actual hardware.

http://blogs.amd.com/developer/2012...h-amd-codexl-1-0-final-release-now-available/

Dark Kaeser · 1 de Março de 2013

- AMD Fellow, Mike Houston’s Keynote at China’s SDCC 2012 (Software Developer Conference China)

http://amddevcentral.com/Lists/AmdLibrary/HSA - Platform Of The Future.pdf
Presentation Video

- Bolt Architecture

The primary goal of Bolt is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing (hc). In this first version, we’ve delivered an introductory set of common compute-optimized routines including sort, scan, transform, and reduce. Compared to writing the equivalent functionality in OpenCL™, you’ll find that Bolt requires significantly fewer lines-of-code and less developer effort. The preview version also provides good acceleration when compared to stock CPU implementations, and we have already identified additional optimizations to be included in future releases. One of our primary goals has been to make the interfaces easy to use, and we have included comprehensive documentation for the library routines, memory management, control interfaces, and host/device code sharing.

Path to Heterogeneous System Architecture (HSA):

Bolt’s device_vector class provides a convenient vector-like interface for managing device memory. On today’s heterogeneous computing devices, managing device memory (and the associated copies to and from host space) is often necessary to obtain good performance from discrete GPUs, and the device_vector makes this as painless as possible. However, in the future, the Heterogeneous Systems Architecture will provide a single, shared, coherent heap of memory with fast access from both the CPU and Accelerated Compute Units – this will eliminate the need for developers to manage the separate device memory. HSA devices will be able to directly access host memory with good performance, and with full support for pageable virtual memory and associated large-memory footprints. This is a powerful concept which will revolutionize the way we program heterogeneous computers and transform it from the domain of special-purpose “compute” languages to being a standard part of popular programming languages. Bolt includes the forward-looking feature of direct access to host memory – ie you can use std::vector<> or pointers (ie int*) as arguments to Bolt template functions. This will become even more important as hc programming evolves to enable pointers to be shared across host and device, including support for complex pointer-containing data structures such as lists or trees. Bolt’s productivity-oriented development environment enables developers to get good performance on today’s hc platforms. And, the exact same code will run on future HSA Platforms at increased performance and reduced power.
We are excited to take this first step in making heterogeneous compute programming more accessible. And there is much more to come as we expand the Bolt functionality and prepare for HSA devices.

http://blogs.amd.com/developer/2012/12/04/bolt-architecture/

- Monte Carlo Sample in Bolt

With Bolt, we are removing many of the challenges that in the past have discouraged mainstream developers from leveraging heterogeneous computing (HC) in the solutions and in the process we wish to change the perception that developing for a heterogeneous platform is only for select code ninjas.
The Bolt template library is compatible with the C++ Standard Template Library (STL) providing a development environment familiar to most C++ developers and at same time enabling developers to easily unlock the exceptional performance potential of HC. Bolt doesn’t require the knowledge of specialized HC programming APIs such as OpenCL, C++ AMP, etc… Furthermore, it simplifies code development and maintenance by having a single code path that will execute efficiently on both the CPU and the compute accelerator.

he sample I’m using here implements an estimator of π using the Monte Carlo method. Monte Carlo methods solve computation problems by observing the outcome of a large number of random samples in a system. To estimate the value π, it uses the knowledge that the area of a circle and the area of its bounding square have a ratio of π/4. Then, it runs an experiment by randomly scattering a large number of points over the area of the bounding square and observing the probability that a point will be dropped inside the circle. Figure 1 shows an example of such experiment performed by the program.

’ve tested all the different implementations discussed here on my notebook, which is powered by an A10-4600M APU with an integrated Radeon HD 7660G providing the heterogeneous compute acceleration. The following chart shows the relative speed-up compared to the CPU, transform & accumulate baseline implementation. A higher number indicates better performance:

From the chart, you can see that the performance gain delivered by Bolt is very impressive and the Bolt fused transform_reduce implementation almost achieves a 50X speed up in the 10 million-point scenario!

http://developer.amd.com/blog/monte-carlo-sample-in-bolt/

Dark Kaeser · 20 de Março de 2013

Até ao momento as notícias tem quase sempre vindo do lado da AMD, mas a ARM anunciou que também está a trabalhar no HSA.

"We're in the last gasps of setting the hardware specification," Jem Davies, vice president of technology at the media processing division of ARM, told EE Times. HSA hardware specifications are expected to accompany software and run time standards including HSAIL [HSA intermediate language].

HSA is about a mix of hardware, software, runtime systems and performance, Davies explained. "But the hardware is the furthest upstream. We have to set the specifications so the RTL can get written, licensed out and eventually get manufactured in chips," Davies said. It also explains why he feels that it will be a couple of years before true HSA demonstrations can occur.

Davies explained the HSA Foundation is working on a number of fronts. "That the caches are fully coherent is one of the truths we hold to be self-evident. You can manage the memory without but it is less easy and less efficient," Davies told EE Times. So the hardware specifications would likely include mechanisms for sharing of page tables between processors. "I don't want to copy data I want to share pointers," said Davies pointing out that copying memory is costly in terms of power consumption and especially so when memory fetches have to go off chip.

The first work is likely to be supporting CPU-GPU exchange of compute data, Davies said. The specific nature of that exchange is an advantage but the HSA does want to produce solutions that are scalable both in terms of numbers cores but also in terms of varieties of cores. So these could also include DSPs and application-specific instruction processors.

http://www.eetimes.com/design/eda-design/4410131/HSA-close-to-setting-hardware-specs

OpenCL Face Detection on Mali-T604

This video, shot on ARM’s stand at Mobile World Congress looks at the performance of an OpenCL based face detection algorithm in terms of “detects per second”. The demo compares the performance of running across the quad core Mali-T604 GPU with that of running on the dual Cortex-A15′s. When running on the GPU the algorithm is seen to be running at around 14 dps, with 20% loading on the CPU’s. When running on the CPU, performance drops to around 3 dps, with the CPU’s running at 100% and 60%.

http://leapconf.com/2013/02/opencl-face-detection-on-arm-mali-t604/

Dark Kaeser · 25 de Abril de 2013

- EARLY EXPERIENCES WITH HETEROGENEOUS COMPUTE

Using heterogeneous compute for HPC

Heterogeneous computing allows us to partition work to the “right” type of processor, to get work done faster and to meet these demands

GPUs alone are not enough, CPUs alone use too much power

THE IMPORTANCE OF HETEROGENEOUS COMPUTING

Sandia is working with AMD and the wider HPC community to:

– Get tools to work with OpenCL including profilers, debuggers, MPI etc
– Acting as a test center for how OpenCL and heterogeneous computing can be used in production HPC
– Giving feedback on tools, hardware, drivers etc when really pushed to the absolute limit
– Want to look at OpenACC and grow community expertise

http://www.penguincomputing.com/files/AMD_Fusion_Developer_Summit_2012.pdf

-The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

We examine the impact of this trend for high performance scientic computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeos in performance, power consumption, and programmability when comparing this unied memory hierarchy with similar, but discrete GPUs.
...
Furthermore, an APU-like design may only be benecial if the application has a substantial parallel fraction and that fraction can be run on the throughput-oriented cores. The dierence in performance of Llano's CPU cores and the Sandy Bridge CPU re ect the costs of not utilizing the appropriate core type or the potential penalty for devoting too many resources to a set of cores which won't be fully used by an application. When moving to a fused heterogeneous platform, an eective performance model and characterization of the instruction mix will be critical for choosing the correct core type for a kernel.

http://ft.ornl.gov/~dol/papers/cf12_llano.pdf

- How setting up our AMD Fusion12 Developer Summit Demo made me appreciate AMD's HSA announcement

The point is that I learned first-hand that leveraging a heterogeneous APU architecture with an OpenCL based library package can be challenging. With the jit concept the diagnostic output for compiles is only as good as the implementation allows for. One also has to be very aware of the underlying hardware limitations. While the GPU cores on the APU and the processing cores 'see' the same physical memory there is still a clear separation of a CPU core and GPU core memory. A simple mechanism for automatically detecting the GPU characteristics to help 'autotune' OpenCL code is also missing. In short … while APUs and OpenCL make it possible to efficiently exploit APUs it would be great if it was easier and more transparent. From this perspective I was excited about the announcement of the Hybrid System Architecture at AMDFDS. While the details have not been released, the notion of a truly unified virtual memory space for GPUs and CPUs should make application development (and debugging) a lot easier. With big organizations such as Sandia National Labs trying to utilize APU technology for HPC, AMD will hopefully move away from the strict desktop/laptop focus for its APU technology and start delivering server class processors that support features that matter to HPC customers in production environments such as ECC and two and four-way processing.

http://www.penguincomputing.com/Blo...mo-made-me-appreciate-AMD's-HSA-announcement-

Uma série de vídeos das conferências realizadas durante a AFDS 2012, para quem tiver paciência

http://www.youtube.com/watch?v=uHvvDx9Pjk0&list=PLA5581E4E4FF05061

Dark Kaeser · 30 de Abril de 2013

AMD introduces heterogeneous Uniform Memory Access

http://www.tweaktown.com/news/30038/amd-introduces-heterogeneous-uniform-memory-access/index.html

AMD sheds light on Kaveri's uniform memory architecture

Current APUs have non-uniform memory access (NUMA) between the processor and graphics logic. In those solutions, the CPU cores and IGP are both tied to system memory, but they each have their own separate memory pools. The processor cores must jump through hoops to access memory being used by the graphics hardware, and vice versa. Different heaps and different address spaces are involved, and when data needs to be shared, it has to be copied back and forth between the CPU and IGP pools. There is, as you'd expect, a performance cost to all those intermediate steps.

In Kaveri, hUMA takes away the hoops: the processor cores and integrated graphics have a shared address space, and they share both physical and virtual memory. Also, data is kept coherent between the CPU and IGP caches, so there are no cycles lost to synchronization like in current, NUMA-based solutions. All of this should translate into higher performance (and lower power utilization) in general-purpose GPU compute applications. Those applications tap into both the CPU cores and the IGP shaders and must pass data back and forth between them, which would require extra steps without hUMA. AMD said Kaveri's hUMA architecture has been implemented entirely in hardware, so it should support any operating systems and programming models. Virtualization is supported, as well.

http://techreport.com/news/24737/amd-sheds-light-on-kaveri-uniform-memory-architecture

Introducing hUMA, Sophisticated New Memory Architecture from AMD

In order for HSA to be powerful and power-efficient, we still have two major obstacles to overcome in hardware:

Unlock the GPU compute performance and

Remove the bottlenecks of the GPU when accessing system memory.

Enabling high bandwidth access to memory will arguably be important in our quest for unlocking this compute performance. Breaking down bottlenecks in how GPU is accessing the memory is important to the future of programming because it allows apps to efficiently move the right tasks to the best suited processing element. heterogeneous Uniform Memory Access or hUMA signifies the first step in bringing a heterogeneous compute ecosystem to life. hUMA is a highly sophisticated shared memory architecture used in APUs (Accelerated Processing Units). In a hUMA architecture, CPU and GPU (inside APU) have full access to the entire system memory. hUMA architecture means that all processing cores in a true UMA system share a single memory address space. hUMA main features include

Access to entire system memory space: CPU and GPU processes to dynamically allocate memory from the entire memory space

Pageable Memory: GPU can take page faults, and is no longer restricted to page locked memory

Bi-directional coherent memory: Any updates made by one processing element will be seen by all other processing elements – GPU or CPU

http://blogs.amd.com/fusion/2013/04...phisticated-new-memory-architecture-from-amd/

Gaugamela · 1 de Maio de 2013

É muito interessante,mas isto vai implicar que o software actual tenha que ser adaptado para tirar partido disto não?
Se for esse o caso é pena. Mas o Kaveri começa a soar cada vez mais interessante... Pena que quase não hajam portáteis AMD á venda em Portugal.

muddymind · 2 de Maio de 2013

Para tirar partido completo será necessário algumas alterações nas aplicações mas mesmo sem isso já é possível retirar frutos da arquitectura a nível de drivers e/ou compiladores.

Por exemplo, quando se tem um vertex buffer object - VBO (lista de vértices que definem um modelo 3D que é carregado em memória para execução rápida de renderização) há um parâmetro extra que define o target do upload em memória: memória de sistema, memória de vídeo ou memória de cache de vídeo (antigo pci/agp memory). Dado que se passa a ter espaço de memória unificado deixa de ser relevante o parâmetro extra e assim ignorar as chamadas intermédias de download/upload do VBO da memória de vídeo.

Infelizmente já o mesmo não se passa em OCL e afins pois pode trazer problemas de concorrência entre o CPU e GPU que podem assumir que estão a trabalhar sobre arrays diferentes. Neste caso a partilha de apontadores seria desastroso e de resultado não previsível.

Dark Kaeser · 2 de Maio de 2013

Sim serão necessárias algumas alterações a linguagem existentes.

@muddymind
em relação ao OpenCL e para evitar o que tu referes a AMD já tinha disponibilizado ferramentas, nomeadamente o CodeXL

STATIC KERNEL ANALYSIS – KEY FEATURES AND BENEFITS

Compile, analyze and disassemble the OpenCL kernel and supports multiple GPU device targets.

View any kernel compilation errors and warnings generated by the OpenCL runtime.

View the AMD Intermediate Language (IL) code generated by the OpenCL run-time.

View the ISA code generated by the AMD Shader Compiler.

View various statistics generated by analyzing the ISA code.

View General Purpose Registers and spill registers allocated for the kernel.

SYSTEM REQUIREMENTS

Microsoft® Windows 7® (32 or 64-bit) or Microsoft® Windows 8® (32 or 64-bit)

Linux®:

Red Hat® Enterprise Linux® 64-bit 6.*

Ubuntu® 64-bit 12.04 or later

Microsoft® Visual Studio® 2010 or 2012 (applies to Microsoft® Visual Studio® Plugin Only)

The latest AMD Catalyst driver

http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

Dark Kaeser · 30 de Maio de 2013

Publicação do manual de referência para os programadores.

The Programmer¹s Reference Manual provides a standardized method of accessing all available computing resources in HSA-compliant systems. This enables a wide range of system resources to cooperate on parallelizable tasks. It has been specifically designed to perform in the most energy efficient way without compromising on performance. The goal is to enable a heterogeneous architecture that is easy to program, opens up new and rich user experiences and improves performance and quality of service, whilst reducing energy consumption.

The programming architecture detailed in the HSA Programmer¹s Reference Manual calls out features specifically exposed to programmers of the HSA architecture. HSA devices will typically include a broad class of devices, including GPUs and DSPs and support a number of key hardware features that enable easier developer programmability. These include shared coherent virtual memory, platform atomics, user mode queuing and GPU self-queuing.

http://hsafoundation.com/hsa-foundation-announces-first-specification/

The ultimate mission of HSA is to advance Parallel Computing with GPU or any other kind of programmable devices, to the next level in terms of ease of programming and power efficiency. We needed to repeatedly remind ourselves to strike a balance between current state of the art, and forward-looking ideas beyond the current, conventional way of programming GPUs, or for that matter any SIMD style processors. Also by looking at use cases that do not yet exist in the market place, we needed to revisit some common themes in computing, such as precision, cache coherency, memory consistency again and again. The goal is to create a standard that is not only practical for wide industry-wise adoption, but also for future innovation and differentiation.

For middleware, library and compiler developers, HSAIL is a perfect target due to its low-level nature, and stability and universality compared to native hardware ISAs. They can invest in R&D on top of HSAIL, and be sure that they would get the return thru the HSAIL ecosystem. The application developers, can optimize their code manually in HSAIL, and/or leverage the third-party HSAIL development tools or environments, and be confident that the real-world performance and efficiency of the applications developed this way would match their expectations. Such an assurance is achieved thru hardware vendors striving to optimize their HSA-compliant devices for HSAIL. Since HSAIL defines a virtual machine, not a physical one, hardware companies can innovate and differentiate in their native ISAs and micro-architectures. One of the coolest things about HSAIL is that it can potentially enable an ecosystem in which advances in Parallel Computing can happen independently and synergistically between software and hardware companies.

http://hsafoundation.com/hsa-programer-reference-the-formation-of-the-new-specification/

AMD - HSA e a evolução do APU

Colaborador

Colaborador

Membro

Colaborador

Colaborador

Colaborador

Colaborador

Power Member

Colaborador

Power Member

Colaborador

Colaborador

Colaborador

Colaborador

Colaborador

Colaborador

Power Member

1st Folding then Sex

Colaborador

Colaborador