Inside XboxONE: HSA·102 - LCUs & TCUs

Inside XboxONE: HSA·102 - LCUs & TCUs

Notapor BiG Porras » 30 Abr 2016, 12:37

Welcome to the second article of the series of articles HSA. Before further deepen the description of the HSA architecture, it is important to make clear what kinds of computational units (CUs) provides HSA.


As we said in the previous article, heterogeneous computing refers to systems containing multiple computational units (such as CPUs, GPUs, DSPs, or various ASICs) and how these units are memory into a unified pool, where all they can access simultaneously.

Pool, as you all know, a CPU does its work quite differently to a GPU. CPU processing is a serial processing, also called scalar processing, while the GPU performs parallel processing, also called VECTORIAL.

Different problems, different solutions:
Serial processing or CLIMB, reads the program instructions one to one and one to one is applying to different data. This processing mode is also called SISD (Single Instruction, Single Data), in which a single instruction affects only data, or rather, has a single output. scalar type instruction, would be the sense 'to the value 14, you've to add 34'. In a CU (computing unit) of scalar type, the most important is the latency, or delay, what it takes to get to and from memory, as the work done in the CU is scarce, limiting the performance of a scalar CU is latency. The simile is what you just go to work, the ideal is to have a sporty able to go very fast ..

Parallel processing or VECTORIAL, read the instructions one at a time equal to the scalar processing, but the instruction affects different data simultaneously. This processing mode is also called SIMD (Single Instruction, Multiple Data), since a single instruction, it is applied simultaneously to different values. A vector type instruction would be 'the set of values ​​12, 45, 67 and 34, súmales 12'. In a CU vector type, the limiting factor is the available bandwidth. Since each instruction we are doing 'quadruple' work what with CU scale, the important thing is to have a large bandwidth which let free the bus as soon as possible. In this case, the sport is not the most appropriate, and it is more interesting to have a minibus that will promote more people return to work, because although we go slower than a sport, in the end, we will move more people at the same time.

And now the crux of the matter:
Very pool, then in HSA these two types of CUs have a name. CUs which are sensitive to latency, that is, the CUs specialized in working with low memory latency and thus are good at work climbing, are called LCUs (Latency 'optimized' Compute Unit), or CUs optimized to latency. While CUs specialized in working with a large bandwidth of available memory, and therefore are good in the vector works are called TCUs (Throughput 'optimized' Compute Unit), or bandwidth sensitive CUs.

Of course, you can obviously vector (and therefore sensitive to bandwidth) in a LCU work, it would be the equivalent of what 50 people can use sports to go to work, but will have to make 50 trips, taking more what what takes a single bus.

In the same way, you can use a scale to perform work TCU (ie, sensitive to latency), but it's like you're on your own on the bus, you will not get faster by going it alone. The bus will take a specified time, and will not take less to go less loaded.

For each type of CU they are different, and both equal:
HSA, all CUs are counted equally and added, since all CUs can run all kinds of work. Obviously, the LCUs will be more optimal running scalar code, and TCUs will be more optimal running vector code and HSA allows the programmer to 'advise' a type of CU for the code to be executed in architecture, but the final decision where executes that corresponds to the HSA Runtime, which we will discuss later code.

For now, as a preview of what is to come, clarify what the HSA Runtime can decide on their own risk, execute code which the programmer has 'advised' run on a TCU, in a LCU in place, because the TCUs occupancy is very high and the LCUs are idle. the code 'recommended TCU' will surely take longer to run on an LCU, but it would take even more if I had to wait for the queue.

Obviously, the opposite is also true. The HSA Runtime may decide to run a recommended to run on an LCU, a TCU in place code, for the same reason as above: the LCU take less, but with the tail having to wait in the TCU this idle will to finish early.

The more we are, the more laugh:
In x86 systems, and if anyone still has doubts, the LCUs are the cores of the CPU, and the TCUs are the GCN CUs of graphics or integrated into the APU. Of course, it is difficult to add more LCUs X86 system, more than anything, because once you've bought an HSA-compatible APU (As a Kaveri or Carrizo system) to increase the number of LCUs need to change the APU entirely.

This does not happen with the TCUs, where everything you need is to buy a supported GPU HSA, and you get a battery of TCUs willing to integrate and work across the HSA architecture of your current equipment.

After this interlude, we will continue with memory configuration topics in the following article.
...Siempre BiG!
Avatar de Usuario
BiG Porras
Mensajes: 17230
Registrado: 09 Jun 2014, 11:25
Ubicación: Torroles
Has thanked: 4227 times
Been thanked: 1588 times

Volver a MICROSOFT PLATFORM (W10 devices)