III Excerpt of article The preproduction Intel Xeon Phi coprocessors chips can provide well over one teraflop of floating point performance Developers can reach this supercomputing level of number crunching power via one of several routes Using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s) Recompiling source code to run directly on coprocessor as a separate many core Linux SMP compute node Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library) Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes From this list, experienced programmers will recognize that the Phi coprocessors support the full gamut of modern and legacy programming models Most developers will quickly find that they can program the Phi in much the same manner that they program existing x86 systems The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation quad core processors Massive vector parallelism is the path to realize that high performance In a next paper, we will discuss how programming the Phi compares with CUDA programming The Xeon Phi Hardware Model from a Software Perspective The Intel Xeon Phi KNC processor is essentially a 60 core SMP chip where each core has a dedicated 512 bit wide SSE (Streaming SIMD Extensions) vector unit All the cores are connected via a 512 bit bidirectional ring interconnect (Figure 1) Currently, the Phi coprocessor is packaged as a separate PCIe device, external to the host processor Each Phi contains 8 GB of RAM that provides all the memory and file system storage that every user process, the Linux operating system, and ancillary daemon processes will use The Phi can mount an external host file system, which should be used for all file based activity to conserve device memory for user applications Even though Linux on Intel Xeon Phi provides a conventional SMP virtual memory environment, the coprocessor cards do not support paging to an external device A preproduction card using a Knights Corner chip achieved a score of 189 GB s on the streams triad benchmark with ECC (Error Correcting Code) enabled It is expected the production Intel cards shipping next month will deliver higher performance The theoretical maximum bandwidth of the Intel Xeon Phi memory system is 352 GB s (5 5GTransfers s 16 channels 4B Transfer), but internal bandwidth limitations inside the KNC chips (specifically the ring interconnect) plus the overhead of ECC memory limit achievable performance to 200 GB s or less Figure 2 High performance Xeon Phi applications exploit both parallelism and vector processing Each Intel Xeon Phi core is based on a modified Pentium processor design that supports hyperthreading and some new x86 instructions created for the wide vector unit As illustrated in Figure 2, developers need to utilize both parallelism and vector processing to achieve high performance Programmers are free to work with their preferred programming languages and parallelism models so long as the application can scale to match Phi capabilities The current PCIe packaging complicates the offload programming model as external data breaks an assumption made by the SMP execution model that any thread can access any data in a shared memory system without paying a significant performance penalty As can be seen in Figure 3, the PCIe bandwidth is significantly lower than that of the on board memory So achieving high offload computational performance with external coprocessors requires that developers Transfer the data across the PCIe bus to the coprocessor and keep it there Give the coprocessor enough work to do Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving data back and forth to the host processor Be aware that the preproduction Intel Xeon Phi cards have only one DMA engine, so any communications (network file system, MPI, sockets, ssh, and so forth) between the coprocessor and host can interfere with offload data transfers and thereby w application performance The key to Intel Xeon Phi floating point performance is the efficient use of the per core vector unit To access the vector unit, the compiler must be able to recognize SSE compatible constructs so it can generate the special Intel Xeon Phi vector instructions Developers with legacy code can test if their applications will benefit from Xeon Phi floating point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU msse or other compiler switch) Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium based cores Although this means Intel Xeon Phi will probably not be a performance star for non vector applications, these coprocessors can still be used as support devices that provide many core parallelism and high memory bandwidth Figure 3 PCIe and memory bandwidths While the aggregate Intel Xeon Phi computational performance is high, each core is slow and has limited floating point performance when compared with a modern Sandy Bridge processor High performance can be achieved only when a large number of parallel threads (minimum 120) are utilized, and they issue instructions to the wide vector units quickly enough to keep the vector pipeline full The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading Most developers will rely on the compiler to recognize when the Intel Xeon Phi special wide vector instructions can be issued to the per core vector units (More adventurous programmers can utilize compiler intrinsic operations or assembly language to access the vector units ) This means that existing libraries and applications must be recompiled to run well on the Phi In general, the best floating point performance will be realized when each core is running two threads that actively issue instructions to the vector unit For a 61 core coprocessor, this means that the programmer must be able to effectively utilize 120 threads (two times of the number of cores minus one core reserved for the operating system) inside their application Empirically, it appears that the internal Pentium cores are not fast enough to keep their associated per core vector unit busy when running only one thread Running with two threads per core appears to be the generic minimum thread count, best performance sweet spot This is only a general rule of thumb, as much depends on the type and amount of work performed by each thread before it issues a vector operation (Note that future Intel Xeon Phi products will likely support greater parallelism, so the ability to support higher application thread counts is highly encouraged ) The key to Intel Xeon Phi floating point performance is the efficient use of the per core vector unit To access the vector unit, the compiler must be able to recognize SSE compatible constructs so it can generate the special Intel Xeon Phi vector instructions Developers with legacy code can test if their applications will benefit from Xeon Phi floating point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU msse or other compiler switch) Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium based cores Although this means Intel Xeon Phi will probably not be a performance star for non vector applications, these coprocessors can still be used as support devices that provide many core parallelism and high memory bandwidth Questions In the Flynn classification, where would you place the new Xeon Phi (2 pts) What do you thinks will be the main limit of this hardware in regards to performances (1 pt) If you compare the Xeon Phi approach to what you have seen with GPU, what will be similar and what will be different ( 2pts ) Prepare an abstract for this short paper (approx 10 lines) (5 pts)

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 06, 2024

III Excerpt of article The preproduction Intel Xeon Phi coprocessors chips can provide well over one teraflop of floating-point performance. Developers can reach this supercomputing

III Excerpt of article

The preproduction Intel Xeon Phi coprocessors chips can provide well over one teraflop of floating-point performance. Developers can reach this supercomputing level of number crunching power via one of several routes:

Using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
Recompiling source code to run directly on coprocessor as a separate many-core Linux SMP compute node
Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library)
Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes.

From this list, experienced programmers will recognize that the Phi coprocessors support the full gamut of modern and legacy programming models. Most developers will quickly find that they can program the Phi in much the same manner that they program existing x86 systems. The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating-point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation quad-core processors. Massive vector parallelism is the path to realize that high performance. In a next paper, we will discuss how programming the Phi compares with CUDA programming.

The Xeon Phi Hardware Model from a Software Perspective

The Intel Xeon Phi KNC processor is essentially a 60-core SMP chip where each core has a dedicated 512- bit wide SSE (Streaming SIMD Extensions) vector unit. All the cores are connected via a 512-bit bidirectional ring interconnect (Figure 1). Currently, the Phi coprocessor is packaged as a separate PCIe device, external to the host processor. Each Phi contains 8 GB of RAM that provides all the memory and file-system storage that every user process, the Linux operating system, and ancillary daemon processes will use. The Phi can mount an external host file-system, which should be used for all file-based activity to conserve device memory for user applications. Even though Linux on Intel Xeon Phi provides a conventional SMP virtual memory environment, the coprocessor cards do not support paging to an external device.
A preproduction card using a Knights Corner chip achieved a score of 189 GB/s on the streams triad benchmark with ECC (Error Correcting Code) enabled. It is expected the production Intel cards shipping next month will deliver higher performance. The theoretical maximum bandwidth of the Intel Xeon Phi memory system is 352 GB/s (5.5GTransfers/s * 16 channels * 4B/Transfer), but internal bandwidth limitations inside the KNC chips (specifically the ring interconnect) plus the overhead of ECC memory limit achievable performance to 200 GB/s or less.
Figure 2: High-performance Xeon Phi applications exploit both parallelism and vector processing.

Each Intel Xeon Phi core is based on a modified Pentium processor design that supports hyperthreading and some new x86 instructions created for the wide vector unit. As illustrated in Figure 2, developers need to utilize both parallelism and vector processing to achieve high performance. Programmers are free to work with their preferred programming languages and parallelism models so long as the application can scale to match Phi capabilities.

The current PCIe packaging complicates the offload programming model as external data breaks an assumption made by the SMP execution model that any thread can access any data in a shared memory system without paying a significant performance penalty. As can be seen in Figure 3, the PCIe bandwidth is significantly lower than that of the on-board memory.

So achieving high offload computational performance with external coprocessors requires that developers:
Transfer the data across the PCIe bus to the coprocessor and keep it there
Give the coprocessor enough work to do
Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving

data back and forth to the host processor.

Be aware that the preproduction Intel Xeon Phi cards have only one DMA engine, so any communications (network file-system, MPI, sockets, ssh, and so forth) between the coprocessor and host can interfere with offload data transfers and thereby w application performance.
The key to Intel Xeon Phi floating-point performance is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE-compatible constructs so it can generate the special Intel Xeon Phi vector instructions. Developers with legacy code can test if their applications will benefit from Xeon Phi floating-point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU msse or other compiler switch). Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit. Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium-based cores. Although this means Intel Xeon Phi will

probably not be a performance star for non-vector applications, these coprocessors can still be used as support devices that provide many-core parallelism and high memory bandwidth.

Figure 3: PCIe and memory bandwidths.

While the aggregate Intel Xeon Phi computational performance is high, each core is slow and has limited floating-point performance when compared with a modern Sandy Bridge processor. High performance can be achieved only when a large number of parallel threads (minimum 120) are utilized, and they issue instructions to the wide vector units quickly enough to keep the vector pipeline full. The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading. Most developers will rely on the compiler to recognize when the Intel Xeon Phi special wide vector instructions can be issued to the per core vector units. (More-adventurous programmers can utilize compiler intrinsic operations or assembly language to access the vector units.) This means that existing libraries and applications must be recompiled to run well on the Phi. In general, the best floating-point performance will be realized when each core is running two threads that actively issue instructions to the vector unit. For a 61-core coprocessor, this means that the programmer must be able to effectively utilize 120 threads (two times of the number of cores minus one core reserved for the operating system) inside their application.

Empirically, it appears that the internal Pentium cores are not fast enough to keep their associated per core vector unit busy when running only one thread. Running with two threads per core appears to be the generic minimum thread count, best performance sweet spot. This is only a general rule of thumb, as much depends on the type and amount of work performed by each thread before it issues a vector operation. (Note that future Intel Xeon Phi products will likely support greater parallelism, so the ability to support higher application thread counts is highly encouraged.)
The key to Intel Xeon Phi floating-point performance is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE-compatible constructs so it can generate the special Intel Xeon Phi vector instructions. Developers with legacy code can test if their applications will benefit from Xeon Phi floating-point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU msse or other compiler switch). Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit. Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium-based cores. Although this means Intel Xeon Phi will

probably not be a performance star for non-vector applications, these coprocessors can still be used as support devices that provide many-core parallelism and high memory bandwidth.
Questions:
In the Flynn classification, where would you place the new Xeon Phi (2 pts) ?
What do you thinks will be the main limit of this hardware in regards to

performances (1 pt) ?
If you compare the Xeon Phi approach to what you have seen with GPU, what will be similar and what will be different ( 2pts )
Prepare an abstract for this short paper (approx. 10 lines) (5 pts).