Ciao Mondo 3!

thumb_logo


In this article we present the new server architecture based on AMD's Bulldozer core class. We describe the 4200 series CPUs for 1 and 2-socket systems and 6200 series for 2 and 4-socket systems and a maximum of 64 cores.

[VERSIONE ITALIANA]

 

The growing demand for web services and/or cloud has increased the need for installation of platforms and server systems. The most efficient and cheap solution currently used is the virtualization: a single physical machine can simulate a variety of machines, even hundreds. Just have enough RAM and Core. Typically at least one core and 1GB of RAM per virtual machine. And therefore became necessary to have the greatest possible number of cores and DIMM slots for a single system. Currently the server farm is composed of towers including a few dozen slots for many racks, each host a complete system. It goes without saying that many more cores and DIMM can accommodate a rack, remaining in acceptable power consumption and heat, the greater the number of virtual servers that can be hosted.

AMD_Interlagos_1

The requirements for a server platform are therefore the cost for the cloud, scalability for virtualization, namely the ability to allow many virtual machines without sacrificing performance, and processing power for HPC, namely the high performance computing.

AMD_Interlagos_2


The CPUs

With the series 4200 and 6200 AMD claims to have improved all three aspects than the previous generation and also in comparison to the competition.

The following tests were compiled using the best compiler and linked with the best library for the platform in question. This means that the tests have been compiled for use AMD's new exclusive FMA4 and XOP instructions and AMD ACML 5.0 library, optimized for the Bulldozer architecture. This also explains, in part, the excellent results. The good results in memory tests, however, is all due to AMD: efficiency of the RAM controller and HT links between CPUs with HT-assist.

AMD_Interlagos_3

HPC Performance: 2 Opteron 6276 generated 84% more power in the LINPACK benchmark of 2 Xeon 5670.

Memory Bandwidth: 2 Opteron 6276 generated 73% more memory bandwidth in the test STREAM 2P of 2 Xeon 5670.

Efficiency and economy for the cloud and virtualization: half power per core than the best Xeon (4.375W against 10W), 2/3 less space for the same core (2P rack systems can accommodate 12 Intel cores against 32 AMD cores) and finally 1/3 to 2/3 less in retail price for TOP of the range AMD complete server solutions against Intel’s.

AMD has chosen to make the comparison with Xeon 5670 CPU (Gulftown esa-core server party) because it constitutes the vast majority of sales of Intel's server market.

AMD_Interlagos_4

The Bulldozer CPU architecture is based on modules, of which a detailed description can be found in our previous article.

The 4200 series solutions are similar to the desktop configurations with solutions from 4 to 8 cores, arranged in modules, with 64KB L1 instruction cache per pair of cores, 16KB L1 data cache per core, 2MB of L2 cache per pair of cores and 8MB of total L3 cache. The Opteron 4200-based systems have up to two sockets with DDR3 1600MHz dual-channel per socket, three HT links x16 3.0 per socket, up to 6.4GT/s and frequencies up to 3.3GHz, base, and 3.7GHz with maximum turbo core, with solutions from 35 to 95W consistent with the previous 4100 series, being installable, after updating the BIOS, even in existing systems.


AMD_Interlagos_5

The solutions of the 6200 Series double the 4200 series features, being based on two dice in MCM configuration, with solutions from 4 to 16 cores, arranged in modules, with 64KB L1 instruction cache per pair of cores, 16KB L1 data cache per core, 2MB of L2 cache per pair of cores and 16MB total L3 cache per socket, with solutions of two to four-socket, four DDR3 1600MHz channels per socket, four HT links x16 3.0 per socket, up to 6.4GT/s and frequencies up to 3.3GHz, base, and 3.6GHz with maximum turbo core, with solutions from 85 to 140W consistent with the previous 6100 series, being installable, after updating the BIOS, even in existing systems.


CPU features

AMD's server solutions are flexible, with Turbo Core, Core select to select via BIOS the number of visible cores (for example to reduce software license costs), various modes of operation of FLEX FP, cache partitioning and new HPC instructions (FMA4 and XOP, AMD's unique, and AVX, and cryptographic acceleration, shared with INTEL), TDP power cap, to limit the power dissipated by the CPU with 1W granularity, C6 state, for a lower IDLE consumption, 6 TDP classes to choose the CPU that best suits your needs, support for low and ultra low voltage DIMMs, support from 4 to 64 cores per rack with the same chipset (and drivers) for all systems from 1 to 4 sockets.

AMD_Interlagos_6

The Turbo core technology is active both with all cores active, in which case allows for an increase from 300 to 500MHz depending on the model and with a maximum of half of the core active of up to 1.2GHz (for the top 16 core model).

AMD_Interlagos_7

As we saw in the Bulldozer architecture presentation, each module has a shared FPU, capable of executing 128 and 256 bits instructions of two threads.

AMD_Interlagos_8

The supported FPU instruction set are:

  • x87, MMX, SSE1, SSE2, SSE3, which is the FPU and Integer legacy instruction set, both scalar SIMD, supported by both AMD and Intel;
  • SSSE3, SSE4.1, SSE4.2, which is the FPU and integer SIMD set, supported by both AMD and Intel, which speeds up the video algorithms, biometric and intensive text processing;
  • AESNI, PCLMULQDQ, which is the instruction set, common to both AMD and Intel, that accelerate cryptographic algorithms, and in particular the AES;
  • AVX, which is the new instruction set, shared by both AMD and Intel, with a new extensible encoding, allowing you to have FPU and integer SIMD instructions both 128 and 256 bits, and that serves to accelerate compute-intensive applications, such as HPC;
  • FMA4, AMD's unique instruction set, which allows 4-way operations of multiplication and accumulation in a single instruction, greatly speeding up the algorithms that require them, such as matrix multiplication, and many scientific calculations;
  • XOP, AMD's unique instruction set, which contains instructions for accelerating multimedia applications, as the sum of vectors, fraction extraction and conversion for 16-bit FP numbers, used in video cards.

 

These sets of instructions were designed to increase the instructions calculation density, to reduce the need to copy registers (only FMA4) and to allow automatic vectorization by compilers.


Energy saving

Interlagos CPUs support low voltage (1.35V) and ultra low voltage (1.25V) DDR3 for a lower overall system consumption.

AMD_Interlagos_9

Support the C6 power-saving core state, which switches off the clock and power to idle modules, reducing the power dissipation, compared to the previous generation Opteron processors by 46% (the 6174, 12 cores at 2.2GHz, consuming 11.7 W in active idle C1E state, while the 6276, 16 cores at 2.3GHz, consumes 6.4W, C1E in active idle state, with the new C6 power gating state active by BIOS).

AMD_Interlagos_10

TDP Power Cap technology allows you to select the target power of the APM unit with an accuracy of 1 watt.

As seen in the technical article on Bulldozer architecture, the APM module (Advanced Power Management) calculated at each instant the active units, with an accuracy of 2%, hence the upper limit of the power consumed by the CPU at that instant. In this way it can decide whether and for how long the CPU can be in a state of greater Turbo Core, in order to consume the amount of energy as close as possible, but less than the specified limit. In desktop CPUs the limit is usually given from the class (95W or 125W). But with Interlagos CPUs you can force a TDP of less than one for the class (from 35W to 140W) with a granularity of 1 watt, to reduce consumption and heat dissipation, and can, for example, put more racks in a data cabinet with limited dissipation and power. Given how the APM works and the characteristics of the CPU and the software, a given load may even not lose in performance, limiting the power to lower values, for example, if not already fully exploits all the CPU.

Finally, all CPU units, use an advanced clock gating and voltage gating to keep active only the necessary parts of the CPU at that time.

This leads to a consumption per core, up to 56% lower than in INTEL (8-core 35W AMD vs. 6-core 60W Intel). This is possible due to the very low leakage of Global Foundries SOI Process, so that many transistors powered at low voltage and with low clock consume very little.


Comparison with INTEL

In the greater volume market segment, the new generation AMD Opteron achieved an 89% performance increase in the SPEC tests, at the same CPU cost (Xeon E5640, against Opteron 6276).

AMD_Interlagos_11

With the same system cost, however, offers 25% more CPU power, more memory and better hard drive specs.

AMD_Interlagos_12


Regarding virtualization, it requires the greatest amount of core, addressable memory, L3 cache and the least consumption. The AMD solutions provide a cost-per-VM up to 77% less and allow you to have a 2.6 times higher number of VMs per rack.

AMD_Interlagos_13

The cloud requires the maximum density, made possible by low power consumption: 5.3W/core for 6200 Series, 4.375W/core for 4200 Series against 10W/core for the best Intel CPU. In addition, the top of the range AMD (6282 SE) gets 25% more in web-based benchmarks, compared to Intel competitor (Xeon 5690).

AMD_Interlagos_14

Finally, for the HPC is required scalable performance, high-performance computing, high memory bandwidth, many core and SIMD extensions for numerical computation. AMD gets 73% more memory bandwidth than Intel, has a larger number of cores per rack, a greater number of FLOP per rack and up to 80% less cost. To get the raw power of a rack of 16-core AMD CPUs, you need two full racks of Intel CPUs.

AMD_Interlagos_16

 


Software Ecosystem

To take advantage of the Bulldozer architecture you must recognize its own peculiarities:

  • Optimize the software for its module architecture: the operating system or hypervisor must recognize Bulldozer and act accordingly;
  • Exploiting new Bulldozer instructions, with particular reference to the SIMD instructions, SSEn, AVX, cryptographic acceleration and the exclusive FMA4 and XOP;
  • Exploiting the performance counters to monitor and improve the software;
  • Exploit the C6 state to consume less and/or enable a more advanced turbo core (e.g. core parking)
  • Exploiting new instructions for virtualization acceleration to achieve performance almost identical to a native machine.

AMD_Interlagos_17

Among the various instructions supported by AMD Bulldozer CPUs, we describe the AMD's exclusive FMA4 and XOP.

FMA4 implements the fused multiply accumulate, in the formula d = a + b x c, which allows great flexibility in programming, allowing you to choose the 4 registers independently. Such instruction is able to perform this calculation in a pipelined FPU and a bulldozer module is able to perform a 256 bit or 2 128 bit floating point FMAs, and simultaneously a 256 bit or 2 128 bit integer FMAs in a single clock cycle.

This class of instruction accelerates applications that require this type of calculations. IBM, SPARC and Itanium CPUs have this type of instructions. The x86 class Intel CPUs will implement them in 2013 but in the FMA3 version, more limited, in which one of the 4 registers is overwritten. This is due to the internal micro architecture of the Intel CPUs that can not have more than 3 registers per instruction.

AMD_Interlagos_18

XOP includes 128 and 256-bit, 3 or 4 operands instructions of horizontal summation or subtraction, compare, shift, rotation, permutation, integer accumulation and product, fraction extraction, and conversion to and from the 16-bit floating point, used in video cards.

To fully exploit the power of the Bulldozer CPU, software must be compiled with the SSE, AVX 128-bit and FMA4 option and linked to the ACML library version 5.x.

If the software supports the instructions in common with INTEL and not use the Intel compiler (which controls that the CPU is Intel to enable support for new instructions) then does not need to be recompiled. If you want to make use of FMA4 or XOP instructions, you must recompile the software or link it with the ACML 5.x libraries.

The new instructions to support virtualization is being implemented in all later software and kernels.

The Bulldozer support is active on all the latest compilers, except in the Intel one in which XOP and FMA4 are not supported and you have to force by hand the use of AVX with the -mAVX switch.

Using Libraries:

  • ACML 4.0 library is compatible with AMD Bulldozer. Version 5.0 is optimized for Bulldozer. It contains basic linear algebra routines (BLAS), advanced algebra routines (LAPACK) routines for FFTs and random numbers. The 5.1 version of ACML library that contains the same routines extended to double precision and complex numbers it's in development;
  • the libm AMD library, version 3.0 optimized for Bulldozer, contains the standard math functions optimized for this CPU;
  • finally, the IPP Intel library, that is limited to SSE3 version with AMD CPU.

 


Conclusions

The AMD Bulldozer CPUs in the server class embodiment, called Interlagos, are a drop-in replacement for the previous generation of Opteron CPUs, after updating the BIOS.

Using the appropriate compilers and libraries can make the most of features and also the new instructions unique to the architecture.

AMD_Interlagos_19

This task is much easier in the professional field where in general the code is optimized for the architecture on which the software will run.

In this case there is no doubt: as shown in the AMD Interlagos presentation, Bulldozer architecture provides higher performance, smaller footprint, higher density of the core and memory, lower power consumption and greater flexibility in relation to price.

AMD_Interlagos_20

On 14 November is the official launch of the new AMD Interlagos platforms. AMD has announced that not only these CPUs will be available immediately, but in fact many examples have already been delivered from time to important AMD partners. Bulldozer did not shine in the desktop, where he was greeted with a little disappointment, however, in the server, is received with very different expectations, positioning itself in a position of distinct advantage, at least for the moment, than the competition.

Marco Comerci.

Pubblicità