ARM's battle for the datacentre: The contenders

– Nick Heath, chief reporter for TechRepublic UK, says:

How will the first generation of enterprise-ready ARM servers stack up against traditional datacentre boxes?

As the first enterprise-ready, ARM-based servers get nearer to release more details are emerging on what these energy-sipping systems will be capable of.

The upcoming 64-bit machines are being designed to tackle a far broader range of tasks than the few 32-bit ARM-based servers tested out by a handful of companies this year.

Rather than just web serving, these systems are being built to also power data analytics on Hadoop clusters, fetch and put data in NoSQL data stores, streaming media and high-performance computing, sharing processing duties with GPUs, FPGAsor ASICs.

Jobs like these can be split into computationally light workloads and processed in parallel by clusters of thousands of wimpy core processors. These dense clusters of low-power servers can handle these parallelisable tasks more efficiently than smaller number of powerful chips, delivering better performance per watt and per square foot of datacentre space, important measures for driving down the cost of running a large server estate.

Hence the interest in taking small, energy thrifty ARM-based chipsets, today more commonly found in mobile phones and tablets, and using them in tightly, packed server clusters.

A fair proportion of the software needed to handle these web serving, data analytics, streaming media and other jobs are on track to be ready for production use on ARM-based servers. But what about the hardware?

Powering these servers will be chipsets from a range of companies – but major players in the nascent ARM-based server space will be likely be Applied Micro with its X-Gene boards and AMD, which is branching out beyond x86 with its Opteron A1100 processor.

These forthcoming chips are based on the ARM v8 architecture, which introduces support for features considered critical by business. Not only is v8 the first ARM architecture to support 64-bit cores, it also brings additional enterprise-class features, such as error-correcting code (ECC) memory.

The companies behind these server chipsets were at the Hot Chips conference in Cupertino this week to detail the capabilities of their chips and the servers they will power.

Applied Micro X-Gene

When is it out?

Three generations of X-Gene system on a chips are planned. The first to hit the market in servers will be the X-Gene1 processor, is expected to be available in production systems this Autumn. The X-Gene processor is already being tested in HP Moonshot servers, has been demoed in HPC and enterprise-targeted systems from Eurotech, E4 and Mitac.

Its successor, the X-Gene 2, is available for sampling now and X-Gene 3 is due to be released for sampling in 2015.

The specs

The X-Gene 1 has eight cores running at 2.4 GHz. It is made to a 40nm process – the smaller the process the more transistors can be crammed onto the chips’ surface, allowing for better processing power per watt. The chip’s superscalar architecture allows it to handle more than one instruction per processor cycle, with a four-instruction wide processing pipeline that is capable of out-of-order execution, an optimisation that reduces delays in handling instructions. Applied Micro say the chip can handle “more than 100 instructions in flight”.

Each pair of processor cores shares L1 instruction and data cache, as well as L2 cache. Connected to the cores via a network link that keeps data coherent between caches is 8MB of L3 cache and two dual-channel DDR3 memory controllers. The chipset can support up to 128GB of DDR memory capable of 1,600 MT/s.

The chipset integrates networking hardware, removing the need for discrete cards, such as I/O controller hub, NIC and baseboard management controller – reducing additional cost and power draw.

For I/O the chipset supports four 10 gigabit Ethernet connections and six PCI-E 3.0 slot, as well as multiple Sata 3 ports.

Future releases of the X-Gene will bring further performance improvements and allow servers based on the board to tackle workloads where low application latency is necessary. The X-Gene 2 will add RDMA over Converged Ethernet, or RoCE. RoCE is important feature in distributed systems as it reduces latency between servers in the cluster. This feature allows one server node in an X-Gene cluster to transfer data directly to and from memory of another node over 10 Gbps Ethernet, reducing the work carried out by each node’s CPU and improving data transfer speed. Using Roce the X-Gene 2 has shown itself capable of reducing application latency to about 5 microseconds, up to ten times faster than the X-Gene 1, according to Applied Micro.

X-Gene 2 will be made to a 28nm process, have up to 16 cores clocked at a maximum of 2.8 GHz and support four channels of memory. Architectural changes will be made to the processor core to boost performance.

Performance

What is important for the types of workloads suited to being handled in parallel on a cluster of low-energy servers – the likes of web front ends, search engines, NoSQL data stores, data analytics work like Hadoop, and media serving – are factors beyond clock speed. Applied Micro believes the X-Gene delivers on core metrics for these workloads, such as instruction issue width, the number of tiers in the processor cache hierarchy, the size of the cache per CPU and the memory bandwidth of the processor.

The graph shows how the X-Gene 2 beats compares to competitors on these measures – from left to right is the ThunderX Arm SoC from Cavium, Intel’s microserver-targeted eight-core C2000 Atom processor and, in green, the X-Gene 2. On the far right is the Intel Xeon E5-2600 v2 processor, which while higher performing costs more.

In the SPEC2006_rate processor benchmarks the X-Gene 2 delivers 55 percent better performance per watt than the X-Gene 1 and a 25 percent performance boost in ApacheBench web serving score.

Compared to Intel servers the X-Gene will be competing against, Applied Micro claims the first generation chipset can deliver the performance of an Ivy Bridge or Haswell Xeon, while the X-Gene 2 will offer greater performance at lower power and be suited to latency-sensitive clustered applications.

Applied Micro says a rack of X-Gene 2 systems will burn about 30 kilowatts and pack 6,480 threads running at 2.8 GHz. The cluster will provide 50 TB of memory and 48 TBps of memory bandwidth. It will handle 750 million transactions per second on the memcached test with 95 percent of the transactions coming in at under 40 milliseconds. A cluster of 80 two-socket machines based on Intel’s Xeon E5-2630 v2 processors, with six cores and twelve threads per socket, delivers 1,920 threads and deliver around 400 million transactions per second on the same memcached test in the same power envelope of around 30 KW. These benchmarks are provided by Applied Micro, however, so need to be treated with the appropriate level of skepticism until verified.

Intel said Applied Micro’s performance estimates are impossible to verify as “no-one has ever seen X-Gene 1-based system benchmarked using industry standard applications” and indicated the Xeon setup used in the comparison could be weighted in the X-Gene’s favour.

Intel has its own range of energy sipping, less powerful SoCs aimed at the server market, the Avoton series in its Intel Atom family, and for its part Intel claims these are more power efficient.

“X-Gene 1 is based on 40nm process and has 8 cores and roughly 35 – 40W TDP [which reflects the maximum power consumption of the machine]. For comparison, Atom C2000 (Avoton) has 8 cores as well with 20W TDP,” said an Intel spokeswoman.

“X-Gene is expected to have 35 -40 W TDP for 8 cores, node power 59W, vs 8-cores, 20W Avoton and 28-35W node power. Best case scenario for them – same performance for twice as power.”

By the time the X-Gene 2 hits productions servers Intel is also likely to have refreshed its server chip line-up with its Broadwell-EP and Broadwell-EX Xeon chips – further improving its performance per watt.

X-Gene 3 will increase the core count to a maximum of 64, increase the clock speed to 3GHz and introduce 2nd generation RoCE. It will move the X-Gene to a 16nm manufacturing process, with FinFET transistors.

What can you use them for?

Applied Micro say the X-Gene family will be able to be used for “pretty much anything that runs in the datacentre today”.

That includes hosting large-scale web sites and services; web search services such as data serving and harvesting; NoSQL data storage and retrieval; data analytics services such as information classification and filtering and extraction; and hosting and streaming of media.

The X-Gene 2 will be suited to a wider range of cloud and HPC applications than its predecessor, due to its low-latency, inter-server data transfer enabled by Roce.

The X-Gene one has already been demoed tackling HPC and other datacentre workloads when paired with Nvidia Tesla GPU K20 accelerators. The X-Gene/ Nvidia Tesla accelerator pairing is being used in servers from Cirrascale, E4 and Eurotech. Each server is designed to specialise in different workloads, the Cirrascale on HPC and enterprise workloads, while the E4 is focused on seismic, signal and image processing, as well as running jobs against big data sets using map-reduce.