AMD’s moment of Zen: Finally, an architecture that can compete

What makes it tick

The basic building block of Zen is the Core Complex (CCX): a unit containing four cores, capable of running eight simultaneous threads. True to AMD's belief in high core and thread counts for desktop processors, the first Ryzen 7 series processors include two CCXes, for a total of eight cores and 16 threads. Three versions are launching: the 1800X, a 3.6-4.0GHz part at $499/£490, the 1700X, 3.4-3.8GHz at $399/£390, and the $329/£320 1700 at 3.0-3.7GHz.

In the second quarter, these will be joined by Ryzen 5. The R5 1600X will be a six-core, 12-thread chip running at 3.6-4.0GHz (two CCXes, with one core from each disabled), and the 1500X will be a four-core, eight-thread chip at 3.5-3.7GHz (just a single CCX).

Zen scales up, too. At some point this year, AMD will launch server processors, codenamed "Naples," containing eight CCXes for a total of 32 cores and 64 threads.

This design decision already sets AMD apart from Intel. Intel's processor range is strangely bifurcated. The company's latest core is Kaby Lake, but Kaby Lake is only available in two- and four-core versions, some with simultaneous multithreading (SMT), others without. To go beyond four cores, you have to switch to a previous-generation architecture: Broadwell.

Bigger, beefier cores

Each of those new cores is equipped with many more execution resources than Bulldozer. On the integer side, Zen has four ALUs and two AGUs. On the floating point side, the shared floating point unit concept has been scrapped: each core now has a pair of 128-bit FMA units of its own. The floating point units are organized as separate add and multiply pipes to handle a more diverse instruction mix when not performing multiply-accumulate operations. But 256-bit AVX instructions have to be split up across the two FMA units and tie up all the floating point units.

This is a big step up from Bulldozer, essentially doubling the integer and floating point resources available to each core. Compared to Broadwell and Skylake, however, things are murkier. AMD's four ALUs are similar to each other though not identical; some instructions have to be processed on a particular unit (only one has a full multiplier, only one has a divider), and they can't be run on other units even if they're available. Intel's are a bit more diverse, so for some instruction mixes, Intel's four ALUs may possibly be lesser than AMD's.

Complicating this further, AMD says that six instructions total can be dispatched per cycle across the 10 pipelines (four ALU, two AGU, four FP) in the core. Broadwell and Skylake can both issue eight instructions per cycle. Four of those go to AGUs—Skylake has two general-purpose AGUs and two more specialized ones. The other four perform arithmetic of some kind, either integer or floating point.

Intel groups its functional units behind four dispatch ports, numbered 0, 1, 5, and 6. All four ports include a regular integer ALU, but port 0 also has an AVX FMA unit, a divider, and a branch unit. Port 1 has a second AVX FMA unit but no divider. Ports 5 and 6 have neither an FMA nor a divider. This means that in one cycle, the processor can schedule two AVX FMA operations or one divide and one AVX FMA. But the processor can't do a divide and two FMAs.

In principle, this means that in one cycle, Zen could dispatch four integer operations and two floating point operations. Skylake could dispatch four integer operations, but this would tie up all four ports, leaving it unable to also dispatch any floating point operations. On the other hand, Skylake and Broadwell could both dispatch four integer operations and four address operations in a single cycle. Zen would only manage two address operations.

Bulldozer's foibles aren't quite a thing of the past

Isolating these differences to measure the impact of the designs is nigh impossible. Nonetheless, there are a few areas where the advantage of one design or the other is clear. Intel's chips both have two full 256-bit AVX FMA units that can be used simultaneously. For code that can take advantage of this, Skylake and Broadwell's performance should easily double Zen's. For many years, AMD has been pushing the GPU as the best place to perform this kind of floating point-intensive parallel workload. So on some level this discrepancy makes sense—but programs that depend heavily on AVX instructions are likely to strongly favor the Intel chips.

This becomes visible in, for example, one of Geekbench's floating point tests, SGEMM. This is a matrix multiplication test that uses, when available, AVX and FMA instructions for best performance. On a single thread, the 6900K manages about 90 billion single precision floating point instructions per second (gigaflops). The 1800X, by contrast, only offers 53 gigaflops. Although the 1800X's higher clock speed helps a little, the Intel chip is doing twice as much work with each cycle. A few hundred megahertz isn't enough to offset this architectural difference.

Of course, this is the kind of workload that in some ways proves AMD's point: GPU-accelerated versions of the same matrix multiply operation can hit 800 or more gigaflops. If your computational requirements include a substantial number of matrix multiplications, you're not going to want to do that work on your CPU.

The long-standing difficulty for AMD, and for general purpose GPU computation in general, is how to cope when only some of your workload is a good match for the GPU. Moving data back and forth between CPU and GPU imposes overheads and often requires developers to switch between development tools and programming languages. There are solutions to this, such as AMD's heterogeneous systems architecture and OpenCL, but they're still awaiting widespread industry adoption.

One particular Geekbench subtest showed a strong advantage in the other direction. Geekbench includes tests of the cryptographic instructions found in all mainstream processors these days. In a test of single-threaded performance, the Ryzen trounces the Broadwell-E, encrypting at 4.5GB/s compared to 2.7GB/s. Ryzen has two AES units that both reside within the floating point portion of the processor. Broadwell only has one, giving the AMD chip a big lead.

This situation is reversed when moving from one thread to 16: the Intel system can do 24.4GB/s while the AMD only does 10.2GB/s. This suggests that the test becomes bandwidth-limited with high thread counts, allowing the 6900K's quad-memory channels to give it a lead over the 1800X's dual channels. Even though the Ryzen has more computational resources to throw at this particular problem, that doesn't help when the processor sits twiddling its thumbs waiting for data.

And when it comes to single-threaded performance, the Kaby Lake i7-7700K pulls well ahead of both platforms. With its combination of superior IPC and faster clockspeeds, neither Broadwell-E nor Zen can keep up.

Tech —

AMD’s moment of Zen: Finally, an architecture that can compete

Intel's architecture is still better, but AMD has significantly narrowed the gap.

What makes it tick

Further Reading

Bigger, beefier cores

Bulldozer's foibles aren't quite a thing of the past

Channel Ars Technica

What makes it tick

Further Reading

Bigger, beefier cores

Bulldozer's foibles aren't quite a thing of the past

reader comments

Channel Ars Technica