Tech —

AMD’s moment of Zen: Finally, an architecture that can compete

Intel's architecture is still better, but AMD has significantly narrowed the gap.

What makes it tick

The basic building block of Zen is the Core Complex (CCX): a unit containing four cores, capable of running eight simultaneous threads. True to AMD's belief in high core and thread counts for desktop processors, the first Ryzen 7 series processors include two CCXes, for a total of eight cores and 16 threads. Three versions are launching: the 1800X, a 3.6-4.0GHz part at $499/£490, the 1700X, 3.4-3.8GHz at $399/£390, and the $329/£320 1700 at 3.0-3.7GHz.

A Zen Core Complex.
Enlarge / A Zen Core Complex.
AMD

In the second quarter, these will be joined by Ryzen 5. The R5 1600X will be a six-core, 12-thread chip running at 3.6-4.0GHz (two CCXes, with one core from each disabled), and the 1500X will be a four-core, eight-thread chip at 3.5-3.7GHz (just a single CCX).

Zen scales up, too. At some point this year, AMD will launch server processors, codenamed "Naples," containing eight CCXes for a total of 32 cores and 64 threads.

This design decision already sets AMD apart from Intel. Intel's processor range is strangely bifurcated. The company's latest core is Kaby Lake, but Kaby Lake is only available in two- and four-core versions, some with simultaneous multithreading (SMT), others without. To go beyond four cores, you have to switch to a previous-generation architecture: Broadwell.

Broadwell, a 14nm shrink of the previous Haswell architecture, was first introduced in September 2014. Currently, every processor that's "bigger" than a four-core, eight-thread mainstream desktop or mobile part is built using the Broadwell core. This includes not just the enthusiast-oriented Broadwell-E parts that offer six, eight, or 10 cores and 12, 16, or 20 threads. It also includes Broadwell-EP server parts, right up to the Xeon E7-8894V4 that was launched just two weeks ago. This is an 8-socket-capable 24-core, 48-thread chip that won't give you much change from $9,000.

These first Ryzen processors straddle that discontinuity in the Intel line-up. The R7 1700 is going more or less head to head with the Kaby Lake i7-7700K. The latter uses Intel's refined 14nm process and newest architecture with the best single-threaded performance, running at 4.2-4.5GHz. But the 1800 and 1800X are competing against Broadwell designs, the six-core, 12-thread 3.6-3.8 GHz i7-6850K ($620/£580 or so) and eight-core, 16-thread 3.2-3.7 GHz i7-6900K ($1,049/£1,000 or so), respectively. In upping the core count, Intel's forcing you to sacrifice its latest core and most power efficient process and to switch to older, less frequently updated chipsets to boot (the current X99 dates back to late 2014).

Bigger, beefier cores

Zen core block diagram.
Enlarge / Zen core block diagram.
AMD

Each of those new cores is equipped with many more execution resources than Bulldozer. On the integer side, Zen has four ALUs and two AGUs. On the floating point side, the shared floating point unit concept has been scrapped: each core now has a pair of 128-bit FMA units of its own. The floating point units are organized as separate add and multiply pipes to handle a more diverse instruction mix when not performing multiply-accumulate operations. But 256-bit AVX instructions have to be split up across the two FMA units and tie up all the floating point units.

This is a big step up from Bulldozer, essentially doubling the integer and floating point resources available to each core. Compared to Broadwell and Skylake, however, things are murkier. AMD's four ALUs are similar to each other though not identical; some instructions have to be processed on a particular unit (only one has a full multiplier, only one has a divider), and they can't be run on other units even if they're available. Intel's are a bit more diverse, so for some instruction mixes, Intel's four ALUs may possibly be lesser than AMD's.

Complicating this further, AMD says that six instructions total can be dispatched per cycle across the 10 pipelines (four ALU, two AGU, four FP) in the core. Broadwell and Skylake can both issue eight instructions per cycle. Four of those go to AGUs—Skylake has two general-purpose AGUs and two more specialized ones. The other four perform arithmetic of some kind, either integer or floating point.

Intel groups its functional units behind four dispatch ports, numbered 0, 1, 5, and 6. All four ports include a regular integer ALU, but port 0 also has an AVX FMA unit, a divider, and a branch unit. Port 1 has a second AVX FMA unit but no divider. Ports 5 and 6 have neither an FMA nor a divider. This means that in one cycle, the processor can schedule two AVX FMA operations or one divide and one AVX FMA. But the processor can't do a divide and two FMAs.

In principle, this means that in one cycle, Zen could dispatch four integer operations and two floating point operations. Skylake could dispatch four integer operations, but this would tie up all four ports, leaving it unable to also dispatch any floating point operations. On the other hand, Skylake and Broadwell could both dispatch four integer operations and four address operations in a single cycle. Zen would only manage two address operations.

Bulldozer's foibles aren't quite a thing of the past

Isolating these differences to measure the impact of the designs is nigh impossible. Nonetheless, there are a few areas where the advantage of one design or the other is clear. Intel's chips both have two full 256-bit AVX FMA units that can be used simultaneously. For code that can take advantage of this, Skylake and Broadwell's performance should easily double Zen's. For many years, AMD has been pushing the GPU as the best place to perform this kind of floating point-intensive parallel workload. So on some level this discrepancy makes sense—but programs that depend heavily on AVX instructions are likely to strongly favor the Intel chips.

This becomes visible in, for example, one of Geekbench's floating point tests, SGEMM. This is a matrix multiplication test that uses, when available, AVX and FMA instructions for best performance. On a single thread, the 6900K manages about 90 billion single precision floating point instructions per second (gigaflops). The 1800X, by contrast, only offers 53 gigaflops. Although the 1800X's higher clock speed helps a little, the Intel chip is doing twice as much work with each cycle. A few hundred megahertz isn't enough to offset this architectural difference.

Of course, this is the kind of workload that in some ways proves AMD's point: GPU-accelerated versions of the same matrix multiply operation can hit 800 or more gigaflops. If your computational requirements include a substantial number of matrix multiplications, you're not going to want to do that work on your CPU.

The long-standing difficulty for AMD, and for general purpose GPU computation in general, is how to cope when only some of your workload is a good match for the GPU. Moving data back and forth between CPU and GPU imposes overheads and often requires developers to switch between development tools and programming languages. There are solutions to this, such as AMD's heterogeneous systems architecture and OpenCL, but they're still awaiting widespread industry adoption.

One particular Geekbench subtest showed a strong advantage in the other direction. Geekbench includes tests of the cryptographic instructions found in all mainstream processors these days. In a test of single-threaded performance, the Ryzen trounces the Broadwell-E, encrypting at 4.5GB/s compared to 2.7GB/s. Ryzen has two AES units that both reside within the floating point portion of the processor. Broadwell only has one, giving the AMD chip a big lead.

This situation is reversed when moving from one thread to 16: the Intel system can do 24.4GB/s while the AMD only does 10.2GB/s. This suggests that the test becomes bandwidth-limited with high thread counts, allowing the 6900K's quad-memory channels to give it a lead over the 1800X's dual channels. Even though the Ryzen has more computational resources to throw at this particular problem, that doesn't help when the processor sits twiddling its thumbs waiting for data.

And when it comes to single-threaded performance, the Kaby Lake i7-7700K pulls well ahead of both platforms. With its combination of superior IPC and faster clockspeeds, neither Broadwell-E nor Zen can keep up.

Channel Ars Technica