Advertisement
News
EEtimes
News the global electronics community can trust
eetimes.com
power electronics news
The trusted news source for power-conscious design engineers
powerelectronicsnews.com
EPSNews
News for Electronics Purchasing and the Supply Chain
epsnews.com
elektroda
The can't-miss forum engineers and hobbyists
elektroda.pl
eetimes eu
News, technologies, and trends in the electronics industry
eetimes.eu
Products
Electronics Products
Product news that empowers design decisions
electronicproducts.com
Datasheets.com
Design engineer' search engine for electronic components
datasheets.com
eem
The electronic components resource for engineers and purchasers
eem.com
Design
embedded.com
The design site for hardware software, and firmware engineers
embedded.com
Elector Schematics
Where makers and hobbyists share projects
electroschematics.com
edn Network
The design site for electronics engineers and engineering managers
edn.com
electronic tutorials
The learning center for future and novice engineers
electronics-tutorials.ws
TechOnline
The educational resource for the global engineering community
techonline.com
Tools
eeweb.com
Where electronics engineers discover the latest toolsThe design site for hardware software, and firmware engineers
eeweb.com
Part Sim
Circuit simulation made easy
partsim.com
schematics.com
Brings you all the tools to tackle projects big and small - combining real-world components with online collaboration
schematics.com
PCB Web
Hardware design made easy
pcbweb.com
schematics.io
A free online environment where users can create, edit, and share electrical schematics, or convert between popular file formats like Eagle, Altium, and OrCAD.
schematics.io
Product Advisor
Find the IoT board you’ve been searching for using this interactive solution space to help you visualize the product selection process and showcase important trade-off decisions.
transim.com/iot
Transim Engage
Transform your product pages with embeddable schematic, simulation, and 3D content modules while providing interactive user experiences for your customers.
transim.com/Products/Engage
About
AspenCore
A worldwide innovation hub servicing component manufacturers and distributors with unique marketing solutions
aspencore.com
Silicon Expert
SiliconExpert provides engineers with the data and insight they need to remove risk from the supply chain.
siliconexpert.com
Transim
Transim powers many of the tools engineers use every day on manufacturers' websites and can develop solutions for any company.
transim.com

Arm Updates CSS Designs for Hyperscalers’ Custom Chips

By   02.21.2024 0

Arm recently upgraded its Arm Neoverse Compute Subsystem (CSS) designs with new CPU cores, aimed at companies building their own custom chips for the data center.

The market for custom chips in the data center is significant, according to Mohamed Awad, senior VP and general manager for Arm’s infrastructure line of business.

“[Hyperscalers] are redesigning systems from the ground up, starting with custom specs,” he said. “This works because they know their workloads better than anyone else, which means they can fine-tune every aspect of the system, including the networking acceleration, and even general-purpose compute, specifically, to optimize for efficiency, performance and ultimately TCO.”

Hyperscale data center operators have developed multiple generations of their own custom Arm-based CPU, including AWS Graviton, and Arm has been adopted for data center CPUs by companies, including Ampere and Nvidia.

Partner Content
View All

CSSes are Arm’s oven-ready designs that combine key SoC elements to give customers a head start when designing custom SoCs. Arm also has an ecosystem of design partners to help with implementation, if required. The overall aim is to make the path to custom silicon faster and more accessible. The recently announced Microsoft Cobalt 100 is based on second-gen CSS (specifically, CSS-N2).

Mohamed Awad
Mohamed Awad (Source: Arm)

Awad said hyperscalers choose Arm because the availability of CSSes means custom solutions can be created quickly and combined with Arm’s robust ecosystem.

“What we’re hearing from everywhere is that, generally speaking for hyperscalers and many of these OEMs, general-purpose compute is just not keeping up, meaning an off-the-shelf SoC is not keeping up,” he said. “We’re really optimistic about [CSSes] and we’ve seen tremendous traction with these platforms.”

The driver for hyperscalers’ desire to build their own chips is undoubtedly AI.

Awad said Arm has customers running AI inference at scale on Arm-based CPUs, in part down to the cost of custom accelerators, and in part down to their availability. (The market-leading data center GPU, Nvidia H100, is in notoriously short supply.) CPUs are widely available and very affordable compared with other options, he said.

“The decision to offload an inference job to an accelerator, whether that’s a GPU or something else, comes down to the granularity of compute you’re dealing with,” he said. “At certain granularities of compute, within the context of the workload, it make a lot more sense to keep it on the CPU…from a performance perspective, which ultimately will translate to cost.”

Arm’s vision is for a “significant percentage, if not the vast majority” of AI inference to run on CPUs eventually, particularly as models become more optimized for CPU hardware.

“There’s a lot of work going on in that sphere—the market is evolving, so it’s like, what can I throw at it to get the job done as quickly as possible,” he said. “So that’s where we’re probably seeing some GPUs or AI accelerators being used [today]. In the future, a lot of that may end up on CPUs as the market matures.”

Arm also expects its CSS designs to be used in tightly-coupled CPU-plus–accelerator designs, analogous to Nvidia Grace Hopper, which is optimized for memory capacity and bandwidth, Awad said.

CSSes don’t just work for hyperscalers; they can also support smaller companies, particularly through the Arm Total Design ecosystem of design partners, he said.

“[Smaller] companies are looking to get to market as quickly as possible to launch their solutions to capture market share, to establish themselves,” he said. “They’re also looking for a level of flexibility so that they can focus their innovation, and then they obviously need the performance to run some of these workloads.”

Collaborative relationship

With CSS, Arm takes responsibility for configuring, optimizing and validating a compute subsystem so the hyperscaler can focus on system-level workload-specific differentiation they care about, whether that’s software tuning, custom acceleration, or something else, said Dermot O’Driscoll, vice president of product solutions for Arm’s infrastructure line of business.

Dermot O'Driscoll
Dermot O’Driscoll (Source: Arm)

“They get faster time to market, they reduce the cost of engineering, and yet they take advantage of the same leading edge processor technology,” he said. “We created the CSS program to give customers the same kind of control of the silicon stack as they have over their software and system stacks today. This is a close collaborative relationship and our partners push us really hard to raise our game.”

Hyperscalers are highly focused on optimizing every layer of their infrastructure to get the best performance, especially performance per Watt, on diverse workloads, he said.

“This drives the need to understand and tune for each use,” he said. “The old cycle of software and hardware being developed in separate companies no longer keeps up with customer performance needs, or the complexity of either the software or the hardware. Customers want to see the hardware they deploy, even down to the microarchitecture, optimized to run their software workloads. This type of co-optimization is hard to do and requires significant commitment on both sides to make it work.”

Arm allows its customers to run workloads on simulations of its IP as it’s being developed, with customer feedback directly influencing how Arm evolves its architecture, O’Driscoll said.

Third-generation cores

The new Arm Neoverse CPU cores are the third generation of the N series (optimized for performance per Watt) and the V series (optimized for performance). CSS designs are available for the new N3 and V3 cores.

The CSS-N3 offers a 20% performance-per-Watt improvement per core over the CSS-N2. This CSS design comes with between 8 and 32 cores, with the 32-core version using as little as 40 W. It’s intended for telecoms, networking, DPU, and cloud applications and can be used with on-chip or separate AI accelerators. The new N3 core is based on Armv9.2 and includes 2 MB private L2 cache per core. It supports the latest versions of PCIe, CXL and UCIe.

The performance-tuned version, the V3 core, is Arm’s highest performance Neoverse core to date. CSS-V3 offers more than 50% better performance per socket compared to CSS-N2 (because this is the first CSS for V-series cores, a performance comparison to earlier CSS-V designs isn’t available). CSS-V3 can scale to 128 cores per socket for cloud, HPC and AI workloads. It supports DDR5/LPDDR5 and HBM3 memories with PCIe Gen5 and CXL 3.0 support.

Arm Neoverse V3 and N3 performance on various workloads versus previous generation cores (measured on simulation). (Source: Arm)
Arm Neoverse V3 and N3 performance on various workloads versus previous generation cores (measured on simulation). (Source: Arm)

While N3 has been further optimized for tasks like compression, which bring down cloud operator costs, V3’s optimizations include better performance for workloads like protocol buffers. Both show big improvements for AI data analytics; the figures in the graph above are for XGBoost, a widely-used machine learning (ML) library for regression, classification and ranking applications. Improvements to branch prediction, better management of last level cache and associated memory bandwidth, and bigger L2 cache almost doubled XGBoost performance on N3 cores versus N2.

Arm’s N3 core features better branch prediction accuracy and bigger L2 cache, which contribute to almost doubling performance for the XGBoost library versus N2 (Source: Arm)
Arm’s N3 core features better branch prediction accuracy and bigger L2 cache, which contribute to almost doubling performance for the XGBoost library versus N2 (Source: Arm)
Arm’s preliminary results for Llama2-7B inference on Neoverse V1 and V2 cores (Source: Arm)
Arm’s preliminary results for Llama2-7B inference on Neoverse V1 and V2 cores (Source: Arm)

Arm has also been looking at generative AI, ready for the shift to inference at scale that O’Driscoll says is coming. Arm showed preliminary results for Llama2-7B running on Neoverse V1 and V2 performance-optimized cores (no figures are available yet for the third-gen V3 core announced today).

Part of the equation for cost-efficient inference is throughput, O’Driscoll said, adding that token generation throughput on deployed Arm silicon is already “very good.”

“CPUs are widely available and can flexibly be used for ML or other workloads,” he said. “They are easy to deploy, support a variety of software frameworks, and are cost and energy efficient. So we know CPU inference will be a key part of the genAI computing footprint, and we can see these workloads already benefiting from ML-specific Neoverse features like B Float 16, MathML, SVE [scalable vector extensions] and SVE2 as well as our microarchitectural optimizations, and that trend will continue.”

0 comments
Post Comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles