Friday, 1 May 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > AI model using AMD GPUs for training hits milestone
AI & Compute

AI model using AMD GPUs for training hits milestone

Last updated: November 25, 2025 2:56 am
Published November 25, 2025
Share
AI model using AMD GPUs for training hits milestone
SHARE

Zyphra, AMD, and IBM spent a yr testing whether or not AMD’s GPUs and platform can help large-scale AI mannequin coaching, and the result’s ZAYA1.

In partnership, the three firms skilled ZAYA1 – described as the primary main Combination-of-Consultants basis mannequin constructed completely on AMD GPUs and networking – which they see as proof that the market doesn’t need to rely on NVIDIA to scale AI.

The mannequin was skilled on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software program, all working throughout IBM Cloud’s infrastructure. What’s notable is how standard the setup seems to be. As an alternative of experimental {hardware} or obscure configurations, Zyphra constructed the system very similar to any enterprise cluster—simply with out NVIDIA’s parts.

Zyphra says ZAYA1 performs on par with, and in some areas forward of, well-established open fashions in reasoning, maths, and code. For companies annoyed by provide constraints or spiralling GPU pricing, it quantities to one thing uncommon: a second choice that doesn’t require compromising on functionality.

How Zyphra used AMD GPUs to chop prices with out gutting AI coaching efficiency

Most organisations observe the identical logic when planning coaching budgets: reminiscence capability, communication velocity, and predictable iteration instances matter greater than uncooked theoretical throughput. 

MI300X’s 192GB of high-bandwidth reminiscence per GPU offers engineers some respiration room, permitting early coaching runs with out instantly resorting to heavy parallelism. That tends to simplify initiatives which can be in any other case fragile and time-consuming to tune.

Zyphra constructed every node with eight MI300X GPUs linked over InfinityFabric and paired every one with its personal Pollara community card. A separate community handles dataset reads and checkpointing. It’s an unfussy design, however that appears to be the purpose; the less complicated the wiring and community format, the decrease the swap prices and the simpler it’s to maintain iteration instances regular.

See also  Vibe coding platform Cursor releases first in-house LLM, Composer, promising 4X speed boost

ZAYA1: An AI mannequin that punches above its weight

ZAYA1-base prompts 760 million parameters out of a complete 8.3 billion and was skilled on 12 trillion tokens in three phases. The structure leans on compressed consideration, a refined routing system to steer tokens to the appropriate specialists, and lighter-touch residual scaling to maintain deeper layers secure.

The mannequin makes use of a mixture of Muon and AdamW. To make Muon environment friendly on AMD {hardware}, Zyphra fused kernels and trimmed pointless reminiscence site visitors so the optimiser wouldn’t dominate every iteration. Batch sizes have been elevated over time, however that relies upon closely on having storage pipelines that may ship tokens shortly sufficient.

All of this results in an AI mannequin skilled on AMD {hardware} that competes with bigger friends corresponding to Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One benefit of the MoE construction is that solely a sliver of the mannequin runs directly, which helps handle inference reminiscence and reduces serving value.

A financial institution, for instance, might practice a domain-specific mannequin for investigations with no need convoluted parallelism early on. The MI300X’s reminiscence headroom offers engineers house to iterate, whereas ZAYA1’s compressed consideration cuts prefill time throughout analysis.

Making ROCm behave with AMD GPUs

Zyphra didn’t disguise the truth that shifting a mature NVIDIA-based workflow onto ROCm took work. As an alternative of porting parts blindly, the crew frolicked measuring how AMD {hardware} behaved and reshaping mannequin dimensions, GEMM patterns, and microbatch sizes to go well with MI300X’s most well-liked compute ranges.

InfinityFabric operates finest when all eight GPUs in a node take part in collectives, and Pollara tends to achieve peak throughput with bigger messages, so Zyphra sized fusion buffers accordingly. Lengthy-context coaching, from 4k as much as 32k tokens, relied on ring consideration for sharded sequences and tree consideration throughout decoding to keep away from bottlenecks.

See also  Trump’s $500 billion AI moonshot: Ambition meets controversy in ‘Project Stargate’

Storage issues have been equally sensible. Smaller fashions hammer IOPS; bigger ones want sustained bandwidth. Zyphra bundled dataset shards to scale back scattered reads and elevated per-node web page caches to hurry checkpoint restoration, which is significant throughout lengthy runs the place rewinds are inevitable.

Holding clusters on their ft

Coaching jobs that run for weeks hardly ever behave completely. Zyphra’s Aegis service screens logs and system metrics, identifies failures corresponding to NIC glitches or ECC blips, and takes simple corrective actions mechanically. The crew additionally elevated RCCL timeouts to maintain quick community interruptions from killing total jobs.

Checkpointing is distributed throughout all GPUs slightly than compelled via a single chokepoint. Zyphra stories greater than ten-fold quicker saves in contrast with naïve approaches, which instantly improves uptime and cuts operator workload.

What the ZAYA1 AMD coaching milestone means for AI procurement

The report attracts a clear line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so forth. The authors argue the AMD stack is now mature sufficient for severe large-scale mannequin improvement.

None of this means enterprises ought to tear out present NVIDIA clusters. A extra reasonable path is to maintain NVIDIA for manufacturing whereas utilizing AMD for phases that profit from the reminiscence capability of MI300X GPUs and ROCm’s openness. It spreads provider threat and will increase complete coaching quantity with out main disruption.

This all leads us to a set of suggestions: deal with mannequin form as adjustable, not mounted; design networks across the collective operations your coaching will truly use; construct fault tolerance that protects GPU hours slightly than merely logging failures; and modernise checkpointing so it now not derails coaching rhythm.

See also  The CIO’s guide to governance

It’s not a manifesto, simply our sensible takeaway from what Zyphra, AMD, and IBM discovered by coaching a big MoE AI mannequin on AMD GPUs. For organisations trying to broaden AI capability with out relying solely on one vendor, it’s a probably helpful blueprint.

See additionally: Google commits to 1000x extra AI infrastructure in subsequent 4-5 years

Banner for AI & Big Data Expo by TechEx events.

Wish to study extra about AI and large knowledge from business leaders? Try AI & Big Data Expo happening in Amsterdam, California, and London. The great occasion is a part of TechEx and is co-located with different main know-how occasions together with the Cyber Security Expo. Click on here for extra info.

AI Information is powered by TechForge Media. Discover different upcoming enterprise know-how occasions and webinars here.

Source link

TAGGED: AMD, GPUs, hits, milestone, Model, training
Share This Article
Twitter Email Copy Link Print
Previous Article OVHcloud Reinforces AI Inference with SambaNova Partnership OVHcloud Reinforces AI Inference with SambaNova Partnership
Next Article How to avoid becoming an “AI-first” company with zero real AI usage How to avoid becoming an “AI-first” company with zero real AI usage
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Google’s new AI agent rewrites code to automate vulnerability fixes

Google DeepMind has deployed a brand new AI agent designed to autonomously discover and repair…

October 6, 2025

Opera introduces browser-integrated AI agent

Opera has launched “Browser Operator,” a local AI agent designed to carry out duties for…

March 3, 2025

AI agents enter banking roles at Bank of America

AI brokers are beginning to tackle a extra direct position in how monetary recommendation is…

March 25, 2026

Anthropic’s AI assistant Claude learns to search the web

Anthropic has introduced its AI assistant Claude can now search the online, offering customers with…

March 21, 2025

Making cosmetics sustainable with generative AI

L’Oréal will leverage IBM’s generative AI (GenAI) know-how to create modern and sustainable beauty merchandise.…

January 16, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
The evolution of encoders: From simple models to multimodal AI
AI & Compute

The evolution of encoders: From simple models to multimodal AI

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.