Zyphra, AMD, and IBM spent a yr testing whether or not AMD’s GPUs and platform can help large-scale AI mannequin coaching, and the result’s ZAYA1.
In partnership, the three firms skilled ZAYA1 – described as the primary main Combination-of-Consultants basis mannequin constructed completely on AMD GPUs and networking – which they see as proof that the market doesn’t need to rely on NVIDIA to scale AI.
The mannequin was skilled on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software program, all working throughout IBM Cloud’s infrastructure. What’s notable is how standard the setup seems to be. As an alternative of experimental {hardware} or obscure configurations, Zyphra constructed the system very similar to any enterprise cluster—simply with out NVIDIA’s parts.
Zyphra says ZAYA1 performs on par with, and in some areas forward of, well-established open fashions in reasoning, maths, and code. For companies annoyed by provide constraints or spiralling GPU pricing, it quantities to one thing uncommon: a second choice that doesn’t require compromising on functionality.
How Zyphra used AMD GPUs to chop prices with out gutting AI coaching efficiency
Most organisations observe the identical logic when planning coaching budgets: reminiscence capability, communication velocity, and predictable iteration instances matter greater than uncooked theoretical throughput.
MI300X’s 192GB of high-bandwidth reminiscence per GPU offers engineers some respiration room, permitting early coaching runs with out instantly resorting to heavy parallelism. That tends to simplify initiatives which can be in any other case fragile and time-consuming to tune.
Zyphra constructed every node with eight MI300X GPUs linked over InfinityFabric and paired every one with its personal Pollara community card. A separate community handles dataset reads and checkpointing. It’s an unfussy design, however that appears to be the purpose; the less complicated the wiring and community format, the decrease the swap prices and the simpler it’s to maintain iteration instances regular.
ZAYA1: An AI mannequin that punches above its weight
ZAYA1-base prompts 760 million parameters out of a complete 8.3 billion and was skilled on 12 trillion tokens in three phases. The structure leans on compressed consideration, a refined routing system to steer tokens to the appropriate specialists, and lighter-touch residual scaling to maintain deeper layers secure.
The mannequin makes use of a mixture of Muon and AdamW. To make Muon environment friendly on AMD {hardware}, Zyphra fused kernels and trimmed pointless reminiscence site visitors so the optimiser wouldn’t dominate every iteration. Batch sizes have been elevated over time, however that relies upon closely on having storage pipelines that may ship tokens shortly sufficient.
All of this results in an AI mannequin skilled on AMD {hardware} that competes with bigger friends corresponding to Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One benefit of the MoE construction is that solely a sliver of the mannequin runs directly, which helps handle inference reminiscence and reduces serving value.
A financial institution, for instance, might practice a domain-specific mannequin for investigations with no need convoluted parallelism early on. The MI300X’s reminiscence headroom offers engineers house to iterate, whereas ZAYA1’s compressed consideration cuts prefill time throughout analysis.
Making ROCm behave with AMD GPUs
Zyphra didn’t disguise the truth that shifting a mature NVIDIA-based workflow onto ROCm took work. As an alternative of porting parts blindly, the crew frolicked measuring how AMD {hardware} behaved and reshaping mannequin dimensions, GEMM patterns, and microbatch sizes to go well with MI300X’s most well-liked compute ranges.
InfinityFabric operates finest when all eight GPUs in a node take part in collectives, and Pollara tends to achieve peak throughput with bigger messages, so Zyphra sized fusion buffers accordingly. Lengthy-context coaching, from 4k as much as 32k tokens, relied on ring consideration for sharded sequences and tree consideration throughout decoding to keep away from bottlenecks.
Storage issues have been equally sensible. Smaller fashions hammer IOPS; bigger ones want sustained bandwidth. Zyphra bundled dataset shards to scale back scattered reads and elevated per-node web page caches to hurry checkpoint restoration, which is significant throughout lengthy runs the place rewinds are inevitable.
Holding clusters on their ft
Coaching jobs that run for weeks hardly ever behave completely. Zyphra’s Aegis service screens logs and system metrics, identifies failures corresponding to NIC glitches or ECC blips, and takes simple corrective actions mechanically. The crew additionally elevated RCCL timeouts to maintain quick community interruptions from killing total jobs.
Checkpointing is distributed throughout all GPUs slightly than compelled via a single chokepoint. Zyphra stories greater than ten-fold quicker saves in contrast with naïve approaches, which instantly improves uptime and cuts operator workload.
What the ZAYA1 AMD coaching milestone means for AI procurement
The report attracts a clear line between NVIDIA’s ecosystem and AMD’s equivalents: NVLINK vs InfinityFabric, NCCL vs RCCL, cuBLASLt vs hipBLASLt, and so forth. The authors argue the AMD stack is now mature sufficient for severe large-scale mannequin improvement.
None of this means enterprises ought to tear out present NVIDIA clusters. A extra reasonable path is to maintain NVIDIA for manufacturing whereas utilizing AMD for phases that profit from the reminiscence capability of MI300X GPUs and ROCm’s openness. It spreads provider threat and will increase complete coaching quantity with out main disruption.
This all leads us to a set of suggestions: deal with mannequin form as adjustable, not mounted; design networks across the collective operations your coaching will truly use; construct fault tolerance that protects GPU hours slightly than merely logging failures; and modernise checkpointing so it now not derails coaching rhythm.
It’s not a manifesto, simply our sensible takeaway from what Zyphra, AMD, and IBM discovered by coaching a big MoE AI mannequin on AMD GPUs. For organisations trying to broaden AI capability with out relying solely on one vendor, it’s a probably helpful blueprint.
See additionally: Google commits to 1000x extra AI infrastructure in subsequent 4-5 years

Wish to study extra about AI and large knowledge from business leaders? Try AI & Big Data Expo happening in Amsterdam, California, and London. The great occasion is a part of TechEx and is co-located with different main know-how occasions together with the Cyber Security Expo. Click on here for extra info.
AI Information is powered by TechForge Media. Discover different upcoming enterprise know-how occasions and webinars here.
