Zyphra, AMD, and IBM spent a year testing whether AMD’s GPUs and platforms could support training AI models at scale, and the result is ZAYA1.
The three companies have partnered to train ZAYA1, which is said to be the first major expert mixed-foundation model built entirely on AMD GPUs and networking. We believe this is proof that the market doesn’t need to rely on NVIDIA to scale AI.
The model is trained on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all running on IBM Cloud infrastructure. What’s notable is how traditional the setup looks. Zyphra built a system that is nearly identical to an enterprise cluster, only without NVIDIA components, instead of experimental hardware and obscure configurations.
According to Zyphra, ZAYA1 performs on par with, and in some areas outperforms, established open models in reasoning, mathematics, and code. For companies struggling with supply constraints and rising GPU prices, this provides a rare second option that doesn’t require compromising functionality.
How Zyphra used AMD GPUs to reduce costs without sacrificing AI training performance
Most organizations follow the same logic when planning their training budgets. That is, memory capacity, communication speed, and predictable iteration times are more important than raw theoretical throughput.
MI300X’s 192GB of high-bandwidth memory per GPU gives engineers the headroom to run early training runs without immediately relying on massive parallelism. This tends to simplify projects that are fragile and take time to coordinate.
Zyphra built each node with eight MI300X GPUs connected via InfinityFabric and paired each with its own Pollara network card. A separate network handles reading the dataset and checkpointing. It’s a simple design, but that seems to be the point. The simpler the wiring and network layout, the lower the switch cost and the easier it is to keep repeat times constant.
ZAYA1: An AI model that punches above its weight
ZAYA1 base enabled 760 million parameters out of a total of 8.3 billion parameters and trained 12 trillion tokens in three stages. This architecture relies on compressed attention, a sophisticated routing system to direct tokens to the appropriate experts, and lighter-touch residual scaling to maintain stability in deeper layers.
The model uses a combination of Muon and AdamW. To make Muon efficient on AMD hardware, Zyphra fused the kernels and trimmed unnecessary memory traffic to prevent the optimizer from dominating each iteration. Batch sizes have increased over time, but are highly dependent on having a storage pipeline that can deliver tokens quickly enough.
All of this leads to AI models trained on AMD hardware competing with larger peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a portion of the model is executed at a time, making it easier to manage inference memory and reducing service costs.
For example, banks can train domain-specific models for research without requiring complex parallelism early on. MI300X’s memory headroom gives engineers space for iteration, and ZAYA1’s compressed focus reduces prefill time during evaluation.
Run ROCm on AMD GPU
Zyphra has made no secret of the fact that migrating mature NVIDIA-based workflows to ROCm will take time. Rather than blindly porting components, the team took the time to measure the behavior of the AMD hardware and reshaped the model’s dimensions, GEMM pattern, and microbatch size to match the MI300X’s recommended compute range.
InfinityFabric works best when all eight GPUs in a node participate in a collective, and Pollara tends to peak throughput as messages get larger, so Zyphra sized the fusion buffer accordingly. Long context training from 4k to 32k tokens relied on ring attention on sharded sequences and tree attention during decoding to avoid bottlenecks.
Storage considerations were practical as well. Smaller models will have lower IOPS. Larger ones require sustained bandwidth. Zyphra bundled dataset shards to reduce distributed reads and increased per-node page caching to speed up checkpoint recovery. This is essential during long runs where rewinding is unavoidable.
Keep the cluster stable
Training jobs that run over several weeks rarely work perfectly. Zyphra’s Aegis service monitors logs and system metrics, identifies failures such as NIC failures and ECC blips, and automatically takes simple corrective actions. The team also increased the RCCL timeout to ensure that short network interruptions do not stop the entire job.
Checkpoints are distributed across all GPUs instead of being forced through a single chokepoint. Zyphra reports more than 10x faster storage compared to simpler approaches, which directly improves uptime and reduces operator workload.
What the ZAYA1 AMD training milestone means for AI procurement
This report draws a clear line between NVIDIA’s ecosystem and AMD’s equivalent ecosystem, including NVLINK and InfinityFabric, NCCL and RCCL, and cuBLASLt and hipBLASLt. The authors claim that the AMD stack is now mature enough for serious large-scale model development.
None of this suggests that companies should tear away their existing NVIDIA clusters. A more practical approach is to keep NVIDIA for production and use AMD for stages that benefit from the memory capacity of the MI300X GPU and the openness of ROCm. This spreads supplier risk and increases the total amount of training without significant disruption.
All this leads to a series of recommendations. Treats the model’s shape as adjustable rather than fixed. Design networks around collective operations that you will actually use in training. Build fault tolerance to protect GPU time rather than simply logging failures. You can also modernize your checkpoint settings to keep your training rhythm consistent.
This is not a manifesto, just practical lessons from what Zyphra, AMD, and IBM have learned by training large-scale MoE AI models on AMD GPUs. This is a potentially useful blueprint for organizations looking to expand their AI capabilities without relying on just one vendor.
See also: Google commits to increasing AI infrastructure by 1,000x over next 4-5 years

Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California, and London. This comprehensive event is part of TechEx and co-located with other major technology events including Cyber Security Expo. Click here for more information.
AI News is brought to you by TechForge Media. Learn about other upcoming enterprise technology events and webinars.

