Zyphra Unveils ZAYA1-8B: AMD Hardware-Trained MoE Display That Punches Far Above Its Weight Class

0 0 5 minutes read

Zyphra AI released ZAYA1-8B, a small Mixture of Experts (MoE) language model with 760 million active parameters and 8.4 billion parameters. Trained end-to-end on AMD hardware, the model outperforms open-source models many times its size in math and code benchmarks, and is now available under the Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud.

With less than 1 billion operating parameters, ZAYA1-8B scores competitively with first-generation imaging models such as DeepSeek-R1-0528, Gemini-2.5-Pro, and Claude 4.5 Sonnet in challenging mathematical imaging tasks. With its novel test-time computing method called Markovian RSA, it outperforms Claude 4.5 Sonnet and GPT-5-High in HMMT'25 (89.6 vs 88.3) and is tied with open-weight models like DeepSeek-V3.2 in statistical benchmarks.

What is an Expert Model Mix and Why Does the Functional Parameter Matter?

The difference between 'active' and 'gross' parameters is very important. In a general dense model, every parameter is valid for every input token. In the Expert Combination Model, only a small set of network parameters – the 'experts' – are activated during the decision process. ZAYA1-8B has a total of 8.4B parameters but only 760M are active per forward pass. This dramatically reduces the computational complexity and memory bandwidth requirements while maintaining the representational capacity of a very large model.

The ZAYA1-8B can be installed on the device for local LLM applications, perform well in the test-time computing harness, and provide applications with lower latency compared to denser models with similar benchmark performance.

Architecture: MoE++ and Three Key Innovations

ZAYA1-8B is built on Zyphra's MoE++ architecture, which introduces three specific changes on top of standard MoE projects. Together, these form the basis of the ZAYA1-8B's intelligent performance which is the design objective of Zyphra frames such as maximizing the output intelligence per parameter and per FLOP.

Compressed Convolutional Attention (CCA)a sequence integration method developed by Zyphra that works in a compressed latent environment and achieves 8× KV cache compression against conventional attention. KV-cache is the memory used at runtime to store states of central attention – the 8× reduction directly lowers memory requirements at runtime and allows longer active instances within the same hardware envelope.
ZAYA1 MLP based router with balanced PID controller. Standard MoE routers typically use queue estimation to determine which expert processes a given token. Zyphra replaces this with an MLP-based router and adds a PID controller bias measurement to improve router stability – actively preventing load imbalance across experts, which is a known failure mode in MoE training.
You have learned to measure the restwhich controls the growth of the net trend in depth with a negligible parameter and the FLOP cost. In deep networks, residual propagation processes can grow unstable layer upon layer; measured reading addresses this without adding a sound topic.

Training Infrastructure: Fully Built on AMD

ZAYA1-8B is a pre-trained, mid-trained, and fine-tuned MoE model on the AMD Instinct MI300 stack. The full training pipeline was run on a cluster of 1,024 AMD Instinct MI300x nodes connected via an AMD Pensando Pollara network, on a custom training cluster built with IBM.

Realizing-First Pretraining and Five Stage Post-Training Pipeline

The performance of the ZAYA1-8B shows innovation across the full stack: Zyphra's MoE++ architecture, pre-processing logic, the RL cascade methodology, and the novel Markovian RSA test-time calculation method.

The Zyphra pipeline after training consists of five consecutive stages:

The first is a standard SFT platform that includes basic discussion, following instructions, coding, math, and test-time computing (TTC) skills.
The second is a cognitive warm-up that includes math, logic and puzzle solving tasks, and TTC information to train the model to solve the candidate self-assembly.
The third is a large RLVE-Gym section with dynamically adjusted puzzle difficulty to train basic reasoning circuits.
The fourth is a big section on math and RL code to deepen the work in these two important domains.
Finally, the simple RLHF/RLAIF class improves conversational behavior, following instructions, and writing style.

Zyphra's research team saw significant improvements in math and coding skills during RL, with small but meaningful gains in multiple-choice (MMLU and GPQA-Diamond) and non-validated tasks such as creative writing.

Markovian RSA: A Novel Computational Approach to Timing

The most important technical contribution aside from the model Markovian RSA, test-time compute (TTC) a program that combines two previous ideas in a new way.

The first one is Recursive Self-Aggregation (RSA)which generates multiple thought sequences in parallel and integrates them repeatedly in every iteration. The second is Markovian thoughtwhich makes thinking about fixed-time chunks – only the tail end of the previous chunk is transferred to the next one, which keeps the context window closed no matter how long the model triggers.

Markovian RSA includes the following: at each instant, multiple traces are generated in parallel; fixed-length tail segments are removed from each trace; new clustering alerts are created from small samples from the population; and these combined warnings produce the next round of similar responses. The result has favorable inference characteristics – the output generation is consistent, and the Markovian chunking strategy ensures that the average length of the inference does not exceed a fixed context window size.

An important finding emerges that the co-creation between the post-training method and the thinking harness is important. ZAYA1-8B was trained to understand and respond to Markovian RSA integration commands and to integrate phases from SFT and continue with RL. When Zyphra used the same method on Qwen3-4B-Thinking-2507 without this combination, the performance increase was much smaller – meaning that the harness and post-training must be developed together to achieve the benefits.

With Markovian RSA at an overtime computing budget of 5.5 million tokens per problem, ZAYA1-8B outperforms DeepSeek-V3.2 and GPT-OSS-High in the challenging APEX math short list benchmark.

Benchmark results

In a class comparison to similarly sized models, the ZAYA1-8B scores 89.1 in AIME'26, 71.6 in HMMT Feb.'26, 59.3 in IMO-AnswerBench, 32.2 in APEX Shortlist, 65.8 in LiveCodeBench-v1. Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it in all math and coding categories.

Compared to large open weight models, ZAYA1-8B with 760M active parameters outperforms Mistral-Small-4-119B (6B active, total 119B) in statistical and code benchmarks directly — it gets 89.1 vs 86.4 in AIME'26, 71.6 and 60 Feb. vs 57.9 in LiveCodeBench-v6. Mistral-Small-4-119B maintains advantages in GPQA-Diamond (77.2 vs 71.0) and MMLU-Pro (81.6 vs 74.2), where breadth of knowledge is more important than depth of mathematical reasoning.

Key Takeaways

ZAYA1-8B delivers border-line math and code performance with only 760M active parameters, open-weight models that perform many times their size.
Its MoE++ architecture introduces three new features – CCA with 8× KV-cache compression, MLP-based router with PID-controller biancing, and learned residual scaling – to increase intelligence with each parameter.
A novel test-time computing method called Markovian RSA, which combines Recursive Self-Assembly and Markovian chunking, pushes ZAYA1-8B past DeepSeek-V3.2 and GPT-OSS-High in the APEX shortlist at 5.5M tokens per problem.
ZAYA1-8B is MoE's first pre-trained, mid-trained, and fully SFT'd model on AMD Instinct MI300 hardware – on a 1,024 node MI300x cluster built with IBM.
Released under Apache 2.0, it is available on Hugging Face and Zyphra Cloud.

Check it out Paper, Sample weights again Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

nimda 2 hours ago

0 0 5 minutes read