Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on 1/32 Activation-Ratio MoE Architecture

0 1 6 minutes read

Meet AntAngelMed: A 103B-Parameter Open-Source Medical Language Model Built on 1/32 Activation-Ratio MoE Architecture

A group of researchers from China released AntAngelMed, an open source medical language model that the group describes as the largest and most sophisticated of its kind currently available.

What is AntAngelMed?

AntAngelMed is a medical domain language model with a total of 103 billion parameters, but it does not use all of those parameters during decision-making. Instead, it uses a Mixture-of-Experts (MoE) architecture with an activation ratio of 1/32, which means that only 6.1 billion parameters are active at any time a query is processed.

It helps to know how MoE structures work. In a typical dense model, every parameter participates in processing every token. In the MoE model, the network is divided into 'specialist' sub-networks, and the routing mechanism selects a subset of them to handle each input. This allows you to have a very large number of parameter calculations – often associated with strong information power – while keeping the actual computational cost of the design proportional to the minimum number of active parameters.

AntAngelMed inherits this design from Ling-flash-2.0, a basic model developed by inclusionAI and guided by what the team calls Ling Scaling Laws. Special settings on top include: refined expert granularity, distributed expert tuning, balance attention methods, lossless sigmoid routing, MTP (Multi-Token Prediction) layer, QK-Norm, and Partial-RoPE (Rotary Position Embedding is used for the smallest set of attention heads). According to the research team, these design choices together allow small MoE models to deliver 7× efficiency compared to dense architectures of the same size meaning that with only 6.1B activated parameters, AntAngelMed can almost match the performance of a 40B dense model. Separately, as the output length increases during prediction, the relative speed advantage can reach 7× or more over dense models of the same size.

Training Pipeline

AntAngelMed uses a a three-stage training process designed to layer general language understanding on top of in-depth practice in the medical field.

I the first stage It is a continuous advance training of major medical companies, including encyclopedias, web text, and academic publications. This section is built on the Ling-flash-2.0 testing environment, which gives the model a solid foundation for general thinking before medical technology begins.

I the second phase Supervised Fine-Tuning (SFT), where a model is trained on a multi-source instruction dataset. This dataset mixes common thinking tasks – math, programming, logic – to maintain thinking skills, alongside clinical contexts such as patient Q&A, diagnostic reasoning, and safety and ethical cases.

I the third stage Reinforcement Learning using the GRPO (Group Relative Policy Optimization) algorithm, combined with task-specific reward models. GRPO, originally introduced in a DeepSeekMath paper, is a variant of PPO that estimates fundamentals from group scores rather than a different critical model, making it mathematically simpler. Here, reward signals are designed to shape behavioral models toward empathy, structured clinical responses, safety parameters, and evidence-based reasoning – all with the goal of reducing biases in clinical questions.

Inference Performance

On H20 hardware, AntAngelMed passes 200 tokens per second, which the research team reports is about 3× faster than the 36 billion parameter dense model. With YaRN (Yet Another RoPE Extension), it supports a context length of 128K – long enough to handle full clinical documents, extended patient histories, or dynamic medical interviews.

The research team also released an FP8 version of the model. If this measurement is combined with the improvement of the EAGLE3 prediction formulation, the output of the consensus 32 is much better than FP8 alone: 71% in HumanEval, 45% in GSM8K, and 94% in Math-500. These benchmarks measure coding and math tasks – not medical tasks specifically – but serve as proxies for the general robustness of model outputs across all output types.

Benchmark results

In HealthBench, an open-source medical evaluation benchmark from OpenAI that uses simulated dynamic medical interviews to measure real-world clinical performance, AntAngelMed ranks first among all open source models and outperforms the range of proprietary models at the top, with a particularly notable advantage in the HealthBench-Hard subset.

In MedAIBench, a test program maintained by the China National Artificial Intelligence Medical Industry Pilot Facility, AntAngelMed ranks high, with particularly strong scores in the Q&A medical knowledge and medical ethics and safety categories.

In MedBench, a benchmark for Chinese healthcare LLMs that includes 36 independently selected datasets and nearly 700,000 samples across five dimensions – medical knowledge question answer, medical language comprehension, medical language production, complex medical reasoning, and safety and ethics – AntAngelMed tops overall.

Marktechpost Visual Explainer

01 – Overview
What is AntAngelMed?
It was jointly developed by Zhejiang Provincial Health Information Center, Ant Healthcare, and Zhejiang Anzhen'er Medical AI Technology Co., Ltd.

103BTotal Params

6.1BWorks on Inference

128KCore Length

AntAngelMed is an LLM medical center built on 1/32 activation-ratio of MoE structures. With a total of 103B parameters and only 6.1B active during prediction, it is equivalent to performing approx. 40B dense models at a fraction of the cost of computing.

Model weights are subtracted from the bottom Apache 2.0. The code repository is licensed under MIT.

02 – Architecture
MoE Architecture & Base Model
Built on Ling-flash-2.0 by installation, guided by Ling Scalling Laws.

AntAngelMed uses a 1/32-rate MoE activation with adjustments to all major components. These options allow smaller MoE models to deliver 7× efficiency over dense structures of the same size – and as the output length increases, relative speedups can reach 7× or more.

Key building blocks:

Expert Granularity
Expert Shared Rating
Sigmoid Routing
No Loss Helps
MTP layer
QK-Norm
Part-Thread
YaRN Extrapolation
Balance of attention

03 – Training
A Three-Stage Training Pipeline
It is designed to layer general language understanding on top of in-depth familiarity with the medical domain.

Section 01
Pre-Continued Training
Built on Ling-flash-2.0, trained in a large field of medicine – encyclopedias, web text, and academic publications – to inject deep background and world knowledge.

Section 02
Supervised Fine-Tuning (SFT)
Prescription data from multiple sources including common tasks (mathematics, systems, logic) of thought processes, and medical contexts (doctor-patient Q&A, diagnostic reasoning, safety/ethics) for clinical practice.

Section 03
Reinforcement Learning with GRPO
Development of Group-Related Policy with work-specific reward models. It shapes the behavior model toward empathy, structural clarity, safety parameters, and evidence-based reasoning to reduce hallucinations.

04 – Explanation
Inference Performance
Hardware benchmarks on H20 and performance improvements from FP8 + EAGLE3 optimization.

>200 tok/s
On H20 hardware. About 3× faster than the comparable compact 36B model.

7× efficiency
MoE is competitively dense with equal size. Speedup increases as output length increases.

+71% / +45% / +94%
FP8 + EAGLE3 outperforms FP8 alone in HumanEval / GSM8K / Math-500 by 32.

128K context
Supported by YaRN extrapolation. It handles full clinical documentation and extensive multi-variable interviews.

05 – Measurements
Benchmark results
Tested on all three accredited medical LLM benchmarks.

Benchmark	Width	The result
HealthBenchOpenAI	Simulated medical interviews for real-world clinical practice.	#1 open source; exceeds several proprietary models. The highest value of HealthBench-Hard.
MedaBenchNat'l AI Medical Pilot Center	Chinese authority benchmark including Q&A information and medical/safety regulations.	High quality. Powerful Q&A on knowledge and ethics/medical safety.
MedBenchChinese Health Center	36 datasets, ~700K samples across 5 clinical measurements.	#1 overall in all 5 ratings.

06 – Start immediately
Run with Hugging Face Transformers
Requires trust_remote_code=True for the router's MoE code.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "MedAIBase/AntAngelMed",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("MedAIBase/AntAngelMed")

messages = [
  {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
  {"role": "user",   "content": "What should I do if I have a headache?"}
]
text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt",
    return_token_type_ids=False).to(model.device)
out    = model.generate(**inputs, max_new_tokens=16384)
out    = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])

It also supports: vLLM v0.11.0 (4-GPU tensor parallel), SGlang with FlashAttention-3, and vLLM-Up for Huawei Ascend 910B NPUs.

07 — Access
Resources and Links
Model weights Apache 2.0 – MIT code repository – FP8 limited variant available separately.

Developed by Zhejiang Provincial Health Information Center, Ant Healthcareagain Zhejiang Anzhen'er Medical AI Technology Co., Ltd.
Cover by Marktechpost -markettechpost.com

Key Takeaways

AntAngelMed is a 103B open medical LLM parameter that activates only 6.1B parameters at the time of determination using the 1/32 aperture ratio MoE design inherited from Ling-flash-2.0.
It uses a three-stage training pipeline: continuous pre-training in the medical organization, SFT with mixed data of general and clinical instructions, and reinforcement learning based on GRPO for safety and diagnostic reasoning.
On H20 hardware, the model exceeds 200 tokens/s and supports 128K context lengths by using YaRN extrapolation — about 3× faster than a comparable dense 36B model.
AntAngelMed ranks first among open source models in OpenAI's HealthBench, surpassing several proprietary models, and tops both the MedAIBench and MedBench leaderboards.
The model is available on Hugging Face, ModelScope, and GitHub; Model weights are Apache 2.0, code is MIT, and a limited FP8 version has also been released.

Check it out Model weights in HF, GitHub Repo again Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 57 minutes ago

0 1 6 minutes read