Generative AI

TII Abu-Dhabi Releases Falcon H1R-7B: A New Mathematical-Efficient Reference Model with Only 7B Params and 256k Context Window

The Technology Innovation Institute (TII), Abu Dhabi, has released the Falcon-H1R-7B, a special 7B-parameter imaging model that matches or surpasses most 14B to 47B imaging models in math, code and standard benchmarks, while remaining compact and efficient. It builds on the Falcon H1 7B Base and is available in Hugging Face under the Falcon-H1R collection.

Falcon-H1R-7B is interesting because it combines 3 design options in 1 system, a hybrid Transformer and Mamba2 core, a very long core that reaches 256k tokens in a typical vLLM implementation, and a training recipe that combines supervised long-form reasoning and reinforcement learning using GRPO.

Hybrid Transformer and Mamba2 structures with long context

Falcon-H1R-7B is a factor-only decoder model with a hybrid architecture consisting of Transformer layers and Mamba2 circuit components. Transformer blocks provide general logic-based attention, while Mamba2 blocks provide a real-time sequential model and better memory scaling as the context length increases. This project targets 3 axes of efficiency in thinking defined by the team, speed, token efficiency and accuracy.

The model works automatically --max-model-len of 262144 when delivered via vLLM, which corresponds to a 256k token context window. This allows for very long chain of command, multi-step tool execution logs and large multi-document commands in a single pass. The hybrid backbone helps control memory usage in this array length and improves throughput compared to a pure Transformer 7B base on the same hardware.

Training recipe for thinking activities

The Falcon H1R 7B uses a 2-stage training pipeline:

Of the first stagethe team performs the first supervised cold tuning over the Falcon-H1-7B Base. SFT (supervised fine tuning) data includes step by step following the logic 3 large domainsmath, coding and science, and non-thinking domains like conversation, hacking and security. The difficulty of seeing the filter increases the serious problems and the low weight that is not important. The target can be up to 48k tokens, so the model sees long output and full solution paths during training.

Of the second stagethe SFT testing environment has been refined through GRPO, which is a group-related policy development method for learning reinforcement. Rewards are given if the generated logic chain is correct by verification. For math problems, the system applies a symbolic check to the final answer. In code, it runs the generated program against unit tests. This RL phase pushes the model to maintain useful intermediate steps while staying within the token budget.

The result is a 7B model that is specifically open to a chain of thought, rather than a general conversation.

Benchmarks in math, coding and general reasoning

The Falcon-H1R-7B's benchmark scores were collected on math, coding and agent tasks, as well as general reasoning tasks.

In the statistics group, Falcon-H1R-7B achieves a combined score of 73.96%, ahead of Apriel-1.5-15B with 69.32% and larger models such as Qwen3-32B and Nemotron-H-47B. For individual benchmarks:

  • AIME 24, 88.1%, outperforms Apriel-1.5-15B by 86.2%
  • AIME 25, 83.1%, 80% higher than Apriel-1.5-15B
  • HMMT 25, 64.9%, above all listed bases
  • AMO Bench, 36.3%, compared to 23.3% for DeepSeek-R1-0528 Qwen3-8B

With code and agent load, the model reaches 33.95% as a group score. In LiveCodeBench v6, Falcon-H1R-7B scores 68.6%, higher than Qwen3-32B and other benchmarks. It also scores 28.3% on SciCode's less difficult benchmark and 4.9% on Terminal Bench Hard, where it ranks second behind Apriel 1.5-15B but ahead of several 8B and 32B programs.

By general reasoning, Falcon-H1R-7B reaches 49.48% as a group score. It records 61.3% in GPQA D, close to other 8B models, 72.1% in MMLU Pro, higher than all other 8B models in the table above, 11.1% in HLE and 53.4% ​​in IFBench, where it is second after April 1.5 15B.

The bottom line is that the 7B model can stay in the same performance band as most 14B to 47B imaging models, if the architecture and training pipeline is tuned for imaging tasks.

Inference throughput and test time scaling

The team also evaluated the Falcon-H1R-7B in output and test time measurements under realistic cluster settings.

With 512 token input and 32k token output, the Falcon-H1R-7B achieves about 1,000 tokens per second per GPU at a cluster size of 32 and about 1,500 tokens per second per GPU at a cluster size of 64, nearly doubling the output of the Qwen3-8B in the same configuration. For 8k input and 16k output, the Falcon-H1R-7B reaches around 1,800 tokens per second per GPU, while the Qwen3-8B stays below 900. The hybrid Transformer and Mamba architecture is the main factor in this scaling behavior, because it reduces the quadratic cost of long sequence attention.

The Falcon-H1R-7B is also designed to measure test time using Deep Think with confidence, known as DeepConf. The idea is to run multiple thought chains in parallel, then use the model's subsequent confidence tokens to filter out noisy traces and keep only high-quality candidates.

In AIME 24 and AIME 25, Falcon-H1R-7B achieves 96.7% accuracy with less than 100 million tokens generated, which puts it on the Pareto positive frontier of accuracy versus token cost compared to other 8B, 14B and 32B logic models. In the validated subset of AMO Bench analysis, it achieves 35.9% accuracy on 217 million tokens, and is ahead of the comparison models by the same or larger scale.

Key Takeaways

  • The Falcon-H1R-7B is a 7B parametric logic model that uses a hybrid Transformer and Mamba2 architecture and supports a 256k token core for a long chain of logic instructions.
  • The model is trained in 2 phases, supervised fine-tuning on long runs of math, code and science up to 48k tokens, followed by GRPO-based reinforcement learning with proven math and code rewards.
  • The Falcon-H1R-7B achieves strong math performance, including about 88.1% in AIME 24, 83.1% in AIME 25 and a math average of 73.96%, which is competitive with or better than the larger 14B to 47B models.
  • In the coding and agent tasks, the Falcon-H1R-7B gets 33.95% as a group score and 68.6% in LiveCodeBench v6, and it is also competitive in common logic benchmarks like MMLU Pro and GPQA D.
  • The hybrid design improves throughput, reaching approximately 1,000 to 1,800 tokens per second per GPU at reported settings, and the model supports Deep Think scaling testing time with confidence to improve accuracy using multiple think samples under a controlled token budget.

Check it out Technical details again MODEL WEIGHTS here. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Check out our latest issue of ai2025.deva 2025-focused analytics platform that transforms model implementations, benchmarks, and ecosystem activity into structured datasets that you can sort, compare, and export


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button