Pokeeresearch-7b: An open source 7b deep learning agent trained with reinforcement learning from AI Reportback (RLAIIIF) and strong dynamic reasoning

Pokee ai open reced Pokeeresearch-7bthe agent of the 7b Parameter deep that takes out the loops of the full research, reduces the question, searches for the problems and reads the calls, confirms the answers of the election, and shows the threads of many researches to the final answer.
The agent runs a loop of research and validation. For research, call external tools for web search and page reading or propose a temporary response. In confirmation, it checks the response against the evidence returned, and either accepts or restarts the study. This structure minimizes brittle trajectories and catches obvious errors before completion. The research group communicates this loop and adds a test phase to test time that includes many independent research threads.
Training is a recipe, rlaif by rloo
Pokeeresearch-7b is fined from Qwen2.5-7B-orders using annotation Emphasis on Learning from AI feedback, called Rlaifwith Emphasize the break-out algorithm, called Rloo. The reward aims for semantic accuracy, citation reliability, and message consistency, not overlap. Model card list of batch 64 faces, 8 RL research threads, 3e-6 read rate, 14 steps, tokens, BF16 accuracy, and Checkpoint near 13 GB. The Research Team emphasizes that Rloo provides a policy policy and differentiates it from the PPO family which is almost policy and discrimination.

Consultation Scaffold and Research Threats Synthesis
The scaffold includes three methods. Self-correction, the agent receives calls from the wrong bell and returns. To verify the choice, the agent checks its response against the evidence. In research threads, the agent runs several independent threads for each question, phases, and compiles the final answer. The research group reports that integration improves accuracy on heavy benches.


Experimental protocol
The research team examines messages only questions from 10 benches, NQ, Triviaqa, POPQA, 2Wikimultulqapqa, Musiqua, Bamboogle, and the last GAMETIMENTIty. They sampled 125 questions for each dataset, except for GAIA which had 103, to get 1,228 questions. For each question, they run 4 threads of the study, then the compute cluster means the accuracy, say 4, using Gemini-2.5-flash-flash-lite to judge the accuracy. The maximum coupling rotation is set to 100.




Results on a 7B scale
Pokeeresearch-7b reports mean with 4 accuracy among 7B deep agents across 10 datasets. In Hle the model reports 15.2 without RTS and 17.6 with RTS. In Gaia the model reports 36.9 without RTS and 41.3 with RTS. In PresenoComp Reports model 5.4 without RTS and 8.4 with RTS. In seven QA benches, bamboogle, 2WikimultiHopqa, triviaqa, NQ, POPQA, Musique, Hotpotqa, the model improves the latest 7B bases. The gains from RTS are greatest for Hle, Gaia, and PresictoComp, and less for QA bets.
Key taken
- Preparing for the game: Pokeeresearch-7b Fine Tunes Qwen2.5-7B-commands with Rlaif using Rloo Estimator, to improve the rewards of true accuracy, reliability of loyalty, and adherence to instructions, not solicitation of instructions, not solicitation of instructions, not limitation of token, not token.
- Scanner: The agent runs a loop of research and verification through the research of research threads, using multiple independent threads, and then performs a proof on the final response.
- Experimental protocol: Benchmarks span 10 datasets with 125 queries each, except for gaia with 103, 4 threads per query, mean @ 4 accuracy judged by Gemini-2.5-flash-lite, with a 100 point cap.
- Results and exemptions: Pokeeresearch-7b reports the state of the art among deep deep agents, for example HLE 17.6 with RTS, Genesecomp.4 with RTS, and released under APTS-2.0 with Code and Weight Pustom.
Pokeeresearch-7b is a useful step for active agents working in depth. It aligns the training with rlaif using rloo, so the aim is to aim for semantic accuracy, citation reliability and message retention. The consultation scaffold includes self-validation and research threads, which develop difficult benchmarks. The test used means that it is 4 with Gemini 2.5 Flash Lite as a judge, across 10 details. Release ships Apache 2.0 Code and weight with a clear tool stack using Serper and Jina. Setup works on A100 80 GB and scale.
Look Paper, model in hf and Github repo. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of the intelligence media platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



