New scales that focuses on higher average software AII agents with only 78 examples

Is the selected demonstrations, the tools of the toolbar building a strong software than broad masses of regular teaching data? A group of researchers from Shanghai Jiaa Tong University and SII RUDERATIVE AI CwaNa Lab (Gair) Limi (“little is in alency”)Good planning method that changes the basic model into proper software / research agent using 78 samples. LIMI scores 73.5% average in Agencench (FTFC 71.7, rc @ 3 74.2, SR @ 3 74.6), beating powerful foundations (QWEN3 275 -5, UNWA-v3.1 11.9) and variations beyond trained training 10,000 samples-With 128 × database.

What's new new?
- The most effective principle: Mbumba say that Agentic skill A lot of scale with Data quality / structure there is a green sample count. Based Research Tune Tune Tune Glm-4.5 / GLM-4.5-Air on 78 Long-Hore-Hore-Traveleries, Travels using tool (samples) and report main benefits to the Augensberch and Gearization Suites (Tau2-Bench, Thema / Mbpp, DS-1000, SCIPOKE).
- Minor employment but crowded. Each trajectory (~ 132.4k AVG.) Calls the full clots of many jobs, work calls, and natural views collected in CLI the nature of execution. SPAN Functions “Vibe codes“(Practical Software Development) and Change of Research Work (Search, analyze, test formation).


How does this work?
- Basic Models: GLM-4.5 (355b) and GLM-4.5-AIR (106b). Training uses nominate Soft frame that has the same composition of comparison (separating data effects).
- Data Construction: Sixteen real questions from workers + 18 appear from High-Star Githhub Prerves (Qa Qa Participation). Each Question, Limi includes a total agent of the trajectory for successfully eliminated CLI.
- Checking: Agencench (R = 3 round) with FTFC, SR @ 3, RC @ 3; Plus Generization Suites (Tau2-Airline / Retail Pass ^ 4, ElloPlplus He / Mbpplus, DS-1000, SCITOCE).


Result
- Agenberbench (AVG): 73.5%. Limi vs. GLM-4.5 (+28.4 PTS); FTFC 71.7% vs 37.8%; Sr @ 3 74.6% vs 47.4%.
- Data efficiency: Limi (78 samples) Appercefforms GLM-4.5 trained AFM-CodEagent SFT (10,000 samples): 73.5% vs 47.8%–+ 53.7% completely with 128 × Small data. Similar spaces catch vs Afm-ofbagent (7,610) NecC-bench-traj (260).
- General Working: Across the use of tools / codes / computer of science, lime ratings ~ 57%passing through the GLM-4.5 and other basics; Without access to tools, the limis is still slowly.50.0% vs 48.7% For the GLM-4.5), which indicates winners between intranins above environmental tools.


Healed Key
- Data performance is a rate dominant. Limi reaches 73.5% Average in account using The selected trajectoriespassing GLM-4.5 (45.1%) and showing a + 53.7-Point Profit on top of 10k sofa SFT BASE-with 128 × samples for a few.
- The quality of the trajectory, not a heap. Training data Long Town, Tool-based The flow of work in the development of a partnership with a collaboration and scientific research, collected by CLI Stack kill is referred to on paper.
- Metric benefits of overseas. In Acchench, Limi Reports FTFC 71.7%, SR @ 3 74.6%and strong Rc @ 3with detailed tables indicating large margins on top of bases; Generalization Suites (Tau2, Avallus-He / Mbpp, DS-1000, SCITODE) ratio 57.2%.
- It works on a scale. Fine tuning GLM-4.5 (355b) including GLM-4.5-AIR (106B) Both produce big deltas over their bases, showing a path to a model size.
The research team trains various GLM-4.5 of the selected, long, low-operative tracroes, etc.-captured to the CLI Endionaling Spanning Software-engineering and research activities. Reports at 73.5% in accentBench with FTFC, RC @ 3, and SR @ 3 metric; The basis of GLM-4.5 is reported in 45.1%. Comparison against 10,000-Pam-Codiagent SFT BASE is showing 73.5% vs 47.8%; Toolbar tests show the surroundings of the middle (≈50.0% with Lim vs 48.7% GLM-4.5). Trajestories turn to multi-turn-mene, emphasizes organizing, tool to plan tools, and verification.
Look Paper, GitHub page including Model card on HF. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.
Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.
Follow MarkteachPost: We have added like a favorite source to Google.



