Generative AI

Hexo Labs Open-Sources SIA: A Self-Developing Agent That Updates Both Harness and Model Weights

Most AI agents stop improving when people stop tuning them. The model is fixed. The scaffolding around us has been repaired. Hexo Labs wants to deliver both at the same time. It released SIA (Self-Improving AI) this week as an open source framework under the MIT license.

The core claim of this study is small but strong. SIA organizes both the agent framework and the model weights within a single self-optimizing loop.

What is SIA (Self-improvement AI)

SIA divides the task-specific agent into two parts. The first is a harness, also called a scaffold. That includes system information, tool deployment logic, retry policy, and response output code. The second part weights the model itself.

The three parts of LLM drive the loop. Meta-Agent writes the initial scaffolding from the task specification and any reference code. A Special Task Agent runs the task and logs every step. The Feedback Agent then reads that full route and decides what to change.

That decision is the main idea. After each run, the Response Agent chooses one of two actions. It can rewrite the scaffolding while the weights remain constant. Or it can initiate a weight update while the scaffolding remains unchanged.

The base model is openai/gpt-oss-120b. Weight updates use LoRA, a low-level adapter, at position 32. Meta-Agent and Feedback-Agent both work in Claude Sonnet 4.6. Training runs on H100 GPUs with Modal, the RL team's platform.

The research team labels its two operating points SIA-H and SIA-W+H. SIA-H uses harness updates only. SIA-W+H adds weight updates above.

The Benchmark Case

The research team tested SIA in three different domains on purpose. The pattern holds for all three. Weight updates added benefits beyond what was achieved by scaffolding alone. “First” the basic model by using the first Meta-Agent scaffold, before any response.

Work First of all The past. SOTA SIA-H (harness only) SIA-W+H (harness + weights)
LawBench (top-1 acc) 13.5% 45.0% 50.0% 70.1%
AlphaEvolve TriMul (reward) 0.105 1.292 0.120 1.475
Denoising (mse_norm) 0.048 0.240 0.241 0.289

In LawBench, the task is the classification of 191 categories of Chinese criminal cases. Harness iteration built the TF-IDF pipeline and LinearSVC and is 50.0% higher. Weight update with PPO then pushes the accuracy to 70.1%. That's a 20.1 percent gain over the harness-only best.

The TriMul task requests a custom CUDA kernel on the H100 GPU. The kernel computes the key functionality in the Evoformer module of AlphaFold2. Scaffold alignment reached a speed of 1.14× over baseline. The weight update then runs the runtime from 12,483 to 1,017 microseconds. That's a 91.9% reduction from just the top of the harness.

One reliable warning comes from the same chart. Code agent Claude Code reached 1.50× on TriMul unaided, beating SIA-H's 1.14×. SIA-W + H still leads at 14.02 ×.

By emitting sound, the agent tunes MAGIC, the single-cell RNA pathway. Harness sweeps have their hyperparameters resolved at 0.241 mse_norm. The first test site to review the weight added a two-line step that the scaffolding did not produce. Rounded the calculated values ​​to non-negative integers, increasing the score to 0.289.

How a Response Agent Chooses Its Traffic

SIA does not use one fixed RL recipe. The Feedback-Agent chooses a training algorithm based on the reward signal it perceives.

At LawBench, the reward was a pure score based scale, so it used PPO with GAE. In TriMul, many kernels failed to converge, so they used the entropic approximation to gain. That way it lifts the weights for rare releases with high rewards. When denoising, use GRPO, which eliminates the value network completely.

The research team also lists REINFORCE with KL-to-base, DPO, and best-of-N behavioral cloning. Each map goes to a different shape of reward and risk of failure.

Power and What to Watch

Power:

  • The first system to arrange both scaffolding and weights in one loop, according to the authors' comparison table.
  • Consistent gains over previous SOTA in three unrelated domains.
  • Open source under MIT, which can be installed as a sia agent, with four integrated functions.
  • Algorithm selection is based on observed rewards, not a fixed schedule.

What you can watch:

  • The study reports three activities; the broad effects of the selection algorithm are reversed.
  • Both instruments prepare the same constant verifier, which compromises Goodhart's consistent results.
  • The study warns that a fixed joint surface may be fragile under the disturbance.

Marktechpost Visual Explainer

Hexo Labs · Open Source (MIT)

SIA: Self-Improving AI

Binding + Weight Updates

It is a self-optimizing loop that adjusts both the agent's scaffolding and its model weights, without further human modification.

gpt-oss-120b
LoRA grade 32
3 measurements
Claude Sonnet 4.6 agents

The Gap

Two silos, working independently

Move the school

Arrange the scaffolding

The meta-agent rewrites commands, tools, and tries to reason. The model weights remain constant.

Trial period training

Set the weights

The RL pipeline updates the work response model. The harness stays tight.


SIA bridges the gap by moving both instruments within a single loop.

Anatomy

What is SIA actually

  • Binding (scaffolding): system information, tool deployment logic, retry policy, and response exit code.
  • Weights: model parameters, transformed with LoRA at level 32.
  • The three components of the LLM call loop: Meta-Agent, Task-Specific Agent, and Feedback-Agent.

A loop

One loop, two bells

After each run, the Response Agent reads the full trace and selects one action.

Action A

Harness update

Rewrite the scaffolding. Weights are held.

Action B

Weight update

Train LoRA weights. The scaffolding is held steady.


The two lips meet freely, not in consecutive locked sections.

Evidence

Benchmark results

Work First of all The past. SOTA SIA-H SIA-W+H
LawBench (top-1 acc) 13.5% 45.0% 50.0% 70.1%
AlphaEvolve TriMul (reward) 0.105 1.292 0.120 1.475
Denoising (mse_norm) 0.048 0.240 0.241 0.289

SIA-W+H (harness + weights) beats SIA-H (harness only) in all three tasks.

The way

How the Feedback-Agent chooses its moves

  • LawBench: pure reward based on result, so used PPO with GAE. The accuracy reached 70.1%.
  • TriMul: most kernels fail to converge, so they use the entropic advantage approximation. The runtime reached 1,017 µs.
  • Denoising: used GRPO, which completes the value network. The score rose to 0.289.
  • Also available: CONFIRM + KL-to-base, DPO, and best practice N cloning.

RQ2

Which changes each lever

Binding

Released changes

Software engineering development: new tools, robust analysts, try and think.

Weights

Inside information

Domain information is not fast access: H100 kernel patterns, integer integration step.


The harness shapes how the agent searches; weight updates change what the model knows.

Serious Reading

Limits you should always be aware of

  • Both levers improve the the same fixed guaranteeit compromises the Goodhart effect of parallel evolution.
  • Fixed points may look solid to the verifier but remain fragile under disturbance.
  • The paper reports three jobs; the broad effects of the selection algorithm are reversed.
  • What's different 350× superintelligence The claim in the filing does not appear on paper.

Get started

Run it yourself

Open source under MIT at hexo-ai/sia. Built on gpt-oss-120b with LoRA level 32.

# install the Claude backend
pip install 'sia-agent[claude]'
export ANTHROPIC_API_KEY="..."

# run 5 self-improvement generations on a bundled task
sia --task lawbench --max_gen 5 --run_id 1

Four combined activities are sent in the box: gpqa, law bench, longcot-chess, spaceship-titanic.


01 / 09

Key Takeaways

  • SIA is the first self-optimizing loop that organizes both the agent scaffold and its model weights.
  • The Feedback-Agent reads the full trace of each run, and chooses a harness rewrite or weight update.
  • Combining both levers beat scaffold-only across three functions: LawBench, TriMul kernels, scRNA-seq denoising.
  • Cable management adds to the cleanliness of software engineering; Weight database information is not readily available.
  • Open source under MIT (hexo-ai/sia), built on gpt-oss-120b with LoRA level 32.

Check it out Repo again Research Paper. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button