How to build a model-native agent that learns internal programming, memory, and multi-tool reasoning through finite-element reinforcement learning

In this tutorial, we explore how an agent can reprogram planning, memory, and tool use within a single neural model rather than relying on external input. We design a compact, model-navidive agent that learns to perform arithmetic reasoning tasks through precision learning. By combining a network of experienced actors on the stage with a curriculum of more complex areas, we enable the agent to find out how to use “temporary tools” to reach the right solutions at the end-to-end. We work step-by-step to see how learning arises from the simple thinking of several behaviors. Look Full codes here.
import math, random, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"; torch.manual_seed(0); random.seed(0)
V = 18; CTX = 10; MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17
tok2str = {**{i: str(i) for i in range(10)}, CTX:"[CTX]", MUL:"[MUL]", ADD:"[ADD]", SUB:"[SUB]", ANS:"[ANS]", STO:"[STO]", RCL:"[RCL]", EOS:"[EOS]"}
class ToolEnv:
def __init__(self, max_steps=7):
self.max_steps = max_steps
def sample(self, stage):
a,b,c,d,e = [random.randint(0,9) for _ in range(5)]
if stage==0: ctx=[a,b,c]; target=a*b+c
elif stage==1: ctx=[a,b,c,d]; target=(a*b+c)-d
else: ctx=[a,b,c,d,e]; target=(a*b+c)-(d*e)
return ctx, target, (a,b,c,d,e)
def step_seq(self, actions, abc, stage):
a,b,c,d,e = abc; last=None; mem=None; steps=0; shaped=0.0
goal0=a*b; goal1=goal0+c; goal2=goal1-d; goal3=d*e; goal4=goal1-goal3
for act in actions:
steps+=1
if act==MUL: last=(a*b if last is None else last*(d if stage>0 else 1))
elif act==ADD and last is not None: last+=c
elif act==SUB and last is not None:
last -= (e if stage==2 and mem=="use_d" else (d if stage>0 else 0))
elif act==STO: mem="use_d" if stage>=1 else "ok"
elif act==RCL and mem is not None:
last = (d*e) if (stage==2 and mem=="use_d") else (last if last else 0)
elif act==ANS:
target=[goal0,goal1,goal2,goal4][stage] if stage==2 else [goal0,goal1,goal2][stage]
correct=(last==target)
if stage==0: shaped += 0.25*(last==goal0)+0.5*(last==goal1)
if stage==1: shaped += 0.25*(last==goal0)+0.5*(last==goal1)+0.75*(last==goal2)
if stage==2: shaped += 0.2*(last==goal0)+0.4*(last==goal1)+0.6*(last==goal4)+0.6*(last==goal3)
return (1.0 if correct else 0.0)+0.2*shaped, steps
if steps>=self.max_steps: break
return 0.0, steps
We start by setting up the environment and defining the symbolic tools our agent can use. We create a small artificial world where each action, such as repetition, addition, or subtraction, acts as an internal tool. This environment enables us to simulate consultation tasks where the agent must plan a sequence of tools to be used to arrive at the correct answer. Look Full codes here.
class ActorCritic(nn.Module):
def __init__(self,V,d=96,nstage=3):
super().__init__()
self.emb=nn.Embedding(V,d); self.stage_emb=nn.Embedding(nstage,d)
self.rnn=nn.GRU(d,d,1,batch_first=True); self.pi=nn.Linear(d,V); self.v=nn.Linear(d,1)
def forward(self,ctx,stage,max_len=6,greedy=False):
B=ctx.shape[0]; ce=self.emb(ctx).mean(1)+self.stage_emb(stage).unsqueeze(1)
h=torch.tanh(ce.mean(1)).unsqueeze(0); inp=self.emb(torch.full((B,1),CTX,device=device))
acts,logps,ents,vals=[],[],[],[]
for _ in range(max_len):
out,h=self.rnn(inp,h); val=self.v(out[:,-1]); logits=self.pi(out[:,-1])
pi=F.log_softmax(logits,dim=-1).exp(); ent=-(pi*torch.log(pi+1e-9)).sum(1)
a=torch.argmax(logits,1) if greedy else torch.distributions.Categorical(pi).sample()
logp=F.log_softmax(logits,dim=-1).gather(1,a.unsqueeze(1)).squeeze(1)
inp=self.emb(a.unsqueeze(1))
acts.append(a); logps.append(logp); ents.append(ent); vals.append(val.squeeze(1))
return torch.stack(acts,1), torch.stack(logps,1), torch.stack(ents,1), torch.stack(vals,1)
We then design our own model policy – native creation using the actor criticism structure built by Gru. We embed both tokens and layers of work, allowing the network to adapt the depth of its thinking according to the complexity of the combination. This setup enables the agent to contextually learn when and how to use internal tools within a single unified model. Look Full codes here.
env=ToolEnv(); net=ActorCritic(V).to(device)
opt=torch.optim.Adam(net.parameters(),lr=3e-4)
def pad_batch(ctxs):
L=max(len(c)+1 for c in ctxs)
out=torch.full((len(ctxs),L),EOS,dtype=torch.long,device=device)
for i,c in enumerate(ctxs): out[i,:len(c)+1]=torch.tensor(c+[CTX],device=device)
return out
def run_batch(stage,batch=128,train=True,greedy=False):
ctxs=[]; metas=[]
for _ in range(batch):
c,t,abc=env.sample(stage); ctxs.append(c); metas.append((t,abc))
ctx=pad_batch(ctxs); stage_t=torch.full((batch,),stage,device=device,dtype=torch.long)
acts,logps,ents,vals=net(ctx,stage_t,max_len=6,greedy=greedy)
rewards=[]
for i in range(batch):
traj = acts[i].tolist()
abc = metas[i][1]
r,_ = env.step_seq(traj,abc,stage)
rewards.append(r)
R=torch.tensor(rewards,device=device).float()
adv=(R-vals.sum(1)).detach()
if not train: return R.mean().item(), 0.0
pg=-(logps.sum(1)*adv).mean(); vloss=F.mse_loss(vals.sum(1),R); ent=-ents.mean()
loss=pg+0.5*vloss+0.01*ent
opt.zero_grad(); loss.backward(); nn.utils.clip_grad_norm_(net.parameters(),1.0); opt.step()
return R.mean().item(), loss.item()
We use reinforcement learning for learning training using character update (A2C). We train the agent end to access batches of synthetic keys, policy updates and value networks simultaneously. Here, we introduce standard entropy to encourage exploration and prevent premature convergence. Look Full codes here.
print("Training…")
stages=[0,0,0,1,1,2]
for ep in range(1,61):
stage=stages[min((ep-1)//10,len(stages)-1)]
acc,loss=run_batch(stage,batch=192,train=True)
if ep%5==0:
with torch.no_grad():
evals=[run_batch(s,train=False,greedy=True)[0] for s in [0,1,2]]
print(f"ep={ep:02d} stage={stage} acc={acc:.3f} | eval T0={evals[0]:.3f} "
f"T1={evals[1]:.3f} T2={evals[2]:.3f} loss={loss:.3f}")
We begin a large training process using a curriculum strategy where tasks gradually increase in difficulty. As we train, we test the agent at every stage to check its ability to use simple steps of complex reasoning. Printed metrics show how internal programming is improving over time. Look Full codes here.
def explain(stage):
c,t,abc=env.sample(stage)
ctx=pad_batch([c]); stage_t=torch.tensor([stage],device=device)
with torch.no_grad(): a,_,_,_=net(ctx,stage_t,greedy=True)
seq=[tok2str[x] for x in a[0].tolist()]
r,_=env.step_seq(a[0].tolist(),abc,stage)
return dict(stage=stage,ctx=c,target=t,actions=" ".join(seq),reward=round(float(r),2))
with torch.no_grad():
for s in [0,1,2]:
print(f"nStage {s} samples:")
for _ in range(5): print(explain(s))
with torch.no_grad():
finals=[run_batch(s,train=False,greedy=True,batch=1000)[0] for s in [0,1,2]]
print(f"nFinal greedy accuracies → T0={finals[0]:.3f}, T1={finals[1]:.3f}, T2={finals[2]:.3f}")
We conclude by investigating a trained agent and a print example showing trajectories. We envision a sequence of tokens for tokens Choosing a model and verifying whether it achieves the desired result. Finally, we evaluate the overall performance, showing that the model successfully integrates planning, memory, and internal process reasoning.
In conclusion, we see that even a neural network can learn internal programming and use the tool when training reinforced signals. We are successfully moving forward with the architecture of the Pipeline – where the memory, planning, and execution, separation, in the traditional agent that includes these components as part of their learning capabilities. This approach represents agentic AI, showing how end-to-end learning can produce outstanding decisions and self-organize funny decision-making without the need for manual controls.
Look Full codes here. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.
AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.
Follow Marktechpost: Add us as a favorite source on Google.



