Using Vibe Evidence for Reinforcement Teaching

“The development of mathematics to great precision has led, as is well known, to the invention of large tracts of it, so that one can prove any theorem using nothing but a few mechanical laws.”
– K. Gödel
In part 1, we built an evidence checker and developed a mental model of why we should trust the evidence from LLM: as long as we have valid assumptions and sound verifiers, “a few mechanical rules” are all we need. So how do we do it? the train LLM to produce practical evidence?
As DeepSeek has well demonstrated, the same logic behind AI learning the game of Go works for AI learning how to think, as long as thinking can be observed (and we now know it can). In this second part we implement our validator and build an end-to-end RL training loop to fine-tune the open source model to generate evidence in the language we introduced in part 1: at a glance, the following figure shows the basic ingredients of the flow.
TL;DR: after machine-human interaction to generate the dataset (using our tester as a sanity check on the examples generated by LLM), we use the Tinker RL loop to perform LoRA-style fine-tuning of the open source models. We tell the model about (1) how our language works, (2) how to use rules to construct proofs, and (3) how to format responses for easy analysis. All the proofs are then run through the proof checker, and the reward is propagated back to improve the model's skills: ideally, the model will start with many failed proof attempts, then get progressively better as the training progresses.
Note that although the series directly addresses statistical reasoning, empirical evidence is important for building confidence in distributed software systems. As some experts argue, AI may be the missing ingredient to prove the correctness of the software on the scale!
Bind, match the repo, and write the code. If you skipped the first part, you can read it here!
Data set processing
“People think math is complicated. Math is a simple thing. It's something we can understand. Cats are complicated.” – J. Conway
To get the reward of developing our model, we need examples of proofs in the first place: ideally, we would like a combination of simple and hard proofs, written in our logic language. We can't just generate random strings from our characters because we'd like the model to try to prove things we know can be proven in the first place! How do we bootstrap the process?
Our training mix is a combination of three sources:
- A manual translation of the exercise (properties->conclusion) is taken from forallx, which we think is a solvable proof;
- A manual translation of the exercise (premises->conclusion) is taken from Language, Evidence and Logic, which we assume is a solvable proof;
- Proof chorus produced by powerful LLM (Sonnet by Anthropic). Since we cannot assume that the structures produced by LLM->final tuples are correct, we encourage LLM to full proof(you guessed it!) checked by our proofreader before being added to the training set.
One look at the dataset looks like the following:
{"premises": ["P", "Q"], "conclusion": "P and Q", "num_steps": 1}
that is, the set of premises, the conclusion and how many steps Sonnet took to make a valid proof: the premises and the conclusion will end up in the notice during RL (as we will ask the model to find the proof of the conclusion in the structure), and num_steps is a value suitable for printing some statistics on the perceived difficulty of the training set (assuming for simplicity that the length of proofs is freely related to their difficulty).
Reinforcement Learning in Tinker
“The best way to have a good idea is to have a lot of ideas.”
– written by L. “
Now we are ready to get our own, small, open source LLM for Vibe Proving. There are many recipes and services online for doing RL on open source models, but we chose Tinker as it promises to outsource the infrastructure and most of the boilerplate needed (it's also the new kid on the block, so it's time to check it out!).
The training loop itself doesn't have many surprises:
- A sample: given information and a tuple (place->end), we ask the model to generate multiple proof attempts.
- Confirm: we run each attempt through a proof checker.
- The reward: valid evidence (ie evidence that is fully separable and reasonably accurate) gets reward 1, all other results get 0 ('Do or don't', really). Note that we also check that the generated proof is the same (premises->end) as our request, to avoid having LLM easily game the system by always generating partially correct proofs.
- Update: we adjust the model weights to make the effective evidence more likely.
Following Tinker's guidelines, we choose to try MoE models for thinking in several sizes: gpt-oss-20b, gpt-oss-120b and Qwen3-30B-A3B-Instruct-2507. During training, logs and evidence are kept on site training_logs folder: finally, our app (vibe coded!) can be used to visualize metric trends and check the generated evidence.

If you use an AI assistant to monitor training (which I tried for the first time with this project), an interesting piece of data to track is evidence from textbooks, because it is designed to be tricksters. For example, the following is a status update from Claude Code:

How good is our vibe proving?
Within a few runs and a little tweaking with the parameters, we always have models that can prove most of the generated examples, but we struggle in some of the textbook proofs. It is instructive and a little amusing to examine the evidence produced.
On the success side, this is an attempt to prove DeMorgan's law, i.e. to show how it departs ['not A or not B'] to not (A and B)by starting with consideration A and B and proving the contradiction:
- neither A nor B (base)
- | A and B (sub-evidence)
- | A (2)
- | B (2)
- || not A (nested subproof, from 1)
- || ~ (3,5)
- || not B (nested subproof)
- || ~ (4,7)
- | (1, 5-6, 7-8)
- END
On the failure side, no model has been proven successful from 'A or B', 'not A or C', 'not B or D' that C or D struggles to properly handle nesting objects and use the explosion rule, as shown in this thread:
- A or B (basic)
- not A or C (base)
- not B or D (base)
- | A (underlying evidence)
- || not A (nested subproof)
- || ~ (4,5)
- | C (5-6) ← ERROR
- ….
How easy was Tinker?
Our small proof of concept is by no means a stress test of the training service at scale, but it is sufficient to obtain system-based impressions.
The combination of good community examples, Claude-friendly documentation and hardware abstraction made for a nice, gentle introduction to RL, at a reasonable cost (all blog post tests cost $60 or more, including the first run which – in retrospect! – was clearly a waste of time and money!).
When you get the hang of it and start doing several tasks in parallel, the lack of monitoring and observation becomes a problem: sometimes my running try_again answers for a long time, as if the system is overloaded), and some tasks fail at some point for unclear reasons (but, of course, you can start over at the previous checkpoint). Considering the reasonable price and prototype nature of my work, none of these problems outweighed the benefits, and I walked away with enough Tinker knowledge to use again for a future project.
Hello, RL cowboys!
“We don't do these things because they are easy, but because we thought they would be easy.” -Anonymous
While Tinker really makes the training process (mostly) seamless, the devil is in the (RL) details: we haven't scratched the surface yet, as our goal was to go from zero to the Vibe Proving stack, not develop RL. by myself.
The good news is that the flow is quite modular, so that all components can be developed and assembled (sort of) independently:
- model selection: model type, model size, supplier…
- training parameters: select learning rate, batch size, LoRA rate…
- code abstractions: rewrite code with RL Envs …
- quick optimization: better instructions, easier formatting, usable examples of content, …
- dataset preparation: many different examples, curriculum studies (not only the variation of proof difficulty, but for example starting with a proof without one missing step, then a proof with two missing steps etc. until the model needs to complete all the proofs) …
In the same way, our custom witnessing language is never enough to get interesting results: we he can better at it, but getting to something that is actually usable can require an incredible amount of work. For these reasons, you are better off moving to a purpose-built language, such as Lean: importantly, as you now know about evidence-as-legal-reasoning, the same mental model extends to a clear (method) language. In addition, Lean has a similar style of writing proofs, i.e. rules for introducing and terminating logical operators.
In other words, once we've dialed the math behind Vibe Proving and built the first RL harness, what's left is good ol' engineering.
Thank you
Thanks to Patrick John Chia, Federico Bianchi, Ethan Rosenthal, Ryan Vilim, Davis Treybig for valuable feedback on earlier versions of this draft.
If you are interested in the intersection of genAI, thinking about distributed systems and authentication, you can also check out our research at Bauplan.
AI coding assistants were used to write the corresponding repository, but no assistant was used to write the script (other than for proofreading and typo correction).



