Machine Learning

“Where is Martha?”: How do we remove the uncertainty from the thinking of AI

“Stochastic Parrots” in Ai Models win math competitions? While the doubt of the llms is really thinking about PhD-Level Thanes as advertised, the progress in difficult situations of consultation cannot be denied.

The famous plan has been a mixing and comparison of the operating skills of the LLM. The main understanding is that the llm may be good in translating pornography, complex information in accurate format. The formal guarantees are found in receiving solutions that satisfied the books. By associating, we find a plan that can understand what you want including Make sure you bring exactly that: recently, the AWS using this trail is the main building of the actual time.

How does this work? Unfortunately, the meaning of these basic materials occur under major, complicated, complex circumstances, such as tightening reading or statistical evidence. Today, we will show this hybrid method using alloy, a simple language that can say a lot to read, even begin beginners. Instead of standard Math-Y papers and standard Benchmarkaki, we will resolve the most related challenge, inspired by the Crossword Publication:

Real Puzzle: 5 cars (1-5), 5 girls (AE), 5 words, and 4 statements: Who is Martha and what is her car? [ The original puzzle from “Settimana Enigmistica” was discussed first in this LinkedIn post; the image was then modified as well as translated by the author. As we show in the prompts below, a pure textual representation of the situation is obviously possible. ]

We have: 5 vehicles (1-5) are parked in front of 5 girls (AE), and 5 Names (Laura, Giovanna, Bianca, Franca, Martha); We do not know any car parked with the girl but the girls say something about the situation. Our work is to answer this simple question: Which girl is called Marti and what is her car?

While a lot of beach quality than the PHD rate, the solution lies in a fun difficult place of difficulty. It can provide Primer on the organized llM and formators that are not contaminated by other themes and does not require comprehensive domain information: We keep all the basic ingredients of the world, but make it easy to set.

It is encouraging, screenshots, alloy code are available in this open source (all tests are made on Sunday August 2025, the Main Tingleing Loop is made of Opuude Desktop.

Ais and people fight themselves

Happy Truth about our puzzle that, even though it requires the “thinking of the beach level”, high models are not obvious It's good for it. To include the first photo and promoting Opus 4.1 with the solution, the model that is not taken in the wrong pants: How can we trust its fate – that Martha is a girl 5?

Things are interesting when we try to compare models. We have taken the puzzle to the description of the text, But the llms still cannot get agree: Deepseeek's response 4.1 (A and 2) is different than what is given to Opus; Opus's answer by Text Resing (A and 2) is different from Opus above, and ChatGpt5 has another idea (A and 5).

This is what makes the puzzle a good steaming example: people fight this convention program (school question: How long has it taken you to solve it?), but it is not clear how good models. How do we build confidence in what answers over? How can we show each other with ai Instead of forwarding the process?

Reasoning With “Deleting Opportunities”

The challenges of complex consultation are often resolved following advice from the popular investor instead of trying to solve the whole problem at the same time, we can consider our porridge as a combination of three main objects:

  • The original state, random girls in cars and labels.
  • The set of issues, in the form of statements for the same girls: These statements will make the specific map not possible.
  • The final state, where girls have been rewritten.

Our first knowledge is compatible with this fact:

The potential assignment of girls and cars: Franca (a, 1), Laura (B, 2), etc. [ image by the author ]

But again this (and more more):

Another assignment of girls and cars: Franca (a, 1), Martha (B, 2), etc. [ image by the author ]

We can think that every time we include a girl's statement, we finish some arrangements from the last. In other words, we magnify our knowledge as we prohibit a gradual collection of solutions (this basic understanding is the complete Episteming Logic and the vision of information). In fact, the first statement said Laura was not near her, and a car car now was in Bianca “, ruling our first state, because Laura is near a girl.

Renew situations are an intimate and tendency work, even in the llMS. The magic of alloy is their Detarative Condition. Instead of writing the codes of our own consultation, we say We know (Buildings in traditional evidence, statements in the case), and What should you get .

The partnership of labor now should be: instead of the llm (or US) directly, translates English Code and Claude, then use alloy to produce solutions and ultimate, as people, evaluate.

From llm to alloy and back: Loop consulting

Our advancement strategy is now submerged. We don't ask Claude with a specific solution; Instead, soon as soon as straightway to produce alloy code depends on our original form. Instead of “one single” solution, now we are a good loop, produces more sophisticated code, and ensuring that we are approaching the alley exits:

Reasoning, Claude with someone working together [ image by the author ]

The result is our first code, which contains advanced ingredients but no issues are currently. It is easy to browse to the types that a tired translation has been completed: a girl, car and word as our main objects “We don't know who owns without any car.

 // No girl is initially standing in front of her own car
 // Girl A (position 1) does not own Car1, B does not own Car2, etc.
 A.owns != Car1
 B.owns != Car2
 C.owns != Car3
 D.owns != Car4
 E.owns != Car5

Smile here to highlight two good aloloy features: First, the maps of the code clearly in logical statements, just as they have not seen alloy album. Second, built-in UI useful to see our progress, because it shows an index The election of all practical factors that satisfy the issues: for example, is a portable assignment (Giovanna by C):

First instance in Alloy Ui [ screenshot from the author ]

To do too, we could find another, and then another one: as our knowledge is limited in this section, many assignments They are all possible: Time to start ending others!

Let us ask Claude to turn our original code, and add a statement from the Girl A. The good thing for thatOP that we can repeat again and reach perfect but convincing thinking. Not just llms, but also with intelligent intelligence of the “advanced” Intelligent: To be able to enter the “local” issues are the Alloy model testing unit and puzzle model.

Now let's add a statement about a girl as a challenge. Now add a check to make sure that the following map is not allowed: Franca (a, 1), Laura (B, 2). If we now use the code, no opponent finds, proves that we have successfully issued unwanted configuration:

pred InvalidConfiguration {
    // Girl A is named Franca and owns Car1
    A.name = Franca
    A.owns = Car1
    
    // Girl B is named Laura and owns Car2
    B.name = Laura
    B.owns = Car2
}

check { not InvalidConfiguration } for 5 Int

Now that we know the trick, our AI helper can produce script in all girl statements. When we use it, this is an example we get:

Last model in Alloy Ui (Martha is a girl a, in CAR 5) [ screenshot from the author ]

Due to a few Interations and explaining, visible insight, we are now able to find that discussion: Martha is a girl who is about the Chatgt in the above form – this: one and proves marta of Marka.

Reasoning without the box

The Greater Product on the side of the deceptive independent representatives of the nearest concepts is that we are now able to assess in the symbolic The Alloy Mechanical Space of Puzzle, instead of fully dependent on opaques Mappings in the muscular space.

For example, we can confirm that solution is different: On Alloy Ui, if you are trying to find a new example, warning is saying no other instance exists. But can we check again without existing restrictions, and remove all dress details: Does the solution change? (Try to respond before using it!) It arises, the correct solution is an allowed example (school question: Why should this be a crime?

The symbolic space we can easily use and are good in view of AI, which should not be taken at fair value. The first point in case of an opus solution originally, received by entering the image incorrectly. We can easily change Girl c (IE `C.Es`s) and try again: For the Opus Completion is wrong – because of” the wrong reason “.

The second example comes from what is included in view of unity (ie: Martha is the only valid configuration). In a sense, that is added to good, but working this test does not work:

assert MartaUniqueSolution {
    // In all valid configurations, Marta is 
    // always the same girl with the same car
    all g1, g2: Girl | 
        (g1.name = Marta and g2.name = Marta) implies 
        (g1 = g2)  // Marta is always at the same position
}

Mismatch is clear, and it is easy to see thanks to the clear syntax of alloy: “In all valid Alloy Configuration” is “meta-language language” so you can speak), while “all girls … inside For example.

Timeo Claude et Dona Feretententes!

See you, Space Cowboys

Similarly with the edge-edge programs such as alphagemetry, we have solved a voluntary problem (successfully, a evidence) In consultation with Claude, instead of giving the process completely.

The LLM makes map between English and official language: It is easy to read, but sometimes the courage to write, so the skills of Claude generation will help. On the other hand, people can focus on looking at a formal setup ok (testing often easier than doing original!). Both Claude and the people have turned a combined observation to a strong solver, guaranteed for real reduction.

While the evidence of Beach-Level seems to be important, and the episode from Claude becomes tirelessly, this simple example is the opinion of organized ways when compiling to the code and another (person or Agentic generation. Real-world programs use more prominent languages, run and understand the smallest evidence, but many of today's concepts bear.

Of course, resolve sensible puzzles on the beach is not the only case to use with hybrid plans like this. Alloy languages are well-known for modeling software programs, as a result, open the Department about the future in which the systems are confirmed by any implementation. As very effective examples, AWs do not care about verifying their cloud products, and Bauplan provides a data-cat model for data catalog.

Taking a very different approach than many who could even even over 50 years ago, apparently, day after day, we finally approach the Leibniz dream:

If the issues would appear, there would be no need to argue between two philosophy and two calculators. For it would be enough to take away their pencils in their hands and to stay in Babu, they said to one another: Let's count.

Acceptance

Due to Federichi Bianchi, Aldrin Montana, Patrick John Chia First response over the pre-processing of this article. No llm was used or damaged to write English parts of this blog.

If you care about verification, simulation and AI system and infrastructure construction, you will love working in Bauplan: We are employed!

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button