Machine Learning

Stop Using LLMs as Big Problem Solvers

in a feature where I had to convert 100 messy compliance pdfs into structured JSON rules.

The brute force approach was obvious: give the agent the source text, define the function, provide examples, and ask it to generate the rules. Since it was low-hanging fruit, I tried it first.

At first glance, the output looks good. The JSON output was valid and similar to what I expected.

But as I manually sampled the results to check accuracy, cracks appeared. Some laws were too broad, others were ignored. Some laws failed to retain the nuances of the original text. I tried using another agent to catch and debug but with such a large corpus, it was not possible to confidently verify the output.

That was the frustrating part. Errors were not visible. This was too fragile in use to be measured.

While I can't share the exact implementation details, what I can share are the architectural lessons I learned and how I ultimately applied them. Hopefully, this information will be useful if you are building AI systems that need to scale, remain reliable, and deal with messy data. And if you have better ways of doing things, reach out to chat!

Okay let's get to it.

The problem

100's of pdfs I was working with had already been parsed and stripped before they got to me. But the raw content was still dirty. There were bullet points, tables, OCR artifacts, translated paragraphs, partially formatted titles, footers, headings, inconsistent formatting and some document issues.

I chose to use agent because determining what is important requires a semantic judgment. The documents did not follow one consistent pattern, so compatibility could not be determined by simple rules alone.

You had to understand the surrounding context. None of this was difficult if done on a small piece of data. The challenge was to do this reliably at scale.

These rules were then considered by another program that will be evaluated by decision.

Which ended up being successful

After a few tests, I realized that the big improvement didn't come from better information, a new tool, an MCP server, or a complex agent harness.

It came from changing the nature of the problem.

Instead of trying to make the agent smarter, I made the agent's job smaller.

The first change was to prepare the source data in advance. Instead of asking the agent to query the database, find the records, determine if it has the right input, and then perform the extraction, I gave it a more controlled starting point.

In my case, that meant temporarily storing the relevant raw data locally.

This may not always be possible. But the basic goal is to reduce the amount of return uncertainty the agent has to handle. If the agent's job is to think about the content, don't do it and be responsible for finding out if it got the right content.

Another option would be to prepare the question in advance.

I also used a script to remove unnecessary metadata and fields before passing the raw content to the agent. Less irrelevant content meant fewer distractions, fewer chances for the agent to get stuck on wrong details and overall cleaner thinking.

But the most important change was the work unit.
Instead of processing everything at once, I did things iteratively and processed one document at a time.

That made each task smaller, easier to test, easier to retry, and easier to test. I have assembled five subagents to process the documents in parallel, each agent logging its progress to a file.

If one document fails, I can retry only that document. If one output had formatting problems, I could fix that case without restarting the whole batch. If the pipeline stops midway, cache persistence means that it can resume from the last successful checkpoint.

This is where the division of responsibilities became clear.

The agent handled the semantic work: understanding the content, identifying the relevant components and writing the JSON output.

The surrounding code handled parts of the machine: parallel operations, enforcing the schema, generating identities, writing files, continuing to cache, validating pointers, and checking whether the output can be traced back to the original source.

I also have an orchestrator to view the progress of the script.

Making the output readable

A useful design decision was to add reference IDs to every generated rule. This means that each output points back to a specific source.

This made the output easier to research. Instead of asking, “Does this generated rule look correct?”, I can ask more specific questions such as: does the referenced source passage exist? Does the quoted source text actually exist in that passage?

I can also get another agent to choose to audit large and complex documents to ensure that important nuances are preserved.

In addition, I made a lightweight version of the evals. I used a small set of raw documents for the workflow and manually reviewed the results for installation and accuracy. A complete gold dataset was not applicable to the scope of this project, but I still needed a way to prove that the workflow was working.

My goal was not to build a perfect benchmark but to make the system readable enough to be able to test the output, catch failures, and iterate to a high accuracy bar.

If you have ideas on how I could do this better, let me know!

My biggest takeaway

The pattern that worked was to stop treating the LLM as a whole program.

The system became more reliable not because the agent was perfect, but because the workflow made outputs easier to track, approve, and recover from.

Coincidentally, I was building this just before attending the first AI Engineer Singapore conference, held from 15-17 May 2026.

On the last day, JJ Geewax, Director of Applied AI at Google DeepMind, shared an outline that captured what I was learning so hard: we need to stop using LLMs as big problem solvers.

That affected me because it is an easy trap to fall into. It's easy to provide the model with data, schema, business rules, edge cases, and self-validating responsibility. Then you get frustrated when the result is inconsistent.

But for reliable production systems, the best pattern is usually a hybrid. Let the agent handle the parts that require semantic judgment, and let the code handle the parts that require structure, validation, and control.

I will share many thoughts from AI Engineer Singapore and the workshops I attended. YouTube excerpt of JJ's speech here.

All of that comes from me. Hope this helped, and see you in the next article 🙂

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button