The Next AI Bottleneck Is Not a Model: The Targeting System

0 1 4 minutes read

The Next AI Bottleneck Is Not a Model: The Targeting System

I've seen a lot when working with enterprise AI teams: they almost always blame the model when something goes wrong. This is understandable, but it is also often wrong, and ends up being very expensive.

The general situation is as follows. The output is inconsistent; when someone brings it up, the first reaction is to blame the model. It may require additional training data, further fine-tuning, or a different baseline model. After weeks of work, the issue remains the same or has changed little. The real problem, which usually resides in the retrieval layer, the context window or how the operations were routed, was not explored.

I've seen it happen so many times before that I believe it's worth writing about.

Good planning is helpful, but it is overused

In most cases, it is still worth making a few changes. If domain correction, tone alignment, or security calibration is needed, it should be part of the workflow. I'm not saying you shouldn't use it.

The problem is that it's the default answer to any problem, or it's not the right tool. Partly because it feels like a productive thing to do. You start the work of planning well, something happens clearly, and there is a before and after. You seem to be fixing the problem while not talking.

An example of this is the contract analysis program, which I was watching a team debug. The results were not reliable for complex documents, and the first impression was that the model lacked legal reasoning capabilities. So they do a lot of tuning iterations. The problem did not end. Finally, someone realized that the retrieval layer was doing the same retrieval multiple times and was adding it to the context window. The model was trying to work with a lot of low-value text that was repeated many times. They adjusted the retrieval rate and introduced context compression, and in the end it was much better.

The model itself has not been changed. Again, this is a common occurrence.

Fine-Tuning vs Inference Loop (Image by Author)

What happens during forecasting

For a long time, guesswork was the step in which you used the model. Training is where all the interesting decisions happen. That is changing now.

One of the reasons for this is that some models start allocating more computing power to a generation instead of baking it into the training program. Another aspect was that research showed that behaviors such as self-evaluation or rewriting feedback can be learned through reinforcement learning. Both of these point to self-esteem as an area where performance can be improved.

What I'm seeing now is that engineering teams are starting to treat the concept as something you can design around, rather than just a fixed step that you accept. How much depth of thinking does this job require? How is memory managed? How is the recovery done first? These become real questions rather than automatic ones that you don't think about.

Resource allocation problem

What is often overlooked is that many AI programs use the same approach for all of their queries. A single inquiry about account status follows the same process as a multi-step compliance process, with information having to be compiled from several conflicting documents. Same cost, same process, same calculation.

This doesn't seem unreasonable when you think about it. In all other engineering applications, resources will be allocated based on the work required. Some teams are starting to do this with AI, taking the simple instructions out of simple workloads and moving the heavy computing route to tasks that really need it. The economy improves, and the quality of the heavy stuff also improves, since you don't use it anymore.

These systems have more layers than people realize

When you look inside a manufacturing AI system today, it's often not just one model that answers questions. It is usually accompanied by a retrieval step, a ranking step, possibly a verification step, and a summation step; several steps in tandem to produce the final output. It's not just about the ability of the underlying model, but how all those pieces fit together to produce the output.

If the regression coefficient is not properly estimated, it will produce output similar to model errors. A content window that can grow without restraint will affect the quality of the image, but nothing will fail. These are systemic issues, not model issues, and need to be addressed through systems thinking.

An example of this type of thinking in action is predictive recording. The idea is that the smaller model generates the candidate output, and the larger model confirms it. It started as a way to create latency, but it's really an example of distributed thinking across multiple components rather than expecting a single model to do everything. Two teams using the same base model but different thinking structures can end up with very different results in productivity.

**AI Inference Production Pipeline (Image by Author)**

Memory is becoming a real problem

Larger context windows have been helpful, but in the past, more context doesn't improve thinking; it lowers it. The retrieval becomes noisy, the model tracks less, and the computational cost increases. Teams using AI at scale are spending real time on things like page attention and context compression, which aren't fun to talk about but are very important in practice.

The idea is to have the right shape, but not too much, and be treated well.

Take away

Model selection is more important than ever. Capable base models are now available from several suppliers, and power gaps have narrowed for many use cases. What actually determines whether the deployment is successful is the infrastructure around the model, how retrieval is planned, how computing is allocated, and how the system handles cases over time.

The teams that will be in good shape in a few years are the ones that treat conceptual design as something that deserves careful engineering, rather than assuming that an adequate model will fix everything else. In my experience, usually not.

Source link

nimda 2 hours ago

0 1 4 minutes read