The Black Box Problem: Why AI-Generated Code Stops Being Maintainable

nimda March 6, 2026

0 11 7 minutes read

The Black Box Problem: Why AI-Generated Code Stops Being Maintainable

A Pattern Across Groups

building on all engineering teams that have used AI coding tools in the past year. I first the moon is happy. The speed is doubled, the features move faster, the participants are happier. In the third montha different metric starts to rise: the time it takes change safely anything built.

The code itself keeps getting better. Improved models, very good, very complete, great context. And yet teams producing more code are increasingly asking for more rewrites.

It makes no sense until you look at the structure.

A developer opens a module generated in a single AI session. It could be 200 lines, maybe 600, the length doesn't matter. They realized the only thing that understood the relationship in this code was the context window that generated it. Function signatures do not record their assumptions. The three services call each other in a certain order, but the reason for that order is nowhere in the codebase. Every change requires a thorough understanding and deep review. That's the black box problem.

What Makes AI-Generated Code a Black Box

Code generated by AI is not bad code. But it has a tendency to quickly become problematic:

All in one place. The AI has a strong bias towards monoliths and choosing the fastest path. Request a “checkout page” and you'll get cart rendering, payment processing, form validation, and API calls in one file. It works, but it's a single unit. You cannot update, test, or change any part without dealing with the whole.
Circular and implicit dependence. The AI strings things together based on what it saw in the context window. Service A calls service B because they were in the same session. That combination is not announced anywhere. Worse, AI tends to create circular dependencies, A depends on B depends on A, because it doesn't track the dependency graph across files. A few weeks later, removing B breaks A, and no one knows why.
There are no contracts. Well-designed systems have mechanical interfaces, API schemes, clear boundaries. AI bypasses this. A “contract” is whatever the current implementation is. Everything works until you need to change one piece.
Documentation describes usage, not usage. AI generates complete descriptions of what the code is doing internally. What is missing on the other hand: usage examples, how it is used, what it depends on, how it connects to the rest of the system. A developer who reads the documentation can understand the implementation but still doesn't know how to use the component or what breaks when they change the interface.

A concrete example

Consider two ways in which AI can generate a user notification system:

A disorganized generation produces one module:

notifications/
├── index.ts          # 600 lines: templates, sending logic,
│                     #   user preferences, delivery tracking,
│                     #   retry logic, analytics events
├── helpers.ts        # Shared utilities (used by... everything?)
└── types.ts          # 40 interfaces, unclear which are public

Result: 1 file to understand everything. 1 file to change anything.

Dependency is direct import. Changing the email provider means editing the same file that holds the push notifications. Testing needs to mock the entire system. A new developer needs to read all 600 lines to understand any single behavior.

Organized generation decomposes the same function:

notifications/
├── templates/        # Template rendering (pure functions, independently testable)
├── channels/         # Email, push, SMS, each with declared interface
├── preferences/      # User preference storage and resolution
├── delivery/         # Send logic with retry, depends on channels/
└── tracking/         # Delivery analytics, depends on delivery/

Result: 5 private areas. Change one without reading the others.

Each subdomain declares its dependencies explicitly. Consumers import typed interfaces, not implementations. You can check, replace, or modify each piece individually. A new developer can understand preferences/ without opening delivery/. The dependency graph is auditable, so you don't have to rebuild it from scattered import statements.

Both implementations produce the same runtime behavior. The difference is entirely structural. And that structural difference is what determines whether a program is still maintainable a few months after release.

Same notification system, two layouts. Informal generation combines everything into a single module. Organized generation decomposes into independent components with obvious, direct interdependencies. Author's photo.

The Composability Principle

What separates these two results agreement: building systems from components with well-defined parameters, declared dependencies, and isolated tests.

There is nothing new about this. Component-based architecture, microservices, microfrontends, plugin systems, module patterns. They all feature some version of the combination. New at scale: AI generates code faster than anyone can program it manually.

Modular systems have specific, measurable characteristics:

✨ Property	✅ Assembled (Created)	🛑 Black Box (No Structure)
Limits	It was obvious (declared for each part)	Obvious (meeting, if any)
Dependence	Declared and confirmed during construction	Hidden in import chains
Testability	Each part is tested separately	It needs to mock the world
Replacement	It's safe (interface contract retained)	Dangerous (unknown river effects)
Riding	Writing them with the structure	It requires archaeology

Here's the bottom line: cohesion is not a quality attribute you add after a generation. It is a an obstacle that should exist during production. If the AI produces a flat directory with no gaps, the output will not be static regardless of how good the model is.

Most of the current AI coding is lacking here. The model is capable, but the target area does not provide structural feedback. So you get code that works but has no architectural purpose.

What Does Structural Feedback Look Like?

So what would it take for AI-generated code to be compiled automatically?

It comes down to the answer, mostly structural feedback from the target area during productionnot after.

When a developer writes code, they get signals: type errors, test failures, liting violations, CI tests. Those signals force the output to be correct. AI-generated code often doesn't catch this during production. It is produced in one pass and tested after the fact, if at all.

What are the changes in which generation targets provide real-time structural signals?

“This component has undeclared dependencies”which forces clear dependency graphs
“This user interface does not match what the consumer expects”to enforce contracts
“This test fails with isolation”holding a hidden connection
“This module exceeded its specified limit”preventing scope creep or cycle dependence

Tools like Bit and Nx already offer these features to human developers. Change provides them in time generation, so that the AI can correct course before the structure is damaged.

In my work at Bit Cloud, we built this feedback loop into the production process itself. When our AI generates components, each one is validated against the platform's structural constraints in real time: parameters, dependencies, tests, typed links. The AI doesn't get to produce a 600-line module with hidden integration, because the environment rejects it before it commits. That is the enforcement of the architecture during production.

The structure should be a first-level constraint during production, not something you revise later.

The Real Question: How Fast Can You Be in Production and Remain in Control

We often measure the productivity of AI by the speed of production. But the key question is: how quickly can you move from AI-generated code to production and be able to turn things around next week?

That divides into a few concrete problems. Can you review what the AI has produced? Not just reading it, actually reviewing it, how you can review a pull request. Can you understand boundaries, dependencies, purpose? Can our partners do the same?

Then: can you send it? Does it have tests? Are the contracts transparent enough that you have confidence in the production? Or is there a gap between “works locally” and “we can ship this”?

And after it's live: can you keep changing it? Can you add a feature without relearning the entire module? Can a new team member make a safe transition without excavating?

If AI saves you 10 hours writing code but you spend 40 getting it to production quality, or you ship it quickly but lose control of it a month later, you've gained nothing. The credit starts on the second day and is cumulative.

The teams that are really moving fast with AI are the ones that can answer yes to all three: reviewable, moveable, changeable. That's not about the model. It's about where the code goes.

Practical Implications

With the code you're running now

Treat every generation of AI as a boundary decision. Before informing, explain: what is this part responsible for? What does it depend on? What is its public interface? These constraints on the prompt produce better output than open output. You give AI an architectural purpose, not just functional requirements.

For systems you have developed

Implicit integration testing. The most dangerous code isn't code that doesn't work, it's code that works but can't be maintained. Look for modules with mixed dependencies, circular dependencies, components that can't be tested without spinning the full system. Pay special attention to the code generated in a single AI session. You can also use AI to get comprehensive reviews on the specific levels you care about.

By choosing tools and platforms

Compare AI coding tools with what happens a generation later. Can you update the output structurally? Is the dependency declared or assumed? Can you test one unit produced by yourself? Can you check the dependence graph? The answers determine whether you will get to the product quickly and stay in control, or get there quickly and lose it.

The conclusion

AI generated code is not a problem. Informal The code generated by the AI is.

The black box problem is solvable, but not just with better reporting. It requires manufacturing environments that enforce design: transparent component boundaries, verifiable dependency graphs, component testing, and interaction contracts.

What that looks like in practice: one product description, hundreds of tested, controlled components out. That is the subject of the next article.

The black box is real. But it's a spatial problem, not an AI problem. Adjust the environment, and the AI generates code that you can send and save.

Yonatan Sason is the founder of Bit Cloud, where his team is building infrastructure for AI-assisted systematic development. Yonatan has spent the last decade working on architecture-based design and the last two years on AI-generated platforms. The patterns in this article come from that work.

Bit is open source. For more on integrative architecture and structured AI generation, visit bit.dev.