From 4 Weeks to 45 Minutes: Designing a Document Output System for 4,700+ PDFs

0 5 7 minutes read

From 4 Weeks to 45 Minutes: Designing a Document Output System for 4,700+ PDFs

and asked if I could help extract revision numbers from over 4,700 engineering drawing PDFs. They were migrating to a new asset management system and needed the current REV value of the drawing, a small field buried in the title block of each document. Another was a team of engineers opening the PDF one by one, finding the title block, and manually entering the value into the spreadsheet. At two minutes per drawing, that's about 160 hours. Four weeks of developer time. At fully loaded rates of around £50 an hour, that's over £8,000 in labor costs for work that produces no engineering value beyond filling a spreadsheet column.

This was not an AI problem. It was a system design problem with real constraints: budget, accuracy requirements, mixed file formats, and a team that needed results they could trust. AI was part of the solution. The surrounding engineering decisions are what make the system work.

The Hidden Complexity of “Simple” PDFs

Engineering drawings are not ordinary PDFs. Some are created with CAD software and are sent as text-based PDFs where you can export the text systematically. Some, especially legacy drawings from the 1990s and early 2000s, were scanned from the original papers and saved as image-based PDFs. The entire page is a flat raster image with no text layer at all.

📷 [FIGURE 1: Annotated engineering drawing] Representative engineering drawing with title block (bottom right), revision history table (top right), and grid reference characters (border) highlighted. The REV “E” value resides in the title block next to the drawing number, but the revision history table and grid characters are common sources of false positives.

Our corpus was approximately 70-80% text based and 20-30% image based. But even the text-based subset was tricky. REV values appear in at least four formats: tricked-out numerical versions such as 1-0, 2-0, or 5-1; single letters like A, B, C; double letters like AA or AB; and sometimes empty or missing fields. Some drawings were rotated 90 or 270 degrees. Many had revision history tables (multiline change logs) sitting next to the current REV field, which is an obvious fallacy trap. The grid reference characters on the drawing border can easily be mistaken for single character updates.

Why the Full AI Approach Was the Wrong Decision

You can throw all the documents into GPT-4 Vision and call it a day, but at about $0.01 per image and 10 seconds per call, that's $47 and about 100 minutes of API time. More importantly, you'll be paying for expensive guesswork in documentation where a few lines of Python can produce an answer in milliseconds.

The logic was simple: if the document has extractable text and the REV value follows predictable patterns, there is no reason to engage LLM. Save the model for situations where criteria fail.

A Functional Hybrid Architecture

📷 [FIGURE 2: Pipeline architecture diagram] A two-stage mixed pipeline: every PDF goes into Stage 1 (PyMuPDF rule-based database). If the confidence result is returned, it goes directly to the CSV result. Otherwise, the PDF falls into Stage 2 (GPT-4 Vision with Azure OpenAI).

Stage 1: PyMuPDF background (deterministic, zero cost). For all PDFs, we try to extract them based on the rules using PyMuPDF. A logical approach focuses on the lower right quadrant of the page, where the title blocks, and searches for text around known anchors such as “REV”, “DWG NO”, “SHEET”, and “SCALE”. The scoring function ranks candidates by proximity to these anchors and conformance to known REV formats.

def extract_native_pymupdf(pdf_path: Path) -> Optional[RevResult]:
    """Try native PyMuPDF text extraction with spatial filtering."""
    try:
        best = process_pdf_native(
            pdf_path,
            brx=DEFAULT_BR_X,      # bottom-right X threshold
            bry=DEFAULT_BR_Y,      # bottom-right Y threshold
            blocklist=DEFAULT_REV_2L_BLOCKLIST,
            edge_margin=DEFAULT_EDGE_MARGIN
        )
        if best and best.value:
            value = _normalize_output_value(best.value)
            return RevResult(
                file=pdf_path.name,
                value=value,
                engine=f"pymupdf_{best.engine}",
                confidence="high" if best.score > 100 else "medium",
                notes=best.context_snippet
            )
        return None
    except Exception:
        return None

The blacklist filters out common false positives: section tags, grid references, page references. Forcing the search in the header block area cuts false matches close to zero.

Stage 2: Vision of GPT-4 (for all who miss Stage 1). If the native background comes back empty, maybe because the PDF is based on an image or the text structure is too complex, we render the first page as a PNG and send it to GPT-4 Vision with Azure OpenAI.

def pdf_to_base64_image(self, pdf_path: Path, page_idx: int = 0, 
                         dpi: int = 150) -> Tuple[str, int, bool]:
    """Convert PDF page to base64 PNG with smart rotation handling."""
    rotation, should_correct = detect_and_validate_rotation(pdf_path)
    
    with fitz.open(pdf_path) as doc:
        page = doc[page_idx]
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72), alpha=False)
        
        if rotation != 0 and should_correct:
            img_bytes = correct_rotation(pix, rotation)
            return base64.b64encode(img_bytes).decode(), rotation, True
        else:
            return base64.b64encode(pix.tobytes("png")).decode(), rotation, False

We settled on 150 DPI after testing. High resolution prevents payload loading and reduces API calls without improving accuracy. Low resolution has lost detail in marginal scans.

What's Wrong with Manufacturing

Two categories of problems emerged only when we ran through the full corpus of 4,700 documents.

Circulatory ambiguity. Engineering drawings are usually saved in landscape orientation, but PDF metadata encoding that orientation varies greatly. Some files are set/Rotate correctly. Others literally rotate the content but leave the metadata at zero. We solved this with a heuristic: if PyMuPDF can extract more than ten blocks of text from an unedited page, the format is probably correct regardless of the metadata. Otherwise, we apply the correction before sending to GPT-4 Vision.

Instant hallucination. The model sometimes stuck to values from personal data examples instead of studying the actual drawing. If every instance shows a REV “2-0”, the model has developed a bias to output “2-0” even if the diagram clearly shows “A” or “3-0”. We fixed this in two ways: we separated the examples from all valid formats with warnings against the header, and we added explicit instructions that separate the revision history table (a multi-line change log) from the current REV field (a single value in the header block).

CRITICAL RULES - AVOID THESE:
✗ DO NOT extract from REVISION HISTORY TABLES
   (columns: REV | DESCRIPTION | DATE)
   - We want the CURRENT REV from title block (single value)
 
✗ DO NOT extract grid reference letters (A, B, C along edges)
✗ DO NOT extract section markers ("SECTION C-C", "SECTION B-B")

Results and trade-offs

We verified against a sample of 400 files with manually verified ground truth.

Metric	Hybrid (PyMuPDF + GPT-4)	GPT-4 Only
Accuracy (n=400)	96%	98%
Processing time (n=4,730)	~ 45 minutes	~ 100 minutes
API fees	~$10-15	~$47 (all files)
A person's review level	~5%	~1%

A 2% accuracy gap was the price of a 55-minute reduction in operating time and estimated cost. In data migration where developers will see the percentage values anyway, 96% with an average of 5% flagged for review was accepted. If the use case was compliance, we would use GPT-4 for all files.

We later benchmarked the new models, including GPT-5+, against the same 400-file validation set. Accuracy compared to GPT-4.1 at 98%. The new models did not offer a reasonable lift in this output function, with higher costs per call and slower speeds. We have shipped GPT-4.1. If the function has a spatially bound pattern corresponding to a well-defined document space, the ceiling is alertness and preprocessing, not the model's reasoning power.

In manufacturing work, the “correct” accuracy index is not always the highest you can hit. It's the one that balances costs, delays, and declining workflows that depend on your output.

From script to System

The first delivery was a command line tool: feed a folder of PDFs, get a CSV of the results. It worked within our Microsoft Azure environment, using Azure OpenAI endpoints for GPT-4 Vision calls.

After the first migration was successful, the participants asked if other groups could use it. We've wrapped the pipeline in a lightweight internal web application with a file upload interface, so non-technical users can extract if they want without touching the terminal. The system has been adopted by engineering teams in multiple locations across the organization, each running their own drawing archive for migration and research tasks. I can't share the screenshots for privacy reasons, but the core idea of the extraction is the same as what I described here.

Labor Studies

Start with the cheapest method that works. The instinct when working with LLMs is to use them for everything. Resist it. A dedicated domain managed 70-80% of our corpus at zero cost. The LLM added value because we kept it focused on situations where the rules failed.

Confirm on scale, not on cherry-picked samples. Rotation ambiguity, revision history table confusion, false fact reference grid. None of these appeared in our first test set of 20 files. Your validation set needs to represent the actual distribution of edge cases you will see in production.

Agile engineering is software engineering. The system information has gone through many iterations with structured examples, obvious worst cases, and self-validation checklists. Treating it as a throwaway script instead of a carefully edited part is how you end up with unexpected exits.

Evaluate what is important to stakeholders. Developers didn't care if the pipeline used PyMuPDF, GPT-4, or carrier pigeons. They cared that 4,700 drawings were processed in 45 minutes instead of four weeks, with $50-70 in API calls instead of £8,000+ in engineering time, and that the results were accurate enough to proceed with confidence.

The full pipeline is about 600 lines of Python. It saved four weeks of engineering time, cost less than a team lunch in API fees, and has since been used as a production tool on multiple sites. We tested the latest models. They were no better at this job. Sometimes the most impactful AI work isn't using the most powerful model available. It's about knowing where the model belongs in the system, and keeping it there.

Obinna is a Senior AI/Data Engineer based in Leeds, UK, specializing in document intelligence and AI manufacturing systems. He creates content around practical AI engineering @DataSenseiObi in X and Wisabi Analytics on YouTube.

Source link

nimda 2 weeks ago

0 5 7 minutes read