Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

0 0 8 minutes read

Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas

Datalab has released lift, a 9B open-weights vision model for structured extraction. You pass it a JSON schema, and it returns a JSON object that matches. The model reads PDFs and images directly, then decodes against your schema.

This is Datalab’s first model built purely for extraction. The team already ships open-source OCR tools: chandra, marker, and surya. lift extends that work into schema-driven field extraction.

lift scores 90.2% field accuracy on Datalab’s 225-document benchmark. The research team reports it as the strongest small self-hostable model they tested. It runs at a median of 9.5 seconds per document.

What is Datalab lift?

lift is a 9B-parameter vision model for structured extraction. It accepts standard JSON Schema as input. It returns valid JSON of that shape as output.

The model handles multi-page documents in a single pass. It can read values that span across pages. Whole documents go in at once, not page by page.

Two inference modes ship with the package. Local inference runs through HuggingFace. Remote inference runs through a vLLM server, which Datalab recommends for production.

The code is Apache 2.0. The weights use a modified OpenRAIL-M license.

lift enters a small but growing field of open extraction models. Some are purpose-built, like the NuExtract family. Others are general vision-language models pressed into extraction, like Qwen3.5-9B. It pairs a vision-language base with schema-constrained decoding and trained abstention. On Datalab’s benchmark, it leads that open group on field accuracy.

Schema-Constrained Decoding: The Core Mechanism

The main design choice is schema-constrained decoding. lift decodes its output directly against your schema. The result is always valid JSON of the correct shape.

Here is what happens under the hood. lift first turns your JSON Schema into a Pydantic model. It then normalizes that into a strict JSON Schema. The schema is passed to the vLLM server as a response_format constraint.

During generation, the server compiles the schema into a grammar. At each step, the model assigns a probability to every possible next token. The grammar defines which tokens are valid continuations. Tokens that would break the schema are masked out. The model can only sample from what remains.

This is why the output is always valid JSON of the right shape. The structure is enforced token by token, not checked afterward.

There is a sharp limit to this guarantee. Constrained decoding governs structure and types, not meaning. A field typed as number will hold a number. Whether it holds the correct number is a separate question. The model can emit a valid value that is simply wrong. Validity is not correctness.

lift also widens every field to allow null. Each scalar leaf in the compiled schema accepts its type or null. So the model can abstain on any field without breaking the structure. Abstention is both trained behavior and a property of the constraint.

You write standard JSON Schema. Supported types include string, number, integer, boolean, arrays of those, arrays of objects, and nested objects. A field description guides the model when a name is ambiguous.

This is also where a quiet failure mode lives. Some constructs cannot be compiled: enum, anyOf/oneOf, $ref, and additionalProperties. When lift cannot compile your schema, it does not stop. It logs a warning and generates without the constraint. The structural guarantee is gone for that run, with no hard error. Output may then fail to match your schema at all.

The practical rule is simple. Keep schemas inside the supported subset. Validate the returned JSON against your schema downstream. Do not assume valid output just because the call returned.

Here is a simple invoice schema:

{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string", "description": "Invoice identifier"},
    "total": {"type": "number", "description": "Total amount due"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "amount": {"type": "number"}
        }
      }
    }
  },
  "required": ["invoice_number", "total"]
}

Abstention by Default

Real extraction is hard for a non-obvious reason. Beyond reading fields that exist, the real challenge is not inventing fields that are absent.

A model that hallucinates a tax ID is worse than one returning nothing. The error is silent and hard to catch downstream. lift is trained to leave genuinely missing fields null.

Mark a field required only when it must appear. Fields absent from a document come back null. This gives you an extractor that can report a value is not present.

Benchmark

Datalab evaluated lift on a 225-document extraction benchmark. Documents ran 6 to 64 pages each, with roughly 11,000 scored fields. Adversarial cases were planted throughout the set.

Those cases include cross-page values and exhaustive lists. They also include fields that must be left null and near-miss distractors. Multi-source aggregation was tested as well.

Every model received the same rendered page images. Each extracted every document in a single pass. Scoring was a deterministic exact-match against ground truth, with numeric tolerance and normalized strings.

Model	Size	Field accuracy	Full-document accuracy	Median latency*	Features
Datalab API	—	95.9%	44.4%	30.8s	Citations + Verification
Gemini Flash 3.5	—	91.3%	40.0%	28.1s
lift	9B	90.2%	20.9%	9.5s
Azure Content Understanding	—	83.4%	22.2%	73.7s	Citations
NuExtract3	4B	81.5%	8.4%	8.3s
Qwen3.5-9B	9B	76.32%	24.0%	16.8s

* Per document, 8 concurrent requests. Local models (lift, Qwen3.5-9B, NuExtract3) were served with vLLM on a single GPU. Gemini, Datalab, and Azure ran via API. Latency varies with hardware and load; treat it as relative.

Two details matter here. Field accuracy is the fraction of individual fields extracted correctly. Full-document accuracy is the fraction of documents where every field is correct.

On field accuracy, lift leads the self-hostable models. It sits ahead of NuExtract3 and the Qwen3.5-9B base. It is also the fastest of the accurate models in the table.

At 9.5s median, lift is roughly 3x faster than Gemini Flash 3.5. It stays within about a point of that model’s field accuracy. Full-document accuracy is a harder metric: every field must be correct. Here lift scores 20.9%, ahead of only NuExtract3. The hosted APIs lead, at 44.4% and 40.0%.

A note on reading these numbers. This is Datalab’s own benchmark, so treat it as a vendor result. Its adversarial design rewards models tuned to abstain, which lift is. Full-document accuracy is low for every model, topping out at 44.4%. That reflects how hard single-pass extraction is on long documents. The numbers are also a snapshot; models change.

This is the reality of single-pass, single-model extraction on hard documents. It tells you where lift fits. It is excellent for field-level extraction that feeds a human-in-the-loop review or aggregate analytics. It is not yet a drop-in for zero-touch, every-field-must-be-perfect automation. For that last mile, Datalab’s hosted API adds per-field verification, citations, and confidence scores on the same approach.

A Practitioner Workflow: From Schema to Reviewed Data

Three use cases show the shape of the work. Invoice processing: define invoice_number, total, and line_items, and a missing tax_id returns null. Contract review: a two-page agreement carries a value across pages, which single-pass extraction stitches together. Document pipelines: an accounts-payable queue trusts that absent due dates return null, avoiding silent errors.

Here is one of them as an end-to-end workflow. The goal is a clean, reviewed dataset, not raw model output.

1. Define the schema. Add a description to any field whose name is not obvious. Mark only truly mandatory fields as required.

2. Run extraction. Pass the schema and the file to lift. Use a dict, a file path, or a saved schema name.

3. Branch on the result. A failed call or a null extraction goes to review. A missing required value also goes to review, since null is abstention, not an error.

4. Validate before you trust. Check the returned JSON against your schema. This catches the silent fallback when a schema could not be compiled.

from lift import extract

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice identifier"},
        "total": {"type": "number", "description": "Total amount due"},
        "due_date": {"type": "string", "description": "Payment due date, ISO 8601"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        }
    },
    "required": ["invoice_number", "total"]
}

result = extract("invoice.pdf", schema)

if result.error or result.extraction is None:
    queue_for_review("invoice.pdf", reason="extraction_failed")
else:
    data = result.extraction
    # A required field can still be null. That is abstention, not a crash.
    if data.get("total") is None:
        queue_for_review("invoice.pdf", reason="missing_total")
    else:
        save(data)

A few schema-design tips that pay off in practice:

Write a description for ambiguous fields; it is your main lever on accuracy.
Keep schemas inside the supported subset, and validate output downstream.
Prefer flat, shallow schemas; deep nesting is harder to extract reliably.
Mark fields required sparingly, so genuine gaps can return null.
Use –page-range (CLI) or page_range (Python) to limit long PDFs.
Reuse one InferenceManager across calls to amortize model load.

Self-Host vs. Hosted: Which to Use

lift ships as open weights, and Datalab runs a hosted API on the same approach. The choice is about constraints, not prestige.

Choose	When
Self-hosted lift (open weights)	Data residency or on-prem rules apply; you need cost control at high volume; you want latency control on your own GPUs; runs must work offline.
Hosted Datalab API	You need per-field verification, citations, and confidence scores; you want the highest accuracy; you would rather not manage infrastructure; volume is low or bursty.

One caveat for self-hosting. Commercial use needs a license under the modified OpenRAIL-M terms. It is free for research, personal use, and startups under $5M in funding or revenue, and not for use in competition with Datalab’s API.

Getting Started

The fastest path is the CLI. lift-pdf requires Python 3.12 or newer.

pip install lift-pdf

# Serve the model with vLLM (recommended)
lift_vllm

# Extract against a schema
lift_extract input.pdf ./output --schema schema.json

Each file produces two outputs. .json holds the extraction matching your schema. _metadata.json holds page count, token count, and error info for debugging.

The Python API is equally small:

from lift import extract

# schema: a dict, a path, an inline JSON string, or a library name
result = extract("document.pdf", "schema.json")
if result.extraction is not None:
    data = result.extraction  # dict matching the schema

Check result.extraction for the dict matching your schema. A null extraction signals a failure you can inspect. The HuggingFace backend uses –method hf and needs pip install lift-pdf[hf].

Schema Studio ships as a Streamlit app. It lets you build, save, and test schemas against your own documents. Install it with pip install lift-pdf[app], then run lift_app.

For production, lift_vllm launches a Docker container with batch size scaled to your GPU. Supported GPUs are: h100, a100-80, a100/a100-40, l40s, a10, l4, 4090, 3090, t4.

Interactive Explainer

{ current=k; render(); }; tabsEl.appendChild(b); }); } function buildFields(){ const sel = ensureSel(current); fieldsEl.innerHTML=””; DOCS[current].fields.forEach(f=>{ const row=document.createElement(‘label’); row.className=”f”; const cb=document.createElement(‘input’); cb.type=”checkbox”; cb.checked=sel.has(f.name); cb.onchange=()=>{ cb.checked ? sel.add(f.name) : sel.delete(f.name); }; const meta=document.createElement(‘div’); meta.className=”meta”; let tag = f.absent ? `not in document` : f.span ? `spans pages` : “”; meta.innerHTML = ` ${f.name} : ${f.type}${f.req?’ · required’:”} ${f.desc} ${tag}`; row.appendChild(cb); row.appendChild(meta); fieldsEl.appendChild(row); }); } function jval(v, indent){ const pad = ‘ ‘.repeat(indent); const pad2 = ‘ ‘.repeat(indent+1); if(v===null) return `null`; if(typeof v===’number’) return `${v}`; if(typeof v===’boolean’) return `${v}`; if(typeof v===’string’) return `“${v}”`; if(Array.isArray(v)){ if(v.length===0) return ‘[]’; const items=v.map(it=>pad2+jval(it,indent+1)).join(‘,n’); return `[n${items}n${pad}]`; } // object const keys=Object.keys(v); const body=keys.map(k=>`${pad2}“${k}”: ${jval(v[k],indent+1)}`).join(‘,n’); return `{n${body}n${pad}}`; } function extract(){ const sel = ensureSel(current); const fields = DOCS[current].fields.filter(f=>sel.has(f.name)); if(fields.length===0){ jsonEl.innerHTML = `// No fields selected — nothing to extract`; statsEl.innerHTML=””; noteEl.innerHTML=””; emitSize(); return; } // Build object honoring schema shape; absent -> null (abstention) const obj={}; let found=0, nulls=0, spans=0; fields.forEach(f=>{ if(f.absent){ obj[f.name]=null; nulls++; } else { obj[f.name]=f.val; found++; if(f.span) spans++; } }); const lines = Object.keys(obj).map(k=>{ const f = fields.find(x=>x.name===k); const badge = f.absent ? `null · abstained` : `found`; return ` “${k}”: ${jval(obj[k],1)}${badge}`; }); jsonEl.innerHTML = `{n${lines.join(‘,n’)}n}`; statsEl.innerHTML = `${found} found${nulls} null${fields.length} fields`; let n = `Output is valid JSON of exactly the shape you selected — that is schema-constrained decoding. `; if(nulls>0) n += `Fields not present in the document came back <code>null</code> instead of a guess (abstention). `; if(spans>0) n += `<code>total_value</code> is resolved from values that span two pages, read in a single pass.`; noteEl.innerHTML = n; emitSize(); } function render(){ buildTabs(); docEl.innerHTML = DOCS[current].render; buildFields(); jsonEl.innerHTML = `// Press Extract to run the selected schema`; statsEl.innerHTML=””; noteEl.innerHTML=””; emitSize(); } document.getElementById(‘run’).onclick = extract; document.getElementById(‘all’).onclick = ()=>{ const sel=ensureSel(current); const all = DOCS[current].fields; const allOn = all.every(f=>sel.has(f.name)); all.forEach(f=> allOn ? sel.delete(f.name) : sel.add(f.name)); buildFields(); emitSize(); }; /* auto-resize: post component offsetHeight + 40 to parent */ function emitSize(){ requestAnimationFrame(()=>{ const h = (document.querySelector(‘.wrap’).offsetHeight || document.body.offsetHeight) + 40; parent.postMessage({type:’lift-demo-height’, height:h}, ‘*’); }); } window.addEventListener(‘load’, ()=>{ render(); emitSize(); }); window.addEventListener(‘resize’, emitSize);

Source link

nimda 4 hours ago

0 0 8 minutes read