Building Clean, Schema-Enforced Pipelines with Tensorlake + Outlines

September 25, 2025

TL;DR

This blog explores how to combine Tensorlake Document AI with Outlines to build schema-enforced pipelines for document processing. We show how parsing documents with Tensorlake and constraining model outputs with Outlines leads to structured data that is guaranteed to match your schema. The example we walk through is an invoice parser that produces valid JSON on every run.

Even with OCR, data extracted from parsing documents contains noise, missing values, or misaligned fields.

Adding a large language model helps with reasoning but introduces its own issues. Models return malformed JSON, mix up date formats, or hallucinate values that were never in the source. It is not unusual to see outputs with mismatched braces or extra tokens that prevent the JSON from parsing at all.

In a production pipeline, these errors lead to broken downstream jobs, failed integrations, and costly manual review. And while regex cleanups, validation scripts, or retry loops can help for a handful of documents, these bandaid-solutions quickly break down at scale. What is missing is a reliable way to go from messy documents to structured, validated data that downstream systems can trust.

Schema-Enforced Pipelines#

A robust pipeline for document AI has two distinct requirements:

Parsing that preserves structure. Documents are more than raw text. They have tables, headers, signatures, and layouts that provide meaning. Tensorlake Document AI captures this structure by converting PDFs, scans, and images into clean fragments enriched with metadata such as page numbers, bounding boxes, and element types. This step ensures that the model starts from a faithful representation of the source.
Generation that obeys a schema. Even with high-quality input, an unconstrained model can produce unpredictable output. Outlines addresses this problem by constraining the decoding process itself. Instead of hoping a model will follow instructions, Outlines enforces rules during generation so that every result conforms to a JSON Schema or Pydantic model.

When combined, the two form a pipeline that is resilient from end to end. Tensorlake ensures that no information is lost at the parsing stage. Outlines guarantees that the output matches the exact shape expected by downstream systems.

The result is clean, verifiable data that can flow into databases, APIs, or analytics pipelines without brittle post-processing.

How Outlines Enforces Schema Constraints#

Language model normally works like this: at each step, it looks at the context and outputs a probability distribution over the next token in its vocabulary. The model picks one token (sampled or greedy), appends it to the sequence, and repeats. Nothing in this process stops the model from outputting invalid JSON or an incorrect type.

Outlines changes the decoding loop. Instead of sampling from the model’s full vocabulary, it builds a finite state machine (FSM) that represents all valid outputs for a given schema. At each decoding step, it masks out any tokens that would violate the schema and only allows those that keep the output valid. For example:

If your schema says "total_amount" must be a number, then at the point where the model is about to generate that value, the FSM prunes away every token that is not a digit, decimal point, or valid number continuation.
If your schema says the output must be valid JSON, the FSM ensures braces, commas, and quotes are placed in the right order. It prevents the model from ever outputting ,] or an unclosed string.

At each decoding step, Outlines intersects the model’s probability distribution with the set of tokens allowed by the FSM. The model is still free to choose among the allowed tokens, but impossible to choose an invalid one.

Think of it as autocomplete with guardrails. The model can still “write the story,” but Outlines makes sure every sentence is grammatically correct according to the schema.

The benefit is that the constraint happens during generation, not after. That is why every output from Outlines is guaranteed to be valid JSON, match the schema, and respect type definitions like enums or dates.

A Hands-On Example: Invoice Processing#

To make the benefits concrete, let’s build a small pipeline that extracts structured data from invoices. The goal is to parse a PDF invoice, then enforce a schema so that every run produces valid JSON.

Step 1: Parse the Document with Tensorlake#

from tensorlake.documentai import DocumentAI, ParseStatus

doc_ai = DocumentAI()
file_id = doc_ai.upload("invoice.pdf")
result = doc_ai.parse_and_wait(file_id)

assert result.status == ParseStatus.SUCCESSFUL
for frag in result.chunks[:5]:
  print(frag.page_number, frag.type, frag.content[:80])

Tensorlake converts the document into fragments such as headers, table rows, and text blocks. Each fragment comes with metadata (page number, bounding box, fragment type). This preserves the structure of the source, rather than flattening everything into a single text blob.

Step 2: Define the Schema#

We use a Pydantic model to describe the fields we expect.

from pydantic import BaseModel, Field

class Invoice(BaseModel):
  invoice_number: str = Field(description="Invoice number on the invoice")
  issue_date: str
  due_date: str
  vendor_name: str
  total_amount: float

This schema becomes the contract for our pipeline. Downstream code can rely on this shape without extra validation.

Step 3: Run Extraction with Outlines#

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"
omodel = outlines.from_transformers(
  AutoModelForCausalLM.from_pretrained(model_name, device_map="auto"),
  AutoTokenizer.from_pretrained(model_name)
)

prompt = f"""
Extract the invoice fields from the following text:
{' '.join([frag.content for frag in result.chunks[:10]])}
"""

raw = omodel(prompt, Invoice)
invoice = Invoice.model_validate_json(raw)
print(invoice.model_dump())

Outlines constrains the decoding process so that the output must satisfy the Invoice schema. Even if the model tries to produce something malformed, the FSM blocks it and only allows valid tokens.

Step 4: Outputs#

Without schema enforcement, LLMs often produce outputs not adhering to the Pydantic model.

With Tensorlake + Outlines, the output becomes:

{
"invoice_number": "INV-2387",
"issue_date": "2025-05-10",
"due_date": "2025-06-10",
"vendor_name": "Acme Supplies Ltd.",
"total_amount": 3000.0
}

The result is clean, type-safe, and ready for downstream systems. No regex cleanup, no retries, no manual corrections.

Best Practices and Takeaways#

When building schema-enforced pipelines, there are a few patterns that make them more robust in production:

Design schemas carefully. Keep them as simple as possible, prefer low nesting levels, and experiment with different schema keys and their descriptions.
Filter before you extract. Tensorlake page classification and fragment typing let you discard irrelevant sections (like footers or signatures) before passing text to the model. This reduces noise and improves accuracy.
Validate twice. Outlines guarantees schema validity during decoding, but it is good practice to validate again downstream with Pydantic. This provides an extra safety net before writing to a database.
Handle missing values explicitly. Instead of letting models hallucinate, define optional fields in the schema so the absence of data is captured cleanly.
Benchmark cost and latency. Constrained decoding has overhead, especially for large schemas. Measure the trade-offs between schema complexity and generation speed.

Conclusion#

Schema enforcement changes the reliability of document pipelines. By combining Tensorlake Document AI with Outlines, you get LLM-ready inputs and guaranteed-valid outputs in a single workflow. Instead of patching errors with regex or manual review, every document is parsed into structured fragments and then decoded into a schema that downstream systems can trust.

We showed this on invoices, but the same pattern applies to contracts, claims, financial filings, or any document where correctness matters. The benefit is simple: fewer failures, lower review costs, and pipelines that scale.

We have published a demo notebook that walks through the entire workflow. Clone it, run it on your own PDFs, and see how schema-enforced pipelines change the way you build with document AI.

Antaripa Saha

DevRel Engineer at Tensorlake

Machine Learning Engineer with 4 years of experience. Love building with AI, with interest in VLMs, search, and memory.

Twitter GitHub LinkedIn