Page Classification: Smarter, Safer Structured Extraction

July 31, 2025

TL;DR

Mixed‑format, multi‑page documents often combine wildly different page types, like personal data, legal terms, and signatures, that don’t all belong in the same extraction schema. Running extraction across every page wastes compute, adds noise, and lowers accuracy. Tensorlake’s Page Classification solves this by labeling pages up front, letting you target extraction only where it’s relevant. In a single API call, you get page labels, structured JSON per page, full layout, and markdown chunks—ready for use in RAG pipelines, automation, databases, or ETL workflows. The result is cleaner data, fewer errors, and simpler pipelines.

If you’ve ever tried to run structured extraction over a 100‑page PDF, you know the pain: pages with irrelevant legalese pollute your output, OCR burns cycles on tables you don’t care about, and your downstream logic drowns in noise. Most real‑world document bundles are a Frankenstein mix of formats and page types (loan forms, annexes, signatures, appendices) and dumping them all into your schema is like feeding your LLM a junk drawer.

Page Classification fixes this. With a single API call, you can label pages by type, extract structured data only where it matters, and keep multiple repeated data blocks neatly partitioned. No multi‑stage pipelines. No brittle regex gymnastics. Just clean, page‑aware JSON ready to drop into your RAG, agents, or ETL workflows.

Why Page Classification matters#

Modern document workflows often involve mixed formats (PDFs, Word, PPTX, Excel, images) with varying page types. Consider these common scenarios:

Multi-page loan applications that mix applicant data pages with legal terms and appendices
Contract bundles that include redacted sections, signature pages, and technical annexes
Insurance files combining personal information, claim details, and supporting documentation
Research reports with executive summaries, data tables, methodology sections, and references

Running structured extraction across every page wastes compute cycles, introduces noise, and reduces accuracy. You end up with polluted schemas where personal data gets mixed with legal boilerplate, or where signature detection runs on pages that contain only text.

Page Classification solves this by letting you:

Classify pages upfront using simple, rule-based descriptions
Target extraction only to relevant page types
Partition data cleanly with multiple instances of the same schema across different pages
Maintain traceability knowing exactly which pages contributed to each extracted record

How Page Classification works#

Page Classification operates on a simple principle: define your page types once, then let Tensorlake handle the classification and targeted extraction automatically. The process involves three straightforward steps that all happen within a single API call.

Instead of building complex multi-stage pipelines or writing brittle page-detection logic, you simply describe what each page type looks like in natural language. Tensorlake's AI models then classify each page and apply the appropriate extraction schema only where it makes sense.

Here's how it works in practice:

1. Define Your Page Classes#

Define the types of pages you want to classify by providing a name and description for each:

page_classifications = [
  PageClassConfig(
    name="applicant_info",
    description="Page containing personal info: name, address, SSN" 
  ),
  PageClassConfig(
    name="contract_terms",
    description="Pages with legal contract terms and definitions" 
  )
]

2. Structured Extraction with partition strategy by page#

Each structured data request is defined with a schema and can choose to only extract data from pages of a certain class:

structured_extraction_options = [
  StructuredExtractionOptions(
      schema_name="ApplicantInfo",
      json_schema=applicant_schema,
      page_classifications=["applicant_info"]
  ),
  StructuredExtractionOptions(
      schema_name="Terms",
      json_schema=terms_schema,
      page_classifications=["contract_terms"]
  )
]

3. One endpoint, everything delivered#

By just calling the single /parse endpoint, you’ll get:

page_classes: pages grouped by classification,
structured_data: list of records per page,
A full document_layout,
Markdown chunks

All in one response

from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import StructuredExtractionOptions, PageClassConfig

doc_ai = DocumentAI(api_key="YOUR_API_KEY")

parse_id = doc_ai.parse(
file="application_bundle.pdf",
page_classifications=page_classifications,
structured_data_extraction=structured_extraction_options
)

result = doc_ai.wait_for_completion(parse_id)

print("
Page Classifications:")
for page_classification in result.page_classes:
  print(f"- {page_classification.page_class}: {page_classification.page_numbers}")

print("
Structured Data:")
for structured_data in result.structured_data:
  print(f"
=== {structured_data.schema_name} ===")
  data = structured_data.data
  pages = structured_data.page_numbers
  print(json.dumps(data, indent=2, ensure_ascii=False))
  print("Extract from pages: ", pages)

Page Class Output:

Page Classifications:
- terms_and_conditions: [1]
- signature_page: [2]

Structured Data by Page Class Output:

Structured Data:

=== TermsAndConditions ===
{
"terms": [
  {
    "term_description": "You agree to only use WizzleCorp services for imaginary purposes. Any attempt to apply our products or services to real-world situations will result in strong disapproval.",
    "term_name": "Use of Services"
  },
  {
    "term_description": "Users must behave in a whimsical, respectful, and sometimes rhyming manner while interacting with WizzleCorp platforms. No trolls allowed. Literal or figurative.",
    "term_name": "User Conduct"
  },
  {
    "term_description": "All ideas, dreams, and unicorn thoughts shared through WizzleCorp become the temporary property of the Dream Bureau, a subdivision of the Ministry of Make-Believe.",
    "term_name": "Imaginary Ownership"
  },
  {
    "term_description": "By using this site, you consent to the use of cookies - both digital and chocolate chip. We cannot guarantee the availability of milk.",
    "term_name": "Cookies and Snacks"
  },
  {
    "term_description": "We reserve the right to revoke your imaginary license to access WizzleCorp should you fail to smile at least once during your visit.",
    "term_name": "Termination of Use"
  },
  {
    "term_description": "These terms may be updated every lunar eclipse. We are not responsible for any confusion caused by ancient prophecies or time travel.",
    "term_name": "Modifications"
  }
]
}
Extract from pages:  [1, 2]

=== Signatures ===
{
"signature_date": "January 13, 2026",
"signature_present": true,
"signer_name": "April Snyder"
}
Extract from pages:  [1, 2]

Context Engineering for RAG, Agents, and ETL#

Structured extraction isn’t useful in isolation, it needs to plug seamlessly into your workflows. Whether you’re building a retrieval-augmented generation (RAG) system, automating agents, or feeding data into a database or ETL pipeline, Tensorlake outputs are designed to slot in cleanly with the tools you’re already using.

Here’s how structured extraction with per-page context and JSON output enhances every part of your stack:

RAG Workflows#

Feed high-fidelity, page-anchored JSON directly into your retrieval pipelines. By anchoring each field to the correct page and context, you can extract data that respects both structure and semantics—improving retrieval precision.

Say goodbye to hallucinated content and hello to grounded generation.

Agents & Automation#

Trigger agents or workflow steps based on what was found—on which page and in what context. With every page classified and parsed into clean JSON and markdown chunks, your automations can take action with confidence.

Databases & ETL#

Each structured extraction is a self-contained, traceable entity. You know what was extracted, where it came from, and how it maps to your data model. Use this to build ETL pipelines that are both accurate and auditable, or create page-aware payloads for indexing and querying with pinpoint precision.

Try Page Classification for Precise Structured Data Extraction#

Ready to streamline your document pipelines?

Explore Page Classification today with this Colab Notebook or dig into the docs.

Got feedback or want to show us what you built? Join the conversation in our Slack Community!

Dr Sarah Guthals

Founding DevRel Engineer at Tensorlake

Founding DevRel Engineer at Tensorlake, blending deep technical expertise with a decade of experience leading developer engagement at companies like GitHub, Microsoft, and Sentry. With a PhD in Computer Science and a background in founding developer education startups, I focus on building tools, content, and communities that help engineers work smarter with AI and data.

Twitter GitHub LinkedIn