Blog

Benchmarking the Most Reliable Document Parsing API
Blog

Benchmarking the Most Reliable Document Parsing API

Learn how Tensorlake built the most reliable document parsing API by measuring what actually matters: structural preservation, reading order accuracy, and downstream usability. See benchmark results comparing Tensorlake to Azure, AWS Textract, and open-source solutions on real enterprise documents.

Banner image with text: New: Vision Language Models for Document Processing, January 16, 2025. Tensorlake logo.
Changelog

New: Vision Language Models for Document Processing

Tensorlake now uses Vision Language Models (VLMs) across multiple features including page classification, figure/table summarization, and structured extraction, enabling faster and more intelligent document understanding.

  • VLM-powered page classification for efficient large document processing
  • Direct visual understanding for figures, tables, and structured data extraction
  • Skip OCR entirely with VLM-based extraction for more accurate results from harder to parse documents
Banner image with text: New: Tracked Changes Parsing for Word Documents, October 9, 2025. Tensorlake logo.
Changelog

New: Tracked Changes Parsing for Word Documents

Tensorlake now preserves tracked changes (insertions, deletions, and comments) from Word documents as structured HTML, enabling programmatic access to document revision history.

  • Preserve insertions, deletions, and comments from .docx tracked changes
  • Structured HTML output with semantic tags (<ins>, <del>, <span class='comment'>)
  • Extract author metadata and comment text programmatically
Banner image with text: New: Header Detection and Correction for accurate document hierarchy, September 30, 2025. Tensorlake logo.
Changelog

New: Header Detection and Correction for accurate document hierarchy

Tensorlake now detects and corrects document headers across pages, maintaining proper hierarchy even when OCR misidentifies header levels.

  • Automatic header hierarchy correction based on numbering patterns and visual structure
  • Cross-page header detection without fragmentation
  • Section headers include accurate level attributes (0 for #, 1 for ##, etc.)
Precise Data Extraction: Pattern-Based Partitioning for Structured Extraction
Blog

Precise Data Extraction: Pattern-Based Partitioning for Structured Extraction

Stop wrestling with brittle document extraction pipelines that break when layouts change. Learn how Tensorlake's pattern-based partitioning to extract data from specific document sections, eliminating positional dependencies and parsing noise for consistent structured outputs.

Building Clean, Schema-Enforced Pipelines with Tensorlake + Outlines
Blog

Building Clean, Schema-Enforced Pipelines with Tensorlake + Outlines

Learn how to build bulletproof document AI pipelines by combining Tensorlake's structured parsing with Outlines' schema-enforced generation. This technical guide shows how to eliminate malformed JSON, validation errors, and downstream failures by constraining LLM outputs during decoding rather than hoping for valid results.

Banner image with text: Fixed: Citation filtering now respects page classification limits, September 19, 2025. Tensorlake logo.
Changelog

Fixed: Citation filtering now respects page classification limits

Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.

  • Citations now correctly respect page classification boundaries
  • Cleaner results with no citations pointing to irrelevant page content
  • Better RAG pipeline accuracy with properly scoped citations
Citation-Aware RAG: How to add Fine Grained Citations in Retrieval and Response Synthesis
Blog

Citation-Aware RAG: How to add Fine Grained Citations in Retrieval and Response Synthesis

Learn how to build citation-aware RAG systems that link AI responses back to exact source locations in documents. This technical guide covers document parsing with spatial metadata, chunking strategies for preserving citations, and implementing verifiable AI responses with page numbers and bounding box coordinates. Includes code examples using Tensorlake's Document AI for parsing complex documents and generating audit-ready citations in production RAG applications.

Banner image with text: Fixed token limit issues with large CSV/Excel tables, September 17, 2025. Tensorlake logo.
Changelog

Fixed token limit issues with large CSV/Excel tables

Fixed token limit issues with large, dense CSV and Excel tables through automatic splitting and intelligent result merging.

  • Handles 500+ row spreadsheets and extensive financial reports that previously failed
  • Automatic table splitting preserves relationships and maintains extraction accuracy
  • Transparent processing - no configuration changes or manual preprocessing required
Banner image with text: Page classification now includes reasoning explanations, September 15, 2025. Tensorlake logo.
Changelog

Page classification now includes reasoning explanations

Page classification results now include the model's reasoning for each decision to help with debugging and prompt engineering.

  • Detailed explanations for why pages received specific classifications
  • Helps identify prompt engineering opportunities and debug classification errors
  • Automatically included in all classification results with no performance impact
Banner image with text: Page classification now defaults to multi-label (multiple classes per page), September 12, 2025. Tensorlake logo.
Changelog

Page classification now defaults to multi-label (multiple classes per page)

Page classification now defaults to multi-label mode, allowing pages to receive multiple classification labels simultaneously.

  • Single pages can be classified as multiple page types (e.g., account_info AND transactions)
  • Better handling of complex documents like bank statements and legal docs
  • Backward compatible - multi-class mode still available via configuration
Banner image with text: Summaries now include optional full-page image context, September 10, 2025. Tensorlake logo.
Changelog

Summaries now include optional full-page image context

Optionally reference the full-page during figure and table summarization to preserve spatial context in complex layouts.

  • Full-page image context for better spatial relationship understanding
  • Reduces hallucinations in multi-column and form-based documents
  • Optional setting - maintains existing fragment-level behavior as default
Parse and Retrieve Dense Tables Accurately with Tensorlake
Blog

Parse and Retrieve Dense Tables Accurately with Tensorlake

Learn how Tensorlake preserves structure in dense, multi-page tables—returning DataFrames with summaries and bounding boxes for accurate, explainable retrieval.

Banner image with text: Document Ingestion now supports XML, DOC, and Markdown files, September 8, 2025. Tensorlake logo.
Changelog

Document Ingestion now supports XML, DOC, and Markdown files

Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.

  • Native XML parsing for config files and structured data exports
  • Legacy DOC file support for older document repositories
  • Markdown processing for documentation and technical specs
Verify Structured Output with Field-Level Citations
Blog

Verify Structured Output with Field-Level Citations

Tensorlake now supports citations in Structured Extraction. Every extracted field can be traced back to its bounding box and page number—unlocking auditing, compliance, and verification workflows.

Fix Broken Context in RAG with Tensorlake + Chonkie
Blog

Fix Broken Context in RAG with Tensorlake + Chonkie

RAG pipelines fail when contracts, financial reports, or research papers are split into meaningless chunks. Learn how Tensorlake’s parsing and Chonkie’s chunking work together to deliver faithful, retrieval-ready context.

Accelerate Advanced RAG with Tensorlake
Blog

Accelerate Advanced RAG with Tensorlake

Advanced RAG that survives production: keep context fresh, preserve structure, and plan retrieval using Tensorlake to turn messy PDFs into traceable answers. We demonstrate it by fact-checking Tesla news against SEC filings.

Banner image with text: Table Recognition now parses ~1,500-cell tables (with structure preserved), August 13, 2025. Tensorlake logo.
Changelog

Table Recognition now parses ~1,500-cell tables (with structure preserved)

New model is live—reliably extracting very large, dense tables from PDFs (incl. scans) while preserving header hierarchy, row/col spans, and cell boundaries, with fast HTML/CSV export and bbox for citations.

  • Robust on ~1,500-cell tables; resilient to complex layouts and scanned documents.
  • Preserves header hierarchy and row/column spans; faithful HTML outputs.
  • Improved cell boundary detection and multi-row/multi-col header parsing.
  • +3 more...
AI Tagging for Page-Level Metadata with Tensorlake Page Classification
Blog

AI Tagging for Page-Level Metadata with Tensorlake Page Classification

Learn how AI Tagging with Tensorlake’s Page Classification turns unstructured documents into page-level metadata for CRMs, vector databases, RAG pipelines, and compliance workflows—enabling precise search, automation, and structured data extraction.

Banner image with text: DocumentAI API v2, August 10, 2025. Tensorlake logo.
Changelog

DocumentAI API v2

V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.

  • Unified Parse and Jobs API
  • Advanced Structured Extraction with JSON Schema
  • Page Classification and Signature Detection
  • +2 more...
Page Classification: Smarter, Safer Structured Extraction
Blog

Page Classification: Smarter, Safer Structured Extraction

Extract the *right* structured data *from the right pages*, with zero extra complexity

Unlocking Smarter RAG with Qdrant + Tensorlake: Structured Filters Meet Semantic Search
Blog

Unlocking Smarter RAG with Qdrant + Tensorlake: Structured Filters Meet Semantic Search

A modern RAG stack demands more than vectors. In this post, we show how to combine Qdrant and Tensorlake to build smarter retrieval pipelines with structured filters, figure/table summaries, and markdown chunks enriched with document metadata. Learn how to parse research papers, create embeddings, and answer nuanced queries using real-world document structure, no fragile pipelines required.

LangChain + Tensorlake: Unlocking Document Understanding for Agents
Blog

LangChain + Tensorlake: Unlocking Document Understanding for Agents

LangChain and Tensorlake join forces to enhance agent-driven workflows with reliable document parsing and understanding.

Signature Detection in Tensorlake: Catch what’s missing, trigger what’s next
Blog

Signature Detection in Tensorlake: Catch what’s missing, trigger what’s next

Signature Detection is now available in Tensorlake. Automatically identify whether a document has been signed—and use that signal to power intelligent automations.

Tensorlake Cloud: Ingest, Structure, Orchestrate Without Losing a Byte
Blog

Tensorlake Cloud: Ingest, Structure, Orchestrate Without Losing a Byte

Tensorlake Cloud is a fully managed platform for turning unstructured documents into structured, AI-ready data. With human-like document parsing and code-first workflow orchestration, delivering the accuracy and durability needed for high-stakes applications in finance, healthcare, and more.

Banner image with text: Advanced Schema Extraction, March 15, 2024. Tensorlake logo.
Changelog

Advanced Schema Extraction

Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support

  • Research paper metadata extraction
  • Pydantic schema support
  • Multi-format document support
  • +1 more...

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.