Changelog

Stay up to date with the latest changes and improvements to Tensorlake

October 16, 2025

New: Vision Language Models for Document Processing

Tensorlake now uses Vision Language Models (VLMs) across multiple features including page classification, figure/table summarization, and structured extraction, enabling faster and more intelligent document understanding.

•VLM-powered page classification for efficient large document processing
•Direct visual understanding for figures, tables, and structured data extraction
•Skip OCR entirely with VLM-based extraction for more accurate results from harder to parse documents

October 10, 2025

New: Tracked Changes Parsing for Word Documents

Tensorlake now preserves tracked changes (insertions, deletions, and comments) from Word documents as structured HTML, enabling programmatic access to document revision history.

•Preserve insertions, deletions, and comments from .docx tracked changes
•Structured HTML output with semantic tags (<ins>, <del>, <span class='comment'>)
•Extract author metadata and comment text programmatically

September 30, 2025

New: Header Detection and Correction for accurate document hierarchy

Tensorlake now detects and corrects document headers across pages, maintaining proper hierarchy even when OCR misidentifies header levels.

•Automatic header hierarchy correction based on numbering patterns and visual structure
•Cross-page header detection without fragmentation
•Section headers include accurate level attributes (0 for #, 1 for ##, etc.)

September 19, 2025

Fixed: Citation filtering now respects page classification limits

Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.

•Citations now correctly respect page classification boundaries
•Cleaner results with no citations pointing to irrelevant page content
•Better RAG pipeline accuracy with properly scoped citations

September 17, 2025

Fixed token limit issues with large CSV/Excel tables

Fixed token limit issues with large, dense CSV and Excel tables through automatic splitting and intelligent result merging.

•Handles 500+ row spreadsheets and extensive financial reports that previously failed
•Automatic table splitting preserves relationships and maintains extraction accuracy
•Transparent processing - no configuration changes or manual preprocessing required

September 15, 2025

Page classification now includes reasoning explanations

Page classification results now include the model's reasoning for each decision to help with debugging and prompt engineering.

•Detailed explanations for why pages received specific classifications
•Helps identify prompt engineering opportunities and debug classification errors
•Automatically included in all classification results with no performance impact

September 12, 2025

Page classification now defaults to multi-label (multiple classes per page)

Page classification now defaults to multi-label mode, allowing pages to receive multiple classification labels simultaneously.

•Single pages can be classified as multiple page types (e.g., account_info AND transactions)
•Better handling of complex documents like bank statements and legal docs
•Backward compatible - multi-class mode still available via configuration

September 10, 2025

Summaries now include optional full-page image context

Optionally reference the full-page during figure and table summarization to preserve spatial context in complex layouts.

•Full-page image context for better spatial relationship understanding
•Reduces hallucinations in multi-column and form-based documents
•Optional setting - maintains existing fragment-level behavior as default

September 8, 2025

Document Ingestion now supports XML, DOC, and Markdown files

Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.

•Native XML parsing for config files and structured data exports
•Legacy DOC file support for older document repositories
•Markdown processing for documentation and technical specs

August 13, 2025

Table Recognition now parses ~1,500-cell tables (with structure preserved)

New model is live—reliably extracting very large, dense tables from PDFs (incl. scans) while preserving header hierarchy, row/col spans, and cell boundaries, with fast HTML/CSV export and bbox for citations.

•Robust on ~1,500-cell tables; resilient to complex layouts and scanned documents.
•Preserves header hierarchy and row/column spans; faithful HTML outputs.
•Improved cell boundary detection and multi-row/multi-col header parsing.
+3 more...

August 11, 2025

DocumentAI API v2

V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.

•Unified Parse and Jobs API
•Advanced Structured Extraction with JSON Schema
•Page Classification and Signature Detection
+2 more...

March 15, 2024

Advanced Schema Extraction

Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support

•Research paper metadata extraction
•Pydantic schema support
•Multi-format document support
+1 more...