Changelog
Stay up to date with the latest changes and improvements to Tensorlake
Fixed: Citation filtering now respects page classification limits
Fixed bug where citations ignored page classification filtering, ensuring citations only reference pages you're actually extracting from.
- •Citations now correctly respect page classification boundaries
- •Cleaner results with no citations pointing to irrelevant page content
- •Better RAG pipeline accuracy with properly scoped citations
Fixed token limit issues with large CSV/Excel tables
Fixed token limit issues with large, dense CSV and Excel tables through automatic splitting and intelligent result merging.
- •Handles 500+ row spreadsheets and extensive financial reports that previously failed
- •Automatic table splitting preserves relationships and maintains extraction accuracy
- •Transparent processing - no configuration changes or manual preprocessing required
Page classification now includes reasoning explanations
Page classification results now include the model's reasoning for each decision to help with debugging and prompt engineering.
- •Detailed explanations for why pages received specific classifications
- •Helps identify prompt engineering opportunities and debug classification errors
- •Automatically included in all classification results with no performance impact
Page classification now defaults to multi-label (multiple classes per page)
Page classification now defaults to multi-label mode, allowing pages to receive multiple classification labels simultaneously.
- •Single pages can be classified as multiple page types (e.g., account_info AND transactions)
- •Better handling of complex documents like bank statements and legal docs
- •Backward compatible - multi-class mode still available via configuration
Page summaries now include optional full-page image context
Optionally reference the full-page during figure and table summarization to preserve spatial context in complex layouts.
- •Full-page image context for better spatial relationship understanding
- •Reduces hallucinations in multi-column and form-based documents
- •Optional setting - maintains existing fragment-level behavior as default
Document Ingestion now supports XML, DOC, and Markdown files
Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.
- •Native XML parsing for config files and structured data exports
- •Legacy DOC file support for older document repositories
- •Markdown processing for documentation and technical specs
Table Recognition now parses ~1,500-cell tables (with structure preserved)
New model is live—reliably extracting very large, dense tables from PDFs (incl. scans) while preserving header hierarchy, row/col spans, and cell boundaries, with fast HTML/CSV export and bbox for citations.
- •Robust on ~1,500-cell tables; resilient to complex layouts and scanned documents.
- •Preserves header hierarchy and row/column spans; faithful HTML outputs.
- •Improved cell boundary detection and multi-row/multi-col header parsing.
- +3 more...
DocumentAI API v2
V2 of the DocumentAI API is fully in production in the Python SDK and on the Playground, offering unified document processing with advanced structured extraction, page classification, and enrichment capabilities.
- •Unified Parse and Jobs API
- •Advanced Structured Extraction with JSON Schema
- •Page Classification and Signature Detection
- +2 more...
Advanced Schema Extraction
Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support
- •Research paper metadata extraction
- •Pydantic schema support
- •Multi-format document support
- +1 more...