API v2 Summary#

Tensorlake API v2 represents a major evolution in document processing capabilities, providing a unified interface for extracting structured data from any document format. The new API combines document parsing, structured extraction, and enrichment into a single, powerful endpoint that can handle complex document workflows.

Core Capabilities#

Document Ingestion: Upload and process files up to 1GB in size, supporting PDF, Word documents (DOCX), Excel spreadsheets (XLS, XLSX, XLSM), PowerPoint presentations (PPTX), images (PNG, JPG, JPEG), CSV files, HTML, and plain text.

Unified Processing: Submit documents via file upload, public URL, or raw text content with a single API endpoint that handles all processing operations.

Flexible Output: Convert documents to markdown with intelligent chunking strategies, extract structured data using custom schemas, and classify pages into categories.

Structured Data Extraction#

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using JSON Schema definitions.

Invoice Processing Example#

Define schemas for extracting structured information from business documents:

{
"title": "Invoice", 
"type": "object",
"properties": {
  "invoice_number": {"type": "string"},
  "date": {"type": "string", "format": "date"},
  "vendor": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "address": {"type": "string"}
    }
  },
  "line_items": {
    "type": "array",
    "items": {
      "type": "object", 
      "properties": {
        "description": {"type": "string"},
        "quantity": {"type": "number"},
        "unit_price": {"type": "number"},
        "total": {"type": "number"}
      }
    }
  },
  "total_amount": {"type": "number"}
}
}

Contract Analysis Example#

Extract key terms and parties from legal documents:

{
"title": "Contract",
"type": "object", 
"properties": {
  "parties": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "role": {"type": "string"},
        "address": {"type": "string"}
      }
    }
  },
  "effective_date": {"type": "string", "format": "date"},
  "expiration_date": {"type": "string", "format": "date"},
  "key_terms": {
    "type": "array",
    "items": {"type": "string"}
  },
  "governing_law": {"type": "string"},
  "signatures_required": {"type": "boolean"}
}
}

API Usage#

Extract structured data using the unified parse endpoint:

curl -X POST https://api.tensorlake.ai/documents/v2/parse \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "file_id": "file_12345",
  "structured_extraction_options": [{
    "schema_name": "invoice_data",
    "json_schema": {
      "type": "object",
      "properties": {
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"}
      }
    }
  }]
}'

Page Classification#

Classify document pages into categories for better organization and processing:

{
"page_classifications": [
  {
    "name": "invoice",
    "description": "Pages containing invoice information with line items and totals"
  },
  {
    "name": "contract_terms", 
    "description": "Pages containing contract terms and conditions"
  },
  {
    "name": "signature_page",
    "description": "Pages containing signatures and execution information"
  }
]
}

Document Enhancement#

Table and Chart Summarization#

Automatically generate summaries of complex tables and visual elements:

{
"enrichment_options": {
  "table_summarization": true,
  "table_summarization_prompt": "Provide a concise summary of the key data points in this table",
  "figure_summarization": true, 
  "figure_summarization_prompt": "Describe the main insights from this chart or diagram"
}
}

Signature Detection#

Detect and locate signatures within documents using specialized computer vision models:

{
"parsing_options": {
  "signature_detection": true
}
}

Advanced Features#

Document Layout Analysis#

Get detailed document structure information including bounding boxes for all elements:

Page Fragments: Text blocks, tables, images, charts with precise coordinates
Layout Detection: Automatic identification of document structure and hierarchy
Cross-Page Headers: Detection of headers that span multiple pages

Flexible Input Methods#

File Upload: Upload documents directly to Tensorlake storage (up to 1GB)
URL Processing: Process documents from public URLs with automatic download
Raw Text: Extract structured data from text content, emails, HTML, or CSV

Intelligent Chunking#

Multiple chunking strategies for different use cases:

None: Return full document content
Semantic: Chunk by logical document sections
Fixed-size: Split into consistent token lengths
Custom: Define your own chunking parameters

Response Format#

Successful parse operations return comprehensive results:

{
"parse_id": "parse_abcd1234",
"status": "successful", 
"chunks": [
  {"content": "Document text chunk 1"},
  {"content": "Document text chunk 2"}
],
"structured_data": {
  "invoice_data": {
    "invoice_number": "INV-2024-001",
    "total_amount": 1250.00
  }
},
"document_layout": {
  "pages": [{
    "page_number": 1,
    "page_fragments": [{
      "fragment_type": "text",
      "bbox": {"x1": 100, "y1": 200, "x2": 400, "y2": 250}
    }]
  }]
},
"page_classes": [
  {"page": 1, "classification": "invoice"}
]
}

Migration and Compatibility#

API v2 maintains backward compatibility while introducing powerful new capabilities:

Unified Endpoint: Single /documents/v2/parse endpoint replaces multiple v1 endpoints
Enhanced Error Handling: Detailed error messages and status tracking
Improved Performance: Faster processing with optimized document analysis
Better Scaling: Handle larger documents and more complex schemas

The API v2 is available now in the Python SDK and Playground, ready for production workloads requiring sophisticated document understanding and structured data extraction. authors: List[Author] = Field( description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation." ) conference_journal: Conference = Field(description="Conference or journal information") title: str = Field(description="Title of the research paper")

Convert to JSON schema for Tensorlake

json_schema = ResearchPaperMetadata.model_json_schema()`} />

Usage Example#

Extract structured data from documents using your custom schemas:

from tensorlake import Client

client = Client(api_key="your-api-key")

# Extract metadata from a research paper
result = client.extract_schema(
  document_id="doc_123",
  schema=ResearchPaperMetadata
)

print(result.title)
# "Deep Learning for Natural Language Processing"

print(result.authors[0].name)
# "John Doe"

print(result.conference_journal.name)
# "NeurIPS 2024"

Supported Formats#

PDF documents
Word documents (.docx, .doc)
Spreadsheets (XLSX, XLSM, XLS, CSV)
Images (PNG, JPG)
Presentations (PPTX, Keynote)
HTML pages
Plain text files

API Reference#

# Extract data using a custom schema
curl -X POST https://api.tensorlake.ai/v2/extract-schema -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{
  "document_id": "doc_123",
  "schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "authors": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "affiliation": {"type": "string"}
          }
        }
      }
    }
  }
}'

DocumentAI API v2

Key Highlights