API v2 Summary#
Tensorlake API v2 represents a major evolution in document processing capabilities, providing a unified interface for extracting structured data from any document format. The new API combines document parsing, structured extraction, and enrichment into a single, powerful endpoint that can handle complex document workflows.
Core Capabilities#
Document Ingestion: Upload and process files up to 1GB in size, supporting PDF, Word documents (DOCX), Excel spreadsheets (XLS, XLSX, XLSM), PowerPoint presentations (PPTX), images (PNG, JPG, JPEG), CSV files, HTML, and plain text.
Unified Processing: Submit documents via file upload, public URL, or raw text content with a single API endpoint that handles all processing operations.
Flexible Output: Convert documents to markdown with intelligent chunking strategies, extract structured data using custom schemas, and classify pages into categories.
Structured Data Extraction#
We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using JSON Schema definitions.
Invoice Processing Example#
Define schemas for extracting structured information from business documents:
1{
2"title": "Invoice",
3"type": "object",
4"properties": {
5 "invoice_number": {"type": "string"},
6 "date": {"type": "string", "format": "date"},
7 "vendor": {
8 "type": "object",
9 "properties": {
10 "name": {"type": "string"},
11 "address": {"type": "string"}
12 }
13 },
14 "line_items": {
15 "type": "array",
16 "items": {
17 "type": "object",
18 "properties": {
19 "description": {"type": "string"},
20 "quantity": {"type": "number"},
21 "unit_price": {"type": "number"},
22 "total": {"type": "number"}
23 }
24 }
25 },
26 "total_amount": {"type": "number"}
27}
28}
Contract Analysis Example#
Extract key terms and parties from legal documents:
1{
2"title": "Contract",
3"type": "object",
4"properties": {
5 "parties": {
6 "type": "array",
7 "items": {
8 "type": "object",
9 "properties": {
10 "name": {"type": "string"},
11 "role": {"type": "string"},
12 "address": {"type": "string"}
13 }
14 }
15 },
16 "effective_date": {"type": "string", "format": "date"},
17 "expiration_date": {"type": "string", "format": "date"},
18 "key_terms": {
19 "type": "array",
20 "items": {"type": "string"}
21 },
22 "governing_law": {"type": "string"},
23 "signatures_required": {"type": "boolean"}
24}
25}
API Usage#
Extract structured data using the unified parse endpoint:
1curl -X POST https://api.tensorlake.ai/documents/v2/parse \
2-H "Authorization: Bearer YOUR_API_KEY" \
3-H "Content-Type: application/json" \
4-d '{
5 "file_id": "file_12345",
6 "structured_extraction_options": [{
7 "schema_name": "invoice_data",
8 "json_schema": {
9 "type": "object",
10 "properties": {
11 "invoice_number": {"type": "string"},
12 "total_amount": {"type": "number"}
13 }
14 }
15 }]
16}'
Page Classification#
Classify document pages into categories for better organization and processing:
1{
2"page_classifications": [
3 {
4 "name": "invoice",
5 "description": "Pages containing invoice information with line items and totals"
6 },
7 {
8 "name": "contract_terms",
9 "description": "Pages containing contract terms and conditions"
10 },
11 {
12 "name": "signature_page",
13 "description": "Pages containing signatures and execution information"
14 }
15]
16}
Document Enhancement#
Table and Chart Summarization#
Automatically generate summaries of complex tables and visual elements:
1{
2"enrichment_options": {
3 "table_summarization": true,
4 "table_summarization_prompt": "Provide a concise summary of the key data points in this table",
5 "figure_summarization": true,
6 "figure_summarization_prompt": "Describe the main insights from this chart or diagram"
7}
8}
Signature Detection#
Detect and locate signatures within documents using specialized computer vision models:
1{
2"parsing_options": {
3 "signature_detection": true
4}
5}
Advanced Features#
Document Layout Analysis#
Get detailed document structure information including bounding boxes for all elements:
- Page Fragments: Text blocks, tables, images, charts with precise coordinates
- Layout Detection: Automatic identification of document structure and hierarchy
- Cross-Page Headers: Detection of headers that span multiple pages
Flexible Input Methods#
- File Upload: Upload documents directly to Tensorlake storage (up to 1GB)
- URL Processing: Process documents from public URLs with automatic download
- Raw Text: Extract structured data from text content, emails, HTML, or CSV
Intelligent Chunking#
Multiple chunking strategies for different use cases:
- None: Return full document content
- Semantic: Chunk by logical document sections
- Fixed-size: Split into consistent token lengths
- Custom: Define your own chunking parameters
Response Format#
Successful parse operations return comprehensive results:
1{
2"parse_id": "parse_abcd1234",
3"status": "successful",
4"chunks": [
5 {"content": "Document text chunk 1"},
6 {"content": "Document text chunk 2"}
7],
8"structured_data": {
9 "invoice_data": {
10 "invoice_number": "INV-2024-001",
11 "total_amount": 1250.00
12 }
13},
14"document_layout": {
15 "pages": [{
16 "page_number": 1,
17 "page_fragments": [{
18 "fragment_type": "text",
19 "bbox": {"x1": 100, "y1": 200, "x2": 400, "y2": 250}
20 }]
21 }]
22},
23"page_classes": [
24 {"page": 1, "classification": "invoice"}
25]
26}
Migration and Compatibility#
API v2 maintains backward compatibility while introducing powerful new capabilities:
- Unified Endpoint: Single
/documents/v2/parse
endpoint replaces multiple v1 endpoints - Enhanced Error Handling: Detailed error messages and status tracking
- Improved Performance: Faster processing with optimized document analysis
- Better Scaling: Handle larger documents and more complex schemas
The API v2 is available now in the Python SDK and Playground, ready for production workloads requiring sophisticated document understanding and structured data extraction. authors: List[Author] = Field( description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation." ) conference_journal: Conference = Field(description="Conference or journal information") title: str = Field(description="Title of the research paper")
Convert to JSON schema for Tensorlake
json_schema = ResearchPaperMetadata.model_json_schema()`} />
Usage Example#
Extract structured data from documents using your custom schemas:
1from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7 document_id="doc_123",
8 schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"
Supported Formats#
- PDF documents
- Word documents (.docx, .doc)
- Spreadsheets (XLSX, XLSM, XLS, CSV)
- Images (PNG, JPG)
- Presentations (PPTX, Keynote)
- HTML pages
- Plain text files
API Reference#
1# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{
3 "document_id": "doc_123",
4 "schema": {
5 "type": "object",
6 "properties": {
7 "title": {"type": "string"},
8 "authors": {
9 "type": "array",
10 "items": {
11 "type": "object",
12 "properties": {
13 "name": {"type": "string"},
14 "affiliation": {"type": "string"}
15 }
16 }
17 }
18 }
19 }
20}'