Back to changelog
Minor Release
v2.2.0

Advanced Schema Extraction

Extract structured data from any document using Pydantic schemas with improved accuracy and multi-format support

Key Highlights

  • Research paper metadata extraction
  • Pydantic schema support
  • Multi-format document support
  • Improved accuracy with structured outputs

Structured Data Extraction#

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.

Research Paper Schema Example#

Define complex schemas for extracting structured information from research papers:

1from pydantic import BaseModel, Field 2from typing import List 3 4class Author(BaseModel): 5 """Author information for a research paper""" 6 name: str = Field(description="Full name of the author") 7 affiliation: str = Field(description="Institution or organization affiliation") 8 9class Conference(BaseModel): 10 """Conference or journal information""" 11 name: str = Field(description="Name of the conference or journal") 12 year: str = Field(description="Year of publication") 13 location: str = Field(description="Location of the conference or journal publication") 14 15class ResearchPaperMetadata(BaseModel): 16 """Complete schema for extracting research paper information""" 17 authors: List[Author] = Field( 18 description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation." 19 ) 20 conference_journal: Conference = Field(description="Conference or journal information") 21 title: str = Field(description="Title of the research paper") 22 23# Convert to JSON schema for Tensorlake 24json_schema = ResearchPaperMetadata.model_json_schema()

Usage Example#

Extract structured data from documents using your custom schemas:

1from tensorlake import Client 2 3client = Client(api_key="your-api-key") 4 5# Extract metadata from a research paper 6result = client.extract_schema( 7 document_id="doc_123", 8 schema=ResearchPaperMetadata 9) 10 11print(result.title) 12# "Deep Learning for Natural Language Processing" 13 14print(result.authors[0].name) 15# "John Doe" 16 17print(result.conference_journal.name) 18# "NeurIPS 2024"

Supported Formats#

  • PDF documents
  • Word documents (.docx, .doc)
  • Markdown files
  • HTML pages
  • Plain text files

API Reference#

1# Extract data using a custom schema 2curl -X POST https://api.tensorlake.ai/v2/extract-schema -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{ 3 "document_id": "doc_123", 4 "schema": { 5 "type": "object", 6 "properties": { 7 "title": {"type": "string"}, 8 "authors": { 9 "type": "array", 10 "items": { 11 "type": "object", 12 "properties": { 13 "name": {"type": "string"}, 14 "affiliation": {"type": "string"} 15 } 16 } 17 } 18 } 19 } 20}'

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.