Advanced Schema Extraction

Structured Data Extraction#

We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.

Research Paper Schema Example#

Define complex schemas for extracting structured information from research papers:

from pydantic import BaseModel, Field
from typing import List

class Author(BaseModel):
  """Author information for a research paper"""
  name: str = Field(description="Full name of the author")
  affiliation: str = Field(description="Institution or organization affiliation")

class Conference(BaseModel):
  """Conference or journal information"""
  name: str = Field(description="Name of the conference or journal")
  year: str = Field(description="Year of publication")
  location: str = Field(description="Location of the conference or journal publication")

class ResearchPaperMetadata(BaseModel):
  """Complete schema for extracting research paper information"""
  authors: List[Author] = Field(
      description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."
  )
  conference_journal: Conference = Field(description="Conference or journal information")
  title: str = Field(description="Title of the research paper")

# Convert to JSON schema for Tensorlake
json_schema = ResearchPaperMetadata.model_json_schema()

Usage Example#

Extract structured data from documents using your custom schemas:

from tensorlake import Client

client = Client(api_key="your-api-key")

# Extract metadata from a research paper
result = client.extract_schema(
  document_id="doc_123",
  schema=ResearchPaperMetadata
)

print(result.title)
# "Deep Learning for Natural Language Processing"

print(result.authors[0].name)
# "John Doe"

print(result.conference_journal.name)
# "NeurIPS 2024"

Supported Formats#

PDF documents
Word documents (.docx, .doc)
Markdown files
HTML pages
Plain text files

API Reference#

# Extract data using a custom schema
curl -X POST https://api.tensorlake.ai/v2/extract-schema -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{
  "document_id": "doc_123",
  "schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "authors": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "affiliation": {"type": "string"}
          }
        }
      }
    }
  }
}'

Key Highlights

Structured Data Extraction#

Research Paper Schema Example#

Usage Example#

Supported Formats#

API Reference#