Structured Data Extraction#
We're excited to introduce advanced schema extraction capabilities that allow you to extract structured data from any document using Pydantic schemas.
Research Paper Schema Example#
Define complex schemas for extracting structured information from research papers:
1from pydantic import BaseModel, Field
2from typing import List
3
4class Author(BaseModel):
5 """Author information for a research paper"""
6 name: str = Field(description="Full name of the author")
7 affiliation: str = Field(description="Institution or organization affiliation")
8
9class Conference(BaseModel):
10 """Conference or journal information"""
11 name: str = Field(description="Name of the conference or journal")
12 year: str = Field(description="Year of publication")
13 location: str = Field(description="Location of the conference or journal publication")
14
15class ResearchPaperMetadata(BaseModel):
16 """Complete schema for extracting research paper information"""
17 authors: List[Author] = Field(
18 description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation."
19 )
20 conference_journal: Conference = Field(description="Conference or journal information")
21 title: str = Field(description="Title of the research paper")
22
23# Convert to JSON schema for Tensorlake
24json_schema = ResearchPaperMetadata.model_json_schema()
Usage Example#
Extract structured data from documents using your custom schemas:
1from tensorlake import Client
2
3client = Client(api_key="your-api-key")
4
5# Extract metadata from a research paper
6result = client.extract_schema(
7 document_id="doc_123",
8 schema=ResearchPaperMetadata
9)
10
11print(result.title)
12# "Deep Learning for Natural Language Processing"
13
14print(result.authors[0].name)
15# "John Doe"
16
17print(result.conference_journal.name)
18# "NeurIPS 2024"
Supported Formats#
- PDF documents
- Word documents (.docx, .doc)
- Markdown files
- HTML pages
- Plain text files
API Reference#
1# Extract data using a custom schema
2curl -X POST https://api.tensorlake.ai/v2/extract-schema -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{
3 "document_id": "doc_123",
4 "schema": {
5 "type": "object",
6 "properties": {
7 "title": {"type": "string"},
8 "authors": {
9 "type": "array",
10 "items": {
11 "type": "object",
12 "properties": {
13 "name": {"type": "string"},
14 "affiliation": {"type": "string"}
15 }
16 }
17 }
18 }
19 }
20}'