What's new#
Tensorlake now automatically detects and corrects header hierarchy in parsed documents. Enable cross_page_header_detection=True to get properly structured section headers with accurate level attributes, even when OCR engines misidentify header depths.
Why it matters#
- Accurate document structure - preserve logical hierarchy of research papers, technical docs, and reports
- Cross-page headers - detect headers spanning page breaks without fragmentation
- Better RAG quality - improved chunking boundaries and context preservation for retrieval
- Knowledge graphs - build accurate document trees with proper parent-child relationships
The problem#
OCR engines frequently misidentify header hierarchy. A subsection labeled "2.2" might get marked as a top-level header (##) instead of a nested header (###):
1# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
2## 1. Introduction
3## 2. Materials and Methods
4### 2.1. Subjects
5## 2.2. Statistical Analysis ← Wrong level (should be ###)
6## 3. ResultsSection 2.2 should be level 2 (nested under section 2), not level 1 (peer to section 2).
How it works#
Tensorlake analyzes header patterns across the entire document to correct hierarchy:
1from tensorlake.documentai import DocumentAI, ParsingOptions
2
3doc_ai = DocumentAI()
4
5result = doc_ai.parse_and_wait(
6 file="https://tlake.link/docs/gong-16-research-paper",
7 parsing_options=ParsingOptions(
8 cross_page_header_detection=True # Enable header correction
9 )
10)
11
12# Access corrected headers
13for page in result.pages:
14 for page_fragment in page.page_fragments:
15 if page_fragment.fragment_type == "section_header":
16 print(f"Level {page_fragment.content.level}: {page_fragment.content.content}")Corrected output:
1level=0, content='Article'
2level=0, content='Effectiveness of ω-3 Polyunsaturated Fatty Acids...'
3level=1, content='1. Introduction'
4level=1, content='2. Materials and Methods'
5level=2, content='2.1. Subjects'
6level=2, content='2.2. Statistical Analysis' # Corrected to level 2
7level=1, content='3. Results'What you get#
Section headers now include:
level: Integer representing header depth (0 =#, 1 =##, 2 =###, etc.)content: Clean header text without markdown formatting
Build document outlines programmatically:
1for page in result.pages:
2 for page_fragment in page.page_fragments:
3 if page_fragment.fragment_type == "section_header":
4 indent = " " * page_fragment.content.level
5 print(f"{indent}• {page_fragment.content.content}")
6
7# Output:
8# • Article
9# • Effectiveness of ω-3 Polyunsaturated Fatty Acids...
10# • 1. Introduction
11# • 2. Materials and Methods
12# • 2.1. Subjects
13# • 2.2. Statistical Analysis
14# • 3. ResultsTry it#
Colab Notebook: Header Detection Example
Documentation: Parsing Options Reference
Enable cross_page_header_detection=True in your parse requests to get corrected document hierarchy automatically.
Status#
✅ Live now in the API, SDK, and on cloud.tensorlake.ai.
Add cross_page_header_detection=True to your ParsingOptions to enable.
