New: Header Detection and Correction for accurate document hierarchy

What's new#

Tensorlake now automatically detects and corrects header hierarchy in parsed documents. Enable cross_page_header_detection=True to get properly structured section headers with accurate level attributes, even when OCR engines misidentify header depths.

Comparison of OCR vs header correction showing improved structured extraction with Tensorlake DocumentAI.

Why it matters#

Accurate document structure - preserve logical hierarchy of research papers, technical docs, and reports
Cross-page headers - detect headers spanning page breaks without fragmentation
Better RAG quality - improved chunking boundaries and context preservation for retrieval
Knowledge graphs - build accurate document trees with proper parent-child relationships

The problem#

OCR engines frequently misidentify header hierarchy. A subsection labeled "2.2" might get marked as a top-level header (##) instead of a nested header (###):

# Effectiveness of ω-3 Polyunsaturated Fatty Acids...
## 1. Introduction
## 2. Materials and Methods
### 2.1. Subjects
## 2.2. Statistical Analysis  ← Wrong level (should be ###)
## 3. Results

Section 2.2 should be level 2 (nested under section 2), not level 1 (peer to section 2).

How it works#

Tensorlake analyzes header patterns across the entire document to correct hierarchy:

from tensorlake.documentai import DocumentAI, ParsingOptions

doc_ai = DocumentAI()

result = doc_ai.parse_and_wait(
  file="https://tlake.link/docs/gong-16-research-paper",
  parsing_options=ParsingOptions(
      cross_page_header_detection=True  # Enable header correction
  )
)

# Access corrected headers
for page in result.pages:
  for page_fragment in page.page_fragments:
      if page_fragment.fragment_type == "section_header":
        print(f"Level {page_fragment.content.level}: {page_fragment.content.content}")

Corrected output:

level=0, content='Article'
level=0, content='Effectiveness of ω-3 Polyunsaturated Fatty Acids...'
level=1, content='1. Introduction'
level=1, content='2. Materials and Methods'
level=2, content='2.1. Subjects'
level=2, content='2.2. Statistical Analysis'  # Corrected to level 2
level=1, content='3. Results'

What you get#

Section headers now include:

level: Integer representing header depth (0 = #, 1 = ##, 2 = ###, etc.)
content: Clean header text without markdown formatting

Build document outlines programmatically:

for page in result.pages:
  for page_fragment in page.page_fragments:
      if page_fragment.fragment_type == "section_header":
          indent = "  " * page_fragment.content.level
          print(f"{indent}• {page_fragment.content.content}")

# Output:
# • Article
# • Effectiveness of ω-3 Polyunsaturated Fatty Acids...
#   • 1. Introduction
#   • 2. Materials and Methods
#     • 2.1. Subjects
#     • 2.2. Statistical Analysis
#   • 3. Results

Try it#

Colab Notebook: Header Detection Example

Documentation: Parsing Options Reference

Enable cross_page_header_detection=True in your parse requests to get corrected document hierarchy automatically.

Status#

✅ Live now in the API, SDK, and on cloud.tensorlake.ai.

Add cross_page_header_detection=True to your ParsingOptions to enable.

Key Highlights