Back to posts

New: Header Detection and Correction for accurate document hierarchy

Tensorlake now detects and corrects document headers across pages, maintaining proper hierarchy even when OCR misidentifies header levels.

Key Highlights

  • Automatic header hierarchy correction based on numbering patterns and visual structure
  • Cross-page header detection without fragmentation
  • Section headers include accurate level attributes (0 for #, 1 for ##, etc.)
Open in Colab

What's new#

Tensorlake now automatically detects and corrects header hierarchy in parsed documents. Enable cross_page_header_detection=True to get properly structured section headers with accurate level attributes, even when OCR engines misidentify header depths.

Comparison of OCR vs header correction showing improved structured extraction with Tensorlake DocumentAI.

Why it matters#

  • Accurate document structure - preserve logical hierarchy of research papers, technical docs, and reports
  • Cross-page headers - detect headers spanning page breaks without fragmentation
  • Better RAG quality - improved chunking boundaries and context preservation for retrieval
  • Knowledge graphs - build accurate document trees with proper parent-child relationships

The problem#

OCR engines frequently misidentify header hierarchy. A subsection labeled "2.2" might get marked as a top-level header (##) instead of a nested header (###):

1# Effectiveness of ω-3 Polyunsaturated Fatty Acids... 2## 1. Introduction 3## 2. Materials and Methods 4### 2.1. Subjects 5## 2.2. Statistical Analysis ← Wrong level (should be ###) 6## 3. Results

Section 2.2 should be level 2 (nested under section 2), not level 1 (peer to section 2).

How it works#

Tensorlake analyzes header patterns across the entire document to correct hierarchy:

1from tensorlake.documentai import DocumentAI, ParsingOptions 2 3doc_ai = DocumentAI() 4 5result = doc_ai.parse_and_wait( 6 file="https://tlake.link/docs/gong-16-research-paper", 7 parsing_options=ParsingOptions( 8 cross_page_header_detection=True # Enable header correction 9 ) 10) 11 12# Access corrected headers 13for page in result.pages: 14 for page_fragment in page.page_fragments: 15 if page_fragment.fragment_type == "section_header": 16 print(f"Level {page_fragment.content.level}: {page_fragment.content.content}")

Corrected output:

1level=0, content='Article' 2level=0, content='Effectiveness of ω-3 Polyunsaturated Fatty Acids...' 3level=1, content='1. Introduction' 4level=1, content='2. Materials and Methods' 5level=2, content='2.1. Subjects' 6level=2, content='2.2. Statistical Analysis' # Corrected to level 2 7level=1, content='3. Results'

What you get#

Section headers now include:

  • level: Integer representing header depth (0 = #, 1 = ##, 2 = ###, etc.)
  • content: Clean header text without markdown formatting

Build document outlines programmatically:

1for page in result.pages: 2 for page_fragment in page.page_fragments: 3 if page_fragment.fragment_type == "section_header": 4 indent = " " * page_fragment.content.level 5 print(f"{indent}{page_fragment.content.content}") 6 7# Output: 8# • Article 9# • Effectiveness of ω-3 Polyunsaturated Fatty Acids... 10# • 1. Introduction 11# • 2. Materials and Methods 12# • 2.1. Subjects 13# • 2.2. Statistical Analysis 14# • 3. Results

Try it#

Colab Notebook: Header Detection Example

Documentation: Parsing Options Reference

Enable cross_page_header_detection=True in your parse requests to get corrected document hierarchy automatically.

Status#

✅ Live now in the API, SDK, and on cloud.tensorlake.ai.

Add cross_page_header_detection=True to your ParsingOptions to enable.

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.