Back to Blogs
Banner image with wavy green lines in the background and the text 'Fix Broken Context in RAG with Chonkie + Tensorlake.' Logos for Chonkie (a cartoon bear face) and Tensorlake appear beneath the text.

Fix Broken Context in RAG with Tensorlake + Chonkie

TL;DR

Naïve RAG pipelines break on real-world documents because parsers flatten structure and chunkers cut blindly. Tensorlake parses documents into hierarchy-aware, enriched outputs. Chonkie turns that into coherent, retrieval-ready chunks. Together, they reduce hallucinations, improve retrieval precision, and give your RAG system the faithful context it needs.

Table of Contents:






If you’ve ever tried building a RAG pipeline on top of real-world documents, you’ve probably seen it fail in familiar ways:

  • Contracts where one clause is cut off mid-sentence and lands in a different chunk.
  • Financial statements where a table gets separated from the explanation that makes it meaningful.
  • Research papers where methods, results, and figures are all flattened into the same block of text.

The result is bad context which leads to bad retrieval and causes models to hallucinate. As discussed in our Advanced RAG blog, the issue isn’t just bigger context windows, it’s better context. That means parsing documents faithfully and chunking them intelligently.

That’s where Tensorlake and Chonkie come together. Tensorlake provides clean, hierarchy-aware parsing. Chonkie is an open-source library that many teams now use as a default for chunking in RAG pipelines. It takes Tensorlake’s structured output and turns it into retrieval-ready chunks. The two are complementary: parsing alone does not make documents retrievable, and chunking without structure just rearranges noise. Together, they form a solid foundation for high-quality RAG.

Why Parsing + Chunking Must Work Together#

The ideal chunk is one that contains enough information to answer questions about a specific topic without pulling in unrelated noise. To get there, semantic boundaries have to be preserved across chunks: methods separated from results, tables linked with their explanations, and definitions kept with the terms they define.

Most PDF parsers do not expose enough signals to detect those boundaries. They flatten the document and leave chunkers to guess where sections begin and end.

A hierarchy-aware parser changes that. It maps section starts and ends, even when they continue across pages, and keeps tables and figures intact. Tensorlake also supports cross-page header detection, which keeps header levels consistent when an H2 flows into the next page and is followed by several H3s. This is where many commodity parsers fall short.

Tensorlake goes beyond OCR. It detects:

  • Section hierarchy across pages (##, ###, etc.)
  • Tables preserved in HTML/Markdown with merged cells intact
  • Figures automatically summarized alongside their captions

That means instead of flat text, you get a structured blueprint of the document.

Chonkie can then split along real boundaries, keep related content together, and apply semantic chunking where it matters most. The result is chunks that are retrieval-ready and far more reliable in RAG.

Smarter Chunking with Chonkie#

Chonkie is a lightweight, high-performance chunking framework for RAG applications. When paired with Tensorlake’s hierarchy-aware parsing, it can make full use of the preserved structure to produce coherent, context-rich chunks. Some of the key strategies it supports include:

  • Recursive chunking (common default): It is the industry default because it is fast and cost-efficient. It works by splitting text with a series of hierarchical rules and separators (double newlines, paragraphs, sentences, etc.) until the chunk size limit is reached. While effective for simpler documents, it does not have the semantic awareness to detect topic shifts, so in complex documents like research papers it can still split ideas in the middle.
  • Semantic chunking (recommended choice): It is the most effective strategy for structured documents. It starts by splitting text into sentences or paragraphs, then uses embeddings to measure semantic similarity between adjacent units. A new chunk is formed when similarity drops below a threshold, ensuring that each chunk represents a complete idea. In the context of research papers, that might mean an entire subsection with all its supporting sentences, rather than an arbitrary slice of text. When applied to Tensorlake’s structured markdown output, semantic chunking can align perfectly with section boundaries and avoid splitting concepts across chunks.
  • Late chunking (higher computation): It uses a different approach than traditional chunking. Instead of chunking first and embedding each piece separately, it embeds the entire document at once using a long-context model. The resulting token-level embeddings are then divided and pooled into chunk representations. Because the embedding step sees the whole document, each chunk vector retains more global context. This can improve retrieval for queries that span multiple sections, but it requires more compute, longer runtimes, and requires models with large context windows.

Theory only takes us so far. Next, let’s build a real RAG pipeline and test Tensorlake + Chonkie on a research paper.

Hands-on: Better RAG for Research Papers with Tensorlake + Chonkie#

Let’s walk through an example using a research paper. The same pattern applies to contracts, financial filings, or technical manuals, but research papers make the structural complexity obvious.

Step 1: Parse with Tensorlake#

The first step is to parse the research paper into a format that preserves its structure. Instead of flattening everything into plain text, Tensorlake Document AI gives you clean Markdown with heading levels, tables in HTML, and figures summarized with captions.

Since we’re using Chonkie to handle chunking, we set the chunking_strategy to NONE. This ensures Tensorlake only parses the document to plain markdown, leaving the actual splitting to Chonkie. One key feature of Tensorlake is cross-page header detection. Research papers often have a section header (say, an H2) on one page, followed by multiple H3 subsections that continue onto the next page. Standard parsers like Azure or AWS misclassify those as new sections. Tensorlake correctly maintains the hierarchy across pages, which is critical for meaningful chunking.

chonkie_demo.py
1from tensorlake.documentai import ( 2 DocumentAI, 3 ParsingOptions, 4 EnrichmentOptions, 5 ChunkingStrategy, 6 PageFragmentType 7 ) 8 9# Initialize Tensorlake client 10doc_ai = DocumentAI(api_key="your_tensorlake_api_key") 11 12# Using an arXiv-hosted research paper as input 13file_id = "https://tlake.link/docs/sota-research-paper" 14 15# Configure Tensorlake for optimal research paper parsing 16parsing_options = ParsingOptions( 17 chunking_strategy=ChunkingStrategy.NONE, # Parse into Markdown, let Chonkie handle chunking 18 cross_page_header_detection=True # Keeps headers consistent across page breaks 19) 20 21enrichment_options = EnrichmentOptions( 22 figure_summarization=True, 23 figure_summarization_prompt="Summarize this figure in the context of the research findings.", 24 table_summarization=True, 25 table_summarization_prompt="Summarize this table's significance to the paper's results." 26) 27 28result = doc_ai.parse_and_wait( 29 file_id, 30 parsing_options=parsing_options, 31 enrichment_options=enrichment_options 32)

For research papers, we can go further by defining a custom json schema to pull exactly the metadata we care about like title, authors, abstract, keywords, and a list of section headings with their nesting levels. This produces a searchable blueprint of the document alongside the raw content.

Tensorlake’s parsing pipeline is driven by three main option sets:

  1. Parsing Options control segmentation strategy
  2. Structured Extraction Options to define a JSON schema for extracting specific fields so they’re machine-readable from the start. We are not using this here currently.
  3. Enrichment Options when set to True, automatically summarizes tables and figures with custom prompts to capture context-specific insights.

Now once the document is parsed, we have clean markdown and summaries for tables/figures. All of this data is returned as part of that single parse API call and stored in the result:

chonkie_demo.py
1# Section-based content chunks from parsed sections 2sections_with_content = [] 3for chunk in result.chunks: 4 sections_with_content.append({ 5 'page_number': chunk.page_number, 6 'content': chunk.content 7 }) 8print(f"Tensorlake extracted {len(sections_with_content)} structured sections.") 9print(sections_with_content) 10 11# Summaries of tables and figures 12for page in result.pages: 13 for i, fragment in enumerate(page.page_fragments): 14 if fragment.fragment_type == PageFragmentType.TABLE: 15 print(f"Table {i} on Page {page.page_number}:{fragment.content.summary}---") 16 elif fragment.fragment_type == PageFragmentType.FIGURE: 17 print(f"Figure {i} on Page {page.page_number}:{fragment.content.summary}---")

Tensorlake outputs Markdown with heading levels (##, ###) so you can see where each section starts and ends. Here’s an example of how a Methods section looks in chunked Markdown:

A screenshot of a document section labeled “Chunk 5” with the heading “## 2 Methods” and subheading “### 2.1 Definitions.” The text introduces definitions for a model distillation framework. The student model is described as the trainable text embedding model tasked with producing effective vector representations. The teacher model is a state-of-the-art embedding model that guides the student model in generating vectors, but the teacher itself is not trained. The notation Sx refers to the normalized vector representation of a text x produced by the student model, while tx represents the same text x after being normalized, concatenated, and re-normalized by multiple teacher models. The notation Sx with uppercase X refers to a matrix of normalized vectors for a batch of text X from the student model, and Tx with uppercase X refers to the corresponding matrix from teacher models that is normalized, concatenated, and re-normalized.

And here’s a figure alongside Tensorlake’s automatic summarization:

A slide titled “Figure Summaries with Tensorlake” shows two panels side by side. On the left is a flowchart diagram of the Jasper model architecture. The diagram has two input paths: an image path (Siglip Vision Encoder → AvgPool2d) and a text path (Stella Input Embedding). Both feed into a large block labeled “Stella Encoder.” Above the encoder, multiple parallel outputs flow into a “Mean Polling” layer, which then branches into four fully connected layers (FC1, FC2, FC3, FC4) producing vectors of sizes 12,288, 1,024, 512, and 256. Below the chart is the caption: “Figure 1: The model architecture of Jasper model.” On the right is a dark-themed text output titled “figure summarization result from Tensorlake.” The text explains the Jasper model in bullet points, covering multi-modal input processing, unified encoder representation, multi-output architecture for teacher distillation, and the objective of aligning student model outputs with teacher models.

Step 2: Chunk with Chonkie#

Now comes the critical part: turning Tensorlake’s structured output into retrieval-ready chunks. With plain text, a recursive chunker might split in the middle of a method description or separate a table from its caption. But with Tensorlake’s hierarchy, Chonkie can align chunk boundaries with real document boundaries.

Here’s how we apply semantic chunking:

chonkie_demo.py
1from chonkie import SemanticChunker 2 3# Basic initialization with default parameters 4chunker = SemanticChunker( 5 embedding_model="minishlab/potion-base-8M", 6 threshold=0.5, # Similarity threshold 7 chunk_size=1024, # Maximum tokens per chunk 8 min_sentences=2, # Initial sentences per chunk 9 mode="window" 10) 11 12semantic_chunks = [] 13for section in sections_with_content: 14 # Chonkie expects a single text string; it returns SemanticChunk objects 15 chunks = chunker.chunk(section["content"]) 16 for ch in chunks: 17 # ch.text, ch.token_count, ch.sentences[...] available 18 if ch.text.strip(): 19 semantic_chunks.append({ 20 "text": ch.text, 21 "token_count": ch.token_count, 22 "page_hint": section["page_number"] 23 }) 24 25print(f"SemanticChunker produced {len(semantic_chunks)} chunks")

Instead of arbitrary slices, these chunks respect sections and sub-sections, keeping tables, figures, and their explanations together.

Step 3: Evaluate the Chunks#

A good chunk should:

  • Respect boundaries: Good chunks should respect the natural structure of a paper: sections, subsections, and other explicit boundaries. We compare predicted boundaries against true section headers using WindowDiff and Pk from the topic segmentation literature. Lower scores mean chunk breaks occur where readers expect them.
  • Stay coherent: Within a chunk, sentences should “stick together”; while across boundaries, they should diverge. Using embeddings, we measure:
    • Intra-chunk similarity (higher = coherent chunks)
    • Boundary dissimilarity (higher = cleaner separations) This combination ensures that chunks maximize relevance internally while minimizing noise from unrelated material.
  • Preserve integrity: Academic papers and technical reports often rely on figures, tables, and definitions. For chunks to be useful in retrieval, these elements must remain with their explanatory context. We track this using simple heuristics:
    • Section fragmentation: how often a single section is broken into multiple pieces. Ideally, sections remain intact or split minimally.
    • Figure/Table proximity: checking whether references like “Figure 2” or “Table 1” stay grouped with at least a few following explanatory sentences. This preserves the flow and prevents loss of meaning.

By measuring boundary accuracy, semantic cohesion, and structural integrity, you can see whether the pipeline is producing retrieval units that an LLM can faithfully use.

Here, we show the side by side comparison of recursive vs. semantic chunking on the same research paper. You’ll see how semantic chunking avoids cutting mid-sentence, respects section headers, and produces more coherent context for RAG queries.

Quality comparison between chunks created by Recursive Chunking and Semantic Chunking

On the top, you’ll see recursive chunking. Notice how:

  • Sentences are often cut in the middle.
  • Section boundaries (Abstract, Introduction, etc.) get ignored.
  • The resulting chunks are noisy and incomplete.

Below, you’ll see semantic chunking applied on Tensorlake’s structured markdown. Here:

  • Boundaries align cleanly with sections and subsections.
  • Sentences stay intact.
  • Each chunk captures a full idea, making it easier for retrieval to surface the right context.

In short: bad chunks → bad retrieval. Good, semantic chunks → reliable RAG.

Faithful Context, Reliable RAG#

RAG systems don’t fail because embeddings are weak or context windows are too small. They fail because the input chunks are broken.

  • Tensorlake ensures documents are parsed into structured, faithful representations.
  • Chonkie ensures those representations are chunked into coherent retrieval units.

Together, they reduce hallucinations, increase retrieval precision, and build trust in AI systems.

This is context engineering in action: not just more context, but better context.

Try It Yourself#

Want to see Tensorlake + Chonkie in action?

Start building RAG pipelines that actually deliver faithful, retrieval-ready context.

Antaripa Saha

Antaripa Saha

DevRel Engineer at Tensorlake

Machine Learning Engineer with 4 years of experience. Love building with AI, with interest in VLMs, search, and memory.

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.