Back to changelog

Table Recognition now parses ~1,500-cell tables (with structure preserved)

New model is live—reliably extracting very large, dense tables from PDFs (incl. scans) while preserving header hierarchy, row/col spans, and cell boundaries, with fast HTML/CSV export and bbox for citations.

Key Highlights

  • Robust on ~1,500-cell tables; resilient to complex layouts and scanned documents.
  • Preserves header hierarchy and row/column spans; faithful HTML outputs.
  • Improved cell boundary detection and multi-row/multi-col header parsing.
  • Per-cell bounding boxes (bbox) for page-level citations and overlays.
  • Faster long-table parsing with fewer fallback OCR passes.
  • Works out of the box via TableOutputMode/TableParsingFormat (no breaking changes).

What’s new#

Our Table Recognition model got a significant upgrade. It now reliably parses very large, dense tables (e.g., ~1,500 cells in a single table) that typically break VLMs and most OCR pipelines. The sample we used internally is a healthcare report table (patient safety indicators in California) with multi-row headers and wide column spans.

Why it matters#

  • Big, busy tables are common in regulatory, healthcare, and finance PDFs.
  • Losing structure → bad retrieval, broken joins, and wrong analytics.
  • This release preserves cell grid, header hierarchy, and spans so you can export clean HTML and keep bbox coordinates for citeable answers.

Highlights#

  • Better header detection (multi-row, multi-col headers)
  • Robust row/col span recovery on wide tables
  • Improved cell boundary accuracy on scans

How to use#

This will just work out of the box when parsing any document with tables.

You can see an example in this colab notebook.

page-classes.py
1doc_ai = DocumentAI() 2 3result = doc_ai.parse_and_wait( 4 file="https://tlake.link/blog/dense-tables", 5) 6 7for page in result.pages: 8 for fragment in page.page_fragments: 9 if(fragment.fragment_type == PageFragmentType.TABLE): 10 table = fragment.content.html 11 # pandas.read_html can parse a single table string 12 df = pd.read_html(StringIO(str(t)), flavor="lxml")[0].fillna('') 13 print(f"Table found on page {page.page_number} at {fragment.bbox}:") 14 print(df)

Tips#

  • Keep tables atomic when chunking (don’t split a single table across chunks).
  • Attach page number + bbox to chunk metadata so you can render page previews in answers.

Known limitations#

  • Extremely degraded scans (low DPI, heavy skew) may still need pre-deskewing.
  • Rotated tables are supported, but nested tables inside footnotes may require a second pass.

Status#

✅ Live now. No config changes required beyond TableOutputMode / TableParsingFormat.

We’d love reports#

Send tricky tables (wide spans, nested headers, tiny fonts). They directly drive our next round of improvements.

This website uses cookies to enhance your browsing experience. By clicking "Accept All Cookies", you consent to the use of ALL cookies. By clicking "Decline", only essential cookies will be used. Read our Privacy Policy for more details.