Table Recognition now parses ~1,500-cell tables (with structure preserved)

What’s new#

Our Table Recognition model got a significant upgrade. It now reliably parses very large, dense tables (e.g., ~1,500 cells in a single table) that typically break VLMs and most OCR pipelines. The sample we used internally is a healthcare report table (patient safety indicators in California) with multi-row headers and wide column spans.

Why it matters#

Big, busy tables are common in regulatory, healthcare, and finance PDFs.
Losing structure → bad retrieval, broken joins, and wrong analytics.
This release preserves cell grid, header hierarchy, and spans so you can export clean HTML and keep bbox coordinates for citeable answers.

Highlights#

Better header detection (multi-row, multi-col headers)
Robust row/col span recovery on wide tables
Improved cell boundary accuracy on scans

How to use#

This will just work out of the box when parsing any document with tables.

You can see an example in this colab notebook.

page-classes.py

doc_ai = DocumentAI()

result = doc_ai.parse_and_wait(
  file="https://tlake.link/blog/dense-tables",
)

for page in result.pages:
  for fragment in page.page_fragments:
    if(fragment.fragment_type == PageFragmentType.TABLE):
      table = fragment.content.html
      # pandas.read_html can parse a single table string
      df = pd.read_html(StringIO(str(t)), flavor="lxml")[0].fillna('')
      print(f"Table found on page {page.page_number} at {fragment.bbox}:")
      print(df)

Tips#

Keep tables atomic when chunking (don’t split a single table across chunks).
Attach page number + bbox to chunk metadata so you can render page previews in answers.

Known limitations#

Extremely degraded scans (low DPI, heavy skew) may still need pre-deskewing.
Rotated tables are supported, but nested tables inside footnotes may require a second pass.

Status#

✅ Live now. No config changes required beyond TableOutputMode / TableParsingFormat.

We’d love reports#

Send tricky tables (wide spans, nested headers, tiny fonts). They directly drive our next round of improvements.

Key Highlights