What’s new#
Our Table Recognition model got a significant upgrade. It now reliably parses very large, dense tables (e.g., ~1,500 cells in a single table) that typically break VLMs and most OCR pipelines. The sample we used internally is a healthcare report table (patient safety indicators in California) with multi-row headers and wide column spans.
Why it matters#
- Big, busy tables are common in regulatory, healthcare, and finance PDFs.
- Losing structure → bad retrieval, broken joins, and wrong analytics.
- This release preserves cell grid, header hierarchy, and spans so you can export clean HTML and keep bbox coordinates for citeable answers.
Highlights#
- Better header detection (multi-row, multi-col headers)
- Robust row/col span recovery on wide tables
- Improved cell boundary accuracy on scans
How to use#
This will just work out of the box when parsing any document with tables.
You can see an example in this colab notebook.
1doc_ai = DocumentAI()
2
3result = doc_ai.parse_and_wait(
4 file="https://tlake.link/blog/dense-tables",
5)
6
7for page in result.pages:
8 for fragment in page.page_fragments:
9 if(fragment.fragment_type == PageFragmentType.TABLE):
10 table = fragment.content.html
11 # pandas.read_html can parse a single table string
12 df = pd.read_html(StringIO(str(t)), flavor="lxml")[0].fillna('')
13 print(f"Table found on page {page.page_number} at {fragment.bbox}:")
14 print(df)
Tips#
- Keep tables atomic when chunking (don’t split a single table across chunks).
- Attach page number + bbox to chunk metadata so you can render page previews in answers.
Known limitations#
- Extremely degraded scans (low DPI, heavy skew) may still need pre-deskewing.
- Rotated tables are supported, but nested tables inside footnotes may require a second pass.
Status#
✅ Live now. No config changes required beyond TableOutputMode / TableParsingFormat.
We’d love reports#
Send tricky tables (wide spans, nested headers, tiny fonts). They directly drive our next round of improvements.