September 8, 2025

Document Ingestion now supports XML, DOC, and Markdown files

Document ingestion now supports XML, legacy DOC, and Markdown files with the same parsing capabilities as existing formats.

Key Highlights

Native XML parsing for config files and structured data exports
Legacy DOC file support for older document repositories
Markdown processing for documentation and technical specs

What's new

Document ingestion now handles three additional file formats: XML documents, legacy DOC files (Microsoft Word 97-2003), and native Markdown files. These join our existing support for PDF, DOCX, images, and other formats in a unified parsing pipeline.

Why it matters

XML files are everywhere in enterprise workflows (config files, data exports, structured documents)
Legacy DOC files still appear in legacy systems and older document repositories
Markdown files are standard for documentation, README files, and technical specs
Unified processing means fewer custom preprocessing steps in your pipeline

Highlights

Native parsing preserves document structure and metadata
Same extraction and classification capabilities as other formats
Automatic format detection - no manual format specification required
Full compatibility with structured extraction and summarization features

How to use

Works automatically when you upload any of these file types. No configuration changes needed.

1[.code-block-title]Code[.code-block-title]doc_ai = DocumentAI()
2
3# All of these now work seamlessly
4xml_file_id = doc_ai.upload(path="/path/to/config.xml")
5xml_result = doc_ai.parse_and_wait(xml_file_id)
6
7doc_file_id = doc_ai.upload(path="/path/to/legacy_report.doc")
8doc_result = doc_ai.parse_and_wait(doc_file_id)
9
10md_file_id = doc_ai.upload(path="/path/to/README.md")
11md_result = doc_ai.parse_and_wait(md_file_id)

Status

✅ Live now. All existing parsing features work across the new formats.

Get server-less runtime for agents and data ingestion

Data ingestion like never before.

TRY TENSORLAKE

REQUEST A DEMO

TRUSTED BY PRO DEVS GLOBALLY

Tensorlake is the Agentic Compute Runtime the durable serverless platform that runs Agents at scale.

“With Tensorlake, we've been able to handle complex document parsing and data formats that many other providers don't support natively, at a throughput that significantly improves our application's UX. Beyond the technology, the team's responsiveness stands out, they quickly iterate on our feedback and continuously expand the model's capabilities.”

Vincent Di Pietro

Founder, Novis AI

"At SIXT, we're building AI-powered experiences for millions of customers while managing the complexity of enterprise-scale data. TensorLake gives us the foundation we need—reliable document ingestion that runs securely in our VPC to power our generative AI initiatives."

Boyan Dimitrov

CTO, Sixt

“Tensorlake enabled us to avoid building and operating an in-house OCR pipeline by providing a robust, scalable OCR and document ingestion layer with excellent accuracy and feature coverage. Ongoing improvements to the platform, combined with strong technical support, make it a dependable foundation for our scientific document workflows.”

Yaroslav Sklabinskyi

Principal Software Engineer, Reliant AI

"For BindHQ customers, the integration with Tensorlake represents a shift from manual data handling to intelligent automation, helping insurance businesses operate with greater precision, and responsiveness across a variety of transactions"

Cristian Joe

CEO @ BindHQ

“Tensorlake let us ship faster and stay reliable from day one. Complex stateful AI workloads that used to require serious infra engineering are now just long-running functions. As we scale, that means we can stay lean—building product, not managing infrastructure.”

Arpan Bhattacharya

CEO, The Intelligent Search Company