What's new#
Document ingestion now handles three additional file formats: XML documents, legacy DOC files (Microsoft Word 97-2003), and native Markdown files. These join our existing support for PDF, DOCX, images, and other formats in a unified parsing pipeline.
Why it matters#
- XML files are everywhere in enterprise workflows (config files, data exports, structured documents)
- Legacy DOC files still appear in legacy systems and older document repositories
- Markdown files are standard for documentation, README files, and technical specs
- Unified processing means fewer custom preprocessing steps in your pipeline
Highlights#
- Native parsing preserves document structure and metadata
- Same extraction and classification capabilities as other formats
- Automatic format detection - no manual format specification required
- Full compatibility with structured extraction and summarization features
How to use#
Works automatically when you upload any of these file types. No configuration changes needed.
1doc_ai = DocumentAI()
2
3# All of these now work seamlessly
4xml_file_id = doc_ai.upload(path="/path/to/config.xml")
5xml_result = doc_ai.parse_and_wait(xml_file_id)
6
7doc_file_id = doc_ai.upload(path="/path/to/legacy_report.doc")
8doc_result = doc_ai.parse_and_wait(doc_file_id)
9
10md_file_id = doc_ai.upload(path="/path/to/README.md")
11md_result = doc_ai.parse_and_wait(md_file_id)
Status#
✅ Live now. All existing parsing features work across the new formats.