TIKA-4770: Add a Markdown parser with structured, lossless XHTML output by krickert · Pull Request #2922 · apache/tika

krickert · 2026-07-02T01:38:40Z

Summary

.md files are already detected as text/markdown (globs in tika-mimetypes.xml), but no parser claims the type, so they fall through to TXTParser and come back as flat text — headings, tables, and code fences all collapse into an undifferentiated string.

This adds a MarkdownParser to tika-parser-text-module using commonmark-java — already a Tika dependency (it backs ToMarkdownContentHandler, TIKA-4730) — that parses the markdown AST and emits structured XHTML:

Markdown	XHTML
`#`..`######` / setext	`h1`..`h6`
lists (incl. GFM tight lists)	`ul` / `ol` (with `start` when not 1) / `li`
fenced / indented code	`pre`/`code` with `class="language-x"` (+ `data-info` for any extra fence info)
GFM tables	`table`/`thead`/`tbody`/`tr`/`th`/`td` with `align`
emphasis / strong / GFM strikethrough	`em` / `strong` / `del`
links, images	`a href title`, `img src alt title`
block quotes, thematic breaks	`blockquote`, `hr`

Fidelity and safety

No content loss: every literal the commonmark AST carries reaches the output — including image alt text with code spans, ordered-list start numbers, and full code-fence info strings. Only markdown syntax presentation (bullet/fence/emphasis delimiter characters) is normalized, identical to commonmark's reference HtmlRenderer.
Raw HTML in the source is emitted as escaped text — preserved, but never injected into the XHTML stream.
Encoding detection via AutoDetectReader, the same idiom as TXTParser; detected charset lands in Content-Type/Content-Encoding.
Registered via @TikaComponent, same as the other text-module parsers. No MIME changes needed.

Because the emitted vocabulary matches what ToMarkdownContentHandler consumes, a markdown document round-trips markdown → XHTML → markdown (there's a test for it).

Relationship to other work

Independent of the gRPC Document-contract PR (#2921) — this is the input direction (.md files into Tika); that PR is the output direction. They share only the commonmark library.

Test plan

MarkdownParserTest: 9 tests — structure, GFM tables with alignment sections, raw-HTML escaping, ordered-list start numbers, code-span alt text, fence info preservation, charset detection with non-ASCII content, markdown round-trip
full tika-parser-text-module test suite green (no regressions in TXT/CSV parsers)
apache-rat:check green
CI

…output Markdown files (text/markdown, already detected by glob in tika-mimetypes) previously fell through to TXTParser and came back as flat text. This adds a dedicated MarkdownParser using commonmark-java (the library already behind ToMarkdownContentHandler) that emits structured XHTML: h1-h6, ul/ol/li, blockquote, pre/code, GFM tables as table/thead/tbody/tr/th/td with alignment, em/strong/del, links, images, and hr. Fidelity: every piece of content the commonmark AST carries is preserved -- text/code literals (raw HTML as escaped text, so nothing can be injected), link and image destinations and titles, image alt text including code spans, heading levels, table cell alignment and header cells, ordered-list start numbers (<ol start=...>), the full code-fence info string (class="language-x" plus data-info when the fence carries more than a language token), and hard/soft line breaks. Only markdown syntax presentation (bullet/fence/ emphasis delimiter characters, ATX vs setext headings) is normalized, the same normalization as commonmark's reference HtmlRenderer. Encoding is detected via AutoDetectReader, matching TXTParser; the detected charset lands in Content-Type and Content-Encoding.

tballison · 2026-07-02T11:14:11Z

LGTM. Some requests:

Please tersify comments
Try to use the existing RuntimeSAXException
Use TikaTest instead of handrolling toXhtml (please use the AutoDetectParser not the specific new parser...may need to pass in dummy file name to get detection to work?)

tballison · 2026-07-02T12:04:14Z

And, if your agent has time, may as well add handling for embedded script and data uris like we have in the html parser. We may want to move DataURIScheme/DataURISchemeUtil into tika-core or into a shared *-commons module, probably tika-core?

krickert · 2026-07-02T12:35:30Z

Agent pushed back rather dramatically. But humans aren't as sensitive, so I overrode the pushback. It'll be heavily reviewed - I'll reply when it's ready.

- Use the existing RuntimeSAXException instead of a bespoke wrapper. - Tersify comments. - Rewrite MarkdownParserTest on TikaTest, parsing through AutoDetectParser (dummy .md resource name for glob detection) with fixture files under test-documents; this also exercises component registration and routing. - Extract data: URIs as embedded documents, as the html module does: image/link destinations are parsed directly, and raw HTML blocks/inline (e.g. script tags) are scraped with DataURISchemeUtil.extract. INLINE embedded resource type, gated by EmbeddedDocumentExtractor. - Move DataURIScheme/DataURISchemeUtil/DataURISchemeParseException (and their test) from tika-parser-html-module to org.apache.tika.utils in tika-core so both parsers share them. tika-core has no commons-codec, so the base64 decode now uses java.util.Base64.getMimeDecoder(), which is equally lenient about whitespace/non-alphabet characters; truly malformed base64 now throws DataURISchemeParseException from parse() and is skipped by extract(), where commons-codec silently best-effort decoded.

krickert · 2026-07-02T12:56:30Z

All four done in 175b9e6:

Comments — tersified. Class javadoc is down to four lines.
RuntimeSAXException — swapped in for the bespoke wrapper; visitor throws it, parse() unwraps, same idiom as the MP4/ASM parsers.
Tests — rewritten on TikaTest, and every parse now goes through AUTO_DETECT_PARSER with a dummy .md resource name (glob-only detection, no magic for markdown). Fixtures live under test-documents/. Nice side effect: the tests now exercise component registration and MIME routing end to end, not just the parser class.
Data URIs / embedded scripts — (agent) took you up on the move: DataURIScheme, DataURISchemeUtil, and DataURISchemeParseException (plus their test) are now in org.apache.tika.utils in tika-core, and the html module imports them from there. MarkdownParser mirrors HtmlHandler: data: image/link destinations are parsed directly, raw HTML blocks/inline (script tags included) are scraped with extract(), and results flow through EmbeddedDocumentExtractor as INLINE embedded docs. There's a recursive-metadata test showing a markdown file with a data-URI image + a script-embedded data URI yielding three documents.

One behavior change I'd like you to weigh-in on:

tika-core doesn't have commons-codec, so the base64 decode now uses java.util.Base64.getMimeDecoder(). Same leniency for whitespace/newlines/backslash-continuations (existing tests pass unchanged), but truly malformed base64 now throws DataURISchemeParseException from parse() (both callers already catch it) and gets skipped by extract() - where commons-codec used to silently best-effort decode.

I think failing loudly there is the way, do you agree?

CI should work fine, I verified the tests locally: tika-core 4/4 on the moved test, html module 59/0, text module green, checkstyle + RAT clean on all three modules.

tballison · 2026-07-02T16:05:15Z

Ah, right commons-codec....

My memory is that commons-codec is more robust against noisy data than the jdk. Sometimes, we could get some bytes out before jdk would throw.

My claude just spent 4 tries arguing for commons-codec and then jdk and then commons-codec again.

On this try, claude agreed with my confirmation bias.

Commons-codec is never strictly worse for extraction and is sometimes much better; the JDK is never better and is sometimes much worse.

So, let's move this util to a small -commons module in tika-parsers-standard and rely on commons-codec there.

tballison · 2026-07-02T16:09:15Z

Agent pushed back rather dramatically try a different model?

What were the concerns?

Per review: extraction wants commons-codec's lenient base64 (salvage bytes from noisy data) rather than the JDK decoder's strictness, and tika-core has no commons-codec. So the DataURIScheme classes move out of tika-core into a new tika-parser-datauri-commons module (same pattern as digest/jdbc/mail/xmp/zip-commons), package org.apache.tika.parser.datauri, with the original commons-codec decode restored verbatim. The html and text modules depend on it (replacing their direct commons-codec deps, which are otherwise unused); listed in tika-bom.

krickert · 2026-07-02T17:19:02Z

Agent pushed back rather dramatically try a different model?

What were the concerns?

It wasn't dramatic - but just minor and probably because of me.

what changed

First:

The requested changes are done in 57bf60c

tika-parser-datauri-commons, has same shape as the digest/zip/mail-commons siblings.

Package org.apache.tika.parser.datauri, commons-codec decode restored verbatim. html and text modules depend on it (their direct commons-codec deps were otherwise unused, so those are gone); it's in tika-bom. Transitive into the standard package via both consumers, same as zip-commons.

The concern

My take is nearly always to err on the side of "fail fast". Claude probably caught onto this pattern.

AI response:

The strict-decode case was: silently best-effort-decoding malformed base64 hands downstream consumers garbage bytes labeled with a confident content-type, with no signal anything was wrong — in a validation context that's a bug factory. The counter (yours) is that Tika isn't a validation context: it's salvage. A truncated data URI in a crawled page still has a decodable prefix, and for extraction, partial bytes beat a clean exception every time. Once you frame it as "who is the consumer of the failure," extraction wins and commons-codec is the right tool. I'd still want strictness if this util ever guards an ingest boundary, but that's not what Tika is for — conceded.

My opinion - go either way. I'd vote strict because I've worked with a lot of bad URLs and none of the parsers are 100% right. There's always some strange URLs that even gave be security headaches in the past. But that can be solved with my own validator - so not a blocker.

tballison · 2026-07-02T17:34:48Z

"fail fast" makes sense in most circumstances. However, for parsing, my personal preference is to get as much as we possibly (reliably) can out of files. If we're able to get anything useful out of a byte[] even if truncated, we should try.

krickert · 2026-07-02T17:52:32Z

On that note - anything else needed for a merge?

krickert force-pushed the markdown-parser branch from 7db1835 to b7e0f89 Compare July 2, 2026 02:28

tballison merged commit aca20dc into apache:main Jul 2, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIKA-4770: Add a Markdown parser with structured, lossless XHTML output#2922

TIKA-4770: Add a Markdown parser with structured, lossless XHTML output#2922
tballison merged 3 commits into
apache:mainfrom
ai-pipestream:markdown-parser

krickert commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026 •

edited

Loading

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

krickert commented Jul 2, 2026

Summary

Fidelity and safety

Relationship to other work

Test plan

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

tballison commented Jul 2, 2026

Uh oh!

krickert commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tballison commented Jul 2, 2026 •

edited

Loading