TIKA-4770: Add a Markdown parser with structured, lossless XHTML output#2922
Conversation
…output Markdown files (text/markdown, already detected by glob in tika-mimetypes) previously fell through to TXTParser and came back as flat text. This adds a dedicated MarkdownParser using commonmark-java (the library already behind ToMarkdownContentHandler) that emits structured XHTML: h1-h6, ul/ol/li, blockquote, pre/code, GFM tables as table/thead/tbody/tr/th/td with alignment, em/strong/del, links, images, and hr. Fidelity: every piece of content the commonmark AST carries is preserved -- text/code literals (raw HTML as escaped text, so nothing can be injected), link and image destinations and titles, image alt text including code spans, heading levels, table cell alignment and header cells, ordered-list start numbers (<ol start=...>), the full code-fence info string (class="language-x" plus data-info when the fence carries more than a language token), and hard/soft line breaks. Only markdown syntax presentation (bullet/fence/ emphasis delimiter characters, ATX vs setext headings) is normalized, the same normalization as commonmark's reference HtmlRenderer. Encoding is detected via AutoDetectReader, matching TXTParser; the detected charset lands in Content-Type and Content-Encoding.
|
LGTM. Some requests:
|
|
Agent pushed back rather dramatically. But humans aren't as sensitive, so I overrode the pushback. It'll be heavily reviewed - I'll reply when it's ready. |
- Use the existing RuntimeSAXException instead of a bespoke wrapper. - Tersify comments. - Rewrite MarkdownParserTest on TikaTest, parsing through AutoDetectParser (dummy .md resource name for glob detection) with fixture files under test-documents; this also exercises component registration and routing. - Extract data: URIs as embedded documents, as the html module does: image/link destinations are parsed directly, and raw HTML blocks/inline (e.g. script tags) are scraped with DataURISchemeUtil.extract. INLINE embedded resource type, gated by EmbeddedDocumentExtractor. - Move DataURIScheme/DataURISchemeUtil/DataURISchemeParseException (and their test) from tika-parser-html-module to org.apache.tika.utils in tika-core so both parsers share them. tika-core has no commons-codec, so the base64 decode now uses java.util.Base64.getMimeDecoder(), which is equally lenient about whitespace/non-alphabet characters; truly malformed base64 now throws DataURISchemeParseException from parse() and is skipped by extract(), where commons-codec silently best-effort decoded.
|
All four done in 175b9e6:
One behavior change I'd like you to weigh-in on:
I think failing loudly there is the way, do you agree? CI should work fine, I verified the tests locally: tika-core 4/4 on the moved test, html module 59/0, text module green, checkstyle + RAT clean on all three modules. |
|
Ah, right commons-codec.... My memory is that commons-codec is more robust against noisy data than the jdk. Sometimes, we could get some bytes out before jdk would throw. My claude just spent 4 tries arguing for commons-codec and then jdk and then commons-codec again. On this try, claude agreed with my confirmation bias.
So, let's move this util to a small -commons module in tika-parsers-standard and rely on commons-codec there. |
|
What were the concerns? |
Per review: extraction wants commons-codec's lenient base64 (salvage bytes from noisy data) rather than the JDK decoder's strictness, and tika-core has no commons-codec. So the DataURIScheme classes move out of tika-core into a new tika-parser-datauri-commons module (same pattern as digest/jdbc/mail/xmp/zip-commons), package org.apache.tika.parser.datauri, with the original commons-codec decode restored verbatim. The html and text modules depend on it (replacing their direct commons-codec deps, which are otherwise unused); listed in tika-bom.
It wasn't dramatic - but just minor and probably because of me. what changed First: The requested changes are done in 57bf60c
Package The concern My take is nearly always to err on the side of "fail fast". Claude probably caught onto this pattern. AI response:
My opinion - go either way. I'd vote strict because I've worked with a lot of bad URLs and none of the parsers are 100% right. There's always some strange URLs that even gave be security headaches in the past. But that can be solved with my own validator - so not a blocker. |
|
"fail fast" makes sense in most circumstances. However, for parsing, my personal preference is to get as much as we possibly (reliably) can out of files. If we're able to get anything useful out of a byte[] even if truncated, we should try. |
|
On that note - anything else needed for a merge? |
Summary
.mdfiles are already detected astext/markdown(globs intika-mimetypes.xml), but no parser claims the type, so they fall through toTXTParserand come back as flat text — headings, tables, and code fences all collapse into an undifferentiated string.This adds a
MarkdownParsertotika-parser-text-moduleusing commonmark-java — already a Tika dependency (it backsToMarkdownContentHandler, TIKA-4730) — that parses the markdown AST and emits structured XHTML:#..######/ setexth1..h6ul/ol(withstartwhen not 1) /lipre/codewithclass="language-x"(+data-infofor any extra fence info)table/thead/tbody/tr/th/tdwithalignem/strong/dela href title,img src alt titleblockquote,hrFidelity and safety
HtmlRenderer.AutoDetectReader, the same idiom asTXTParser; detected charset lands inContent-Type/Content-Encoding.@TikaComponent, same as the other text-module parsers. No MIME changes needed.Because the emitted vocabulary matches what
ToMarkdownContentHandlerconsumes, a markdown document round-trips markdown → XHTML → markdown (there's a test for it).Relationship to other work
Independent of the gRPC Document-contract PR (#2921) — this is the input direction (
.mdfiles into Tika); that PR is the output direction. They share only the commonmark library.Test plan
MarkdownParserTest: 9 tests — structure, GFM tables with alignment sections, raw-HTML escaping, ordered-list start numbers, code-span alt text, fence info preservation, charset detection with non-ASCII content, markdown round-triptika-parser-text-moduletest suite green (no regressions in TXT/CSV parsers)apache-rat:checkgreen