Convert saved HTML articles and exported webpages into clean, portable Markdown — preserving editorial structure, metadata, code blocks, tables, figures, captions, and local images.
Ships three ways to use it: a Claude Code skill (/html-to-markdown), a standalone CLI
(html-to-markdown), and a reusable Python library (html_to_markdown).
"Preserving formatting" here means preserving the semantic structure that Markdown can represent (headings, lists, quotes, code, tables, figures) — not reproducing the page's CSS layout pixel-by-pixel.
You save an article as a "complete webpage" (HTML + a _files/ assets folder). The HTML is full of
site chrome — navigation, share buttons, theme toggles, table-of-contents widgets, copy buttons,
icon SVGs — wrapped around the actual article. Pasting that into a notes app gives you a mess.
html-to-markdown finds the real editorial content, drops the chrome, and emits a single Markdown
note with valid YAML frontmatter and a local assets/ folder whose images render in your vault.
- Layered content extraction —
<article>→<main>/[role=main]→ semantic selectors → text-density fallback. Generic, not hardcoded to any single site. - Site-chrome removal — nav, header, footer, share/social, TOC, ads, comments, related,
data-nosnippet, hidden elements, and icon-only SVGs. - Metadata in confidence order — JSON-LD → Open Graph → Twitter Cards →
<meta>→<time datetime>→ header heuristics. Never invents missing values. - Faithful Markdown — headings, bold/italic, links, nested lists, blockquotes, inline code, fenced code blocks with language detection, GFM tables, figures + captions, horizontal rules.
- Smart code blocks — recovers byte-accurate source from copy-button
data-codeattributes, de-duplicates light/dark (shiki) twins, and picks a fence longer than any backtick run inside. - Image handling —
srcset(highest res), lazy-load attributes,data:URIs, local copy with content-hash de-duplication, and opt-in, SSRF-guarded remote download. - Two image styles — wikilink embeds
![[assets/x.jpg|alt]]or standard. - Offline & deterministic by default — no network, no JavaScript execution, idempotent output.
- Diagnostic report — text or JSON, with selector used, metadata found, image/link counts, dropped elements, and warnings.
- It preserves semantics, not visual layout. Complex multi-column CSS becomes linear Markdown.
- Content embedded only via JavaScript at runtime won't be present in a saved static HTML file.
- Highly bespoke widgets (interactive charts, embedded apps) are dropped, not reconstructed.
- Heuristic extraction can occasionally mis-rank an unusual layout; use
--reportto inspect.
Copy the skill into your Claude Code skills directory:
cp -r skills/html-to-markdown ~/.claude/skills/Then invoke it:
/html-to-markdown "saved-article.html"
/html-to-markdown "saved-article.html" "./My Vault/Articles"
This repo includes .claude-plugin/plugin.json. Point your plugin configuration at the repo root,
or install it through your plugin manager of choice.
pip install html-to-markdown # from PyPI (once published)
# or, from a checkout:
pip install .This installs the html-to-markdown command. The CLI runs fully independently of Claude Code.
Basic:
html-to-markdown "saved-article.html"Choose an output vault folder:
html-to-markdown "saved-article.html" \
--output-dir "$HOME/Documents/My Vault/Articles"Full example:
html-to-markdown input.html \
--output-dir "./My Vault/Articles" \
--image-style wikilink \
--assets-dir "assets" \
--include-toc \
--reportFrom Claude Code:
/html-to-markdown "saved-article.html" "./My Vault/Articles"
| Option | Description |
|---|---|
input |
Path to the input .html file (positional) |
--output-dir PATH |
Where to write the note + assets (default: <input dir>/output) |
--output-name NAME |
Base name for the note (default: derived from title) |
--assets-dir NAME |
Assets subdirectory name (default: assets) |
--image-style wikilink|markdown |
Embed style (default: markdown) |
--frontmatter / --no-frontmatter |
Emit YAML frontmatter (default on) |
--include-toc / --no-toc |
Prepend a table of contents (default off) |
--keep-safe-html |
Allow safe inline HTML (sup/sub/details) where Markdown can't represent it |
--download-remote-images |
Download remote images (off by default; SSRF-guarded) |
--overwrite |
Overwrite existing output files |
--dry-run |
Do everything except write files |
--report |
Write a .conversion.json report next to the note |
--report-format text|json |
Format of the report printed to stdout |
--verbose |
Verbose logging |
--version |
Print version and exit |
Exit codes: 0 success · 1 input error · 2 output error · 3 security refusal · 4 unexpected.
For an input named My Article.html:
output/
├── My Article.md
├── assets/
│ ├── cover-image.jpg
│ └── diagram.png
└── My Article.conversion.json # only with --report
- Default image embeds use the
![[path|alt]]wikilink syntax for local assets. - Frontmatter is valid YAML and uses vault-friendly keys (
title,tags,source, ...). - Table-of-contents links use
[[#Heading]]so they resolve inside the note. - Use
--image-style markdownif you prefer portable CommonMark image syntax.
Remote images are never downloaded unless you pass --download-remote-images. When enabled,
each URL must be http(s), is DNS-resolved, and is rejected if it points at loopback, link-local,
private, reserved, multicast, or unspecified addresses (SSRF defense). Downloads are size-, redirect-,
and timeout-limited, and never carry cookies or credentials. A failed image becomes a warning — it
never aborts the conversion; the original URL is preserved in the note when it is safe to do so.
See SECURITY.md. The tool defends against hostile HTML, path traversal, SSRF, oversized inputs/images, active SVG, decompression bombs, symlink tricks, and unwanted writes.
Everything runs locally. By default there is zero network activity and no JavaScript from the page is ever executed. The only outbound traffic possible is opt-in remote image download.
See skills/html-to-markdown/references/troubleshooting.md.
Common cases: wrong content block extracted (--report shows the selector used), images not copied
(check the _files/ folder sits next to the HTML), output not written (no-overwrite default — use
--overwrite or --output-name).
| Package | Why |
|---|---|
beautifulsoup4 |
Tolerant HTML tree parsing and navigation |
lxml |
Fast, robust parser backend for imperfect saved pages |
Remote fetching uses the Python standard library (urllib) so SSRF screening can inspect resolved
IPs directly. No other runtime dependencies.
- Optional EPUB /
.webarchiveinput adapters - Configurable per-site extraction profiles (opt-in, still generic core)
- Wikilink-style internal cross-references between converted notes
See CONTRIBUTING.md. Please add a synthetic fixture for any new structural case; never commit copyrighted article bodies or their images.
🇧🇷 Versão em português: README.pt-BR.md