Skip to content

mrbitsdcf/html-to-markdown-skill

HTML TO MARKDOWN Skill and CLI

Convert saved HTML articles and exported webpages into clean, portable Markdown — preserving editorial structure, metadata, code blocks, tables, figures, captions, and local images.

Ships three ways to use it: a Claude Code skill (/html-to-markdown), a standalone CLI (html-to-markdown), and a reusable Python library (html_to_markdown).

"Preserving formatting" here means preserving the semantic structure that Markdown can represent (headings, lists, quotes, code, tables, figures) — not reproducing the page's CSS layout pixel-by-pixel.


The problem it solves

You save an article as a "complete webpage" (HTML + a _files/ assets folder). The HTML is full of site chrome — navigation, share buttons, theme toggles, table-of-contents widgets, copy buttons, icon SVGs — wrapped around the actual article. Pasting that into a notes app gives you a mess.

html-to-markdown finds the real editorial content, drops the chrome, and emits a single Markdown note with valid YAML frontmatter and a local assets/ folder whose images render in your vault.

Features

  • Layered content extraction<article><main>/[role=main] → semantic selectors → text-density fallback. Generic, not hardcoded to any single site.
  • Site-chrome removal — nav, header, footer, share/social, TOC, ads, comments, related, data-nosnippet, hidden elements, and icon-only SVGs.
  • Metadata in confidence order — JSON-LD → Open Graph → Twitter Cards → <meta><time datetime> → header heuristics. Never invents missing values.
  • Faithful Markdown — headings, bold/italic, links, nested lists, blockquotes, inline code, fenced code blocks with language detection, GFM tables, figures + captions, horizontal rules.
  • Smart code blocks — recovers byte-accurate source from copy-button data-code attributes, de-duplicates light/dark (shiki) twins, and picks a fence longer than any backtick run inside.
  • Image handlingsrcset (highest res), lazy-load attributes, data: URIs, local copy with content-hash de-duplication, and opt-in, SSRF-guarded remote download.
  • Two image styles — wikilink embeds ![[assets/x.jpg|alt]] or standard ![alt](assets/x.jpg).
  • Offline & deterministic by default — no network, no JavaScript execution, idempotent output.
  • Diagnostic report — text or JSON, with selector used, metadata found, image/link counts, dropped elements, and warnings.

Limitations

  • It preserves semantics, not visual layout. Complex multi-column CSS becomes linear Markdown.
  • Content embedded only via JavaScript at runtime won't be present in a saved static HTML file.
  • Highly bespoke widgets (interactive charts, embedded apps) are dropped, not reconstructed.
  • Heuristic extraction can occasionally mis-rank an unusual layout; use --report to inspect.

Installation

As a Claude Code skill

Copy the skill into your Claude Code skills directory:

cp -r skills/html-to-markdown ~/.claude/skills/

Then invoke it:

/html-to-markdown "saved-article.html"
/html-to-markdown "saved-article.html" "./My Vault/Articles"

As a Claude Code plugin

This repo includes .claude-plugin/plugin.json. Point your plugin configuration at the repo root, or install it through your plugin manager of choice.

As a CLI

pip install html-to-markdown          # from PyPI (once published)
# or, from a checkout:
pip install .

This installs the html-to-markdown command. The CLI runs fully independently of Claude Code.


Usage

Basic:

html-to-markdown "saved-article.html"

Choose an output vault folder:

html-to-markdown "saved-article.html" \
  --output-dir "$HOME/Documents/My Vault/Articles"

Full example:

html-to-markdown input.html \
  --output-dir "./My Vault/Articles" \
  --image-style wikilink \
  --assets-dir "assets" \
  --include-toc \
  --report

From Claude Code:

/html-to-markdown "saved-article.html" "./My Vault/Articles"

Options

Option Description
input Path to the input .html file (positional)
--output-dir PATH Where to write the note + assets (default: <input dir>/output)
--output-name NAME Base name for the note (default: derived from title)
--assets-dir NAME Assets subdirectory name (default: assets)
--image-style wikilink|markdown Embed style (default: markdown)
--frontmatter / --no-frontmatter Emit YAML frontmatter (default on)
--include-toc / --no-toc Prepend a table of contents (default off)
--keep-safe-html Allow safe inline HTML (sup/sub/details) where Markdown can't represent it
--download-remote-images Download remote images (off by default; SSRF-guarded)
--overwrite Overwrite existing output files
--dry-run Do everything except write files
--report Write a .conversion.json report next to the note
--report-format text|json Format of the report printed to stdout
--verbose Verbose logging
--version Print version and exit

Exit codes: 0 success · 1 input error · 2 output error · 3 security refusal · 4 unexpected.

Output structure

For an input named My Article.html:

output/
├── My Article.md
├── assets/
│   ├── cover-image.jpg
│   └── diagram.png
└── My Article.conversion.json   # only with --report

Wikilink & Markdown compatibility

  • Default image embeds use the ![[path|alt]] wikilink syntax for local assets.
  • Frontmatter is valid YAML and uses vault-friendly keys (title, tags, source, ...).
  • Table-of-contents links use [[#Heading]] so they resolve inside the note.
  • Use --image-style markdown if you prefer portable CommonMark image syntax.

Remote images policy

Remote images are never downloaded unless you pass --download-remote-images. When enabled, each URL must be http(s), is DNS-resolved, and is rejected if it points at loopback, link-local, private, reserved, multicast, or unspecified addresses (SSRF defense). Downloads are size-, redirect-, and timeout-limited, and never carry cookies or credentials. A failed image becomes a warning — it never aborts the conversion; the original URL is preserved in the note when it is safe to do so.

Security

See SECURITY.md. The tool defends against hostile HTML, path traversal, SSRF, oversized inputs/images, active SVG, decompression bombs, symlink tricks, and unwanted writes.

Privacy

Everything runs locally. By default there is zero network activity and no JavaScript from the page is ever executed. The only outbound traffic possible is opt-in remote image download.

Troubleshooting

See skills/html-to-markdown/references/troubleshooting.md. Common cases: wrong content block extracted (--report shows the selector used), images not copied (check the _files/ folder sits next to the HTML), output not written (no-overwrite default — use --overwrite or --output-name).

Dependencies

Package Why
beautifulsoup4 Tolerant HTML tree parsing and navigation
lxml Fast, robust parser backend for imperfect saved pages

Remote fetching uses the Python standard library (urllib) so SSRF screening can inspect resolved IPs directly. No other runtime dependencies.

Roadmap

  • Optional EPUB / .webarchive input adapters
  • Configurable per-site extraction profiles (opt-in, still generic core)
  • Wikilink-style internal cross-references between converted notes

Contributing

See CONTRIBUTING.md. Please add a synthetic fixture for any new structural case; never commit copyrighted article bodies or their images.

License

Apache License 2.0.


🇧🇷 Versão em português: README.pt-BR.md

About

Convert saved HTML articles and exported webpages into clean, portable Markdown.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages