HTML TO MARKDOWN Skill and CLI

Convert saved HTML articles and exported webpages into clean, portable Markdown — preserving editorial structure, metadata, code blocks, tables, figures, captions, and local images.

Ships three ways to use it: a Claude Code skill (/html-to-markdown), a standalone CLI (html-to-markdown), and a reusable Python library (html_to_markdown).

"Preserving formatting" here means preserving the semantic structure that Markdown can represent (headings, lists, quotes, code, tables, figures) — not reproducing the page's CSS layout pixel-by-pixel.

The problem it solves

You save an article as a "complete webpage" (HTML + a _files/ assets folder). The HTML is full of site chrome — navigation, share buttons, theme toggles, table-of-contents widgets, copy buttons, icon SVGs — wrapped around the actual article. Pasting that into a notes app gives you a mess.

html-to-markdown finds the real editorial content, drops the chrome, and emits a single Markdown note with valid YAML frontmatter and a local assets/ folder whose images render in your vault.

Features

Layered content extraction — <article> → <main>/[role=main] → semantic selectors → text-density fallback. Generic, not hardcoded to any single site.
Site-chrome removal — nav, header, footer, share/social, TOC, ads, comments, related, data-nosnippet, hidden elements, and icon-only SVGs.
Metadata in confidence order — JSON-LD → Open Graph → Twitter Cards → <meta> → <time datetime> → header heuristics. Never invents missing values.
Faithful Markdown — headings, bold/italic, links, nested lists, blockquotes, inline code, fenced code blocks with language detection, GFM tables, figures + captions, horizontal rules.
Smart code blocks — recovers byte-accurate source from copy-button data-code attributes, de-duplicates light/dark (shiki) twins, and picks a fence longer than any backtick run inside.
Image handling — srcset (highest res), lazy-load attributes, data: URIs, local copy with content-hash de-duplication, and opt-in, SSRF-guarded remote download.
Two image styles — wikilink embeds ![[assets/x.jpg|alt]] or standard ![alt](assets/x.jpg).
Offline & deterministic by default — no network, no JavaScript execution, idempotent output.
Diagnostic report — text or JSON, with selector used, metadata found, image/link counts, dropped elements, and warnings.

Limitations

It preserves semantics, not visual layout. Complex multi-column CSS becomes linear Markdown.
Content embedded only via JavaScript at runtime won't be present in a saved static HTML file.
Highly bespoke widgets (interactive charts, embedded apps) are dropped, not reconstructed.
Heuristic extraction can occasionally mis-rank an unusual layout; use --report to inspect.

Installation

As a Claude Code skill

Copy the skill into your Claude Code skills directory:

cp -r skills/html-to-markdown ~/.claude/skills/

Then invoke it:

/html-to-markdown "saved-article.html"
/html-to-markdown "saved-article.html" "./My Vault/Articles"

As a Claude Code plugin

This repo includes .claude-plugin/plugin.json. Point your plugin configuration at the repo root, or install it through your plugin manager of choice.

As a CLI

pip install html-to-markdown          # from PyPI (once published)
# or, from a checkout:
pip install .

This installs the html-to-markdown command. The CLI runs fully independently of Claude Code.

Usage

Basic:

html-to-markdown "saved-article.html"

Choose an output vault folder:

html-to-markdown "saved-article.html" \
  --output-dir "$HOME/Documents/My Vault/Articles"

Full example:

html-to-markdown input.html \
  --output-dir "./My Vault/Articles" \
  --image-style wikilink \
  --assets-dir "assets" \
  --include-toc \
  --report

From Claude Code:

/html-to-markdown "saved-article.html" "./My Vault/Articles"

Options

Option	Description
`input`	Path to the input `.html` file (positional)
`--output-dir PATH`	Where to write the note + assets (default: `<input dir>/output`)
`--output-name NAME`	Base name for the note (default: derived from title)
`--assets-dir NAME`	Assets subdirectory name (default: `assets`)
`--image-style wikilink\|markdown`	Embed style (default: `markdown`)
`--frontmatter` / `--no-frontmatter`	Emit YAML frontmatter (default on)
`--include-toc` / `--no-toc`	Prepend a table of contents (default off)
`--keep-safe-html`	Allow safe inline HTML (`sup`/`sub`/`details`) where Markdown can't represent it
`--download-remote-images`	Download remote images (off by default; SSRF-guarded)
`--overwrite`	Overwrite existing output files
`--dry-run`	Do everything except write files
`--report`	Write a `.conversion.json` report next to the note
`--report-format text\|json`	Format of the report printed to stdout
`--verbose`	Verbose logging
`--version`	Print version and exit

Exit codes: 0 success · 1 input error · 2 output error · 3 security refusal · 4 unexpected.

Output structure

For an input named My Article.html:

output/
├── My Article.md
├── assets/
│   ├── cover-image.jpg
│   └── diagram.png
└── My Article.conversion.json   # only with --report

Wikilink & Markdown compatibility

Default image embeds use the ![[path|alt]] wikilink syntax for local assets.
Frontmatter is valid YAML and uses vault-friendly keys (title, tags, source, ...).
Table-of-contents links use [[#Heading]] so they resolve inside the note.
Use --image-style markdown if you prefer portable CommonMark image syntax.

Remote images policy

Remote images are never downloaded unless you pass --download-remote-images. When enabled, each URL must be http(s), is DNS-resolved, and is rejected if it points at loopback, link-local, private, reserved, multicast, or unspecified addresses (SSRF defense). Downloads are size-, redirect-, and timeout-limited, and never carry cookies or credentials. A failed image becomes a warning — it never aborts the conversion; the original URL is preserved in the note when it is safe to do so.

Security

See SECURITY.md. The tool defends against hostile HTML, path traversal, SSRF, oversized inputs/images, active SVG, decompression bombs, symlink tricks, and unwanted writes.

Privacy

Everything runs locally. By default there is zero network activity and no JavaScript from the page is ever executed. The only outbound traffic possible is opt-in remote image download.

Troubleshooting

See skills/html-to-markdown/references/troubleshooting.md. Common cases: wrong content block extracted (--report shows the selector used), images not copied (check the _files/ folder sits next to the HTML), output not written (no-overwrite default — use --overwrite or --output-name).

Dependencies

Package	Why
`beautifulsoup4`	Tolerant HTML tree parsing and navigation
`lxml`	Fast, robust parser backend for imperfect saved pages

Remote fetching uses the Python standard library (urllib) so SSRF screening can inspect resolved IPs directly. No other runtime dependencies.

Roadmap

Optional EPUB / .webarchive input adapters
Configurable per-site extraction profiles (opt-in, still generic core)
Wikilink-style internal cross-references between converted notes

Contributing

See CONTRIBUTING.md. Please add a synthetic fixture for any new structural case; never commit copyrighted article bodies or their images.

License

Apache License 2.0.

🇧🇷 Versão em português: README.pt-BR.md

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude-plugin		.claude-plugin
.github		.github
examples/synthetic-output		examples/synthetic-output
skills/html-to-markdown		skills/html-to-markdown
src/html_to_markdown		src/html_to_markdown
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.pt-BR.md		README.pt-BR.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML TO MARKDOWN Skill and CLI

The problem it solves

Features

Limitations

Installation

As a Claude Code skill

As a Claude Code plugin

As a CLI

Usage

Options

Output structure

Wikilink & Markdown compatibility

Remote images policy

Security

Privacy

Troubleshooting

Dependencies

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTML TO MARKDOWN Skill and CLI

The problem it solves

Features

Limitations

Installation

As a Claude Code skill

As a Claude Code plugin

As a CLI

Usage

Options

Output structure

Wikilink & Markdown compatibility

Remote images policy

Security

Privacy

Troubleshooting

Dependencies

Roadmap

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages