kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
Implements a visitor pattern with custom callbacks for content filtering and URL rewriting, enabling domain-specific Markdown dialects. Achieves 150-280 MB/s throughput via a Rust core while supporting 12 language bindings (Python, Node.js, Go, Ruby, PHP, Java, C#, Elixir, R, WASM, and FFI) with identical output across runtimes. Extracts metadata including titles, headers, structured data (JSON-LD, Microdata, RDFa), and tabular data as part of conversion, with built-in HTML sanitization via ammonia.
565 stars. Actively maintained with 201 commits in the last 30 days.
Stars
565
Forks
50
Language
HTML
License
MIT
Category
Last pushed
Mar 13, 2026
Commits (30d)
201
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/kreuzberg-dev/html-to-markdown"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
any4ai/AnyCrawl
AnyCrawl π: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
paulpierre/markdown-crawler
A multithreaded πΈοΈ web crawler that recursively crawls a website and creates a π½ markdown file...
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping,...