opendatalab/MinerU-HTML
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
Uses regex-structured output with vLLM or Transformers backends for efficient inference, paired with modular HTML simplification and fallback mechanisms (trafilatura/bypass/empty). Integrates with MinerU-Webkit for downstream conversion to Markdown/JSON/TXT formats, achieving 0.90 ROUGE-N F1 on WebMainBench—competitive with GPT-5 while enabling local deployment via compact SLM models.
217 stars.
Stars
217
Forks
24
Language
HTML
License
Apache-2.0
Category
Last pushed
Dec 25, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendatalab/MinerU-HTML"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
any4ai/AnyCrawl
AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
paulpierre/markdown-crawler
A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file...