opendatalab/MinerU-HTML

MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

45
/ 100
Emerging

Uses regex-structured output with vLLM or Transformers backends for efficient inference, paired with modular HTML simplification and fallback mechanisms (trafilatura/bypass/empty). Integrates with MinerU-Webkit for downstream conversion to Markdown/JSON/TXT formats, achieving 0.90 ROUGE-N F1 on WebMainBench—competitive with GPT-5 while enabling local deployment via compact SLM models.

217 stars.

No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 13 / 25
Community 16 / 25

How are scores calculated?

Stars

217

Forks

24

Language

HTML

License

Apache-2.0

Last pushed

Dec 25, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendatalab/MinerU-HTML"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.