adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

60
/ 100
Established

Combines feed/sitemap discovery with parallel batch processing of URLs and local HTML files, enabling efficient crawling workflows without external databases. Extraction uses hybrid algorithms (jusText, readability patterns) to balance precision and recall, preserving document structure through semantic markup while filtering boilerplate. Integrates with HuggingFace, IBM, and Microsoft Research pipelines, and supports R bindings alongside Python and CLI interfaces.

5,481 stars. Used by 24 other packages. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 2 / 25
Adoption 15 / 25
Maturity 25 / 25
Community 18 / 25

How are scores calculated?

Stars

5,481

Forks

348

Language

Python

License

Apache-2.0

Last pushed

Sep 12, 2025

Commits (30d)

0

Dependencies

7

Reverse dependents

24

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/adbar/trafilatura"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.