adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

/ 100

Established

Combines feed/sitemap discovery with parallel batch processing of URLs and local HTML files, enabling efficient crawling workflows without external databases. Extraction uses hybrid algorithms (jusText, readability patterns) to balance precision and recall, preserving document structure through semantic markup while filtering boilerplate. Integrates with HuggingFace, IBM, and Microsoft Research pipelines, and supports R bindings alongside Python and CLI interfaces.

5,481 stars. Used by 24 other packages. No commits in the last 6 months. Available on PyPI.

Stale 6m

Maintenance 2 / 25

Adoption 15 / 25

Maturity 25 / 25

Community 18 / 25

How are scores calculated?

Stars

5,481

Forks

348

Language

Python

License

Apache-2.0

Related tools

any4ai/AnyCrawl

AnyCrawl 🚀: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...

lightfeed/extractor

Using LLMs and AI browser automation to robustly extract web data

paulpierre/markdown-crawler

A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file...

Explore RAG Tools

All categories Trending RAG directory Insights