adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Combines feed/sitemap discovery with parallel batch processing of URLs and local HTML files, enabling efficient crawling workflows without external databases. Extraction uses hybrid algorithms (jusText, readability patterns) to balance precision and recall, preserving document structure through semantic markup while filtering boilerplate. Integrates with HuggingFace, IBM, and Microsoft Research pipelines, and supports R bindings alongside Python and CLI interfaces.
5,481 stars. Used by 24 other packages. No commits in the last 6 months. Available on PyPI.
Stars
5,481
Forks
348
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 12, 2025
Commits (30d)
0
Dependencies
7
Reverse dependents
24
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/adbar/trafilatura"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
any4ai/AnyCrawl
AnyCrawl π: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts...
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
kreuzberg-dev/html-to-markdown
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the...
lightfeed/extractor
Using LLMs and AI browser automation to robustly extract web data
paulpierre/markdown-crawler
A multithreaded πΈοΈ web crawler that recursively crawls a website and creates a π½ markdown file...