flairNLP/fundus
A very simple news crawler with a funny name
Supports crawling from both live publisher websites and the CommonCrawl CC-NEWS archive with multi-process parallel fetching, enabling large-scale corpus creation. Provides unified article parsing across 150+ international news publishers with structured extraction of text, metadata, images, and multiple content source types (live sites, sitemaps, web archives). Includes AI training filtering to help identify publishers that haven't objected to model training on their content.
443 stars and 3,566 monthly downloads. Available on PyPI.
Stars
443
Forks
105
Language
Python
License
MIT
Category
Last pushed
Mar 17, 2026
Monthly downloads
3,566
Commits (30d)
0
Dependencies
18
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/flairNLP/fundus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
fhamborg/news-please
news-please - an integrated web crawler and information extractor for news that just works
FreeDiscovery/FreeDiscovery
Web Service for E-Discovery Analytics
affjljoo3581/canrevan
대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.
Multiverse-of-Projects/NewsAI
A dynamic NewsAI dashboard that uses NLP to analyze news articles, visualize sentiment trends,...
tirthajyoti/Web-Database-Analytics
Web scrapping and related analytics using Python tools