fhamborg/news-please
news-please - an integrated web crawler and information extractor for news that just works
Combines Scrapy, Newspaper4k, and Readability libraries to extract structured article data (headline, author, publication date, body text, main image) from arbitrary news websites via recursive link following and RSS feeds. Offers three usage modes: CLI for batch crawling with JSON/PostgreSQL/ElasticSearch/Redis output, Python library API for programmatic extraction, and direct integration with CommonCrawl's historical news archive with optional filtering by publisher and date range.
2,402 stars and 195,274 monthly downloads. No commits in the last 6 months. Available on PyPI.
Stars
2,402
Forks
450
Language
Python
License
Apache-2.0
Category
Last pushed
Sep 21, 2025
Monthly downloads
195,274
Commits (30d)
0
Dependencies
25
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/fhamborg/news-please"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
flairNLP/fundus
A very simple news crawler with a funny name
FreeDiscovery/FreeDiscovery
Web Service for E-Discovery Analytics
affjljoo3581/canrevan
대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.
Multiverse-of-Projects/NewsAI
A dynamic NewsAI dashboard that uses NLP to analyze news articles, visualize sentiment trends,...
tirthajyoti/Web-Database-Analytics
Web scrapping and related analytics using Python tools