fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

71
/ 100
Verified

Combines Scrapy, Newspaper4k, and Readability libraries to extract structured article data (headline, author, publication date, body text, main image) from arbitrary news websites via recursive link following and RSS feeds. Offers three usage modes: CLI for batch crawling with JSON/PostgreSQL/ElasticSearch/Redis output, Python library API for programmatic extraction, and direct integration with CommonCrawl's historical news archive with optional filtering by publisher and date range.

2,402 stars and 195,274 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 2 / 25
Adoption 20 / 25
Maturity 25 / 25
Community 24 / 25

How are scores calculated?

Stars

2,402

Forks

450

Language

Python

License

Apache-2.0

Last pushed

Sep 21, 2025

Monthly downloads

195,274

Commits (30d)

0

Dependencies

25

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/fhamborg/news-please"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.