fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

/ 100

Verified

Combines Scrapy, Newspaper4k, and Readability libraries to extract structured article data (headline, author, publication date, body text, main image) from arbitrary news websites via recursive link following and RSS feeds. Offers three usage modes: CLI for batch crawling with JSON/PostgreSQL/ElasticSearch/Redis output, Python library API for programmatic extraction, and direct integration with CommonCrawl's historical news archive with optional filtering by publisher and date range.

2,402 stars and 195,274 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m

Maintenance 2 / 25

Adoption 20 / 25

Maturity 25 / 25

Community 24 / 25

How are scores calculated?

Stars

2,402

Forks

450

Language

Python

License

Apache-2.0

Related tools

flairNLP/fundus

A very simple news crawler with a funny name

FreeDiscovery/FreeDiscovery

Web Service for E-Discovery Analytics

affjljoo3581/canrevan

대량의 네이버 뉴스 기사를 수집하는 라이브러리입니다.

Multiverse-of-Projects/NewsAI

A dynamic NewsAI dashboard that uses NLP to analyze news articles, visualize sentiment trends,...

tirthajyoti/Web-Database-Analytics

Web scrapping and related analytics using Python tools

Explore NLP Tools

All categories Trending NLP directory Insights