jamesturk/scrapeghost

👻 Experimental library for scraping websites using OpenAI's GPT API.

52
/ 100
Established

Leverages GPT's language understanding to extract structured data from HTML by defining schemas in Python, with built-in preprocessing (HTML cleaning, CSS/XPath filtering, auto-splitting for large pages) and postprocessing (Pydantic validation, hallucination detection). Includes cost tracking and budget controls to manage expensive API calls, plus automatic model fallbacks between GPT-3.5-Turbo and GPT-4. Note: This 2023 project is no longer maintained.

1,444 stars.

No Package No Dependents
Maintenance 10 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 16 / 25

How are scores calculated?

Stars

1,444

Forks

88

Language

Python

License

Last pushed

Jan 14, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/jamesturk/scrapeghost"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.