harvard-lil/warc-gpt
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Extracts text from HTML and PDF records in WARC files, generates embeddings with automatic context-window splitting, and stores them in a vector database (ChromaDB) for semantic search. Supports multiple LLM backends including OpenAI, Ollama, and OpenAI-compatible providers like HuggingFace and vLLM through configurable environment variables. Provides both REST API and web UI with chat history support, plus T-SNE visualization of the embedding space.
270 stars. No commits in the last 6 months.
Stars
270
Forks
25
Language
Python
License
MIT
Category
Last pushed
Feb 11, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/harvard-lil/warc-gpt"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Kain-90/RAG-Play
An interactive visualization tool for understanding Retrieval-Augmented Generation (RAG) pipelines.
rryam/LumoKit
Swift package for on-device Retrieval-Augmented Generation (RAG)
CoIR-team/coir
(ACL 2025 Main) A Comprehensive Benchmark for Code Information Retrieval.
constacts/ragtacts
RAG(Retrieval-Augmented Generation) for Evolving Data
Global-Witness/augmenta
AI agent for enhancing datasets with information from the internet