Aavache/LLMWebCrawler
A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.
Implements recursive web crawling with configurable depth limits and stores both raw text and BERT embeddings in Milvus vector database for semantic similarity search. Distributes crawling workloads across Ray workers in a master-worker architecture, with a FastAPI interface for querying crawled content by vector proximity rather than keyword matching.
No commits in the last 6 months.
Stars
98
Forks
13
Language
Python
License
—
Category
Last pushed
Oct 15, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/Aavache/LLMWebCrawler"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yichuan-w/LEANN
[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...
aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation
Advanced document extraction and chunking techniques for retrieval augmented generation that is...
byerlikaya/SmartRAG
Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....
mrutunjay-kinagi/ragsearch
This project aims to build a Retrieval-Augmented Generation (RAG) engine to provide...
Omkar-Wagholikar/adora
Python package that makes it easy to spin up a custom Retrieval-Augmented Generation (RAG) pipeline.