Aavache/LLMWebCrawler

A Web Crawler based on LLMs implemented with Ray and Huggingface. The embeddings are saved into a vector database for fast clustering and retrieval. Use it for your RAG.

/ 100

Emerging

Implements recursive web crawling with configurable depth limits and stores both raw text and BERT embeddings in Milvus vector database for semantic similarity search. Distributes crawling workloads across Ray workers in a master-worker architecture, with a FastAPI interface for querying crawled content by vector proximity rather than keyword matching.

No commits in the last 6 months.

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

yichuan-w/LEANN

[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...

aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation

Advanced document extraction and chunking techniques for retrieval augmented generation that is...

byerlikaya/SmartRAG

Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....

mrutunjay-kinagi/ragsearch

This project aims to build a Retrieval-Augmented Generation (RAG) engine to provide...

Omkar-Wagholikar/adora

Python package that makes it easy to spin up a custom Retrieval-Augmented Generation (RAG) pipeline.

Explore Vector Databases

All categories Trending Vector Database directory Insights