OceanPresentChao/llm-corpus
从零搭建大模型知识库(Build LLM RAG Corpus from scratch)
Implements a complete RAG pipeline with custom Word2Vec embedding training for Chinese corpora, vector persistence in Qdrant, and flexible model backends supporting both local ChatGLM2-6B deployment and OpenAI APIs. The modular architecture separates document ingestion, embedding generation, vector storage, and inference into independent components configurable via JSON, enabling experimentation with different embedding models and LLM providers without code changes.
No commits in the last 6 months.
Stars
86
Forks
9
Language
Python
License
—
Category
Last pushed
Oct 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/OceanPresentChao/llm-corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yichuan-w/LEANN
[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...
aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation
Advanced document extraction and chunking techniques for retrieval augmented generation that is...
byerlikaya/SmartRAG
Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....
mrutunjay-kinagi/ragsearch
This project aims to build a Retrieval-Augmented Generation (RAG) engine to provide...
Omkar-Wagholikar/adora
Python package that makes it easy to spin up a custom Retrieval-Augmented Generation (RAG) pipeline.