OceanPresentChao/llm-corpus

从零搭建大模型知识库(Build LLM RAG Corpus from scratch)

/ 100

Experimental

Implements a complete RAG pipeline with custom Word2Vec embedding training for Chinese corpora, vector persistence in Qdrant, and flexible model backends supporting both local ChatGLM2-6B deployment and OpenAI APIs. The modular architecture separates document ingestion, embedding generation, vector storage, and inference into independent components configurable via JSON, enabling experimentation with different embedding models and LLM providers without code changes.

No commits in the last 6 months.

No License Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 9 / 25

Maturity 8 / 25

Community 12 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

yichuan-w/LEANN

[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...

aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation

Advanced document extraction and chunking techniques for retrieval augmented generation that is...

byerlikaya/SmartRAG

Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....

mrutunjay-kinagi/ragsearch

This project aims to build a Retrieval-Augmented Generation (RAG) engine to provide...

Omkar-Wagholikar/adora

Python package that makes it easy to spin up a custom Retrieval-Augmented Generation (RAG) pipeline.

Explore Vector Databases

All categories Trending Vector Database directory Insights