messkan/rag-chunk
A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.
Implements six chunking strategies including header-aware and embedding-based semantic splitting, with token-accurate chunking via tiktoken for specific LLM models (GPT-3.5, GPT-4, etc.). Evaluates chunk quality through precision, recall, and F1-score metrics, and supports embedding-based semantic retrieval using sentence-transformers as an alternative to lexical matching. Exports results to JSON/CSV and integrates optional LangChain components for recursive character splitting.
104 stars.
Stars
104
Forks
5
Language
Python
License
MIT
Category
Last pushed
Jan 18, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/messkan/rag-chunk"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
chonkie-inc/chonkie
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust...
speedyk-005/chunklet-py
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs,...
andreshere00/Splitter_MR
Chunk your data into markdown text blocks for your LLM applications
chonkie-inc/chonkiejs
🦛 CHONK your texts with Chonkie ✨ Type-friendly, light-weight, fast and super-simple chunking library
jchunk-io/jchunk
JChunk is a lightweight and flexible library designed to provide multiple strategies for text...