ALucek/chunking-strategies
An Overview of the Latest Document Chunking Research
Implements multiple chunking strategies—including character/token-based, recursive, semantic, cluster semantic, and LLM-based approaches—to optimize text splitting for RAG pipelines and vector database ingestion. Based on ChromaDB research comparing chunking methods, it provides empirical evaluation of how different segmentation strategies impact downstream retrieval performance. Integrates with vector databases and embedding models to test end-to-end RAG workflows with various chunking configurations.
No commits in the last 6 months.
Stars
85
Forks
18
Language
Jupyter Notebook
License
—
Category
Last pushed
Nov 25, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/ALucek/chunking-strategies"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Higher-rated alternatives
chonkie-inc/chonkie
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust...
speedyk-005/chunklet-py
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs,...
andreshere00/Splitter_MR
Chunk your data into markdown text blocks for your LLM applications
chonkie-inc/chonkiejs
🦛 CHONK your texts with Chonkie ✨ Type-friendly, light-weight, fast and super-simple chunking library
jchunk-io/jchunk
JChunk is a lightweight and flexible library designed to provide multiple strategies for text...