chonkie and chunking-strategies

A production-ready chunking library and a research overview repository are **complements**: the latter informs the design decisions and benchmarking choices for the former, while practitioners using the former might consult the latter to understand the algorithmic tradeoffs underlying their chunking strategy.

chonkie
83
Verified
chunking-strategies
36
Emerging
Maintenance 25/25
Adoption 15/25
Maturity 25/25
Community 18/25
Maintenance 0/25
Adoption 9/25
Maturity 8/25
Community 19/25
Stars: 3,829
Forks: 256
Downloads:
Commits (30d): 53
Language: Python
License: MIT
Stars: 85
Forks: 18
Downloads:
Commits (30d): 0
Language: Jupyter Notebook
License:
No risk flags
No License Stale 6m No Package No Dependents

About chonkie

chonkie-inc/chonkie

🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines

Provides pluggable chunking strategies—recursive, semantic, code-aware, and LLM-based—with composable pipeline workflows that chain multiple chunkers and refineries together. Integrates with 32+ tools across tokenizers (GPT-2, BPE), embeddings (OpenAI, Sentence Transformers), vector databases, and LLMs, while supporting 56 languages out-of-the-box through modular dependency installation.

About chunking-strategies

ALucek/chunking-strategies

An Overview of the Latest Document Chunking Research

Implements multiple chunking strategies—including character/token-based, recursive, semantic, cluster semantic, and LLM-based approaches—to optimize text splitting for RAG pipelines and vector database ingestion. Based on ChromaDB research comparing chunking methods, it provides empirical evaluation of how different segmentation strategies impact downstream retrieval performance. Integrates with vector databases and embedding models to test end-to-end RAG workflows with various chunking configurations.

Scores updated daily from GitHub, PyPI, and npm data. How scores work