mirth/chonky

Fully neural approach for text chunking

/ 100

Established

Uses fine-tuned transformer models (ModernBERT, mBERT) that learn semantic boundaries directly from training data, outperforming rule-based and embedding similarity approaches on standard benchmarks. Integrates with RAG pipelines and supports markup removal across HTML, XML, and Markdown formats; multiple model variants range from 66M to 396M parameters with multilingual options available on Hugging Face.

407 stars and 312 monthly downloads. Available on PyPI.

Maintenance 6 / 25

Adoption 16 / 25

Maturity 18 / 25

Community 10 / 25

How are scores calculated?

Stars

407

Forks

Language

Python

License

MIT

Category

document-chunking

Last pushed

Oct 23, 2025

Monthly downloads

312

Commits (30d)

Dependencies

GitHub PyPI

Document Chunking · 3 tools

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/mirth/chonky"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

Related tools

sentencizer/sentencizer

A sentence splitting (sentence boundary disambiguation) library for Go. It is rule-based and...

prajwal10001/semantic-chunker-langchain

Token-aware, LangChain-compatible semantic chunker with PDF, markdown, and layout support

Explore NLP Tools

All categories Trending NLP directory Insights