mirth/chonky

Fully neural approach for text chunking

50
/ 100
Established

Uses fine-tuned transformer models (ModernBERT, mBERT) that learn semantic boundaries directly from training data, outperforming rule-based and embedding similarity approaches on standard benchmarks. Integrates with RAG pipelines and supports markup removal across HTML, XML, and Markdown formats; multiple model variants range from 66M to 396M parameters with multilingual options available on Hugging Face.

407 stars and 312 monthly downloads. Available on PyPI.

Maintenance 6 / 25
Adoption 16 / 25
Maturity 18 / 25
Community 10 / 25

How are scores calculated?

Stars

407

Forks

16

Language

Python

License

MIT

Last pushed

Oct 23, 2025

Monthly downloads

312

Commits (30d)

0

Dependencies

1

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/mirth/chonky"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.