agamm/semantic-split
A Python library to chunk/group your texts based on semantic similarity.
Leverages SentenceTransformers for semantic embeddings and spaCy for sentence tokenization to group semantically related sentences while preserving document structure. Designed specifically for RAG pipelines and vector database ingestion, enabling efficient retrieval of contextually relevant chunks for LLM prompts while reducing token costs. Supports pluggable similarity models and sentence splitters, with examples demonstrating integration into question-answering workflows over documents.
103 stars. No commits in the last 6 months.
Stars
103
Forks
9
Language
Python
License
—
Category
Last pushed
Jul 11, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/agamm/semantic-split"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
jparkerweb/semantic-chunking
🍱 semantic-chunking ⇢ semantically create chunks from large document for passing to LLM workflows
drittich/SemanticSlicer
🧠✂️ SemanticSlicer — A smart text chunker for LLM-ready documents.
ndgigliotti/afterthoughts
Sentence-aware embeddings using late chunking with transformers.
ReemHal/Semantic-Text-Segmentation-with-Embeddings
Uses GloVe embeddings and greedy sequence segmentation to semantically segment a text document...
smart-models/Normalized-Semantic-Chunker
Cutting-edge tool that unlocks the full potential of semantic chunking