text2vec and Top2Vec

text2vec
73
Verified
Top2Vec
65
Established
Maintenance 10/25
Adoption 19/25
Maturity 25/25
Community 19/25
Maintenance 0/25
Adoption 19/25
Maturity 25/25
Community 21/25
Stars: 4,950
Forks: 428
Downloads: 1,922
Commits (30d): 0
Language: Python
License: Apache-2.0
Stars: 3,109
Forks: 377
Downloads: 5,399
Commits (30d): 0
Language: Python
License: BSD-3-Clause
No risk flags
Stale 6m

About text2vec

shibing624/text2vec

text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。

Supports multi-GPU/multi-CPU batch inference via multiprocessing and includes a command-line interface for scripting bulk text vectorization tasks. Built on PyTorch with implementations of contrastive learning methods (CoSENT's ranking-aware loss, BGE's RetroMAE pretraining with contrastive finetuning) that optimize for semantic matching; includes pre-trained checkpoints on HuggingFace for Chinese, multilingual, and cross-lingual tasks. Integrates with BERT-family models and sentence-transformers architectures, with tooling for supervised fine-tuning on custom NLI and STS datasets.

About Top2Vec

ddangelov/Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.

Combines Doc2Vec, BERT Sentence Transformers, or Universal Sentence Encoder embeddings with UMAP dimensionality reduction and HDBSCAN clustering to automatically discover topics without predefined counts or stop word lists. The contextual variant uses token-level embeddings to identify multiple topics per document and intra-document topic spans, exposing results through methods for topic distribution, relevance scoring, and token-level topic assignments.

Related comparisons

Scores updated daily from GitHub, PyPI, and npm data. How scores work