text2vec and Top2Vec
About text2vec
shibing624/text2vec
text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。
Supports multi-GPU/multi-CPU batch inference via multiprocessing and includes a command-line interface for scripting bulk text vectorization tasks. Built on PyTorch with implementations of contrastive learning methods (CoSENT's ranking-aware loss, BGE's RetroMAE pretraining with contrastive finetuning) that optimize for semantic matching; includes pre-trained checkpoints on HuggingFace for Chinese, multilingual, and cross-lingual tasks. Integrates with BERT-family models and sentence-transformers architectures, with tooling for supervised fine-tuning on custom NLI and STS datasets.
About Top2Vec
ddangelov/Top2Vec
Top2Vec learns jointly embedded topic, document and word vectors.
Combines Doc2Vec, BERT Sentence Transformers, or Universal Sentence Encoder embeddings with UMAP dimensionality reduction and HDBSCAN clustering to automatically discover topics without predefined counts or stop word lists. The contextual variant uses token-level embeddings to identify multiple topics per document and intra-document topic spans, exposing results through methods for topic distribution, relevance scoring, and token-level topic assignments.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work