ncbi-nlp/BioSentVec

BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

/ 100

Emerging

Trained on 4.9 billion tokens from PubMed literature and MIMIC-III clinical notes, this project provides fastText word vectors (200-dim) and sent2vec sentence embeddings (700-dim) optimized for biomedical NLP tasks. The models handle out-of-vocabulary terms through fastText's subword approach and are evaluated on domain-specific similarity benchmarks (MayoSRS, BIOSSES, MedSTS), outperforming general-purpose embeddings like Universal Sentence Encoder on clinical text.

611 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 22 / 25

How are scores calculated?

Stars

611

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

avidale/compress-fasttext

Tools for shrinking fastText models (in gensim format)

dselivanov/text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

vzhong/embeddings

Fast, DB Backed pretrained word embeddings for natural language processing.

dccuchile/spanish-word-embeddings

Spanish word embeddings computed with different methods and from different corpora

ibrahimsharaf/doc2vec

:notebook: Long(er) text representation and classification using Doc2Vec embeddings

Explore NLP Tools

All categories Trending NLP directory Insights