colonelwatch/abstracts-search
Semantic search engine indexing 110 million academic publications
Generates dense vector embeddings from 110M academic abstracts using the Stella 1.5B model, then builds a FAISS index for fast approximate nearest-neighbor retrieval—all components (embeddings, index, search interface) are published as separate Hugging Face datasets and spaces for modularity. Integrates with OpenAlex for publication metadata and supports incremental syncing against quarterly dataset snapshots to keep the index current. The modular architecture allows running just the search interface without rebuilding, or performing full reindexing on commodity hardware (RTX 3060, 32GB RAM+swap) in under a week.
102 stars.
Stars
102
Forks
6
Language
Python
License
Apache-2.0
Category
Last pushed
Jan 19, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/colonelwatch/abstracts-search"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
ahr9n/quranic-search-v2
Quranic Lexical/Semantic Search
VIGINUM-FR/D3lta
A Python implementation of the D3lta algorithm for duplicated textual content detection
geetanjaliapp/geetanjali
RAG-powered ethical decision guidance from Bhagavad Geeta. Analyze dilemmas, get structured...
hazemabdelkawy/SunnahGPT
SunnahGPT is a natural language processing (NLP) project aimed at scraping hadith data from the...
mufaizz/FAIZ-AI
FAIZ AI 🔍 – The search bot that finds what others miss. Searches HTTP, FTP, IPFS & Torrent with...