augustwester/searchthearxiv
The code powering searchthearxiv.com, a simple semantic search engine for more than 300,000 ML papers on arXiv.
Embeds arXiv papers using OpenAI's `text-embedding-ada-002` model and stores them in Pinecone's vector database for semantic search. The implementation separates data pipeline (automated weekly updates via Kaggle's arXiv dataset) from the web application, with both components containerized for cloud deployment. Pre-computed embeddings for 300K+ papers are also published publicly on Kaggle for independent use.
171 stars. No commits in the last 6 months.
Stars
171
Forks
15
Language
Python
License
GPL-3.0
Category
Last pushed
Apr 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/augustwester/searchthearxiv"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
aryn-ai/sycamore
🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
deepset-ai/haystack-tutorials
Here you can find all the Tutorials for Haystack 📓
MaartenGr/PolyFuzz
Fuzzy string matching, grouping, and evaluation.
unum-cloud/USearch
Fast Open-Source Search & Clustering engine × for Vectors & Arbitrary Objects × in C++, C,...
pinecone-io/pinecone-datasets
An open-source dataset library for pre-embedded dataset: create your own data catalog, or use...