augustwester/searchthearxiv

The code powering searchthearxiv.com, a simple semantic search engine for more than 300,000 ML papers on arXiv.

/ 100

Emerging

Embeds arXiv papers using OpenAI's `text-embedding-ada-002` model and stores them in Pinecone's vector database for semantic search. The implementation separates data pipeline (automated weekly updates via Kaggle's arXiv dataset) from the web application, with both components containerized for cloud deployment. Pre-computed embeddings for 300K+ papers are also published publicly on Kaggle for independent use.

171 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 12 / 25

How are scores calculated?

Stars

171

Forks

Language

Python

License

GPL-3.0

Higher-rated alternatives

aryn-ai/sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.

deepset-ai/haystack-tutorials

Here you can find all the Tutorials for Haystack 📓

MaartenGr/PolyFuzz

Fuzzy string matching, grouping, and evaluation.

unum-cloud/USearch

Fast Open-Source Search & Clustering engine × for Vectors & Arbitrary Objects × in C++, C,...

pinecone-io/pinecone-datasets

An open-source dataset library for pre-embedded dataset: create your own data catalog, or use...

Explore Embedding Tools

All categories Trending Embeddings directory Insights