pinecone-io/pinecone-datasets
An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
Supports both dense and sparse vector embeddings with lazy-loaded pandas DataFrames, enabling efficient local exploration before upserting. The library streams pre-embedded datasets from GCS with built-in batch iteration helpers optimized for Pinecone index ingestion, while also supporting custom data catalogs for organization-specific embeddings.
Stars
34
Forks
14
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/embeddings/pinecone-io/pinecone-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
deepset-ai/haystack-tutorials
Here you can find all the Tutorials for Haystack 📓
unum-cloud/USearch
Fast Open-Source Search & Clustering engine × for Vectors & Arbitrary Objects × in C++, C,...
towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
aryn-ai/sycamore
🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.
MaartenGr/PolyFuzz
Fuzzy string matching, grouping, and evaluation.