J0nasW/science-datalake
Unified data lake of 293M scientific papers from 8 scholarly sources + 13 ontologies (960 GB Parquet, queryable via DuckDB)
26
/ 100
Experimental
No Package
No Dependents
Maintenance
13 / 25
Adoption
4 / 25
Maturity
9 / 25
Community
0 / 25
Stars
8
Forks
—
Language
Jupyter Notebook
License
—
Category
Last pushed
Mar 12, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/J0nasW/science-datalake"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
NVIDIA-NeMo/Curator
Scalable data pre processing and curation toolkit for LLMs
74
MigoXLab/dingo
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool
67
data-prep-kit/data-prep-kit
Open source project for data preparation for GenAI applications
64
cleanlab/cleanlab-studio
Client interface to Cleanlab Studio
56
TheDataStation/pneuma
LLM-Powered Data Discovery System for Tabular Data
46