malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
Used by 1 other package. No commits in the last 6 months. Available on PyPI.
Stars
64
Forks
6
Language
Python
License
Apache-2.0
Category
Last pushed
Jul 29, 2024
Monthly downloads
23
Commits (30d)
0
Dependencies
9
Reverse dependents
1
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/malteos/llm-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...
willxxy/ECG-Bench
A Unified Framework for Benchmarking Generative Electrocardiogram-Language Models (ELMs)
geobrain-ai/geogalactica
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"
HaoAreYuDong/MachineLearningLM
Scaling In-context Learning from Few-shot to 1,024-shot on Tabular ML