AmanPriyanshu/Stratified-LLM-Subsets-100K-1M-Scale
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
No commits in the last 6 months.
Stars
1
Forks
—
Language
HTML
License
—
Category
Last pushed
Oct 04, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/AmanPriyanshu/Stratified-LLM-Subsets-100K-1M-Scale"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Goekdeniz-Guelmez/mlx-lm-lora
Train Large Language Models on MLX.
uber-research/PPLM
Plug and Play Language Model implementation. Allows to steer topic and attributes of GPT-2 models.
VHellendoorn/Code-LMs
Guide to using pre-trained large language models of source code
ssbuild/chatglm_finetuning
chatglm 6b finetuning and alpaca finetuning
jarobyte91/pytorch_beam_search
A lightweight implementation of Beam Search for sequence models in PyTorch.