Math Reasoning Datasets Transformer Models

There are 37 math reasoning datasets models tracked. 1 score above 70 (verified tier). The highest-rated is ExtensityAI/symbolicai at 75/100 with 1,677 stars and 2,722 monthly downloads. 1 of the top 10 are actively maintained.

Get all 37 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Model	Score	Tier	Stars	Language
1	ExtensityAI/symbolicai A neurosymbolic perspective on LLMs	75	Verified	1,677	Python
2	TIGER-AI-Lab/MMLU-Pro The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task...	49	Emerging	347	Python
3	deep-symbolic-mathematics/LLM-SR [ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on...	42	Emerging	216	Python
4	zhudotexe/fanoutqa Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering...	42	Emerging	59	Python
5	microsoft/interwhen A framework for verifiable reasoning with language models.	42	Emerging	13	Python
6	HiThink-Research/MME-Finance [MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning	37	Emerging	44	Python
7	xlang-ai/Binder [ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"	36	Emerging	325	Python
8	yifanzhang-pro/AutoMathText [ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative...	32	Emerging	90	Python
9	princeton-pli/AdaptMI [COLM 2025] Adaptive Skill-based In-context Math Instruction for Small...	31	Emerging	9	Python
10	SeekingDream/DyCodeEval Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning...	30	Emerging	255	Python
11	TIGER-AI-Lab/StructLM Code and data for "StructLM: Towards Building Generalist Models for...	30	Emerging	76	Python
12	AlphaPav/mem-kk-logic On Memorization of Large Language Models in Logical Reasoning	30	Emerging	76	Python
13	DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries [ACL 2025] Analyzing LLMs' Multilingual Knowledge Boundary Cognition Across...	30	Emerging	18	Jupyter Notebook
14	TIGER-AI-Lab/LongICLBench Code and Data for "Long-context LLMs Struggle with Long In-context Learning"...	29	Experimental	112	Python
15	declare-lab/LLM-PuzzleTest This repository is maintained to release dataset and models for multimodal...	29	Experimental	113	Python
16	TIGER-AI-Lab/MAmmoTH Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid...	29	Experimental	383	Jupyter Notebook
17	akjindal53244/Arithmo Small and Efficient Mathematical Reasoning LLMs	28	Experimental	73	Python
18	amazon-science/recode Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"	28	Experimental	58	Python
19	google/curie Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long...	27	Experimental	29	Jupyter Notebook
20	martin-wey/CodeUltraFeedback CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)	26	Experimental	73	Python
21	QwenLM/PolyMath [NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath:...	25	Experimental	42	Python
22	bobxwu/learning-from-rewards-llm-papers A comrephensive collection of learning from rewards in the post-training and...	24	Experimental	64	—
23	ryokamoi/llm-self-correction-papers List of papers on Self-Correction of LLMs.	24	Experimental	80	—
24	reasoning-machines/CoCoGen Language Models of Code are Few-Shot Commonsense Learners (EMNLP 2022)	23	Experimental	86	Python
25	conditionWang/FLNK Federated Learning with New Knowledge -- explore to incorporate various new...	23	Experimental	86	—
26	gersteinlab/Struc-Bench [NAACL 2024] Struc-Bench: Are Large Language Models Good at Generating...	23	Experimental	55	Python
27	zjunlp/DynamicKnowledgeCircuits [ACL 2025] How Do LLMs Acquire New Knowledge? A Knowledge Circuits...	22	Experimental	47	Jupyter Notebook
28	kaistAI/LangBridge [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision	21	Experimental	96	Python
29	YangLing0818/SuperCorrect-llm [ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought...	20	Experimental	87	Python
30	WooooDyy/MathCritique Implementation for the research paper "Enhancing LLM Reasoning via Critique...	20	Experimental	55	Python
31	merlerm/In-Context-Symbolic-Regression Official code implementation for the ACL 2024 Student Research Workshop...	20	Experimental	17	Python
32	joeljang/continual-knowledge-learning [ICLR 2022] Towards Continual Knowledge Learning of Language Models	20	Experimental	91	Python
33	UCSC-VLAA/vllm-safety-benchmark [ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in...	18	Experimental	87	Python
34	MMStar-Benchmark/MMStar [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on...	17	Experimental	204	Python
35	iiis-ai/IterativeQuestionComposing [AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing...	16	Experimental	23	Python
36	TIGER-AI-Lab/TableCoT The code and data for paper "Large Language Models are few(1)-shot Table...	16	Experimental	48	Python
37	Eleanor-H/MUSTARD Code & data for ICLR 2024 spotlight paper: 🍯MUSTARD: Mastering Uniform...	14	Experimental	42	C++