Math Reasoning Datasets LLM Tools

Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.

There are 60 math reasoning datasets tools tracked. 2 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.

Get all 60 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

#	Tool	Score	Tier	Stars	Language
1	MMMU-Benchmark/MMMU This repo contains evaluation code for the paper "MMMU: A Massive...	52	Established	548	Python
2	pat-jj/DeepRetrieval [COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome	51	Established	696	Python
3	lupantech/MathVista MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts	47	Emerging	355	Jupyter Notebook
4	ise-uiuc/magicoder [ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct	45	Emerging	2,086	Python
5	x66ccff/liveideabench [𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific...	42	Emerging	23	Jupyter Notebook
6	IAAR-Shanghai/xVerify xVerify: Efficient Answer Verifier for Reasoning Model Evaluations	40	Emerging	144	Jupyter Notebook
7	SuperBruceJia/Awesome-LLM-Self-Consistency Awesome LLM Self-Consistency: a curated list of Self-consistency in Large...	39	Emerging	120	—
8	sherryzyh/physical_reasoning_toolkit A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit...	37	Emerging	3	Python
9	GAIR-NLP/MathPile [NeurlPS D&B 2024] Generative AI for Math: MathPile	37	Emerging	419	Python
10	rxlqn/awesome-llm-self-reflection augmented LLM with self reflection	37	Emerging	139	—
11	killthefullmoon/PhyX PhyX: Does Your Model Have the "Wits" for Physical Reasoning?	36	Emerging	52	Python
12	iiis-ai/AutoMathText-V2 AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset	36	Emerging	6	HTML
13	yecchen/MIRAI Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"	36	Emerging	90	Python
14	bigai-nlco/LooGLE ACL 2024 \| LooGLE: Long Context Evaluation for Long-Context Language Models	34	Emerging	195	Python
15	gsarti/verbalized-rebus Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers...	34	Emerging	4	Jupyter Notebook
16	TIGER-AI-Lab/AceCoder The official repo for "AceCoder: Acing Coder RL via Automated Test-Case...	33	Emerging	99	Python
17	microsoft/repoclassbench [ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level...	32	Emerging	17	Python
18	artificial-scientist-lab/SciMuse Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:...	32	Emerging	32	Python
19	DAMO-NLP-SG/M3Exam Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel...	31	Emerging	103	Python
20	blacksnail789521/Time-Series-Reasoning-Survey A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models	31	Emerging	38	—
21	TianHongZXY/CoRe [ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced...	31	Emerging	50	Python
22	JunyiYe/CreativeMath [AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to...	31	Emerging	13	Jupyter Notebook
23	uni-medical/GMAI-MMBench GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards...	29	Experimental	82	—
24	yubol-bobo/MT-Consistency This repo investigates LLMs' tendency to exhibit acquiescence bias in...	29	Experimental	49	Python
25	intuit-ai-research/DCR-consistency DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and...	29	Experimental	25	Python
26	CodeEval-Pro/CodeEval-Pro [ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating...	29	Experimental	37	Python
27	lt-asset/REPOCOD For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says...	28	Experimental	26	Python
28	EngineeringSoftware/codeditor Multilingual Code Co-Evolution Using Large Language Models	28	Experimental	13	Python
29	kg-bnu/SciMKG Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for...	27	Experimental	3	Python
30	ehsk/OpenQA-eval ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large...	27	Experimental	47	Python
31	zjunlp/ReCode [AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates	27	Experimental	24	Python
32	thehsansaeed/Questions-for-AI-Model-Testing This repository contains a curated set of logical, mathematical, and...	26	Experimental	8	—
33	ai-for-edu/ScratchMath Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking...	25	Experimental	3	Python
34	pinterest/pinpoint-dataset [CVPR '26] - PinPoint: Evaluation of Composed Image Retrieval with Explicit...	25	Experimental	4	Python
35	asaakyan/ngram-creativity Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a...	25	Experimental	4	Jupyter Notebook
36	mismayil/creativity-in-AI Creativity in AI: A Survey of Progresses and Challenges	24	Experimental	4	Jupyter Notebook
37	surrey-nlp/LLM4MT_eval This repository is for our paper "What do large language model need for...	24	Experimental	4	Python
38	cyzhh/MMOS Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two...	23	Experimental	74	Python
39	yifanzhang-pro/BlueMO BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad...	23	Experimental	5	HTML
40	neuro-symbolic-ai/explanation_based_ethical_reasoning Code and data for Paper "Enhancing Ethical Explanations of Large Language...	23	Experimental	6	Python
41	marcusm117/DNA [ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and...	22	Experimental	—	Python
42	carlomarxdk/trilemma-of-truth A research project on competing notions of truth in large language models.	22	Experimental	—	Python
43	richardcsuwandi/cake [NeurIPS 2025] Context-Aware Kernel Evolution (CAKE)	21	Experimental	21	Python
44	HarryYancy/SolidGeo SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry	20	Experimental	9	Python
45	MAC-AutoML/SocialOmni Benchmarking Audio-Visual Social Interactivity in Omni Models	20	Experimental	17	Python
46	I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling Optimization of inference time sampling of large language models guided by a...	19	Experimental	—	Python
47	Liz-Atlas/last_frame_whitepaper A Modular Knowledge Transfer System for Large Language Models	19	Experimental	—	—
48	mshin77/mathipy mathipy: Multimodal item feature extraction for K-12 math assessment (Python...	19	Experimental	—	Python
49	LiXin97/WirelessMathLM WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless...	17	Experimental	2	HTML
50	yifanzhang-pro/StackMathQA StackMathQA: A Curated Collection of 2 Million Mathematical Questions and...	15	Experimental	6	—
51	jwallat/temporalrobustness A Study Into Temporal Robustness of LLMs	15	Experimental	2	Jupyter Notebook
52	robertopassaro/tales-of-2-minds Evaluating Creativity in Human and Large Language Model Narratives	13	Experimental	—	Jupyter Notebook
53	yahskapar/LLMs-and-Probabilistic-Reasoning Data and software artifacts for the EMNLP 2024 (Main) paper "What Are the...	13	Experimental	5	Jupyter Notebook
54	GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an...	12	Experimental	1	Python
55	yashmahe2020/math-tutor-research Research on Large Language Model capabilities in mathematics tutoring and...	12	Experimental	1	Jupyter Notebook
56	polymathbenchmark/polymathbenchmark.github.io A Challenging Multi-Modal Mathematical Reasoning Benchmark	11	Experimental	—	JavaScript
57	aauss/temporal-answer-qa Time to Revisit Exact Match (Findings of EMNLP 2025)	11	Experimental	—	Python
58	maxpeeperkorn/creativity-parameter This repository contains the supplementary material / appendix to go with...	11	Experimental	2	Jupyter Notebook
59	kreasof-ai/self-perturbation-learning Imagine "2 truth and a lie", but formalized as ML training objective	10	Experimental	1	Jupyter Notebook
60	sileod/nlp-verbal-probabilities-reasoning Probing handling of verbal probabilities in NLP models	10	Experimental	1	Jupyter Notebook