Math Reasoning Datasets LLM Tools
Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.
There are 60 math reasoning datasets tools tracked. 2 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.
Get all 60 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
MMMU-Benchmark/MMMU
This repo contains evaluation code for the paper "MMMU: A Massive... |
|
Established |
| 2 |
pat-jj/DeepRetrieval
[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome |
|
Established |
| 3 |
lupantech/MathVista
MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts |
|
Emerging |
| 4 |
ise-uiuc/magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct |
|
Emerging |
| 5 |
x66ccff/liveideabench
[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific... |
|
Emerging |
| 6 |
IAAR-Shanghai/xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations |
|
Emerging |
| 7 |
SuperBruceJia/Awesome-LLM-Self-Consistency
Awesome LLM Self-Consistency: a curated list of Self-consistency in Large... |
|
Emerging |
| 8 |
sherryzyh/physical_reasoning_toolkit
A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit... |
|
Emerging |
| 9 |
GAIR-NLP/MathPile
[NeurlPS D&B 2024] Generative AI for Math: MathPile |
|
Emerging |
| 10 |
rxlqn/awesome-llm-self-reflection
augmented LLM with self reflection |
|
Emerging |
| 11 |
killthefullmoon/PhyX
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? |
|
Emerging |
| 12 |
iiis-ai/AutoMathText-V2
AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset |
|
Emerging |
| 13 |
yecchen/MIRAI
Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting" |
|
Emerging |
| 14 |
bigai-nlco/LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models |
|
Emerging |
| 15 |
gsarti/verbalized-rebus
Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers... |
|
Emerging |
| 16 |
TIGER-AI-Lab/AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case... |
|
Emerging |
| 17 |
microsoft/repoclassbench
[ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level... |
|
Emerging |
| 18 |
artificial-scientist-lab/SciMuse
Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:... |
|
Emerging |
| 19 |
DAMO-NLP-SG/M3Exam
Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel... |
|
Emerging |
| 20 |
blacksnail789521/Time-Series-Reasoning-Survey
A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models |
|
Emerging |
| 21 |
TianHongZXY/CoRe
[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced... |
|
Emerging |
| 22 |
JunyiYe/CreativeMath
[AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to... |
|
Emerging |
| 23 |
uni-medical/GMAI-MMBench
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards... |
|
Experimental |
| 24 |
yubol-bobo/MT-Consistency
This repo investigates LLMs' tendency to exhibit acquiescence bias in... |
|
Experimental |
| 25 |
intuit-ai-research/DCR-consistency
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and... |
|
Experimental |
| 26 |
CodeEval-Pro/CodeEval-Pro
[ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating... |
|
Experimental |
| 27 |
lt-asset/REPOCOD
For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says... |
|
Experimental |
| 28 |
EngineeringSoftware/codeditor
Multilingual Code Co-Evolution Using Large Language Models |
|
Experimental |
| 29 |
kg-bnu/SciMKG
Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for... |
|
Experimental |
| 30 |
ehsk/OpenQA-eval
ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large... |
|
Experimental |
| 31 |
zjunlp/ReCode
[AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates |
|
Experimental |
| 32 |
thehsansaeed/Questions-for-AI-Model-Testing
This repository contains a curated set of logical, mathematical, and... |
|
Experimental |
| 33 |
ai-for-edu/ScratchMath
Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking... |
|
Experimental |
| 34 |
pinterest/pinpoint-dataset
[CVPR '26] - PinPoint: Evaluation of Composed Image Retrieval with Explicit... |
|
Experimental |
| 35 |
asaakyan/ngram-creativity
Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a... |
|
Experimental |
| 36 |
mismayil/creativity-in-AI
Creativity in AI: A Survey of Progresses and Challenges |
|
Experimental |
| 37 |
surrey-nlp/LLM4MT_eval
This repository is for our paper "What do large language model need for... |
|
Experimental |
| 38 |
cyzhh/MMOS
Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two... |
|
Experimental |
| 39 |
yifanzhang-pro/BlueMO
BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad... |
|
Experimental |
| 40 |
neuro-symbolic-ai/explanation_based_ethical_reasoning
Code and data for Paper "Enhancing Ethical Explanations of Large Language... |
|
Experimental |
| 41 |
marcusm117/DNA
[ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and... |
|
Experimental |
| 42 |
carlomarxdk/trilemma-of-truth
A research project on competing notions of truth in large language models. |
|
Experimental |
| 43 |
richardcsuwandi/cake
[NeurIPS 2025] Context-Aware Kernel Evolution (CAKE) |
|
Experimental |
| 44 |
HarryYancy/SolidGeo
SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry |
|
Experimental |
| 45 |
MAC-AutoML/SocialOmni
Benchmarking Audio-Visual Social Interactivity in Omni Models |
|
Experimental |
| 46 |
I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling
Optimization of inference time sampling of large language models guided by a... |
|
Experimental |
| 47 |
Liz-Atlas/last_frame_whitepaper
A Modular Knowledge Transfer System for Large Language Models |
|
Experimental |
| 48 |
mshin77/mathipy
mathipy: Multimodal item feature extraction for K-12 math assessment (Python... |
|
Experimental |
| 49 |
LiXin97/WirelessMathLM
WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless... |
|
Experimental |
| 50 |
yifanzhang-pro/StackMathQA
StackMathQA: A Curated Collection of 2 Million Mathematical Questions and... |
|
Experimental |
| 51 |
jwallat/temporalrobustness
A Study Into Temporal Robustness of LLMs |
|
Experimental |
| 52 |
robertopassaro/tales-of-2-minds
Evaluating Creativity in Human and Large Language Model Narratives |
|
Experimental |
| 53 |
yahskapar/LLMs-and-Probabilistic-Reasoning
Data and software artifacts for the EMNLP 2024 (Main) paper "What Are the... |
|
Experimental |
| 54 |
GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness
The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an... |
|
Experimental |
| 55 |
yashmahe2020/math-tutor-research
Research on Large Language Model capabilities in mathematics tutoring and... |
|
Experimental |
| 56 |
polymathbenchmark/polymathbenchmark.github.io
A Challenging Multi-Modal Mathematical Reasoning Benchmark |
|
Experimental |
| 57 |
aauss/temporal-answer-qa
Time to Revisit Exact Match (Findings of EMNLP 2025) |
|
Experimental |
| 58 |
maxpeeperkorn/creativity-parameter
This repository contains the supplementary material / appendix to go with... |
|
Experimental |
| 59 |
kreasof-ai/self-perturbation-learning
Imagine "2 truth and a lie", but formalized as ML training objective |
|
Experimental |
| 60 |
sileod/nlp-verbal-probabilities-reasoning
Probing handling of verbal probabilities in NLP models |
|
Experimental |