Math Reasoning Datasets LLM Tools

Datasets, benchmarks, and training resources specifically for mathematical reasoning tasks in LLMs, including word problems, visual math, problem generation, and mathematical text curation. Does NOT include general math tutoring platforms, creativity evaluation, or non-mathematical reasoning benchmarks.

There are 60 math reasoning datasets tools tracked. 2 score above 50 (established tier). The highest-rated is MMMU-Benchmark/MMMU at 52/100 with 548 stars.

Get all 60 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive...

52
Established
2 pat-jj/DeepRetrieval

[COLM’25] DeepRetrieval — 🔥 Training Search Agent by RLVR with Retrieval Outcome

51
Established
3 lupantech/MathVista

MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts

47
Emerging
4 ise-uiuc/magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

45
Emerging
5 x66ccff/liveideabench

[𝐍𝐚𝐭𝐮𝐫𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬] 🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific...

42
Emerging
6 IAAR-Shanghai/xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

40
Emerging
7 SuperBruceJia/Awesome-LLM-Self-Consistency

Awesome LLM Self-Consistency: a curated list of Self-consistency in Large...

39
Emerging
8 sherryzyh/physical_reasoning_toolkit

A Python toolkit for physical reasoning in LLMs and VLMs. This toolkit...

37
Emerging
9 GAIR-NLP/MathPile

[NeurlPS D&B 2024] Generative AI for Math: MathPile

37
Emerging
10 rxlqn/awesome-llm-self-reflection

augmented LLM with self reflection

37
Emerging
11 killthefullmoon/PhyX

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

36
Emerging
12 iiis-ai/AutoMathText-V2

AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset

36
Emerging
13 yecchen/MIRAI

Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

36
Emerging
14 bigai-nlco/LooGLE

ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models

34
Emerging
15 gsarti/verbalized-rebus

Materials for "Non Verbis, Sed Rebus: Large Language Models are Weak Solvers...

34
Emerging
16 TIGER-AI-Lab/AceCoder

The official repo for "AceCoder: Acing Coder RL via Automated Test-Case...

33
Emerging
17 microsoft/repoclassbench

[ICML DMLR 2024] Repo that contains code for the paper titled: "Class-Level...

32
Emerging
18 artificial-scientist-lab/SciMuse

Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs:...

32
Emerging
19 DAMO-NLP-SG/M3Exam

Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel...

31
Emerging
20 blacksnail789521/Time-Series-Reasoning-Survey

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

31
Emerging
21 TianHongZXY/CoRe

[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced...

31
Emerging
22 JunyiYe/CreativeMath

[AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to...

31
Emerging
23 uni-medical/GMAI-MMBench

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards...

29
Experimental
24 yubol-bobo/MT-Consistency

This repo investigates LLMs' tendency to exhibit acquiescence bias in...

29
Experimental
25 intuit-ai-research/DCR-consistency

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and...

29
Experimental
26 CodeEval-Pro/CodeEval-Pro

[ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating...

29
Experimental
27 lt-asset/REPOCOD

For our ACL25 Paper: Can Language Models Replace Programmers? RepoCod Says...

28
Experimental
28 EngineeringSoftware/codeditor

Multilingual Code Co-Evolution Using Large Language Models

28
Experimental
29 kg-bnu/SciMKG

Source code of AAAI 2026 paper "SciMKG: A Multimodal Knowledge Graph for...

27
Experimental
30 ehsk/OpenQA-eval

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large...

27
Experimental
31 zjunlp/ReCode

[AAAI 2026] ReCode: Reinforced Code Knowledge Editing for API Updates

27
Experimental
32 thehsansaeed/Questions-for-AI-Model-Testing

This repository contains a curated set of logical, mathematical, and...

26
Experimental
33 ai-for-edu/ScratchMath

Official Repo for Paper "Can MLLMs Read Students' Minds? Unpacking...

25
Experimental
34 pinterest/pinpoint-dataset

[CVPR '26] - PinPoint: Evaluation of Composed Image Retrieval with Explicit...

25
Experimental
35 asaakyan/ngram-creativity

Repository for the paper Death of the Novel(ty): Beyond n-Gram Novelty as a...

25
Experimental
36 mismayil/creativity-in-AI

Creativity in AI: A Survey of Progresses and Challenges

24
Experimental
37 surrey-nlp/LLM4MT_eval

This repository is for our paper "What do large language model need for...

24
Experimental
38 cyzhh/MMOS

Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two...

23
Experimental
39 yifanzhang-pro/BlueMO

BlueMO: A Comprehensive Collection of Challenging Mathematical Olympiad...

23
Experimental
40 neuro-symbolic-ai/explanation_based_ethical_reasoning

Code and data for Paper "Enhancing Ethical Explanations of Large Language...

23
Experimental
41 marcusm117/DNA

[ICLR 2026] Divide and Abstract: Autoformalization via Decomposition and...

22
Experimental
42 carlomarxdk/trilemma-of-truth

A research project on competing notions of truth in large language models.

22
Experimental
43 richardcsuwandi/cake

[NeurIPS 2025] Context-Aware Kernel Evolution (CAKE)

21
Experimental
44 HarryYancy/SolidGeo

SolidGeo: Measuring Multimodal Spatial Math Reasoning in Solid Geometry

20
Experimental
45 MAC-AutoML/SocialOmni

Benchmarking Audio-Visual Social Interactivity in Omni Models

20
Experimental
46 I-Halder/Demystifying-LLM-as-a-Judge-Analytically-Tractable-Model-for-Inference-Time-Scaling

Optimization of inference time sampling of large language models guided by a...

19
Experimental
47 Liz-Atlas/last_frame_whitepaper

A Modular Knowledge Transfer System for Large Language Models

19
Experimental
48 mshin77/mathipy

mathipy: Multimodal item feature extraction for K-12 math assessment (Python...

19
Experimental
49 LiXin97/WirelessMathLM

WirelessMathLM:Teaching Mathematical Reasoning for LLMs in Wireless...

17
Experimental
50 yifanzhang-pro/StackMathQA

StackMathQA: A Curated Collection of 2 Million Mathematical Questions and...

15
Experimental
51 jwallat/temporalrobustness

A Study Into Temporal Robustness of LLMs

15
Experimental
52 robertopassaro/tales-of-2-minds

Evaluating Creativity in Human and Large Language Model Narratives

13
Experimental
53 yahskapar/LLMs-and-Probabilistic-Reasoning

Data and software artifacts for the EMNLP 2024 (Main) paper "What Are the...

13
Experimental
54 GSkuza/Generalized-Theory-of-Mathematical-Indefiniteness

The Generalized Theory of Mathematical Undefiniteness (GTMØ) is an...

12
Experimental
55 yashmahe2020/math-tutor-research

Research on Large Language Model capabilities in mathematics tutoring and...

12
Experimental
56 polymathbenchmark/polymathbenchmark.github.io

A Challenging Multi-Modal Mathematical Reasoning Benchmark

11
Experimental
57 aauss/temporal-answer-qa

Time to Revisit Exact Match (Findings of EMNLP 2025)

11
Experimental
58 maxpeeperkorn/creativity-parameter

This repository contains the supplementary material / appendix to go with...

11
Experimental
59 kreasof-ai/self-perturbation-learning

Imagine "2 truth and a lie", but formalized as ML training objective

10
Experimental
60 sileod/nlp-verbal-probabilities-reasoning

Probing handling of verbal probabilities in NLP models

10
Experimental