Math Reasoning Datasets Transformer Models

There are 37 math reasoning datasets models tracked. 1 score above 70 (verified tier). The highest-rated is ExtensityAI/symbolicai at 75/100 with 1,677 stars and 2,722 monthly downloads. 1 of the top 10 are actively maintained.

Get all 37 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=transformers&subcategory=math-reasoning-datasets&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Model Score Tier
1 ExtensityAI/symbolicai

A neurosymbolic perspective on LLMs

75
Verified
2 TIGER-AI-Lab/MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task...

49
Emerging
3 deep-symbolic-mathematics/LLM-SR

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on...

42
Emerging
4 zhudotexe/fanoutqa

Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering...

42
Emerging
5 microsoft/interwhen

A framework for verifiable reasoning with language models.

42
Emerging
6 HiThink-Research/MME-Finance

[MM 2025] A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

37
Emerging
7 xlang-ai/Binder

[ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"

36
Emerging
8 yifanzhang-pro/AutoMathText

[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative...

32
Emerging
9 princeton-pli/AdaptMI

[COLM 2025] Adaptive Skill-based In-context Math Instruction for Small...

31
Emerging
10 SeekingDream/DyCodeEval

Official repository of the ICML2025 paper “Dynamic Benchmarking of Reasoning...

30
Emerging
11 TIGER-AI-Lab/StructLM

Code and data for "StructLM: Towards Building Generalist Models for...

30
Emerging
12 AlphaPav/mem-kk-logic

On Memorization of Large Language Models in Logical Reasoning

30
Emerging
13 DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries

[ACL 2025] Analyzing LLMs' Multilingual Knowledge Boundary Cognition Across...

30
Emerging
14 TIGER-AI-Lab/LongICLBench

Code and Data for "Long-context LLMs Struggle with Long In-context Learning"...

29
Experimental
15 declare-lab/LLM-PuzzleTest

This repository is maintained to release dataset and models for multimodal...

29
Experimental
16 TIGER-AI-Lab/MAmmoTH

Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid...

29
Experimental
17 akjindal53244/Arithmo

Small and Efficient Mathematical Reasoning LLMs

28
Experimental
18 amazon-science/recode

Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"

28
Experimental
19 google/curie

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long...

27
Experimental
20 martin-wey/CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

26
Experimental
21 QwenLM/PolyMath

[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath:...

25
Experimental
22 bobxwu/learning-from-rewards-llm-papers

A comrephensive collection of learning from rewards in the post-training and...

24
Experimental
23 ryokamoi/llm-self-correction-papers

List of papers on Self-Correction of LLMs.

24
Experimental
24 reasoning-machines/CoCoGen

Language Models of Code are Few-Shot Commonsense Learners (EMNLP 2022)

23
Experimental
25 conditionWang/FLNK

Federated Learning with New Knowledge -- explore to incorporate various new...

23
Experimental
26 gersteinlab/Struc-Bench

[NAACL 2024] Struc-Bench: Are Large Language Models Good at Generating...

23
Experimental
27 zjunlp/DynamicKnowledgeCircuits

[ACL 2025] How Do LLMs Acquire New Knowledge? A Knowledge Circuits...

22
Experimental
28 kaistAI/LangBridge

[ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision

21
Experimental
29 YangLing0818/SuperCorrect-llm

[ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought...

20
Experimental
30 WooooDyy/MathCritique

Implementation for the research paper "Enhancing LLM Reasoning via Critique...

20
Experimental
31 merlerm/In-Context-Symbolic-Regression

Official code implementation for the ACL 2024 Student Research Workshop...

20
Experimental
32 joeljang/continual-knowledge-learning

[ICLR 2022] Towards Continual Knowledge Learning of Language Models

20
Experimental
33 UCSC-VLAA/vllm-safety-benchmark

[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in...

18
Experimental
34 MMStar-Benchmark/MMStar

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on...

17
Experimental
35 iiis-ai/IterativeQuestionComposing

[AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing...

16
Experimental
36 TIGER-AI-Lab/TableCoT

The code and data for paper "Large Language Models are few(1)-shot Table...

16
Experimental
37 Eleanor-H/MUSTARD

Code & data for ICLR 2024 spotlight paper: 🍯MUSTARD: Mastering Uniform...

14
Experimental