LLM Domain Datasets LLM Tools
Datasets, benchmarks, and evaluation tools for domain-specific LLM applications (geoscience, entity matching, information extraction, etc.). Does NOT include general-purpose LLM datasets, training frameworks, or model architecture code.
There are 59 llm domain datasets tools tracked. 1 score above 50 (established tier). The highest-rated is monarch-initiative/ontogpt at 56/100 with 811 stars. 2 of the top 10 are actively maintained.
Get all 59 projects as JSON
curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=llm-tools&subcategory=llm-domain-datasets&limit=20"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
| # | Tool | Score | Tier |
|---|---|---|---|
| 1 |
monarch-initiative/ontogpt
LLM-based ontological extraction tools, including SPIRES |
|
Established |
| 2 |
weAIDB/awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper |
|
Emerging |
| 3 |
open-chinese/poetry-collection
中文《诗歌总集》,距今为止最全面,最系统的中文诗词数据集,统一数据建模. |
|
Emerging |
| 4 |
AXYZdong/AMchat
AM (Advanced Mathematics) Chat is a large language model that integrates... |
|
Emerging |
| 5 |
skywalker023/sodaverse
🥤🧑🏻🚀Code and dataset for our EMNLP 2023 paper - "SODA: Million-scale... |
|
Emerging |
| 6 |
Y-Research-SBU/TimeSeriesScientist
Official Repository for TimeSeriesScientist |
|
Emerging |
| 7 |
Jeryi-Sun/LLM-and-Law
This repository is dedicated to summarizing papers related to large language... |
|
Emerging |
| 8 |
jd-coderepos/llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository |
|
Emerging |
| 9 |
SysNetS/SPEC5G
This repository contains the code and data of the paper titled "SPEC5G: A... |
|
Emerging |
| 10 |
davendw49/k2
Code and datasets for paper "K2: A Foundation Language Model for Geoscience... |
|
Emerging |
| 11 |
microsoft/clinical-self-verification
Self-verification for LLMs. |
|
Emerging |
| 12 |
sciknoworg/llms4subjects
The official GermEval 2025 Task - LLMs4Subjects - Shared Task Dataset Repository |
|
Emerging |
| 13 |
falensiazmi/IndoSafety
A dataset for LLM safety evaluation in Indonesian and major local languages... |
|
Emerging |
| 14 |
ewok-core/ewok-paper
Elements of World Knowledge! This repository houses data and code needed to... |
|
Experimental |
| 15 |
night-chen/ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering... |
|
Experimental |
| 16 |
marcobombieri/do-LLM-dream-of-ontologies
Repository containing code and dataset of the paper "Do LLM Dream Of Ontologies?" |
|
Experimental |
| 17 |
KRR-Oxford/LLMap-Prelim
A preliminary investigation for ontology alignment (OM) with large language... |
|
Experimental |
| 18 |
jd-coderepos/sota
The official training/validation/test dataset repository for the SOTA? task... |
|
Experimental |
| 19 |
SpursGoZmy/Tabular-LLM
本项目旨在收集开源的表格智能任务数据集(比如表格问答、表格-文本生成等),将原始数据整理为指令微调格式的数据并微调LLM,进而增强LLM对于表格数据的理解... |
|
Experimental |
| 20 |
paulalesius/llmath
Large Language Math - The Mathematics of LLM Foundational Models - For Beginners |
|
Experimental |
| 21 |
CharlesPikachu/ToolBridge
ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities |
|
Experimental |
| 22 |
bioepic-data/bervo
BERVO, the Biological and Environmental Research Variable Ontology |
|
Experimental |
| 23 |
abcsys/libem-sample-data
Libem sample datasets. |
|
Experimental |
| 24 |
LHHegland/if-llm-behavior-ontology
Instruction-Following LLM Behavior Ontology (IF-LLM-BO) is a lightweight... |
|
Experimental |
| 25 |
infosenselab/frameref
Large-scale dataset and simulation framework for studying information health. |
|
Experimental |
| 26 |
lankamar/pragmatic-llm-alignment
Investigación sobre alineación pragmática de LLMs y Framework de Agentes... |
|
Experimental |
| 27 |
dsfsi/edu-assessment-llm-prompt
Educational Assesement using LLMs |
|
Experimental |
| 28 |
zjunlp/Data2Behavior
From Data to Behavior: Predicting Unintended Model Behaviors Before Training |
|
Experimental |
| 29 |
Nkluge-correa/Model-Library
The Model Library is a project that maps the risks associated with modern... |
|
Experimental |
| 30 |
Iamsdt/awesome-bengali-ai
A curated collection of resources for Bengali AI, LLMs, Generative AI, and... |
|
Experimental |
| 31 |
MaheshJakkala/naamapadam-multilingual-ner
Benchmarking NER on Naamapadam across 7 Indic languages. EDA + model... |
|
Experimental |
| 32 |
GS-Uni-Heidelberg/Paper-WhoPlaysWhichRole
🤖 Phrase-level protagonist detection and role classification in moral discourse. |
|
Experimental |
| 33 |
MIKUAFANS/SciTopic
[IEEE BigData 2025] SciTopic: Enhancing Topic Discovery in Scientific... |
|
Experimental |
| 34 |
nercone-dev/zeta-llm-dataset
Public Datasets for Zeta-Tool |
|
Experimental |
| 35 |
RenzeLou/AAAR-1.0
The source code for running LLMs on the AAAR-1.0 benchmark. |
|
Experimental |
| 36 |
hitz-zentroa/lm-contamination
The LM Contamination Index is a manually created database of contamination... |
|
Experimental |
| 37 |
NLP-Research-Insights/SciTables
Enhance Table-to-Text by LLMs using scientific tables |
|
Experimental |
| 38 |
willxxy/ECG-Byte
[MLHC 2025] ECG-Byte: A Tokenizer for End-to-End Generative... |
|
Experimental |
| 39 |
liyaooi/TAMO
TAMO: reimagine Table representation as an independent Modality for LLMs |
|
Experimental |
| 40 |
mahadi-nahid/NormTab
[EMNLP 2024] NormTab: Improving Symbolic Reasoning in LLMs Through Tabular... |
|
Experimental |
| 41 |
sciknoworg/LLMs4OL-Challenge
LLMs4OL Challenge @ ISWC |
|
Experimental |
| 42 |
nnliu1/sem_annotation
ontology term recommendation system for semantic annotation |
|
Experimental |
| 43 |
Maryam-Nasseri/SFA-Lexical-Complexity
Supplementary materials for the journal article Structural Factor Analysis... |
|
Experimental |
| 44 |
Mehreen1103/LLMs4OL-2025
Official participation in the 2nd LLMs4OL Challenge @ ISWC 2025, Nara,... |
|
Experimental |
| 45 |
zabir-nabil/bangla-multilingual-llm-eval
Evaluation of Open and Closed-Source Multi-lingual LLMs for Low-Resource... |
|
Experimental |
| 46 |
HES-XPLAIN/mlxplain
An open platform for accelerating the development of eXplainable AI systems |
|
Experimental |
| 47 |
JustinMuecke/GLaMoR
This repository provides a framework for transforming OWL ontologies into a... |
|
Experimental |
| 48 |
alemoraru/exceed-project-overview
Reproduction package for a framework that uses LLMs to generate tailored,... |
|
Experimental |
| 49 |
sefeoglu/llm-examples
LLM examples for the state of the art problems in knowledge graphs |
|
Experimental |
| 50 |
xwang297/metamate-dataset
MetaMate: Large Language Model to the Rescue of Automated Data Extraction... |
|
Experimental |
| 51 |
mhmoslemi2338/Heterogeneity_EM_Survey
Official implementation of the paper "Heterogeneity in Entity Matching: A... |
|
Experimental |
| 52 |
eugeniusms/textgrad-TextualVerifier
TextualVerifier: Verify Step by Step in TextGrad Automated "Differentiation"... |
|
Experimental |
| 53 |
ryang1119/OOMB
Repo for "Can Large Language Models be Effective Online Opinion Miners?"... |
|
Experimental |
| 54 |
MariaSahakyan/peer_review_project
This repository contains the custom code and anonymized data used to produce... |
|
Experimental |
| 55 |
abhishekmaity/BhashaMind
Low-Resource Bengali LLM for Summarization and Classification |
|
Experimental |
| 56 |
Uniquenetra/ml-based-ontology-matching
A project to enhance ontology matching accuracy using Large Language Models... |
|
Experimental |
| 57 |
mehedihasanbijoy/BanglaLLMs
A collection of fine-tuned LLMs for Bangla language processing. |
|
Experimental |
| 58 |
s-m-hashemi/llms4ol-2024-challenge
Data and implementations of the paper "SKH-NLP at LLMs4OL 2024 Task B:... |
|
Experimental |
| 59 |
AbhijitKumarJ/Meta_Abstraction
Meta Abstracting data to utilize emergent patterns |
|
Experimental |