aqlaboratory/proteinnet
Standardized data set for machine learning of protein structure
Includes MSAs, PSSMs, and secondary/tertiary structure annotations across CASP 7-12, with time-reset training splits that reconstruct historical database states to ensure unbiased benchmarking. Uses CASP blind prediction targets as test sets and provides validation subsets spanning difficulty levels (<10% to >90% sequence identity) to assess generalization across distributional shifts. Distributed as human-readable text files and TensorFlow TFRecord format with community PyTorch parsers available.
910 stars. No commits in the last 6 months.
Stars
910
Forks
138
Language
Python
License
MIT
Category
Last pushed
Nov 18, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/aqlaboratory/proteinnet"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
DeepRank/deeprank2
An open-source deep learning framework for data mining of protein-protein interfaces or...
sacdallago/biotrainer
Biological prediction models made simple.
jonathanking/sidechainnet
An all-atom protein structure dataset for machine learning.
a-r-j/ProteinWorkshop
Benchmarking framework for protein representation learning. Includes a large number of...
BioinfoMachineLearning/DIPS-Plus
The Enhanced Database of Interacting Protein Structures for Interface Prediction