EdinburghNLP/code-docstring-corpus
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Contains 150K+ parallel triples of Python function declarations, docstrings, and bodies extracted from GitHub, with preprocessing via AST normalization and tokenization using Moses scripts. Includes repository-consistent train/validation/test splits, a code-only corpus with synthetically generated docstrings via backtranslation, and baseline NMT results using Nematus. Version 2 extends coverage to class methods, module docstrings, and commit metadata.
211 stars. No commits in the last 6 months.
Stars
211
Forks
48
Language
Python
License
—
Category
Last pushed
Jul 13, 2020
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ai-coding/EdinburghNLP/code-docstring-corpus"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
k4black/codebleu
Pip compatible CodeBLEU metric implementation available for linux/macos/win
LiveCodeBench/LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of...
hendrycks/apps
APPS: Automated Programming Progress Standard (NeurIPS 2021)
alxschwrz/codex_py2cpp
Converts python code into c++ by using OpenAI CODEX.
AS-SiliconMind/SiliconMind-V1
Inference Engine for SiliconMind-V1 Verilog Coding Models