PolyAI-LDN/conversational-datasets
Large datasets for conversational AI
Provides reproducible, deterministically-split datasets (Reddit: 654M examples, OpenSubtitles: 286M, Amazon QA: 3.6M) structured as context-response pairs with historical conversation turns, enabling pre-training of conversational models. Uses Apache Beam pipelines on Google Dataflow for distributed processing, with outputs serialized as either JSON or TensorFlow record format for seamless integration with TensorFlow training workflows.
1,387 stars. No commits in the last 6 months.
Stars
1,387
Forks
177
Language
Python
License
Apache-2.0
Category
Last pushed
Nov 16, 2019
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/PolyAI-LDN/conversational-datasets"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
Pinafore/qb
QANTA Quiz Bowl AI
KristiyanVachev/Question-Generation
Generating multiple choice questions from text using Machine Learning.
wuba/qa_match
A simple effective ToolKit for short text matching
mcQA-suite/mcQA
🔮 Answering multiple choice questions with Language Models.
dapurv5/awesome-question-answering
Resources, datasets, papers on Question Answering