Koziev/NLP_Datasets

My NLP datasets for Russian language

45
/ 100
Emerging

Provides curated dialogue corpora (imageboard conversations, literary fiction, jokes, movie subtitles) with quality-scored JSONL annotations for relevance and specificity, plus synthetic Q&A datasets for arithmetic reasoning. Includes shallow-parsed sentence templates with morphologically-tagged noun phrase slots (~21M samples) and collocation/verb-phrase patterns extracted via automatic chunking, designed for training generative Russian chatbots compatible with ruGPT and sentence-BERT models. Datasets are distributed across Hugging Face Hub and GitHub with companion training scripts for fine-tuning dialogue systems and paraphrase detectors.

386 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 19 / 25

How are scores calculated?

Stars

386

Forks

55

Language

C#

License

CC0-1.0

Last pushed

Feb 18, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/nlp/Koziev/NLP_Datasets"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.