GPT2-Chinese and gpt2-ml

These tools are competitors, with the former providing a Chinese version of GPT2 training code that uses a BERT tokenizer and the latter offering GPT2 support for multiple languages, including pretrained models, making them alternative solutions for similar NLP tasks involving GPT2 in multilingual or Chinese contexts.

GPT2-Chinese

Established

gpt2-ml

Established

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 25/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 25/25

Stars: 7,598

Forks: 1,694

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

Stars: 1,703

Forks: 330

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

Stale 6m No Package No Dependents

About GPT2-Chinese

Morizeyao/GPT2-Chinese

Chinese version of GPT2 training code, using BERT tokenizer.

Supports multiple tokenization strategies (character-level, word-level, and BPE) and integrates with HuggingFace Transformers, enabling training on diverse Chinese text domains from classical poetry to novels with configurable model depth and batch sizes. Includes pre-trained models for specialized domains (ancient Chinese, lyrics, couplets) available on Hugging Face Model Hub, alongside utilities for perplexity evaluation and batch text generation with customizable sampling parameters.

About gpt2-ml

imcaspar/gpt2-ml

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

Implements simplified training scripts based on Grover architecture with TPU support, and adapts BERT's tokenizer for multilingual corpus compatibility using CLUE vocabulary. Provides two 1.5B Chinese pretrained checkpoints trained on 15-30GB corpora with different tokenization schemes (BERT vs. CLUE tokens), optimized via Cloud TPU Pod for production-ready text generation tasks.

Scores updated daily from GitHub, PyPI, and npm data. How scores work