gpt-neox and gpt-neo

These are ecosystem siblings representing different technological approaches to the same goal—GPT-Neo uses mesh-tensorflow for distributed training while GPT-NeoX uses Megatron/DeepSpeed for the same purpose, with NeoX being the more recent evolution designed to scale to larger models.

gpt-neox

Established

gpt-neo

Emerging

Maintenance 10/25

Adoption 10/25

Maturity 16/25

Community 22/25

Maintenance 0/25

Adoption 10/25

Maturity 16/25

Community 21/25

Stars: 7,399

Forks: 1,100

Downloads: —

Commits (30d): 0

Language: Python

License: Apache-2.0

Stars: 8,286

Forks: 963

Downloads: —

Commits (30d): 0

Language: Python

License: MIT

No Package No Dependents

Archived Stale 6m No Package No Dependents

About gpt-neox

EleutherAI/gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Supports distributed training via 3D parallelism (tensor, pipeline, and data) with ZeRO optimization, enabling efficient scaling across heterogeneous hardware including AWS, supercomputers (Summit, Frontier, LUMI), and AMD MI250X GPUs. Features modern architectural innovations like rotary/ALiBi positional embeddings, Flash Attention 2, and Mixture-of-Experts, with preset configs for Pythia, PaLM, Falcon, and LLaMA. Integrates seamlessly with Hugging Face ecosystem (tokenizers, transformers), supports preference learning (DPO, KTO), and connects to monitoring platforms (WandB, Comet ML) and the Language Model Evaluation Harness.

About gpt-neo

EleutherAI/gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

Supports diverse attention mechanisms including local and linear attention variants, alongside mixture-of-experts and axial positional embeddings beyond standard GPT architectures. Built on mesh-tensorflow for distributed training across TPU and GPU clusters with both data and model parallelism, enabling efficient scaling to multi-billion parameter models. Includes pre-trained checkpoints (1.3B and 2.7B parameters) trained on The Pile dataset, compatible with HuggingFace Transformers for immediate inference.

Scores updated daily from GitHub, PyPI, and npm data. How scores work