gpt-neox and gpt-neo
These are ecosystem siblings representing different technological approaches to the same goal—GPT-Neo uses mesh-tensorflow for distributed training while GPT-NeoX uses Megatron/DeepSpeed for the same purpose, with NeoX being the more recent evolution designed to scale to larger models.
About gpt-neox
EleutherAI/gpt-neox
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Supports distributed training via 3D parallelism (tensor, pipeline, and data) with ZeRO optimization, enabling efficient scaling across heterogeneous hardware including AWS, supercomputers (Summit, Frontier, LUMI), and AMD MI250X GPUs. Features modern architectural innovations like rotary/ALiBi positional embeddings, Flash Attention 2, and Mixture-of-Experts, with preset configs for Pythia, PaLM, Falcon, and LLaMA. Integrates seamlessly with Hugging Face ecosystem (tokenizers, transformers), supports preference learning (DPO, KTO), and connects to monitoring platforms (WandB, Comet ML) and the Language Model Evaluation Harness.
About gpt-neo
EleutherAI/gpt-neo
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
Supports diverse attention mechanisms including local and linear attention variants, alongside mixture-of-experts and axial positional embeddings beyond standard GPT architectures. Built on mesh-tensorflow for distributed training across TPU and GPU clusters with both data and model parallelism, enabling efficient scaling to multi-billion parameter models. Includes pre-trained checkpoints (1.3B and 2.7B parameters) trained on The Pile dataset, compatible with HuggingFace Transformers for immediate inference.
Scores updated daily from GitHub, PyPI, and npm data. How scores work