pjlab-sys4nlp/llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)

/ 100

Emerging

Converts dense LLaMA FFN layers into sparse mixture-of-experts through neuron partitioning (random, clustering, co-activation graph, or gradient-based) and top-K gating, maintaining only 3.0–3.5B activated parameters. Supports multiple gating strategies (TopK Noisy Gate, Switch Gating) and optimizes training via FlashAttention-v2 integration with dynamic batch sampling weights from Sheared LLaMA. Integrates with Hugging Face Transformers ecosystem and provides comprehensive training monitoring (gate load/importance, load balancing metrics, throughput visualization).

1,002 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

1,002

Forks

Language

Python

License

Apache-2.0

Higher-rated alternatives

UKPLab/gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled...

galilai-group/stable-pretraining

Reliable, minimal and scalable library for pretraining foundation and world models

CognitiveAISystems/MAPF-GPT

[AAAI-2025] This repository contains MAPF-GPT, a deep learning-based model for solving MAPF...

larslorch/avici

Amortized Inference for Causal Structure Learning, NeurIPS 2022

svdrecbd/mhc-mlx

MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.

Explore Transformer Models

All categories Trending Transformer directory Insights