MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

/ 100

Emerging

Divides full context into learnable sparse blocks where each query token selects the most relevant KV blocks via a parameter-less top-k gating mechanism, achieving up to 40x speedup on long sequences. Integrates with HuggingFace Transformers and Flash Attention 2.6.3, offering both naive (mask-based) and optimized production implementations that seamlessly switch between full and sparse attention modes without requiring architectural changes.

2,076 stars. No commits in the last 6 months.

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 18 / 25

How are scores calculated?

Stars

2,076

Forks

136

Language

Python

License

MIT

Higher-rated alternatives

EfficientMoE/MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

raymin0223/mixture_of_recursions

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation...

AviSoori1x/makeMoE

From scratch implementation of a sparse mixture of experts language model inspired by Andrej...

thu-nics/MoA

[CoLM'25] The official implementation of the paper

CASE-Lab-UMD/Unified-MoE-Compression

The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study...

Explore Transformer Models

All categories Trending Transformer directory Insights