flash-linear-attention and flash_attention_inference
Flash-linear-attention provides optimized implementations of linear attention mechanisms as an alternative to the quadratic attention in flash-attention, while the inference benchmark tool measures flash-attention's C++ performance, making them complementary approaches to different efficiency trade-offs in attention computation rather than direct competitors.
About flash-linear-attention
fla-org/flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
Provides PyTorch and Triton kernels for linear attention variants (RetNet, GLA, Mamba, RWKV, DeltaNet, and 20+ emerging architectures), optimized for CPU and GPU across NVIDIA, AMD, and Intel platforms. Includes fused operators, hybrid model support, and variable-length sequence handling to reduce memory overhead during training. Integrates with Hugging Face model hub and the companion `flame` training framework for distributed model development.
About flash_attention_inference
Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work