omlx and vllm-mlx

These are competitors offering overlapping inference server capabilities (continuous batching, Apple Silicon optimization via MLX) with different feature trade-offs—omlx emphasizes macOS integration while vllm-mlx prioritizes OpenAI API compatibility and multimodal model support.

omlx
65
Established
vllm-mlx
61
Established
Maintenance 25/25
Adoption 10/25
Maturity 11/25
Community 19/25
Maintenance 25/25
Adoption 10/25
Maturity 5/25
Community 21/25
Stars: 4,057
Forks: 306
Downloads:
Commits (30d): 539
Language: Python
License: Apache-2.0
Stars: 579
Forks: 87
Downloads:
Commits (30d): 59
Language: Python
License:
No Package No Dependents
No License No Package No Dependents

About omlx

jundot/omlx

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Supports multi-model serving with automatic LRU eviction and manual pinning, alongside vision-language models and embedding/reranker inference—all via OpenAI-compatible API endpoints. KV cache persists across hot (RAM) and cold (SSD) tiers using block-based management with prefix sharing, restoring cached context from disk on subsequent requests even after server restarts. Includes built-in web dashboard for real-time monitoring, per-model configuration (sampling, TTL, aliases), and direct chat interface, with MCP (Model Context Protocol) support for tool integration.

About vllm-mlx

waybarrios/vllm-mlx

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

Related comparisons

Scores updated daily from GitHub, PyPI, and npm data. How scores work