omlx and vllm-mlx
These are competitors offering overlapping inference server capabilities (continuous batching, Apple Silicon optimization via MLX) with different feature trade-offs—omlx emphasizes macOS integration while vllm-mlx prioritizes OpenAI API compatibility and multimodal model support.
About omlx
jundot/omlx
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
Supports multi-model serving with automatic LRU eviction and manual pinning, alongside vision-language models and embedding/reranker inference—all via OpenAI-compatible API endpoints. KV cache persists across hot (RAM) and cold (SSD) tiers using block-based management with prefix sharing, restoring cached context from disk on subsequent requests even after server restarts. Includes built-in web dashboard for real-time monitoring, per-model configuration (sampling, TTL, aliases), and direct chat interface, with MCP (Model Context Protocol) support for tool integration.
About vllm-mlx
waybarrios/vllm-mlx
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
Related comparisons
Scores updated daily from GitHub, PyPI, and npm data. How scores work