jundot/omlx
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
Supports multi-model serving with automatic LRU eviction and manual pinning, alongside vision-language models and embedding/reranker inference—all via OpenAI-compatible API endpoints. KV cache persists across hot (RAM) and cold (SSD) tiers using block-based management with prefix sharing, restoring cached context from disk on subsequent requests even after server restarts. Includes built-in web dashboard for real-time monitoring, per-model configuration (sampling, TTL, aliases), and direct chat interface, with MCP (Model Context Protocol) support for tool integration.
4,057 stars. Actively maintained with 539 commits in the last 30 days.
Stars
4,057
Forks
306
Language
Python
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
539
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/jundot/omlx"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Featured in
Compare
Related tools
waybarrios/vllm-mlx
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models...
jordanhubbard/nanolang
A tiny experimental language designed to be targeted by coding LLMs
josStorer/RWKV-Runner
A RWKV management and startup tool, full automation, only 8MB. And provides an interface...
akivasolutions/tightwad
Pool your CUDA + ROCm GPUs into one OpenAI-compatible API. Speculative decoding proxy gives you...
petrukha-ivan/mlx-swift-structured
Structured output generation in Swift