cubist38/mlx-openai-server
A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.
Supports multimodal inference (text, vision, audio, image generation/editing) with speculative decoding for faster LLM generation and dynamic model swapping via YAML configuration. Built on MLX's Apple Silicon optimization, it features prompt KV caching, per-model request queuing, LoRA adapter injection for image models, and can run multiple models simultaneously with request routing by model ID.
263 stars and 19,758 monthly downloads. Available on PyPI.
Stars
263
Forks
47
Language
Python
License
MIT
Category
Last pushed
Mar 18, 2026
Monthly downloads
19,758
Commits (30d)
0
Dependencies
24
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/cubist38/mlx-openai-server"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.