cubist38/mlx-openai-server

A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.

79
/ 100
Verified

Supports multimodal inference (text, vision, audio, image generation/editing) with speculative decoding for faster LLM generation and dynamic model swapping via YAML configuration. Built on MLX's Apple Silicon optimization, it features prompt KV caching, per-model request queuing, LoRA adapter injection for image models, and can run multiple models simultaneously with request routing by model ID.

263 stars and 19,758 monthly downloads. Available on PyPI.

Maintenance 13 / 25
Adoption 20 / 25
Maturity 25 / 25
Community 21 / 25

How are scores calculated?

Stars

263

Forks

47

Language

Python

License

MIT

Last pushed

Mar 18, 2026

Monthly downloads

19,758

Commits (30d)

0

Dependencies

24

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/transformers/cubist38/mlx-openai-server"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.