runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
This project helps developers deploy and manage large language models (LLMs) as highly performant, serverless API endpoints. It takes a chosen LLM (like Llama-3.1-8B-Instruct or OpenChat-3.5) and serves it through an API that's compatible with OpenAI's format. The primary users are developers who need to integrate custom LLM capabilities into their applications with speed and efficiency.
406 stars.
Use this if you are a developer looking to deploy your own large language models efficiently and scale them as serverless, OpenAI-compatible API endpoints.
Not ideal if you are an end-user without programming experience, as this tool requires familiarity with Docker, API configuration, and development workflows.
Stars
406
Forks
290
Language
Python
License
MIT
Category
Last pushed
Mar 10, 2026
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/runpod-workers/worker-vllm"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related models
containers/ramalama
RamaLama is an open-source developer tool that simplifies the local serving of AI models from...
eastriverlee/LLM.swift
LLM.swift is a simple and readable library that allows you to interact with large language...
beehive-lab/GPULlama3.java
GPU-accelerated Llama3.java inference in pure Java using TornadoVM.
gitkaz/mlx_gguf_server
This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously...
Scottcjn/llama-cpp-power8
AltiVec/VSX optimized llama.cpp for IBM POWER8