runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.

/ 100

Established

This project helps developers deploy and manage large language models (LLMs) as highly performant, serverless API endpoints. It takes a chosen LLM (like Llama-3.1-8B-Instruct or OpenChat-3.5) and serves it through an API that's compatible with OpenAI's format. The primary users are developers who need to integrate custom LLM capabilities into their applications with speed and efficiency.

406 stars.

Use this if you are a developer looking to deploy your own large language models efficiently and scale them as serverless, OpenAI-compatible API endpoints.

Not ideal if you are an end-user without programming experience, as this tool requires familiarity with Docker, API configuration, and development workflows.

AI-application-development MLOps API-development backend-development large-language-model-deployment

No Package No Dependents

Maintenance 10 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 25 / 25

How are scores calculated?

Stars

406

Forks

290

Language

Python

License

MIT

Related models

containers/ramalama

RamaLama is an open-source developer tool that simplifies the local serving of AI models from...

eastriverlee/LLM.swift

LLM.swift is a simple and readable library that allows you to interact with large language...

beehive-lab/GPULlama3.java

GPU-accelerated Llama3.java inference in pure Java using TornadoVM.

gitkaz/mlx_gguf_server

This is a FastAPI based LLM server. Load multiple LLM models (MLX or llama.cpp) simultaneously...

Scottcjn/llama-cpp-power8

AltiVec/VSX optimized llama.cpp for IBM POWER8

Explore Transformer Models

All categories Trending Transformer directory Insights