phospho-app/fastassert
Dockerized LLM inference server with constrained output (JSON mode), built on top of vLLM and outlines. Faster, cheaper and without rate limits. Compare the quality and latency to your current LLM API provider.
This project helps developers and MLOps engineers deploy and manage Large Language Models (LLMs) more efficiently. It takes text prompts and desired JSON or regex output formats as input, and provides faster, more cost-effective, and rate-limit-free LLM responses. It's designed for technical teams who integrate LLMs into applications and need reliable, structured outputs.
No commits in the last 6 months.
Use this if you are a developer or MLOps engineer looking to self-host LLM inference, guarantee structured JSON or regex outputs, and reduce costs and latency compared to commercial LLM APIs.
Not ideal if you are a non-technical user or do not have the infrastructure (Linux OS, CUDA 12.1, and at least 16GB GPU RAM) to run a local inference server.
Stars
27
Forks
—
Language
Jupyter Notebook
License
—
Category
Last pushed
Feb 17, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/phospho-app/fastassert"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
containers/ramalama
RamaLama is an open-source developer tool that simplifies the local serving of AI models from...
av/harbor
One command brings a complete pre-wired LLM stack with hundreds of services to explore.
RunanywhereAI/runanywhere-sdks
Production ready toolkit to run AI locally
runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
foldl/chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)