// OpenAI-Compatible GPU Gateway
Multi-tenant, OpenAI-compatible LLM gateway. Route requests to cloud GPU pods (RunPod, Vast.ai, Lambda Labs), a local GPU, or commercial APIs — all behind one endpoint on port 8000. Self-host 7B–70B models on rented GPUs and pay only when running. Drop-in replacement for the OpenAI API — works with Open WebUI, Cursor, and any OpenAI client. Mix providers freely: local GPU for dev, cloud GPU for heavy loads, OpenAI as fallback. Per-user billing, quotas, and tier restrictions for team deployments.
┌──────────────────────────────────────────────────────────────────┐
│ CLIENT (any OpenAI SDK / curl / Open WebUI) │
└────────────────────────────┬─────────────────────────────────────┘
│ POST /v1/chat/completions
▼
┌──────────────────────────────────────────────────────────────────┐
│ BRIDGE API :8000 (FastAPI) │
│ │
│ auth → rate-limit → pipeline → router → instance manager │
└──┬──────────────────────────────────────────────────────────────-┘
│
│ ROUTER (bridge/router.py)
│
├─ image_url detected ────────────────────────────────────────── ▶ vision tier (hard stop)
│
├─ X-Tier header / ?tier= / allowed_tiers / budget / complexity
│
├─ [simple / architecture] ──────────────────────── ▶ RunPod RTX 4090 ~$0.69/hr
├─ [maximum] ──────────────────────── ▶ RunPod L40S 48GB ~$1.14/hr
├─ [ultra] ──────────────────────── ▶ RunPod A100 80GB ~$1.89/hr
├─ [vision] ──────────────────────── ▶ Together Dedicated / RunPod MiniCPM-V
├─ [cloud fallback] ──────────────────────── ▶ Vast.ai → Lambda Labs
└─ [commercial] ──────────────────────── ▶ OpenAI / Groq / Cerebras / SambaNova / Together / Mistral / DeepSeek
PREPROCESSOR ─── local Ollama 7B (qwen2.5-coder) rewrites prompt before cloud inference
┌───────────────────────────────────────────────────────────────────┐
│ Postgres :5432 (state, billing, audit) │
│ Redis :6379 (quota, rate-limit, cache) │
└───────────────────────────────────────────────────────────────────┘
| File | Role |
|---|---|
bridge/main.py | Routes, auth, image URL resolution to base64 |
bridge/router.py | Tier selection — vision hard-stop, tokens/files/keywords/budget |
bridge/instance_manager.py | Pod pool, lifecycle, health checks, idle reaper |
bridge/multi_model.py | WorkflowOrchestrator — named pipelines (llm-visual-html, etc.) |
providers/base.py | BaseProvider ABC — GPU ranking, fallback order |
providers/runpod.py / vast.py / lambda_labs.py | Cloud GPU pod providers |
providers/api_compat.py | OpenAI / Groq / Together / Mistral / DeepSeek pass-through |
database/models.py | User, Pod, Request, ApiKey, Invoice ORM models |
dashboard/app.py | Streamlit entry point — Overview, Monitoring, Analytics, Billing |
Clone and copy the environment template
$ git clone https://github.com/infectiousoma/gpu-relay $ cd self-host-llm $ cp .env.example .env
Edit .env — set provider keys, secrets, and (optionally) network volume
$ $EDITOR .env # Required: at least one GPU provider or commercial API key PROVIDER_PRIORITY=runpod,vast,lambda RUNPOD_API_KEY=rp_xxxxxxxxxxxxxxxxxxxx # Required: generate with openssl rand -hex 32 BRIDGE_SECRET_KEY=<random-hex-64> POSTGRES_PASSWORD=<strong-password> OPENWEBUI_SECRET_KEY=<random-hex-64> # Optional: cuts RunPod cold starts from minutes → ~30 s RUNPOD_NETWORK_VOLUME_ID=<volume-id>
One-shot setup — builds images, starts stack, runs migrations, bootstraps admin user
$ bash scripts/setup.sh
Smoke test
$ curl -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llm-simple","messages":[{"role":"user","content":"hello"}]}' \
http://localhost:8000/v1/chat/completions
Install llmctl shortcut (optional)
# System-wide (requires sudo) $ sudo ln -sf "$(pwd)/scripts/llmctl" /usr/local/bin/llmctl # Per-user (no sudo — ensure ~/.local/bin is in $PATH) $ mkdir -p ~/.local/bin && ln -sf "$(pwd)/scripts/llmctl" ~/.local/bin/llmctl
Add a user and connect Open WebUI
$ llmctl users add you@example.com $ llmctl users keys-add you@example.com --label "open-webui" # Copy the sk-llm-... key — displayed ONCE # In Open WebUI → Admin Panel → Settings → Connections: # OpenAI API URL: http://bridge:8000/v1 # Key: sk-llm-...
MOCK_PROVIDERS=1 ./scripts/smoke_test.sh — routes all GPU requests to local Ollama. No billing, no cold start. All 13 E2E tests pass.
Workflow models: llm-smart llm-code-review llm-refactor llm-arch-design llm-visual-html
$ # Use a specific tier curl -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llm-architecture","messages":[{"role":"user","content":"review this code"}]}' \ http://localhost:8000/v1/chat/completions $ # Auto-route — router picks tier based on complexity curl -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llm-auto","messages":[{"role":"user","content":"what is 2+2?"}]}' \ http://localhost:8000/v1/chat/completions $ # Force tier via header (bypasses all routing logic) curl -H "Authorization: Bearer $API_KEY" \ -H "X-Tier: simple" \ -H "Content-Type: application/json" \ -d '{"model":"llm-auto","messages":[{"role":"user","content":"hello"}]}' \ http://localhost:8000/v1/chat/completions $ # Force local GPU — bypasses cloud regardless of tier config curl -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llm-local","messages":[{"role":"user","content":"hello"}]}' \ http://localhost:8000/v1/chat/completions
| Model | Tier | Underlying Model | GPU | Est. $/hr |
|---|---|---|---|---|
llm-simple | simple | Qwen2.5-Coder 7B | RTX 4090 | ~$0.69 |
llm-architecture | architecture | Qwen2.5 32B (tools) / Qwen2.5-Coder 32B (chat) | RTX 4090 | ~$0.69 |
llm-maximum | maximum | DeepSeek V3 | L40S 48GB | ~$1.14 |
llm-ultra | ultra | Qwen2.5 72B | A100 80GB | ~$1.89 |
llm-vision | vision | Llama-3.2-11B-Vision / MiniCPM-V | L40 / RTX 4090 | ~$0.69–1.49 |
llm-auto | — | router selects | varies | varies |
llm-local | local | Ollama (same models) | local GPU / CPU | free |
| # | Signal | Action |
|---|---|---|
| 0 | image_url content parts | → vision tier (hard stop — no fallthrough) |
| 1 | X-Tier header / ?tier= query param | force specific tier |
| 2 | allowed_tiers user whitelist | restrict scope |
| 3 | budget gate | downgrade or HTTP 402 |
| 4 | token count thresholds | route by prompt size |
| 5 | file count thresholds | route by file complexity |
| 6 | complexity keywords | route by detected intent |
| 7 | default | → simple |
| Tier | OpenAI | Groq | Cerebras | SambaNova | Together | Mistral | DeepSeek |
|---|---|---|---|---|---|---|---|
| simple | gpt-4o-mini | llama-3.1-8b-instant | zai-glm-4.7 | Llama-3.1-8B | Llama-3.2-11B | mistral-small | deepseek-chat |
| mid | gpt-4o-mini | llama-3.3-70b | gpt-oss-120b | Llama-3.3-70B | Llama-3.1-70B | mistral-medium | deepseek-chat |
| architecture | gpt-4o | llama-3.3-70b | gpt-oss-120b | Llama-3.3-70B | Llama-3.1-70B | mistral-medium | deepseek-chat |
| maximum | gpt-4o | llama-3.3-70b | — | — | Llama-3.1-405B | mistral-large | deepseek-reasoner |
| ultra | gpt-4o | llama-3.3-70b | — | Llama-3.1-405B | Llama-3.1-405B | mistral-large | deepseek-reasoner |
Override any mapping via env var — e.g. OPENAI_MODEL_ARCHITECTURE=o1-mini
Any request with image_url content parts routes exclusively to the vision tier — steps 1–7 are skipped entirely.
downstream_model is set for pipeline use).| Routing Tier | Vision Model | Hardware |
|---|---|---|
| simple | Qwen3-VL-8B-Instruct | L40 48GB / L40S / A100-40GB |
| vision | Llama-3.2-11B-Vision-Instruct-Turbo | L40 48GB / L40S / A100 |
| architecture / maximum / ultra | Llama-3.2-90B-Vision-Instruct-Turbo | A100-80GB / H100-80GB |
Run claude (Claude Code CLI) backed by your own local or cloud LLMs instead of Anthropic's API.
The claude-code-router (ccr) service sits between Claude Code
and the bridge, converting the Anthropic API format Claude Code expects into the OpenAI format the bridge speaks.
Claude Code → ccr :3456 → Bridge :8000 → RunPod / Groq / local GPU / …
│
├─ Converts Anthropic ↔ OpenAI wire format
├─ Routes by request type: default / background / think / longContext
└─ Passes tier name as model field (e.g. "architecture")
Copy and configure the ccr config
$ cp config/ccr-config.json.example config/ccr-config.json $ $EDITOR config/ccr-config.json # Set APIKEY to any secret, set api_key to your bridge sk-llm-... key
Add to .env and start the ccr service
CCR_PORT=3456 CCR_BRIDGE_API_KEY=sk-llm-... # bridge API key from llmctl users keys-add $ docker compose up -d ccr
Activate on the host and run Claude Code
$ source scripts/ccr-activate.sh # exports ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN $ claude --model architecture # forces architecture tier $ claude # uses default tier from ccr Router config
| ccr type | When used | Bridge tier |
|---|---|---|
default | Most requests | architecture |
background | Low-priority / short tasks | simple |
think | Extended thinking | maximum |
longContext | Large context window | ultra |
For Ollama-backed tiers, the bridge automatically selects between two models based on whether the request carries tool definitions:
| Request type | Model | Reason |
|---|---|---|
| Has tools (Claude Code tool-call session) | qwen2.5:32b-instruct-q4_K_M |
Instruct variant handles tool call JSON format correctly |
| No tools (plain chat, title generation) | qwen2.5-coder:32b-instruct-q4_K_M |
Coder variant is faster for pure text generation |
Both models are pulled at pod startup. If the coder model fails to pull, requests fall back to the primary model automatically.
Claude Code sends 26+ tools with verbose descriptions (~64K tokens total). The bridge strips
description fields from all tool schemas before forwarding — applied to all providers
including Ollama/RunPod. This cuts the tool payload to ~5K tokens, which:
Set PROVIDER_PRIORITY=runpod,vast,lambda to control order and fallback.
Only providers with a configured API key are active — unconfigured providers are skipped silently.
| Provider | Type | Env Key | Notes |
|---|---|---|---|
| RunPod | Cloud GPU Pod | RUNPOD_API_KEY |
GPU preference-order fallback. Community cloud fallback. Network volume cache. Per-pod-type concurrency (vision pod never blocks simple pod). |
| Vast.ai | Cloud GPU Pod | VAST_API_KEY |
Fallback when RunPod has no capacity. Identical pod lifecycle. |
| Lambda Labs | Cloud GPU Pod | LAMBDA_API_KEY |
Secondary fallback after Vast. Same pod lifecycle. |
| Local GPU | Local | none | Add local to PROVIDER_PRIORITY. Routes to Ollama on this machine. Zero cost — budget gate skipped. allow_local must be enabled per user. |
| OpenAI | Commercial API | OPENAI_API_KEY |
gpt-4o-mini / gpt-4o. Pay per token. No cold start. Multimodal supported. |
| Groq | Commercial API | GROQ_API_KEY |
Llama 3.1/3.3 at high throughput. Pay per token. |
| Cerebras | Commercial API | CEREBRAS_API_KEY |
zai-glm-4.7 (simple) / gpt-oss-120b (mid–architecture). Extremely fast inference. Pay per token. |
| SambaNova | Commercial API | SAMBANOVA_API_KEY |
Llama 3.1/3.3 8B–405B. High-throughput inference on custom silicon. Pay per token. |
| Together AI | Commercial API | TOGETHER_API_KEY |
Serverless (text only). Use together_dedicated provider for vision — spins a reserved GPU endpoint, billed per hour (~$1.49–6.49/hr). |
| Mistral | Commercial API | MISTRAL_API_KEY |
mistral-small / medium / large. Pay per token. |
| DeepSeek | Commercial API | DEEPSEEK_API_KEY |
deepseek-chat / deepseek-reasoner. Pay per token. |
| Tier | GPU Preference Order | VRAM Range |
|---|---|---|
simple | RTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range | 8–24 GB |
vision | RTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range | 10–24 GB |
architecture | RTX 4090 → RTX 3090 → A40 → A6000 → cheapest ≥20 GB | ≥20 GB |
maximum | L40S → L40 → A40 → A100 40GB → cheapest ≥38 GB | ≥38 GB |
ultra | A100 80GB → H100 → cheapest ≥50 GB | ≥50 GB |
VRAM cap on simple/vision prevents landing on A100/H100 when preferred types sell out — avoids 5–10× cost with no quality gain for 7B/13B models.
Without a network volume, models re-download on every cold start (7B: ~3–5 min, 32B: ~10–15 min). With one, cold starts drop to ~30 s. Cost: ~$7–8/month for 100 GB.
# 1. Create 100 GB network volume in RunPod Dashboard → Storage → Network Volumes # 2. Copy the volume ID # 3. Add to .env and restart bridge: RUNPOD_NETWORK_VOLUME_ID=<volume-id> $ docker compose up -d bridge
RUNPOD_NETWORK_VOLUME_ID from .env immediately. A stale ID causes every pod launch to fail.Per-user volume keys (each user registers their own volume):
POST /v1/user/volume-keys
Authorization: Bearer <user-token>
Content-Type: application/json
{
"provider": "runpod",
"volume_id": "abc12345",
"api_key": "<runpod-api-key>",
"datacenter": "EU-RO-1" // optional — constrains to DC + validates before launch
}
GET /v1/user/volume-keys // list (api_key not decrypted in response)
DELETE /v1/user/volume-keys/{id} // remove
| Policy | Effect when no user volume key found |
|---|---|
use_env | Use RUNPOD_NETWORK_VOLUME_ID if allow_env=True and var is set; otherwise launch stateless (default) |
stateless | Launch without any volume — models re-download every cold start |
block | Fail immediately with HTTP 400 — no fallback to other providers |
Commands run in the bridge container via the llmctl shortcut.
# Direct (no shortcut installed) $ docker compose exec bridge python -m cli.llm_ctl <command>
$ llmctl users add <email> # create user (prompts for password) $ llmctl users set-password <email> # reset password $ llmctl users budget <email> --usd 50 # set monthly spend cap $ llmctl users credit-add <email> --usd 20 # add prepaid credit $ llmctl users tiers <email> # show allowed tiers $ llmctl users tiers <email> --set simple # lock to one tier $ llmctl users tiers <email> --set simple,architecture # allow two tiers $ llmctl users tiers <email> --set all # remove restriction $ llmctl users deactivate <email> # soft-delete user $ llmctl users list # list all users
$ llmctl users keys-add <email> --label "open-webui" # create key (shown ONCE) $ llmctl users keys-list <email> # list active keys $ llmctl users keys-revoke <key_id> # revoke by ID $ llmctl users reset-key <email> [--label "name"] # revoke all + issue fresh key
$ llmctl users add <email> --sync-openwebui # create bridge user + matching OW account $ llmctl users keys-add <email> --sync-pipeline # create key + update pipeline user_key_map
Requires OPENWEBUI_ADMIN_EMAIL, OPENWEBUI_ADMIN_PASSWORD, and PIPELINES_API_KEY in .env.
$ llmctl users storage <email> # show policy + registered volume keys $ llmctl users storage <email> --policy use_env # use env volume if no user key (default) $ llmctl users storage <email> --policy stateless # always launch without volume $ llmctl users storage <email> --policy block # require user volume; reject otherwise $ llmctl users storage <email> --allow-env # allow shared RUNPOD_NETWORK_VOLUME_ID $ llmctl users storage <email> --no-allow-env # prevent shared volume for this user
$ llmctl users local-access <email> --allow # enable llm-local for this user $ llmctl users local-access <email> --deny # disable llm-local for this user
$ llmctl pods ls [--status ready] # list pods $ llmctl pods kill <pod_id> # terminate pod immediately $ llmctl start --tier architecture # prewarm a pod $ llmctl bills run --month 2026-05 # generate invoices $ llmctl bills show <email> --month 2026-05 # per-user invoice + breakdown
$ llmctl models [--user-type personal] # tier table with effective $/hr $ llmctl status [--tier architecture] # active pods + running cost $ llmctl budget [--email u@example.com] # spend vs cap progress bars $ llmctl costs [--month 2026-05] # per-tier cost breakdown $ llmctl gain [--month 2026-05] # savings vs GPT-4o equivalent
Four ways to deploy — from solo dev to multi-tenant hosted service.
| Mode | Who | Bridge | Gateway | Open WebUI |
|---|---|---|---|---|
| Solo | One user | Local | — | Local, OPENWEBUI_BRIDGE_API_KEY set to your key |
| Hosted multi-user | Admin + users | Shared server | — | Shared. Per-user billing via gpu-relay Pipelines. |
| Gateway client | User of hosted bridge | Remote (host's) | Local | Local, pointed at gateway on port 8080 |
| Full self-hosted | Single operator | Own server | Optional | Own server or local |
The gateway is a lightweight stateless proxy. Users point their OpenAI-compatible client
at http://localhost:8080 and authenticate with their own sk-llm- key.
The gateway forwards requests to the upstream bridge unchanged.
Option A — alongside main stack (add-on overlay):
# Add to .env: GATEWAY_BRIDGE_URL=http://bridge:8000 # internal; bridge is a sibling service GATEWAY_PORT=8080 $ docker compose -f docker-compose.yml -f docker-compose.gateway.yml up -d gateway
Option B — standalone (user's machine → remote bridge):
# docker/docker-compose.gateway.yml (no main stack needed) GATEWAY_BRIDGE_URL=https://your-bridge.example.com GATEWAY_PORT=8080 $ docker compose -f docker/docker-compose.gateway.yml up -d
Then set your Open WebUI (or any OpenAI client) base URL to
http://localhost:8080 and API key to your sk-llm-... bridge key.
Each user gets their own bridge API key. The gpu-relay Pipelines manifold routes inference
per-user for correct billing attribution. Admin installs the pipeline once; each user's
key is added automatically when created with --sync-openwebui.
# .env — sync settings OPENWEBUI_ADMIN_EMAIL=admin@example.com OPENWEBUI_ADMIN_PASSWORD=<ow-admin-password> PIPELINES_API_KEY=<from Open WebUI Admin → Pipelines → API key> PIPELINE_ID=gpu-relay # default; match your installed pipeline ID # Create user + OW account + pipeline mapping in one step: $ llmctl users add alice@example.com --sync-openwebui $ llmctl users keys-add alice@example.com --sync-pipeline # Or via HTTP (returns plaintext key in response): $ curl -X POST http://localhost:8000/admin/users \ -H "Authorization: Bearer <admin-key>" \ -H "Content-Type: application/json" \ -d '{"email":"alice@example.com","password":"...","sync_openwebui":true,"sync_pipeline":true}'
Users can view their own spend, budget, and 30-day usage charts without admin access.
Available at http://localhost:8501 → User Portal page.
Login with a sk-llm- key — no email or password needed.
GET /v1/usage on the bridge. It only shows data for the
authenticated key's user — no cross-user visibility.
Open WebUI Tool integration that gives any model persistent file I/O and code execution
in a shared workspace_data/ directory. Add the tool once in Admin Panel —
every model in Open WebUI gains the same capabilities.
| Tool | Description |
|---|---|
write_file(path, content) | Write text to a file; creates parent dirs automatically |
read_file(path) | Read a file from the workspace |
run_bash(command) | Execute a shell command in the workspace directory |
run_python(code) | Execute a Python code snippet (not a file path) |
list_tree(path, depth) | Browse directory structure |
search_files(query, path, glob) | Regex search across workspace files |
delete_path(path) | Delete file or directory |
move_path(src, dst) | Move or rename file/directory |
create_directory(path) | Create directory (and parents) |
generate_pdf(markdown, output_path) | Render Markdown → PDF in workspace |
# 1. Open WebUI → Admin Panel → Tools → Add Tool # Paste contents of: pipelines/openwebui_tool.py # Set valve: workspace_tools_url = http://workspace-tools:7000 # 2. Start workspace-tools service (already in docker-compose.yml) $ docker compose up -d workspace-tools # 3. In any Open WebUI chat: enable the tool via the tool icon, then chat normally
Without a system prompt, models hallucinate tool calls as Python code instead of invoking them. Use the recommended system prompt to enforce correct behavior.
You have access to Workspace Tools. You MUST use them for ALL file and execution tasks —
never simulate, imagine, or describe results.
Rules:
- Writing code: always call write_file with real file content (actual newlines, not \n)
- Testing code: always call run_bash or run_python — never show "expected output"
- Browsing files: always call list_tree or read_file
- If a tool call fails: read the actual error, fix the actual file, retry
- To run a Python FILE: use run_bash with "python3 <path>" — never pass a file path to run_python
- run_python is for inline code snippets only, e.g. run_python("print('hello')")
- File organization: always create new projects under projects/<project-name>/, not the workspace root
CRITICAL — tool call rules:
- Tool calls are NOT Python code. Never write run_bash(...) in a code block.
- Invoke tools directly — do not write code that calls them.
- "I will call run_bash" followed by a code block = violation. Call it, don't narrate it.
- Fake output is output without a tool call. It is always wrong.
Full prompt with explanations: docs/workspace-tools-system-prompt.md
| Wrong | Right |
|---|---|
run_python("projects/calc/main.py") | run_bash("python3 projects/calc/main.py") |
Writing result = run_bash(...) in a code block | Invoking run_bash as a tool call directly |
| Creating files in workspace root | Creating under projects/<name>/ |
Calling list_tree before write_file | Write first, verify after |
Click any screenshot to open full size.