// OpenAI-Compatible GPU Gateway

GPU-RELAY

Multi-tenant, OpenAI-compatible LLM gateway. Route requests to cloud GPU pods (RunPod, Vast.ai, Lambda Labs), a local GPU, or commercial APIs — all behind one endpoint on port 8000. Self-host 7B–70B models on rented GPUs and pay only when running. Drop-in replacement for the OpenAI API — works with Open WebUI, Cursor, and any OpenAI client. Mix providers freely: local GPU for dev, cloud GPU for heavy loads, OpenAI as fallback. Per-user billing, quotas, and tier restrictions for team deployments.

OpenAI-Compatible Wire Format Multi-Tenant RunPod / Vast.ai / Lambda Labs Per-User Billing & Quotas MIT License

Services

BRIDGE API
localhost:8000
OpenAI-compatible chat & embeddings
DASHBOARD
localhost:8501
Streamlit admin — billing, monitoring, users
OPEN WEBUI
localhost:3000
Chat UI; wire to bridge via sk-llm-... key
CCR (Claude Code Router)
localhost:3456
Anthropic ↔ OpenAI adapter for Claude Code
OLLAMA (internal)
ollama:11434
Local preprocessor + embeddings

Architecture

 ┌──────────────────────────────────────────────────────────────────┐
 │              CLIENT  (any OpenAI SDK / curl / Open WebUI)        │
 └────────────────────────────┬─────────────────────────────────────┘
                              │  POST /v1/chat/completions
                              ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │         BRIDGE API  :8000  (FastAPI)                             │
 │                                                                  │
 │  auth → rate-limit → pipeline → router → instance manager        │
 └──┬──────────────────────────────────────────────────────────────-┘
    │
    │  ROUTER  (bridge/router.py)
    │
    ├─ image_url detected ────────────────────────────────────────── ▶ vision tier (hard stop)
    │
    ├─ X-Tier header / ?tier= / allowed_tiers / budget / complexity
    │
    ├─ [simple / architecture] ──────────────────────── ▶ RunPod  RTX 4090   ~$0.69/hr
    ├─ [maximum]               ──────────────────────── ▶ RunPod  L40S 48GB  ~$1.14/hr
    ├─ [ultra]                 ──────────────────────── ▶ RunPod  A100 80GB  ~$1.89/hr
    ├─ [vision]                ──────────────────────── ▶ Together Dedicated / RunPod MiniCPM-V
    ├─ [cloud fallback]        ──────────────────────── ▶ Vast.ai → Lambda Labs
    └─ [commercial]            ──────────────────────── ▶ OpenAI / Groq / Cerebras / SambaNova / Together / Mistral / DeepSeek

 PREPROCESSOR ─── local Ollama 7B (qwen2.5-coder) rewrites prompt before cloud inference

 ┌───────────────────────────────────────────────────────────────────┐
 │  Postgres :5432  (state, billing, audit)                          │
 │  Redis    :6379  (quota, rate-limit, cache)                       │
 └───────────────────────────────────────────────────────────────────┘
      

Key Files

FileRole
bridge/main.pyRoutes, auth, image URL resolution to base64
bridge/router.pyTier selection — vision hard-stop, tokens/files/keywords/budget
bridge/instance_manager.pyPod pool, lifecycle, health checks, idle reaper
bridge/multi_model.pyWorkflowOrchestrator — named pipelines (llm-visual-html, etc.)
providers/base.pyBaseProvider ABC — GPU ranking, fallback order
providers/runpod.py / vast.py / lambda_labs.pyCloud GPU pod providers
providers/api_compat.pyOpenAI / Groq / Together / Mistral / DeepSeek pass-through
database/models.pyUser, Pod, Request, ApiKey, Invoice ORM models
dashboard/app.pyStreamlit entry point — Overview, Monitoring, Analytics, Billing

Install

Clone and copy the environment template

$ git clone https://github.com/infectiousoma/gpu-relay
$ cd self-host-llm
$ cp .env.example .env

Edit .env — set provider keys, secrets, and (optionally) network volume

$ $EDITOR .env

# Required: at least one GPU provider or commercial API key
PROVIDER_PRIORITY=runpod,vast,lambda
RUNPOD_API_KEY=rp_xxxxxxxxxxxxxxxxxxxx

# Required: generate with openssl rand -hex 32
BRIDGE_SECRET_KEY=<random-hex-64>
POSTGRES_PASSWORD=<strong-password>
OPENWEBUI_SECRET_KEY=<random-hex-64>

# Optional: cuts RunPod cold starts from minutes → ~30 s
RUNPOD_NETWORK_VOLUME_ID=<volume-id>

One-shot setup — builds images, starts stack, runs migrations, bootstraps admin user

$ bash scripts/setup.sh

Smoke test

$ curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-simple","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

Install llmctl shortcut (optional)

# System-wide (requires sudo)
$ sudo ln -sf "$(pwd)/scripts/llmctl" /usr/local/bin/llmctl

# Per-user (no sudo — ensure ~/.local/bin is in $PATH)
$ mkdir -p ~/.local/bin && ln -sf "$(pwd)/scripts/llmctl" ~/.local/bin/llmctl

Add a user and connect Open WebUI

$ llmctl users add you@example.com
$ llmctl users keys-add you@example.com --label "open-webui"
# Copy the sk-llm-... key — displayed ONCE

# In Open WebUI → Admin Panel → Settings → Connections:
#   OpenAI API URL: http://bridge:8000/v1
#   Key: sk-llm-...
No GPU account? Use MOCK_PROVIDERS=1 ./scripts/smoke_test.sh — routes all GPU requests to local Ollama. No billing, no cold start. All 13 E2E tests pass.

Usage

Model Names

llm-simple llm-architecture llm-maximum llm-ultra llm-vision llm-auto llm-local

Workflow models: llm-smart  llm-code-review  llm-refactor  llm-arch-design  llm-visual-html

Curl Examples

$ # Use a specific tier
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-architecture","messages":[{"role":"user","content":"review this code"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Auto-route — router picks tier based on complexity
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-auto","messages":[{"role":"user","content":"what is 2+2?"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Force tier via header (bypasses all routing logic)
curl -H "Authorization: Bearer $API_KEY" \
     -H "X-Tier: simple" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-auto","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Force local GPU — bypasses cloud regardless of tier config
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-local","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

Tier Table — Pod Providers

ModelTierUnderlying ModelGPUEst. $/hr
llm-simplesimpleQwen2.5-Coder 7BRTX 4090~$0.69
llm-architecturearchitectureQwen2.5 32B (tools) / Qwen2.5-Coder 32B (chat)RTX 4090~$0.69
llm-maximummaximumDeepSeek V3L40S 48GB~$1.14
llm-ultraultraQwen2.5 72BA100 80GB~$1.89
llm-visionvisionLlama-3.2-11B-Vision / MiniCPM-VL40 / RTX 4090~$0.69–1.49
llm-autorouter selectsvariesvaries
llm-locallocalOllama (same models)local GPU / CPUfree

Auto-Routing Priority

#SignalAction
0image_url content parts→ vision tier (hard stop — no fallthrough)
1X-Tier header / ?tier= query paramforce specific tier
2allowed_tiers user whitelistrestrict scope
3budget gatedowngrade or HTTP 402
4token count thresholdsroute by prompt size
5file count thresholdsroute by file complexity
6complexity keywordsroute by detected intent
7default→ simple

API Provider Model Mapping

TierOpenAIGroqCerebrasSambaNovaTogetherMistralDeepSeek
simplegpt-4o-minillama-3.1-8b-instantzai-glm-4.7Llama-3.1-8BLlama-3.2-11Bmistral-smalldeepseek-chat
midgpt-4o-minillama-3.3-70bgpt-oss-120bLlama-3.3-70BLlama-3.1-70Bmistral-mediumdeepseek-chat
architecturegpt-4ollama-3.3-70bgpt-oss-120bLlama-3.3-70BLlama-3.1-70Bmistral-mediumdeepseek-chat
maximumgpt-4ollama-3.3-70bLlama-3.1-405Bmistral-largedeepseek-reasoner
ultragpt-4ollama-3.3-70bLlama-3.1-405BLlama-3.1-405Bmistral-largedeepseek-reasoner

Override any mapping via env var — e.g. OPENAI_MODEL_ARCHITECTURE=o1-mini

────────────────────────────────────────────────────

Vision Routing

Any request with image_url content parts routes exclusively to the vision tier — steps 1–7 are skipped entirely.

Vision tier unavailable → HTTP 503 immediately. Vision pod acquires but fails → HTTP 400. Images are never silently stripped and re-routed to a text model (unless downstream_model is set for pipeline use).

Together Dedicated Vision Tiers

Routing TierVision ModelHardware
simpleQwen3-VL-8B-InstructL40 48GB / L40S / A100-40GB
visionLlama-3.2-11B-Vision-Instruct-TurboL40 48GB / L40S / A100
architecture / maximum / ultraLlama-3.2-90B-Vision-Instruct-TurboA100-80GB / H100-80GB

Claude Code Integration

Run claude (Claude Code CLI) backed by your own local or cloud LLMs instead of Anthropic's API. The claude-code-router (ccr) service sits between Claude Code and the bridge, converting the Anthropic API format Claude Code expects into the OpenAI format the bridge speaks.

 Claude Code  →  ccr :3456  →  Bridge :8000  →  RunPod / Groq / local GPU / …
                  │
                  ├─ Converts Anthropic ↔ OpenAI wire format
                  ├─ Routes by request type: default / background / think / longContext
                  └─ Passes tier name as model field (e.g. "architecture")
      

Copy and configure the ccr config

$ cp config/ccr-config.json.example config/ccr-config.json
$ $EDITOR config/ccr-config.json
# Set APIKEY to any secret, set api_key to your bridge sk-llm-... key

Add to .env and start the ccr service

CCR_PORT=3456
CCR_BRIDGE_API_KEY=sk-llm-...   # bridge API key from llmctl users keys-add

$ docker compose up -d ccr

Activate on the host and run Claude Code

$ source scripts/ccr-activate.sh   # exports ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN
$ claude --model architecture       # forces architecture tier
$ claude                            # uses default tier from ccr Router config

Request Type → Tier Mapping

ccr typeWhen usedBridge tier
defaultMost requestsarchitecture
backgroundLow-priority / short taskssimple
thinkExtended thinkingmaximum
longContextLarge context windowultra

Two-Model Routing (Ollama tiers)

For Ollama-backed tiers, the bridge automatically selects between two models based on whether the request carries tool definitions:

Request typeModelReason
Has tools (Claude Code tool-call session) qwen2.5:32b-instruct-q4_K_M Instruct variant handles tool call JSON format correctly
No tools (plain chat, title generation) qwen2.5-coder:32b-instruct-q4_K_M Coder variant is faster for pure text generation

Both models are pulled at pod startup. If the coder model fails to pull, requests fall back to the primary model automatically.

Tool Schema Stripping

Claude Code sends 26+ tools with verbose descriptions (~64K tokens total). The bridge strips description fields from all tool schemas before forwarding — applied to all providers including Ollama/RunPod. This cuts the tool payload to ~5K tokens, which:

Providers

Set PROVIDER_PRIORITY=runpod,vast,lambda to control order and fallback. Only providers with a configured API key are active — unconfigured providers are skipped silently.

ProviderTypeEnv KeyNotes
RunPod Cloud GPU Pod RUNPOD_API_KEY GPU preference-order fallback. Community cloud fallback. Network volume cache. Per-pod-type concurrency (vision pod never blocks simple pod).
Vast.ai Cloud GPU Pod VAST_API_KEY Fallback when RunPod has no capacity. Identical pod lifecycle.
Lambda Labs Cloud GPU Pod LAMBDA_API_KEY Secondary fallback after Vast. Same pod lifecycle.
Local GPU Local none Add local to PROVIDER_PRIORITY. Routes to Ollama on this machine. Zero cost — budget gate skipped. allow_local must be enabled per user.
OpenAI Commercial API OPENAI_API_KEY gpt-4o-mini / gpt-4o. Pay per token. No cold start. Multimodal supported.
Groq Commercial API GROQ_API_KEY Llama 3.1/3.3 at high throughput. Pay per token.
Cerebras Commercial API CEREBRAS_API_KEY zai-glm-4.7 (simple) / gpt-oss-120b (mid–architecture). Extremely fast inference. Pay per token.
SambaNova Commercial API SAMBANOVA_API_KEY Llama 3.1/3.3 8B–405B. High-throughput inference on custom silicon. Pay per token.
Together AI Commercial API TOGETHER_API_KEY Serverless (text only). Use together_dedicated provider for vision — spins a reserved GPU endpoint, billed per hour (~$1.49–6.49/hr).
Mistral Commercial API MISTRAL_API_KEY mistral-small / medium / large. Pay per token.
DeepSeek Commercial API DEEPSEEK_API_KEY deepseek-chat / deepseek-reasoner. Pay per token.

RunPod GPU Tier Preferences

TierGPU Preference OrderVRAM Range
simpleRTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range8–24 GB
visionRTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range10–24 GB
architectureRTX 4090 → RTX 3090 → A40 → A6000 → cheapest ≥20 GB≥20 GB
maximumL40S → L40 → A40 → A100 40GB → cheapest ≥38 GB≥38 GB
ultraA100 80GB → H100 → cheapest ≥50 GB≥50 GB

VRAM cap on simple/vision prevents landing on A100/H100 when preferred types sell out — avoids 5–10× cost with no quality gain for 7B/13B models.

Network Volume Cache (RunPod)

Without a network volume, models re-download on every cold start (7B: ~3–5 min, 32B: ~10–15 min). With one, cold starts drop to ~30 s. Cost: ~$7–8/month for 100 GB.

# 1. Create 100 GB network volume in RunPod Dashboard → Storage → Network Volumes
# 2. Copy the volume ID
# 3. Add to .env and restart bridge:

RUNPOD_NETWORK_VOLUME_ID=<volume-id>

$ docker compose up -d bridge
If you delete the volume, clear RUNPOD_NETWORK_VOLUME_ID from .env immediately. A stale ID causes every pod launch to fail.

Per-user volume keys (each user registers their own volume):

POST /v1/user/volume-keys
Authorization: Bearer <user-token>
Content-Type: application/json

{
  "provider":   "runpod",
  "volume_id":  "abc12345",
  "api_key":    "<runpod-api-key>",
  "datacenter": "EU-RO-1"     // optional — constrains to DC + validates before launch
}

GET    /v1/user/volume-keys          // list (api_key not decrypted in response)
DELETE /v1/user/volume-keys/{id}     // remove

Volume Storage Policy (Admin)

PolicyEffect when no user volume key found
use_envUse RUNPOD_NETWORK_VOLUME_ID if allow_env=True and var is set; otherwise launch stateless (default)
statelessLaunch without any volume — models re-download every cold start
blockFail immediately with HTTP 400 — no fallback to other providers

CLI Reference

Commands run in the bridge container via the llmctl shortcut.

# Direct (no shortcut installed)
$ docker compose exec bridge python -m cli.llm_ctl <command>

User Management

$ llmctl users add <email>                              # create user (prompts for password)
$ llmctl users set-password <email>                     # reset password
$ llmctl users budget <email> --usd 50                  # set monthly spend cap
$ llmctl users credit-add <email> --usd 20              # add prepaid credit
$ llmctl users tiers <email>                            # show allowed tiers
$ llmctl users tiers <email> --set simple               # lock to one tier
$ llmctl users tiers <email> --set simple,architecture  # allow two tiers
$ llmctl users tiers <email> --set all                  # remove restriction
$ llmctl users deactivate <email>                       # soft-delete user
$ llmctl users list                                     # list all users

API Keys

$ llmctl users keys-add <email> --label "open-webui"   # create key (shown ONCE)
$ llmctl users keys-list <email>                       # list active keys
$ llmctl users keys-revoke <key_id>                    # revoke by ID
$ llmctl users reset-key <email> [--label "name"]      # revoke all + issue fresh key

User Sync (Open WebUI)

$ llmctl users add <email> --sync-openwebui        # create bridge user + matching OW account
$ llmctl users keys-add <email> --sync-pipeline    # create key + update pipeline user_key_map

Requires OPENWEBUI_ADMIN_EMAIL, OPENWEBUI_ADMIN_PASSWORD, and PIPELINES_API_KEY in .env.

Volume Storage Policy

$ llmctl users storage <email>                          # show policy + registered volume keys
$ llmctl users storage <email> --policy use_env         # use env volume if no user key (default)
$ llmctl users storage <email> --policy stateless       # always launch without volume
$ llmctl users storage <email> --policy block           # require user volume; reject otherwise
$ llmctl users storage <email> --allow-env              # allow shared RUNPOD_NETWORK_VOLUME_ID
$ llmctl users storage <email> --no-allow-env           # prevent shared volume for this user

Local Provider Access

$ llmctl users local-access <email> --allow   # enable llm-local for this user
$ llmctl users local-access <email> --deny    # disable llm-local for this user

Pods & Billing

$ llmctl pods ls [--status ready]             # list pods
$ llmctl pods kill <pod_id>                   # terminate pod immediately
$ llmctl start --tier architecture            # prewarm a pod
$ llmctl bills run --month 2026-05            # generate invoices
$ llmctl bills show <email> --month 2026-05   # per-user invoice + breakdown

Observability

$ llmctl models [--user-type personal]        # tier table with effective $/hr
$ llmctl status [--tier architecture]         # active pods + running cost
$ llmctl budget [--email u@example.com]       # spend vs cap progress bars
$ llmctl costs [--month 2026-05]              # per-tier cost breakdown
$ llmctl gain  [--month 2026-05]              # savings vs GPT-4o equivalent

Deployment Modes

Four ways to deploy — from solo dev to multi-tenant hosted service.

ModeWhoBridgeGatewayOpen WebUI
Solo One user Local Local, OPENWEBUI_BRIDGE_API_KEY set to your key
Hosted multi-user Admin + users Shared server Shared. Per-user billing via gpu-relay Pipelines.
Gateway client User of hosted bridge Remote (host's) Local Local, pointed at gateway on port 8080
Full self-hosted Single operator Own server Optional Own server or local

Gateway — Local Proxy to Remote Bridge

The gateway is a lightweight stateless proxy. Users point their OpenAI-compatible client at http://localhost:8080 and authenticate with their own sk-llm- key. The gateway forwards requests to the upstream bridge unchanged.

Option A — alongside main stack (add-on overlay):

# Add to .env:
GATEWAY_BRIDGE_URL=http://bridge:8000   # internal; bridge is a sibling service
GATEWAY_PORT=8080

$ docker compose -f docker-compose.yml -f docker-compose.gateway.yml up -d gateway

Option B — standalone (user's machine → remote bridge):

# docker/docker-compose.gateway.yml (no main stack needed)
GATEWAY_BRIDGE_URL=https://your-bridge.example.com
GATEWAY_PORT=8080

$ docker compose -f docker/docker-compose.gateway.yml up -d

Then set your Open WebUI (or any OpenAI client) base URL to http://localhost:8080 and API key to your sk-llm-... bridge key.

Hosted Multi-User Setup

Each user gets their own bridge API key. The gpu-relay Pipelines manifold routes inference per-user for correct billing attribution. Admin installs the pipeline once; each user's key is added automatically when created with --sync-openwebui.

# .env — sync settings
OPENWEBUI_ADMIN_EMAIL=admin@example.com
OPENWEBUI_ADMIN_PASSWORD=<ow-admin-password>
PIPELINES_API_KEY=<from Open WebUI Admin → Pipelines → API key>
PIPELINE_ID=gpu-relay                   # default; match your installed pipeline ID

# Create user + OW account + pipeline mapping in one step:
$ llmctl users add alice@example.com --sync-openwebui
$ llmctl users keys-add alice@example.com --sync-pipeline

# Or via HTTP (returns plaintext key in response):
$ curl -X POST http://localhost:8000/admin/users \
       -H "Authorization: Bearer <admin-key>" \
       -H "Content-Type: application/json" \
       -d '{"email":"alice@example.com","password":"...","sync_openwebui":true,"sync_pipeline":true}'

User Portal

Users can view their own spend, budget, and 30-day usage charts without admin access. Available at http://localhost:8501User Portal page. Login with a sk-llm- key — no email or password needed.

The user portal calls GET /v1/usage on the bridge. It only shows data for the authenticated key's user — no cross-user visibility.

Workspace Tools

Open WebUI Tool integration that gives any model persistent file I/O and code execution in a shared workspace_data/ directory. Add the tool once in Admin Panel — every model in Open WebUI gains the same capabilities.

Available Tools

ToolDescription
write_file(path, content)Write text to a file; creates parent dirs automatically
read_file(path)Read a file from the workspace
run_bash(command)Execute a shell command in the workspace directory
run_python(code)Execute a Python code snippet (not a file path)
list_tree(path, depth)Browse directory structure
search_files(query, path, glob)Regex search across workspace files
delete_path(path)Delete file or directory
move_path(src, dst)Move or rename file/directory
create_directory(path)Create directory (and parents)
generate_pdf(markdown, output_path)Render Markdown → PDF in workspace

Setup

# 1. Open WebUI → Admin Panel → Tools → Add Tool
#    Paste contents of: pipelines/openwebui_tool.py
#    Set valve: workspace_tools_url = http://workspace-tools:7000

# 2. Start workspace-tools service (already in docker-compose.yml)
$ docker compose up -d workspace-tools

# 3. In any Open WebUI chat: enable the tool via the tool icon, then chat normally

System Prompt

Without a system prompt, models hallucinate tool calls as Python code instead of invoking them. Use the recommended system prompt to enforce correct behavior.

You have access to Workspace Tools. You MUST use them for ALL file and execution tasks —
never simulate, imagine, or describe results.

Rules:
- Writing code: always call write_file with real file content (actual newlines, not \n)
- Testing code: always call run_bash or run_python — never show "expected output"
- Browsing files: always call list_tree or read_file
- If a tool call fails: read the actual error, fix the actual file, retry
- To run a Python FILE: use run_bash with "python3 <path>" — never pass a file path to run_python
- run_python is for inline code snippets only, e.g. run_python("print('hello')")
- File organization: always create new projects under projects/<project-name>/, not the workspace root

CRITICAL — tool call rules:
- Tool calls are NOT Python code. Never write run_bash(...) in a code block.
- Invoke tools directly — do not write code that calls them.
- "I will call run_bash" followed by a code block = violation. Call it, don't narrate it.
- Fake output is output without a tool call. It is always wrong.

Full prompt with explanations: docs/workspace-tools-system-prompt.md

Common Mistakes

WrongRight
run_python("projects/calc/main.py")run_bash("python3 projects/calc/main.py")
Writing result = run_bash(...) in a code blockInvoking run_bash as a tool call directly
Creating files in workspace rootCreating under projects/<name>/
Calling list_tree before write_fileWrite first, verify after

Screenshots

Click any screenshot to open full size.