GPU-Relay

Architecture

 ┌──────────────────────────────────────────────────────────────────┐
 │              CLIENT  (any OpenAI SDK / curl / Open WebUI)        │
 └────────────────────────────┬─────────────────────────────────────┘
                              │  POST /v1/chat/completions
                              ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │         BRIDGE API  :8000  (FastAPI)                             │
 │                                                                  │
 │  auth → rate-limit → pipeline → router → instance manager        │
 └──┬──────────────────────────────────────────────────────────────-┘
    │
    │  ROUTER  (bridge/router.py)
    │
    ├─ image_url detected ────────────────────────────────────────── ▶ vision tier (hard stop)
    │
    ├─ X-Tier header / ?tier= / allowed_tiers / budget / complexity
    │
    ├─ [simple / architecture] ──────────────────────── ▶ RunPod  RTX 4090   ~$0.69/hr
    ├─ [maximum]               ──────────────────────── ▶ RunPod  L40S 48GB  ~$1.14/hr
    ├─ [ultra]                 ──────────────────────── ▶ RunPod  A100 80GB  ~$1.89/hr
    ├─ [vision]                ──────────────────────── ▶ Together Dedicated / RunPod MiniCPM-V
    ├─ [cloud fallback]        ──────────────────────── ▶ Vast.ai → Lambda Labs
    └─ [commercial]            ──────────────────────── ▶ OpenAI / Groq / Cerebras / SambaNova / Together / Mistral / DeepSeek

 PREPROCESSOR ─── local Ollama 7B (qwen2.5-coder) rewrites prompt before cloud inference

 ┌───────────────────────────────────────────────────────────────────┐
 │  Postgres :5432  (state, billing, audit)                          │
 │  Redis    :6379  (quota, rate-limit, cache)                       │
 └───────────────────────────────────────────────────────────────────┘

Key Files

File	Role
`bridge/main.py`	Routes, auth, image URL resolution to base64
`bridge/router.py`	Tier selection — vision hard-stop, tokens/files/keywords/budget
`bridge/instance_manager.py`	Pod pool, lifecycle, health checks, idle reaper
`bridge/multi_model.py`	WorkflowOrchestrator — named pipelines (llm-visual-html, etc.)
`providers/base.py`	BaseProvider ABC — GPU ranking, fallback order
`providers/runpod.py` / `vast.py` / `lambda_labs.py`	Cloud GPU pod providers
`providers/api_compat.py`	OpenAI / Groq / Together / Mistral / DeepSeek pass-through
`database/models.py`	User, Pod, Request, ApiKey, Invoice ORM models
`dashboard/app.py`	Streamlit entry point — Overview, Monitoring, Analytics, Billing

Install

Clone and copy the environment template

$ git clone https://github.com/infectiousoma/gpu-relay
$ cd self-host-llm
$ cp .env.example .env

Edit .env — set provider keys, secrets, and (optionally) network volume

$ $EDITOR .env

# Required: at least one GPU provider or commercial API key
PROVIDER_PRIORITY=runpod,vast,lambda
RUNPOD_API_KEY=rp_xxxxxxxxxxxxxxxxxxxx

# Required: generate with openssl rand -hex 32
BRIDGE_SECRET_KEY=<random-hex-64>
POSTGRES_PASSWORD=<strong-password>
OPENWEBUI_SECRET_KEY=<random-hex-64>

# Optional: cuts RunPod cold starts from minutes → ~30 s
RUNPOD_NETWORK_VOLUME_ID=<volume-id>

One-shot setup — builds images, starts stack, runs migrations, bootstraps admin user

$ bash scripts/setup.sh

Smoke test

$ curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-simple","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

Install llmctl shortcut (optional)

# System-wide (requires sudo)
$ sudo ln -sf "$(pwd)/scripts/llmctl" /usr/local/bin/llmctl

# Per-user (no sudo — ensure ~/.local/bin is in $PATH)
$ mkdir -p ~/.local/bin && ln -sf "$(pwd)/scripts/llmctl" ~/.local/bin/llmctl

Add a user and connect Open WebUI

$ llmctl users add you@example.com
$ llmctl users keys-add you@example.com --label "open-webui"
# Copy the sk-llm-... key — displayed ONCE

# In Open WebUI → Admin Panel → Settings → Connections:
#   OpenAI API URL: http://bridge:8000/v1
#   Key: sk-llm-...

No GPU account? Use MOCK_PROVIDERS=1 ./scripts/smoke_test.sh — routes all GPU requests to local Ollama. No billing, no cold start. All 13 E2E tests pass.

Usage

Model Names

llm-simple llm-architecture llm-maximum llm-ultra llm-vision llm-auto llm-local

Workflow models: llm-smart llm-code-review llm-refactor llm-arch-design llm-visual-html

Curl Examples

$ # Use a specific tier
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-architecture","messages":[{"role":"user","content":"review this code"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Auto-route — router picks tier based on complexity
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-auto","messages":[{"role":"user","content":"what is 2+2?"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Force tier via header (bypasses all routing logic)
curl -H "Authorization: Bearer $API_KEY" \
     -H "X-Tier: simple" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-auto","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

$ # Force local GPU — bypasses cloud regardless of tier config
curl -H "Authorization: Bearer $API_KEY" \
     -H "Content-Type: application/json" \
     -d '{"model":"llm-local","messages":[{"role":"user","content":"hello"}]}' \
     http://localhost:8000/v1/chat/completions

Tier Table — Pod Providers

Model	Tier	Underlying Model	GPU	Est. $/hr
`llm-simple`	simple	Qwen2.5-Coder 7B	RTX 4090	~$0.69
`llm-architecture`	architecture	Qwen2.5 32B (tools) / Qwen2.5-Coder 32B (chat)	RTX 4090	~$0.69
`llm-maximum`	maximum	DeepSeek V3	L40S 48GB	~$1.14
`llm-ultra`	ultra	Qwen2.5 72B	A100 80GB	~$1.89
`llm-vision`	vision	Llama-3.2-11B-Vision / MiniCPM-V	L40 / RTX 4090	~$0.69–1.49
`llm-auto`	—	router selects	varies	varies
`llm-local`	local	Ollama (same models)	local GPU / CPU	free

Auto-Routing Priority

#	Signal	Action
0	image_url content parts	→ vision tier (hard stop — no fallthrough)
1	X-Tier header / ?tier= query param	force specific tier
2	allowed_tiers user whitelist	restrict scope
3	budget gate	downgrade or HTTP 402
4	token count thresholds	route by prompt size
5	file count thresholds	route by file complexity
6	complexity keywords	route by detected intent
7	default	→ simple

API Provider Model Mapping

Tier	OpenAI	Groq	Cerebras	SambaNova	Together	Mistral	DeepSeek
simple	gpt-4o-mini	llama-3.1-8b-instant	zai-glm-4.7	Llama-3.1-8B	Llama-3.2-11B	mistral-small	deepseek-chat
mid	gpt-4o-mini	llama-3.3-70b	gpt-oss-120b	Llama-3.3-70B	Llama-3.1-70B	mistral-medium	deepseek-chat
architecture	gpt-4o	llama-3.3-70b	gpt-oss-120b	Llama-3.3-70B	Llama-3.1-70B	mistral-medium	deepseek-chat
maximum	gpt-4o	llama-3.3-70b	—	—	Llama-3.1-405B	mistral-large	deepseek-reasoner
ultra	gpt-4o	llama-3.3-70b	—	Llama-3.1-405B	Llama-3.1-405B	mistral-large	deepseek-reasoner

Override any mapping via env var — e.g. OPENAI_MODEL_ARCHITECTURE=o1-mini

────────────────────────────────────────────────────

Vision Routing

Any request with image_url content parts routes exclusively to the vision tier — steps 1–7 are skipped entirely.

Vision tier unavailable → HTTP 503 immediately. Vision pod acquires but fails → HTTP 400. Images are never silently stripped and re-routed to a text model (unless downstream_model is set for pipeline use).

Together Dedicated Vision Tiers

Routing Tier	Vision Model	Hardware
simple	Qwen3-VL-8B-Instruct	L40 48GB / L40S / A100-40GB
vision	Llama-3.2-11B-Vision-Instruct-Turbo	L40 48GB / L40S / A100
architecture / maximum / ultra	Llama-3.2-90B-Vision-Instruct-Turbo	A100-80GB / H100-80GB

Claude Code Integration

Run claude (Claude Code CLI) backed by your own local or cloud LLMs instead of Anthropic's API. The claude-code-router (ccr) service sits between Claude Code and the bridge, converting the Anthropic API format Claude Code expects into the OpenAI format the bridge speaks.

 Claude Code  →  ccr :3456  →  Bridge :8000  →  RunPod / Groq / local GPU / …
                  │
                  ├─ Converts Anthropic ↔ OpenAI wire format
                  ├─ Routes by request type: default / background / think / longContext
                  └─ Passes tier name as model field (e.g. "architecture")

Copy and configure the ccr config

$ cp config/ccr-config.json.example config/ccr-config.json
$ $EDITOR config/ccr-config.json
# Set APIKEY to any secret, set api_key to your bridge sk-llm-... key

Add to .env and start the ccr service

CCR_PORT=3456
CCR_BRIDGE_API_KEY=sk-llm-...   # bridge API key from llmctl users keys-add

$ docker compose up -d ccr

Activate on the host and run Claude Code

$ source scripts/ccr-activate.sh   # exports ANTHROPIC_BASE_URL + ANTHROPIC_AUTH_TOKEN
$ claude --model architecture       # forces architecture tier
$ claude                            # uses default tier from ccr Router config

Request Type → Tier Mapping

ccr type	When used	Bridge tier
`default`	Most requests	architecture
`background`	Low-priority / short tasks	simple
`think`	Extended thinking	maximum
`longContext`	Large context window	ultra

Two-Model Routing (Ollama tiers)

For Ollama-backed tiers, the bridge automatically selects between two models based on whether the request carries tool definitions:

Request type	Model	Reason
Has tools (Claude Code tool-call session)	`qwen2.5:32b-instruct-q4_K_M`	Instruct variant handles tool call JSON format correctly
No tools (plain chat, title generation)	`qwen2.5-coder:32b-instruct-q4_K_M`	Coder variant is faster for pure text generation

Both models are pulled at pod startup. If the coder model fails to pull, requests fall back to the primary model automatically.

Tool Schema Stripping

Claude Code sends 26+ tools with verbose descriptions (~64K tokens total). The bridge strips description fields from all tool schemas before forwarding — applied to all providers including Ollama/RunPod. This cuts the tool payload to ~5K tokens, which:

Keeps requests within Groq/Cerebras free-tier TPM limits
Prevents Cloudflare 524 timeout — full schemas caused 126s+ generation time on RunPod

Providers

Set PROVIDER_PRIORITY=runpod,vast,lambda to control order and fallback. Only providers with a configured API key are active — unconfigured providers are skipped silently.

Provider	Type	Env Key	Notes
RunPod	Cloud GPU Pod	`RUNPOD_API_KEY`	GPU preference-order fallback. Community cloud fallback. Network volume cache. Per-pod-type concurrency (vision pod never blocks simple pod).
Vast.ai	Cloud GPU Pod	`VAST_API_KEY`	Fallback when RunPod has no capacity. Identical pod lifecycle.
Lambda Labs	Cloud GPU Pod	`LAMBDA_API_KEY`	Secondary fallback after Vast. Same pod lifecycle.
Local GPU	Local	none	Add `local` to PROVIDER_PRIORITY. Routes to Ollama on this machine. Zero cost — budget gate skipped. `allow_local` must be enabled per user.
OpenAI	Commercial API	`OPENAI_API_KEY`	gpt-4o-mini / gpt-4o. Pay per token. No cold start. Multimodal supported.
Groq	Commercial API	`GROQ_API_KEY`	Llama 3.1/3.3 at high throughput. Pay per token.
Cerebras	Commercial API	`CEREBRAS_API_KEY`	zai-glm-4.7 (simple) / gpt-oss-120b (mid–architecture). Extremely fast inference. Pay per token.
SambaNova	Commercial API	`SAMBANOVA_API_KEY`	Llama 3.1/3.3 8B–405B. High-throughput inference on custom silicon. Pay per token.
Together AI	Commercial API	`TOGETHER_API_KEY`	Serverless (text only). Use `together_dedicated` provider for vision — spins a reserved GPU endpoint, billed per hour (~$1.49–6.49/hr).
Mistral	Commercial API	`MISTRAL_API_KEY`	mistral-small / medium / large. Pay per token.
DeepSeek	Commercial API	`DEEPSEEK_API_KEY`	deepseek-chat / deepseek-reasoner. Pay per token.

RunPod GPU Tier Preferences

Tier	GPU Preference Order	VRAM Range
`simple`	RTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range	8–24 GB
`vision`	RTX 4090 → RTX 3090 → A40 → A6000 → cheapest in range	10–24 GB
`architecture`	RTX 4090 → RTX 3090 → A40 → A6000 → cheapest ≥20 GB	≥20 GB
`maximum`	L40S → L40 → A40 → A100 40GB → cheapest ≥38 GB	≥38 GB
`ultra`	A100 80GB → H100 → cheapest ≥50 GB	≥50 GB

VRAM cap on simple/vision prevents landing on A100/H100 when preferred types sell out — avoids 5–10× cost with no quality gain for 7B/13B models.

Network Volume Cache (RunPod)

Without a network volume, models re-download on every cold start (7B: ~3–5 min, 32B: ~10–15 min). With one, cold starts drop to ~30 s. Cost: ~$7–8/month for 100 GB.

# 1. Create 100 GB network volume in RunPod Dashboard → Storage → Network Volumes
# 2. Copy the volume ID
# 3. Add to .env and restart bridge:

RUNPOD_NETWORK_VOLUME_ID=<volume-id>

$ docker compose up -d bridge

If you delete the volume, clear RUNPOD_NETWORK_VOLUME_ID from .env immediately. A stale ID causes every pod launch to fail.

Per-user volume keys (each user registers their own volume):

POST /v1/user/volume-keys
Authorization: Bearer <user-token>
Content-Type: application/json

{
  "provider":   "runpod",
  "volume_id":  "abc12345",
  "api_key":    "<runpod-api-key>",
  "datacenter": "EU-RO-1"     // optional — constrains to DC + validates before launch
}

GET    /v1/user/volume-keys          // list (api_key not decrypted in response)
DELETE /v1/user/volume-keys/{id}     // remove

Volume Storage Policy (Admin)

Policy	Effect when no user volume key found
`use_env`	Use `RUNPOD_NETWORK_VOLUME_ID` if `allow_env=True` and var is set; otherwise launch stateless (default)
`stateless`	Launch without any volume — models re-download every cold start
`block`	Fail immediately with HTTP 400 — no fallback to other providers

CLI Reference

Commands run in the bridge container via the llmctl shortcut.

# Direct (no shortcut installed)
$ docker compose exec bridge python -m cli.llm_ctl <command>

User Management

$ llmctl users add <email>                              # create user (prompts for password)
$ llmctl users set-password <email>                     # reset password
$ llmctl users budget <email> --usd 50                  # set monthly spend cap
$ llmctl users credit-add <email> --usd 20              # add prepaid credit
$ llmctl users tiers <email>                            # show allowed tiers
$ llmctl users tiers <email> --set simple               # lock to one tier
$ llmctl users tiers <email> --set simple,architecture  # allow two tiers
$ llmctl users tiers <email> --set all                  # remove restriction
$ llmctl users deactivate <email>                       # soft-delete user
$ llmctl users list                                     # list all users

API Keys

$ llmctl users keys-add <email> --label "open-webui"   # create key (shown ONCE)
$ llmctl users keys-list <email>                       # list active keys
$ llmctl users keys-revoke <key_id>                    # revoke by ID
$ llmctl users reset-key <email> [--label "name"]      # revoke all + issue fresh key

User Sync (Open WebUI)

$ llmctl users add <email> --sync-openwebui        # create bridge user + matching OW account
$ llmctl users keys-add <email> --sync-pipeline    # create key + update pipeline user_key_map

Requires OPENWEBUI_ADMIN_EMAIL, OPENWEBUI_ADMIN_PASSWORD, and PIPELINES_API_KEY in .env.

Volume Storage Policy

$ llmctl users storage <email>                          # show policy + registered volume keys
$ llmctl users storage <email> --policy use_env         # use env volume if no user key (default)
$ llmctl users storage <email> --policy stateless       # always launch without volume
$ llmctl users storage <email> --policy block           # require user volume; reject otherwise
$ llmctl users storage <email> --allow-env              # allow shared RUNPOD_NETWORK_VOLUME_ID
$ llmctl users storage <email> --no-allow-env           # prevent shared volume for this user

Local Provider Access

$ llmctl users local-access <email> --allow   # enable llm-local for this user
$ llmctl users local-access <email> --deny    # disable llm-local for this user

Pods & Billing

$ llmctl pods ls [--status ready]             # list pods
$ llmctl pods kill <pod_id>                   # terminate pod immediately
$ llmctl start --tier architecture            # prewarm a pod
$ llmctl bills run --month 2026-05            # generate invoices
$ llmctl bills show <email> --month 2026-05   # per-user invoice + breakdown

Observability

$ llmctl models [--user-type personal]        # tier table with effective $/hr
$ llmctl status [--tier architecture]         # active pods + running cost
$ llmctl budget [--email u@example.com]       # spend vs cap progress bars
$ llmctl costs [--month 2026-05]              # per-tier cost breakdown
$ llmctl gain  [--month 2026-05]              # savings vs GPT-4o equivalent

Deployment Modes

Four ways to deploy — from solo dev to multi-tenant hosted service.

Mode	Who	Bridge	Gateway	Open WebUI
Solo	One user	Local	—	Local, `OPENWEBUI_BRIDGE_API_KEY` set to your key
Hosted multi-user	Admin + users	Shared server	—	Shared. Per-user billing via gpu-relay Pipelines.
Gateway client	User of hosted bridge	Remote (host's)	Local	Local, pointed at gateway on port 8080
Full self-hosted	Single operator	Own server	Optional	Own server or local

Gateway — Local Proxy to Remote Bridge

The gateway is a lightweight stateless proxy. Users point their OpenAI-compatible client at http://localhost:8080 and authenticate with their own sk-llm- key. The gateway forwards requests to the upstream bridge unchanged.

Option A — alongside main stack (add-on overlay):

# Add to .env:
GATEWAY_BRIDGE_URL=http://bridge:8000   # internal; bridge is a sibling service
GATEWAY_PORT=8080

$ docker compose -f docker-compose.yml -f docker-compose.gateway.yml up -d gateway

Option B — standalone (user's machine → remote bridge):

# docker/docker-compose.gateway.yml (no main stack needed)
GATEWAY_BRIDGE_URL=https://your-bridge.example.com
GATEWAY_PORT=8080

$ docker compose -f docker/docker-compose.gateway.yml up -d

Then set your Open WebUI (or any OpenAI client) base URL to http://localhost:8080 and API key to your sk-llm-... bridge key.

Hosted Multi-User Setup

Each user gets their own bridge API key. The gpu-relay Pipelines manifold routes inference per-user for correct billing attribution. Admin installs the pipeline once; each user's key is added automatically when created with --sync-openwebui.

# .env — sync settings
OPENWEBUI_ADMIN_EMAIL=admin@example.com
OPENWEBUI_ADMIN_PASSWORD=<ow-admin-password>
PIPELINES_API_KEY=<from Open WebUI Admin → Pipelines → API key>
PIPELINE_ID=gpu-relay                   # default; match your installed pipeline ID

# Create user + OW account + pipeline mapping in one step:
$ llmctl users add alice@example.com --sync-openwebui
$ llmctl users keys-add alice@example.com --sync-pipeline

# Or via HTTP (returns plaintext key in response):
$ curl -X POST http://localhost:8000/admin/users \
       -H "Authorization: Bearer <admin-key>" \
       -H "Content-Type: application/json" \
       -d '{"email":"alice@example.com","password":"...","sync_openwebui":true,"sync_pipeline":true}'

User Portal

Users can view their own spend, budget, and 30-day usage charts without admin access. Available at http://localhost:8501 → User Portal page. Login with a sk-llm- key — no email or password needed.

The user portal calls GET /v1/usage on the bridge. It only shows data for the authenticated key's user — no cross-user visibility.

Workspace Tools

Open WebUI Tool integration that gives any model persistent file I/O and code execution in a shared workspace_data/ directory. Add the tool once in Admin Panel — every model in Open WebUI gains the same capabilities.

Available Tools

Tool	Description
`write_file(path, content)`	Write text to a file; creates parent dirs automatically
`read_file(path)`	Read a file from the workspace
`run_bash(command)`	Execute a shell command in the workspace directory
`run_python(code)`	Execute a Python code snippet (not a file path)
`list_tree(path, depth)`	Browse directory structure
`search_files(query, path, glob)`	Regex search across workspace files
`delete_path(path)`	Delete file or directory
`move_path(src, dst)`	Move or rename file/directory
`create_directory(path)`	Create directory (and parents)
`generate_pdf(markdown, output_path)`	Render Markdown → PDF in workspace

Setup

# 1. Open WebUI → Admin Panel → Tools → Add Tool
#    Paste contents of: pipelines/openwebui_tool.py
#    Set valve: workspace_tools_url = http://workspace-tools:7000

# 2. Start workspace-tools service (already in docker-compose.yml)
$ docker compose up -d workspace-tools

# 3. In any Open WebUI chat: enable the tool via the tool icon, then chat normally

System Prompt

Without a system prompt, models hallucinate tool calls as Python code instead of invoking them. Use the recommended system prompt to enforce correct behavior.

You have access to Workspace Tools. You MUST use them for ALL file and execution tasks —
never simulate, imagine, or describe results.

Rules:
- Writing code: always call write_file with real file content (actual newlines, not \n)
- Testing code: always call run_bash or run_python — never show "expected output"
- Browsing files: always call list_tree or read_file
- If a tool call fails: read the actual error, fix the actual file, retry
- To run a Python FILE: use run_bash with "python3 <path>" — never pass a file path to run_python
- run_python is for inline code snippets only, e.g. run_python("print('hello')")
- File organization: always create new projects under projects/<project-name>/, not the workspace root

CRITICAL — tool call rules:
- Tool calls are NOT Python code. Never write run_bash(...) in a code block.
- Invoke tools directly — do not write code that calls them.
- "I will call run_bash" followed by a code block = violation. Call it, don't narrate it.
- Fake output is output without a tool call. It is always wrong.

Full prompt with explanations: docs/workspace-tools-system-prompt.md

Common Mistakes

Wrong	Right
`run_python("projects/calc/main.py")`	`run_bash("python3 projects/calc/main.py")`
Writing `result = run_bash(...)` in a code block	Invoking `run_bash` as a tool call directly
Creating files in workspace root	Creating under `projects/<name>/`
Calling `list_tree` before `write_file`	Write first, verify after

Services