provocapi

OpenAI-compatible inference API on owned GPU infrastructure. Drop-in replacement for the OpenAI SDK, backed by open-weight models (Llama, Qwen, Mistral, DeepSeek, BGE, E5) running on your own GPUs.

See PRD.md for product scope and TRD.md for the full architecture. The customer-facing website (MkDocs Material) lives in site/.

What’s in this repo

Component	Language	Purpose
`gateway/`	Python / FastAPI	HTTPS front door: auth (argon2id + Postgres + Redis), rate limiting (DRR per tenant), usage metering (Redpanda → ClickHouse), LoRA adapter + batch endpoints, `/metrics` + `/status`
`router/`	Go	Model dispatch with least-outstanding-requests + weighted-fair scheduling, canary traffic splitting, SSE proxy, Kubernetes pod-watch for worker discovery
`worker-vllm/`	Docker + vLLM	Production GPU worker: `vllm/vllm-openai` with configurable model/dtype/TP/LoRA
`worker-stub/`	Python / FastAPI	CPU-only stand-in that emits canned SSE tokens. Exercises the full plumbing without a GPU
`dashboard/`	Next.js / TypeScript	Admin UI: tenant/key management, model catalog, usage charts, streaming playground, status page
`deploy/operator/`	Python / kopf	Watches `Model` CRs, creates vLLM Deployments with GPU node affinity + canary replicas
`deploy/intake/`	Python / kopf	Mirrors model weights from HuggingFace to MinIO before workers start; gates operator on `status.intakeReady`
`deploy/charts/provocapi/`	Helm	Full k8s chart: gateway, router, operator, intake, ingress+TLS, external-secrets, Envoy, NetworkPolicies, ServiceMonitors, model-warmer DaemonSet
`site/`	Markdown (MkDocs)	Customer-facing site: landing, pricing, privacy, quickstart, OpenAI migration guide, model catalog, API reference, OpenAPI spec

Architecture

                              Customer (OpenAI SDK / curl)
                                        │
                                  HTTPS + Bearer
                                        ▼
                          ┌────────────────────────────┐
                          │  Ingress + cert-manager     │  auto-renewed TLS
                          │  (nginx or Envoy edge)      │
                          └──────────────┬─────────────┘
                                         │
                          ┌──────────────▼─────────────┐
                          │  Gateway  (FastAPI)         │  auth · rate limit ·
                          │                             │  usage metering · /metrics
                          └──────────────┬─────────────┘
                                         │ HTTP/2 + traceparent
                          ┌──────────────▼─────────────┐
                          │  Router  (Go)               │  weighted-fair
                          │                             │  scheduling · LOR ·
                          │                             │  canary split · SSE
                          └───┬───────────────────────┬─┘
           pod-watch /        │                       │
           labels             │                       │
┌──────────▼─────┐  ┌──────────▼────┐        ┌──────────▼────┐
│ vLLM worker    │  │ vLLM worker   │  ...   │ vLLM canary   │
│ Llama 70B FP8  │  │ Qwen 32B      │        │ Llama 70B     │
│ 1x H100        │  │ 1x H100       │        │ revision B    │
└────────────────┘  └───────────────┘        └───────────────┘

         Control plane side-channels:
         · Postgres (tenants, keys, adapters, batch jobs)
         · Redis (key cache, rate limits)
         · MinIO (model weight registry)
         · Redpanda + ClickHouse (usage events)
         · Prometheus + Grafana + Alertmanager
         · OTLP collector (W3C traceparent from gateway through router)

Deploying the API (control plane)

Three modes, each uses the same images and config shape. Pick the one that matches your target:

Mode A: local development

Zero external dependencies. Stub workers on CPU. Perfect for working on the API surface.

make up           # builds images, starts gateway + router + 2 stub workers
make smoke        # end-to-end SSE test
make logs         # tail everything
make down         # stop

Gateway at http://localhost:8000 accepts any pk-prov-* key.

Mode B: local with full control plane

Adds Postgres, Redis, real auth, rate limiting, admin API, optional dashboard on port 3000.

make up-full              # gateway + router + workers + Postgres + Redis
make seed                 # creates a dev tenant and prints a real API key
make up-dashboard         # adds the Next.js dashboard at localhost:3000

# Optional overlays:
make up-metering          # adds Redpanda + ClickHouse for usage metering

Mode C: Kubernetes (production)

Prerequisites the cluster needs (install once, shared across apps):

Kubernetes (k3s, RKE2, kubeadm, EKS, etc.) with NVIDIA GPU Operator installed
nginx-ingress (or your preferred ingress controller)
cert-manager with a ClusterIssuer for Let’s Encrypt (or internal CA)
external-secrets.io pointed at your secret store (Vault / AWS Secrets Manager / etc.)
prometheus-operator / kube-prometheus-stack for the ServiceMonitor CRDs
MinIO (or any S3-compatible object store) for the model weight registry
Managed Postgres (RDS, CloudSQL, or in-cluster postgres-operator) and Redis (ElastiCache or in-cluster)
Redpanda (Kafka-compatible) and ClickHouse for usage metering (optional)

Deploy the control plane:

# 1. Apply the Model CRD (cluster-scoped, install once).
kubectl apply -f deploy/charts/provocapi/crds/model.yaml

# 2. Populate the secret store with:
#    database-url, redis-url, hf-token, api-key-pepper,
#    s3-access-key, s3-secret-key
# (How you do this is backend-specific — see your Vault / AWS SM docs.)

# 3. Install the Helm chart.
helm install provocapi deploy/charts/provocapi \
  --namespace inference-api --create-namespace \
  --set ingress.enabled=true \
  --set ingress.host=api.yourdomain.com \
  --set ingress.certManagerIssuer=letsencrypt-prod \
  --set externalSecrets.enabled=true \
  --set externalSecrets.storeName=your-vault-store \
  --set externalSecrets.vaultPath=secret/data/provocapi/prod \
  --set intake.enabled=true \
  --set gateway.databaseUrl=set-via-external-secret \
  --set gateway.redisUrl=set-via-external-secret \
  --set gateway.kafkaBootstrap=redpanda.kafka.svc:9092

# 4. Apply the migrations to Postgres (first time only).
#    See migrations/*.sql — apply 001..005 in order.

# 5. Create a tenant and API key via the admin API (port-forward the
#    gateway service, it isn't publicly exposed for admin routes).
kubectl -n inference-api port-forward svc/provocapi-gateway 8000:8000 &
python3 scripts/seed.py

GitOps alternative: deploy/argocd/application.yaml points at this chart and reconciles on every push to main.

Adding GPU nodes (data plane)

Workers run the vLLM container on labeled nodes. A node can serve one or many models depending on its GPU class and VRAM.

1. Provision and join the node

Any hardware bring-up workflow works — PXE + Talos/Ubuntu, bare-metal image, cloud GPU instance. After the OS is up:

# Install the NVIDIA driver + container toolkit (handled automatically by
# the NVIDIA GPU Operator once the node joins the cluster).

# Join to the cluster (k3s example; adapt for your flavor):
curl -sfL https://get.k3s.io | K3S_URL=https://<control-plane>:6443 \
    K3S_TOKEN=<token> sh -

# Verify GPUs are visible:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\.com/gpu

2. Label and taint the node

The router and operator use labels to decide which nodes can serve which models.

kubectl label node <node-name> \
  provocapi.io/pool=inference-api \
  provocapi.io/gpu-class=h100 \
  provocapi.io/gpu-count=8

# Taint so only our workers land here (and our workers tolerate it
# via the Helm chart's default tolerations).
kubectl taint node <node-name> \
  provocapi.io/pool=inference-api:NoSchedule

Supported GPU class labels (the operator’s node affinity matches these):

Label	Hardware	Notes
`rtx-5090`	NVIDIA RTX 5090 32GB	Best for 8B models and embeddings
`rtx-pro-6000`	NVIDIA RTX PRO 6000 96GB (Blackwell)	24B–70B FP8 single-card
`a100-40` / `a100-80`	NVIDIA A100 40/80GB	Legacy 24B–32B
`h100`	NVIDIA H100 80GB	Flagship; 70B FP8 single-card or multi-card TP

3. Create a `Model` resource

Deploy a model onto the pool by applying a Model CR. Three examples live in deploy/examples/:

kubectl apply -f deploy/examples/model-llama-8b.yaml     # Llama 3.1 8B, RTX 5090
kubectl apply -f deploy/examples/model-llama-70b.yaml    # Llama 3.1 70B FP8, H100
kubectl apply -f deploy/examples/model-bge-m3.yaml       # BGE-M3 embeddings, RTX 5090

What happens:

The intake controller (deploy/intake/) reads spec.source and mirrors the weights from HuggingFace to MinIO (s3://provocapi-models/<repo>/<revision>/). It writes a MANIFEST.json with SHA256 per file. Sets status.intakeReady=true when done.
The operator (deploy/operator/) sees intakeReady and creates a vLLM Deployment with the correct GPU resource requests, node affinity (provocapi.io/gpu-class in spec.serving.gpuClassAllowed), and tolerations for the inference-api pool.
The model-warmer DaemonSet pre-pulls the weights to /var/lib/provocapi/models/ on local NVMe so future pod restarts are fast.
Each vLLM pod carries provocapi.io/model=<model-id> — the router’s pod-watch picks it up and adds it to the pool.
The router’s health check polls /health on the vLLM pod and flips the healthy bit when Ready.

Inspect what’s running:

kubectl -n inference-api get models              # Model CRs + their status
kubectl -n inference-api get pods -l provocapi.io/component=worker
kubectl -n inference-api logs -l provocapi.io/component=operator --tail=50

4. Serve traffic

Once kubectl get model <name> shows status.readyReplicas > 0, the router has live upstream workers and customers can hit the API.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.yourdomain.com/v1",
    api_key="pk-prov-<your-key>",
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "hello"}],
    stream=True,
)
for chunk in resp:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

5. Rollouts and canaries

Update spec.source.revision for a straight rolling update. For a canary:

spec:
  source:
    repo: meta-llama/Llama-3.1-70B-Instruct
    revision: main
  canary:
    enabled: true
    revision: v2-fine-tuned-2025-09
    weight: 5          # send 5% of traffic to the canary replica

The operator creates a parallel <model>-canary Deployment with replicas=1, labeled with provocapi.io/rollout=canary and provocapi.io/canary-weight=5. The router reads those labels and uses weighted random selection for traffic splitting, with automatic fallback if the canary is unhealthy.

OpenAI-compatible endpoints

Endpoint	Notes
`POST /v1/chat/completions`	Streaming + non-streaming, tool calling, JSON mode
`POST /v1/completions`	Legacy text completion
`POST /v1/embeddings`	Multilingual and instruct-tuned embedding models
`GET /v1/models`	Models available to the authenticated tenant
`GET /v1/usage`	Per-model, per-day tokens + latency percentiles (requires ClickHouse)

Native extensions

Endpoint	Notes
`POST /v1/adapters`	Upload LoRA adapter from HF repo
`POST /v1/adapters/upload`	Upload adapter weights directly (multipart)
`GET /v1/adapters`	List tenant’s adapters
`POST /v1/batch`	Submit JSONL batch job (50% pricing, 24h SLA)
`GET /v1/batch/{id}/output`	Download batch results
`GET /status`	Public status endpoint (per-model health + latency)

Use adapters in inference via model:adapter-name syntax:

{"model": "llama-3.1-70b-instruct:my-finetuned-v3", "messages": [...]}

Admin API

Non-OpenAI-compatible endpoints for tenant + key management. Not exposed publicly in production — reachable only from the management VLAN.

Endpoint
`POST /admin/tenants`	Create a billing tenant
`GET /admin/tenants`	List
`POST /admin/keys`	Issue a new API key
`GET /admin/keys/{tenant_id}`	List keys for a tenant
`POST /admin/keys/{id}/revoke`	Hard-revoke a key
`POST /admin/keys/{id}/rotate`	Issue successor + put the old key in a 24h grace window

Observability

Every component exposes a Prometheus /metrics endpoint. Install the monitoring/ assets once:

kubectl apply -f deploy/monitoring/service-monitors.yaml
kubectl apply -f deploy/monitoring/alerts/inference-alerts.yaml
# Import the Grafana dashboards:
#   deploy/monitoring/dashboards/inference-overview.json
#   deploy/monitoring/dashboards/per-tenant.json

Key SLO metrics (from the TRD §7):

Metric	What it watches
`provocapi_time_to_first_token_seconds`	Histogram; p95 SLO: <1.2s (shared), <600ms (reserved)
`provocapi_request_duration_seconds`	End-to-end latency
`provocapi_rate_limit_rejections_total`	Per tenant, split by rpm/tpm
`vllm:num_requests_running` / `vllm:num_requests_waiting`	Worker queue depth
`vllm:gpu_cache_usage_perc`	KV cache pressure — predictive capacity signal
`DCGM_FI_DEV_GPU_UTIL` / `DCGM_FI_DEV_GPU_TEMP`	GPU health (via NVIDIA DCGM exporter)

Set OTEL_EXPORTER_OTLP_ENDPOINT on the gateway and router to enable end-to-end distributed tracing (W3C traceparent propagation).

Testing

# Gateway (pytest, 47 tests, <1s)
cd gateway && pytest tests/

# Router (go test + race detector, 26 tests, <2s)
cd router && CGO_ENABLED=1 go test -race ./...

# End-to-end against the running compose stack
make smoke

# Full-stack with auth, Postgres, Redis
make up-full && make seed && ./scripts/openai_sdk_check.py

CI (.github/workflows/ci.yml) runs all of the above plus Helm lint, Trivy fs scan, eslint + tsc on the dashboard, and a Trivy image scan + cosign keyless signature on every release tag (.github/workflows/release.yml).

Make targets

Target	What it does
`make up`	Walking skeleton: stub workers, no DB
`make up-full`	+ Postgres + Redis + live auth + rate limiting
`make up-dashboard`	+ Next.js dashboard at :3000
`make up-metering`	+ Redpanda + ClickHouse usage pipeline
`make up-gpu`	Real vLLM workers with GPU passthrough (needs nvidia-container-toolkit)
`make smoke`	End-to-end SSE test
`make seed`	Create dev tenant + API key (for `up-full`)
`make logs`	Tail all container logs
`make down`	Stop the stack
`make clean`	Down + remove volumes

Repo layout

provocapi/
├── PRD.md                          product requirements
├── TRD.md                          technical requirements
├── README.md                       this file
├── Makefile                        up/down/smoke/seed targets
├── docker-compose.yml              walking-skeleton stack
├── docker-compose.full.yml         + Postgres + Redis + live auth
├── docker-compose.dashboard.yml    + Next.js dashboard
├── docker-compose.metering.yml     + Redpanda + ClickHouse
├── docker-compose.gpu.yml          + real vLLM on GPUs
├── ruff.toml                       shared Python lint config
├── .github/workflows/              CI (lint+test+scan) + release (cosign)
├── gateway/                        FastAPI HTTP front door
│   ├── app/
│   │   ├── main.py                   endpoints, error handling, proxy path
│   │   ├── auth.py                   API key resolution + grace-period logic
│   │   ├── keys.py                   argon2id key generation + hashing
│   │   ├── schemas.py                OpenAI request shapes (permissive)
│   │   ├── ratelimit.py              Redis sliding-window RPM/TPM
│   │   ├── adapters.py               /v1/adapters CRUD
│   │   ├── batch.py                  /v1/batch async processor
│   │   ├── admin.py                  /admin/* tenant + key management
│   │   ├── status_api.py             /status public endpoint + rolling sampler
│   │   ├── metrics.py                Prometheus metric definitions
│   │   ├── tracing.py                OTEL setup
│   │   ├── usage.py                  Kafka producer for usage events
│   │   ├── usage_api.py              /v1/usage ClickHouse query
│   │   ├── router_client.py          httpx wrapper for router calls
│   │   └── db/                       Postgres + Redis connection pools
│   └── tests/                        pytest suites
├── router/                         Go model router
│   ├── main.go                       HTTP server, dispatch, pool balancing
│   ├── scheduler.go                  per-tenant weighted-fair DRR admission
│   ├── k8s.go                        client-go informer for pod-watch
│   ├── metrics.go                    Prometheus metrics
│   ├── tracing.go                    OTEL middleware
│   └── *_test.go                     Go test suites
├── worker-vllm/                    production vLLM worker
├── worker-stub/                    CPU-only stub for local dev
├── dashboard/                      Next.js admin UI
├── deploy/
│   ├── charts/provocapi/             Helm chart (11 templates)
│   ├── examples/                     example Model CRs
│   ├── operator/                     kopf-based model operator
│   ├── intake/                       HF → MinIO weight mirror
│   ├── monitoring/                   ServiceMonitors + alerts + dashboards
│   └── argocd/                       GitOps Application resource
├── migrations/                     SQL migrations (001..005)
├── scripts/
│   ├── smoke.py                      stdlib e2e test
│   ├── seed.py                       dev tenant + key seeder
│   └── openai_sdk_check.py           drop-in SDK validation
└── site/                          customer-facing site (MkDocs Material)
    ├── index.md                      landing page
    ├── pricing.md                    pricing
    ├── privacy.md                    privacy policy
    ├── docs/                         developer documentation
    │   ├── index.md                    quickstart
    │   ├── migrating-from-openai.md
    │   ├── lora-adapters.md
    │   ├── batch-inference.md
    │   ├── models.md
    │   └── api.md
    └── openapi.json                  exported OpenAPI spec