Skip to content

LoRA Adapters

Serve your fine-tuned models on provocapi without managing infrastructure. Upload a LoRA adapter, and we hot-load it onto the base model's workers — no restart, no downtime.

How it works

  1. You fine-tune a LoRA adapter against a supported base model (e.g., Llama 3.1 70B) using your own training pipeline.
  2. You upload the adapter to provocapi via the /v1/adapters endpoint.
  3. You reference it in inference requests as model:adapter-name (e.g., llama-3.1-70b-instruct:my-finetuned-v3).
  4. We hot-load it onto the vLLM workers serving the base model in <2 seconds.

Supported base models

LoRA adapters are supported on all chat models in the catalog. Embedding models do not support LoRA.

Upload an adapter

From a HuggingFace repo

curl -X POST https://inference.provocative.earth/v1/adapters \
  -H "Authorization: Bearer pk-prov-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-finetuned-v3",
    "base_model": "llama-3.1-70b-instruct",
    "repo": "my-org/llama-lora-v3",
    "revision": "main"
  }'

Direct file upload

curl -X POST https://inference.provocative.earth/v1/adapters/upload \
  -H "Authorization: Bearer pk-prov-YOUR-KEY" \
  -F "name=my-finetuned-v3" \
  -F "base_model=llama-3.1-70b-instruct" \
  -F "file=@adapter_model.safetensors"

Use an adapter in inference

Pass the adapter name after a colon in the model field:

response = client.chat.completions.create(
    model="llama-3.1-70b-instruct:my-finetuned-v3",
    messages=[{"role": "user", "content": "hello"}],
)

The router strips the adapter suffix for model pool lookup and forwards the full model:adapter string to the worker. vLLM hot-loads the adapter if it's not already in memory.

List adapters

curl https://inference.provocative.earth/v1/adapters \
  -H "Authorization: Bearer pk-prov-YOUR-KEY"

Delete an adapter

curl -X DELETE https://inference.provocative.earth/v1/adapters/adp_abc123 \
  -H "Authorization: Bearer pk-prov-YOUR-KEY"

Limits

Tier Max adapters Max adapter size
Shared 20 500 MB
Reserved 200 500 MB
Dedicated 200 500 MB

Technical details

  • Adapters are served using vLLM's native --enable-lora support.
  • Each worker can hold up to 8 adapters in memory simultaneously (--max-loras=8).
  • Max LoRA rank: 64 (--max-lora-rank=64).
  • Hot-load time: <2 seconds for a typical adapter.
  • Adapters are scoped to the tenant that uploaded them — no cross-tenant access.