LoRA Adapters
Serve your fine-tuned models on provocapi without managing infrastructure. Upload a LoRA adapter, and we hot-load it onto the base model's workers — no restart, no downtime.
How it works
- You fine-tune a LoRA adapter against a supported base model (e.g., Llama 3.1 70B) using your own training pipeline.
- You upload the adapter to provocapi via the
/v1/adaptersendpoint. - You reference it in inference requests as
model:adapter-name(e.g.,llama-3.1-70b-instruct:my-finetuned-v3). - We hot-load it onto the vLLM workers serving the base model in <2 seconds.
Supported base models
LoRA adapters are supported on all chat models in the catalog. Embedding models do not support LoRA.
Upload an adapter
From a HuggingFace repo
curl -X POST https://inference.provocative.earth/v1/adapters \
-H "Authorization: Bearer pk-prov-YOUR-KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "my-finetuned-v3",
"base_model": "llama-3.1-70b-instruct",
"repo": "my-org/llama-lora-v3",
"revision": "main"
}'
Direct file upload
curl -X POST https://inference.provocative.earth/v1/adapters/upload \
-H "Authorization: Bearer pk-prov-YOUR-KEY" \
-F "name=my-finetuned-v3" \
-F "base_model=llama-3.1-70b-instruct" \
-F "file=@adapter_model.safetensors"
Use an adapter in inference
Pass the adapter name after a colon in the model field:
response = client.chat.completions.create(
model="llama-3.1-70b-instruct:my-finetuned-v3",
messages=[{"role": "user", "content": "hello"}],
)
The router strips the adapter suffix for model pool lookup and forwards the full model:adapter string to the worker. vLLM hot-loads the adapter if it's not already in memory.
List adapters
curl https://inference.provocative.earth/v1/adapters \
-H "Authorization: Bearer pk-prov-YOUR-KEY"
Delete an adapter
curl -X DELETE https://inference.provocative.earth/v1/adapters/adp_abc123 \
-H "Authorization: Bearer pk-prov-YOUR-KEY"
Limits
| Tier | Max adapters | Max adapter size |
|---|---|---|
| Shared | 20 | 500 MB |
| Reserved | 200 | 500 MB |
| Dedicated | 200 | 500 MB |
Technical details
- Adapters are served using vLLM's native
--enable-lorasupport. - Each worker can hold up to 8 adapters in memory simultaneously (
--max-loras=8). - Max LoRA rank: 64 (
--max-lora-rank=64). - Hot-load time: <2 seconds for a typical adapter.
- Adapters are scoped to the tenant that uploaded them — no cross-tenant access.