API Reference
Base URL: https://inference.provocative.earth/v1
All requests require a Bearer token: Authorization: Bearer pk-prov-YOUR-KEY
The full OpenAPI spec is available at /openapi.json.
Endpoints
Chat Completions
POST /v1/chat/completions
Create a chat completion. Supports streaming via SSE.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model ID (e.g., llama-3.1-70b-instruct). Append :adapter-name for LoRA. |
messages |
array | Yes | Array of {role, content} objects. Roles: system, user, assistant, tool. |
stream |
boolean | No | Default false. Set true for Server-Sent Events streaming. |
max_tokens |
integer | No | Max output tokens. Shared tier cap: 8192. |
temperature |
float | No | Sampling temperature, 0-2. Default 1.0. |
top_p |
float | No | Nucleus sampling. Default 1.0. |
tools |
array | No | Tool/function definitions for function calling. |
response_format |
object | No | {"type": "json_object"} for JSON-only output. |
Response (non-streaming):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "llama-3.1-70b-instruct",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "Hello!"},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 5,
"total_tokens": 15
}
}
Response (streaming): Server-Sent Events where each data: line is a chunk:
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"!"},"index":0,"finish_reason":"stop"}],"usage":{...}}
data: [DONE]
Completions
POST /v1/completions
Legacy text completion. Same parameters as chat completions but with prompt instead of messages.
Embeddings
POST /v1/embeddings
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Embedding model ID (e.g., bge-m3). |
input |
string or array | Yes | Text(s) to embed. |
Response:
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [0.123, -0.456, ...]}
],
"model": "bge-m3",
"usage": {"prompt_tokens": 5, "total_tokens": 5}
}
Models
GET /v1/models
List all models available to your tenant.
Usage
GET /v1/usage?start_date=2025-01-01&end_date=2025-01-31
Returns per-model, per-day token counts and latency percentiles. Requires ClickHouse backend.
Adapters
POST /v1/adapters — create from HF repo
POST /v1/adapters/upload — upload weights directly
GET /v1/adapters — list
GET /v1/adapters/{id} — get details
DELETE /v1/adapters/{id} — delete
See LoRA Adapters guide.
Batch
POST /v1/batch — submit JSONL file
GET /v1/batch — list jobs
GET /v1/batch/{id} — get status
GET /v1/batch/{id}/output — download results
POST /v1/batch/{id}/cancel — cancel job
Error format
All errors follow the OpenAI error shape:
{
"error": {
"message": "Human-readable description",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}
| HTTP Status | Meaning |
|---|---|
| 400 | Invalid request (bad JSON, missing field) |
| 401 | Invalid or missing API key |
| 404 | Model not found |
| 413 | Request too large (batch file, adapter) |
| 429 | Rate limit exceeded. Check Retry-After header. |
| 502 | Upstream worker error |
| 503 | No healthy workers for the requested model |
Response headers
| Header | Description |
|---|---|
X-Request-Id |
Unique request ID |
X-Provocapi-Worker |
Worker that served the request |
X-Provocapi-Queue-Ms |
Queue wait time in milliseconds |
X-Provocapi-Model |
Resolved model ID |
Retry-After |
Seconds until rate limit resets (on 429) |