Batch Inference
Process large volumes of requests asynchronously at 50% of the standard per-token rate. Batch jobs are designed for eval harnesses, data pipelines, and bulk processing where you don't need real-time responses.
How it works
- Prepare a JSONL file where each line is a request.
- Upload it to
POST /v1/batch. - Poll
GET /v1/batch/{id}for status. - Download results from
GET /v1/batch/{id}/outputwhen complete.
SLA: 24-hour completion.
Submit a batch
# Create input file
cat > batch.jsonl << 'EOF'
{"custom_id": "eval-1", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 10}}
{"custom_id": "eval-2", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "Capital of France?"}], "max_tokens": 10}}
{"custom_id": "eval-3", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "Translate 'hello' to Spanish."}], "max_tokens": 10}}
EOF
# Submit
curl -X POST https://inference.provocative.earth/v1/batch \
-H "Authorization: Bearer pk-prov-YOUR-KEY" \
-F "file=@batch.jsonl"
Response:
{
"id": "batch_abc123",
"status": "validating",
"endpoint": "/v1/chat/completions",
"total_requests": 3,
"completed_count": 0,
"failed_count": 0
}
Check status
curl https://inference.provocative.earth/v1/batch/batch_abc123 \
-H "Authorization: Bearer pk-prov-YOUR-KEY"
Status progression: validating → in_progress → completed (or failed).
Download results
curl https://inference.provocative.earth/v1/batch/batch_abc123/output \
-H "Authorization: Bearer pk-prov-YOUR-KEY" \
-o results.jsonl
Each line of the output is:
{
"id": "resp_xyz",
"custom_id": "eval-1",
"response": {
"status_code": 200,
"body": { "id": "chatcmpl-...", "choices": [...], "usage": {...} }
}
}
The custom_id lets you correlate results with your input rows.
Cancel a batch
curl -X POST https://inference.provocative.earth/v1/batch/batch_abc123/cancel \
-H "Authorization: Bearer pk-prov-YOUR-KEY"
Limits
- Max file size: 100 MB
- Max requests per batch: 50,000
- Max concurrent batch jobs per tenant: 10 (shared), unlimited (reserved/dedicated)
Pricing
Batch inference is billed at 50% of the standard per-token rate for the model used. See Pricing.