Skip to content

Batch Inference

Process large volumes of requests asynchronously at 50% of the standard per-token rate. Batch jobs are designed for eval harnesses, data pipelines, and bulk processing where you don't need real-time responses.

How it works

  1. Prepare a JSONL file where each line is a request.
  2. Upload it to POST /v1/batch.
  3. Poll GET /v1/batch/{id} for status.
  4. Download results from GET /v1/batch/{id}/output when complete.

SLA: 24-hour completion.

Submit a batch

# Create input file
cat > batch.jsonl << 'EOF'
{"custom_id": "eval-1", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 10}}
{"custom_id": "eval-2", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "Capital of France?"}], "max_tokens": 10}}
{"custom_id": "eval-3", "body": {"model": "llama-3.1-70b-instruct", "messages": [{"role": "user", "content": "Translate 'hello' to Spanish."}], "max_tokens": 10}}
EOF

# Submit
curl -X POST https://inference.provocative.earth/v1/batch \
  -H "Authorization: Bearer pk-prov-YOUR-KEY" \
  -F "file=@batch.jsonl"

Response:

{
  "id": "batch_abc123",
  "status": "validating",
  "endpoint": "/v1/chat/completions",
  "total_requests": 3,
  "completed_count": 0,
  "failed_count": 0
}

Check status

curl https://inference.provocative.earth/v1/batch/batch_abc123 \
  -H "Authorization: Bearer pk-prov-YOUR-KEY"

Status progression: validatingin_progresscompleted (or failed).

Download results

curl https://inference.provocative.earth/v1/batch/batch_abc123/output \
  -H "Authorization: Bearer pk-prov-YOUR-KEY" \
  -o results.jsonl

Each line of the output is:

{
  "id": "resp_xyz",
  "custom_id": "eval-1",
  "response": {
    "status_code": 200,
    "body": { "id": "chatcmpl-...", "choices": [...], "usage": {...} }
  }
}

The custom_id lets you correlate results with your input rows.

Cancel a batch

curl -X POST https://inference.provocative.earth/v1/batch/batch_abc123/cancel \
  -H "Authorization: Bearer pk-prov-YOUR-KEY"

Limits

  • Max file size: 100 MB
  • Max requests per batch: 50,000
  • Max concurrent batch jobs per tenant: 10 (shared), unlimited (reserved/dedicated)

Pricing

Batch inference is billed at 50% of the standard per-token rate for the model used. See Pricing.