Pricing
Per-token pricing (shared tier)
Pay-as-you-go. No minimum commitment.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
llama-3.1-8b-instruct |
$0.10 | $0.15 |
mistral-small-3-24b |
$0.20 | $0.30 |
qwen-2.5-coder-32b |
$0.30 | $0.45 |
llama-3.1-70b-instruct |
$0.55 | $0.75 |
qwen-2.5-72b-instruct |
$0.55 | $0.75 |
deepseek-v3 |
$0.50 | $1.20 |
llama-3.1-405b-instruct |
$2.50 | $3.50 |
bge-m3 |
$0.02 | — |
e5-mistral-7b-instruct |
$0.05 | — |
nomic-embed-v1.5 |
$0.02 | — |
How we compare
| Model class (70B) | provocapi | Together AI | Fireworks | AWS Bedrock |
|---|---|---|---|---|
| Input / 1M tokens | $0.55 | $0.88 | $0.90 | $2.65 |
| Output / 1M tokens | $0.75 | $0.88 | $0.90 | $3.50 |
We undercut cloud inference by 30-50% because we own the GPUs — our marginal cost is electricity and amortization, not an AWS markup.
Batch pricing
Batch inference (via POST /v1/batch) is billed at 50% of the standard per-token rate. SLA: 24-hour completion.
| Model | Batch input (per 1M) | Batch output (per 1M) |
|---|---|---|
llama-3.1-70b-instruct |
$0.28 | $0.38 |
llama-3.1-8b-instruct |
$0.05 | $0.08 |
| (all others) | 50% of standard | 50% of standard |
Reserved capacity
Lock in dedicated model replicas at a flat monthly rate. No per-token charges. Best for predictable, high-volume workloads.
| Commitment | Discount vs. on-demand |
|---|---|
| 1 month | 20% |
| 3 months | 35% |
| 12 months | 50% |
Example: Llama 70B reserved
One replica of Llama 3.1 70B on a dedicated H100 80GB (FP8):
| Term | Monthly cost | Breakeven vs. on-demand |
|---|---|---|
| 1 month | $4,500 | ~6B input + 6B output tokens/month |
| 3 months | $3,900/mo | ~5B input + 5B output tokens/month |
| 12 months | $3,000/mo | ~4B input + 4B output tokens/month |
Reserved capacity includes: - Dedicated vLLM processes (no noisy neighbors) - No rate limits - Priority support - Custom model requests
Dedicated nodes
Lease entire physical GPU nodes for full isolation:
| Node type | GPUs | Monthly rate |
|---|---|---|
| RTX 5090 node | 4x RTX 5090 32GB | Contact sales |
| RTX PRO 6000 node | 4x RTX PRO 6000 96GB | Contact sales |
| A100 node | 8x A100 80GB | Contact sales |
| H100 node | 8x H100 80GB | Contact sales |
Dedicated nodes include: - Physical isolation (dedicated taints, namespace) - Full admin access to deploy any model in supported architectures - Custom LoRA adapter limits - Direct Prometheus metrics export
Rate limits (shared tier defaults)
| Metric | Limit |
|---|---|
| Requests per minute (RPM) | 200 |
| Tokens per minute (TPM) | 500,000 |
| Max output tokens per request | 8,192 |
Limits are per API key. Contact us to increase.
Billing
- Metering: near-realtime (60-second lag). Visible in the dashboard at
/usage. - Invoicing: end-of-month, net-30.
- Currency: USD.