Pricing

Per-token pricing (shared tier)

Pay-as-you-go. No minimum commitment.

Model	Input (per 1M tokens)	Output (per 1M tokens)
`llama-3.1-8b-instruct`	$0.10	$0.15
`mistral-small-3-24b`	$0.20	$0.30
`qwen-2.5-coder-32b`	$0.30	$0.45
`llama-3.1-70b-instruct`	$0.55	$0.75
`qwen-2.5-72b-instruct`	$0.55	$0.75
`deepseek-v3`	$0.50	$1.20
`llama-3.1-405b-instruct`	$2.50	$3.50
`bge-m3`	$0.02	—
`e5-mistral-7b-instruct`	$0.05	—
`nomic-embed-v1.5`	$0.02	—

How we compare

Model class (70B)	provocapi	Together AI	Fireworks	AWS Bedrock
Input / 1M tokens	$0.55	$0.88	$0.90	$2.65
Output / 1M tokens	$0.75	$0.88	$0.90	$3.50

We undercut cloud inference by 30-50% because we own the GPUs — our marginal cost is electricity and amortization, not an AWS markup.

Batch pricing

Batch inference (via POST /v1/batch) is billed at 50% of the standard per-token rate. SLA: 24-hour completion.

Model	Batch input (per 1M)	Batch output (per 1M)
`llama-3.1-70b-instruct`	$0.28	$0.38
`llama-3.1-8b-instruct`	$0.05	$0.08
(all others)	50% of standard	50% of standard

Reserved capacity

Lock in dedicated model replicas at a flat monthly rate. No per-token charges. Best for predictable, high-volume workloads.

Commitment	Discount vs. on-demand
1 month	20%
3 months	35%
12 months	50%

Example: Llama 70B reserved

One replica of Llama 3.1 70B on a dedicated H100 80GB (FP8):

Term	Monthly cost	Breakeven vs. on-demand
1 month	$4,500	~6B input + 6B output tokens/month
3 months	$3,900/mo	~5B input + 5B output tokens/month
12 months	$3,000/mo	~4B input + 4B output tokens/month

Reserved capacity includes: - Dedicated vLLM processes (no noisy neighbors) - No rate limits - Priority support - Custom model requests

Dedicated nodes

Lease entire physical GPU nodes for full isolation:

Node type	GPUs	Monthly rate
RTX 5090 node	4x RTX 5090 32GB	Contact sales
RTX PRO 6000 node	4x RTX PRO 6000 96GB	Contact sales
A100 node	8x A100 80GB	Contact sales
H100 node	8x H100 80GB	Contact sales

Dedicated nodes include: - Physical isolation (dedicated taints, namespace) - Full admin access to deploy any model in supported architectures - Custom LoRA adapter limits - Direct Prometheus metrics export

Rate limits (shared tier defaults)

Metric	Limit
Requests per minute (RPM)	200
Tokens per minute (TPM)	500,000
Max output tokens per request	8,192

Limits are per API key. Contact us to increase.

Billing

Metering: near-realtime (60-second lag). Visible in the dashboard at /usage.
Invoicing: end-of-month, net-30.
Currency: USD.