Skip to content

Pricing

Per-token pricing (shared tier)

Pay-as-you-go. No minimum commitment.

Model Input (per 1M tokens) Output (per 1M tokens)
llama-3.1-8b-instruct $0.10 $0.15
mistral-small-3-24b $0.20 $0.30
qwen-2.5-coder-32b $0.30 $0.45
llama-3.1-70b-instruct $0.55 $0.75
qwen-2.5-72b-instruct $0.55 $0.75
deepseek-v3 $0.50 $1.20
llama-3.1-405b-instruct $2.50 $3.50
bge-m3 $0.02
e5-mistral-7b-instruct $0.05
nomic-embed-v1.5 $0.02

How we compare

Model class (70B) provocapi Together AI Fireworks AWS Bedrock
Input / 1M tokens $0.55 $0.88 $0.90 $2.65
Output / 1M tokens $0.75 $0.88 $0.90 $3.50

We undercut cloud inference by 30-50% because we own the GPUs — our marginal cost is electricity and amortization, not an AWS markup.

Batch pricing

Batch inference (via POST /v1/batch) is billed at 50% of the standard per-token rate. SLA: 24-hour completion.

Model Batch input (per 1M) Batch output (per 1M)
llama-3.1-70b-instruct $0.28 $0.38
llama-3.1-8b-instruct $0.05 $0.08
(all others) 50% of standard 50% of standard

Reserved capacity

Lock in dedicated model replicas at a flat monthly rate. No per-token charges. Best for predictable, high-volume workloads.

Commitment Discount vs. on-demand
1 month 20%
3 months 35%
12 months 50%

Example: Llama 70B reserved

One replica of Llama 3.1 70B on a dedicated H100 80GB (FP8):

Term Monthly cost Breakeven vs. on-demand
1 month $4,500 ~6B input + 6B output tokens/month
3 months $3,900/mo ~5B input + 5B output tokens/month
12 months $3,000/mo ~4B input + 4B output tokens/month

Reserved capacity includes: - Dedicated vLLM processes (no noisy neighbors) - No rate limits - Priority support - Custom model requests

Dedicated nodes

Lease entire physical GPU nodes for full isolation:

Node type GPUs Monthly rate
RTX 5090 node 4x RTX 5090 32GB Contact sales
RTX PRO 6000 node 4x RTX PRO 6000 96GB Contact sales
A100 node 8x A100 80GB Contact sales
H100 node 8x H100 80GB Contact sales

Dedicated nodes include: - Physical isolation (dedicated taints, namespace) - Full admin access to deploy any model in supported architectures - Custom LoRA adapter limits - Direct Prometheus metrics export

Rate limits (shared tier defaults)

Metric Limit
Requests per minute (RPM) 200
Tokens per minute (TPM) 500,000
Max output tokens per request 8,192

Limits are per API key. Contact us to increase.

Billing

  • Metering: near-realtime (60-second lag). Visible in the dashboard at /usage.
  • Invoicing: end-of-month, net-30.
  • Currency: USD.