Quickstart

Get your first API response in 60 seconds. The examples below assume an API key prefixed with pk-prov-. Replace pk-prov-YOUR-KEY with your own.

1. Make a request

curl

curl https://inference.provocative.earth/v1/chat/completions \
  -H "Authorization: Bearer pk-prov-YOUR-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-70b-instruct",
    "messages": [{"role": "user", "content": "Explain inference-as-a-service in one sentence."}],
    "max_tokens": 100
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://inference.provocative.earth/v1",
    api_key="pk-prov-YOUR-KEY",
)

response = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Explain inference-as-a-service in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)

JavaScript (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://inference.provocative.earth/v1",
  apiKey: "pk-prov-YOUR-KEY",
});

const response = await client.chat.completions.create({
  model: "llama-3.1-70b-instruct",
  messages: [{ role: "user", content: "Explain inference-as-a-service in one sentence." }],
  max_tokens: 100,
});
console.log(response.choices[0].message.content);

2. Stream tokens

Add stream: true to get tokens as they're generated via Server-Sent Events:

stream = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

3. Generate embeddings

embeddings = client.embeddings.create(
    model="bge-m3",
    input=["search query", "document to compare"],
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

Next steps

Migrating from OpenAI — what changes, what doesn't
Model Catalog — all available models with specs
LoRA Adapters — serve your fine-tuned models
Batch Inference — async bulk processing at 50% cost