Latest DeepSeek general-purpose model with reasoning and tool use. Strong default for most chat workloads.
Inference that refuses to run outside an enclave.
An OpenAI-compatible API where every request is attested on hardware before it executes. No non-TEE fallback. No silent downgrade. If the upstream ever swaps a model off its confidential compute, the API rejects the call server-side with model_not_confidential.
Base URL: https://api.voltagegpu.com/v1Why confidential inference?
The standard LLM API assumes you trust the inference operator with your prompts, your data, and the weights they claim to be running. VoltageGPU replaces that trust with hardware attestation. Here is what changes.
Host can't read your prompts
Intel TDX encrypts the pod's memory with a key held inside the CPU. The host OS, the hypervisor and even a privileged VoltageGPU operator cannot dump your prompt or the model's KV-cache from outside the enclave.
Every request proves it ran in an enclave
Each TDX quote is bound to the upstream provider's confidential_compute: true flag. The VoltageGPU gateway validates that flag on every call — requests to a model that lost its attestation are refused with a 400, not silently downgraded.
GPU traffic is sealed too
Protected PCIe between the CPU enclave and the NVIDIA GPU means model weights and activations never leave the trust boundary in clear text — even if an attacker is sitting on the host bus.
Non-TEE models are rejected server-side
POST /v1/chat/completions calls assertConfidentialModel() before touching upstream. Any model ID without confidential_compute: true in the live catalog returns model_not_confidential — the request never reaches upstream.
No non-TEE fallback, ever. Embeddings, image generation, audio, video, moderations and fine-tuning are returned as 503 not_available until a hardware-attested variant is offered upstream. If you can't attest it, we don't serve it.
Available TEE models
Stabilised V3.1 release with long-context tool calling. Good drop-in for agent pipelines.
Mainline V3.1 weights. Stronger reasoning than V3-0324, slightly slower.
Original V3 checkpoint. Cheapest entry point for DeepSeek-class quality.
DeepSeek-R1 reasoning model. Best-in-class for math, code and multi-step planning.
Distilled DeepSeek-R1 chimera. Reasoning-first, noticeably faster than R1-0528.
397B MoE with 17B active. Flagship Qwen3.5 with reasoning and agentic tools.
235B MoE with 22B active parameters. Balanced quality/throughput for most chat tasks.
32B dense reasoning model. Cheapest TEE entry point for latency-sensitive chat.
Qwen3 specialised for code generation and repo-scale editing. Long context.
Kimi K2.5 long-context reasoning model. 262K window for whole-repo and whole-book workflows.
MiniMax M2.5 reasoning model with 196K context. Strong reasoning + tool use at low input cost.
OpenAI GPT-OSS 120B weights running confidentially. Solid all-round chat with tools.
GLM-4.7 reasoning model with strong agentic tool use. Chinese + English bilingual.
Latest GLM 5.1. Flagship tier, highest quality in the Zhipu family.
Xiaomi MiMo V2 Flash. Fastest TEE model in the catalog, optimised for high QPS.
This grid is a snapshot. The canonical live feed is GET /v1/models — it returns exactly what the gate will accept right now and includes live per-token pricing. Always fetch it before caching model IDs or rates in your application.
Pricing
USD per 1M tokens with the VoltageGPU 1.85× markup already applied. Same numbers the API returns in its live catalog — no surprise math at billing time.
Billing is metered in real time from the live inference catalog. Streaming responses are pre-charged against a conservative estimate and then reconciled against the final usage block — if the upstream never returns usage, a fallback based on streamed choices[].delta.content bytes is used so long completions cannot be free-loaded.
Quick start
Get your first confidential chat completion in under a minute. The API is 100% OpenAI-compatible — drop in the base URL and your vgpu_* key.
# Chat Completions — OpenAI-compatible, TEE-gated
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-0528-TEE",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 1024,
"temperature": 0.7
}'Tip: existing OpenAI SDKs work as-is. Change base_url to https://api.voltagegpu.com/v1 and every request automatically runs inside an Intel TDX enclave. If you pass a non-TEE model ID by mistake, the API returns a clear 400 model_not_confidential.
Authentication
All API requests require a Bearer token. Generate one from the Dashboard Settings. Keys start with vgpu_.
# Bearer auth
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
-H "Authorization: Bearer vgpu_sk_xxxxxxxxxxxx" \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-R1-0528-TEE", "messages": [...]}'
# Or with the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
api_key="vgpu_sk_xxxxxxxxxxxx",
base_url="https://api.voltagegpu.com/v1",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528-TEE",
messages=[{"role": "user", "content": "Hello!"}],
)Rotate keys regularly from your dashboard. Never embed them in client-side code — every token spent on a leaked key is billed to your account.
API reference
Chat completions
Generate conversational responses using TEE-attested LLMs. Drop-in replacement for POST /v1/chat/completions on api.openai.com.
| Method | Endpoint | Description | Auth |
|---|---|---|---|
| POST | /v1/chat/completions | Create a confidential chat completion | Yes |
Request body parameters
model | required | TEE model ID from GET /v1/models. Non-TEE IDs are rejected with 400 model_not_confidential. |
messages | required | OpenAI-style array of {role, content} objects. |
max_tokens | optional | Max output tokens. Defaults to 1024. |
temperature | optional | Sampling temperature 0–2. Defaults to 0.7. |
stream | optional | Stream tokens as SSE. Usage reconciliation runs automatically at end of stream. |
top_p | optional | Nucleus sampling. Defaults to 1. |
tools / tool_choice | optional | OpenAI-style function calling. Works on every TEE model tagged tools above. |
response_format | optional | { type: "json_object" } or JSON schema. Requires a model tagged json-mode or structured. |
Models
List the live catalog. Only TEE-attested models are returned — this is the same filtered feed the gate uses.
| Method | Endpoint | Description | Auth |
|---|---|---|---|
| GET | /v1/models | List all available TEE models with live pricing | Yes |
| GET | /v1/models/:id | Get a single TEE model's full metadata | Yes |
Why only chat completions?
VoltageGPU only exposes workloads that run end-to-end inside an Intel TDX enclave with hardware attestation. /v1/embeddings, /v1/images/generations, /v1/audio, /v1/video, /v1/moderations and /v1/fine-tuning all return 503 not_available — the upstream catalog offers zero confidential variants for those modalities today. As soon as a TEE embedding or diffusion model ships, we'll open the corresponding endpoint.
SDK integration
Use any OpenAI-compatible SDK by overriding the base URL. Your existing code doesn't change — the TEE enforcement happens server-side.
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="vgpu_sk_xxxxxxxxxxxx",
base_url="https://api.voltagegpu.com/v1",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528-TEE",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing"},
],
max_tokens=1024,
)
print(response.choices[0].message.content)TypeScript (OpenAI SDK)
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'vgpu_sk_xxxxxxxxxxxx',
baseURL: 'https://api.voltagegpu.com/v1',
});
const response = await client.chat.completions.create({
model: 'Qwen/Qwen3-235B-A22B-Instruct-2507-TEE',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain quantum computing' },
],
max_tokens: 1024,
});
console.log(response.choices[0].message.content);Streaming
Set stream: true to receive tokens as Server-Sent Events. The VoltageGPU gateway automatically injects stream_options: { include_usage: true } upstream, parses the final usage chunk on the fly, and reconciles the pre-charge against the real token count when the stream closes. Long completions cannot escape billing even if the upstream drops the connection.
# Enable streaming with stream: true
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-0528-TEE",
"messages": [{"role": "user", "content": "Write a poem about the stars"}],
"stream": true
}'
# Response is a standard OpenAI SSE stream
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"The"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sun"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sets"}}]}
...
data: {"id":"chatcmpl-123","usage":{"prompt_tokens":8,"completion_tokens":42}}
data: [DONE]Errors
All errors follow a consistent JSON envelope:
{
"error": {
"message": "Model 'deepseek-ai/DeepSeek-V3' is not Confidential Compute. Use 'deepseek-ai/DeepSeek-V3-0324-TEE' instead.",
"type": "invalid_request_error",
"code": "model_not_confidential",
"status": 400
}
}Common status codes
200 | Success — request completed inside the enclave. |
400 | Bad request — invalid params, or model_not_confidential for a non-TEE ID. |
401 | Missing or invalid API key. |
402 | Insufficient balance — top up at voltagegpu.com/billing. |
429 | Rate limit exceeded. Check the X-RateLimit-* headers. |
503 | not_available — you hit /v1/embeddings, images, audio, video, moderations or fine-tuning. No TEE variant exists yet. |
Rate limits & billing
- Default rate limit: 1000 requests per minute. Contact support@voltagegpu.com for higher tiers.
- Billing model: USD per million tokens with the 1.85× markup already baked into
GET /v1/models. - Streaming: pre-charged against a conservative input-token estimate, then reconciled with the real usage block emitted at end of stream.
- Balance: debited in real time against your account. Insufficient balance returns
402before any upstream call.
Rate limit headers on every response:
X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 999 X-RateLimit-Reset: 1704715200
Check live usage and top up at voltagegpu.com/billing.