Confidential AI Inference · Intel TDX

Inference that refuses to run outside an enclave.

An OpenAI-compatible API where every request is attested on hardware before it executes. No non-TEE fallback. No silent downgrade. If the upstream ever swaps a model off its confidential compute, the API rejects the call server-side with model_not_confidential.

Base URL: https://api.voltagegpu.com/v1
16TEE-attested models
100%Intel TDX
0Non-TEE fallback
262KMax context

Why confidential inference?

The standard LLM API assumes you trust the inference operator with your prompts, your data, and the weights they claim to be running. VoltageGPU replaces that trust with hardware attestation. Here is what changes.

Memory

Host can't read your prompts

Intel TDX encrypts the pod's memory with a key held inside the CPU. The host OS, the hypervisor and even a privileged VoltageGPU operator cannot dump your prompt or the model's KV-cache from outside the enclave.

Attestation

Every request proves it ran in an enclave

Each TDX quote is bound to the upstream provider's confidential_compute: true flag. The VoltageGPU gateway validates that flag on every call — requests to a model that lost its attestation are refused with a 400, not silently downgraded.

PCIe

GPU traffic is sealed too

Protected PCIe between the CPU enclave and the NVIDIA GPU means model weights and activations never leave the trust boundary in clear text — even if an attacker is sitting on the host bus.

Gate

Non-TEE models are rejected server-side

POST /v1/chat/completions calls assertConfidentialModel() before touching upstream. Any model ID without confidential_compute: true in the live catalog returns model_not_confidential — the request never reaches upstream.

No non-TEE fallback, ever. Embeddings, image generation, audio, video, moderations and fine-tuning are returned as 503 not_available until a hardware-attested variant is offered upstream. If you can't attest it, we don't serve it.

Available TEE models

16 models live · up to 256K tokens · all confidential_compute: true
GET /v1/models ↗
deepseek-ai/DeepSeek-V3.2-TEE
128K ctx

Latest DeepSeek general-purpose model with reasoning and tool use. Strong default for most chat workloads.

in $0.28 · out $0.42 /Mreasoningtoolsjson-mode
deepseek-ai/DeepSeek-V3.1-Terminus-TEE
160K ctx

Stabilised V3.1 release with long-context tool calling. Good drop-in for agent pipelines.

in $0.27 · out $1.00 /Mreasoningtoolsjson-mode
deepseek-ai/DeepSeek-V3.1-TEE
160K ctx

Mainline V3.1 weights. Stronger reasoning than V3-0324, slightly slower.

in $0.27 · out $1.00 /Mreasoningtoolsjson-mode
deepseek-ai/DeepSeek-V3-0324-TEE
160K ctx

Original V3 checkpoint. Cheapest entry point for DeepSeek-class quality.

in $0.25 · out $1.00 /Mtoolsjson-mode
deepseek-ai/DeepSeek-R1-0528-TEE
160K ctx

DeepSeek-R1 reasoning model. Best-in-class for math, code and multi-step planning.

in $0.45 · out $2.15 /Mreasoningtoolsjson-mode
tngtech/DeepSeek-TNG-R1T2-Chimera-TEE
160K ctx

Distilled DeepSeek-R1 chimera. Reasoning-first, noticeably faster than R1-0528.

in $0.30 · out $1.10 /Mreasoningtoolsjson-mode
Qwen/Qwen3.5-397B-A17B-TEE
256K ctx

397B MoE with 17B active. Flagship Qwen3.5 with reasoning and agentic tools.

in $0.39 · out $2.34 /Mreasoningtoolsjson-mode
Qwen/Qwen3-235B-A22B-Instruct-2507-TEE
256K ctx

235B MoE with 22B active parameters. Balanced quality/throughput for most chat tasks.

in $0.10 · out $0.60 /Mtoolsjson-mode
Qwen/Qwen3-32B-TEE
40K ctx

32B dense reasoning model. Cheapest TEE entry point for latency-sensitive chat.

in $0.08 · out $0.24 /Mreasoningtoolsjson-mode
Qwen/Qwen3-Coder-Next-TEE
256K ctx

Qwen3 specialised for code generation and repo-scale editing. Long context.

in $0.12 · out $0.75 /Mtoolsjson-mode
moonshotai/Kimi-K2.5-TEE
256K ctx

Kimi K2.5 long-context reasoning model. 262K window for whole-repo and whole-book workflows.

in $0.38 · out $1.72 /Mreasoningtoolsjson-mode
MiniMaxAI/MiniMax-M2.5-TEE
192K ctx

MiniMax M2.5 reasoning model with 196K context. Strong reasoning + tool use at low input cost.

in $0.12 · out $0.99 /Mreasoningtoolsjson-mode
openai/gpt-oss-120b-TEE
128K ctx

OpenAI GPT-OSS 120B weights running confidentially. Solid all-round chat with tools.

in $0.09 · out $0.36 /Mreasoningtoolsjson-mode
zai-org/GLM-4.7-TEE
198K ctx

GLM-4.7 reasoning model with strong agentic tool use. Chinese + English bilingual.

in $0.39 · out $1.75 /Mreasoningtoolsjson-mode
zai-org/GLM-5.1-TEE
198K ctx

Latest GLM 5.1. Flagship tier, highest quality in the Zhipu family.

in $0.95 · out $3.15 /Mreasoningtoolsjson-mode
XiaomiMiMo/MiMo-V2-Flash-TEE
256K ctx

Xiaomi MiMo V2 Flash. Fastest TEE model in the catalog, optimised for high QPS.

in $0.09 · out $0.29 /Mtoolsjson-mode

This grid is a snapshot. The canonical live feed is GET /v1/models — it returns exactly what the gate will accept right now and includes live per-token pricing. Always fetch it before caching model IDs or rates in your application.

Pricing

USD per 1M tokens with the VoltageGPU 1.85× markup already applied. Same numbers the API returns in its live catalog — no surprise math at billing time.

Cheapestin $0.08 · out $0.24Qwen/Qwen3-32B-TEE
Mid tierin $0.27 · out $1.00deepseek-ai/DeepSeek-V3.1-Terminus-TEE
Flagshipin $0.95 · out $3.15zai-org/GLM-5.1-TEE

Billing is metered in real time from the live inference catalog. Streaming responses are pre-charged against a conservative estimate and then reconciled against the final usage block — if the upstream never returns usage, a fallback based on streamed choices[].delta.content bytes is used so long completions cannot be free-loaded.

Quick start

Get your first confidential chat completion in under a minute. The API is 100% OpenAI-compatible — drop in the base URL and your vgpu_* key.

# Chat Completions — OpenAI-compatible, TEE-gated
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-0528-TEE",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Tip: existing OpenAI SDKs work as-is. Change base_url to https://api.voltagegpu.com/v1 and every request automatically runs inside an Intel TDX enclave. If you pass a non-TEE model ID by mistake, the API returns a clear 400 model_not_confidential.

Authentication

All API requests require a Bearer token. Generate one from the Dashboard Settings. Keys start with vgpu_.

# Bearer auth
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer vgpu_sk_xxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-ai/DeepSeek-R1-0528-TEE", "messages": [...]}'

# Or with the OpenAI Python SDK
from openai import OpenAI

client = OpenAI(
    api_key="vgpu_sk_xxxxxxxxxxxx",
    base_url="https://api.voltagegpu.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-TEE",
    messages=[{"role": "user", "content": "Hello!"}],
)

Rotate keys regularly from your dashboard. Never embed them in client-side code — every token spent on a leaked key is billed to your account.

API reference

Chat completions

Generate conversational responses using TEE-attested LLMs. Drop-in replacement for POST /v1/chat/completions on api.openai.com.

MethodEndpointDescriptionAuth
POST/v1/chat/completionsCreate a confidential chat completionYes

Request body parameters

modelrequiredTEE model ID from GET /v1/models. Non-TEE IDs are rejected with 400 model_not_confidential.
messagesrequiredOpenAI-style array of {role, content} objects.
max_tokensoptionalMax output tokens. Defaults to 1024.
temperatureoptionalSampling temperature 0–2. Defaults to 0.7.
streamoptionalStream tokens as SSE. Usage reconciliation runs automatically at end of stream.
top_poptionalNucleus sampling. Defaults to 1.
tools / tool_choiceoptionalOpenAI-style function calling. Works on every TEE model tagged tools above.
response_formatoptional{ type: "json_object" } or JSON schema. Requires a model tagged json-mode or structured.

Models

List the live catalog. Only TEE-attested models are returned — this is the same filtered feed the gate uses.

MethodEndpointDescriptionAuth
GET/v1/modelsList all available TEE models with live pricingYes
GET/v1/models/:idGet a single TEE model's full metadataYes

Why only chat completions?

VoltageGPU only exposes workloads that run end-to-end inside an Intel TDX enclave with hardware attestation. /v1/embeddings, /v1/images/generations, /v1/audio, /v1/video, /v1/moderations and /v1/fine-tuning all return 503 not_available — the upstream catalog offers zero confidential variants for those modalities today. As soon as a TEE embedding or diffusion model ships, we'll open the corresponding endpoint.

SDK integration

Use any OpenAI-compatible SDK by overriding the base URL. Your existing code doesn't change — the TEE enforcement happens server-side.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="vgpu_sk_xxxxxxxxxxxx",
    base_url="https://api.voltagegpu.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-TEE",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=1024,
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'vgpu_sk_xxxxxxxxxxxx',
  baseURL: 'https://api.voltagegpu.com/v1',
});

const response = await client.chat.completions.create({
  model: 'Qwen/Qwen3-235B-A22B-Instruct-2507-TEE',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing' },
  ],
  max_tokens: 1024,
});

console.log(response.choices[0].message.content);

Streaming

Set stream: true to receive tokens as Server-Sent Events. The VoltageGPU gateway automatically injects stream_options: { include_usage: true } upstream, parses the final usage chunk on the fly, and reconciles the pre-charge against the real token count when the stream closes. Long completions cannot escape billing even if the upstream drops the connection.

# Enable streaming with stream: true
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-0528-TEE",
    "messages": [{"role": "user", "content": "Write a poem about the stars"}],
    "stream": true
  }'

# Response is a standard OpenAI SSE stream
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"The"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sun"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sets"}}]}
...
data: {"id":"chatcmpl-123","usage":{"prompt_tokens":8,"completion_tokens":42}}
data: [DONE]

Errors

All errors follow a consistent JSON envelope:

{
  "error": {
    "message": "Model 'deepseek-ai/DeepSeek-V3' is not Confidential Compute. Use 'deepseek-ai/DeepSeek-V3-0324-TEE' instead.",
    "type": "invalid_request_error",
    "code": "model_not_confidential",
    "status": 400
  }
}

Common status codes

200Success — request completed inside the enclave.
400Bad request — invalid params, or model_not_confidential for a non-TEE ID.
401Missing or invalid API key.
402Insufficient balance — top up at voltagegpu.com/billing.
429Rate limit exceeded. Check the X-RateLimit-* headers.
503not_available — you hit /v1/embeddings, images, audio, video, moderations or fine-tuning. No TEE variant exists yet.

Rate limits & billing

  • Default rate limit: 1000 requests per minute. Contact support@voltagegpu.com for higher tiers.
  • Billing model: USD per million tokens with the 1.85× markup already baked into GET /v1/models.
  • Streaming: pre-charged against a conservative input-token estimate, then reconciled with the real usage block emitted at end of stream.
  • Balance: debited in real time against your account. Insufficient balance returns 402 before any upstream call.

Rate limit headers on every response:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1704715200

Check live usage and top up at voltagegpu.com/billing.

Support & resources