VoltageGPU is a Confidential AI Infrastructure platform. We provide hardware-sealed GPU pods (Intel TDX, H100 / H200 / B200) and an OpenAI-compatible inference API for confidential TEE LLMs such as DeepSeek-R1-TEE, Qwen3-32B-TEE and Qwen3-235B-TEE.

Is the VoltageGPU Inference API OpenAI compatible?

Yes. The Inference API is fully OpenAI-compatible. Point any existing OpenAI SDK at https://api.voltagegpu.com/v1 to run prompts against confidential TEE LLMs.

Which models does the Inference API expose?

The Inference API exposes confidential TEE LLMs only: DeepSeek-R1-0528-TEE (Enterprise), Qwen3-235B-TEE (Pro) and Qwen3-32B-TEE (Starter). All inference runs inside Intel TDX enclaves with hardware attestation.

Confidential AI Inference · Intel TDX

Inference that refuses to run outside an enclave.

An OpenAI-compatible API where every request is attested on hardware before it executes. No non-TEE fallback. No silent downgrade. If the upstream ever swaps a model off its confidential compute, the API rejects the call server-side with model_not_confidential.

Base URL: https://api.voltagegpu.com/v1

16TEE-attested models

100%Intel TDX

0Non-TEE fallback

262KMax context

Why confidential inference?

The standard LLM API assumes you trust the inference operator with your prompts, your data, and the weights they claim to be running. VoltageGPU replaces that trust with hardware attestation. Here is what changes.

Memory

Host can't read your prompts

Intel TDX encrypts the pod's memory with a key held inside the CPU. The host OS, the hypervisor and even a privileged VoltageGPU operator cannot dump your prompt or the model's KV-cache from outside the enclave.

Attestation

Every request proves it ran in an enclave

Each TDX quote is bound to the upstream provider's confidential_compute: true flag. The VoltageGPU gateway validates that flag on every call — requests to a model that lost its attestation are refused with a 400, not silently downgraded.

PCIe

GPU traffic is sealed too

Protected PCIe between the CPU enclave and the NVIDIA GPU means model weights and activations never leave the trust boundary in clear text — even if an attacker is sitting on the host bus.

Gate

Non-TEE models are rejected server-side

POST /v1/chat/completions calls assertConfidentialModel() before touching upstream. Any model ID without confidential_compute: true in the live catalog returns model_not_confidential — the request never reaches upstream.

No non-TEE fallback, ever. Embeddings, image generation, audio, video, moderations and fine-tuning are returned as 503 not_available until a hardware-attested variant is offered upstream. If you can't attest it, we don't serve it.

Available TEE models

16 models live · up to 256K tokens · all confidential_compute: true

GET /v1/models ↗

deepseek-ai/DeepSeek-V3.2-TEE

128K ctx

Latest DeepSeek general-purpose model with reasoning and tool use. Strong default for most chat workloads.

deepseek-ai/DeepSeek-V3.1-Terminus-TEE

160K ctx

Stabilised V3.1 release with long-context tool calling. Good drop-in for agent pipelines.

deepseek-ai/DeepSeek-V3.1-TEE

160K ctx

Mainline V3.1 weights. Stronger reasoning than V3-0324, slightly slower.

deepseek-ai/DeepSeek-V3-0324-TEE

160K ctx

Original V3 checkpoint. Cheapest entry point for DeepSeek-class quality.

deepseek-ai/DeepSeek-R1-0528-TEE

160K ctx

DeepSeek-R1 reasoning model. Best-in-class for math, code and multi-step planning.

tngtech/DeepSeek-TNG-R1T2-Chimera-TEE

160K ctx

Distilled DeepSeek-R1 chimera. Reasoning-first, noticeably faster than R1-0528.

Qwen/Qwen3.5-397B-A17B-TEE

256K ctx

397B MoE with 17B active. Flagship Qwen3.5 with reasoning and agentic tools.

Qwen/Qwen3-235B-A22B-Instruct-2507-TEE

256K ctx

235B MoE with 22B active parameters. Balanced quality/throughput for most chat tasks.

Qwen/Qwen3-32B-TEE

40K ctx

32B dense reasoning model. Cheapest TEE entry point for latency-sensitive chat.

Qwen/Qwen3-Coder-Next-TEE

256K ctx

Qwen3 specialised for code generation and repo-scale editing. Long context.

moonshotai/Kimi-K2.5-TEE

256K ctx

Kimi K2.5 long-context reasoning model. 262K window for whole-repo and whole-book workflows.

MiniMaxAI/MiniMax-M2.5-TEE

192K ctx

MiniMax M2.5 reasoning model with 196K context. Strong reasoning + tool use at low input cost.

openai/gpt-oss-120b-TEE

128K ctx

OpenAI GPT-OSS 120B weights running confidentially. Solid all-round chat with tools.

zai-org/GLM-4.7-TEE

198K ctx

GLM-4.7 reasoning model with strong agentic tool use. Chinese + English bilingual.

zai-org/GLM-5.1-TEE

198K ctx

Latest GLM 5.1. Flagship tier, highest quality in the Zhipu family.

XiaomiMiMo/MiMo-V2-Flash-TEE

256K ctx

Xiaomi MiMo V2 Flash. Fastest TEE model in the catalog, optimised for high QPS.

This grid is a snapshot. The canonical live feed is GET /v1/models — it returns exactly what the gate will accept right now and includes live per-token pricing. Always fetch it before caching model IDs or rates in your application.

Pricing

USD per 1M tokens with the VoltageGPU 1.85× markup already applied. Same numbers the API returns in its live catalog — no surprise math at billing time.

Cheapestin $0.08 · out $0.24Qwen/Qwen3-32B-TEE

Mid tierin $0.27 · out $1.00deepseek-ai/DeepSeek-V3.1-Terminus-TEE

Flagshipin $0.95 · out $3.15zai-org/GLM-5.1-TEE

Billing is metered in real time from the live inference catalog. Streaming responses are pre-charged against a conservative estimate and then reconciled against the final usage block — if the upstream never returns usage, a fallback based on streamed choices[].delta.content bytes is used so long completions cannot be free-loaded.

Quick start

Get your first confidential chat completion in under a minute. The API is 100% OpenAI-compatible — drop in the base URL and your vgpu_* key.

# Chat Completions — OpenAI-compatible, TEE-gated
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-0528-TEE",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

Tip: existing OpenAI SDKs work as-is. Change base_url to https://api.voltagegpu.com/v1 and every request automatically runs inside an Intel TDX enclave. If you pass a non-TEE model ID by mistake, the API returns a clear 400 model_not_confidential.

Authentication

All API requests require a Bearer token. Generate one from the Dashboard Settings. Keys start with vgpu_.

# Bearer auth
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer vgpu_sk_xxxxxxxxxxxx" \
  -H "Content-Type: application/json" \
  -d '{"model": "deepseek-ai/DeepSeek-R1-0528-TEE", "messages": [...]}'

# Or with the OpenAI Python SDK
from openai import OpenAI

client = OpenAI(
    api_key="vgpu_sk_xxxxxxxxxxxx",
    base_url="https://api.voltagegpu.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-TEE",
    messages=[{"role": "user", "content": "Hello!"}],
)

Rotate keys regularly from your dashboard. Never embed them in client-side code — every token spent on a leaked key is billed to your account.

API reference

Chat completions

Generate conversational responses using TEE-attested LLMs. Drop-in replacement for POST /v1/chat/completions on api.openai.com.

Method	Endpoint	Description	Auth
POST	`/v1/chat/completions`	Create a confidential chat completion	Yes

Request body parameters

`model`	required	TEE model ID from `GET /v1/models`. Non-TEE IDs are rejected with `400 model_not_confidential`.
`messages`	required	OpenAI-style array of {role, content} objects.
`max_tokens`	optional	Max output tokens. Defaults to 1024.
`temperature`	optional	Sampling temperature 0–2. Defaults to 0.7.
`stream`	optional	Stream tokens as SSE. Usage reconciliation runs automatically at end of stream.
`top_p`	optional	Nucleus sampling. Defaults to 1.
`tools` / `tool_choice`	optional	OpenAI-style function calling. Works on every TEE model tagged `tools` above.
`response_format`	optional	`{ type: "json_object" }` or JSON schema. Requires a model tagged `json-mode` or `structured`.

Models

List the live catalog. Only TEE-attested models are returned — this is the same filtered feed the gate uses.

Method	Endpoint	Description	Auth
GET	`/v1/models`	List all available TEE models with live pricing	Yes
GET	`/v1/models/:id`	Get a single TEE model's full metadata	Yes

Why only chat completions?

VoltageGPU only exposes workloads that run end-to-end inside an Intel TDX enclave with hardware attestation. /v1/embeddings, /v1/images/generations, /v1/audio, /v1/video, /v1/moderations and /v1/fine-tuning all return 503 not_available — the upstream catalog offers zero confidential variants for those modalities today. As soon as a TEE embedding or diffusion model ships, we'll open the corresponding endpoint.

SDK integration

Use any OpenAI-compatible SDK by overriding the base URL. Your existing code doesn't change — the TEE enforcement happens server-side.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="vgpu_sk_xxxxxxxxxxxx",
    base_url="https://api.voltagegpu.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-0528-TEE",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing"},
    ],
    max_tokens=1024,
)

print(response.choices[0].message.content)

TypeScript (OpenAI SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'vgpu_sk_xxxxxxxxxxxx',
  baseURL: 'https://api.voltagegpu.com/v1',
});

const response = await client.chat.completions.create({
  model: 'Qwen/Qwen3-235B-A22B-Instruct-2507-TEE',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing' },
  ],
  max_tokens: 1024,
});

console.log(response.choices[0].message.content);

Streaming

Set stream: true to receive tokens as Server-Sent Events. The VoltageGPU gateway automatically injects stream_options: { include_usage: true } upstream, parses the final usage chunk on the fly, and reconciles the pre-charge against the real token count when the stream closes. Long completions cannot escape billing even if the upstream drops the connection.

# Enable streaming with stream: true
curl -X POST "https://api.voltagegpu.com/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1-0528-TEE",
    "messages": [{"role": "user", "content": "Write a poem about the stars"}],
    "stream": true
  }'

# Response is a standard OpenAI SSE stream
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"The"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sun"}}]}
data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" sets"}}]}
...
data: {"id":"chatcmpl-123","usage":{"prompt_tokens":8,"completion_tokens":42}}
data: [DONE]

Errors

All errors follow a consistent JSON envelope:

{
  "error": {
    "message": "Model 'deepseek-ai/DeepSeek-V3' is not Confidential Compute. Use 'deepseek-ai/DeepSeek-V3-0324-TEE' instead.",
    "type": "invalid_request_error",
    "code": "model_not_confidential",
    "status": 400
  }
}

Common status codes

`200`	Success — request completed inside the enclave.
`400`	Bad request — invalid params, or `model_not_confidential` for a non-TEE ID.
`401`	Missing or invalid API key.
`402`	Insufficient balance — top up at voltagegpu.com/billing.
`429`	Rate limit exceeded. Check the `X-RateLimit-*` headers.
`503`	`not_available` — you hit `/v1/embeddings`, images, audio, video, moderations or fine-tuning. No TEE variant exists yet.

Rate limits & billing

Default rate limit: 1000 requests per minute. Contact support@voltagegpu.com for higher tiers.
Billing model: USD per million tokens with the 1.85× markup already baked into GET /v1/models.
Streaming: pre-charged against a conservative input-token estimate, then reconciled with the real usage block emitted at end of stream.
Balance: debited in real time against your account. Insufficient balance returns 402 before any upstream call.

Rate limit headers on every response:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1704715200

Check live usage and top up at voltagegpu.com/billing.

Inference that refuses to run outside an enclave.

Why confidential inference?

Host can't read your prompts

Every request proves it ran in an enclave

GPU traffic is sealed too

Non-TEE models are rejected server-side

Available TEE models

Pricing

Quick start

Authentication

API reference

Chat completions

Request body parameters

Models

Why only chat completions?

SDK integration

Python (OpenAI SDK)

TypeScript (OpenAI SDK)

Streaming

Errors

Common status codes

Rate limits & billing

Support & resources

Live model catalog

Dashboard

Discord

Email support