79+ frontier models, one endpoint.
Drop your OpenAI SDK's base URL to https://llm.smoo.ai/v1. Same shapes, same streaming, same tool calls — with unified billing, org-scoped keys, and cross-lab fallback chains that catch 429s, 5xxs, and timeouts before they hit your code.
Drop-in quickstart
Same OpenAI SDK you already use. Different base URL and your Smoo AI virtual key. That's it.
curl https://llm.smoo.ai/v1/chat/completions \
-H "Authorization: Bearer $SMOOAI_LLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash",
"messages": [{"role": "user", "content": "Hello"}]
}'from openai import OpenAI
client = OpenAI(
api_key=os.environ["SMOOAI_LLM_KEY"],
base_url="https://llm.smoo.ai/v1",
)
resp = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.SMOOAI_LLM_KEY,
baseURL: 'https://llm.smoo.ai/v1',
});
const resp = await client.chat.completions.create({
model: 'gemini-2.5-flash',
messages: [{ role: 'user', content: 'Hello' }],
});
console.log(resp.choices[0].message.content);Model catalog
Pricing shown is passthrough cost in USD per million tokens, used as the basis for org-metered overage. See /pricing for plan-included allowances and volume tiers.
| Model | Family | Tier | Context | Input / 1M | Output / 1M | Best for |
|---|---|---|---|---|---|---|
| claude-opus-4-6 | Anthropic | Frontier | 200K | $15.00 | $75.00 | Deepest multi-step reasoning, long-horizon planning, high-fidelity code |
| claude-sonnet-4-6 | Anthropic | Smart | 200K | $3.00 | $15.00 | Best tool-use + diff fidelity in our coding tests (BFCL v3, τ²-bench) |
| claude-sonnet-4-5 | Anthropic | Smart | 200K | $3.00 | $15.00 | Sonnet 4.5 — kept available for prompts pinned before 4.6 |
| claude-haiku-4-5 | Anthropic | Fast | 200K | $1.00 | $5.00 | Cheap, fast, strong JSON adherence — good for judges + classifiers |
| claude-opus-4-7 | Anthropic | Frontier | 200K | $5.00 | $25.00 | 64.3% SWE-bench Pro — step-change agentic coding over 4.6 (GA 2026-04-16) |
| claude-opus-4-8 | Anthropic | Frontier | 1M | $5.00 | $25.00 | Current Anthropic flagship — 1M context, ~4× less likely than 4.7 to slip a flaw (GA 2026-05-28) Fast mode available at $10/$25 per 1M for ~2.5× speed |
| gpt-5.5 | OpenAI | Frontier | 400K | $5.00 | $30.00 | Current OpenAI flagship — GA 2026-04-24, reduced hallucination on regulated domains |
| gpt-5.5-pro | OpenAI | Frontier | 400K | $30.00 | $180.00 | GPT-5.5 Pro reasoning tier — most capable, highest cost |
| gpt-5.4 | OpenAI | Frontier | 400K | $2.50 | $15.00 | Mid-frontier GPT-5.4 — between gpt-5 and gpt-5.5 on capability + cost |
| gpt-5.4-mini | OpenAI | Smart | 400K | $0.75 | $4.50 | Cheaper smart-tier GPT-5.4 sibling |
| gpt-5.4-nano | OpenAI | Fast | 400K | $0.20 | $1.25 | Ultra-cheap GPT-5.4-nano — high-volume structured output |
| gpt-5.4-pro | OpenAI | Frontier | 400K | $30.00 | $180.00 | GPT-5.4 Pro reasoning tier |
| gpt-5.2 | OpenAI | Smart | 400K | $1.75 | $14.00 | GPT-5.2 — mid-tier between 5.1 and 5.4 |
| gpt-5.2-codex | OpenAI | Smart | 400K | $1.75 | $14.00 | GPT-5.2 with Codex-coding training |
| gpt-5.2-pro | OpenAI | Frontier | 400K | $21.00 | $168.00 | GPT-5.2 Pro reasoning tier |
| gpt-5.1 | OpenAI | Smart | 400K | $1.25 | $10.00 | GPT-5.1 — refined GPT-5 family entry |
| gpt-5.1-codex | OpenAI | Smart | 400K | $1.25 | $10.00 | GPT-5.1 with Codex-coding training |
| gpt-5 | OpenAI | Frontier | 256K | $2.50 | $10.00 | Prior flagship — kept for pinned prompts |
| gpt-5-pro | OpenAI | Frontier | 256K | $15.00 | $120.00 | GPT-5 Pro reasoning tier |
| gpt-5-codex | OpenAI | Smart | 256K | $1.25 | $10.00 | GPT-5 with Codex-coding training |
| gpt-5-mini | OpenAI | Smart | 256K | $0.50 | $2.00 | Balanced smart-tier option with GPT-5 training |
| gpt-5-nano | OpenAI | Fast | 400K | $0.20 | $1.25 | Cheapest GPT-5 variant, good for high-volume structured output |
| gpt-4.1 | OpenAI | Smart | 1M | $2.00 | $8.00 | Big context, strong coding + tool use |
| gpt-4.1-mini | OpenAI | Fast | 1M | $0.40 | $1.60 | Long-context, low-cost workhorse |
| gpt-4.1-nano | OpenAI | Fast | 1M | $0.10 | $0.40 | Ultra-cheap 1M-context option for ingestion + summaries |
| gpt-4o | OpenAI | Smart | 128K | $2.50 | $10.00 | Mature multimodal (text + image); good for stable prompts |
| gpt-4o-mini | OpenAI | Fast | 128K | $0.15 | $0.60 | Battle-tested cheap tier — wide SDK compatibility |
| o3 | OpenAI | Specialty | 200K | $2.00 | $8.00 | o-series reasoning — visible chain-of-thought, strong on math + logic |
| o3-mini | OpenAI | Specialty | 200K | $1.10 | $4.40 | Cheaper o-series reasoning sibling |
| o3-pro | OpenAI | Specialty | 200K | $20.00 | $80.00 | o3 Pro — extended reasoning for hardest problems |
| o4-mini | OpenAI | Specialty | 200K | $1.10 | $4.40 | Reasoning-optimized — strong at math, logic, code synthesis |
| omni-moderation-latest | OpenAI | Specialty | 32K | Free | Free | Free content safety classifier — used by built-in guardrails Free from OpenAI; Smoo passes through at cost |
| gemini-3.5-flash | Smart | 1M | $1.50 | $9.00 | Current Google flagship Flash — GA 2026-05-19, 76.2% Terminal-Bench 2.1 | |
| gemini-2.5-pro | Frontier | 1M | $1.25 | $10.00 | Frontier reasoning with 1M context; great for large-doc analysis | |
| gemini-2.5-flash | Smart | 1M | $0.30 | $2.50 | Best tool-use-per-dollar (BFCL v3 leader in its price band) Smoo AI default smart model | |
| gemini-2.5-flash-lite | Fast | 1M | $0.10 | $0.40 | Very cheap, 1M context, fast first-token | |
| gemini-2.0-flash | Fast | 1M | $0.10 | $0.40 | Stable 2.0 family — kept for pinned prompts | |
| gemini-3-flash-preview | Smart | 1M | — | — | Next-gen Flash preview — 3/3 PASS on CS escalation E2E Preview pricing not yet published | |
| gemini-3.1-flash-lite | Fast | 1M | — | — | GA 3.1 Flash-Lite — 2.1s TTFT latency champion, voice-pipeline-tier Pricing not yet posted in our catalog; check the dashboard for live rate | |
| gemini-3.1-flash-lite-preview | Fast | 1M | — | — | Preview alias retained for backwards-compat with pinned callers Use gemini-3.1-flash-lite (GA) for new code | |
| gemini-3-pro-preview | Frontier | 1M | — | — | Next-gen Pro preview Preview pricing not yet published | |
| gemini-3.1-pro-preview | Frontier | 1M | — | — | Next-gen Pro refresh preview Preview pricing not yet published | |
| groq-llama-3.3-70b | Groq | Smart | 128K | $0.59 | $0.79 | Llama 3.3 70B on Groq — fast, cheap, clean tool loops |
| groq-llama-3.1-8b | Groq | Fast | 128K | $0.050 | $0.080 | Sub-300ms first token; cheapest path through Groq Smoo AI default fast model (used for voice pipeline) |
| groq-llama-4-scout | Groq | Smart | 10M | $0.11 | $0.34 | 10M context — ingestion + large document reasoning |
| groq-llama-4-maverick | Groq | Smart | 1M | $0.20 | $0.60 | Larger Llama 4 variant, stronger reasoning than Scout |
| groq-kimi-k2 | Groq | Smart | 128K | $1.00 | $3.00 | Kimi K2-Instruct — MoE design, strong agentic task quality |
| groq-gpt-oss-120b | Groq | Smart | 128K | $0.15 | $0.60 | OpenAI OSS 120B — best for single-turn generation Not recommended for multi-turn tool loops; known to drop structured output |
| groq-gpt-oss-20b | Groq | Fast | 128K | $0.10 | $0.30 | OpenAI OSS 20B — cheap, fast single-shot generation Not recommended for multi-turn tool loops |
| groq-gpt-oss-safeguard-20b | Groq | Specialty | 128K | $0.10 | $0.30 | Safety-tuned open-weight GPT — content moderation tasks |
| deepseek-v4-flash | DeepSeek | Smart | 1M | $0.14 | $0.28 | 1M context, dual Thinking/Non-Thinking modes — smooth-reasoning primary |
| deepseek-v4-pro | DeepSeek | Smart | 1M | $0.43 | $0.87 | Pro-tier V4 reasoner — 75% intro discount through 2026-05-31 List price $1.74/$3.48 per 1M; refresh when intro ends |
| deepseek-chat | DeepSeek | Smart | 1M | $0.14 | $0.28 | Legacy alias — routes to deepseek-v4-flash (retiring 2026-07-24) |
| deepseek-reasoner | DeepSeek | Smart | 1M | $0.43 | $0.87 | Legacy alias — routes to deepseek-v4-pro (retiring 2026-07-24) |
| qwen-3.7-max-direct | Alibaba DashScope | Frontier | 1M | $2.50 | $7.50 | Current Qwen flagship — agent-first, native thinking, 200 tok/s, SWE-Pro + Terminal-Bench tier winner (GA 2026-05-20) 90% cache-hit discount ($0.25/M); accepts both OpenAI ChatCompletions and Anthropic Messages format |
| qwen-3.6-plus-direct | Alibaba DashScope | Smart | 1M | $0.33 | $1.95 | Qwen 3.6 Plus generalist — GA 2026-04-02, 1M context |
| qwen3-coder-flash-direct | Alibaba DashScope | Smart | 1M | $0.30 | $1.50 | Bench-winning coder — smooth-coding primary, 16/16 aider-polyglot PASS |
| qwen3-coder-plus-direct | Alibaba DashScope | Smart | 1M | $1.00 | $5.00 | PR-review tuned coder for large diffs |
| kimi-k2.6-direct | Moonshot | Smart | 262K | $0.95 | $4.00 | Current Kimi flagship — ties GPT-5.5 on SWE-Bench Pro, GA 2026-04-20 |
| kimi-k2-thinking-direct | Moonshot | Smart | 256K | $0.60 | $2.50 | Deepest reasoner in the Kimi line — smooth-reasoning fallback |
| kimi-k2.5-direct | Moonshot | Smart | 256K | $0.60 | $2.50 | Flagship general-purpose Kimi via Moonshot direct |
| glm-5.1-direct | Z.ai | Smart | 200K | $0.60 | $2.20 | 58.4% SWE-Pro — coder-forward, smooth-coding fallback (GA 2026-04-07) |
| glm-5-direct | Z.ai | Smart | 200K | $0.60 | $1.92 | Faster GLM (78 tok/s vs 5.1's 54) — GA 2026-02-11 |
| minimax-m2-direct | MiniMax | Smart | 200K | $0.30 | $1.20 | Frontier-class reviewer at $0.30 input — smooth-reviewing primary |
| minimax-m2.7-direct | MiniMax | Smart | 200K | $0.30 | $1.20 | Current MiniMax flagship — same price as M2 |
| minimax-m2.7-highspeed-direct | MiniMax | Smart | 200K | $0.60 | $2.40 | Throughput-optimized M2.7 — ~100 tps |
| deepseek-v3.2 | DeepSeek (via aggregator) | Smart | 128K | $0.27 | $1.10 | Aggregator-routed V3.2 — emergency failover only |
| deepseek-r1 | DeepSeek (via aggregator) | Frontier | 64K | $0.55 | $2.19 | Aggregator-routed R1 reasoner — emergency failover only |
| glm-5.1 | Z.ai (via aggregator) | Smart | 128K | $0.60 | $2.20 | Aggregator-routed GLM 5.1 — emergency failover only |
| minimax-m2.7 | MiniMax (via aggregator) | Smart | 200K | $0.30 | $1.20 | Aggregator-routed M2.7 — emergency failover only |
| minimax-m2.5 | MiniMax (via aggregator) | Smart | 200K | $0.30 | $1.20 | Aggregator-routed M2.5 — emergency failover only |
| kimi-k2.5 | Moonshot (via aggregator) | Smart | 256K | $0.60 | $2.50 | Aggregator-routed K2.5 — emergency failover only |
| kimi-k2-thinking | Moonshot (via aggregator) | Smart | 256K | $0.60 | $2.50 | Aggregator-routed K2-Thinking — emergency failover only |
| qwen3-coder-plus | Alibaba (via aggregator) | Smart | 1M | $1.00 | $5.00 | Aggregator-routed Coder-Plus — emergency failover only |
| qwen3-coder-flash | Alibaba (via aggregator) | Smart | 1M | $0.30 | $1.50 | Aggregator-routed Coder-Flash — emergency failover only |
| text-embedding-3-small | OpenAI | Embedding | 8K | $0.020 | — | 1536-dim embeddings — Smoo AI default for knowledge base ingestion |
| text-embedding-3-large | OpenAI | Embedding | 8K | $0.13 | — | 3072-dim embeddings — higher retrieval quality for specialist corpora |
| gemini-embedding-001 | Embedding | 8K | $0.15 | — | 3072-dim Gemini embeddings — strong on multilingual + code | |
| gemini-embedding-002 | Embedding | 8K | — | — | First natively multimodal embedding — text + images + video + audio in one space, Matryoshka 3072→1536→768 dims (GA 2026-04-23) Strict upgrade over -001 for new RAG surfaces; keep -001 for index compatibility |
Prices refresh as upstream labs publish changes — your dashboard shows live rates and the effective rate after your plan's tier allowance. Overage is billed per your subscription tier.
Smooth semantic aliases
Point the Smooth coding runtime at llm.smoo.ai/v1 and use stable intent-based model names. Aliases re-target to new upstream models as better options ship — your code doesn't change.
| Alias | Resolves to | Purpose |
|---|---|---|
| smooth-coding | qwen3-coder-flash | Coding workhorse — best agentic tool-use, native OpenAI tool_calls, 1M ctx |
| smooth-reasoning | deepseek-v4-flash | Deep reasoning + planning — 1M ctx, dual Thinking/Non-Thinking, $0.14/$0.28 |
| smooth-reviewing | MiniMax-M2 | Adversarial critique — different lab from the coder |
| smooth-judge | groq/llama-3.1-8b-instant | Sub-300ms JSON judge for guardrails + Narc verdicts |
| smooth-summarize | gemini-2.5-flash | Long-context summaries + compression (1M ctx) |
| smooth-planning | gemini-2.5-flash | Structured mapper / planning flows |
| smooth-fast | groq/llama-3.1-8b-instant | Sub-300ms utility — session naming, titles, autocomplete, voice fast-router |
| smooth-thinking | deepseek-v4-flash | Deprecated alias for smooth-reasoning (legacy callers) |
Every alias has a fallback chain — a single provider outage degrades to the next-best option rather than failing the request.
Smooth Routing — bench-backed primaries, cross-vendor fallback chains
Each smooth-* alias points at the current bench-leader for that purpose and falls back through a hand-curated chain that spans at least three labs. A Vertex 503 or DeepSeek hiccup degrades to the next provider before the caller sees an error. Re-evaluated each time the published-benchmark picture shifts (most recent refresh: May 2026).
Best agentic tool-use in the cheap-coder tier — native OpenAI tool_calls with no thinking-mode contract. DeepSeek-V4 leads SWE-bench Verified at 80.6% but its thinking-mode trips LiteLLM's reasoning_content guard on turn 1, so we pin Qwen as the entry-point primary and let DeepSeek-class reasoning land in smooth-reasoning. 1M context, $0.30/$1.50.
- glm-5.1 (Z.ai)→
- kimi-k2-thinking (Moonshot)→
- MiniMax-M2 (MiniMax)
1M context, dual Thinking / Non-Thinking modes, $0.14/$0.28 per 1M — the cheapest top-tier reasoner.
- kimi-k2-thinking (Moonshot)→
- qwen3-235b-thinking-2507 (Alibaba)
Cheap ($0.30/$1.20), coding-forward, agentic. Different lab than the coder — catches bugs the coder's training didn't see.
- qwen3-coder-plus (Alibaba)→
- glm-5.1 (Z.ai)→
- kimi-k2-thinking (Moonshot)
Sub-300ms first-token. Verdicts are 1-line JSON; Gemini Flash was overkill at ~10× the cost. Matches accuracy in the small-verdict shape.
- gemini-2.5-flash (Google)→
- claude-haiku-4-5 (Anthropic)→
- gpt-5-mini (OpenAI)
1M context, IFEval leader in the cheap Gemini tier, dialable thinking levels.
- qwen3-coder-plus (Alibaba, 1M ctx backstop)→
- gpt-5-mini (OpenAI)
Sub-300ms first-token; P95 ~600ms end-to-end on production traffic. ~10× cheaper than the prior Gemini Flash Lite primary, ~1s faster on cold starts.
- gemini-2.5-flash-lite (Google)→
- claude-haiku-4-5 (Anthropic)→
- gpt-5-mini (OpenAI)
Long context, structured-output friendly, cheap. Avoids Kimi K2-Thinking (overkill + slow for read-only mapping work).
- smooth-reasoning (DeepSeek)→
- qwen3-coder-plus (Alibaba, 1M ctx)→
- claude-haiku-4-5 (Anthropic)
Every primary also has a per-lab variant (e.g. smooth-coding-qwen, smooth-fast-gemini) so callers can pin a specific lab without the cross-vendor chain — useful for compliance, A/B testing, or sticky cache locality.
Why route through Smoo AI
Unified billing
One invoice across every lab. Tier-based token allowances, per-org metering, Stripe-synced overage.
Org-scoped virtual keys
Each organization gets its own key with optional model allowlist and budget cap. Rotate from the dashboard — no downtime.
Cross-lab fallback chains
Every model has a typed fallback chain spanning at least three labs. A Vertex 503, an Anthropic 429, or a DeepSeek timeout degrades to the next provider before surfacing an error.
Drop-in compatibility
OpenAI SDK, LangChain, LlamaIndex, Vercel AI SDK — anything that takes a base URL works unchanged.
Streaming, tool use, JSON mode
Everything the upstream model supports passes through untouched, plus kwargs the OpenAI shape does not cover.
Live OpenAPI spec
Full interactive reference below. The spec tracks upstream provider capabilities in real time.
Interactive API reference
Full endpoint + schema reference for everything the gateway serves — chat completions, embeddings, moderations, and key management. Try every endpoint inline.
/modelsModel List
Use `/model/info` - to get detailed model information, example - pricing, mode, etc. This is just for compatibility with openai projects like aider. Query Parameters: - include_metadata: Include additional metadata in the response with fallback information - fallback_type: Type of fallbacks to include ("general", "context_window", "content_policy") Defaults to "general" when include_metadata=true - scope: Optional scope parameter. Currently only accepts "expand". When scope=expand is passed, proxy admins, team admins, and org admins will receive all proxy models as if they are a proxy admin.
Query Parameters
| Name | Type | Description |
|---|---|---|
return_wildcard_routes | string | |
team_id | string | |
include_model_access_groups | string | |
only_model_access_groups | string | |
include_metadata | string | |
fallback_type | string | |
scope | string |
Responses
200Successful Response422Validation Error| Property | Type | Description |
|---|---|---|
detail | array |
Code Examples
curl -X GET https://llm.smoo.ai/models \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN"