OpenAI-Compatible LLM Routing: Add Cost Control Without Rewriting Your Code

You're already calling OpenAI (or LangChain, LlamaIndex, Vercel AI SDK). Adding multi-model routing and cost caps usually means a new abstraction or rewriting calls. An OpenAI-compatible routing API flips that: same request and response shape as Chat Completions—you only change base_url and api_key. No new SDKs, no wrapper classes. This post explains what that means and how to use it.

What “OpenAI compatible” means here

Many routing or proxy services expose an endpoint that matches the OpenAI Chat Completions API: same HTTP path (/chat/completions), same JSON body (model, messages, stream, max_tokens, temperature, and optionally tools), and the same response shape (choices, usage, id, etc.). So any client built for OpenAI (official SDK, LangChain, or custom fetch) can point at that endpoint instead. You’re not adopting a new API—you’re pointing your existing integration at a router that then picks the model and enforces your rules.

That router can sit in front of your API keys (OpenAI, Anthropic, Google, Groq, DeepSeek). It selects the best model per request (by strategy and cost cap), calls the provider, and returns the answer in OpenAI format. You get multi-model routing and cost control without changing how your app constructs requests or parses responses.

Who it’s for

OpenAI Python SDK – You use openai.ChatCompletion.create(...) or the async client. Set base_url to the router and api_key to your router JWT.
LangChain – You use ChatOpenAI. Set openai_api_base and openai_api_key to the router; model can be a strategy (e.g. lowest-cost, balanced) instead of a concrete model name.
LlamaIndex, Vercel AI SDK, or any OpenAI-shaped client – Same idea: point the client at the router’s base URL and send your router token as the key.

Your code stays the same; only the destination of the request changes. The router handles strategy, cost caps, failover, and logging.

The `model` field: strategy or specific model

With a native routing API you might send strategy: "lowest-cost" and max_cost: 0.01. In an OpenAI-compatible API, the model field is reused so you don’t need a second API shape:

Strategy: Set model to lowest-cost, balanced, fastest, or max-reliability. The router picks the best model for that strategy (and optional cost cap). Same behavior as choosing a routing strategy.
Specific model: Set model to gpt-4o, claude-3-5-sonnet-20241022, or openai:gpt-4o. The router forces that model. Good for multi-tenant overrides or tool calling (when the client sends tools).

Optional body fields (e.g. stepblend_max_cost, stepblend_strategy) let you pass a per-request cost cap or override the strategy without changing the model string. Same routing engine and max cost semantics as the native API.

Quick examples

OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://stepblend.com/api/v1",
    api_key="YOUR_STEPBLEND_JWT"
)

r = client.chat.completions.create(
    model="lowest-cost",
    messages=[{"role": "user", "content": "Summarize this in one sentence."}],
    max_tokens=100,
    stepblend_max_cost=0.01  # optional; many clients support extra_body
)
print(r.choices[0].message.content, r.stepblend)  # stepblend = router metadata

LangChain:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    openai_api_base="https://stepblend.com/api/v1",
    openai_api_key="YOUR_STEPBLEND_JWT",
    model="balanced"
)
print(llm.invoke("What is an LLM routing layer?").content)

curl:

curl -X POST https://stepblend.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_JWT" \
  -H "Content-Type: application/json" \
  -d '{"model":"balanced","messages":[{"role":"user","content":"Hi"}],"max_tokens":50}'

In all cases, your StepBlend JWT is the only secret; provider keys stay in StepBlend and are used by the router. Responses include the usual choices, usage, and optionally a stepblend object with routed_model, provider, and estimated_cost_usd.

What you get without rewriting

One setting change. Point your client at the router’s base URL and use your router token. No new request builders or response parsers.
Same routing and caps. The OpenAI-compatible endpoint uses the same engine as the native routing API: strategies, per-request cost caps, force model, failover, and logging.
Visibility. Requests show up in the same Control Center and count toward the same plan limits. You see which model ran and what it cost.
Tool calling. When your request includes tools, send a specific model (e.g. gpt-4o); the router proxies to that provider with the same request shape. Supported for OpenAI, Groq, and DeepSeek.

So you keep your existing code and gain routing, cost control, and a single place to see spend—without a new SDK or a full migration. For full request/response details, error codes, and rate limits, see the OpenAI-compatible API docs. To try routing and caps in the UI first, use the Optimizer or read the routing API.

OpenAI-Compatible LLM Routing: Add Cost Control Without Rewriting Your Code

What “OpenAI compatible” means here

Who it’s for

The `model` field: strategy or specific model

Quick examples

What you get without rewriting

Ready to add control to your AI calls?

Related posts

LLM Cost Control: How to Cap and Reduce AI API Spend

Multi-Model Routing: One API for OpenAI, Anthropic, and Google

What “OpenAI compatible” means here

Who it’s for

The model field: strategy or specific model

Quick examples

What you get without rewriting

Ready to add control to your AI calls?

Related posts

LLM Cost Control: How to Cap and Reduce AI API Spend

Multi-Model Routing: One API for OpenAI, Anthropic, and Google

The `model` field: strategy or specific model