COMPARE

Rungate vs LiteLLM.

LiteLLM is a widely-adopted provider-normalization layer — 100+ LLMs behind an OpenAI-compatible interface, usable as a Python SDK or a proxy server. Rungate is a run-level governance control plane. They overlap at the proxy surface but model the problem at different layers. Both are MIT/Apache 2.0 open source.

TL;DR

Breadth: LiteLLM wins by a lot. 100+ providers including Bedrock, SageMaker, Vertex, self-hosted Ollama/VLLM, specialist endpoints (embeddings, TTS, rerank). Rungate ships OpenAI + Anthropic today, Gemini planned.
Budgets: LiteLLM caps per virtual key (returns HTTP 401 when exceeded). Rungate caps per run (returns HTTP 402 Payment Required with full context). Different layer, different semantics.
Approval gates: Rungate has HTTP 202 pause/resume on policy-gated tool calls. LiteLLM has MCP require_approval config but no documented pause-retry flow.
Run-level state: Not present in LiteLLM. Budgets, policies, and audit attach to keys/teams/users. No concept of a multi-call workflow as a governed unit.
Deployment: LiteLLM offers two modes (SDK + proxy). Rungate is proxy-only. For script-heavy Python workflows, LiteLLM’s SDK mode is a real convenience.

Capability comparison

Built from the LiteLLM GitHub README and docs.litellm.ai as of 2026-04-20. Corrections welcome at [email protected].

Capability	Rungate	LiteLLM
Deployment modes	Proxy (HTTP)	SDK (Python) + Proxy (HTTP)
Governed primitive	Run (workflow)	Virtual key
Cross-call budget	Yes — spans all calls in a run	Per key; not per workflow
Budget-exceeded HTTP code	402 Payment Required (semantic)	401 Authentication Error
Budget-exceeded context in response	Rich `context` object: spend, limit, rule, run_id, steps	Error message with spend + budget
Human approval gate	HTTP 202 pause/resume on policy match	MCP `require_approval` flag; no HTTP 202 semantics
Per-request policy override	`x-rungate-policy` header	Config on the key
Supported providers	OpenAI, Anthropic (Gemini planned)	100+ across all major families
Self-hosted model support	—	VLLM, Ollama, LM Studio, Llamafile
Specialist endpoints	—	Embeddings, TTS, transcription, rerank
Observability callbacks	Native + OTel export	Lunary, MLflow, Langfuse, many others
Full-run audit reconstruction	Native — every call, tool, gate, approver, cost	Per-request logs
Latency at scale	Production-deployed; benchmarks pending	8ms p95 at 1k RPS
Teams + virtual-key hierarchy	Org → member → agent tokens	Team → key → user
License	Apache 2.0	MIT
Managed cloud	Rungate cloud (Free → Enterprise)	LiteLLM Cloud

When LiteLLM is the right choice

You need maximum provider coverage.

100+ LLMs across OpenAI, Anthropic, Google, AWS (Bedrock, SageMaker), Azure, Cohere, self-hosted, and specialist endpoints. If any of your workflow touches an unusual provider, LiteLLM probably already has an adapter.

You want the SDK path, not a proxy.

LiteLLM can be a Python library you import — no proxy to run. For batch scripts, notebooks, or one-off pipelines where an extra hop is overhead, this is a real convenience. Rungate is proxy-only.

You need specialist endpoints (embeddings, TTS, rerank).

LiteLLM normalizes /embeddings, /audio/transcriptions, /audio/speech, /rerank. If your pipeline mixes chat with embeddings or transcription across providers, this is the right tool.

You already have an observability stack.

LiteLLM callback integrations for Lunary, MLflow, Langfuse, and others mean you can keep your existing dashboards. Rungate exports OpenTelemetry and emits its own webhooks, but the off-the-shelf list is narrower.

When Rungate is the right choice

Your budget needs to span a workflow, not a key.

“This customer-support run can’t spend more than $2 across every LLM call + tool invocation” is native in Rungate. In LiteLLM you can cap a virtual key, but once the workflow ends and the next one starts, the old spend is still on the key until you reset it.

You need HTTP 402 + rich error context.

When Rungate blocks a call, you get HTTP 402 Payment Required with a context object an agent can parse: cumulative spend, limit, run ID, triggering rule. LiteLLM returns HTTP 401 and a string error message — workable, but semantically it conflates “bad credentials” with “over budget,” which complicates agent error-handling code.

You need human-in-the-loop approval on specific tool calls.

Rungate’s policy engine can mark a tool as “requires approval” and pause the run via HTTP 202 until a human approves in Slack/email/dashboard. The agent retries the same call and it goes through. LiteLLM’s MCP require_approval flag doesn’t model the pause-retry flow — it’s a tool-side config, not a runtime gate with resume semantics.

Compliance asks “what did the agent do?” and wants one coherent answer.

Rungate reconstructs a run as one audit trail: every call, tool, retry, approver, cost — in order, correlated, exportable. LiteLLM’s logs give you rows that you’d join yourself on key and timestamp.

The architectural difference, in code

Same scenario — an agent doing multi-step work against a $1 budget cap — expressed in each.

# LiteLLM — virtual key with per-key budget.
# The key is the unit of spend tracking + isolation.
# The proxy blocks requests synchronously when the key's cap is hit.

import openai

client = openai.OpenAI(
    base_url="https://your-litellm-proxy/v1",
    api_key="sk-your-virtual-key-with-1usd-cap",  # budget on the key
)

# All calls in this workflow use the same virtual key.
# LiteLLM accrues cost-per-token into the key's running total.
# When the key hits max_budget, the NEXT call fails.
r1 = client.chat.completions.create(...)   # ok
# ... workflow calls ...
# HTTP 401 "Authentication Error, ExceededTokenBudget: ..."
rN = client.chat.completions.create(...)
# Note: LiteLLM returns 401 for budget-exceeded, which is
# semantically an auth failure. HTTP 402 Payment Required would
# be more precise for spending-cap violations.

# Rungate — run-level model.
# The agent token is a stable credential, not a budget carrier.
# Budget attaches to the run (workflow) via policy.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.rungate.dev/v1",
    api_key="rg_agt_your_agent_token_here",
)

# Every call in THIS workflow carries the run ID.
# Rungate accrues spend against that run.
# Tomorrow's workflow under the same token has its own budget.
RUN = "run_workflow_abc123"
r1 = client.chat.completions.create(
    ..., extra_headers={"x-rungate-run-id": RUN, "x-rungate-policy": "prod-v4"},
)
# ... many calls + tool invocations + retries later ...
# HTTP 402 Payment Required with context: { cumulative_spend,
# limit_usd, run_id, rule, steps_completed }.
# Agent parses context, stops dispatching on this run, starts a
# new run with a fresh budget if appropriate.

Why the status code matters: an agent looking at HTTP 401 is supposed to retry with different credentials. An agent looking at HTTP 402 is supposed to stop dispatching and escalate. LiteLLM’s choice of 401 for budget-exceeded conflates these, which forces agent error handling to inspect the error body instead of the status. Small thing, real friction.

Transparency: HTTP 402 is IANA-reserved “for future use.” We picked it because the emerging de facto use across the industry — Stripe, GitHub, and a growing list of SaaS APIs — is billing/quota exhaustion, which is exactly what a budget cap is. 402 isn’t a ratified standard here, but it’s a better semantic fit than 401.

Can you use both?

Yes. A common shape: agent → Rungate (run-level governance) → LiteLLM (provider breadth + routing) → provider. You get Rungate’s run-level enforcement and LiteLLM’s 100+ LLMs. The cost is one extra hop. If you genuinely need both — wide provider fabric and run-level governance — this is a legitimate architecture.

Try Rungate

Apache 2.0 self-hosted today. Point your existing OpenAI or Anthropic SDK at https://api.rungate.dev/v1 with an rg_agt_* token. See the agent-native reference for code in Python, TypeScript, and curl.

Get started For agents Rungate vs Portkey