Automatic LLM Failover: Never Let a Rate Limit Break Your App

The Fragility Problem

Every production LLM application has the same hidden vulnerability: it depends entirely on a single provider being up, responsive, and within rate limits. When that provider has a bad day — and they all do — your application either errors out or hangs.

OpenAI's API has experienced dozens of significant incidents in the past year alone. Anthropic, Groq, and every other provider have their own reliability profiles. More commonly than full outages are rate limit errors: HTTP 429 responses that mean you've exceeded the provider's requests-per-minute or tokens-per-minute ceiling for your tier.

For most developers, the response to rate limits is an exponential backoff loop: wait a second, retry; wait two seconds, retry; wait four seconds, retry. This works if your users are willing to sit through 10–30 seconds of latency. In a production application with real users, it doesn't.

What Automatic Failover Does

Automatic LLM failover means that when your primary provider returns an error — whether a 429 Too Many Requests, a 503 Service Unavailable, or a 500 Internal Server Error — the gateway immediately retries the same request against a secondary provider, transparently, without the client ever seeing the error.

From the perspective of your application code, the request succeeds. The only difference might be a slightly higher latency (the time to detect the failure and route to the secondary) and the fact that the response came from a different model.

How Provider Failover Works in Proxide

When you route through Proxide, you configure a fallback chain in your dashboard — an ordered list of providers and models to try in sequence:

Primary: openai/gpt-4o
Secondary: anthropic/claude-3-5-sonnet-20241022
Tertiary: groq/llama-3.3-70b-versatile

Proxide attempts the primary provider first. If it receives any of the following error conditions, it immediately routes to the next provider in the chain:

HTTP 429 (rate limit)
HTTP 500, 502, 503 (server errors)
Connection timeout (> 10 seconds with no response)
Partial response timeout (stream stalled)

The failover is transparent to your client. Your application sends a single request to Proxide and receives a single response — it never needs to know that the request was routed to a secondary.

Setting Up a Failover Chain

Basic Setup

Point your OpenAI client at Proxide:

typescript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "prox-your-key-here",
  baseURL: "https://gateway.proxide.ai/openai/v1",
});

By default, Proxide uses the fallback chain you've configured in your dashboard. To override the fallback chain for a specific request, use the x-proxide-fallback header:

typescript

const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [{ role: "user", content: "Summarize this document..." }],
  },
  {
    headers: {
      // Explicit fallback chain for this request
      "x-proxide-fallback": "anthropic/claude-3-5-sonnet,groq/llama-3.3-70b",
    },
  }
);

Python Example

python

from openai import OpenAI

client = OpenAI(
    api_key="prox-your-key-here",
    base_url="https://gateway.proxide.ai/openai/v1",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={
        "x-proxide-fallback": "anthropic/claude-3-5-sonnet,groq/llama-3.3-70b",
    }
)

curl Example

bash

curl https://gateway.proxide.ai/openai/v1/chat/completions \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -H "x-proxide-fallback: anthropic/claude-3-5-sonnet,groq/llama-3.3-70b" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Checking Which Provider Was Used

The response includes an x-proxide-provider header telling you which provider actually served the request:

x-proxide-provider: anthropic/claude-3-5-sonnet
x-proxide-failover: true
x-proxide-original-provider: openai/gpt-4o
x-proxide-original-error: 429

This is useful for logging and debugging — you can see at a glance when failover is happening and how often.

Failover vs. Load Balancing

Failover (route on error) is different from load balancing (distribute load across providers). Proxide supports both:

Failover: Primary provider takes all traffic; others only activate on error. Best for cost optimization (primary is usually cheapest for your use case).
Load balancing: Traffic is split across providers by percentage. Best for very high-throughput applications that regularly hit rate limits.

You can configure load balancing ratios in the dashboard, e.g. 70% OpenAI / 30% Anthropic.

Supported Providers

Proxide currently supports automatic failover between:

OpenAI (GPT-4o, GPT-4o-mini, o1, o3-mini)
Anthropic (Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus)
Groq (Llama 3.3 70B, Llama 3.1 8B, Mixtral 8x7B)
DeepSeek (DeepSeek-V3, DeepSeek-R1)
Google (Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash)
Mistral (Mistral Large, Mistral Small)
Together AI (Llama 3.3 70B, DeepSeek R1)
Fireworks AI (Llama v3.1 70B, Mixtral 8x22B)
xAI (Grok-3, Grok-3-mini)
Cohere (Command R+, Command R)
Perplexity (Sonar Pro, Sonar)
Qwen, Moonshot, HuggingFace, Replicate, and more

All providers are normalized to the OpenAI API format, so you never need to change your request structure when failing over between providers.

Real-World Impact

Teams using Proxide's failover typically see their effective API uptime go from the single-provider reliability (typically 99.5–99.9%) to effectively 100% — because the chance of all three providers being down simultaneously is vanishingly small.

The latency cost of failover is minimal: Proxide detects failure and routes to the secondary within 50–100ms. In practice, this is often faster than OpenAI's own retry-after headers suggest, because Proxide has real-time visibility into provider health and pre-warms connections to all providers.

Getting Started

Sign up for Proxide and configure your fallback chain in the dashboard. The free plan includes failover for up to 1 agent. The Pro plan ($49/month) includes unlimited agents, configurable load balancing, and real-time provider health monitoring.