Why Traditional Caching Doesn't Work for LLMs
If you've tried to cache LLM API responses using a standard Redis key-value store with the prompt as the cache key, you've probably noticed it barely helps. The hit rate is near zero, because even trivially different phrasings produce different cache keys:
- "What's the capital of France?" — cache miss
- "What is the capital of France?" — cache miss (different key)
- "Capital of France?" — cache miss
- "Tell me the capital city of France" — cache miss
All four prompts have the same intent and would benefit from the same cached answer, but exact-match caching treats them as entirely different requests.
LLM usage patterns compound this problem. Real users ask the same *questions* constantly — "How do I reset my password?", "What are your business hours?", "Explain this error message" — but they phrase them differently every time. An exact-match cache provides essentially zero benefit for a customer support bot or a documentation assistant.
What Semantic Caching Does Differently
Semantic caching matches prompts by *meaning* rather than by exact text. It works by converting each prompt into a vector embedding — a high-dimensional numerical representation of the prompt's semantic content — and then checking whether any cached prompt has a similar enough embedding.
The similarity is measured using cosine similarity (the cosine of the angle between the two vectors in embedding space). A cosine similarity of 1.0 means the prompts are semantically identical; 0.0 means completely unrelated. Proxide uses a configurable threshold (default: 0.92) to decide when a cached response is close enough to serve.
At a 0.92 threshold:
- "What's the capital of France?" and "What is the capital of France?" → similarity ≈ 0.97 → cache hit
- "What's the capital of France?" and "What's the largest city in France?" → similarity ≈ 0.84 → cache miss (different question)
- "How do I reset my password?" and "I forgot my password, how do I change it?" → similarity ≈ 0.93 → cache hit
How It Works Under the Hood
Here's the full request flow when semantic caching is enabled:
- Embedding generation: Proxide takes the incoming prompt and generates a vector embedding using OpenAI's
text-embedding-3-smallmodel (or your configured embedding provider). This costs roughly $0.00002 per 1,000 tokens — negligible compared to the cost of a GPT-4o completion.
- Vector similarity search: The embedding is compared against all cached prompt embeddings using approximate nearest-neighbour search (we use pgvector under the hood). This takes 1–3ms.
- Cache hit decision: If the nearest cached embedding has cosine similarity ≥ your threshold, Proxide returns the cached response immediately. The response includes
x-proxide-cache: hitandx-proxide-cache-similarity: 0.95headers.
- Cache miss path: If no sufficiently similar cached prompt exists, the request is forwarded to the upstream LLM. The response is cached for future use, with a configurable TTL (default: 1 hour for general queries, shorter for time-sensitive content).
- Cache invalidation: You can manually invalidate cache entries via the Proxide API, or configure TTLs per prompt type.
Typical Cost Savings
In practice, semantic caching saves 20–40% on most production LLM workloads. The savings vary by use case:
| Use case | Typical cache hit rate | Cost reduction |
|---|---|---|
| Customer support bot | 35–50% | 30–45% |
| Documentation assistant | 40–60% | 35–55% |
| Code review / analysis | 15–25% | 12–20% |
| Creative generation | 5–10% | 3–8% |
| RAG with user data | 10–20% | 8–15% |
Customer support and documentation use cases benefit most because user questions cluster around a small set of common intents. Creative generation and personalised RAG applications benefit least because every request is genuinely unique.
Setting Up Semantic Cache with Proxide
Enable in Dashboard
Semantic caching is enabled per-route in your Proxide dashboard. Navigate to Routes → Cache Settings and enable semantic caching with your desired similarity threshold and TTL.
You'll need to provide an OpenAI API key (or configure an alternative embedding provider) for Proxide to generate embeddings.
Check Cache Headers in Your Application
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "prox-your-key-here",
baseURL: "https://gateway.proxide.ai/openai/v1",
});
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "What is the capital of France?" }],
});
// In streaming mode, check headers from the raw fetch response instead
// For non-streaming, headers are available if using the raw response:# Check cache headers with curl
curl https://gateway.proxide.ai/openai/v1/chat/completions \
-H "Authorization: Bearer prox-your-key-here" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"What is the capital of France?"}]}' \
-v 2>&1 | grep x-proxide-cacheOn a cache hit, you'll see:
x-proxide-cache: hit
x-proxide-cache-similarity: 0.97
x-proxide-cache-age: 342On a cache miss:
x-proxide-cache: missAdjusting the Similarity Threshold
The default 0.92 threshold works well for most use cases. You can tune it:
- Higher threshold (0.95–0.99): Fewer false positives — only very similar prompts return cached responses. Better for factual accuracy.
- Lower threshold (0.85–0.92): More cache hits, higher savings — but a slightly higher risk of returning a cached response that isn't quite right for the specific prompt.
For customer support bots where approximate answers are acceptable, 0.88–0.90 is a reasonable setting. For legal or medical applications where precision matters, 0.95+ is advisable.
Cache Warming
For known high-traffic queries, you can pre-populate the cache using the Proxide Cache API:
curl -X POST https://api.proxide.ai/v1/cache/warm \
-H "Authorization: Bearer prox-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"prompts": [
"How do I reset my password?",
"What are your business hours?",
"How do I cancel my subscription?"
],
"model": "gpt-4o",
"ttl": 86400
}'This pre-generates embeddings and responses for all listed prompts, so the first real user to ask a similar question gets a cache hit immediately.
Semantic Cache + PII Redaction
Semantic caching and PII redaction work together. Proxide applies PII redaction first, then generates the embedding from the redacted prompt. This means:
- Two users asking "My email is [email protected], how do I reset my password?" and "My email is [email protected], how do I reset my password?" both get redacted to "My email is [REDACTED:EMAIL], how do I reset my password?" — and then get a cache hit from each other.
- Cached responses never contain any user's PII.
Getting Started
Semantic caching is available on the Proxide Pro plan ($49/month). Sign up at app.proxide.ai and enable it in your route settings. For most teams, the cost savings from reduced LLM API calls will cover the Proxide subscription cost within the first month.