Documentation
Semantic cache
Proxide's semantic cache matches incoming prompts against cached responses by meaning — not exact text. Rephrased versions of the same question all hit the cache, reducing LLM API costs by 20–40% on typical workloads.
How semantic caching works
Traditional caches use the exact prompt string as the cache key. Because users never type the same message twice, exact-match cache hit rates for LLM applications are near zero.
Semantic caching converts each prompt to a vector embedding and compares it to cached embeddings using cosine similarity. If the similarity exceeds the configured threshold, the cached response is returned without calling the LLM.
// First request — cache miss, response stored
"What is the capital of France?" → MISS
// These all hit the cache (similarity ≥ 0.92):
"What's the capital of France?" → HIT (0.97)
"Capital of France?" → HIT (0.95)
"Tell me the capital city of France" → HIT (0.93)
// This does NOT hit the cache (different question):
"What's the largest city in France?" → MISS (0.84)Request flow
- 1Prompt arrives at Proxide. An embedding is generated using
text-embedding-3-small. - 2The embedding is compared to all cached embeddings via vector similarity search (pgvector ANN).
- 3If the nearest match has cosine similarity ≥ threshold (default 0.92): return cached response in <5ms.
- 4If no match: forward to LLM, cache the response with TTL.
Enabling semantic cache
Semantic cache is configured in your Proxide dashboard under Routes → Cache Settings. You will need to provide an OpenAI API key (or an alternative embedding provider key) for Proxide to generate embeddings.
Requirements:
- • Proxide Pro plan or higher
- • OpenAI API key for embedding generation (or configure alternative)
- • Semantic cache toggle enabled in dashboard
Checking cache hits
Every response includes an x-proxide-cache header indicating whether the response was served from cache:
x-proxide-cache: hit
x-proxide-cache-similarity: 0.97
x-proxide-cache-age: 342 // seconds since cached
x-proxide-latency-ms: 4 // served from cache in 4msx-proxide-cache: miss
x-proxide-latency-ms: 612 // LLM call took 612msReading cache headers in code
curl https://gateway.proxide.ai/openai/v1/chat/completions \
-H "Authorization: Bearer prox-your-key-here" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Capital of France?"}]}' \
-v 2>&1 | grep "x-proxide-cache"
# First request:
# x-proxide-cache: miss
# Second request (or similar phrasing):
# x-proxide-cache: hit
# x-proxide-cache-similarity: 0.95Tuning the similarity threshold
The default threshold of 0.92 works well for most use cases. You can adjust it in Routes → Cache Settings:
| Threshold | Behaviour | Best for |
|---|---|---|
| 0.95–0.99 | Only very close paraphrases hit the cache | Legal, medical, high-precision applications |
| 0.90–0.94 | Default — good balance of hits and accuracy | General purpose, customer support, docs |
| 0.85–0.89 | Broader matching, higher savings, more approximation | FAQ bots, low-stakes repetitive workloads |
Cache TTL and invalidation
The default TTL is 1 hour for general queries. You can configure TTL per route in the dashboard.
To invalidate specific cache entries programmatically:
# Invalidate all cache entries for a specific prompt cluster
curl -X DELETE https://api.proxide.ai/v1/cache \
-H "Authorization: Bearer prox-your-key-here" \
-H "Content-Type: application/json" \
-d '{"query": "capital of France", "threshold": 0.92}'
# Invalidate everything
curl -X DELETE https://api.proxide.ai/v1/cache/all \
-H "Authorization: Bearer prox-your-key-here"Cache warming
For known high-traffic queries, pre-populate the cache to ensure the first user gets a fast response:
curl -X POST https://api.proxide.ai/v1/cache/warm \
-H "Authorization: Bearer prox-your-key-here" \
-H "Content-Type: application/json" \
-d '{
"prompts": [
"How do I reset my password?",
"What are your business hours?",
"How do I cancel my subscription?"
],
"model": "gpt-4o",
"ttl": 86400
}'