Documentation

Semantic cache

Proxide's semantic cache matches incoming prompts against cached responses by meaning — not exact text. Rephrased versions of the same question all hit the cache, reducing LLM API costs by 20–40% on typical workloads.

How semantic caching works

Traditional caches use the exact prompt string as the cache key. Because users never type the same message twice, exact-match cache hit rates for LLM applications are near zero.

Semantic caching converts each prompt to a vector embedding and compares it to cached embeddings using cosine similarity. If the similarity exceeds the configured threshold, the cached response is returned without calling the LLM.

Example: cache hits for semantically similar prompts
// First request — cache miss, response stored
"What is the capital of France?"          → MISS

// These all hit the cache (similarity ≥ 0.92):
"What's the capital of France?"           → HIT  (0.97)
"Capital of France?"                      → HIT  (0.95)
"Tell me the capital city of France"      → HIT  (0.93)

// This does NOT hit the cache (different question):
"What's the largest city in France?"      → MISS (0.84)

Request flow

  1. 1Prompt arrives at Proxide. An embedding is generated using text-embedding-3-small.
  2. 2The embedding is compared to all cached embeddings via vector similarity search (pgvector ANN).
  3. 3If the nearest match has cosine similarity ≥ threshold (default 0.92): return cached response in <5ms.
  4. 4If no match: forward to LLM, cache the response with TTL.

Enabling semantic cache

Semantic cache is configured in your Proxide dashboard under Routes → Cache Settings. You will need to provide an OpenAI API key (or an alternative embedding provider key) for Proxide to generate embeddings.

Requirements:

  • • Proxide Pro plan or higher
  • • OpenAI API key for embedding generation (or configure alternative)
  • • Semantic cache toggle enabled in dashboard

Checking cache hits

Every response includes an x-proxide-cache header indicating whether the response was served from cache:

Cache hit response headers
x-proxide-cache: hit
x-proxide-cache-similarity: 0.97
x-proxide-cache-age: 342       // seconds since cached
x-proxide-latency-ms: 4        // served from cache in 4ms
Cache miss response headers
x-proxide-cache: miss
x-proxide-latency-ms: 612      // LLM call took 612ms

Reading cache headers in code

curl
curl https://gateway.proxide.ai/openai/v1/chat/completions \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Capital of France?"}]}' \
  -v 2>&1 | grep "x-proxide-cache"

# First request:
# x-proxide-cache: miss

# Second request (or similar phrasing):
# x-proxide-cache: hit
# x-proxide-cache-similarity: 0.95

Tuning the similarity threshold

The default threshold of 0.92 works well for most use cases. You can adjust it in Routes → Cache Settings:

ThresholdBehaviourBest for
0.95–0.99Only very close paraphrases hit the cacheLegal, medical, high-precision applications
0.90–0.94Default — good balance of hits and accuracyGeneral purpose, customer support, docs
0.85–0.89Broader matching, higher savings, more approximationFAQ bots, low-stakes repetitive workloads

Cache TTL and invalidation

The default TTL is 1 hour for general queries. You can configure TTL per route in the dashboard.

To invalidate specific cache entries programmatically:

curl
# Invalidate all cache entries for a specific prompt cluster
curl -X DELETE https://api.proxide.ai/v1/cache \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"query": "capital of France", "threshold": 0.92}'

# Invalidate everything
curl -X DELETE https://api.proxide.ai/v1/cache/all \
  -H "Authorization: Bearer prox-your-key-here"

Cache warming

For known high-traffic queries, pre-populate the cache to ensure the first user gets a fast response:

curl
curl -X POST https://api.proxide.ai/v1/cache/warm \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompts": [
      "How do I reset my password?",
      "What are your business hours?",
      "How do I cancel my subscription?"
    ],
    "model": "gpt-4o",
    "ttl": 86400
  }'