Documentation

Semantic cache

Proxide's semantic cache matches incoming prompts against cached responses by meaning — not exact text. Rephrased versions of the same question all hit the cache, reducing LLM API costs by 20–40% on typical workloads.

How semantic caching works

Traditional caches use the exact prompt string as the cache key. Because users never type the same message twice, exact-match cache hit rates for LLM applications are near zero.

Semantic caching converts each prompt to a vector embedding and compares it to cached embeddings using cosine similarity. If the similarity exceeds the configured threshold, the cached response is returned without calling the LLM.

Example: cache hits for semantically similar prompts

// First request — cache miss, response stored
"What is the capital of France?"          → MISS

// These all hit the cache (similarity ≥ 0.92):
"What's the capital of France?"           → HIT  (0.97)
"Capital of France?"                      → HIT  (0.95)
"Tell me the capital city of France"      → HIT  (0.93)

// This does NOT hit the cache (different question):
"What's the largest city in France?"      → MISS (0.84)

Request flow

1Prompt arrives at Proxide. An embedding is generated using text-embedding-3-small.
2The embedding is compared to all cached embeddings via vector similarity search (pgvector ANN).
3If the nearest match has cosine similarity ≥ threshold (default 0.92): return cached response in <5ms.
4If no match: forward to LLM, cache the response with TTL.

Enabling semantic cache

Semantic cache is configured in your Proxide dashboard under Routes → Cache Settings. You will need to provide an OpenAI API key (or an alternative embedding provider key) for Proxide to generate embeddings.

Requirements:

• Proxide Pro plan or higher
• OpenAI API key for embedding generation (or configure alternative)
• Semantic cache toggle enabled in dashboard

Checking cache hits

Every response includes an x-proxide-cache header indicating whether the response was served from cache:

Cache hit response headers

x-proxide-cache: hit
x-proxide-cache-similarity: 0.97
x-proxide-cache-age: 342       // seconds since cached
x-proxide-latency-ms: 4        // served from cache in 4ms

Cache miss response headers

x-proxide-cache: miss
x-proxide-latency-ms: 612      // LLM call took 612ms

Reading cache headers in code

curl

curl https://gateway.proxide.ai/openai/v1/chat/completions \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Capital of France?"}]}' \
  -v 2>&1 | grep "x-proxide-cache"

# First request:
# x-proxide-cache: miss

# Second request (or similar phrasing):
# x-proxide-cache: hit
# x-proxide-cache-similarity: 0.95

Tuning the similarity threshold

The default threshold of 0.92 works well for most use cases. You can adjust it in Routes → Cache Settings:

Threshold	Behaviour	Best for
0.95–0.99	Only very close paraphrases hit the cache	Legal, medical, high-precision applications
0.90–0.94	Default — good balance of hits and accuracy	General purpose, customer support, docs
0.85–0.89	Broader matching, higher savings, more approximation	FAQ bots, low-stakes repetitive workloads

Cache TTL and invalidation

The default TTL is 1 hour for general queries. You can configure TTL per route in the dashboard.

To invalidate specific cache entries programmatically:

curl

# Invalidate all cache entries for a specific prompt cluster
curl -X DELETE https://api.proxide.ai/v1/cache \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"query": "capital of France", "threshold": 0.92}'

# Invalidate everything
curl -X DELETE https://api.proxide.ai/v1/cache/all \
  -H "Authorization: Bearer prox-your-key-here"

Cache warming

For known high-traffic queries, pre-populate the cache to ensure the first user gets a fast response:

curl

curl -X POST https://api.proxide.ai/v1/cache/warm \
  -H "Authorization: Bearer prox-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompts": [
      "How do I reset my password?",
      "What are your business hours?",
      "How do I cancel my subscription?"
    ],
    "model": "gpt-4o",
    "ttl": 86400
  }'

Semantic cache

How semantic caching works

Request flow

Enabling semantic cache

Checking cache hits

Reading cache headers in code

Tuning the similarity threshold

Cache TTL and invalidation

Cache warming

Related docs