Evaluating Langcache on Your Data
Semantic caching only helps if it works on your own traffic. A threshold that looks good on a toy dataset can be far too aggressive on real user queries, or too conservative to save meaningful latency and cost.
The langcache-customer-data-eval repo is designed to answer that question on customer data. It runs an offline evaluation pipeline over your queries and cache candidates, then produces the metrics and plots you need to decide whether LangCache is a good fit and where the operating threshold should be set.
What the repo does
At a high level, the pipeline:
- embeds queries and cache candidates
- finds the nearest cache match for each query
- sweeps similarity thresholds across the score range
- computes metrics at each threshold
- writes CSV outputs and threshold plots for analysis
That gives you a concrete view of the tradeoff between cache aggressiveness and answer quality, rather than relying on a single anecdotal threshold.
Two ways to evaluate
The repo supports two modes.
1. Fast cache-hit-ratio analysis
In the default mode, the pipeline focuses on cache hit ratio. This is useful when you want a quick first pass on a dataset and mainly care about how often a cache would fire at different thresholds.
This mode is helpful for questions like:
- Do my queries cluster tightly enough for semantic caching to help?
- How sensitive is cache hit ratio to the threshold?
- Is there an obvious operating region worth testing online?
2. Full evaluation with quality metrics
The fuller mode adds an LLM-as-a-judge stage on top of the nearest-match retrieval step. That produces proxy labels for whether a retrieved cache match is actually acceptable, which makes it possible to compute metrics such as precision, recall, and F-scores across a threshold sweep.
This is the more useful mode when the real question is not just “how often would the cache hit?” but “how often would it hit correctly?”
What you provide
The pipeline is built around two simple inputs:
- a query log CSV
- a cache/reference CSV
Both need a common text column. The query file represents the traffic you want to evaluate. The cache file represents the candidate responses or utterances that LangCache would try to reuse.
The tooling supports both local paths and s3:// paths, which makes it practical to run on small local samples or larger offline datasets stored in object storage.
What you get back
The main outputs are the artifacts you need to choose a threshold responsibly:
- match files showing the best candidate per query
- threshold sweep CSVs
- cache-hit-rate plots
- in full mode, precision/recall/F-score outputs and comparison plots
That makes the repo useful for several evaluation loops:
- comparing embedding models
- comparing Redis-backed matching against in-memory matching
- testing how much quality drops as cache hit rate increases
- deciding what threshold should move to online experimentation
Why this is useful for LangCache
LangCache is most valuable when it improves latency and cost without introducing too many wrong cache reuses. The hard part is that every dataset has different duplication patterns, phrasing variation, and ambiguity.
This repo helps turn that into a measurable offline exercise. Instead of asking whether semantic caching works in general, you can ask:
- what cache hit ratio is possible on my data?
- what precision do I get at that hit rate?
- how does the threshold behave on real traffic?
- which model or serving setup gives the best tradeoff?
That is exactly the kind of analysis needed before enabling LangCache broadly in production.
Practical workflow
A typical workflow looks like this:
- Export a representative set of queries.
- Build or sample the cache candidate set.
- Run the evaluation pipeline offline.
- Inspect the threshold sweep outputs and plots.
- Choose a small set of candidate thresholds.
- Validate those thresholds in a controlled online setting.
The repo is therefore less about one benchmark number and more about giving teams a repeatable method for deciding whether semantic caching is viable on their own workload.
Repo
- GitHub:
https://github.com/redislabsdev/langcache-customer-data-eval