Evaluating Langcache on Your Data

langcache

evaluation

semantic-caching

embeddings

How the langcache-customer-data-eval repo helps measure cache hit rate, precision, and threshold tradeoffs on your own query data.

Author

Iliya Zhechev, Rado Ralev

Published

March 25, 2026

Semantic caching only helps if it works on your own traffic. A threshold that looks good on a toy dataset can be far too aggressive on real user queries, or too conservative to save meaningful latency and cost.

The langcache-customer-data-eval repo is designed to answer that question on customer data. It runs an offline evaluation pipeline over your queries and cache candidates, then produces the metrics and plots you need to decide whether LangCache is a good fit and where the operating threshold should be set.

What the repo does

At a high level, the pipeline:

embeds queries and cache candidates
finds the nearest cache match for each query
sweeps similarity thresholds across the score range
computes metrics at each threshold
writes CSV outputs and threshold plots for analysis

That gives you a concrete view of the tradeoff between cache aggressiveness and answer quality, rather than relying on a single anecdotal threshold.

Two ways to evaluate

The repo supports two modes.

1. Fast cache-hit-ratio analysis

In the default mode, the pipeline focuses on cache hit ratio. This is useful when you want a quick first pass on a dataset and mainly care about how often a cache would fire at different thresholds.

This mode is helpful for questions like:

Do my queries cluster tightly enough for semantic caching to help?
How sensitive is cache hit ratio to the threshold?
Is there an obvious operating region worth testing online?

2. Full evaluation with quality metrics

The fuller mode adds an LLM-as-a-judge stage on top of the nearest-match retrieval step. That produces proxy labels for whether a retrieved cache match is actually acceptable, which makes it possible to compute metrics such as precision, recall, and F-scores across a threshold sweep.

This is the more useful mode when the real question is not just “how often would the cache hit?” but “how often would it hit correctly?”

What you provide

The pipeline is built around two simple inputs:

a query log CSV
a cache/reference CSV

Both need a common text column. The query file represents the traffic you want to evaluate. The cache file represents the candidate responses or utterances that LangCache would try to reuse.

The tooling supports both local paths and s3:// paths, which makes it practical to run on small local samples or larger offline datasets stored in object storage.

What you get back

The main outputs are the artifacts you need to choose a threshold responsibly:

match files showing the best candidate per query
threshold sweep CSVs
cache-hit-rate plots
in full mode, precision/recall/F-score outputs and comparison plots

That makes the repo useful for several evaluation loops:

comparing embedding models
comparing Redis-backed matching against in-memory matching
testing how much quality drops as cache hit rate increases
deciding what threshold should move to online experimentation

Why this is useful for LangCache

LangCache is most valuable when it improves latency and cost without introducing too many wrong cache reuses. The hard part is that every dataset has different duplication patterns, phrasing variation, and ambiguity.

This repo helps turn that into a measurable offline exercise. Instead of asking whether semantic caching works in general, you can ask:

what cache hit ratio is possible on my data?
what precision do I get at that hit rate?
how does the threshold behave on real traffic?
which model or serving setup gives the best tradeoff?

That is exactly the kind of analysis needed before enabling LangCache broadly in production.

Practical workflow

A typical workflow looks like this:

Export a representative set of queries.
Build or sample the cache candidate set.
Run the evaluation pipeline offline.
Inspect the threshold sweep outputs and plots.
Choose a small set of candidate thresholds.
Validate those thresholds in a controlled online setting.

The repo is therefore less about one benchmark number and more about giving teams a repeatable method for deciding whether semantic caching is viable on their own workload.

Repo

GitHub: https://github.com/redislabsdev/langcache-customer-data-eval