An Alternative LLM-as-a-Judge Local Pipeline for Better Stability and Batch Scaling

evaluation
llm
vllm
benchmarking
A simpler yes/no classification setup improves determinism, scales better on larger batches, and preserves evaluation quality.
Author

Radoslav Ralev

Published

March 24, 2026

This post summarizes an internal evaluation of a revised local LLM-as-a-judge pipeline for cache-hit classification.

The main change is simple: instead of asking the model to generate a structured JSON object for each example, the pipeline now asks for a binary decision, yes or no, and compares the token probabilities p(yes) and p(no) at inference time.

That small interface change has a large systems impact. It removes fragile output parsing, reduces run failures, improves determinism across batch orderings, and makes larger-batch execution much more practical.

Why the earlier approach was brittle

The older implementation had three main weaknesses:

  • It relied on the model to emit valid JSON, so small formatting errors could break a run.
  • It showed non-deterministic behavior across batched inference, where identical sentence pairs could receive different labels depending on batch composition.
  • It was effectively zero-shot, which limited performance compared with a prompt design that better framed the classification task.

In practice, this meant that the evaluation pipeline was doing more work than necessary. The model was spending capacity on output formatting instead of the classification decision itself.

The new classification setup

The revised pipeline narrows the task to a direct binary choice:

  • Predict yes or no
  • Compare p(yes) against p(no)
  • Mark the pair as a cache hit when p(yes) > p(no)

This design makes the system easier to reason about and easier to scale. The surrounding code owns the output structure, while the model focuses on the decision boundary.

What improved

The updated pipeline delivers four practical gains:

  • Improved stability through fewer failed runs
  • Better batch-size scaling on L40S GPUs
  • Deterministic outputs for repeated sentence pairs
  • Lower latency from a simpler forward-pass pattern

These are not just implementation conveniences. They matter directly for evaluation throughput, reproducibility, and confidence in benchmark results.

Prompt comparison and implementation behavior

The writeup compares the old and new setups across different batch sizes and prompt choices.

One key takeaway is that the new implementation produces more stable metrics with less variance. The goal is not to claim a dramatic quality jump from prompt engineering alone, but to show that the revised setup behaves more consistently under scale.

The experiments also compare Hugging Face execution with vLLM. Metric quality stays comparable, while vLLM provides a meaningful speed advantage in most runs.

Plot 1 - Different prompts

The first comparison looks at varying batch sizes with different system prompts. The main point is stability rather than raw quality: the revised pipeline shows less metric variance across batches while avoiding the failure modes of the earlier JSON-generation setup.

Comparison across varying batch sizes with different prompts.

Plot 2 - Similar prompts

The second comparison repeats the experiment with approximately the same prompt structure on both sides. This isolates the implementation change more directly and highlights the efficiency and stability gains of the new method while keeping output quality comparable.

Comparison across varying batch sizes with similar prompts.

Sub-10B model benchmark on Quora Question Pairs

The second part of the evaluation benchmarks local models under 10B parameters on Quora Question Pairs.

Two experimental choices make the results more trustworthy:

  • The evaluation sample size increases from 1024 to 4096
  • Each model is run 5 times on separate 4096-sample batches

That setup makes it possible to report means and standard deviations for precision, recall, F1, and runtime, rather than relying on a single noisy run.

Benchmark observations

The main observations from the benchmark are:

  • Precision, recall, and F1 remain stable across repeated runs
  • Top-performing small models show low variance, suggesting the results are not driven by one favorable sample
  • vLLM is often about 2x faster than the Hugging Face path, though some smaller models can occasionally be faster with HF because of overhead effects

The overall message is that local LLM judging can be both practical and reproducible when the task formulation is kept narrow and the serving path is optimized.

Summary table

The benchmark summary reports precision, recall, and F1 with standard deviations across five runs, alongside inference times for both vLLM and the Hugging Face implementation.

Sub-10B Quora benchmark summary table.

The broad pattern is consistent with the rest of the writeup: small local models can deliver stable evaluation quality, and vLLM often provides a strong runtime advantage without changing the quality conclusions.

Takeaway

The strongest result here is not a single benchmark number. It is the systems lesson.

When an evaluation pipeline asks a model to do only the minimum necessary work, the entire stack becomes easier to scale and more reliable. In this case, replacing structured generation with probability-based binary classification improves stability, preserves evaluation quality, and makes local LLM judging a stronger option for large batch workloads.