Hybrid Search Retrieval: Combining Semantic Search and BM25 Algorithm with Examples

search
retrieval
bm25
embeddings
How to combine embedding similarity with BM25, plus failure modes (negation and identifiers).
Author

Srijith Rajamohan

Published

March 18, 2026

The examples below illustrate a hybrid search retrieval of a query from a corpus of documents by combining semantic search using ModernBERT embeddings and the BM25 algorithm.

The scores from each are normalized between 0 and 1 in order to create a weighted average hybrid score. The raw embedding score is also shown in the columns for diagnostics. The ranks in each measure are then combined using RRF to create another score that can be used for retrieval.

The most correct match for each query is highlighted in each example.

Example 1

Hybrid retrieval example 1 (BM25 vs embedding vs hybrid).

Example 2

Hybrid retrieval example 2 (BM25 vs embedding vs hybrid).

Failure mode 1 - Negations/directionality

BM25 is not keyword search; it up-weights rare terms and not just keywords.

In example 1, the key term of interest or the attribute/identifier is #124. However, the other terms in the query semantic, search, for, doc are all relatively rare and contribute to higher BM25 scores as evidenced in the first two records in the table index 8 and 11. These have negated intents however and therefore incorrect matches. BM25 has no ability to identify negations.

The similarity scores fortunately are not the highest but will most likely exceed most commonly used thresholds. Semantic search can fail to identify negated intent.

Failure mode 2 - Attributes/Identifiers

“The next three docs given by ids 5, 7 and 6 have the same phrasing but with the wrong identifier of #123, #1243 and #1233 respectively which lowers their BM25 score a bit but certainly higher than all the other documents. BM25 can detect a different attribute/identifier but the actual magnitude of this will depend on what the rest of the sentence looks like.”

  • If there is strong lexical overlap but only the identifier is different, there will be a small drop in the BM25 score.
  • If the rest of the sentence is NULL then the BM25 score is 0.
  • If there is very little overlap in the rest of the query and the doc.

BM25 can identify attribute/identifier changes but is hard to put a threshold on this score.

W.r.t embeddings two things are notable:

  • There are two forces at play here: the global semantic intent match and the identifier match.
  • In 8 and 11, there is negated intent whereas there is an identifier match.
  • In 5, 6 and 7 the general intent is correct but the identifier is incorrect.
  • The resulting score is a complex interaction of the two and the answer to “Can we identify negations or attribute changes with a threshold?” is heavily data-dependent.

Note that the embedding scores for 5, 7 and 6 are some of the highest because attribute/identifier differences are something that embeddings struggle to detect.

How does the hybrid score and RRF fare?

Averaging out with a weighted hybrid score or RRF has no impact here. In example 1, record id 9 which is the closest match still does not rank highly in either of these scenarios.

To summarize the above for both approaches for the failure modes, it seems that embeddings seem to have the slight edge here.

Method Negation Directionality Attribute/Identifier
BM25 N N N (hard to put a threshold)
Embedding Sometimes N Sometimes

We can’t eliminate all the issues, but we can hope to mitigate based on our understanding of the data.

But what did we want to achieve with BM25?

We are really looking to filter out matches that do not have the right attributes/identifiers we seek. BM25 looks like it could provide that, but does more than attribute matching.

  • Extract identifiers from query.
  • Perform embedding based retrieval of query + docs to retrieve set M.
  • Filter out docs where identifiers do not match in set M.
  • Unfortunately, this does nothing to address the negation/directionality limitation.