RAG: Hybrid Search Optimization Notes — TF-IDF, BM25 & RRF

Table of Contents

1. Hybrid Search

1.1. TF-IDF (Term Frequency-Inverse Document Frequency)

1.1.1. Introduction

TF-IDF (Term Frequency-Inverse Document Frequency) is a widely-used weighting technique in information retrieval and text mining.

Its core purpose: Evaluate how important a word is to a document within a collection or corpus.

The core idea in one sentence: If a word appears frequently in a particular document (locally hot) but rarely across all other documents (globally rare), then that word best represents that document.

1.1.2. Core Formulas

TF-IDF is composed of two parts: TF (Term Frequency) and IDF (Inverse Document Frequency).

\[ TF\text{-}IDF = TF \times IDF \]

  1. TF (Term Frequency)

    Measures how often term \(t\) appears in document \(d\).

    • Intuition: The more frequently a word appears in a document, the more important it likely is to that document's content.
    • Formula: \[ TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]
  2. IDF (Inverse Document Frequency)

    Measures how common or rare term \(t\) is across the entire corpus.

    • Intuition: If a word appears in many documents (e.g., "the", "is"), it has low discriminative power. Conversely, if a word is rare (e.g., "quantum mechanics"), its presence is highly informative.
    • Formula: \[ IDF(t) = \log \left( \frac{\text{Total number of documents in the corpus}}{\text{Number of documents containing term } t + 1} \right) \] Note: Adding 1 to the denominator is for smoothing, preventing division by zero.

1.1.3. Worked Example

To build intuition, assume a tiny corpus of just three documents:

  • Document A: "bee likes flower"
  • Document B: "I like bee"
  • Document C: "I like apple"
  1. Scenario 1: Importance of "bee" in Document A
    1. Calculate TF (local):

      • "bee" appears 1 time in Document A.
      • Total words in Document A = 3.

      \[ TF = \frac{1}{3} \approx 0.333 \]

    2. Calculate IDF (global):

      • Total documents in corpus = 3.
      • Documents containing "bee" = 2 (A and B).

      \[ IDF = \log\left(\frac{3}{2+1}\right) = \log(1) = 0 \] (Note: This demonstrates the principle. Without smoothing, IDF > 0. In practice, common words have IDF approaching 0.)

  2. Scenario 2: Importance of "flower" in Document A
    1. Calculate TF: \[ TF = \frac{1}{3} \approx 0.333 \]
    2. Calculate IDF:

      • "flower" appears only in Document A (1 document).

      \[ IDF = \log\left(\frac{3}{1}\right) \approx 0.477 \] (base 10)

    3. Final TF-IDF: \[ TF\text{-}IDF = 0.333 \times 0.477 \approx 0.159 \]
  3. Conclusions from the Example
    • "flower" scores (0.159) higher than "bee" (0).
    • "like" appears in all three documents, so its IDF is extremely low, scoring 0.
    • This matches our expectation: "flower" is a unique feature of Document A, therefore it is the most important term.

1.1.4. Core Philosophy: The Tension Between Local and Global

The essence of TF-IDF is solving one problem: How to find the balance between "local importance" and "global commonality"?

  1. Philosophical Breakdown
    Concept Metric Perspective Subtext
    Specific Document TF Local "In this small environment, this word appears so many times — it must be important."
    Global Observation IDF Global "Wait! Let me check if this word is everywhere in other documents. If it's a commodity, local frequency is useless."
    Importance Result Weighted Judgment Only words that are locally hot + globally rare are the true "local specialties."
  2. Analogy: Finding a Regional Specialty
    • Bottled water: Abundant in one store (high TF), but available in every convenience store nationwide (low IDF) → NOT a specialty.
    • Yak butter tea: Abundant in one store (high TF), but rarely seen in other regions (high IDF) → IS a specialty.

1.1.5. Advantages and Disadvantages

  1. Advantages
    1. Simple and effective: Clear logic, low computational cost, easy to implement.
    2. Automatic filtering: Automatically down-weights stop words (e.g., "the", "is", "a") without manual intervention.
    3. Strong baseline model: An excellent baseline for NLP tasks.
  2. Disadvantages
    1. Ignores word order: It is a "Bag of Words" model — cannot distinguish between "cat chases mouse" and "mouse chases cat".
    2. Ignores semantics: Cannot recognize synonyms (e.g., "computer" and "PC" are treated as entirely different words).
    3. Data sparsity: In large-scale corpora, the term-document matrix becomes extremely sparse.

1.1.6. Code Example (Python)

Using the scikit-learn library for quick TF-IDF computation:

from sklearn.feature_extraction.text import TfidfVectorizer

# Corpus
corpus = [
    'bee likes flower',
    'I like bee',
    'I like apple',
]

# Initialize and compute
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Print feature words and corresponding TF-IDF matrix
print(vectorizer.get_feature_names_out())
print(X.toarray())

1.2. BM25 (Best Matching 25)

1.2.1. Introduction

BM25 is the probabilistic optimization of TF-IDF and the current gold standard for keyword-based retrieval (Keyword Search). It primarily solves two problems that TF-IDF has:

  1. Unbounded term frequency: TF-IDF allows term frequency to increase scores linearly without limit.
  2. No document length normalization: TF-IDF does not account for varying document lengths.

1.2.2. Core Formula

\[Score(D, Q) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}\]

The formula is composed of three parts:

  1. IDF (Inverse Document Frequency): Measures the rarity (weight) of a term.

    \[IDF(q_i) = \ln \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\]

    • \(N\): total number of documentation
    • \(n(q_i)\): the number of documentation contains \(q_i\)
    • 0.5 smoothing term: Prevents the denominator from becoming zero and corrects the internal values of the log function, ensuring the IDF remains stable even for common words (adding 1 is also typically done to guarantee non-negativity).
  2. TF Component (Term Frequency): Measures how often a term appears in a document (with a saturation mechanism).
  3. Length Normalization: Adjusts the score based on document length.

1.2.3. Key Improvements over TF-IDF

  1. A. Term Frequency Saturation
    • TF-IDF: Term frequency increases score linearly without limit (easy to game, unreasonable).
    • BM25: Introduces diminishing returns. After term frequency reaches a certain level, the score saturates (upper bound is \(k_1 + 1\)).

    The key insight: the 10th occurrence of a word in a document adds far less information than the 1st occurrence. BM25 models this reality; TF-IDF does not.

  2. B. Document Length Normalization
    • Problem: Longer documents naturally contain more words, giving them an unfair advantage.
    • BM25: Penalizes long documents (\(|D| > avgdl\)) and rewards shorter documents.

    This means a concise, focused document matching a query is boosted relative to a lengthy document that happens to mention the query term among thousands of other words.

1.2.4. The Two Key Parameters

Parameter Range Role Explanation
\(k_1\) 1.2 - 2.0 Controls TF saturation speed Smaller values = faster saturation (repeated words matter less); larger values = more linear, closer to raw TF.
\(b\) 0 - 1 Controls length penalty strength Default: 0.75. \(b=1\) = full length penalty; \(b=0\) = no length penalty at all.

1.3. RRF (Reciprocal Rank Fusion)

1.3.1. Core Concept

In a RAG system, any single retrieval method has blind spots. Hybrid retrieval improves accuracy by combining two complementary search strategies.

Retrieval Type Algorithm Representative Strengths Weaknesses
Keyword Search (Sparse) BM25 Exact matching, proper nouns, abbreviations, IDs Cannot understand semantics, synonyms fail
Vector Search (Dense) Cosine Similarity Semantic understanding, contextual relevance, fuzzy queries Insensitive to precise details, prone to hallucination

1.3.2. RRF (Reciprocal Rank Fusion) Principle

RRF is a fusion algorithm based on rank rather than score. It solves the problem that different retrievers produce scores on different scales (e.g., BM25 has no upper bound vs. Vector similarity is 0-1), making direct weighted combination difficult.

  1. Formula

    For the final score of document \(d\):

    \[RRFscore(d) = \sum_{r \in R} \frac{1}{k + r(d)}\]

    • \(R\): The set of all retrieval result lists.
    • \(r(d)\): The rank of document \(d\) in a given list (Rank 1 = first place).
    • \(k\): Smoothing constant (typically set to 60), used to prevent top-ranked documents from having disproportionately large weight.
  2. Workflow
    1. Parallel Retrieval: Execute BM25 and Vector Search simultaneously.
    2. Obtain Rankings: Get two separate Top-K document lists.
    3. Fusion Calculation: Ignore original scores; compute RRF Score based solely on ranks.
    4. Re-ranking: Sort by RRF Score in descending order; take the Top-N results to feed to the LLM.
  3. Worked Example

    Suppose we search for "quantum computing applications":

    Document BM25 Rank Vector Rank RRF Score (k=60)
    Doc X 3 1 1/63 + 1/61 = 0.0323
    Doc Y 1 5 1/61 + 1/65 = 0.0318
    Doc Z 2 4 1/62 + 1/64 = 0.0317

    Doc X wins because it ranks highly on both lists, even though it was not #1 on either individual list.

1.3.3. Core Advantages of RRF

  1. Normalization-Free: No need for complex weight tuning (e.g., \(0.7 \times A + 0.3 \times B\)). Extremely robust.
  2. Complementary strengths: Captures both "literally exact" matches (IDs, model numbers) and "semantically related" concepts.
  3. Cold-start friendly: For new terms not seen in training data, BM25 serves as a fallback since it relies on lexical matching rather than learned embeddings.

1.4. Summary: How It All Fits Together in RAG

The retrieval pipeline in a RAG system:

  1. User Query arrives.
  2. BM25 (sparse/keyword search) retrieves documents based on exact term matching, powered by TF-IDF principles with saturation and length normalization improvements.
  3. Vector Search (dense/semantic search) retrieves documents based on embedding similarity, capturing meaning beyond exact words.
  4. RRF fuses both result lists by rank, producing a single unified ranking that leverages the strengths of both approaches.
  5. The Top-N documents are passed as context to the LLM for answer generation.

1.5. Key Takeaways

  • TF-IDF is the foundation: locally frequent + globally rare = important. Simple but ignores word order and semantics.
  • BM25 improves on TF-IDF with term frequency saturation (\(k_1\)) and document length normalization (\(b\)). It is the gold standard for keyword search.
  • RRF is the fusion layer: it elegantly combines keyword and vector search results using only rank positions, avoiding the score normalization problem entirely.
  • Hybrid Search = BM25 + Vector Search + RRF: This combination is the current best practice for RAG retrieval, giving you both precision (exact matches) and recall (semantic understanding).

1.6. Implementation

"""
From-scratch implementation of hybrid search: BM25 + Vector Search + RRF.
need OPENAI API key in environment

All scoring formulas follow RAG.org:
  - TF(t,d) = count(t,d) / |d|
  - IDF(q)  = ln((N - n(q) + 0.5) / (n(q) + 0.5)) + 1        [BM25 variant]
  - BM25    = Σ IDF(qi) * f(qi,D)*(k1+1) / (f(qi,D) + k1*(1 - b + b*|D|/avgdl))
  - RRF     = Σ 1/(k + rank(d))
"""

import math
import re
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Any

import openai
from qdrant_client import QdrantClient

# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------

def tokenize(text: str) -> list[str]:
    """Lowercase and extract alphanumeric tokens."""
    return re.findall(r"[a-z0-9]+", text.lower())


# ---------------------------------------------------------------------------
# Corpus loader
# ---------------------------------------------------------------------------

@dataclass
class Document:
    id: str
    tokens: list[str]
    payload: dict[str, Any] = field(default_factory=dict)


def load_corpus(
    collection_name: str = "knowledge_base",
    host: str = "localhost",
    port: int = 6333,
) -> list[Document]:
    """Scroll all points from Qdrant and build an in-memory corpus."""
    client = QdrantClient(host=host, port=port)
    documents: list[Document] = []
    offset = None

    while True:
        results, next_offset = client.scroll(
            collection_name=collection_name,
            offset=offset,
            limit=256,
            with_payload=True,
            with_vectors=False,
        )
        for point in results:
            text = (point.payload or {}).get("text", "")
            documents.append(Document(
                id=str(point.id),
                tokens=tokenize(text),
                payload=point.payload or {},
            ))
        if next_offset is None:
            break
        offset = next_offset
     
    return documents


# ---------------------------------------------------------------------------
# BM25 scoring (from scratch)
# ---------------------------------------------------------------------------

def bm25_search(
    query: str,
    corpus: list[Document],
    limit: int = 15,
    k1: float = 1.5,
    b: float = 0.75,
) -> list[tuple[Document, float]]:
    """
    Score every document against the query using BM25.

    BM25(D, Q) = Σ IDF(qi) * f(qi,D)*(k1+1) / (f(qi,D) + k1*(1-b + b*|D|/avgdl))

    IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5)) + 1
    """
    query_tokens = tokenize(query)
    if not query_tokens or not corpus:
        return []
     
    N = len(corpus)
    avgdl = sum(len(doc.tokens) for doc in corpus) / N

    # Document frequency: how many docs contain each term
    df: dict[str, int] = defaultdict(int)
    for doc in corpus:
        seen = set(doc.tokens)
        for token in seen:
            df[token] += 1
         
    # Precompute IDF for query terms
    idf: dict[str, float] = {}
    for qt in query_tokens:
        n_q = df.get(qt, 0)
        idf[qt] = math.log((N - n_q + 0.5) / (n_q + 0.5)) + 1
     
    # Score each document
    scored: list[tuple[Document, float]] = []
    for doc in corpus:
        score = 0.0
        doc_len = len(doc.tokens)
        # Term frequency map for this document
        tf: dict[str, int] = defaultdict(int)
        for t in doc.tokens:
            tf[t] += 1
         
        for qt in query_tokens:
            f_q = tf.get(qt, 0)
            if f_q == 0:
                continue
            numerator = f_q * (k1 + 1)
            denominator = f_q + k1 * (1 - b + b * doc_len / avgdl)
            score += idf[qt] * numerator / denominator
         
        if score > 0:
            scored.append((doc, score))
         
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:limit]


# ---------------------------------------------------------------------------
# Vector search (via Qdrant)
# ---------------------------------------------------------------------------

@dataclass
class VectorResult:
    id: str
    score: float
    payload: dict[str, Any] = field(default_factory=dict)


def vector_search(
    query: str,
    collection_name: str = "knowledge_base",
    host: str = "localhost",
    port: int = 6333,
    limit: int = 15,
) -> list[VectorResult]:
    """Generate an OpenAI embedding and query Qdrant for nearest neighbors."""
    embedding = openai.OpenAI().embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding

    client = QdrantClient(host=host, port=port)
    response = client.query_points(
        collection_name=collection_name,
        query=embedding,
        limit=limit,
        with_payload=True,
    )
    points = response.points if hasattr(response, "points") else response

    return [
        VectorResult(id=str(p.id), score=p.score, payload=p.payload or {})
        for p in points
    ]


# ---------------------------------------------------------------------------
# RRF fusion
# ---------------------------------------------------------------------------

@dataclass
class SearchResult:
    id: str
    score: float
    payload: dict[str, Any] = field(default_factory=dict)


def rrf_fuse(
    bm25_results: list[tuple[Document, float]],
    vec_results: list[VectorResult],
    limit: int = 5,
    k: int = 60,
) -> list[SearchResult]:
    """
    Reciprocal Rank Fusion: score(d) = Σ 1/(k + rank(d))
    """
    rrf_scores: dict[str, float] = defaultdict(float)
    payloads: dict[str, dict[str, Any]] = {}

    for rank, (doc, _bm25_score) in enumerate(bm25_results, start=1):
        rrf_scores[doc.id] += 1.0 / (k + rank)
        payloads.setdefault(doc.id, doc.payload)
     
    for rank, vr in enumerate(vec_results, start=1):
        rrf_scores[vr.id] += 1.0 / (k + rank)
        payloads.setdefault(vr.id, vr.payload)
     
    sorted_ids = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:limit]

    return [
        SearchResult(id=pid, score=score, payload=payloads[pid])
        for pid, score in sorted_ids
    ]


# ---------------------------------------------------------------------------
# Entry point: hybrid search
# ---------------------------------------------------------------------------

def hybrid_search(
    query: str,
    collection_name: str = "knowledge_base",
    host: str = "localhost",
    port: int = 6333,
    limit: int = 5,
    bm25_k1: float = 1.5,
    bm25_b: float = 0.75,
    rrf_k: int = 60,
) -> list[SearchResult]:
    """
    Full hybrid search pipeline:
      1. Load corpus from Qdrant
      2. BM25 search (from scratch)
      3. Vector search (via Qdrant + OpenAI embeddings)
      4. RRF fusion
    """
    prefetch = limit * 3

    # 1. Build BM25 corpus
    corpus = load_corpus(collection_name, host, port)

    # 2. BM25 search
    bm25_results = bm25_search(query, corpus, limit=prefetch, k1=bm25_k1, b=bm25_b)

    # 3. Vector search
    vec_results = vector_search(query, collection_name, host, port, limit=prefetch)

    # 4. RRF fusion
    return rrf_fuse(bm25_results, vec_results, limit=limit, k=rrf_k)


def test_machine_learning_notes_exist():
    """Query 'machine learning notes' and verify relevant results are returned."""
    results = hybrid_search("machine learning notes", limit=5)

    # At least one result should come back
    assert len(results) > 0, "Expected at least one result for 'machine learning notes'"

    # All RRF scores should be positive
    for r in results:
        assert r.score > 0, f"RRF score should be positive, got {r.score}"
     
    # At least one result should mention machine learning in its text payload
    texts = [r.payload.get("text", "").lower() for r in results]
    assert any(
        "machine learning" in t for t in texts
    ), f"No result contained 'machine learning'. Top texts: {[t[:80] for t in texts]}"

    # Print results for visibility
    for i, r in enumerate(results, 1):
        snippet = r.payload.get("text", "")[:120].replace("\n", " ")
        print(f"  #{i}  rrf_score={r.score:.5f}  id={r.id}")
        print(f"       source={r.payload.get('source_file', 'N/A')}")
        print(f"       text={snippet}...")
     
     
def main():
    print("Hello from rrf_from_scratch!")
    test_machine_learning_notes_exist()
    print("Test passed!")


if __name__ == "__main__":
    main()


Date: 2026-02-10 Di. 00:00

Author: Silin Zhao

Created: 2026-02-10 Di. 20:21

Validate