RAG: Hybrid Search Optimization Notes — TF-IDF, BM25 & RRF
Table of Contents
1. Hybrid Search
1.1. TF-IDF (Term Frequency-Inverse Document Frequency)
1.1.1. Introduction
TF-IDF (Term Frequency-Inverse Document Frequency) is a widely-used weighting technique in information retrieval and text mining.
Its core purpose: Evaluate how important a word is to a document within a collection or corpus.
The core idea in one sentence: If a word appears frequently in a particular document (locally hot) but rarely across all other documents (globally rare), then that word best represents that document.
1.1.2. Core Formulas
TF-IDF is composed of two parts: TF (Term Frequency) and IDF (Inverse Document Frequency).
\[ TF\text{-}IDF = TF \times IDF \]
- TF (Term Frequency)
Measures how often term \(t\) appears in document \(d\).
- Intuition: The more frequently a word appears in a document, the more important it likely is to that document's content.
- Formula: \[ TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]
- IDF (Inverse Document Frequency)
Measures how common or rare term \(t\) is across the entire corpus.
- Intuition: If a word appears in many documents (e.g., "the", "is"), it has low discriminative power. Conversely, if a word is rare (e.g., "quantum mechanics"), its presence is highly informative.
- Formula: \[ IDF(t) = \log \left( \frac{\text{Total number of documents in the corpus}}{\text{Number of documents containing term } t + 1} \right) \] Note: Adding 1 to the denominator is for smoothing, preventing division by zero.
1.1.3. Worked Example
To build intuition, assume a tiny corpus of just three documents:
- Document A: "bee likes flower"
- Document B: "I like bee"
- Document C: "I like apple"
- Scenario 1: Importance of "bee" in Document A
Calculate TF (local):
- "bee" appears 1 time in Document A.
- Total words in Document A = 3.
\[ TF = \frac{1}{3} \approx 0.333 \]
Calculate IDF (global):
- Total documents in corpus = 3.
- Documents containing "bee" = 2 (A and B).
\[ IDF = \log\left(\frac{3}{2+1}\right) = \log(1) = 0 \] (Note: This demonstrates the principle. Without smoothing, IDF > 0. In practice, common words have IDF approaching 0.)
- Scenario 2: Importance of "flower" in Document A
- Calculate TF: \[ TF = \frac{1}{3} \approx 0.333 \]
Calculate IDF:
- "flower" appears only in Document A (1 document).
\[ IDF = \log\left(\frac{3}{1}\right) \approx 0.477 \] (base 10)
- Final TF-IDF: \[ TF\text{-}IDF = 0.333 \times 0.477 \approx 0.159 \]
- Conclusions from the Example
- "flower" scores (0.159) higher than "bee" (0).
- "like" appears in all three documents, so its IDF is extremely low, scoring 0.
- This matches our expectation: "flower" is a unique feature of Document A, therefore it is the most important term.
1.1.4. Core Philosophy: The Tension Between Local and Global
The essence of TF-IDF is solving one problem: How to find the balance between "local importance" and "global commonality"?
- Philosophical Breakdown
Concept Metric Perspective Subtext Specific Document TF Local "In this small environment, this word appears so many times — it must be important." Global Observation IDF Global "Wait! Let me check if this word is everywhere in other documents. If it's a commodity, local frequency is useless." Importance Result Weighted Judgment Only words that are locally hot + globally rare are the true "local specialties." - Analogy: Finding a Regional Specialty
- Bottled water: Abundant in one store (high TF), but available in every convenience store nationwide (low IDF) → NOT a specialty.
- Yak butter tea: Abundant in one store (high TF), but rarely seen in other regions (high IDF) → IS a specialty.
1.1.5. Advantages and Disadvantages
- Advantages
- Simple and effective: Clear logic, low computational cost, easy to implement.
- Automatic filtering: Automatically down-weights stop words (e.g., "the", "is", "a") without manual intervention.
- Strong baseline model: An excellent baseline for NLP tasks.
- Disadvantages
- Ignores word order: It is a "Bag of Words" model — cannot distinguish between "cat chases mouse" and "mouse chases cat".
- Ignores semantics: Cannot recognize synonyms (e.g., "computer" and "PC" are treated as entirely different words).
- Data sparsity: In large-scale corpora, the term-document matrix becomes extremely sparse.
1.1.6. Code Example (Python)
Using the scikit-learn library for quick TF-IDF computation:
from sklearn.feature_extraction.text import TfidfVectorizer # Corpus corpus = [ 'bee likes flower', 'I like bee', 'I like apple', ] # Initialize and compute vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) # Print feature words and corresponding TF-IDF matrix print(vectorizer.get_feature_names_out()) print(X.toarray())
1.2. BM25 (Best Matching 25)
1.2.1. Introduction
BM25 is the probabilistic optimization of TF-IDF and the current gold standard for keyword-based retrieval (Keyword Search). It primarily solves two problems that TF-IDF has:
- Unbounded term frequency: TF-IDF allows term frequency to increase scores linearly without limit.
- No document length normalization: TF-IDF does not account for varying document lengths.
1.2.2. Core Formula
\[Score(D, Q) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}\]
The formula is composed of three parts:
IDF (Inverse Document Frequency): Measures the rarity (weight) of a term.
\[IDF(q_i) = \ln \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\]
- \(N\): total number of documentation
- \(n(q_i)\): the number of documentation contains \(q_i\)
- 0.5 smoothing term: Prevents the denominator from becoming zero and corrects the internal values of the log function, ensuring the IDF remains stable even for common words (adding 1 is also typically done to guarantee non-negativity).
- TF Component (Term Frequency): Measures how often a term appears in a document (with a saturation mechanism).
- Length Normalization: Adjusts the score based on document length.
1.2.3. Key Improvements over TF-IDF
- A. Term Frequency Saturation
- TF-IDF: Term frequency increases score linearly without limit (easy to game, unreasonable).
- BM25: Introduces diminishing returns. After term frequency reaches a certain level, the score saturates (upper bound is \(k_1 + 1\)).
The key insight: the 10th occurrence of a word in a document adds far less information than the 1st occurrence. BM25 models this reality; TF-IDF does not.
- B. Document Length Normalization
- Problem: Longer documents naturally contain more words, giving them an unfair advantage.
- BM25: Penalizes long documents (\(|D| > avgdl\)) and rewards shorter documents.
This means a concise, focused document matching a query is boosted relative to a lengthy document that happens to mention the query term among thousands of other words.
1.2.4. The Two Key Parameters
| Parameter | Range | Role | Explanation |
| \(k_1\) | 1.2 - 2.0 | Controls TF saturation speed | Smaller values = faster saturation (repeated words matter less); larger values = more linear, closer to raw TF. |
| \(b\) | 0 - 1 | Controls length penalty strength | Default: 0.75. \(b=1\) = full length penalty; \(b=0\) = no length penalty at all. |
1.3. RRF (Reciprocal Rank Fusion)
1.3.1. Core Concept
In a RAG system, any single retrieval method has blind spots. Hybrid retrieval improves accuracy by combining two complementary search strategies.
| Retrieval Type | Algorithm Representative | Strengths | Weaknesses |
| Keyword Search (Sparse) | BM25 | Exact matching, proper nouns, abbreviations, IDs | Cannot understand semantics, synonyms fail |
| Vector Search (Dense) | Cosine Similarity | Semantic understanding, contextual relevance, fuzzy queries | Insensitive to precise details, prone to hallucination |
1.3.2. RRF (Reciprocal Rank Fusion) Principle
RRF is a fusion algorithm based on rank rather than score. It solves the problem that different retrievers produce scores on different scales (e.g., BM25 has no upper bound vs. Vector similarity is 0-1), making direct weighted combination difficult.
- Formula
For the final score of document \(d\):
\[RRFscore(d) = \sum_{r \in R} \frac{1}{k + r(d)}\]
- \(R\): The set of all retrieval result lists.
- \(r(d)\): The rank of document \(d\) in a given list (Rank 1 = first place).
- \(k\): Smoothing constant (typically set to 60), used to prevent top-ranked documents from having disproportionately large weight.
- Workflow
- Parallel Retrieval: Execute BM25 and Vector Search simultaneously.
- Obtain Rankings: Get two separate Top-K document lists.
- Fusion Calculation: Ignore original scores; compute RRF Score based solely on ranks.
- Re-ranking: Sort by RRF Score in descending order; take the Top-N results to feed to the LLM.
- Worked Example
Suppose we search for "quantum computing applications":
Document BM25 Rank Vector Rank RRF Score (k=60) Doc X 3 1 1/63 + 1/61 = 0.0323 Doc Y 1 5 1/61 + 1/65 = 0.0318 Doc Z 2 4 1/62 + 1/64 = 0.0317 Doc X wins because it ranks highly on both lists, even though it was not #1 on either individual list.
1.3.3. Core Advantages of RRF
- Normalization-Free: No need for complex weight tuning (e.g., \(0.7 \times A + 0.3 \times B\)). Extremely robust.
- Complementary strengths: Captures both "literally exact" matches (IDs, model numbers) and "semantically related" concepts.
- Cold-start friendly: For new terms not seen in training data, BM25 serves as a fallback since it relies on lexical matching rather than learned embeddings.
1.4. Summary: How It All Fits Together in RAG
The retrieval pipeline in a RAG system:
- User Query arrives.
- BM25 (sparse/keyword search) retrieves documents based on exact term matching, powered by TF-IDF principles with saturation and length normalization improvements.
- Vector Search (dense/semantic search) retrieves documents based on embedding similarity, capturing meaning beyond exact words.
- RRF fuses both result lists by rank, producing a single unified ranking that leverages the strengths of both approaches.
- The Top-N documents are passed as context to the LLM for answer generation.
1.5. Key Takeaways
- TF-IDF is the foundation: locally frequent + globally rare = important. Simple but ignores word order and semantics.
- BM25 improves on TF-IDF with term frequency saturation (\(k_1\)) and document length normalization (\(b\)). It is the gold standard for keyword search.
- RRF is the fusion layer: it elegantly combines keyword and vector search results using only rank positions, avoiding the score normalization problem entirely.
- Hybrid Search = BM25 + Vector Search + RRF: This combination is the current best practice for RAG retrieval, giving you both precision (exact matches) and recall (semantic understanding).
1.6. Implementation
""" From-scratch implementation of hybrid search: BM25 + Vector Search + RRF. need OPENAI API key in environment All scoring formulas follow RAG.org: - TF(t,d) = count(t,d) / |d| - IDF(q) = ln((N - n(q) + 0.5) / (n(q) + 0.5)) + 1 [BM25 variant] - BM25 = Σ IDF(qi) * f(qi,D)*(k1+1) / (f(qi,D) + k1*(1 - b + b*|D|/avgdl)) - RRF = Σ 1/(k + rank(d)) """ import math import re from collections import defaultdict from dataclasses import dataclass, field from typing import Any import openai from qdrant_client import QdrantClient # --------------------------------------------------------------------------- # Tokenizer # --------------------------------------------------------------------------- def tokenize(text: str) -> list[str]: """Lowercase and extract alphanumeric tokens.""" return re.findall(r"[a-z0-9]+", text.lower()) # --------------------------------------------------------------------------- # Corpus loader # --------------------------------------------------------------------------- @dataclass class Document: id: str tokens: list[str] payload: dict[str, Any] = field(default_factory=dict) def load_corpus( collection_name: str = "knowledge_base", host: str = "localhost", port: int = 6333, ) -> list[Document]: """Scroll all points from Qdrant and build an in-memory corpus.""" client = QdrantClient(host=host, port=port) documents: list[Document] = [] offset = None while True: results, next_offset = client.scroll( collection_name=collection_name, offset=offset, limit=256, with_payload=True, with_vectors=False, ) for point in results: text = (point.payload or {}).get("text", "") documents.append(Document( id=str(point.id), tokens=tokenize(text), payload=point.payload or {}, )) if next_offset is None: break offset = next_offset return documents # --------------------------------------------------------------------------- # BM25 scoring (from scratch) # --------------------------------------------------------------------------- def bm25_search( query: str, corpus: list[Document], limit: int = 15, k1: float = 1.5, b: float = 0.75, ) -> list[tuple[Document, float]]: """ Score every document against the query using BM25. BM25(D, Q) = Σ IDF(qi) * f(qi,D)*(k1+1) / (f(qi,D) + k1*(1-b + b*|D|/avgdl)) IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5)) + 1 """ query_tokens = tokenize(query) if not query_tokens or not corpus: return [] N = len(corpus) avgdl = sum(len(doc.tokens) for doc in corpus) / N # Document frequency: how many docs contain each term df: dict[str, int] = defaultdict(int) for doc in corpus: seen = set(doc.tokens) for token in seen: df[token] += 1 # Precompute IDF for query terms idf: dict[str, float] = {} for qt in query_tokens: n_q = df.get(qt, 0) idf[qt] = math.log((N - n_q + 0.5) / (n_q + 0.5)) + 1 # Score each document scored: list[tuple[Document, float]] = [] for doc in corpus: score = 0.0 doc_len = len(doc.tokens) # Term frequency map for this document tf: dict[str, int] = defaultdict(int) for t in doc.tokens: tf[t] += 1 for qt in query_tokens: f_q = tf.get(qt, 0) if f_q == 0: continue numerator = f_q * (k1 + 1) denominator = f_q + k1 * (1 - b + b * doc_len / avgdl) score += idf[qt] * numerator / denominator if score > 0: scored.append((doc, score)) scored.sort(key=lambda x: x[1], reverse=True) return scored[:limit] # --------------------------------------------------------------------------- # Vector search (via Qdrant) # --------------------------------------------------------------------------- @dataclass class VectorResult: id: str score: float payload: dict[str, Any] = field(default_factory=dict) def vector_search( query: str, collection_name: str = "knowledge_base", host: str = "localhost", port: int = 6333, limit: int = 15, ) -> list[VectorResult]: """Generate an OpenAI embedding and query Qdrant for nearest neighbors.""" embedding = openai.OpenAI().embeddings.create( model="text-embedding-3-small", input=query ).data[0].embedding client = QdrantClient(host=host, port=port) response = client.query_points( collection_name=collection_name, query=embedding, limit=limit, with_payload=True, ) points = response.points if hasattr(response, "points") else response return [ VectorResult(id=str(p.id), score=p.score, payload=p.payload or {}) for p in points ] # --------------------------------------------------------------------------- # RRF fusion # --------------------------------------------------------------------------- @dataclass class SearchResult: id: str score: float payload: dict[str, Any] = field(default_factory=dict) def rrf_fuse( bm25_results: list[tuple[Document, float]], vec_results: list[VectorResult], limit: int = 5, k: int = 60, ) -> list[SearchResult]: """ Reciprocal Rank Fusion: score(d) = Σ 1/(k + rank(d)) """ rrf_scores: dict[str, float] = defaultdict(float) payloads: dict[str, dict[str, Any]] = {} for rank, (doc, _bm25_score) in enumerate(bm25_results, start=1): rrf_scores[doc.id] += 1.0 / (k + rank) payloads.setdefault(doc.id, doc.payload) for rank, vr in enumerate(vec_results, start=1): rrf_scores[vr.id] += 1.0 / (k + rank) payloads.setdefault(vr.id, vr.payload) sorted_ids = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:limit] return [ SearchResult(id=pid, score=score, payload=payloads[pid]) for pid, score in sorted_ids ] # --------------------------------------------------------------------------- # Entry point: hybrid search # --------------------------------------------------------------------------- def hybrid_search( query: str, collection_name: str = "knowledge_base", host: str = "localhost", port: int = 6333, limit: int = 5, bm25_k1: float = 1.5, bm25_b: float = 0.75, rrf_k: int = 60, ) -> list[SearchResult]: """ Full hybrid search pipeline: 1. Load corpus from Qdrant 2. BM25 search (from scratch) 3. Vector search (via Qdrant + OpenAI embeddings) 4. RRF fusion """ prefetch = limit * 3 # 1. Build BM25 corpus corpus = load_corpus(collection_name, host, port) # 2. BM25 search bm25_results = bm25_search(query, corpus, limit=prefetch, k1=bm25_k1, b=bm25_b) # 3. Vector search vec_results = vector_search(query, collection_name, host, port, limit=prefetch) # 4. RRF fusion return rrf_fuse(bm25_results, vec_results, limit=limit, k=rrf_k) def test_machine_learning_notes_exist(): """Query 'machine learning notes' and verify relevant results are returned.""" results = hybrid_search("machine learning notes", limit=5) # At least one result should come back assert len(results) > 0, "Expected at least one result for 'machine learning notes'" # All RRF scores should be positive for r in results: assert r.score > 0, f"RRF score should be positive, got {r.score}" # At least one result should mention machine learning in its text payload texts = [r.payload.get("text", "").lower() for r in results] assert any( "machine learning" in t for t in texts ), f"No result contained 'machine learning'. Top texts: {[t[:80] for t in texts]}" # Print results for visibility for i, r in enumerate(results, 1): snippet = r.payload.get("text", "")[:120].replace("\n", " ") print(f" #{i} rrf_score={r.score:.5f} id={r.id}") print(f" source={r.payload.get('source_file', 'N/A')}") print(f" text={snippet}...") def main(): print("Hello from rrf_from_scratch!") test_machine_learning_notes_exist() print("Test passed!") if __name__ == "__main__": main()