25 min read

Hybrid Search: Combining Vector and Keyword Retrieval

Hybrid Search: Combining Vector and Keyword Retrieval
Key Topics: hybrid search implementation, vector search fusion, BM25 algorithm Python, Reciprocal Rank Fusion RRF, semantic and lexical search, search result merging, RAG retrieval optimization, keyword matching, dense sparse retrieval, search relevance scoring

Neither pure vector search nor traditional keyword search is perfect. Vector search excels at understanding semantic meaning but can miss exact keyword matches. Keyword search finds precise terms but fails to understand synonyms and context. Hybrid search combines both approaches, leveraging the strengths of each to deliver superior retrieval accuracy—often improving RAG system performance by 10-30%.

In this comprehensive guide, we’ll explore the theory and implementation of hybrid search systems. You’ll learn how BM25 keyword scoring works, understand different fusion strategies including Reciprocal Rank Fusion (RRF), and build a production-ready hybrid search system that significantly outperforms either approach alone. We’ll also examine why pure vector search fails on highly specific technical queries—such as product codes, version numbers, or rare proper nouns—and why pure BM25 fails to bridge vocabulary gaps between user phrasing and document terminology. Along the way you’ll gain practical intuition for the BM25 formula terms (TF saturation, IDF weighting, and length normalization), and you’ll see concrete benchmarks showing how fusion consistently delivers 10–30% recall improvements across diverse query sets. By the end, you’ll have both the conceptual foundation and the working Python code to integrate hybrid search into any RAG pipeline.

Why Hybrid Search?

To understand why hybrid search works so well, let’s examine the failure modes of each individual approach. Every retrieval method makes a trade-off between semantic generalization and lexical precision, and real-world corpora contain queries that demand both simultaneously. A user searching for error code E_DEADLOCK_42 needs exact string matching, while a user asking “how do I fix a database lock contention issue?” needs semantic understanding—and often a single RAG pipeline must handle both in the same session. Studying these failure modes not only motivates hybrid search but also helps you decide how to weight each component when you tune your system later.

Comparison of vector search, keyword search, and hybrid search results
Figure 1: Query results comparison showing how vector, keyword, and hybrid search handle different query types. Hybrid search captures relevant results that either method alone would miss.

Vector Search Limitations

  • Exact match failures: “Python 3.11.4” might match “Python 3.10.2” with high similarity
  • Rare terms: Unusual proper nouns or technical terms may not embed distinctly
  • Negation blindness: “not working” and “working” can have similar embeddings
  • Numeric precision: Specific numbers or codes don’t embed meaningfully

Keyword Search Limitations

  • Synonym blindness: “automobile” won’t match “car” without explicit handling
  • Context ignorance: “Java” (programming) vs “Java” (island) are identical
  • Paraphrase failure: “How do I start?” won’t match “Getting started guide”
  • Vocabulary mismatch: User terms may differ from document terminology

Hybrid search addresses these limitations by combining the semantic understanding of vectors with the precision of keyword matching. The key insight is that these two failure modes are largely orthogonal: vector search struggles precisely where keyword search excels (exact terms, rare tokens, numeric values), and keyword search struggles precisely where vector search excels (synonyms, paraphrases, cross-lingual queries). By running both retrievers in parallel and merging their ranked lists, we compensate for each method’s blind spots without sacrificing the strengths of either. In practice this means retrieving a larger candidate pool—typically 3× to 5× your desired final top_k—from each system before applying a fusion function that re-ranks the union.

Vector search (also called dense retrieval) converts text into dense numerical vectors using embedding models. Similarity is measured by comparing these vectors, typically using cosine similarity or dot product. Because the embedding model is trained to place semantically similar sentences close together in the high-dimensional space, vector search naturally handles synonyms, paraphrases, and even cross-lingual queries without any explicit vocabulary mapping. The quality of the embedding model is therefore the single most influential factor in vector search performance: a general-purpose model like all-MiniLM-L6-v2 performs well across domains, while a domain-fine-tuned model (e.g., trained on scientific literature or legal contracts) can yield substantial additional gains on specialist corpora. One important implementation detail is to normalize embeddings to unit length before storing them; this converts cosine similarity into a simple dot product, which is significantly faster to compute at scale.

Diagram showing the hybrid search pipeline from query to merged results
Figure 2: Hybrid search pipeline architecture showing parallel vector and keyword searches followed by result fusion.
from sentence_transformers import SentenceTransformer
import numpy as np

def vector_search(query: str, documents: list[str], model: SentenceTransformer, top_k: int = 10):
    """
    Perform vector similarity search.
    
    Args:
        query: Search query
        documents: List of documents to search
        model: Embedding model
        top_k: Number of results to return
        
    Returns:
        List of (index, score) tuples, sorted by score descending
    """
    # Encode query and documents
    query_embedding = model.encode(query, normalize_embeddings=True)
    doc_embeddings = model.encode(documents, normalize_embeddings=True)
    
    # Compute cosine similarities (dot product for normalized vectors)
    similarities = np.dot(doc_embeddings, query_embedding)
    
    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Return (index, score) pairs
    return [(idx, similarities[idx]) for idx in top_indices]


# Example
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
    "Python is a programming language known for its simplicity.",
    "The python snake is found in tropical regions.",
    "Machine learning with Python is very popular.",
]

results = vector_search("Python programming tutorial", documents, model)
for idx, score in results:
    print(f"Score: {score:.3f} | {documents[idx][:50]}...")

BM25 (Best Matching 25) is the gold standard for keyword-based retrieval. It’s an evolution of TF-IDF that addresses its limitations through saturation functions and length normalization. The core idea is that raw term frequency has diminishing returns: the difference between a term appearing once and twice is significant, but the difference between appearing 20 and 21 times is negligible. BM25 captures this intuition through the saturation parameter k₁, which flattens the term-frequency curve. The IDF (Inverse Document Frequency) component simultaneously up-weights rare terms—those appearing in few documents—and down-weights common words like “the” or “is” that carry little discriminative information. Length normalization via the b parameter corrects for the fact that longer documents have a higher raw probability of containing any given term, ensuring that a brief but highly relevant paragraph isn’t penalized relative to a much longer but equally relevant chapter.

Visual explanation of BM25 and RRF formulas
Figure 3: Key formulas in hybrid search: BM25 scoring and Reciprocal Rank Fusion for combining results.

BM25 Formula Explained

BM25(D, Q) = Σ IDF(qᵢ) × (f(qᵢ, D) × (k₁ + 1)) / (f(qᵢ, D) + k₁ × (1 – b + b × |D|/avgdl))

Where:

  • f(qᵢ, D) = frequency of term qᵢ in document D
  • |D| = length of document D
  • avgdl = average document length in the corpus
  • k₁ = term frequency saturation parameter (typically 1.2-2.0)
  • b = length normalization parameter (typically 0.75)
  • IDF(qᵢ) = inverse document frequency of term qᵢ
from rank_bm25 import BM25Okapi
import numpy as np

class BM25Search:
    """BM25 keyword search implementation."""
    
    def __init__(self, documents: list[str]):
        """
        Initialize BM25 index.
        
        Args:
            documents: List of documents to index
        """
        self.documents = documents
        
        # Tokenize documents (simple whitespace tokenization)
        tokenized = [doc.lower().split() for doc in documents]
        
        # Initialize BM25
        self.bm25 = BM25Okapi(tokenized)
    
    def search(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
        """
        Search for documents matching the query.
        
        Args:
            query: Search query
            top_k: Number of results
            
        Returns:
            List of (index, score) tuples
        """
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)
        
        # Get top-k indices
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        return [(idx, scores[idx]) for idx in top_indices if scores[idx] > 0]


# Example
documents = [
    "Python is a programming language known for its simplicity.",
    "The python snake is found in tropical regions.",
    "Machine learning with Python is very popular.",
]

bm25_search = BM25Search(documents)
results = bm25_search.search("Python programming")

for idx, score in results:
    print(f"BM25 Score: {score:.3f} | {documents[idx][:50]}...")
BM25 Parameters

The effectiveness of BM25 depends on two key parameters:

  • k₁ (1.2-2.0): Controls term frequency saturation. Higher values give more weight to repeated terms.
  • b (0.75): Controls length normalization. b=0 ignores length; b=1 fully normalizes.

Default values work well for most cases. Tune on your specific corpus if needed.

Fusion Strategies

Once we have results from both vector and keyword search, we need to combine them. Several strategies exist, each with different characteristics. The fundamental challenge is that vector scores (typically cosine similarity in the range −1 to 1) and BM25 scores (unbounded positive floats) live on completely different scales, so naïve summation is meaningless. The three main families of fusion are: score-based fusion (normalize both score ranges then interpolate), rank-based fusion (ignore raw scores and only use the rank position), and interleaving (alternate picking from each list). Score-based fusion gives you fine-grained control via the blending weight α, but is sensitive to score distribution shifts across queries. Rank-based fusion—especially Reciprocal Rank Fusion—is more robust because it is completely agnostic to score magnitudes. Understanding the trade-offs between these strategies will help you choose the right approach for your specific retrieval requirements.

Diagram comparing different merge strategies for hybrid search
Figure 4: Comparison of fusion strategies: score combination, rank fusion, and weighted interleaving.

1. Score Combination

Normalize scores from each method to the [0, 1] range and then combine them with a weighted linear interpolation using the blending coefficient α. Setting α = 0.5 gives equal weight to both retrievers, while α = 0.7 biases the final ranking toward semantic (vector) similarity—a sensible default for natural-language question-answering tasks. Conversely, setting α = 0.3 biases toward keyword precision, which is preferable when queries contain rare identifiers such as part numbers, error codes, or scientific nomenclature.

One important edge case to handle: if one retriever returns zero results (for example, BM25 returning nothing for an out-of-vocabulary query), the normalization denominator becomes zero; guard against this by falling back to the non-empty result list rather than dividing by zero. Another subtlety is that min-max normalization is sensitive to outliers—a single document with an anomalously high BM25 score can compress all other scores near zero—so consider using percentile clipping before normalizing for more robust behavior.

def score_fusion(
    vector_results: list[tuple[int, float]],
    keyword_results: list[tuple[int, float]],
    alpha: float = 0.5
) -> list[tuple[int, float]]:
    """
    Combine results using weighted score fusion.
    
    Args:
        vector_results: (index, score) from vector search
        keyword_results: (index, score) from keyword search
        alpha: Weight for vector scores (1-alpha for keyword)
        
    Returns:
        Combined (index, score) list
    """
    # Normalize scores to [0, 1]
    def normalize(results):
        if not results:
            return {}
        scores = [s for _, s in results]
        min_s, max_s = min(scores), max(scores)
        range_s = max_s - min_s if max_s > min_s else 1
        return {idx: (s - min_s) / range_s for idx, s in results}
    
    vector_norm = normalize(vector_results)
    keyword_norm = normalize(keyword_results)
    
    # Combine scores
    all_indices = set(vector_norm.keys()) | set(keyword_norm.keys())
    combined = {}
    
    for idx in all_indices:
        v_score = vector_norm.get(idx, 0)
        k_score = keyword_norm.get(idx, 0)
        combined[idx] = alpha * v_score + (1 - alpha) * k_score
    
    # Sort by combined score
    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

2. Reciprocal Rank Fusion (RRF)

RRF combines results based on their rank rather than scores, making it robust to different score distributions. Because it only uses the ordinal position of each document in each result list—not the magnitude of the scores—RRF is naturally immune to the score-scale mismatch between vector cosine similarity and BM25 values. This also makes it trivially extensible to three or more retrievers: you simply sum the reciprocal-rank contributions from each additional list without introducing any new hyperparameters beyond k. In empirical evaluations across BEIR and MS MARCO benchmarks, RRF consistently matches or outperforms sophisticated score-based fusion methods despite this simplicity, which is why it has become the de-facto standard in production retrieval systems including major search engines and enterprise document management platforms.

RRF_score(d) = Σ 1 / (k + rank(d))

Where k is a smoothing constant (typically 60) and rank(d) is the document’s 1-based position in each result list. The constant k prevents the score from blowing up when a document ranks first (rank = 1), ensuring that a rank-1 result scores 1/(60+1) ≈ 0.0164 rather than 1.0—a design choice that dampens the outsized influence of the very top result and allows consensus across lists to assert itself.

If a document appears in both the vector list at rank 2 and the keyword list at rank 5, its RRF score is 1/(60+2) + 1/(60+5) ≈ 0.0316, substantially higher than a document appearing in only one list at rank 1 (≈ 0.0164), which elegantly rewards cross-list agreement. This aggregation behavior is precisely why hybrid systems outperform single-modality approaches: documents that multiple independent signals agree on tend to be the most relevant ones.

def reciprocal_rank_fusion(
    *result_lists: list[tuple[int, float]],
    k: int = 60
) -> list[tuple[int, float]]:
    """
    Combine multiple result lists using Reciprocal Rank Fusion.
    
    Args:
        *result_lists: Variable number of (index, score) lists
        k: RRF constant (default 60)
        
    Returns:
        Combined (index, score) list
    """
    rrf_scores = {}
    
    for results in result_lists:
        for rank, (idx, _) in enumerate(results, start=1):
            if idx not in rrf_scores:
                rrf_scores[idx] = 0
            rrf_scores[idx] += 1 / (k + rank)
    
    # Sort by RRF score
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

Reciprocal Rank Fusion Deep Dive

RRF has become the preferred fusion method for several reasons, and it is worth understanding each advantage before deciding whether to use it or a score-based alternative. First, because RRF only looks at rank positions, it sidesteps the difficult problem of calibrating scores from heterogeneous retrievers onto a common scale—a problem that plagues linear interpolation approaches. Second, its single hyperparameter k is remarkably stable: the default value of 60 (introduced in the original 2009 paper by Cormack, Clarke, and Buettcher) works well across a wide range of corpora and has rarely needed adjustment in practice.

Third, RRF handles the empty-result edge case gracefully—if a retriever returns no results for a query (e.g., BM25 returning nothing for a fully out-of-vocabulary query), its contribution simply drops to zero without any special-casing in code. Finally, extending RRF to more than two retrievers—for instance adding a third sparse retriever using SPLADE or a re-ranking pass—requires no formula changes: you simply add more 1/(k+rank) terms.

  • Score agnostic: Works regardless of score scales or distributions
  • Parameter-light: Only one parameter (k) to tune
  • Proven effective: Consistently outperforms other methods in benchmarks
  • Robust: Handles missing results gracefully

The k Parameter

The constant k (default 60) controls how quickly scores decay with rank. Lower k values give more weight to top ranks, which is useful when your individual retrievers are already highly accurate and you want the fusion to preserve their strong top-1 preferences. Conversely, higher k values flatten the score curve, reducing the penalty for appearing at rank 10 versus rank 1—this is advantageous when your individual retrievers are noisier and consensus across positions matters more than exact rank.

A useful mental model: at k = 20 the rank-1 score is about 1.45× the rank-10 score, while at k = 60 the ratio drops to only 1.14×, meaning the fusion becomes much more egalitarian. In most RAG pipelines the default k = 60 performs well, but if you observe that your vector retriever is consistently very accurate while BM25 is noisier—or vice versa—lowering k to 20–30 will amplify the high-confidence signals from the better retriever:

k ValueRank 1 ScoreRank 10 ScoreRatioCharacteristic
200.0480.0331.45xStrong top-rank preference
600.0160.0141.14xBalanced (default)
1000.0100.0091.11xFlatter distribution

Complete Implementation

Let’s build a production-ready hybrid search system that combines everything we’ve learned. The implementation below wraps both a Sentence Transformer vector index and a BM25 keyword index into a single class, exposes both RRF and score-combination fusion strategies behind a common search() interface, and surfaces per-document component scores so you can audit exactly how much each retriever contributed to the final ranking.

Because the two indexes are built independently at indexing time and queried in parallel at search time, the system scales horizontally—you can swap the in-process numpy arrays for a LanceDB ANN index and a persistent BM25 store without changing the fusion logic. Pay particular attention to the initial_k parameter: fetching 3× to 5× more candidates than your desired final top_k before fusion is essential for recall, since a relevant document that ranks 11th in one list but 2nd in the other would be lost if you only retrieved the top 10 from each:

"""
Production-ready hybrid search combining vector and keyword retrieval.
Features: BM25, vector search, RRF fusion, configurable weights.
"""

from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple, Optional, Dict, Any
from enum import Enum


class FusionMethod(Enum):
    """Available fusion methods."""
    RRF = "rrf"
    SCORE_COMBINATION = "score_combination"
    INTERLEAVE = "interleave"


@dataclass
class SearchResult:
    """A single search result."""
    index: int
    text: str
    score: float
    vector_score: Optional[float] = None
    keyword_score: Optional[float] = None
    metadata: Dict[str, Any] = None


class HybridSearch:
    """
    Hybrid search combining vector and keyword retrieval.
    
    Features:
    - Dual-encoder vector search with Sentence Transformers
    - BM25 keyword search
    - Multiple fusion strategies (RRF, score combination)
    - Configurable weights and parameters
    """
    
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        bm25_k1: float = 1.5,
        bm25_b: float = 0.75
    ):
        """
        Initialize hybrid search.
        
        Args:
            embedding_model: Sentence transformer model name
            bm25_k1: BM25 term frequency saturation
            bm25_b: BM25 length normalization
        """
        self.model = SentenceTransformer(embedding_model)
        self.bm25_k1 = bm25_k1
        self.bm25_b = bm25_b
        
        # Will be initialized when documents are indexed
        self.documents: List[str] = []
        self.doc_embeddings: np.ndarray = None
        self.bm25: BM25Okapi = None
        self.metadata: List[Dict] = []
    
    def index(
        self,
        documents: List[str],
        metadata: List[Dict] = None,
        show_progress: bool = True
    ):
        """
        Index documents for search.
        
        Args:
            documents: List of document texts
            metadata: Optional metadata for each document
            show_progress: Show embedding progress bar
        """
        self.documents = documents
        self.metadata = metadata or [{} for _ in documents]
        
        # Build vector index
        print("Building vector index...")
        self.doc_embeddings = self.model.encode(
            documents,
            normalize_embeddings=True,
            show_progress_bar=show_progress
        )
        
        # Build BM25 index
        print("Building BM25 index...")
        tokenized = [self._tokenize(doc) for doc in documents]
        self.bm25 = BM25Okapi(
            tokenized,
            k1=self.bm25_k1,
            b=self.bm25_b
        )
        
        print(f"Indexed {len(documents)} documents")
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize text for BM25."""
        # Simple tokenization - extend with better preprocessing as needed
        import re
        text = text.lower()
        tokens = re.findall(r'bw+b', text)
        return tokens
    
    def _vector_search(
        self,
        query: str,
        top_k: int
    ) -> List[Tuple[int, float]]:
        """Perform vector similarity search."""
        query_embedding = self.model.encode(
            query,
            normalize_embeddings=True
        )
        
        # Cosine similarity (dot product for normalized vectors)
        similarities = np.dot(self.doc_embeddings, query_embedding)
        
        # Get top-k
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [(idx, float(similarities[idx])) for idx in top_indices]
    
    def _keyword_search(
        self,
        query: str,
        top_k: int
    ) -> List[Tuple[int, float]]:
        """Perform BM25 keyword search."""
        tokenized_query = self._tokenize(query)
        scores = self.bm25.get_scores(tokenized_query)
        
        # Get top-k with positive scores
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        return [(idx, float(scores[idx])) for idx in top_indices if scores[idx] > 0]
    
    def _rrf_fusion(
        self,
        vector_results: List[Tuple[int, float]],
        keyword_results: List[Tuple[int, float]],
        k: int = 60
    ) -> List[Tuple[int, float, float, float]]:
        """
        Combine results using Reciprocal Rank Fusion.
        
        Returns:
            List of (index, rrf_score, vector_score, keyword_score)
        """
        rrf_scores = {}
        vector_scores = {idx: score for idx, score in vector_results}
        keyword_scores = {idx: score for idx, score in keyword_results}
        
        # Add vector search contributions
        for rank, (idx, _) in enumerate(vector_results, start=1):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank)
        
        # Add keyword search contributions
        for rank, (idx, _) in enumerate(keyword_results, start=1):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank)
        
        # Combine with original scores
        results = []
        for idx, rrf_score in rrf_scores.items():
            results.append((
                idx,
                rrf_score,
                vector_scores.get(idx, 0),
                keyword_scores.get(idx, 0)
            ))
        
        # Sort by RRF score
        results.sort(key=lambda x: x[1], reverse=True)
        return results
    
    def _score_fusion(
        self,
        vector_results: List[Tuple[int, float]],
        keyword_results: List[Tuple[int, float]],
        alpha: float = 0.5
    ) -> List[Tuple[int, float, float, float]]:
        """
        Combine results using normalized score fusion.
        
        Args:
            alpha: Weight for vector scores (1-alpha for keyword)
            
        Returns:
            List of (index, combined_score, vector_score, keyword_score)
        """
        def normalize(results):
            if not results:
                return {}
            scores = [s for _, s in results]
            min_s, max_s = min(scores), max(scores)
            range_s = max_s - min_s if max_s > min_s else 1
            return {idx: (s - min_s) / range_s for idx, s in results}
        
        vector_norm = normalize(vector_results)
        keyword_norm = normalize(keyword_results)
        
        vector_raw = {idx: score for idx, score in vector_results}
        keyword_raw = {idx: score for idx, score in keyword_results}
        
        all_indices = set(vector_norm.keys()) | set(keyword_norm.keys())
        
        results = []
        for idx in all_indices:
            v_norm = vector_norm.get(idx, 0)
            k_norm = keyword_norm.get(idx, 0)
            combined = alpha * v_norm + (1 - alpha) * k_norm
            
            results.append((
                idx,
                combined,
                vector_raw.get(idx, 0),
                keyword_raw.get(idx, 0)
            ))
        
        results.sort(key=lambda x: x[1], reverse=True)
        return results
    
    def search(
        self,
        query: str,
        top_k: int = 10,
        fusion_method: FusionMethod = FusionMethod.RRF,
        alpha: float = 0.5,
        rrf_k: int = 60,
        initial_k: int = None
    ) -> List[SearchResult]:
        """
        Perform hybrid search.
        
        Args:
            query: Search query
            top_k: Number of final results
            fusion_method: How to combine results
            alpha: Weight for vector search in score fusion
            rrf_k: Constant for RRF (higher = more equal weighting)
            initial_k: Number of results from each method before fusion
            
        Returns:
            List of SearchResult objects
        """
        if self.doc_embeddings is None:
            raise ValueError("No documents indexed. Call index() first.")
        
        # Get more results initially for better fusion
        initial_k = initial_k or top_k * 3
        
        # Run both searches
        vector_results = self._vector_search(query, initial_k)
        keyword_results = self._keyword_search(query, initial_k)
        
        # Fuse results
        if fusion_method == FusionMethod.RRF:
            fused = self._rrf_fusion(vector_results, keyword_results, rrf_k)
        elif fusion_method == FusionMethod.SCORE_COMBINATION:
            fused = self._score_fusion(vector_results, keyword_results, alpha)
        else:
            raise ValueError(f"Unknown fusion method: {fusion_method}")
        
        # Build final results
        results = []
        for idx, score, v_score, k_score in fused[:top_k]:
            results.append(SearchResult(
                index=idx,
                text=self.documents[idx],
                score=score,
                vector_score=v_score,
                keyword_score=k_score,
                metadata=self.metadata[idx] if self.metadata else None
            ))
        
        return results
    
    def search_vector_only(self, query: str, top_k: int = 10) -> List[SearchResult]:
        """Search using only vector similarity."""
        results = self._vector_search(query, top_k)
        return [
            SearchResult(
                index=idx,
                text=self.documents[idx],
                score=score,
                vector_score=score,
                metadata=self.metadata[idx] if self.metadata else None
            )
            for idx, score in results
        ]
    
    def search_keyword_only(self, query: str, top_k: int = 10) -> List[SearchResult]:
        """Search using only BM25 keywords."""
        results = self._keyword_search(query, top_k)
        return [
            SearchResult(
                index=idx,
                text=self.documents[idx],
                score=score,
                keyword_score=score,
                metadata=self.metadata[idx] if self.metadata else None
            )
            for idx, score in results
        ]


# Example usage
if __name__ == "__main__":
    # Initialize hybrid search
    search = HybridSearch()
    
    # Sample documents
    documents = [
        "Python is a high-level programming language with dynamic typing.",
        "The python snake is a non-venomous constrictor found in Asia and Africa.",
        "Machine learning with Python uses libraries like scikit-learn and TensorFlow.",
        "PyTorch is a deep learning framework developed by Facebook's AI team.",
        "The Burmese python is one of the largest snake species in the world.",
        "Python 3.11 introduced significant performance improvements.",
        "Natural language processing enables computers to understand human language.",
        "The ball python is a popular pet snake due to its docile nature.",
    ]
    
    # Index documents
    search.index(documents)
    
    # Compare search methods
    query = "Python programming language features"
    
    print(f"nQuery: {query}")
    print("=" * 70)
    
    print("n--- Vector Only ---")
    for r in search.search_vector_only(query, top_k=3):
        print(f"Score: {r.score:.3f} | {r.text[:60]}...")
    
    print("n--- Keyword Only ---")
    for r in search.search_keyword_only(query, top_k=3):
        print(f"Score: {r.score:.3f} | {r.text[:60]}...")
    
    print("n--- Hybrid (RRF) ---")
    for r in search.search(query, top_k=3, fusion_method=FusionMethod.RRF):
        print(f"Score: {r.score:.4f} | V: {r.vector_score:.3f} K: {r.keyword_score:.2f}")
        print(f"  {r.text[:60]}...")

Tuning and Optimization

Achieving optimal hybrid search performance requires thoughtful tuning based on your specific data and use cases. A common mistake is to tune on a generic benchmark (like MS MARCO) and apply those settings verbatim to a specialized domain—technical datasheets, legal contracts, or medical literature all exhibit very different term-distribution and query-type patterns. The right approach is to assemble a small golden query set (50–200 queries with known relevant documents) from your actual production traffic, run a grid or random search over the key parameters, and select the configuration that maximizes your target metric (typically nDCG@10 or Recall@5). It’s also worth re-tuning whenever you significantly change your document corpus—for instance after ingesting a new document type or switching embedding models—since those changes can shift which retriever is more reliable for particular query patterns.

Key Parameters to Tune

ParameterRangeHigher ValuesLower Values
alpha (score fusion)0.0-1.0More vector influenceMore keyword influence
rrf_k20-100More equal rank weightingStronger top-rank preference
initial_k2x-5x top_kMore candidates, better recallFaster, may miss relevant docs
bm25_k11.2-2.0More term frequency weightSaturates frequency effect
Tuning Guidelines
  • Start with defaults: RRF with k=60 works well out of the box
  • Adjust based on query types:
    • Technical queries (codes, IDs): Increase keyword weight
    • Natural language questions: Increase vector weight
  • Use A/B testing: Compare configurations on real user queries
  • Monitor recall@k: Ensure relevant docs appear in top results

Evaluation Methods

Proper evaluation is essential for understanding your hybrid search performance and guiding optimization efforts. Without a rigorous measurement framework, you risk making parameter changes that look promising on a handful of hand-picked test queries but actually degrade performance on the broader distribution. The metrics below capture complementary aspects of retrieval quality: Precision@k tells you how many of your returned results are actually relevant (a precision-focused metric useful when the user only reads the first few results), while Recall@k tells you what fraction of all relevant documents you surfaced (critical for RAG pipelines where missing context leads to hallucinated answers).

MRR (Mean Reciprocal Rank) rewards systems that place the first relevant result as high as possible—ideal for single-answer lookups—while nDCG@k applies a logarithmic discount to give partial credit for relevant results that appear lower in the ranking. Together, tracking all four metrics gives you a complete picture of retrieval quality across different use-case priorities.

Key Metrics

from typing import List, Set

def precision_at_k(retrieved: List[int], relevant: Set[int], k: int) -> float:
    """Calculate precision@k."""
    retrieved_k = retrieved[:k]
    relevant_retrieved = sum(1 for doc in retrieved_k if doc in relevant)
    return relevant_retrieved / k

def recall_at_k(retrieved: List[int], relevant: Set[int], k: int) -> float:
    """Calculate recall@k."""
    retrieved_k = retrieved[:k]
    relevant_retrieved = sum(1 for doc in retrieved_k if doc in relevant)
    return relevant_retrieved / len(relevant) if relevant else 0

def mrr(retrieved: List[int], relevant: Set[int]) -> float:
    """Calculate Mean Reciprocal Rank."""
    for i, doc in enumerate(retrieved, start=1):
        if doc in relevant:
            return 1 / i
    return 0

def ndcg_at_k(retrieved: List[int], relevant: Set[int], k: int) -> float:
    """Calculate normalized Discounted Cumulative Gain."""
    import math
    
    dcg = sum(
        1 / math.log2(i + 2)  # +2 because log2(1) = 0
        for i, doc in enumerate(retrieved[:k])
        if doc in relevant
    )
    
    # Ideal DCG (all relevant docs at top)
    ideal_dcg = sum(
        1 / math.log2(i + 2)
        for i in range(min(len(relevant), k))
    )
    
    return dcg / ideal_dcg if ideal_dcg > 0 else 0


# Example evaluation
def evaluate_search(search_system, test_queries):
    """Evaluate search system on test queries."""
    metrics = {"precision@5": [], "recall@5": [], "mrr": [], "ndcg@5": []}
    
    for query, relevant_docs in test_queries:
        results = search_system.search(query, top_k=10)
        retrieved = [r.index for r in results]
        relevant_set = set(relevant_docs)
        
        metrics["precision@5"].append(precision_at_k(retrieved, relevant_set, 5))
        metrics["recall@5"].append(recall_at_k(retrieved, relevant_set, 5))
        metrics["mrr"].append(mrr(retrieved, relevant_set))
        metrics["ndcg@5"].append(ndcg_at_k(retrieved, relevant_set, 5))
    
    return {k: sum(v)/len(v) for k, v in metrics.items()}

Conclusion

Hybrid search represents the state-of-the-art in information retrieval, combining the semantic understanding of vector search with the precision of keyword matching. By implementing the techniques in this guide—particularly Reciprocal Rank Fusion—you can build retrieval systems that significantly outperform either approach alone. The 10–30% recall improvement figure cited earlier is not a theoretical ceiling; in practice, gains are largest on corpora that mix natural-language prose with structured technical content (product specs, API documentation, financial reports), exactly the kinds of documents common in enterprise RAG deployments.

One final production consideration: ensure your hybrid pipeline degrades gracefully when one leg fails—if your vector store becomes temporarily unavailable, route all traffic to BM25 alone rather than returning empty results, and emit a metric so you can alert and investigate quickly. Building in this resilience from the start ensures that the hybrid approach delivers its recall benefits in production without introducing a new single-point-of-failure.

Key takeaways:

  • Vector and keyword search have complementary strengths and weaknesses
  • RRF is the preferred fusion method for most applications
  • Proper tuning requires evaluation on representative queries
  • The hybrid approach typically improves performance by 10-30%
  • Implementation is straightforward with BM25 and Sentence Transformers

In the next article we’ll explore how to run local LLMs with OpenAI-compatible APIs, completing the foundation for a fully local RAG system. Having a fast, accurate hybrid retriever is only half the pipeline—you still need a language model that can synthesize the retrieved context into a coherent, grounded answer. Running that model locally eliminates API latency, removes per-token costs, and keeps sensitive document content entirely on-premise, which is essential for many enterprise and regulated-industry use cases. We’ll cover model quantization options (GGUF, GPTQ, AWQ), server frameworks like Ollama and llama.cpp, and how to wire the OpenAI-compatible endpoint directly into the hybrid search pipeline you’ve built here.

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *