26 min read

Local Text Embeddings with Sentence Transformers — Complete Guide

Local Text Embeddings with Sentence Transformers — Complete Guide
Key Topics: Sentence Transformers tutorial, local embedding models, text-to-vector conversion, semantic similarity, all-MiniLM-L6-v2, all-mpnet-base-v2, BGE embeddings, multilingual embeddings, batch embedding optimization, cosine similarity Python

Text embeddings are the foundation of modern semantic search and RAG systems. These dense vector representations capture the meaning of text, enabling machines to understand similarity, relevance, and relationships between documents. While cloud APIs like OpenAI’s embeddings are popular, running embeddings locally offers significant advantages: no API costs, complete data privacy, offline capability, and often lower latency.

In this comprehensive guide, we’ll explore Sentence Transformers, the leading open-source library for generating high-quality text embeddings. You’ll learn how to choose the right model for your use case, optimize batch processing for large-scale embedding, and implement production-ready embedding pipelines that rival commercial solutions.

What Are Text Embeddings?

Text embeddings transform human-readable text into fixed-size numerical vectors that capture semantic meaning. Unlike simple word counts or TF-IDF representations, embeddings encode the contextual meaning of entire sentences or paragraphs. Two sentences with completely different words but similar meanings will have similar embedding vectors — the model has learned to recognise that “automobile” and “car” occupy the same conceptual region in the vector space. This is a fundamental shift away from lexical approaches: instead of asking “do these documents share the same tokens?”, we ask “do these documents share the same meaning?”. The result is a representation rich enough to support tasks like semantic search, document clustering, duplicate detection, and question answering, all with a single unified model. Because the vector space is continuous and geometrically meaningful, you can even perform arithmetic on embeddings — measuring relational analogies, interpolating between ideas, or aggregating representations across a document collection.

Consider these two sentences:

  • “The cat sat on the mat.”
  • “A feline rested upon the rug.”

Traditional keyword matching would find zero overlap between those two sentences: none of the tokens are shared, so a BM25 or TF-IDF index would score them as completely unrelated. But embedding models understand that these sentences mean nearly the same thing, producing vectors with high cosine similarity (often 0.85+). This gap between lexical and semantic similarity is precisely where embeddings shine — they capture synonymy, paraphrase, and even abstract conceptual overlap that pure keyword approaches simply cannot detect. In a real-world search engine, this means a user querying “cheap accommodation in Paris” can still find documents that mention “affordable hotels near the Eiffel Tower”, without any need for manual synonym lists or query expansion rules. The ability to bridge this semantic gap is what makes embedding-backed search feel intelligent rather than mechanical.

Diagram showing text flowing through embedding model to produce vector representation
Figure 1: The embedding pipeline transforms raw text through a neural network encoder to produce a dense vector representation that captures semantic meaning.

How Transformer Models Generate Embeddings

Modern embedding models are based on transformer architectures (like BERT) that have been fine-tuned specifically for producing meaningful sentence-level representations. The training process typically involves learning from millions of text pairs where the model must predict similarity, enabling it to capture nuanced semantic relationships. The key innovation of the Sentence Transformers framework — introduced by Reimers and Gurevych in 2019 — was the use of Siamese and triplet network structures that allow two texts to be encoded independently and then compared, making inference dramatically faster than naive cross-encoder approaches.

During contrastive training, the model is exposed to positive pairs (texts with similar meaning) and negative pairs (texts with different meaning), and the loss function pushes similar pairs together while pulling dissimilar pairs apart in the vector space. This process, sometimes augmented with techniques like hard-negative mining or knowledge distillation from larger teacher models, is what gives sentence embeddings their impressive generalisation ability across domains and tasks. Understanding this training paradigm also explains why embeddings degrade gracefully out-of-domain: the model has learned broad semantic patterns, even if it lacks domain-specific fine-tuning.

Embedding Dimensions

Embedding vectors typically have 384 to 1536 dimensions, though some specialised research models go higher. Higher dimensions can capture more nuance but require more storage and compute — a corpus of one million documents at 1536 dimensions (float32) consumes roughly 6 GB of memory just for the raw vectors, compared to about 1.5 GB at 384 dimensions. For most applications, 384–768 dimensions provide an excellent balance of quality and efficiency, and benchmarks on the MTEB leaderboard consistently show that well-trained 384-dimensional models like all-MiniLM-L6-v2 outperform poorly trained 1536-dimensional ones.

Dimensionality reduction techniques such as PCA or Matryoshka Representation Learning (MRL) can also compress high-dimensional embeddings post-hoc with surprisingly little quality loss, offering another lever to trade accuracy against storage costs. When designing a retrieval system, always benchmark retrieval quality at multiple dimension sizes before committing to a storage architecture, since the optimal choice depends heavily on your specific data distribution and query patterns.

Dimension SizeStorage per VectorTypical Use Case
3841.5 KBFast retrieval, memory-constrained
7683 KBBalanced quality/performance
10244 KBHigh-quality semantic search
15366 KBMaximum quality (OpenAI ada-002)

Introduction to Sentence Transformers

Sentence Transformers is a Python library built on top of Hugging Face Transformers that makes it trivially easy to generate state-of-the-art embeddings. It provides pre-trained models optimized for semantic similarity, along with tools for fine-tuning on custom data. Behind the scenes the library handles all the complexity of tokenisation, attention masking, pooling strategies (mean pooling, CLS token extraction, max pooling), and optional L2 normalisation — operations that would otherwise require dozens of lines of low-level Transformers code.

This abstraction dramatically lowers the barrier to entry: a developer who has never worked with transformers can go from installation to a working semantic search prototype in under ten minutes. The library also integrates tightly with the Hugging Face Model Hub, meaning that new embedding models released by the research community are instantly accessible without any additional glue code. Perhaps most importantly, the Sentence Transformers ecosystem includes evaluation utilities and benchmark integrations, so you can rigorously compare models on your own data before committing to one for production use.

Key advantages of Sentence Transformers:

  • Pre-trained models: 100+ models ready to use, covering different languages and domains
  • Simple API: Generate embeddings with just two lines of code
  • Efficient: Optimized for both CPU and GPU inference
  • Extensible: Easy to fine-tune on your own data
  • Active development: Regular updates with new models and features
# Installation
pip install sentence-transformers

# With GPU support (CUDA)
pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cu118

Choosing the Right Model

Model selection significantly impacts both quality and performance. The best model depends on your language requirements, performance constraints, and quality needs. Let’s compare the most popular options and see how they differ in practice. A model that scores highest on the MTEB leaderboard is not automatically the best choice for your application — leaderboard scores are averages across many diverse benchmarks, and your specific domain (legal documents, medical records, e-commerce product descriptions) may weight tasks very differently.

Similarly, latency requirements matter: if your application must return search results in under 50 ms, a large 1024-dimensional model running on CPU may be disqualified regardless of its quality score. The pragmatic approach is to shortlist two or three candidates based on the criteria below, run a quick offline evaluation on a sample of your own data, and select the model that best balances quality and operational cost for your specific workload.

Comparison chart of popular embedding models showing quality vs speed tradeoffs
Figure 2: Quality vs. speed comparison of popular Sentence Transformer models. Higher is better for quality (MTEB score), larger circles indicate larger model size.

The chart above compares models on the MTEB (Massive Text Embedding Benchmark) leaderboard. Here’s a detailed breakdown of top choices:

ModelDimensionsSpeedQualityBest For
all-MiniLM-L6-v2384Very FastGoodPrototyping, resource-constrained
all-mpnet-base-v2768FastVery GoodGeneral purpose, balanced
bge-large-en-v1.51024MediumExcellentHigh-quality retrieval
e5-large-v21024MediumExcellentEnterprise applications
multilingual-e5-large1024MediumExcellentMulti-language support
Model Selection Guidelines
  • Start with all-MiniLM-L6-v2 for rapid prototyping and testing
  • Use all-mpnet-base-v2 for production with balanced requirements
  • Choose BGE or E5 large models when quality is paramount
  • Select multilingual models if your content spans multiple languages

Basic Usage and Examples

Getting started with Sentence Transformers couldn’t be simpler. The library handles model downloading, tokenization, and inference automatically — the first time you instantiate a model it is fetched from the Hugging Face Hub and cached locally, so subsequent runs start instantly without any internet connection. Let’s walk through the essential operations that form the building blocks of any embedding pipeline. A solid grasp of the basic API also makes it much easier to understand the more advanced optimisation techniques discussed later in this article, because those optimisations are simply smart ways of calling the same core primitives at scale. Even if you only ever use the simplest two-line interface, understanding what happens under the hood will help you debug issues like unexpectedly slow inference, out-of-memory errors, or inconsistent similarity scores.

Loading a Model

from sentence_transformers import SentenceTransformer

# Load a pre-trained model (downloads automatically on first use)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Check model info
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"Max sequence length: {model.max_seq_length}")

Generating Embeddings

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single text
text = "Machine learning enables computers to learn from data."
embedding = model.encode(text)

print(f"Type: {type(embedding)}")
print(f"Shape: {embedding.shape}")
print(f"First 5 values: {embedding[:5]}")

# Multiple texts (more efficient)
texts = [
    "Machine learning enables computers to learn from data.",
    "Deep learning is a subset of machine learning.",
    "The weather is nice today."
]

embeddings = model.encode(texts)
print(f"Batch shape: {embeddings.shape}")  # (3, 384)

Encoding Options

The encode() method offers several useful parameters for controlling the output format, memory usage, and compute device. Understanding these parameters is key to writing efficient embedding code, because the default settings are convenient for experimentation but are often not optimal for production workloads. For example, normalize_embeddings=True applies L2 normalisation during encoding so that cosine similarity can be calculated as a simple dot product — a much faster operation than full cosine similarity when you are comparing millions of vector pairs. Similarly, choosing between convert_to_tensor=True (which keeps data on the GPU) and the default NumPy output has significant throughput implications when downstream processing can also run on GPU. The batch_size parameter is arguably the most impactful of all: too small and you leave GPU parallelism on the table; too large and you trigger out-of-memory errors; the sweet spot is hardware dependent and worth benchmarking explicitly.

# With progress bar for large batches
embeddings = model.encode(
    texts,
    show_progress_bar=True
)

# Return PyTorch tensors instead of numpy arrays
embeddings = model.encode(
    texts,
    convert_to_tensor=True
)

# Normalize vectors to unit length (recommended for cosine similarity)
embeddings = model.encode(
    texts,
    normalize_embeddings=True
)

# Control batch size for memory management
embeddings = model.encode(
    texts,
    batch_size=32
)

# Use specific device
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
embeddings = model.encode(
    texts,
    device=device
)

Computing Semantic Similarity

Once you have embeddings, computing semantic similarity is straightforward. The most common metric is cosine similarity, which measures the angle between two vectors regardless of their magnitude. This makes cosine similarity robust to differences in text length — a short two-word query and a long paragraph can still produce a high similarity score if they address the same topic, because the L2 norms are effectively cancelled out.

In contrast, Euclidean distance is sensitive to vector magnitude and can produce counterintuitive results when comparing texts of very different lengths, which is why it is rarely the first choice for NLP retrieval tasks. Dot product similarity is another common alternative: when vectors are pre-normalised to unit length, it becomes mathematically identical to cosine similarity, but it allows the magnitude to encode an additional relevance signal (a technique used by some asymmetric retrieval models). For the vast majority of use cases with Sentence Transformers, cosine similarity or dot product on normalised vectors is the right default choice.

Visual explanation of cosine similarity showing vector angles in embedding space
Figure 3: Cosine similarity measures the angle between vectors. Vectors pointing in similar directions (similar meaning) have similarity close to 1.0.

Understanding Similarity Scores

Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning), with 0 indicating no relationship. For normalized vectors, the dot product equals cosine similarity, making computation very efficient. In practice, sentence embedding models rarely produce negative similarities for natural language text, so the effective working range is roughly 0.0–1.0; scores above 0.9 typically indicate near-duplicate content, 0.7–0.9 indicates closely related ideas, and scores below 0.5 often suggest the texts are addressing different topics.

Knowing these rough thresholds helps when setting retrieval cutoffs or building deduplication pipelines — rather than retrieving results with any positive similarity, you can apply a minimum threshold (say 0.65) to filter out weakly related chunks that would dilute the quality of a RAG response. These thresholds are model-dependent, however, so always calibrate them empirically on a labelled evaluation set rather than relying on rules of thumb.

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "The cat sits on the windowsill.",
    "A feline is resting by the window.",
    "Dogs are loyal companions.",
    "Python is a programming language.",
]

# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# Compute pairwise similarities
cosine_scores = util.cos_sim(embeddings, embeddings)

print("Pairwise similarity matrix:")
print(cosine_scores.numpy().round(3))

# Find most similar pair
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(f"Similarity [{i}]-[{j}]: {cosine_scores[i][j]:.3f}")
        print(f"  '{sentences[i][:40]}...'")
        print(f"  '{sentences[j][:40]}...'")

Semantic Search Implementation

Building a basic semantic search system with Sentence Transformers is remarkably simple, and the implementation below illustrates all the key components: indexing, query encoding, similarity computation, and ranking. It is intentionally kept minimal to make the core logic as clear as possible — in a real production system you would replace the in-memory vector store with a dedicated vector database like LanceDB or Qdrant, but the embedding logic remains essentially identical.

One important design decision worth noting is the separation between the index() phase (which encodes documents once and stores the results) and the search() phase (which encodes only the query at query time). This asymmetry is fundamental to efficient retrieval: document embeddings are expensive to compute but can be amortised across many queries, while query encoding is cheap and must be done in real time. Keeping these two phases separate also makes it easy to update the document index incrementally without re-encoding the entire corpus.

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearch:
    """Simple semantic search using Sentence Transformers."""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
    
    def index(self, documents: list[str]):
        """Index a list of documents."""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            normalize_embeddings=True,
            show_progress_bar=True
        )
        print(f"Indexed {len(documents)} documents")
    
    def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        """Search for documents similar to the query."""
        query_embedding = self.model.encode(
            query,
            convert_to_tensor=True,
            normalize_embeddings=True
        )
        
        # Compute similarities (dot product = cosine sim for normalized vectors)
        similarities = util.dot_score(query_embedding, self.embeddings)[0]
        
        # Get top-k results
        top_results = torch.topk(similarities, k=min(top_k, len(self.documents)))
        
        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append((self.documents[idx], score.item()))
        
        return results


# Usage example
search_engine = SemanticSearch()

documents = [
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning algorithms can identify patterns in data.",
    "The Eiffel Tower is located in Paris, France.",
    "Neural networks are inspired by biological brain structures.",
    "JavaScript is commonly used for web development.",
    "Deep learning requires large amounts of training data.",
]

search_engine.index(documents)

# Search
query = "What programming languages are easy to learn?"
results = search_engine.search(query, top_k=3)

print(f"nQuery: {query}n")
for doc, score in results:
    print(f"Score: {score:.3f} | {doc}")

Batch Processing Optimization

When embedding large document collections, efficient batch processing becomes critical. Poor batching can lead to memory errors, suboptimal GPU utilization, and unnecessarily long processing times. Let’s explore optimization strategies that can reduce total embedding time by an order of magnitude compared to naive single-document loops. The core insight is that GPU hardware is built for parallel computation: processing 128 documents simultaneously is barely slower than processing a single document on modern hardware, so throughput scales nearly linearly with batch size up to the point where memory becomes the bottleneck.

On a consumer-grade GPU with 8 GB of VRAM and a typical 384-dimensional model, batch sizes in the range of 128–256 are usually optimal; on CPU, much smaller batches (32–64) tend to work better because the parallelism ceiling is lower. Measuring your specific hardware’s throughput curve with the profiling tips below is always worth the five minutes it takes — for a corpus of millions of documents, the difference between a suboptimal and optimal batch size can mean hours of wall-clock time.

Graph showing throughput vs batch size for different hardware configurations
Figure 4: Embedding throughput (documents/second) across different batch sizes. Optimal batch size varies by hardware but typically peaks between 32-128 for GPUs.

Memory-Efficient Batch Processing

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Generator, List
import gc

def batch_generator(items: list, batch_size: int) -> Generator:
    """Yield batches of items."""
    for i in range(0, len(items), batch_size):
        yield items[i:i + batch_size]


def embed_large_corpus(
    texts: List[str],
    model: SentenceTransformer,
    batch_size: int = 64,
    show_progress: bool = True
) -> np.ndarray:
    """
    Embed a large corpus with memory-efficient batching.
    
    Args:
        texts: List of texts to embed
        model: SentenceTransformer model
        batch_size: Number of texts per batch
        show_progress: Whether to show progress
        
    Returns:
        Numpy array of embeddings
    """
    all_embeddings = []
    total_batches = (len(texts) + batch_size - 1) // batch_size
    
    for batch_idx, batch in enumerate(batch_generator(texts, batch_size)):
        # Encode batch
        batch_embeddings = model.encode(
            batch,
            show_progress_bar=False,
            convert_to_numpy=True,
            normalize_embeddings=True
        )
        
        all_embeddings.append(batch_embeddings)
        
        if show_progress and (batch_idx + 1) % 10 == 0:
            print(f"Processed batch {batch_idx + 1}/{total_batches}")
        
        # Periodic garbage collection for very large corpora
        if batch_idx % 100 == 0:
            gc.collect()
    
    return np.vstack(all_embeddings)


# Usage
model = SentenceTransformer('all-MiniLM-L6-v2')

# Simulate large corpus
large_corpus = [f"Document {i} with some content." for i in range(10000)]

embeddings = embed_large_corpus(
    large_corpus,
    model,
    batch_size=128
)

print(f"Final embeddings shape: {embeddings.shape}")

GPU Optimization

import torch
from sentence_transformers import SentenceTransformer

def get_optimal_batch_size(model: SentenceTransformer, sample_text: str) -> int:
    """Determine optimal batch size based on available GPU memory."""
    
    if not torch.cuda.is_available():
        return 32  # CPU default
    
    # Get GPU memory info
    gpu_memory = torch.cuda.get_device_properties(0).total_memory
    gpu_memory_gb = gpu_memory / (1024**3)
    
    # Estimate based on model size and GPU memory
    model_size = sum(p.numel() * p.element_size() for p in model.parameters())
    model_size_gb = model_size / (1024**3)
    
    # Available memory for batching (leave headroom)
    available_gb = (gpu_memory_gb - model_size_gb) * 0.7
    
    # Rough estimate: ~0.01 GB per batch item for typical models
    estimated_batch = int(available_gb / 0.01)
    
    # Clamp to reasonable range
    return max(16, min(512, estimated_batch))


# GPU-optimized encoding
def encode_on_gpu(
    texts: List[str],
    model: SentenceTransformer,
    batch_size: int = None
) -> np.ndarray:
    """Encode texts using GPU with automatic batch sizing."""
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    
    if batch_size is None:
        batch_size = get_optimal_batch_size(model, texts[0] if texts else "sample")
    
    print(f"Using device: {device}, batch size: {batch_size}")
    
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        device=device,
        normalize_embeddings=True
    )
    
    return embeddings

Multilingual Embeddings

For applications dealing with multiple languages, multilingual embedding models are essential. These models create a shared embedding space where semantically similar content in different languages maps to nearby vectors — enabling cross-lingual search and retrieval without any translation step. This shared space emerges from training on massive multilingual corpora aligned across languages, teaching the model that the English word “cat”, the French word “chat”, and the German word “Katze” should all occupy the same neighbourhood in the embedding space.

The practical implication is powerful: a user can submit a query in Spanish and retrieve relevant documents that were written in Japanese, with no intermediate translation, as long as both are semantically related. This capability is invaluable in globalised enterprise settings where document repositories accumulate content in dozens of languages over time. One caveat worth noting is that multilingual models typically underperform their monolingual counterparts on English-only benchmarks, so if your application is exclusively English, a dedicated English model will usually give better results for the same compute cost.

from sentence_transformers import SentenceTransformer, util

# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Texts in different languages (all mean similar things)
texts = [
    "Machine learning is transforming technology.",      # English
    "L'apprentissage automatique transforme la technologie.",  # French
    "Maschinelles Lernen verändert die Technologie.",   # German
    "El aprendizaje automático está transformando la tecnología.",  # Spanish
    "机器学习正在改变技术。",  # Chinese
    "機械学習はテクノロジーを変革しています。",  # Japanese
]

# Encode all texts
embeddings = model.encode(texts, normalize_embeddings=True)

# Compute cross-lingual similarities
similarities = util.cos_sim(embeddings, embeddings)

print("Cross-lingual similarity matrix:")
print(similarities.numpy().round(2))

# All sentences should have high similarity despite different languages

Recommended Multilingual Models

ModelLanguagesDimensionsNotes
multilingual-e5-large100+1024Best quality, larger size
paraphrase-multilingual-MiniLM-L12-v250+384Good balance, fast
distiluse-base-multilingual-cased-v250+512Proven performer
LaBSE109768Google’s language-agnostic model

Production Pipeline Implementation

Let’s bring everything together into a production-ready embedding service. This implementation includes caching, error handling, monitoring, and efficient batch processing. These are not optional niceties — they are the difference between a research prototype that works in a notebook and a system that reliably serves hundreds of users in production. The caching layer, in particular, has an outsized impact on both latency and cost: in a typical RAG application, the same document chunks are re-embedded far less frequently than they are retrieved, but popular query phrasings can repeat thousands of times per day, and serving cached embeddings for those is essentially free.

The metrics hooks also pay for themselves quickly by surfacing unexpected performance regressions, model loading failures, or memory leaks that would otherwise only manifest as subtle degradations in search quality rather than hard errors. Treat this implementation as a starting point rather than a final blueprint — production services will inevitably need additional features like distributed caching, circuit breakers, and model hot-reloading, but the core patterns shown here remain the same.

"""
Production-ready embedding service using Sentence Transformers.
Features: caching, batch optimization, error handling, and metrics.
"""

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Optional, Dict, Any, Union
from dataclasses import dataclass, field
from pathlib import Path
import hashlib
import pickle
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


@dataclass
class EmbeddingResult:
    """Result of an embedding operation."""
    embeddings: np.ndarray
    texts: List[str]
    model_name: str
    dimension: int
    processing_time: float
    from_cache: int = 0
    
    @property
    def count(self) -> int:
        return len(self.texts)


@dataclass
class EmbeddingConfig:
    """Configuration for the embedding service."""
    model_name: str = "all-MiniLM-L6-v2"
    batch_size: int = 64
    normalize: bool = True
    cache_dir: Optional[str] = None
    device: Optional[str] = None  # None = auto-detect
    max_seq_length: Optional[int] = None


class EmbeddingService:
    """
    Production embedding service with caching and optimization.
    
    Features:
    - Automatic device selection (GPU/CPU)
    - Configurable caching
    - Batch processing optimization
    - Error handling and retries
    - Performance metrics
    """
    
    def __init__(self, config: EmbeddingConfig = None):
        self.config = config or EmbeddingConfig()
        self._model = None
        self._cache: Dict[str, np.ndarray] = {}
        self._cache_dir = Path(self.config.cache_dir) if self.config.cache_dir else None
        
        # Metrics
        self._total_encoded = 0
        self._total_cache_hits = 0
        self._total_time = 0.0
        
        # Initialize cache directory
        if self._cache_dir:
            self._cache_dir.mkdir(parents=True, exist_ok=True)
            self._load_cache()
    
    @property
    def model(self) -> SentenceTransformer:
        """Lazy-load the model."""
        if self._model is None:
            logger.info(f"Loading model: {self.config.model_name}")
            self._model = SentenceTransformer(self.config.model_name)
            
            if self.config.max_seq_length:
                self._model.max_seq_length = self.config.max_seq_length
            
            if self.config.device:
                self._model = self._model.to(self.config.device)
            
            logger.info(f"Model loaded. Dimension: {self._model.get_sentence_embedding_dimension()}")
        
        return self._model
    
    def embed(
        self,
        texts: Union[str, List[str]],
        use_cache: bool = True
    ) -> EmbeddingResult:
        """
        Generate embeddings for text(s).
        
        Args:
            texts: Single text or list of texts
            use_cache: Whether to use caching
            
        Returns:
            EmbeddingResult with embeddings and metadata
        """
        start_time = time.time()
        
        # Normalize input
        if isinstance(texts, str):
            texts = [texts]
        
        # Check cache
        if use_cache:
            embeddings, cache_hits = self._get_from_cache(texts)
        else:
            embeddings = [None] * len(texts)
            cache_hits = 0
        
        # Find texts that need encoding
        to_encode_indices = [i for i, e in enumerate(embeddings) if e is None]
        to_encode_texts = [texts[i] for i in to_encode_indices]
        
        # Encode missing texts
        if to_encode_texts:
            new_embeddings = self._encode_batch(to_encode_texts)
            
            for idx, emb in zip(to_encode_indices, new_embeddings):
                embeddings[idx] = emb
                
                # Cache the new embedding
                if use_cache:
                    self._add_to_cache(texts[idx], emb)
        
        # Stack results
        embeddings_array = np.vstack(embeddings)
        
        # Update metrics
        processing_time = time.time() - start_time
        self._total_encoded += len(texts)
        self._total_cache_hits += cache_hits
        self._total_time += processing_time
        
        return EmbeddingResult(
            embeddings=embeddings_array,
            texts=texts,
            model_name=self.config.model_name,
            dimension=embeddings_array.shape[1],
            processing_time=processing_time,
            from_cache=cache_hits
        )
    
    def _encode_batch(self, texts: List[str]) -> np.ndarray:
        """Encode a batch of texts."""
        try:
            embeddings = self.model.encode(
                texts,
                batch_size=self.config.batch_size,
                normalize_embeddings=self.config.normalize,
                show_progress_bar=len(texts) > 100
            )
            return embeddings
        except Exception as e:
            logger.error(f"Encoding error: {e}")
            raise
    
    def _get_cache_key(self, text: str) -> str:
        """Generate cache key for text."""
        content = f"{self.config.model_name}:{text}"
        return hashlib.md5(content.encode()).hexdigest()
    
    def _get_from_cache(self, texts: List[str]) -> tuple[List[Optional[np.ndarray]], int]:
        """Retrieve embeddings from cache."""
        embeddings = []
        hits = 0
        
        for text in texts:
            key = self._get_cache_key(text)
            if key in self._cache:
                embeddings.append(self._cache[key])
                hits += 1
            else:
                embeddings.append(None)
        
        return embeddings, hits
    
    def _add_to_cache(self, text: str, embedding: np.ndarray):
        """Add embedding to cache."""
        key = self._get_cache_key(text)
        self._cache[key] = embedding
    
    def _load_cache(self):
        """Load cache from disk."""
        cache_file = self._cache_dir / f"{self.config.model_name.replace('/', '_')}_cache.pkl"
        if cache_file.exists():
            try:
                with open(cache_file, 'rb') as f:
                    self._cache = pickle.load(f)
                logger.info(f"Loaded {len(self._cache)} cached embeddings")
            except Exception as e:
                logger.warning(f"Could not load cache: {e}")
                self._cache = {}
    
    def save_cache(self):
        """Save cache to disk."""
        if self._cache_dir:
            cache_file = self._cache_dir / f"{self.config.model_name.replace('/', '_')}_cache.pkl"
            with open(cache_file, 'wb') as f:
                pickle.dump(self._cache, f)
            logger.info(f"Saved {len(self._cache)} embeddings to cache")
    
    def get_metrics(self) -> Dict[str, Any]:
        """Get service metrics."""
        return {
            "total_encoded": self._total_encoded,
            "cache_hits": self._total_cache_hits,
            "cache_hit_rate": self._total_cache_hits / max(1, self._total_encoded),
            "total_processing_time": self._total_time,
            "avg_time_per_text": self._total_time / max(1, self._total_encoded),
            "cache_size": len(self._cache),
            "model_name": self.config.model_name,
        }
    
    def clear_cache(self):
        """Clear the embedding cache."""
        self._cache = {}
        logger.info("Cache cleared")


# Example usage
if __name__ == "__main__":
    # Configure service
    config = EmbeddingConfig(
        model_name="all-MiniLM-L6-v2",
        batch_size=64,
        normalize=True,
        cache_dir="./embedding_cache"
    )
    
    service = EmbeddingService(config)
    
    # Embed some texts
    texts = [
        "Artificial intelligence is transforming industries.",
        "Machine learning models learn from data.",
        "Natural language processing enables text understanding.",
    ]
    
    result = service.embed(texts)
    
    print(f"Embedded {result.count} texts in {result.processing_time:.3f}s")
    print(f"Embeddings shape: {result.embeddings.shape}")
    print(f"From cache: {result.from_cache}")
    
    # Embed again (should hit cache)
    result2 = service.embed(texts)
    print(f"nSecond run - from cache: {result2.from_cache}")
    
    # Get metrics
    print(f"nService metrics: {service.get_metrics()}")
    
    # Save cache for future use
    service.save_cache()
Production Best Practices
  • Cache aggressively: Embedding the same text twice is wasteful
  • Batch when possible: GPU throughput improves significantly with batching
  • Monitor memory: Large batches can exhaust GPU memory
  • Normalize embeddings: Simplifies similarity computation
  • Version your models: Different models produce incompatible embeddings
  • Pre-warm models: First inference is slower due to model loading

Conclusion

Sentence Transformers provides everything you need to build production-quality embedding systems without relying on external APIs. By choosing the right model for your use case, implementing efficient batch processing, and following best practices for caching and optimization, you can achieve excellent results at minimal cost. Running embeddings locally also gives you complete control over your data pipeline — no chunks of sensitive customer documents leave your infrastructure, no API rate limits throttle your indexing jobs, and no price changes from a cloud vendor can make your application suddenly uneconomical.

As open-source models continue to close the quality gap with proprietary alternatives, the case for local embeddings only strengthens: the bge-large-en-v1.5 and e5-large-v2 models, for example, already outperform OpenAI’s text-embedding-ada-002 on several MTEB benchmarks while costing nothing per API call. Investing time now in a well-structured local embedding pipeline pays dividends across every project that builds on it, from the RAG system you are building today to future applications you haven’t yet imagined.

Key Takeaways

Key takeaways from this article:

  • Text embeddings capture semantic meaning in dense vector representations
  • Model selection involves trade-offs between speed, quality, and resource requirements
  • Batch processing with appropriate sizes maximizes throughput
  • Multilingual models enable cross-language semantic search
  • Caching prevents redundant computation in production systems

In the next article, we’ll explore how to store and efficiently search these embeddings using LanceDB, a lightweight but powerful vector database perfect for local RAG applications. LanceDB is designed to run entirely in-process alongside your Python application, eliminating the need to deploy and operate a separate vector-database server u2014 a significant advantage for local and edge deployments. We will cover how to create and persist a LanceDB index from the embeddings generated here, how to run approximate nearest-neighbour queries at scale, and how to integrate the store into the end-to-end RAG pipeline developed throughout this series. By the end of that article, you will have a fully functional local retrieval system capable of indexing thousands of PDF pages and returning semantically relevant results in milliseconds, entirely offline and without any cloud dependencies.n

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *