Text embeddings are the foundation of modern semantic search and RAG systems. These dense vector representations capture the meaning of text, enabling machines to understand similarity, relevance, and relationships between documents. While cloud APIs like OpenAI’s embeddings are popular, running embeddings locally offers significant advantages: no API costs, complete data privacy, offline capability, and often lower latency.
In this comprehensive guide, we’ll explore Sentence Transformers, the leading open-source library for generating high-quality text embeddings. You’ll learn how to choose the right model for your use case, optimize batch processing for large-scale embedding, and implement production-ready embedding pipelines that rival commercial solutions.
What Are Text Embeddings?
Text embeddings transform human-readable text into fixed-size numerical vectors that capture semantic meaning. Unlike simple word counts or TF-IDF representations, embeddings encode the contextual meaning of entire sentences or paragraphs. Two sentences with completely different words but similar meanings will have similar embedding vectors — the model has learned to recognise that “automobile” and “car” occupy the same conceptual region in the vector space. This is a fundamental shift away from lexical approaches: instead of asking “do these documents share the same tokens?”, we ask “do these documents share the same meaning?”. The result is a representation rich enough to support tasks like semantic search, document clustering, duplicate detection, and question answering, all with a single unified model. Because the vector space is continuous and geometrically meaningful, you can even perform arithmetic on embeddings — measuring relational analogies, interpolating between ideas, or aggregating representations across a document collection.
Consider these two sentences:
- “The cat sat on the mat.”
- “A feline rested upon the rug.”
Traditional keyword matching would find zero overlap between those two sentences: none of the tokens are shared, so a BM25 or TF-IDF index would score them as completely unrelated. But embedding models understand that these sentences mean nearly the same thing, producing vectors with high cosine similarity (often 0.85+). This gap between lexical and semantic similarity is precisely where embeddings shine — they capture synonymy, paraphrase, and even abstract conceptual overlap that pure keyword approaches simply cannot detect. In a real-world search engine, this means a user querying “cheap accommodation in Paris” can still find documents that mention “affordable hotels near the Eiffel Tower”, without any need for manual synonym lists or query expansion rules. The ability to bridge this semantic gap is what makes embedding-backed search feel intelligent rather than mechanical.

How Transformer Models Generate Embeddings
Modern embedding models are based on transformer architectures (like BERT) that have been fine-tuned specifically for producing meaningful sentence-level representations. The training process typically involves learning from millions of text pairs where the model must predict similarity, enabling it to capture nuanced semantic relationships. The key innovation of the Sentence Transformers framework — introduced by Reimers and Gurevych in 2019 — was the use of Siamese and triplet network structures that allow two texts to be encoded independently and then compared, making inference dramatically faster than naive cross-encoder approaches.
During contrastive training, the model is exposed to positive pairs (texts with similar meaning) and negative pairs (texts with different meaning), and the loss function pushes similar pairs together while pulling dissimilar pairs apart in the vector space. This process, sometimes augmented with techniques like hard-negative mining or knowledge distillation from larger teacher models, is what gives sentence embeddings their impressive generalisation ability across domains and tasks. Understanding this training paradigm also explains why embeddings degrade gracefully out-of-domain: the model has learned broad semantic patterns, even if it lacks domain-specific fine-tuning.
Embedding Dimensions
Embedding vectors typically have 384 to 1536 dimensions, though some specialised research models go higher. Higher dimensions can capture more nuance but require more storage and compute — a corpus of one million documents at 1536 dimensions (float32) consumes roughly 6 GB of memory just for the raw vectors, compared to about 1.5 GB at 384 dimensions. For most applications, 384–768 dimensions provide an excellent balance of quality and efficiency, and benchmarks on the MTEB leaderboard consistently show that well-trained 384-dimensional models like all-MiniLM-L6-v2 outperform poorly trained 1536-dimensional ones.
Dimensionality reduction techniques such as PCA or Matryoshka Representation Learning (MRL) can also compress high-dimensional embeddings post-hoc with surprisingly little quality loss, offering another lever to trade accuracy against storage costs. When designing a retrieval system, always benchmark retrieval quality at multiple dimension sizes before committing to a storage architecture, since the optimal choice depends heavily on your specific data distribution and query patterns.
| Dimension Size | Storage per Vector | Typical Use Case |
|---|---|---|
| 384 | 1.5 KB | Fast retrieval, memory-constrained |
| 768 | 3 KB | Balanced quality/performance |
| 1024 | 4 KB | High-quality semantic search |
| 1536 | 6 KB | Maximum quality (OpenAI ada-002) |
Introduction to Sentence Transformers
Sentence Transformers is a Python library built on top of Hugging Face Transformers that makes it trivially easy to generate state-of-the-art embeddings. It provides pre-trained models optimized for semantic similarity, along with tools for fine-tuning on custom data. Behind the scenes the library handles all the complexity of tokenisation, attention masking, pooling strategies (mean pooling, CLS token extraction, max pooling), and optional L2 normalisation — operations that would otherwise require dozens of lines of low-level Transformers code.
This abstraction dramatically lowers the barrier to entry: a developer who has never worked with transformers can go from installation to a working semantic search prototype in under ten minutes. The library also integrates tightly with the Hugging Face Model Hub, meaning that new embedding models released by the research community are instantly accessible without any additional glue code. Perhaps most importantly, the Sentence Transformers ecosystem includes evaluation utilities and benchmark integrations, so you can rigorously compare models on your own data before committing to one for production use.
Key advantages of Sentence Transformers:
- Pre-trained models: 100+ models ready to use, covering different languages and domains
- Simple API: Generate embeddings with just two lines of code
- Efficient: Optimized for both CPU and GPU inference
- Extensible: Easy to fine-tune on your own data
- Active development: Regular updates with new models and features
# Installation
pip install sentence-transformers
# With GPU support (CUDA)
pip install sentence-transformers torch --index-url https://download.pytorch.org/whl/cu118
Choosing the Right Model
Model selection significantly impacts both quality and performance. The best model depends on your language requirements, performance constraints, and quality needs. Let’s compare the most popular options and see how they differ in practice. A model that scores highest on the MTEB leaderboard is not automatically the best choice for your application — leaderboard scores are averages across many diverse benchmarks, and your specific domain (legal documents, medical records, e-commerce product descriptions) may weight tasks very differently.
Similarly, latency requirements matter: if your application must return search results in under 50 ms, a large 1024-dimensional model running on CPU may be disqualified regardless of its quality score. The pragmatic approach is to shortlist two or three candidates based on the criteria below, run a quick offline evaluation on a sample of your own data, and select the model that best balances quality and operational cost for your specific workload.

The chart above compares models on the MTEB (Massive Text Embedding Benchmark) leaderboard. Here’s a detailed breakdown of top choices:
| Model | Dimensions | Speed | Quality | Best For |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | Prototyping, resource-constrained |
| all-mpnet-base-v2 | 768 | Fast | Very Good | General purpose, balanced |
| bge-large-en-v1.5 | 1024 | Medium | Excellent | High-quality retrieval |
| e5-large-v2 | 1024 | Medium | Excellent | Enterprise applications |
| multilingual-e5-large | 1024 | Medium | Excellent | Multi-language support |
- Start with all-MiniLM-L6-v2 for rapid prototyping and testing
- Use all-mpnet-base-v2 for production with balanced requirements
- Choose BGE or E5 large models when quality is paramount
- Select multilingual models if your content spans multiple languages
Basic Usage and Examples
Getting started with Sentence Transformers couldn’t be simpler. The library handles model downloading, tokenization, and inference automatically — the first time you instantiate a model it is fetched from the Hugging Face Hub and cached locally, so subsequent runs start instantly without any internet connection. Let’s walk through the essential operations that form the building blocks of any embedding pipeline. A solid grasp of the basic API also makes it much easier to understand the more advanced optimisation techniques discussed later in this article, because those optimisations are simply smart ways of calling the same core primitives at scale. Even if you only ever use the simplest two-line interface, understanding what happens under the hood will help you debug issues like unexpectedly slow inference, out-of-memory errors, or inconsistent similarity scores.
Loading a Model
from sentence_transformers import SentenceTransformer
# Load a pre-trained model (downloads automatically on first use)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Check model info
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"Max sequence length: {model.max_seq_length}")
Generating Embeddings
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Single text
text = "Machine learning enables computers to learn from data."
embedding = model.encode(text)
print(f"Type: {type(embedding)}")
print(f"Shape: {embedding.shape}")
print(f"First 5 values: {embedding[:5]}")
# Multiple texts (more efficient)
texts = [
"Machine learning enables computers to learn from data.",
"Deep learning is a subset of machine learning.",
"The weather is nice today."
]
embeddings = model.encode(texts)
print(f"Batch shape: {embeddings.shape}") # (3, 384)
Encoding Options
The encode() method offers several useful parameters for controlling the output format, memory usage, and compute device. Understanding these parameters is key to writing efficient embedding code, because the default settings are convenient for experimentation but are often not optimal for production workloads. For example, normalize_embeddings=True applies L2 normalisation during encoding so that cosine similarity can be calculated as a simple dot product — a much faster operation than full cosine similarity when you are comparing millions of vector pairs. Similarly, choosing between convert_to_tensor=True (which keeps data on the GPU) and the default NumPy output has significant throughput implications when downstream processing can also run on GPU. The batch_size parameter is arguably the most impactful of all: too small and you leave GPU parallelism on the table; too large and you trigger out-of-memory errors; the sweet spot is hardware dependent and worth benchmarking explicitly.
# With progress bar for large batches
embeddings = model.encode(
texts,
show_progress_bar=True
)
# Return PyTorch tensors instead of numpy arrays
embeddings = model.encode(
texts,
convert_to_tensor=True
)
# Normalize vectors to unit length (recommended for cosine similarity)
embeddings = model.encode(
texts,
normalize_embeddings=True
)
# Control batch size for memory management
embeddings = model.encode(
texts,
batch_size=32
)
# Use specific device
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
embeddings = model.encode(
texts,
device=device
)
Computing Semantic Similarity
Once you have embeddings, computing semantic similarity is straightforward. The most common metric is cosine similarity, which measures the angle between two vectors regardless of their magnitude. This makes cosine similarity robust to differences in text length — a short two-word query and a long paragraph can still produce a high similarity score if they address the same topic, because the L2 norms are effectively cancelled out.
In contrast, Euclidean distance is sensitive to vector magnitude and can produce counterintuitive results when comparing texts of very different lengths, which is why it is rarely the first choice for NLP retrieval tasks. Dot product similarity is another common alternative: when vectors are pre-normalised to unit length, it becomes mathematically identical to cosine similarity, but it allows the magnitude to encode an additional relevance signal (a technique used by some asymmetric retrieval models). For the vast majority of use cases with Sentence Transformers, cosine similarity or dot product on normalised vectors is the right default choice.

Understanding Similarity Scores
Cosine similarity ranges from -1 (opposite meaning) to 1 (identical meaning), with 0 indicating no relationship. For normalized vectors, the dot product equals cosine similarity, making computation very efficient. In practice, sentence embedding models rarely produce negative similarities for natural language text, so the effective working range is roughly 0.0–1.0; scores above 0.9 typically indicate near-duplicate content, 0.7–0.9 indicates closely related ideas, and scores below 0.5 often suggest the texts are addressing different topics.
Knowing these rough thresholds helps when setting retrieval cutoffs or building deduplication pipelines — rather than retrieving results with any positive similarity, you can apply a minimum threshold (say 0.65) to filter out weakly related chunks that would dilute the quality of a RAG response. These thresholds are model-dependent, however, so always calibrate them empirically on a labelled evaluation set rather than relying on rules of thumb.
from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example sentences
sentences = [
"The cat sits on the windowsill.",
"A feline is resting by the window.",
"Dogs are loyal companions.",
"Python is a programming language.",
]
# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
# Compute pairwise similarities
cosine_scores = util.cos_sim(embeddings, embeddings)
print("Pairwise similarity matrix:")
print(cosine_scores.numpy().round(3))
# Find most similar pair
for i in range(len(sentences)):
for j in range(i + 1, len(sentences)):
print(f"Similarity [{i}]-[{j}]: {cosine_scores[i][j]:.3f}")
print(f" '{sentences[i][:40]}...'")
print(f" '{sentences[j][:40]}...'")
Semantic Search Implementation
Building a basic semantic search system with Sentence Transformers is remarkably simple, and the implementation below illustrates all the key components: indexing, query encoding, similarity computation, and ranking. It is intentionally kept minimal to make the core logic as clear as possible — in a real production system you would replace the in-memory vector store with a dedicated vector database like LanceDB or Qdrant, but the embedding logic remains essentially identical.
One important design decision worth noting is the separation between the index() phase (which encodes documents once and stores the results) and the search() phase (which encodes only the query at query time). This asymmetry is fundamental to efficient retrieval: document embeddings are expensive to compute but can be amortised across many queries, while query encoding is cheap and must be done in real time. Keeping these two phases separate also makes it easy to update the document index incrementally without re-encoding the entire corpus.
from sentence_transformers import SentenceTransformer, util
import torch
class SemanticSearch:
"""Simple semantic search using Sentence Transformers."""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def index(self, documents: list[str]):
"""Index a list of documents."""
self.documents = documents
self.embeddings = self.model.encode(
documents,
convert_to_tensor=True,
normalize_embeddings=True,
show_progress_bar=True
)
print(f"Indexed {len(documents)} documents")
def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
"""Search for documents similar to the query."""
query_embedding = self.model.encode(
query,
convert_to_tensor=True,
normalize_embeddings=True
)
# Compute similarities (dot product = cosine sim for normalized vectors)
similarities = util.dot_score(query_embedding, self.embeddings)[0]
# Get top-k results
top_results = torch.topk(similarities, k=min(top_k, len(self.documents)))
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append((self.documents[idx], score.item()))
return results
# Usage example
search_engine = SemanticSearch()
documents = [
"Python is a high-level programming language known for its simplicity.",
"Machine learning algorithms can identify patterns in data.",
"The Eiffel Tower is located in Paris, France.",
"Neural networks are inspired by biological brain structures.",
"JavaScript is commonly used for web development.",
"Deep learning requires large amounts of training data.",
]
search_engine.index(documents)
# Search
query = "What programming languages are easy to learn?"
results = search_engine.search(query, top_k=3)
print(f"nQuery: {query}n")
for doc, score in results:
print(f"Score: {score:.3f} | {doc}")
Batch Processing Optimization
When embedding large document collections, efficient batch processing becomes critical. Poor batching can lead to memory errors, suboptimal GPU utilization, and unnecessarily long processing times. Let’s explore optimization strategies that can reduce total embedding time by an order of magnitude compared to naive single-document loops. The core insight is that GPU hardware is built for parallel computation: processing 128 documents simultaneously is barely slower than processing a single document on modern hardware, so throughput scales nearly linearly with batch size up to the point where memory becomes the bottleneck.
On a consumer-grade GPU with 8 GB of VRAM and a typical 384-dimensional model, batch sizes in the range of 128–256 are usually optimal; on CPU, much smaller batches (32–64) tend to work better because the parallelism ceiling is lower. Measuring your specific hardware’s throughput curve with the profiling tips below is always worth the five minutes it takes — for a corpus of millions of documents, the difference between a suboptimal and optimal batch size can mean hours of wall-clock time.

Memory-Efficient Batch Processing
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Generator, List
import gc
def batch_generator(items: list, batch_size: int) -> Generator:
"""Yield batches of items."""
for i in range(0, len(items), batch_size):
yield items[i:i + batch_size]
def embed_large_corpus(
texts: List[str],
model: SentenceTransformer,
batch_size: int = 64,
show_progress: bool = True
) -> np.ndarray:
"""
Embed a large corpus with memory-efficient batching.
Args:
texts: List of texts to embed
model: SentenceTransformer model
batch_size: Number of texts per batch
show_progress: Whether to show progress
Returns:
Numpy array of embeddings
"""
all_embeddings = []
total_batches = (len(texts) + batch_size - 1) // batch_size
for batch_idx, batch in enumerate(batch_generator(texts, batch_size)):
# Encode batch
batch_embeddings = model.encode(
batch,
show_progress_bar=False,
convert_to_numpy=True,
normalize_embeddings=True
)
all_embeddings.append(batch_embeddings)
if show_progress and (batch_idx + 1) % 10 == 0:
print(f"Processed batch {batch_idx + 1}/{total_batches}")
# Periodic garbage collection for very large corpora
if batch_idx % 100 == 0:
gc.collect()
return np.vstack(all_embeddings)
# Usage
model = SentenceTransformer('all-MiniLM-L6-v2')
# Simulate large corpus
large_corpus = [f"Document {i} with some content." for i in range(10000)]
embeddings = embed_large_corpus(
large_corpus,
model,
batch_size=128
)
print(f"Final embeddings shape: {embeddings.shape}")
GPU Optimization
import torch
from sentence_transformers import SentenceTransformer
def get_optimal_batch_size(model: SentenceTransformer, sample_text: str) -> int:
"""Determine optimal batch size based on available GPU memory."""
if not torch.cuda.is_available():
return 32 # CPU default
# Get GPU memory info
gpu_memory = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = gpu_memory / (1024**3)
# Estimate based on model size and GPU memory
model_size = sum(p.numel() * p.element_size() for p in model.parameters())
model_size_gb = model_size / (1024**3)
# Available memory for batching (leave headroom)
available_gb = (gpu_memory_gb - model_size_gb) * 0.7
# Rough estimate: ~0.01 GB per batch item for typical models
estimated_batch = int(available_gb / 0.01)
# Clamp to reasonable range
return max(16, min(512, estimated_batch))
# GPU-optimized encoding
def encode_on_gpu(
texts: List[str],
model: SentenceTransformer,
batch_size: int = None
) -> np.ndarray:
"""Encode texts using GPU with automatic batch sizing."""
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
if batch_size is None:
batch_size = get_optimal_batch_size(model, texts[0] if texts else "sample")
print(f"Using device: {device}, batch size: {batch_size}")
embeddings = model.encode(
texts,
batch_size=batch_size,
show_progress_bar=True,
device=device,
normalize_embeddings=True
)
return embeddings
Multilingual Embeddings
For applications dealing with multiple languages, multilingual embedding models are essential. These models create a shared embedding space where semantically similar content in different languages maps to nearby vectors — enabling cross-lingual search and retrieval without any translation step. This shared space emerges from training on massive multilingual corpora aligned across languages, teaching the model that the English word “cat”, the French word “chat”, and the German word “Katze” should all occupy the same neighbourhood in the embedding space.
The practical implication is powerful: a user can submit a query in Spanish and retrieve relevant documents that were written in Japanese, with no intermediate translation, as long as both are semantically related. This capability is invaluable in globalised enterprise settings where document repositories accumulate content in dozens of languages over time. One caveat worth noting is that multilingual models typically underperform their monolingual counterparts on English-only benchmarks, so if your application is exclusively English, a dedicated English model will usually give better results for the same compute cost.
from sentence_transformers import SentenceTransformer, util
# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Texts in different languages (all mean similar things)
texts = [
"Machine learning is transforming technology.", # English
"L'apprentissage automatique transforme la technologie.", # French
"Maschinelles Lernen verändert die Technologie.", # German
"El aprendizaje automático está transformando la tecnología.", # Spanish
"机器学习正在改变技术。", # Chinese
"機械学習はテクノロジーを変革しています。", # Japanese
]
# Encode all texts
embeddings = model.encode(texts, normalize_embeddings=True)
# Compute cross-lingual similarities
similarities = util.cos_sim(embeddings, embeddings)
print("Cross-lingual similarity matrix:")
print(similarities.numpy().round(2))
# All sentences should have high similarity despite different languages
Recommended Multilingual Models
| Model | Languages | Dimensions | Notes |
|---|---|---|---|
| multilingual-e5-large | 100+ | 1024 | Best quality, larger size |
| paraphrase-multilingual-MiniLM-L12-v2 | 50+ | 384 | Good balance, fast |
| distiluse-base-multilingual-cased-v2 | 50+ | 512 | Proven performer |
| LaBSE | 109 | 768 | Google’s language-agnostic model |
Production Pipeline Implementation
Let’s bring everything together into a production-ready embedding service. This implementation includes caching, error handling, monitoring, and efficient batch processing. These are not optional niceties — they are the difference between a research prototype that works in a notebook and a system that reliably serves hundreds of users in production. The caching layer, in particular, has an outsized impact on both latency and cost: in a typical RAG application, the same document chunks are re-embedded far less frequently than they are retrieved, but popular query phrasings can repeat thousands of times per day, and serving cached embeddings for those is essentially free.
The metrics hooks also pay for themselves quickly by surfacing unexpected performance regressions, model loading failures, or memory leaks that would otherwise only manifest as subtle degradations in search quality rather than hard errors. Treat this implementation as a starting point rather than a final blueprint — production services will inevitably need additional features like distributed caching, circuit breakers, and model hot-reloading, but the core patterns shown here remain the same.
"""
Production-ready embedding service using Sentence Transformers.
Features: caching, batch optimization, error handling, and metrics.
"""
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Optional, Dict, Any, Union
from dataclasses import dataclass, field
from pathlib import Path
import hashlib
import pickle
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class EmbeddingResult:
"""Result of an embedding operation."""
embeddings: np.ndarray
texts: List[str]
model_name: str
dimension: int
processing_time: float
from_cache: int = 0
@property
def count(self) -> int:
return len(self.texts)
@dataclass
class EmbeddingConfig:
"""Configuration for the embedding service."""
model_name: str = "all-MiniLM-L6-v2"
batch_size: int = 64
normalize: bool = True
cache_dir: Optional[str] = None
device: Optional[str] = None # None = auto-detect
max_seq_length: Optional[int] = None
class EmbeddingService:
"""
Production embedding service with caching and optimization.
Features:
- Automatic device selection (GPU/CPU)
- Configurable caching
- Batch processing optimization
- Error handling and retries
- Performance metrics
"""
def __init__(self, config: EmbeddingConfig = None):
self.config = config or EmbeddingConfig()
self._model = None
self._cache: Dict[str, np.ndarray] = {}
self._cache_dir = Path(self.config.cache_dir) if self.config.cache_dir else None
# Metrics
self._total_encoded = 0
self._total_cache_hits = 0
self._total_time = 0.0
# Initialize cache directory
if self._cache_dir:
self._cache_dir.mkdir(parents=True, exist_ok=True)
self._load_cache()
@property
def model(self) -> SentenceTransformer:
"""Lazy-load the model."""
if self._model is None:
logger.info(f"Loading model: {self.config.model_name}")
self._model = SentenceTransformer(self.config.model_name)
if self.config.max_seq_length:
self._model.max_seq_length = self.config.max_seq_length
if self.config.device:
self._model = self._model.to(self.config.device)
logger.info(f"Model loaded. Dimension: {self._model.get_sentence_embedding_dimension()}")
return self._model
def embed(
self,
texts: Union[str, List[str]],
use_cache: bool = True
) -> EmbeddingResult:
"""
Generate embeddings for text(s).
Args:
texts: Single text or list of texts
use_cache: Whether to use caching
Returns:
EmbeddingResult with embeddings and metadata
"""
start_time = time.time()
# Normalize input
if isinstance(texts, str):
texts = [texts]
# Check cache
if use_cache:
embeddings, cache_hits = self._get_from_cache(texts)
else:
embeddings = [None] * len(texts)
cache_hits = 0
# Find texts that need encoding
to_encode_indices = [i for i, e in enumerate(embeddings) if e is None]
to_encode_texts = [texts[i] for i in to_encode_indices]
# Encode missing texts
if to_encode_texts:
new_embeddings = self._encode_batch(to_encode_texts)
for idx, emb in zip(to_encode_indices, new_embeddings):
embeddings[idx] = emb
# Cache the new embedding
if use_cache:
self._add_to_cache(texts[idx], emb)
# Stack results
embeddings_array = np.vstack(embeddings)
# Update metrics
processing_time = time.time() - start_time
self._total_encoded += len(texts)
self._total_cache_hits += cache_hits
self._total_time += processing_time
return EmbeddingResult(
embeddings=embeddings_array,
texts=texts,
model_name=self.config.model_name,
dimension=embeddings_array.shape[1],
processing_time=processing_time,
from_cache=cache_hits
)
def _encode_batch(self, texts: List[str]) -> np.ndarray:
"""Encode a batch of texts."""
try:
embeddings = self.model.encode(
texts,
batch_size=self.config.batch_size,
normalize_embeddings=self.config.normalize,
show_progress_bar=len(texts) > 100
)
return embeddings
except Exception as e:
logger.error(f"Encoding error: {e}")
raise
def _get_cache_key(self, text: str) -> str:
"""Generate cache key for text."""
content = f"{self.config.model_name}:{text}"
return hashlib.md5(content.encode()).hexdigest()
def _get_from_cache(self, texts: List[str]) -> tuple[List[Optional[np.ndarray]], int]:
"""Retrieve embeddings from cache."""
embeddings = []
hits = 0
for text in texts:
key = self._get_cache_key(text)
if key in self._cache:
embeddings.append(self._cache[key])
hits += 1
else:
embeddings.append(None)
return embeddings, hits
def _add_to_cache(self, text: str, embedding: np.ndarray):
"""Add embedding to cache."""
key = self._get_cache_key(text)
self._cache[key] = embedding
def _load_cache(self):
"""Load cache from disk."""
cache_file = self._cache_dir / f"{self.config.model_name.replace('/', '_')}_cache.pkl"
if cache_file.exists():
try:
with open(cache_file, 'rb') as f:
self._cache = pickle.load(f)
logger.info(f"Loaded {len(self._cache)} cached embeddings")
except Exception as e:
logger.warning(f"Could not load cache: {e}")
self._cache = {}
def save_cache(self):
"""Save cache to disk."""
if self._cache_dir:
cache_file = self._cache_dir / f"{self.config.model_name.replace('/', '_')}_cache.pkl"
with open(cache_file, 'wb') as f:
pickle.dump(self._cache, f)
logger.info(f"Saved {len(self._cache)} embeddings to cache")
def get_metrics(self) -> Dict[str, Any]:
"""Get service metrics."""
return {
"total_encoded": self._total_encoded,
"cache_hits": self._total_cache_hits,
"cache_hit_rate": self._total_cache_hits / max(1, self._total_encoded),
"total_processing_time": self._total_time,
"avg_time_per_text": self._total_time / max(1, self._total_encoded),
"cache_size": len(self._cache),
"model_name": self.config.model_name,
}
def clear_cache(self):
"""Clear the embedding cache."""
self._cache = {}
logger.info("Cache cleared")
# Example usage
if __name__ == "__main__":
# Configure service
config = EmbeddingConfig(
model_name="all-MiniLM-L6-v2",
batch_size=64,
normalize=True,
cache_dir="./embedding_cache"
)
service = EmbeddingService(config)
# Embed some texts
texts = [
"Artificial intelligence is transforming industries.",
"Machine learning models learn from data.",
"Natural language processing enables text understanding.",
]
result = service.embed(texts)
print(f"Embedded {result.count} texts in {result.processing_time:.3f}s")
print(f"Embeddings shape: {result.embeddings.shape}")
print(f"From cache: {result.from_cache}")
# Embed again (should hit cache)
result2 = service.embed(texts)
print(f"nSecond run - from cache: {result2.from_cache}")
# Get metrics
print(f"nService metrics: {service.get_metrics()}")
# Save cache for future use
service.save_cache()
- Cache aggressively: Embedding the same text twice is wasteful
- Batch when possible: GPU throughput improves significantly with batching
- Monitor memory: Large batches can exhaust GPU memory
- Normalize embeddings: Simplifies similarity computation
- Version your models: Different models produce incompatible embeddings
- Pre-warm models: First inference is slower due to model loading
Conclusion
Sentence Transformers provides everything you need to build production-quality embedding systems without relying on external APIs. By choosing the right model for your use case, implementing efficient batch processing, and following best practices for caching and optimization, you can achieve excellent results at minimal cost. Running embeddings locally also gives you complete control over your data pipeline — no chunks of sensitive customer documents leave your infrastructure, no API rate limits throttle your indexing jobs, and no price changes from a cloud vendor can make your application suddenly uneconomical.
As open-source models continue to close the quality gap with proprietary alternatives, the case for local embeddings only strengthens: the bge-large-en-v1.5 and e5-large-v2 models, for example, already outperform OpenAI’s text-embedding-ada-002 on several MTEB benchmarks while costing nothing per API call. Investing time now in a well-structured local embedding pipeline pays dividends across every project that builds on it, from the RAG system you are building today to future applications you haven’t yet imagined.
Key Takeaways
Key takeaways from this article:
- Text embeddings capture semantic meaning in dense vector representations
- Model selection involves trade-offs between speed, quality, and resource requirements
- Batch processing with appropriate sizes maximizes throughput
- Multilingual models enable cross-language semantic search
- Caching prevents redundant computation in production systems
In the next article, we’ll explore how to store and efficiently search these embeddings using LanceDB, a lightweight but powerful vector database perfect for local RAG applications. LanceDB is designed to run entirely in-process alongside your Python application, eliminating the need to deploy and operate a separate vector-database server u2014 a significant advantage for local and edge deployments. We will cover how to create and persist a LanceDB index from the embeddings generated here, how to run approximate nearest-neighbour queries at scale, and how to integrate the store into the end-to-end RAG pipeline developed throughout this series. By the end of that article, you will have a fully functional local retrieval system capable of indexing thousands of PDF pages and returning semantically relevant results in milliseconds, entirely offline and without any cloud dependencies.n
Leave a Reply