Chunking is arguably the most critical yet often overlooked component of any Retrieval-Augmented Generation (RAG) system. The way you split your documents into smaller pieces directly impacts retrieval accuracy, context relevance, and ultimately the quality of generated responses. Poor chunking leads to lost context, irrelevant retrievals, and hallucinations—while intelligent chunking enables precise, contextually-rich answers.
In this comprehensive guide, we’ll explore both the science and the art of document chunking — the often-overlooked craft that separates mediocre RAG systems from truly effective ones. You’ll learn why chunk size matters and how to measure its impact, how to implement semantic-aware splitting that respects the natural structure of your documents, the critical role that overlap plays in preventing context loss at boundaries, and techniques for enriching chunks with the metadata that enables precise filtered retrieval. We’ll also cover how to evaluate your chunking strategy objectively, so you can quantify improvements and make data-driven decisions rather than relying on intuition. By the end, you’ll have a production-ready chunking pipeline that significantly improves your RAG system’s retrieval precision and the quality of responses it enables.
Why Chunking Matters for RAG
Before diving into implementation specifics, let’s understand why chunking is so crucial to the overall performance of a RAG system. In this architecture, documents are first split into chunks, then each chunk is converted into a dense embedding vector using a model like Sentence Transformers, and finally stored in a vector database indexed for fast similarity lookup. When a user asks a question, the query undergoes the same embedding process, and the system retrieves the most semantically similar chunks to provide as context window input to the language model. The quality of retrieved context directly determines whether the LLM can generate an accurate, grounded, well-cited response — or whether it will hallucinate, miss critical details, or provide only a partial answer. This means the chunking stage is not just a preprocessing convenience but a fundamental architectural decision with cascading effects on every downstream component of the RAG pipeline.
Consider what happens with poorly designed chunks — these failure modes are not theoretical edge cases but the everyday reality of RAG systems built without careful attention to splitting strategy. In production, they manifest as user-visible symptoms: a chatbot that misses obvious facts it should know, a search tool that surfaces tangentially related results instead of the right answer, or a summarization system that leaves out critical details because the relevant chunks were never retrieved. These problems are insidious because they’re often difficult to diagnose — retrieval failures don’t generate exceptions or error logs; they silently return slightly-wrong context that leads the LLM to produce plausible-sounding but inaccurate or incomplete responses. Each failure mode below can be directly addressed by the specific chunking techniques covered in this article, which is why understanding the structural causes is the essential foundation for designing effective solutions:
- Chunks too large: Embedding quality degrades as the semantic signal becomes diluted. The chunk may contain multiple topics, making similarity search less precise.
- Chunks too small: Critical context is lost. A chunk might contain a partial sentence or idea that’s meaningless without surrounding text.
- Hard boundaries: Splitting mid-sentence or mid-paragraph destroys semantic coherence and can separate related information.
- No overlap: Information at chunk boundaries may become unretrievable if it’s split between two chunks that neither fully captures.

The visualization above makes these failure modes concrete, translating abstract performance degradation into structural problems you can identify by direct inspection of your chunk outputs. Naive character-based chunking creates arbitrary boundaries that slice ideas mid-sentence and split related information across chunk borders, while semantic chunking respects the document’s actual logical structure by treating paragraphs and topic shifts as natural split points. The difference may look minor when you examine individual examples in isolation, but it has compounding effects on retrieval quality: when the embedding model encodes a chunk containing a sentence fragment or a half-formed argument, the resulting vector doesn’t accurately represent any coherent concept, leading to retrieval misses for queries that should have matched that content. To understand why, recall that transformer embedding models were trained on well-formed, complete natural-language inputs — sentences, paragraphs, documents — and their learned representations are calibrated for that regime; feeding them truncated fragments produces embeddings that fall outside the distribution the model was trained on, degrading their geometric reliability in similarity search. Conversely, when a chunk cleanly encodes a single well-formed idea, its embedding sits precisely in the semantic neighborhood of queries asking about that idea, producing the high-precision retrieval that makes a RAG system genuinely useful rather than merely impressive in demos. This is why improving chunking strategy — before touching the embedding model, the vector index, or the LLM prompt — consistently yields the best retrieval quality gains per hour of engineering effort.
Chunking Approaches Compared
There are several established approaches to document chunking, each with distinct characteristics, different implementation complexities, and varying suitability for different types of source material. Understanding these options thoroughly helps you choose the right strategy — or the right combination of strategies — for your specific use case and document corpus rather than defaulting to whatever example code you find first. The choice is not purely technical: it also involves practical considerations like preprocessing compute budget, whether your documents have consistent or highly variable structure, how frequently new documents are added to the collection, and whether you have labeled query-document pairs to evaluate against. It’s also worth noting that these strategies are not mutually exclusive — a production system might apply semantic chunking for high-value reference documents queried frequently, while using faster recursive splitting for ephemeral or rapidly-updated content where re-indexing latency matters. The metadata layer also interacts with chunking strategy: more sophisticated semantic chunking tends to produce chunks that are harder to assign to a single section heading, while simpler structural splitting produces chunks with cleaner hierarchical metadata that supports richer filtered search. Starting with a simpler approach, measuring its retrieval performance on a representative query set, and then deciding whether the added complexity of semantic splitting is worth it is generally the right engineering mindset.

1. Fixed-Size Character Chunking
The simplest approach splits text into chunks of exactly N characters, implemented in just one line of code. While trivially easy to implement, this method ignores all semantic and syntactic boundaries, often cutting words mid-character sequence, sentences mid-thought, and paragraphs before they’ve made their point. The resulting chunks can start mid-word and end mid-sentence, which is not just aesthetically unpleasant but fundamentally harmful to embedding quality — the transformer model that encodes these chunks was trained on well-formed natural language text, and feeding it fragments that violate basic grammatical boundaries produces embeddings that inadequately represent any coherent semantic concept. Fixed-size character chunking is useful as a baseline for benchmarking and as a quick-and-dirty approach in throwaway prototypes, but should almost never be used in a production RAG system where retrieval quality matters.
def fixed_size_chunks(text: str, chunk_size: int = 500) -> list[str]:
"""Split text into fixed-size character chunks."""
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
# Problem: This splits mid-word and mid-sentence
text = "The quick brown fox jumps over the lazy dog. It was a beautiful day."
chunks = fixed_size_chunks(text, 30)
# Result: ['The quick brown fox jumps over', ' the lazy dog. It was a beaut', 'iful day.']
2. Recursive Character Splitting
Recursive character splitting is a significant improvement over fixed-size chunking because it attempts to split text on natural linguistic boundaries before resorting to arbitrary character positions. The algorithm tries a prioritized list of separators — double newlines between paragraphs first, then single newlines, then sentence-ending periods with spaces, then plain spaces, and only finally individual characters when nothing else fits — producing chunks that at minimum preserve whole words and ideally preserve whole sentences. This is the core approach used by LangChain’s RecursiveCharacterTextSplitter, which has become something of a community standard for text splitting precisely because it works well across a wide variety of document types without requiring any domain knowledge or model infrastructure. For most RAG applications working with well-formatted source documents, recursive splitting provides a very good balance between implementation simplicity, low preprocessing cost, and solid retrieval quality.
def recursive_split(
text: str,
chunk_size: int = 500,
separators: list[str] = ["nn", "n", ". ", " ", ""]
) -> list[str]:
"""Recursively split text, trying natural boundaries first."""
if len(text) <= chunk_size:
return [text] if text.strip() else []
# Try each separator in order of preference
for sep in separators:
if sep in text:
parts = text.split(sep)
chunks = []
current = ""
for part in parts:
candidate = current + sep + part if current else part
if len(candidate) chunk_size:
result.extend(recursive_split(chunk, chunk_size, separators))
else:
result.append(chunk)
return result
# Fallback to character splitting
return fixed_size_chunks(text, chunk_size)
3. Sentence-Based Chunking
Sentence-based chunking groups complete, grammatically whole sentences together until reaching the target chunk size, ensuring that every chunk contains only complete linguistic units that stand on their own. This approach guarantees that the transformer’s context window always receives properly formed input, which tends to produce more stable and meaningful embeddings compared to character-based approaches that may introduce partial sentences. The tradeoff is that chunk sizes become variable — some sentences are short and some are long — which means your batch sizes for embedding and your eventual chunk count will vary more than with fixed-size approaches. For documents where logical completeness of individual statements is important — such as FAQ entries, legal statements, or factual claims that must not be split — sentence-based chunking provides a strong guarantee that each chunk carries complete, self-contained meaning.
import re
def sentence_chunking(text: str, max_sentences: int = 5) -> list[str]:
"""Split text into chunks of N sentences each."""
# Simple sentence splitting (for production, use spaCy or NLTK)
sentences = re.split(r'(?<=[.!?])s+', text)
chunks = []
for i in range(0, len(sentences), max_sentences):
chunk = " ".join(sentences[i:i + max_sentences])
if chunk.strip():
chunks.append(chunk)
return chunks
4. Semantic Chunking
Semantic chunking is the most sophisticated approach and uses embedding models to identify the actual meaning boundaries in text, rather than relying on syntactic markers like paragraph breaks or sentence punctuation. The core insight is that consecutive sentences discussing the same topic should be kept together in the same chunk, while sentences that mark a transition to a new topic should trigger a split — even if no explicit structural marker like a heading or paragraph break signals that transition. By measuring the cosine similarity between the embeddings of adjacent sentences, we can detect these semantic transitions as drops in similarity below a configurable threshold, effectively letting the embedding model itself guide the chunking decision. This approach is especially powerful for documents with irregular or inconsistent formatting — like transcripts, notes, or scraped web content — where structural markers alone are unreliable indicators of topic boundaries.
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunking(
text: str,
model: SentenceTransformer,
threshold: float = 0.5,
min_sentences: int = 2
) -> list[str]:
"""Split text based on semantic similarity between sentences."""
sentences = re.split(r'(?<=[.!?])s+', text)
if len(sentences) < 2:
return [text]
# Get embeddings for all sentences
embeddings = model.encode(sentences)
# Calculate cosine similarity between adjacent sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = np.dot(embeddings[i], embeddings[i + 1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
)
similarities.append(sim)
# Find breakpoints where similarity drops below threshold
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
if sim = min_sentences:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Choosing the Right Chunk Size
Chunk size is one of the most heavily debated parameters in RAG system design, with strong opinions on all sides and significant empirical variation depending on the embedding model, the document corpus, and the nature of the user queries. The core tension is between specificity and completeness: smaller chunks produce more focused embeddings that match queries with higher precision but may lack the surrounding context needed to form a complete, useful answer; larger chunks preserve more context but dilute the embedding’s semantic signal, reducing matching precision. The optimal size also depends critically on your embedding model’s effective context window — models like all-MiniLM-L6-v2 perform best with inputs around 256 tokens, while larger models like BGE-large or E5-large can leverage up to 512-1024 tokens effectively. Benchmarking your specific embedding model on a representative sample of your actual queries and documents is the only reliable way to find the sweet spot for your use case.

As shown in the chart above, there is a clear inverse-U-shaped relationship between chunk size and retrieval performance across most tested document types and embedding models. Too small leads to fragmented embeddings where individual chunks lack enough semantic content to match anything meaningfully; too large dilutes the embedding signal so severely that the vector represents an average of too many topics to usefully match any specific query. The sweet spot for most general-purpose applications falls in the 512–768 token range, but the specific optimal value can vary by as much as 3-4× depending on document type — legal documents with long interconnected clauses benefit from much larger chunks than FAQ-style documents where each entry is self-contained. These are starting-point guidelines; always validate against your actual query patterns:
| Document Type | Recommended Size | Reasoning |
|---|---|---|
| Technical docs | 512-1024 tokens | Complex concepts need more context |
| FAQs / Q&A | 256-512 tokens | Each Q&A is self-contained |
| Legal documents | 1024-2048 tokens | Clauses need full context to be meaningful |
| News articles | 512-768 tokens | Paragraphs are mostly self-contained |
| Conversational | 256-512 tokens | Short exchanges, quick retrieval |
Different embedding models have different optimal input lengths:
- all-MiniLM-L6-v2: Best up to ~256 tokens
- all-mpnet-base-v2: Handles up to 384 tokens well
- e5-large-v2: Optimized for 512 tokens
- BGE models: Can handle 512-1024 tokens effectively
Always align your chunk size with your embedding model’s sweet spot for best results.
The Role of Chunk Overlap
Chunk overlap is a technique where consecutive chunks deliberately share some content at their boundaries, creating intentional redundancy that might seem wasteful at first but turns out to be crucial for maintaining context continuity and preventing critical information from falling through the cracks. Without overlap, a document is divided into perfectly non-overlapping, butted-together segments — and any sentence or fact that happens to fall at a chunk boundary will be split between two chunks, with neither chunk containing enough of the statement to match queries about it. Overlap is the engineering solution to this boundary problem: by repeating the last N characters of each chunk at the beginning of the next, you ensure that information near boundaries appears in full context in at least one chunk, making it retrievable for any query semantically related to that content. The mathematical intuition is straightforward: if a critical definition spans positions 480–520 in a 500-character chunk, neither that chunk alone nor the following one captures the full definition — but with 100 characters of overlap, the following chunk begins at position 400, and the definition appears completely within it. Overlap also acts as insurance against imperfect splitting: even if your semantic or structural splitting logic places a boundary slightly wrong, the overlap ensures that information near that boundary still appears in complete form in the preceding or following chunk. The cost is a modest increase in storage and index size — typically 10–25% depending on the overlap ratio — but the improvement in retrieval completeness almost always justifies it.

Without overlap, a crucial fact or definition might straddle the boundary between chunks: the first part of the explanation in Chunk N and the conclusion in Chunk N+1, with neither chunk independently containing enough of the statement to be retrieved for a query about it. Overlap solves this by ensuring that any sentence that begins near the end of a chunk also appears near the beginning of the following chunk, so it will be fully captured in at least one chunk that can then be retrieved and provided as context. This is especially important for factual claims where the relationship between subject and predicate spans multiple sentences — a common pattern in technical and scientific writing. The slight increase in index size and embedding computation cost is almost always worth paying for this retrieval completeness guarantee.
def chunking_with_overlap(
text: str,
chunk_size: int = 500,
overlap: int = 100
) -> list[str]:
"""Split text into overlapping chunks."""
if len(text) <= chunk_size:
return [text] if text.strip() else []
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to end at a sentence boundary
if end chunk_size // 2: # Only if in second half
chunk = chunk[:last_period + 1]
end = start + last_period + 1
if chunk.strip():
chunks.append(chunk.strip())
# Move start back by overlap amount
start = end - overlap
return chunks
# Example
text = """First paragraph introduces the topic. It contains important context.
Second paragraph builds on the first. It adds more details.
Third paragraph concludes. It summarizes everything."""
chunks = chunking_with_overlap(text, chunk_size=100, overlap=30)
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1}: {chunk[:50]}...")
Recommended Overlap Values
The optimal overlap depends on your chunk size and content type, and getting it right requires understanding the trade-off between retrieval completeness and index efficiency. A common rule of thumb is 10-20% overlap — enough to bridge boundary gaps without bloating your vector index with near-duplicate chunks that can obscure genuine semantic differences during retrieval. Too little overlap and you risk missing information that straddles chunk boundaries; too much and you create so many near-identical chunks that cosine similarity scores become unreliable, causing the retriever to return redundant rather than complementary context. The specific value also interacts with your typical query length: short keyword-style queries benefit from tightly-scoped chunks with minimal overlap, while long discursive queries benefit from more overlap that preserves narrative coherence across boundaries. Use these guidelines as starting points and measure retrieval F1 on a held-out query set to find the value that actually works for your corpus:
- 256-token chunks: 25-50 token overlap
- 512-token chunks: 50-100 token overlap
- 1024-token chunks: 100-200 token overlap
For highly interconnected content where ideas build on each other across multiple consecutive paragraphs — such as legal contracts, technical specifications, or academic methodology sections — consider larger overlaps of 20–25% to ensure that the contextual foundation of each argument is always co-present with its conclusions. For more independent content where each paragraph or section is largely self-contained — FAQ entries, glossary definitions, news articles — a smaller 10% overlap typically provides the boundary protection you need without excessive redundancy. When in doubt, start with 15% overlap and adjust based on measured retrieval performance on a representative set of queries. It’s also worth noting that overlap interacts with chunk size: a 100-token overlap on 500-token chunks represents 20% redundancy and is generally beneficial, while the same 100-token overlap on 200-token chunks represents 50% redundancy and may produce misleadingly similar chunks that hurt retrieval precision.
Semantic Chunking Techniques
Semantic chunking elevates the chunking process from simple text manipulation to genuine content understanding: rather than splitting documents at syntactic markers or fixed character counts, it identifies where the actual meaning and topical focus of the text changes. This approach produces chunks that align with the natural topic structure of the document, ensuring each chunk represents a coherent, focused idea rather than an arbitrary text window — which in turn produces embeddings that occupy a well-defined region of semantic space rather than a blurred average of multiple unrelated topics. The tradeoff is higher preprocessing cost: semantic chunking requires running the embedding model over all sentences during the chunking phase itself, not just during the final embedding step, roughly doubling the embedding compute required. This additional cost is typically justified for high-value document collections where retrieval quality is critical, but may not be worth it for frequently-updated, low-stakes content.
Topic-Based Segmentation
The most effective semantic chunking technique uses sentence embeddings to detect topic transitions by measuring the semantic similarity between adjacent sentence pairs. When two consecutive sentences are discussing the same topic, their embeddings will be geometrically close in the embedding space — high cosine similarity — and they should stay in the same chunk. When the text pivots to a new topic, the sentence embeddings make a corresponding jump in semantic space — low cosine similarity — and that drop signals a natural chunk boundary. The threshold for this similarity drop is the key tunable parameter: lower thresholds create fewer, larger chunks that are more contextually complete but less focused; higher thresholds create more, smaller chunks that are more semantically precise but may lose important surrounding context. A good starting point is 0.4–0.6 cosine similarity, which typically balances topic precision with contextual completeness for technical document corpora.
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import numpy as np
class SemanticChunker:
"""Chunk text based on semantic similarity."""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
def chunk(
self,
text: str,
similarity_threshold: float = 0.5,
min_chunk_size: int = 100,
max_chunk_size: int = 1000
) -> List[str]:
"""
Split text into semantically coherent chunks.
Args:
text: Input text to chunk
similarity_threshold: Similarity below this triggers a split
min_chunk_size: Minimum characters per chunk
max_chunk_size: Maximum characters per chunk
Returns:
List of text chunks
"""
# Split into sentences
sentences = self._split_sentences(text)
if len(sentences) List[str]:
"""Split text into sentences."""
import re
sentences = re.split(r'(? List[int]:
"""Find semantic breakpoints between sentences."""
breakpoints = []
current_size = len(sentences[0])
for i in range(len(embeddings) - 1):
current_size += len(sentences[i + 1])
# Calculate cosine similarity
sim = np.dot(embeddings[i], embeddings[i + 1]) / (
np.linalg.norm(embeddings[i]) *
np.linalg.norm(embeddings[i + 1])
)
# Create breakpoint if similarity is low and we have enough content
if sim = min_size:
breakpoints.append(i + 1)
current_size = 0
return breakpoints
def _create_chunks(
self,
sentences: List[str],
breakpoints: List[int],
max_size: int
) -> List[str]:
"""Create chunks from sentences and breakpoints."""
chunks = []
start = 0
for bp in breakpoints:
chunk = " ".join(sentences[start:bp])
# Handle oversized chunks
if len(chunk) > max_size:
# Split into smaller pieces
sub_chunks = self._split_oversized(chunk, max_size)
chunks.extend(sub_chunks)
else:
chunks.append(chunk)
start = bp
# Don't forget the last chunk
if start max_size:
chunks.extend(self._split_oversized(final_chunk, max_size))
else:
chunks.append(final_chunk)
return chunks
def _split_oversized(self, text: str, max_size: int) -> List[str]:
"""Split an oversized chunk into smaller pieces."""
return chunking_with_overlap(text, max_size, max_size // 5)
# Usage
chunker = SemanticChunker()
chunks = chunker.chunk(document_text, similarity_threshold=0.4)
Metadata Enrichment for Better Retrieval
Raw text chunks alone are often insufficient for effective retrieval in production RAG systems, because real-world retrieval needs go beyond simple semantic similarity — users want to filter by source document, restrict results to a specific time period, limit context to a particular category of documentation, or prioritize chunks from authoritative sources over general discussions. Enriching chunks with metadata dramatically improves retrieval by enabling these filtered searches, by providing the language model with provenance context that helps it calibrate confidence and cite sources accurately, and by enabling post-retrieval re-ranking based on structured attributes like recency or document authority. Systems without robust metadata frequently suffer from a class of retrieval error where semantically similar but contextually inappropriate chunks are returned — for example, a support chatbot that surfaces documentation for an older product version when the user is asking about the current release, simply because no version metadata was captured to enable filtering. The metadata you attach during the chunking phase is essentially the indexing strategy for your entire RAG system — the fields you capture here determine what filtering dimensions will be available forever, because adding a new filtering dimension after initial ingestion typically requires reprocessing the entire document collection. Consider your metadata schema from a user-experience perspective first: what attributes users would naturally want to scope their queries to, and what provenance fields the LLM needs to generate well-grounded, citable responses. It’s much easier to over-capture metadata at ingestion time than to go back and reprocess thousands of documents when a new filtering requirement emerges in production.
Essential metadata to include with each chunk — the fields you capture here define every filtering and attribution dimension that will ever be available to your retrieval system. Think of metadata not as a convenience feature but as the indexing schema for your entire RAG pipeline: if you don’t capture document category during ingestion, you can never filter by category during retrieval, which means users querying within a specific product line or regulatory domain will get results polluted by irrelevant content. The metadata schema should be designed before you write a single line of chunking code, driven by a clear understanding of how users will actually query the system and what provenance context the LLM needs to generate confident, accurately-citable responses. It is far cheaper to over-capture metadata at ingestion time than to reprocess thousands of documents later when a new filtering requirement emerges in production — treat each metadata field as a deliberate design decision:
- Source document: Filename, URL, or document ID
- Page/section number: Location within the document
- Document metadata: Author, date, category, version
- Chunk position: First/middle/last, chunk index
- Parent context: Section heading, chapter title
- Content type: Paragraph, table, list, code block
from dataclasses import dataclass, field
from typing import Optional, Dict, Any
from datetime import datetime
@dataclass
class EnrichedChunk:
"""A text chunk enriched with metadata for RAG systems."""
# Core content
text: str
# Source information
source_document: str
page_number: Optional[int] = None
section_title: Optional[str] = None
# Positioning
chunk_index: int = 0
total_chunks: int = 1
is_first: bool = True
is_last: bool = True
# Document metadata
document_type: str = "unknown"
document_date: Optional[datetime] = None
document_author: Optional[str] = None
# Content type
content_type: str = "text" # text, table, code, list
# Custom metadata
extra_metadata: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> dict:
"""Convert to dictionary for storage."""
return {
"text": self.text,
"source_document": self.source_document,
"page_number": self.page_number,
"section_title": self.section_title,
"chunk_index": self.chunk_index,
"total_chunks": self.total_chunks,
"is_first": self.is_first,
"is_last": self.is_last,
"document_type": self.document_type,
"document_date": self.document_date.isoformat() if self.document_date else None,
"document_author": self.document_author,
"content_type": self.content_type,
**self.extra_metadata
}
class MetadataEnricher:
"""Enrich text chunks with metadata."""
def __init__(self, document_path: str, document_metadata: dict = None):
self.document_path = document_path
self.document_metadata = document_metadata or {}
def enrich_chunks(
self,
chunks: list[str],
page_numbers: list[int] = None,
section_titles: list[str] = None
) -> list[EnrichedChunk]:
"""Add metadata to a list of text chunks."""
enriched = []
total = len(chunks)
for i, text in enumerate(chunks):
chunk = EnrichedChunk(
text=text,
source_document=self.document_path,
page_number=page_numbers[i] if page_numbers else None,
section_title=section_titles[i] if section_titles else None,
chunk_index=i,
total_chunks=total,
is_first=(i == 0),
is_last=(i == total - 1),
document_type=self.document_metadata.get("type", "unknown"),
document_date=self.document_metadata.get("date"),
document_author=self.document_metadata.get("author"),
)
# Detect content type
chunk.content_type = self._detect_content_type(text)
enriched.append(chunk)
return enriched
def _detect_content_type(self, text: str) -> str:
"""Detect the type of content in the chunk."""
# Simple heuristics - extend based on your needs
lines = text.strip().split('n')
# Check for code
code_indicators = ['def ', 'class ', 'import ', 'function ', 'const ', 'var ']
if any(indicator in text for indicator in code_indicators):
return "code"
# Check for lists
list_starters = ['- ', '* ', '• ', '1. ', '1) ']
list_lines = sum(1 for line in lines if any(line.strip().startswith(s) for s in list_starters))
if list_lines > len(lines) * 0.5:
return "list"
# Check for table-like structure
if '|' in text and text.count('|') > 4:
return "table"
return "text"
# Usage example
enricher = MetadataEnricher(
"technical_spec.pdf",
{"type": "specification", "author": "Engineering Team", "date": datetime(2024, 1, 15)}
)
enriched_chunks = enricher.enrich_chunks(
chunks,
page_numbers=[1, 1, 2, 2, 3],
section_titles=["Introduction", "Introduction", "Requirements", "Requirements", "Implementation"]
)
Complete Implementation
Let’s bring all the techniques covered in this article together into a production-ready chunking pipeline that can be dropped directly into a real RAG system. This implementation combines the semantic boundary detection of the SemanticChunker class, configurable overlap to protect against boundary losses, and the MetadataEnricher to attach the provenance and context information needed for filtered retrieval. The design uses dataclasses for clear, type-safe interfaces and separates configuration from logic so you can tune parameters without touching implementation code — the ChunkConfig dataclass intentionally covers not just chunk size and overlap but also semantic splitting thresholds and embedding model selection, because these parameters interact and should be tuned together. In practice you would typically serialize this configuration to a YAML or JSON file so that different document types can have their own profiles: technical datasheets might use one configuration with larger chunks and a lower semantic threshold, while FAQ documents use another with smaller chunks and no overlap, all processed by the same pipeline code. A key design principle throughout is that each processed chunk is fully self-contained: it carries not just its text content but everything needed to display it, filter it, link it back to its source document, and explain to the language model where the information came from. This self-contained design makes the output compatible with any vector database or RAG orchestration framework without further transformation, and means you can swap retrieval backends or prompt strategies without touching the chunking layer.
"""
Production-ready document chunking for RAG systems.
Combines semantic splitting, overlap, and metadata enrichment.
"""
from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any, Callable
from sentence_transformers import SentenceTransformer
import numpy as np
import re
@dataclass
class ChunkConfig:
"""Configuration for the chunking pipeline."""
# Size parameters
target_chunk_size: int = 512 # Target size in characters
min_chunk_size: int = 100
max_chunk_size: int = 1500
# Overlap
overlap_size: int = 50
# Semantic settings
use_semantic_splitting: bool = True
similarity_threshold: float = 0.5
# Model
embedding_model: str = "all-MiniLM-L6-v2"
@dataclass
class ProcessedChunk:
"""A fully processed chunk ready for embedding."""
text: str
metadata: Dict[str, Any]
char_count: int
word_count: int
@classmethod
def create(cls, text: str, **metadata) -> "ProcessedChunk":
return cls(
text=text.strip(),
metadata=metadata,
char_count=len(text),
word_count=len(text.split())
)
class DocumentChunker:
"""
Advanced document chunking with semantic awareness.
Features:
- Semantic boundary detection
- Configurable overlap
- Metadata enrichment
- Multiple content type support
"""
def __init__(self, config: ChunkConfig = None):
self.config = config or ChunkConfig()
if self.config.use_semantic_splitting:
self.model = SentenceTransformer(self.config.embedding_model)
else:
self.model = None
def chunk_document(
self,
text: str,
source: str,
base_metadata: Dict[str, Any] = None
) -> List[ProcessedChunk]:
"""
Chunk a document with full processing.
Args:
text: Document text to chunk
source: Source identifier (filename, URL, etc.)
base_metadata: Metadata to include with all chunks
Returns:
List of ProcessedChunk objects ready for embedding
"""
base_metadata = base_metadata or {}
# Step 1: Pre-process text
text = self._preprocess(text)
# Step 2: Split into initial chunks
if self.config.use_semantic_splitting:
raw_chunks = self._semantic_split(text)
else:
raw_chunks = self._recursive_split(text)
# Step 3: Apply overlap
if self.config.overlap_size > 0:
raw_chunks = self._apply_overlap(raw_chunks)
# Step 4: Enrich with metadata
processed = []
for i, chunk_text in enumerate(raw_chunks):
metadata = {
**base_metadata,
"source": source,
"chunk_index": i,
"total_chunks": len(raw_chunks),
"position": self._get_position(i, len(raw_chunks)),
}
chunk = ProcessedChunk.create(chunk_text, **metadata)
processed.append(chunk)
return processed
def _preprocess(self, text: str) -> str:
"""Clean and normalize text."""
# Normalize whitespace
text = re.sub(r's+', ' ', text)
# Remove excessive newlines
text = re.sub(r'n{3,}', 'nn', text)
return text.strip()
def _semantic_split(self, text: str) -> List[str]:
"""Split using semantic similarity."""
sentences = self._split_sentences(text)
if len(sentences) < 3:
return [text]
embeddings = self.model.encode(sentences)
# Find semantic breaks
chunks = []
current_sentences = [sentences[0]]
current_size = len(sentences[0])
for i in range(len(sentences) - 1):
# Calculate similarity with next sentence
sim = self._cosine_similarity(embeddings[i], embeddings[i + 1])
next_size = current_size + len(sentences[i + 1])
should_split = (
(sim = self.config.min_chunk_size) or
next_size > self.config.max_chunk_size
)
if should_split:
chunks.append(" ".join(current_sentences))
current_sentences = [sentences[i + 1]]
current_size = len(sentences[i + 1])
else:
current_sentences.append(sentences[i + 1])
current_size = next_size
# Add final chunk
if current_sentences:
chunks.append(" ".join(current_sentences))
return chunks
def _recursive_split(self, text: str) -> List[str]:
"""Split using recursive character splitting."""
separators = ["nn", "n", ". ", "! ", "? ", "; ", ", ", " "]
return self._split_recursive(text, separators)
def _split_recursive(self, text: str, separators: List[str]) -> List[str]:
"""Recursively split on separators."""
if len(text) <= self.config.target_chunk_size:
return [text] if text.strip() else []
for sep in separators:
if sep in text:
parts = text.split(sep)
chunks = []
current = ""
for part in parts:
test = (current + sep + part) if current else part
if len(test) self.config.target_chunk_size:
# Recursively split large part
sub_chunks = self._split_recursive(part, separators[separators.index(sep)+1:])
chunks.extend(sub_chunks)
current = ""
else:
current = part
if current:
chunks.append(current)
return chunks
# Fallback: character split
return [text[i:i+self.config.target_chunk_size]
for i in range(0, len(text), self.config.target_chunk_size)]
def _apply_overlap(self, chunks: List[str]) -> List[str]:
"""Add overlap between consecutive chunks."""
if len(chunks) List[str]:
"""Split text into sentences."""
pattern = r'(? float:
"""Calculate cosine similarity between vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def _get_position(self, index: int, total: int) -> str:
"""Determine chunk position label."""
if total == 1:
return "only"
if index == 0:
return "first"
if index == total - 1:
return "last"
return "middle"
# Example usage
if __name__ == "__main__":
config = ChunkConfig(
target_chunk_size=500,
overlap_size=50,
use_semantic_splitting=True,
similarity_threshold=0.45
)
chunker = DocumentChunker(config)
# Sample document
document = """
Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables
systems to learn from data. Unlike traditional programming where rules
are explicitly coded, ML systems discover patterns automatically.
Types of Machine Learning
Supervised learning uses labeled data to train models. The algorithm
learns to map inputs to known outputs. Common applications include
image classification and spam detection.
Unsupervised learning finds patterns in unlabeled data. Clustering
and dimensionality reduction are key techniques. These methods help
discover hidden structures in datasets.
"""
chunks = chunker.chunk_document(
document,
source="ml_intro.pdf",
base_metadata={"category": "education", "topic": "machine learning"}
)
for chunk in chunks:
print(f"--- Chunk {chunk.metadata['chunk_index'] + 1}/{chunk.metadata['total_chunks']} ---")
print(f"Position: {chunk.metadata['position']}")
print(f"Words: {chunk.word_count}")
print(f"Text: {chunk.text[:100]}...")
print()
Evaluating Chunking Quality
How do you know if your chunking strategy is working? Evaluation is crucial for iterating toward optimal results, and without a systematic evaluation process you’re essentially tuning parameters in the dark. Measuring chunking quality before deployment can save enormous amounts of rework: a chunking strategy that looks good on paper but performs poorly in practice will degrade every retrieval interaction your users have, and the failures are often subtle — not complete misses but slightly-wrong context that leads the LLM to produce inaccurate or incomplete answers. Here are the key metrics and methodologies for objectively assessing chunk quality:
Metrics to Track
- Retrieval Precision: What percentage of retrieved chunks are actually relevant?
- Retrieval Recall: What percentage of relevant information was retrieved?
- Chunk Coherence: Do chunks represent complete, coherent ideas?
- Size Distribution: Are chunks reasonably uniform in size?
- Boundary Quality: Do chunks split at natural boundaries?
def evaluate_chunking(chunks: List[str]) -> dict:
"""Calculate chunking quality metrics."""
sizes = [len(c) for c in chunks]
return {
"total_chunks": len(chunks),
"avg_size": sum(sizes) / len(sizes),
"min_size": min(sizes),
"max_size": max(sizes),
"size_std": np.std(sizes),
"size_cv": np.std(sizes) / np.mean(sizes), # Coefficient of variation
}
# Lower size_cv indicates more uniform chunks
- Start with semantic chunking if you have the compute resources
- Use 10-20% overlap to preserve boundary context
- Align chunk size with your embedding model’s optimal input length
- Always include source metadata for traceability
- Test different configurations with your actual queries
- Monitor retrieval quality and iterate on your strategy
Conclusion
Effective document chunking is the often-invisible foundation that determines whether a RAG system actually delivers on its promise of contextually accurate, well-grounded responses. By understanding the fundamental trade-offs between different splitting approaches and implementing semantic-aware splitting with proper overlap and metadata enrichment, you can substantially improve both the precision of chunk retrieval and the quality of the context provided to the language model. The key insight is that no single chunking strategy is optimal for all document types and query patterns — context-aware strategies that combine structural signals (headings, paragraphs) with semantic signals (topic shifts detected via embeddings) consistently outperform one-size-fits-all approaches. Investing time in your chunking pipeline and measuring its performance with real queries against real documents is one of the highest-leverage improvements you can make to an existing RAG system.
The key takeaways from this article distill into actionable principles that should inform every chunking design decision you make. These are not arbitrary rules of thumb — they are conclusions grounded in the mechanical realities of how embedding models represent text, how approximate nearest-neighbor search operates under different input distributions, and how language models consume and reason over retrieved context. Each principle maps directly to a failure mode: ignoring chunk size leads to diluted or fragmented embeddings, skipping overlap causes boundary losses, omitting metadata removes every filtering capability, and neglecting evaluation means you’re tuning parameters blindly. Before moving on to the embedding stage of your pipeline, make sure each of these points is genuinely reflected in your implementation:
- Chunk size significantly impacts both embedding quality and retrieval precision
- Semantic chunking produces more coherent chunks than naive splitting
- Overlap prevents loss of information at chunk boundaries
- Metadata enrichment enables filtered search and provides essential context
- Continuous evaluation and iteration are essential for optimization
In the next article we’ll explore how to generate high-quality embeddings for your chunks using local Sentence Transformer models, enabling a fully offline RAG pipeline with no API dependencies and no per-token cost. The chunking pipeline built in this article produces chunks that are ready for embedding — and understanding the characteristics of different embedding models will help you further refine your chunking decisions, since the two stages are tightly coupled: the right chunk size for all-MiniLM-L6-v2 is different from the right size for bge-large-en-v1.5, and choosing an embedding model before finalizing your chunking strategy is a consequential architectural decision. We’ll cover how to benchmark embedding models against your specific document corpus, what retrieval-focused metrics matter beyond leaderboard scores, and how quantization can reduce memory footprint without meaningfully impacting retrieval accuracy. The interplay between chunk size and embedding model capacity is one of the most important design decisions in the entire RAG stack: a mismatch in either direction degrades retrieval quality in ways that no downstream component optimization can compensate for. By understanding both layers together, you’ll be able to optimize the text representation pipeline as an integrated system rather than as two independent stages tuned in isolation, and the gains from getting both right compound multiplicatively.
Leave a Reply