30 min read

Describing PDF Charts with Vision Language Models

Describing PDF Charts with Vision Language Models
Key Topics: Vision Language Model chart description, VLM PDF extraction, GPT-4 Vision charts, LLaVA image analysis, multimodal RAG, figure-to-text conversion, chart OCR alternative, visual document understanding, image captioning AI, MiniCPM-V local vision model

Charts and figures in PDF documents contain invaluable information that traditional text extraction completely misses. A bar chart showing revenue growth, a pie chart of market share, or a technical diagram—all become invisible to standard RAG pipelines. Vision Language Models (VLMs) solve this problem by “seeing” images and generating rich text descriptions that can be indexed and searched alongside regular document text.

This comprehensive guide covers the entire pipeline: extracting figures from PDFs, processing them with VLMs (both cloud and local), crafting effective prompts for chart understanding, and integrating visual content into your RAG system. You’ll learn to work with models from GPT-4 Vision to local alternatives like LLaVA and MiniCPM-V. Along the way, we examine the architectural differences between these models—how LLaVA fuses a CLIP vision encoder with a language backbone, how Qwen-VL extends a large language model with a dedicated visual token pathway, and how proprietary systems like GPT-4o achieve exceptional reasoning about chart structure through massive multimodal pretraining.

We also tackle the practical challenges that arise in real-world document pipelines: handling low-resolution screenshots, dealing with overlapping figure regions, constructing prompts that reliably produce structured output, and batch-processing hundreds of charts without saturating API rate limits. By the end of this article you will have a working, production-ready system that can turn any PDF full of bar charts, scatter plots, and architectural diagrams into a fully searchable knowledge base.

Why VLMs for Document Understanding?

Technical documents, research papers, and business reports rely heavily on visual content to convey information. Consider what gets lost when we only extract text: charts that encode months of financial data, architectural diagrams that describe system topology at a glance, and scatter plots that reveal subtle correlations between variables. Even a competent OCR pass will capture the axis labels of a bar chart as isolated strings with no indication that they belong to a visual structure—the relationship between label, bar height, and comparison across categories is completely absent.

This information gap is especially damaging in RAG systems, because a user asking “which product line showed the highest growth last quarter?” may find their answer buried inside a figure, not in the surrounding prose. The only way to bridge this gap without manual annotation is to give the retrieval pipeline a pair of eyes—and that is precisely what Vision Language Models provide.

Pipeline showing chart extraction and VLM processing
Figure 1: Complete pipeline for extracting visual content from PDFs and converting to searchable text using Vision Language Models.

Information Hidden in Visuals

  • Charts and graphs: Trends, comparisons, distributions, correlations
  • Technical diagrams: Architecture, workflows, system designs
  • Tables as images: Complex layouts that defy text extraction
  • Screenshots: UI mockups, code examples, command outputs
  • Infographics: Multi-element visual summaries

VLMs Bridge the Gap

Vision Language Models combine image understanding with language generation, making them uniquely suited to the challenge of chart comprehension. Architecturally, most modern VLMs follow a dual-encoder design: a vision encoder (typically based on CLIP or a similar contrastive model) converts the image into a sequence of patch embeddings, which are then projected into the same token space as text and fed into a large language model. Models like LLaVA 1.6 and Qwen-VL extend this idea with higher-resolution image tiling, enabling them to read small axis labels and data point annotations that earlier approaches routinely missed.

GPT-4o and Gemini Vision go further by integrating vision at a native level throughout the architecture, giving them superior multi-step reasoning—for example, inferring a compound annual growth rate from a line chart even when the exact values are not labeled. Choosing the right model involves balancing accuracy, throughput, data-privacy requirements, and deployment cost, which is why this guide covers both cloud and locally-hosted options. They can:

  • Identify chart types (bar, line, pie, scatter, etc.)
  • Extract data values and labels
  • Describe trends and patterns
  • Explain relationships between elements
  • Generate searchable summaries

Extracting Figures from PDFs

Before we can describe figures, we need to extract them from PDF documents. PyMuPDF provides powerful tools for this task, able to enumerate every embedded image XObject on a page, retrieve its raw bytes, and map it back to a precise bounding rectangle in page coordinates. However, not every figure in a PDF is stored as a raster image—many charts generated by tools like Matplotlib, Excel, or Adobe Illustrator are encoded as vector graphics (a stream of drawing commands), and these require a different extraction strategy: rendering the relevant page region at high DPI into a pixel buffer.

A robust extractor therefore needs to handle both cases, apply a minimum-size filter to ignore decorative icons and watermarks, and optionally search for nearby caption text so the description can be enriched with the author’s own label. Getting this extraction step right has a direct impact on VLM quality—sending a blurry, low-resolution crop to a vision model will produce a vague description no matter how good the prompt is.

Process of cropping and extracting figures from PDF pages
Figure 2: Figure extraction process: identify bounding boxes, crop regions, and export as high-quality images.
"""
Extract figures and images from PDF documents.
"""

import fitz  # PyMuPDF
from pathlib import Path
from dataclasses import dataclass
from typing import List, Optional, Tuple
import io
from PIL import Image


@dataclass
class ExtractedFigure:
    """An extracted figure from a PDF."""
    page_number: int
    bbox: Tuple[float, float, float, float]  # (x0, y0, x1, y1)
    image_bytes: bytes
    image_format: str
    width: int
    height: int
    caption: Optional[str] = None
    figure_number: Optional[str] = None


class PDFFigureExtractor:
    """
    Extract figures and images from PDF documents.
    
    Supports:
    - Embedded images
    - Vector graphics rendered as images
    - Full-page rendering for complex layouts
    """
    
    def __init__(
        self,
        min_width: int = 100,
        min_height: int = 100,
        dpi: int = 150
    ):
        """
        Initialize the extractor.
        
        Args:
            min_width: Minimum image width to extract
            min_height: Minimum image height to extract
            dpi: Resolution for page rendering
        """
        self.min_width = min_width
        self.min_height = min_height
        self.dpi = dpi
    
    def extract_embedded_images(
        self,
        pdf_path: str
    ) -> List[ExtractedFigure]:
        """
        Extract embedded images from PDF.
        
        Args:
            pdf_path: Path to PDF file
            
        Returns:
            List of ExtractedFigure objects
        """
        figures = []
        doc = fitz.open(pdf_path)
        
        for page_num, page in enumerate(doc):
            image_list = page.get_images(full=True)
            
            for img_index, img_info in enumerate(image_list):
                xref = img_info[0]
                
                try:
                    base_image = doc.extract_image(xref)
                    image_bytes = base_image["image"]
                    image_format = base_image["ext"]
                    
                    # Get dimensions
                    img = Image.open(io.BytesIO(image_bytes))
                    width, height = img.size
                    
                    # Skip small images (likely icons/decorations)
                    if width < self.min_width or height < self.min_height:
                        continue
                    
                    # Get bounding box
                    bbox = self._get_image_bbox(page, xref)
                    
                    figures.append(ExtractedFigure(
                        page_number=page_num + 1,
                        bbox=bbox,
                        image_bytes=image_bytes,
                        image_format=image_format,
                        width=width,
                        height=height
                    ))
                    
                except Exception as e:
                    print(f"Error extracting image on page {page_num + 1}: {e}")
        
        doc.close()
        return figures
    
    def _get_image_bbox(
        self,
        page: fitz.Page,
        xref: int
    ) -> Tuple[float, float, float, float]:
        """Get bounding box for an image by xref."""
        for img in page.get_images():
            if img[0] == xref:
                # Get image rectangles
                rects = page.get_image_rects(img)
                if rects:
                    r = rects[0]
                    return (r.x0, r.y0, r.x1, r.y1)
        return (0, 0, 0, 0)
    
    def render_page_region(
        self,
        pdf_path: str,
        page_number: int,
        bbox: Tuple[float, float, float, float]
    ) -> bytes:
        """
        Render a specific region of a page as an image.
        
        Useful for extracting charts that are vector graphics
        rather than embedded images.
        
        Args:
            pdf_path: Path to PDF
            page_number: Page number (1-indexed)
            bbox: Region to render (x0, y0, x1, y1)
            
        Returns:
            PNG image bytes
        """
        doc = fitz.open(pdf_path)
        page = doc[page_number - 1]
        
        # Create clip rectangle
        clip = fitz.Rect(bbox)
        
        # Render at specified DPI
        mat = fitz.Matrix(self.dpi / 72, self.dpi / 72)
        pix = page.get_pixmap(matrix=mat, clip=clip)
        
        doc.close()
        return pix.tobytes("png")
    
    def detect_figure_regions(
        self,
        pdf_path: str,
        page_number: int
    ) -> List[Tuple[float, float, float, float]]:
        """
        Detect potential figure regions on a page.
        
        Uses heuristics to identify areas likely containing
        charts, diagrams, or other visual content.
        
        Args:
            pdf_path: Path to PDF
            page_number: Page number (1-indexed)
            
        Returns:
            List of bounding boxes
        """
        doc = fitz.open(pdf_path)
        page = doc[page_number - 1]
        
        regions = []
        
        # Method 1: Find drawing commands (vector graphics)
        drawings = page.get_drawings()
        if drawings:
            # Cluster nearby drawings
            all_rects = [fitz.Rect(d["rect"]) for d in drawings]
            if all_rects:
                combined = all_rects[0]
                for r in all_rects[1:]:
                    combined |= r  # Union of rectangles
                
                if combined.width > self.min_width and combined.height > self.min_height:
                    regions.append(tuple(combined))
        
        # Method 2: Find image XObjects
        for img_info in page.get_images():
            xref = img_info[0]
            bbox = self._get_image_bbox(page, xref)
            if bbox[2] - bbox[0] > self.min_width:
                regions.append(bbox)
        
        doc.close()
        return regions
    
    def extract_with_captions(
        self,
        pdf_path: str
    ) -> List[ExtractedFigure]:
        """
        Extract figures and attempt to find associated captions.
        
        Looks for text like "Figure 1:" or "Fig. 1" near images.
        """
        figures = self.extract_embedded_images(pdf_path)
        doc = fitz.open(pdf_path)
        
        for figure in figures:
            page = doc[figure.page_number - 1]
            
            # Search for caption below the image
            caption_rect = fitz.Rect(
                figure.bbox[0],
                figure.bbox[3],  # Start below image
                figure.bbox[2],
                figure.bbox[3] + 50  # Search 50 points below
            )
            
            caption_text = page.get_text("text", clip=caption_rect)
            
            # Look for figure label patterns
            import re
            fig_pattern = r'(?:Figure|Fig.?)s*(d+[a-z]?)[:.]?s*(.*)'
            match = re.search(fig_pattern, caption_text, re.IGNORECASE)
            
            if match:
                figure.figure_number = f"Figure {match.group(1)}"
                figure.caption = match.group(2).strip()
        
        doc.close()
        return figures


# Example usage
if __name__ == "__main__":
    extractor = PDFFigureExtractor(min_width=150, min_height=150)
    
    figures = extractor.extract_with_captions("technical_report.pdf")
    
    for fig in figures:
        print(f"Page {fig.page_number}: {fig.width}x{fig.height}")
        if fig.caption:
            print(f"  Caption: {fig.caption[:50]}...")
        
        # Save image
        with open(f"figure_p{fig.page_number}_{fig.width}x{fig.height}.png", "wb") as f:
            f.write(fig.image_bytes)

Vision Model Options

Several VLMs are available for chart description, each with different capabilities and deployment requirements. Cloud-hosted models like GPT-4o and Claude 3.5 Sonnet consistently achieve the highest accuracy on complex charts—they handle cluttered scatter plots, dual-axis line charts, and nested pie charts with minimal prompt engineering—but they introduce API costs, latency, and data-privacy concerns that are unacceptable in many enterprise settings. Open-weight models like LLaVA 1.6 and Qwen-VL can be run entirely on your own hardware via frameworks such as Ollama or vLLM, trading a modest accuracy drop for complete data sovereignty and predictable cost.

MiniCPM-V occupies an especially interesting niche: at under 8 billion parameters it fits comfortably on a single consumer GPU with 12 GB VRAM, yet its high-resolution input pipeline (it tiles images up to 1344×1344 pixels) lets it resolve fine chart details that smaller models blur over. Understanding these trade-offs before committing to a model will save considerable effort when you hit throughput or cost ceilings in production:

Comparison of Vision Language Models for chart analysis
Figure 3: Vision model comparison across accuracy, speed, deployment options, and specialized capabilities.
ModelProviderBest ForDeployment
GPT-4 VisionOpenAIHighest accuracy, complex chartsCloud API
Claude 3.5AnthropicDetailed analysis, reasoningCloud API
Gemini Pro VisionGoogleFast, good accuracyCloud API
LLaVA 1.6Open SourceLocal deployment, good qualityLocal/Ollama
MiniCPM-VOpenBMBSmall, efficient, localLocal
Qwen-VLAlibabaMultilingual, local optionLocal/API

Effective Prompting for Charts

The quality of chart descriptions depends heavily on how you prompt the VLM. A generic “describe this image” instruction produces a superficial caption—”a bar chart with blue and orange bars”—that is practically useless for retrieval. Structured prompts that enumerate specific extraction targets (title, axis labels, approximate values, trend direction, key comparisons) reliably elicit the dense, factual text that makes descriptions searchable.

Prompt engineering for charts also benefits from a two-pass approach: a first, lightweight call asks the model to classify the chart type, and a second, specialized call uses a prompt tailored to that type—because the salient features of a pie chart (segment percentages and relative proportions) differ fundamentally from those of a line chart (trend direction, peaks, and inflection points). When sending images to locally-hosted models that are sensitive to context length, keeping the prompt concise while still covering all required fields is an important balancing act. Different chart types benefit from specialized prompts:

Example of chart being converted to text description
Figure 4: Example transformation from visual chart to structured text description suitable for RAG indexing.
"""
Specialized prompts for different chart types.
"""

from enum import Enum
from typing import Dict


class ChartType(Enum):
    """Types of charts we handle."""
    BAR = "bar"
    LINE = "line"
    PIE = "pie"
    SCATTER = "scatter"
    TABLE = "table"
    DIAGRAM = "diagram"
    UNKNOWN = "unknown"


CHART_PROMPTS: Dict[ChartType, str] = {
    ChartType.BAR: """
Analyze this bar chart and provide a detailed description including:

1. **Chart Title and Labels**: What is the title? What do the axes represent?
2. **Categories**: List all categories/bars shown
3. **Values**: Provide approximate values for each bar
4. **Comparisons**: Which category has the highest/lowest value? What's the range?
5. **Trends**: Are there any notable patterns (e.g., growth, decline)?
6. **Key Insights**: What are the main takeaways from this chart?

Format your response as structured text that would be useful for search and retrieval.
""",
    
    ChartType.LINE: """
Analyze this line chart and provide a detailed description including:

1. **Chart Title and Labels**: What is being measured? What are the axes?
2. **Data Series**: How many lines are there? What does each represent?
3. **Time Period**: What is the date/time range shown?
4. **Trends**: Describe the overall trend (increasing, decreasing, stable, cyclical)
5. **Key Points**: Identify peaks, valleys, and inflection points
6. **Comparisons**: If multiple lines, how do they compare?

Format your response as structured text suitable for indexing.
""",
    
    ChartType.PIE: """
Analyze this pie chart and provide a detailed description including:

1. **Chart Title**: What is being divided into segments?
2. **Segments**: List each segment with its label and percentage/value
3. **Largest/Smallest**: Which segments are dominant? Which are minor?
4. **Proportions**: Describe the relative sizes (e.g., "X is twice as large as Y")
5. **Total**: If shown, what is the total value represented?

Format your response as structured text for search and retrieval.
""",
    
    ChartType.SCATTER: """
Analyze this scatter plot and provide a detailed description including:

1. **Variables**: What are the X and Y axes measuring?
2. **Correlation**: Is there a visible correlation (positive, negative, none)?
3. **Clusters**: Are there distinct clusters or groupings?
4. **Outliers**: Are there any notable outliers?
5. **Trend Line**: If present, describe the trend line
6. **Data Range**: What are the approximate ranges for X and Y?

Format your response as structured text suitable for indexing.
""",
    
    ChartType.TABLE: """
Analyze this table and provide a detailed description including:

1. **Table Title/Context**: What data does this table contain?
2. **Columns**: List all column headers and what they represent
3. **Rows**: How many rows? What do they represent?
4. **Key Data Points**: Highlight the most important values
5. **Patterns**: Are there any notable patterns in the data?
6. **Summary**: Provide a brief summary of what the table shows

Convert the key information into searchable text format.
""",
    
    ChartType.DIAGRAM: """
Analyze this diagram and provide a detailed description including:

1. **Diagram Type**: What kind of diagram is this (flowchart, architecture, process, etc.)?
2. **Components**: List all major components/elements shown
3. **Relationships**: Describe how components are connected
4. **Flow**: If applicable, describe the direction of flow
5. **Labels**: Include all important labels and annotations
6. **Purpose**: What process or system is this diagram explaining?

Format your response as structured text that captures the diagram's meaning.
""",
    
    ChartType.UNKNOWN: """
Analyze this image and provide a detailed description including:

1. **Image Type**: What kind of visual is this (chart, diagram, photo, etc.)?
2. **Main Subject**: What is the primary content shown?
3. **Details**: Describe all important elements, labels, and text visible
4. **Data**: If numeric data is shown, list the key values
5. **Context**: Based on the visual, what topic or subject does this relate to?
6. **Key Information**: What are the most important facts conveyed?

Provide a comprehensive description suitable for search and retrieval.
"""
}


def get_chart_prompt(chart_type: ChartType, context: str = "") -> str:
    """
    Get the appropriate prompt for a chart type.
    
    Args:
        chart_type: Type of chart
        context: Additional context (e.g., document title, surrounding text)
        
    Returns:
        Formatted prompt string
    """
    base_prompt = CHART_PROMPTS[chart_type]
    
    if context:
        return f"""
Context: This image is from a document about: {context}

{base_prompt}
"""
    return base_prompt


def classify_chart_type(vlm_response: str) -> ChartType:
    """
    Classify chart type from a preliminary VLM response.
    
    Ask the VLM "What type of chart is this?" first,
    then use this function to parse the response.
    """
    response_lower = vlm_response.lower()
    
    if any(term in response_lower for term in ["bar chart", "bar graph", "histogram"]):
        return ChartType.BAR
    elif any(term in response_lower for term in ["line chart", "line graph", "time series"]):
        return ChartType.LINE
    elif any(term in response_lower for term in ["pie chart", "donut chart"]):
        return ChartType.PIE
    elif any(term in response_lower for term in ["scatter", "scatter plot"]):
        return ChartType.SCATTER
    elif any(term in response_lower for term in ["table", "spreadsheet"]):
        return ChartType.TABLE
    elif any(term in response_lower for term in ["diagram", "flowchart", "architecture"]):
        return ChartType.DIAGRAM
    else:
        return ChartType.UNKNOWN

Complete Implementation

Let’s build a complete chart description system that works with multiple VLM providers. The design uses a provider-agnostic abstract base class so you can swap between OpenAI’s GPT-4o, Anthropic’s Claude Vision, and locally-hosted models without changing any business logic. Each provider receives a base64-encoded image alongside a structured prompt and returns a raw text response; a shared parsing layer then extracts numeric data points, identifies insight sentences, and assembles a ChartDescription dataclass that downstream components can consume uniformly.

Supporting multiple providers in a single codebase also makes it practical to implement a cost-tiered fallback strategy—attempt description with a cheap local model first, and escalate to a cloud API only when the local confidence score falls below a threshold. Batch processing support is built in from the start, with sequential iteration and per-image error handling so that a single corrupted figure does not abort an entire document ingestion job. Below is the complete implementation:

"""
Complete VLM-based chart description system.
Supports OpenAI, Anthropic, and local models via Ollama.
"""

import base64
from pathlib import Path
from typing import Optional, Union, List
from dataclasses import dataclass
from abc import ABC, abstractmethod
import httpx


@dataclass
class ChartDescription:
    """Result of chart analysis."""
    image_path: str
    chart_type: str
    description: str
    key_data: List[str]
    insights: List[str]
    raw_response: str


class VLMProvider(ABC):
    """Abstract base class for VLM providers."""
    
    @abstractmethod
    def describe_image(
        self,
        image_data: Union[str, bytes],
        prompt: str
    ) -> str:
        """Send image to VLM and get description."""
        pass


class OpenAIVLM(VLMProvider):
    """OpenAI GPT-4 Vision provider."""
    
    def __init__(self, api_key: str, model: str = "gpt-4o"):
        from openai import OpenAI
        self.client = OpenAI(api_key=api_key)
        self.model = model
    
    def describe_image(
        self,
        image_data: Union[str, bytes],
        prompt: str
    ) -> str:
        # Handle both file paths and bytes
        if isinstance(image_data, str):
            with open(image_data, "rb") as f:
                image_bytes = f.read()
        else:
            image_bytes = image_data
        
        # Encode to base64
        b64_image = base64.b64encode(image_bytes).decode("utf-8")
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{b64_image}",
                                "detail": "high"
                            }
                        }
                    ]
                }
            ],
            max_tokens=1500
        )
        
        return response.choices[0].message.content


class OllamaVLM(VLMProvider):
    """Local VLM via Ollama (LLaVA, etc.)."""
    
    def __init__(
        self,
        model: str = "llava:13b",
        base_url: str = "http://localhost:11434"
    ):
        self.model = model
        self.base_url = base_url
    
    def describe_image(
        self,
        image_data: Union[str, bytes],
        prompt: str
    ) -> str:
        # Handle both file paths and bytes
        if isinstance(image_data, str):
            with open(image_data, "rb") as f:
                image_bytes = f.read()
        else:
            image_bytes = image_data
        
        # Encode to base64
        b64_image = base64.b64encode(image_bytes).decode("utf-8")
        
        # Ollama API for vision models
        response = httpx.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "images": [b64_image],
                "stream": False
            },
            timeout=120.0
        )
        
        response.raise_for_status()
        return response.json()["response"]


class ChartDescriptor:
    """
    Main class for describing charts using VLMs.
    
    Handles:
    - Chart type classification
    - Specialized prompting
    - Structured output parsing
    """
    
    def __init__(self, vlm_provider: VLMProvider):
        """
        Initialize with a VLM provider.
        
        Args:
            vlm_provider: Instance of VLMProvider (OpenAI, Ollama, etc.)
        """
        self.vlm = vlm_provider
    
    def classify_chart(self, image_data: Union[str, bytes]) -> str:
        """
        Classify the type of chart in an image.
        
        Args:
            image_data: Path to image or image bytes
            
        Returns:
            Chart type string
        """
        prompt = """
        What type of chart or visual is shown in this image?
        
        Respond with ONLY one of these categories:
        - bar chart
        - line chart
        - pie chart
        - scatter plot
        - table
        - diagram
        - other
        """
        
        response = self.vlm.describe_image(image_data, prompt)
        return response.strip().lower()
    
    def describe(
        self,
        image_data: Union[str, bytes],
        context: str = "",
        chart_type: Optional[str] = None
    ) -> ChartDescription:
        """
        Generate a comprehensive description of a chart.
        
        Args:
            image_data: Path to image or image bytes
            context: Additional context about the document
            chart_type: Pre-classified chart type (or None to auto-detect)
            
        Returns:
            ChartDescription with structured information
        """
        # Auto-detect chart type if not provided
        if chart_type is None:
            chart_type = self.classify_chart(image_data)
        
        # Get appropriate prompt
        from enum import Enum
        chart_enum = classify_chart_type(chart_type)
        prompt = get_chart_prompt(chart_enum, context)
        
        # Get description
        raw_response = self.vlm.describe_image(image_data, prompt)
        
        # Parse response into structured format
        return self._parse_response(
            image_path=image_data if isinstance(image_data, str) else "",
            chart_type=chart_type,
            raw_response=raw_response
        )
    
    def _parse_response(
        self,
        image_path: str,
        chart_type: str,
        raw_response: str
    ) -> ChartDescription:
        """Parse VLM response into structured format."""
        
        # Extract key data points (lines with numbers)
        import re
        key_data = []
        for line in raw_response.split('n'):
            if re.search(r'd+.?d*%?', line):
                key_data.append(line.strip())
        
        # Extract insights (lines with key phrases)
        insight_phrases = ['shows', 'indicates', 'suggests', 'reveals', 
                          'highest', 'lowest', 'trend', 'increase', 'decrease']
        insights = []
        for line in raw_response.split('n'):
            if any(phrase in line.lower() for phrase in insight_phrases):
                if len(line.strip()) > 20:  # Skip very short lines
                    insights.append(line.strip())
        
        return ChartDescription(
            image_path=image_path,
            chart_type=chart_type,
            description=raw_response,
            key_data=key_data[:10],  # Limit to top 10
            insights=insights[:5],    # Limit to top 5
            raw_response=raw_response
        )
    
    def batch_describe(
        self,
        images: List[Union[str, bytes]],
        context: str = ""
    ) -> List[ChartDescription]:
        """
        Describe multiple charts.
        
        Args:
            images: List of image paths or bytes
            context: Shared context for all images
            
        Returns:
            List of ChartDescription objects
        """
        results = []
        for i, image in enumerate(images):
            print(f"Processing image {i+1}/{len(images)}...")
            try:
                desc = self.describe(image, context)
                results.append(desc)
            except Exception as e:
                print(f"Error processing image {i+1}: {e}")
        
        return results


# Example usage
if __name__ == "__main__":
    # Using OpenAI
    # vlm = OpenAIVLM(api_key="sk-...")
    
    # Using local Ollama with LLaVA
    vlm = OllamaVLM(model="llava:13b")
    
    descriptor = ChartDescriptor(vlm)
    
    # Describe a single chart
    result = descriptor.describe(
        "sales_chart.png",
        context="Quarterly sales report for 2024"
    )
    
    print(f"Chart Type: {result.chart_type}")
    print(f"nDescription:n{result.description}")
    print(f"nKey Data Points:")
    for point in result.key_data:
        print(f"  - {point}")
    print(f"nInsights:")
    for insight in result.insights:
        print(f"  - {insight}")

Local VLM Deployment

For privacy-sensitive applications, running vision models locally is essential. When your documents contain proprietary financial data, medical records, or confidential engineering specifications, sending chart images to a third-party cloud API is simply not an option—and even in less sensitive contexts, local deployment eliminates API costs, removes external latency, and gives you full control over model version and configuration. The Ollama framework makes local VLM deployment remarkably approachable: it manages model downloads, GPU memory allocation, and an OpenAI-compatible local HTTP API with a handful of shell commands.

LLaVA 1.6 (available in both 7B and 13B parameter variants) is the most widely deployed open-weight option, offering a solid balance between accuracy and resource requirements—the 7B model runs comfortably on an 8 GB VRAM GPU, while the 13B model delivers noticeably better results on complex charts if you have 16 GB available. For edge deployments or CPU-only machines, MiniCPM-V provides competitive chart understanding at a fraction of the memory footprint. Here’s how to set up both options:

LLaVA via Ollama

# Install LLaVA model
ollama pull llava:13b

# Or the smaller 7B version
ollama pull llava:7b

# Test with an image
ollama run llava:13b "Describe this image" --image chart.png

MiniCPM-V for Resource-Constrained Systems

"""
MiniCPM-V: Efficient local vision model.
Requires: pip install transformers torch pillow
"""

from transformers import AutoModel, AutoTokenizer
from PIL import Image
import torch


class MiniCPMVision:
    """
    MiniCPM-V local vision model.
    
    Optimized for efficiency while maintaining good quality.
    """
    
    def __init__(self, model_name: str = "openbmb/MiniCPM-V-2_6"):
        self.model = AutoModel.from_pretrained(
            model_name,
            trust_remote_code=True,
            torch_dtype=torch.float16
        )
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            trust_remote_code=True
        )
        
        # Move to GPU if available
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = self.model.to(self.device)
        self.model.eval()
    
    def describe(self, image_path: str, prompt: str) -> str:
        """
        Describe an image.
        
        Args:
            image_path: Path to image file
            prompt: Question or instruction
            
        Returns:
            Model's response
        """
        image = Image.open(image_path).convert("RGB")
        
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": prompt}
                ]
            }
        ]
        
        response = self.model.chat(
            image=image,
            msgs=messages,
            tokenizer=self.tokenizer
        )
        
        return response


# Example
if __name__ == "__main__":
    vlm = MiniCPMVision()
    
    description = vlm.describe(
        "revenue_chart.png",
        "Analyze this chart and describe the key trends shown."
    )
    print(description)

RAG Integration

Integrating chart descriptions into your RAG pipeline makes visual content searchable alongside ordinary document text. The key insight is that once a VLM has converted a bar chart into a paragraph of structured prose—listing category labels, approximate values, and trend observations—that description can be embedded and stored in a vector database using exactly the same workflow as any text chunk. This means a user’s natural-language query like “which quarter had the highest capital expenditure?” can semantically match a chart description even if the original figure contained no searchable text at all.

To preserve the distinction between text-derived and vision-derived chunks, it is worth storing a chunk_type field that allows filtered searches—for example, searching only figure chunks when a query explicitly mentions a chart or graph. Metadata such as the source page number, the detected chart type, and a path to the original image file should also be persisted alongside the embedding so that the final answer can include a reference back to the original figure, improving citation quality and user trust in the response:

RAG pipeline with integrated figure descriptions
Figure 5: Complete RAG architecture showing how visual descriptions are indexed alongside text content.
"""
RAG integration for visual content.
"""

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from sentence_transformers import SentenceTransformer
import lancedb


@dataclass
class VisualChunk:
    """A chunk representing visual content."""
    chunk_id: str
    source_file: str
    page_number: int
    chunk_type: str  # "text", "figure", "table"
    content: str
    image_path: Optional[str] = None
    chart_type: Optional[str] = None
    metadata: Dict[str, Any] = None


class MultimodalRAG:
    """
    RAG system that handles both text and visual content.
    
    Visual content is converted to text descriptions
    and indexed alongside regular text chunks.
    """
    
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        db_path: str = "./multimodal_rag"
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.db = lancedb.connect(db_path)
        self.table = None
    
    def index_chunks(self, chunks: List[VisualChunk], table_name: str = "content"):
        """
        Index both text and visual chunks.
        
        Args:
            chunks: List of VisualChunk objects
            table_name: Name for the database table
        """
        # Prepare data
        data = []
        texts_to_embed = []
        
        for chunk in chunks:
            texts_to_embed.append(chunk.content)
            data.append({
                "chunk_id": chunk.chunk_id,
                "source_file": chunk.source_file,
                "page_number": chunk.page_number,
                "chunk_type": chunk.chunk_type,
                "content": chunk.content,
                "image_path": chunk.image_path or "",
                "chart_type": chunk.chart_type or "",
                "metadata": str(chunk.metadata or {})
            })
        
        # Generate embeddings
        embeddings = self.embedder.encode(texts_to_embed, show_progress_bar=True)
        
        # Add embeddings to data
        for item, emb in zip(data, embeddings):
            item["vector"] = emb.tolist()
        
        # Create table
        self.table = self.db.create_table(table_name, data, mode="overwrite")
        print(f"Indexed {len(chunks)} chunks")
    
    def search(
        self,
        query: str,
        top_k: int = 5,
        filter_type: Optional[str] = None
    ) -> List[Dict]:
        """
        Search for relevant content.
        
        Args:
            query: Search query
            top_k: Number of results
            filter_type: Optional filter ("text", "figure", "table")
            
        Returns:
            List of matching chunks
        """
        query_embedding = self.embedder.encode(query)
        
        search = self.table.search(query_embedding)
        
        if filter_type:
            search = search.where(f"chunk_type = '{filter_type}'")
        
        results = search.limit(top_k).to_list()
        
        return results
    
    def search_figures(self, query: str, top_k: int = 5) -> List[Dict]:
        """Search specifically for figures/charts."""
        return self.search(query, top_k, filter_type="figure")


def process_pdf_with_figures(
    pdf_path: str,
    vlm_provider,
    text_chunks: List[Dict]
) -> List[VisualChunk]:
    """
    Process PDF extracting both text and visual content.
    
    Args:
        pdf_path: Path to PDF file
        vlm_provider: VLM provider for chart description
        text_chunks: Pre-extracted text chunks
        
    Returns:
        Combined list of text and visual chunks
    """
    from pathlib import Path
    
    all_chunks = []
    
    # Add text chunks
    for i, chunk in enumerate(text_chunks):
        all_chunks.append(VisualChunk(
            chunk_id=f"text_{i}",
            source_file=pdf_path,
            page_number=chunk.get("page", 0),
            chunk_type="text",
            content=chunk["text"]
        ))
    
    # Extract and describe figures
    extractor = PDFFigureExtractor()
    descriptor = ChartDescriptor(vlm_provider)
    
    figures = extractor.extract_with_captions(pdf_path)
    
    for i, fig in enumerate(figures):
        # Save figure image temporarily
        img_path = f"temp_fig_{i}.png"
        with open(img_path, "wb") as f:
            f.write(fig.image_bytes)
        
        # Get description
        try:
            description = descriptor.describe(
                img_path,
                context=Path(pdf_path).stem
            )
            
            # Create chunk
            content = f"""
Figure from page {fig.page_number}
{f'Caption: {fig.caption}' if fig.caption else ''}
Type: {description.chart_type}

Description:
{description.description}

Key Data:
{chr(10).join(description.key_data)}
"""
            
            all_chunks.append(VisualChunk(
                chunk_id=f"figure_{i}",
                source_file=pdf_path,
                page_number=fig.page_number,
                chunk_type="figure",
                content=content,
                image_path=img_path,
                chart_type=description.chart_type
            ))
            
        except Exception as e:
            print(f"Error processing figure {i}: {e}")
    
    return all_chunks


# Example usage
if __name__ == "__main__":
    # Initialize
    vlm = OllamaVLM(model="llava:13b")
    rag = MultimodalRAG()
    
    # Process a document
    text_chunks = [
        {"text": "Revenue grew 15% year over year...", "page": 1},
        {"text": "Operating expenses decreased by 8%...", "page": 2},
    ]
    
    chunks = process_pdf_with_figures(
        "annual_report.pdf",
        vlm,
        text_chunks
    )
    
    # Index everything
    rag.index_chunks(chunks)
    
    # Search across text and figures
    results = rag.search("revenue growth trends")
    
    for r in results:
        print(f"[{r['chunk_type']}] Page {r['page_number']}")
        print(f"  {r['content'][:100]}...")
        if r.get('image_path'):
            print(f"  Image: {r['image_path']}")

Quality and Evaluation

Evaluating chart descriptions requires both automated metrics and human judgment. Automated evaluation works by comparing the generated description against a ground-truth annotation: you can measure value recall (what fraction of the actual numeric values in the chart were mentioned), label recall (how many axis labels, legend entries, and category names appear in the description), and insight coverage (whether key qualitative observations such as “revenue peaked in Q3” are captured). These metrics are easy to compute programmatically and are suitable for regression testing as you iterate on your prompts or upgrade your VLM.

Human evaluation is equally important because automated metrics miss fluency, coherence, and the subtle ability to foreground the most decision-relevant information rather than exhaustively listing every data point. A practical evaluation workflow combines automated regression tests on a curated benchmark of annotated charts with periodic human review of a random sample of production descriptions, using the results to refine prompts and identify chart types that the current model consistently handles poorly:

Evaluation Checklist
  • Accuracy: Are the stated values correct?
  • Completeness: Are all important elements described?
  • Clarity: Is the description easy to understand?
  • Searchability: Would relevant queries find this description?
  • Consistency: Are similar charts described similarly?
def evaluate_description(
    description: str,
    ground_truth: dict
) -> dict:
    """
    Evaluate a chart description against ground truth.
    
    Args:
        description: Generated description
        ground_truth: Dict with expected values, labels, etc.
        
    Returns:
        Evaluation metrics
    """
    scores = {}
    
    # Check if key values are mentioned
    values_found = 0
    for value in ground_truth.get("values", []):
        if str(value) in description:
            values_found += 1
    
    if ground_truth.get("values"):
        scores["value_recall"] = values_found / len(ground_truth["values"])
    
    # Check if labels are mentioned
    labels_found = 0
    for label in ground_truth.get("labels", []):
        if label.lower() in description.lower():
            labels_found += 1
    
    if ground_truth.get("labels"):
        scores["label_recall"] = labels_found / len(ground_truth["labels"])
    
    # Check for key insights
    insights_found = 0
    for insight in ground_truth.get("insights", []):
        if any(word in description.lower() for word in insight.lower().split()):
            insights_found += 1
    
    if ground_truth.get("insights"):
        scores["insight_coverage"] = insights_found / len(ground_truth["insights"])
    
    return scores

Conclusion

Vision Language Models unlock the visual content hidden in PDF documents, making charts, diagrams, and figures searchable and accessible to RAG systems. By combining careful figure extraction, specialized prompting, and proper integration, you can build document understanding systems that truly comprehend the complete content of technical documents. The architectural landscape of VLMs is evolving rapidly—models like LLaVA, Qwen-VL, MiniCPM-V, and the proprietary GPT-4o and Gemini Vision series each make different trade-offs between accuracy, resource consumption, and deployment flexibility, and the right choice depends on your specific combination of document types, privacy requirements, and throughput targets.

Prompt engineering remains the highest-leverage investment: a well-crafted, chart-type-specific prompt can double the amount of useful information extracted compared to a generic instruction, and the two-pass classification-then-description strategy consistently outperforms single-pass approaches across all model sizes. Generating rich alt-text as a by-product of the description pipeline is essentially free and produces accessibility benefits that extend well beyond RAG—screen-reader users and automated summarization pipelines both benefit from the same structured figure descriptions that power your vector search.

Key Takeaways

Each article in this series has tackled one focused challenge in building a full-stack, local-first RAG system, and the following items summarize the most important lessons from the complete journey. Reading these takeaways in sequence also serves as a concise checklist for evaluating the completeness of any document-intelligence pipeline you inherit or are building from scratch—if any layer is missing or under-engineered, the downstream components will inevitably compensate with increased complexity or degraded answer quality. Key takeaways from this series:

  • PyMuPDF: Extract text, tables, images, and metadata from PDFs
  • Chunking: Split documents intelligently for optimal retrieval
  • Embeddings: Use Sentence Transformers for local, high-quality vectors
  • LanceDB: Store and search vectors with minimal setup
  • Hybrid Search: Combine semantic and keyword search with RRF
  • Local LLMs: Run models locally for privacy and cost savings
  • Vision Models: Convert visual content to searchable text

Together, these components form a complete, privacy-respecting RAG pipeline capable of understanding complex technical documents. You now have all the tools needed to build production-quality document analysis systems. The text extraction, chunking, embedding, and retrieval layers established in earlier articles in this series provide the backbone, while the VLM-powered figure description layer added in this article fills in the visual gaps that previously forced engineers to either ignore figures entirely or hand-annotate them at great expense.

Scaling this pipeline to large document corpora requires attention to throughput—batching vision inference, caching descriptions for identical images across documents, and parallelizing PDF processing—as well as to quality over time, with continuous evaluation loops that catch prompt regressions before they reach end users. The rapid pace of VLM development means that any specific model recommendation will evolve, but the pipeline architecture described here is model-agnostic by design: swapping in a newer, more capable vision model is a one-line change to the provider configuration.

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *