23 min read

Running Local LLMs with OpenAI-Compatible APIs

Running Local LLMs with OpenAI-Compatible APIs
Key Topics: local LLM deployment, OpenAI API compatible server, LM Studio setup, Ollama tutorial, vLLM production, LLaMA local inference, private AI deployment, self-hosted language model, open source LLM, local reasoning models, extended thinking blocks

Running large language models locally offers compelling advantages: complete data privacy, zero API costs, and full control over the inference process. With OpenAI-compatible local servers, you can use the same code that works with GPT-4 while running models entirely on your own hardware—no internet connection required.

This comprehensive guide covers everything you need to set up local LLM inference: from user-friendly tools like LM Studio and Ollama for desktop use, to production-ready solutions like vLLM. You’ll learn to configure models, optimize performance, and integrate local LLMs into your RAG pipelines using the familiar OpenAI API interface. The guide also explores quantization formats like GGUF, AWQ, and GPTQ that allow large models to fit within consumer-grade GPU memory budgets. Whether you’re building a private document analysis tool, a cost-conscious chatbot, or a research platform that must stay air-gapped from the internet, local LLM deployment gives you full ownership of the entire inference stack. By the end, you’ll be able to swap between Ollama, LM Studio, vLLM, and even the real OpenAI API using a single unified client—with virtually zero code changes.

Why Run LLMs Locally?

Before diving into implementation, let’s understand when local LLM deployment makes sense. The decision isn’t always straightforward—cloud APIs like OpenAI’s GPT-4o, Anthropic’s Claude, and Google’s Gemini offer state-of-the-art quality and near-instant scalability, while local models require upfront hardware investment and ongoing maintenance. However, for use cases involving sensitive data, high request volumes, offline environments, or tight cost controls, local deployment frequently wins on every meaningful metric. Industries such as healthcare, legal, and finance routinely handle documents where sending data to a third-party inference endpoint is either prohibited by regulation or simply too risky from a confidentiality standpoint. Understanding this tradeoff clearly before you write a single line of code will save you from a costly architectural pivot later.

Comparison between local and cloud LLM deployment
Figure 1: Decision matrix comparing local vs. cloud LLM deployment across privacy, cost, performance, and maintenance dimensions.

Advantages of Local LLMs

  • Complete data privacy: Sensitive documents never leave your network
  • No recurring costs: After hardware investment, inference is free
  • Offline operation: Works without internet connectivity
  • Customization: Fine-tune models on your specific domain
  • Low latency: No network round-trip for each request
  • Unlimited usage: No rate limits or token quotas

When Cloud APIs Are Better

  • Limited hardware: Top models require significant GPU memory
  • Cutting-edge capabilities: GPT-4 and Claude still lead in many benchmarks
  • Scalability needs: Handling thousands of concurrent users
  • Quick prototyping: No setup time required
Hardware Requirements

Local LLM performance depends heavily on available hardware:

  • 7B models: 8GB+ VRAM, runs well on RTX 3080/4080
  • 13B models: 16GB+ VRAM, needs RTX 4090 or dual GPUs
  • 70B models: 48GB+ VRAM, requires A100, H100, or multiple GPUs
  • CPU inference: Possible but 10-50x slower than GPU

Local LLM Server Options

Several excellent tools provide OpenAI-compatible APIs for local LLMs, and each has different strengths that suit different use cases. LM Studio targets desktop users who want a point-and-click experience, while Ollama prioritises developer ergonomics with a clean CLI and first-class Docker support. vLLM is purpose-built for multi-user production workloads and leverages advanced GPU memory management techniques to maximise throughput. At the other end of the spectrum, llama.cpp runs on almost any hardware—including CPUs and Apple Silicon—making it the best choice for edge or embedded deployments where a discrete NVIDIA GPU is not available. Choosing the right tool from the outset avoids expensive re-architecture later, so it is worth spending a few minutes with the comparison table below.

Overview of local LLM server options
Figure 2: Comparison of popular local LLM servers: LM Studio, Ollama, vLLM, and llama.cpp.
ToolBest ForPlatformKey Features
LM StudioDesktop users, beginnersWindows, Mac, LinuxGUI, easy model download, chat interface
OllamaCLI users, developersMac, Linux, WindowsSimple CLI, Docker support, fast setup
vLLMProduction, high throughputLinuxPagedAttention, continuous batching
llama.cppCPU inference, embeddedAll platformsMinimal dependencies, quantization
Text Generation WebUIAdvanced users, experimentationAll with PythonExtensions, multiple backends

LM Studio: Getting Started

LM Studio provides the easiest path to running local LLMs with a polished GUI application. It handles model downloads, configuration, and provides both a chat interface and API server. The built-in model browser pulls directly from Hugging Face, so you can discover, filter by VRAM requirement, and download quantized GGUF variants without ever opening a terminal. LM Studio also exposes a hardware diagnostics panel that shows exactly how many model layers are offloaded to the GPU versus kept in system RAM—crucial information when you’re trying to squeeze a 13B model onto a card with only 12 GB of VRAM.

For teams doing rapid prototyping, the integrated chat UI lets you evaluate a model’s personality and instruction-following quality before committing to it in production code. On Windows in particular, LM Studio often outperforms raw llama.cpp because it ships with an optimised Vulkan backend that can leverage AMD and Intel GPUs in addition to NVIDIA hardware.

Installation and Setup

  1. Download LM Studio from lmstudio.ai
  2. Install and launch the application
  3. Browse the model catalog and download a model (e.g., LLaMA 3.1, Mistral, Qwen)
  4. Select the model and click “Start Server” to enable API access

By default, LM Studio runs its OpenAI-compatible API on http://localhost:1234. The server exposes the standard /v1/chat/completions, /v1/completions, and /v1/models endpoints, so any library built for the OpenAI API—including the official Python SDK, LangChain, and LlamaIndex—will work without modification. You can also enable CORS in LM Studio’s server settings, which is useful when calling the API from a browser-based front end running on a different port. For teams that share a single workstation, LM Studio can bind to 0.0.0.0 instead of 127.0.0.1, making the server accessible to other machines on the local network—a handy way to centralise a powerful GPU for multiple developers during a sprint.

from openai import OpenAI

# Connect to LM Studio's local server
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"  # Not actually validated
)

# Use like normal OpenAI API
response = client.chat.completions.create(
    model="local-model",  # Can be any string, LM Studio uses loaded model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

LM Studio Configuration Tips

  • GPU Layers: Set to maximum your VRAM allows for best speed
  • Context Length: Higher values use more memory but allow longer conversations
  • Thread Count: Match your CPU cores for optimal CPU inference
  • Batch Size: Increase for throughput, decrease for interactive use

Ollama: Command-Line Power

Ollama offers a streamlined command-line experience, making it perfect for developers and automated workflows. Its Docker support makes it excellent for containerised deployments where reproducibility and environment isolation are paramount. Under the hood, Ollama packages each model with a Modelfile—a simple declarative format that captures the base model, system prompt, sampling parameters, and context length in a single versioned artefact, making it straightforward to share a tuned configuration across a team. Ollama also manages model storage efficiently by deduplicating shared base layers, so pulling a 7B and a 13B model built on the same LLaMA checkpoint does not waste twice the disk space. The project maintains an official model registry at ollama.com with curated, pre-quantised builds of the most popular open-source models, meaning you can be running LLaMA 3.1, Mistral, Phi-3, or Qwen in under five minutes on a fresh machine.

Installation

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows (download from ollama.ai)

# Or using Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Basic Usage

# Download and run a model
ollama run llama3.1

# List available models
ollama list

# Pull a specific model
ollama pull mistral:7b-instruct

# Run with specific parameters
ollama run llama3.1 --num-ctx 4096

Using Ollama’s API

Ollama runs an OpenAI-compatible API on port 11434 that mirrors the same endpoints provided by OpenAI’s cloud service. This means the /v1/chat/completions endpoint accepts the exact same JSON payload shape, supports the same stream: true server-sent events protocol, and returns responses with identical field names—so switching between Ollama and the real OpenAI API is literally a one-line change to base_url. Ollama additionally exposes native endpoints like /api/generate and /api/chat with its own streaming format if you prefer a lighter payload, but for maximum ecosystem compatibility the OpenAI-compatible endpoint is the recommended choice. When running inside a Docker network, other containers can reach Ollama via the service name rather than localhost, enabling clean microservice architectures where the LLM is a first-class networked dependency.

from openai import OpenAI

# Connect to Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not validated
)

# List available models
models = client.models.list()
for model in models.data:
    print(model.id)

# Chat completion
response = client.chat.completions.create(
    model="llama3.1",  # Must match installed model name
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

print(response.choices[0].message.content)

Creating Custom Models with Modelfile

# Modelfile
FROM llama3.1

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

# Set system prompt
SYSTEM """
You are a helpful coding assistant specialized in Python.
Always provide clear explanations with code examples.
"""

# Save as Modelfile, then:
# ollama create my-coding-assistant -f Modelfile

vLLM: Production Deployment

vLLM is designed for high-throughput production workloads. Its PagedAttention algorithm efficiently manages GPU memory, enabling much higher concurrent users than naive implementations. Unlike LM Studio or Ollama—which load the KV cache as a single contiguous block—PagedAttention divides the cache into non-contiguous pages analogous to virtual memory in an operating system, dramatically reducing fragmentation and allowing many more parallel requests to share the same GPU. vLLM also implements continuous batching, which means it can process new requests as existing ones complete rather than waiting for a full batch to finish, reducing average latency under load by up to 20×.

For teams deploying to a Kubernetes cluster with A100 or H100 GPUs, vLLM supports tensor parallelism across multiple devices via --tensor-parallel-size, and pipeline parallelism via --pipeline-parallel-size, enabling models too large for a single card to be served transparently. While vLLM requires Linux and CUDA, its Docker image makes it straightforward to run in any cloud environment alongside standard Kubernetes infrastructure.

Diagram showing OpenAI-compatible API request flow
Figure 3: Request flow through a local LLM server providing OpenAI-compatible API endpoints.

Installation and Startup

# Install vLLM
pip install vllm

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Meta-Llama-3.1-8B-Instruct 
    --host 0.0.0.0 
    --port 8000 
    --tensor-parallel-size 1 
    --max-model-len 8192

# For multi-GPU setups
python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Meta-Llama-3.1-70B-Instruct 
    --tensor-parallel-size 4 
    --pipeline-parallel-size 1

vLLM with Docker

# docker-compose.yml
version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model meta-llama/Meta-Llama-3.1-8B-Instruct
      --max-model-len 8192
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Using the OpenAI API Interface

The beauty of OpenAI-compatible APIs is code portability. Your application code works with any backend—local or cloud—by simply changing the base_url parameter. This design pattern is sometimes called backend-agnostic inference, and it is one of the most valuable architectural conventions to adopt early in an AI project. By abstracting the LLM behind a standard interface, you can start development with a fast, cheap local model, measure performance against a cloud model like GPT-4o or Claude 3.7, and then decide whether the quality delta justifies the added cost and privacy risk—all without touching your application logic.

The pattern also makes A/B testing straightforward: route a percentage of traffic to a local 70B model and compare outputs against a cloud model to build evidence-based deployment decisions. OpenAI’s community has effectively set the standard here, and virtually every major open-source LLM project now ships an OpenAI-compatible endpoint as a first-class feature.

"""
Unified LLM client supporting local and cloud backends.
"""

from openai import OpenAI
from typing import Optional
from dataclasses import dataclass
from enum import Enum


class LLMBackend(Enum):
    """Supported LLM backends."""
    OPENAI = "openai"
    LM_STUDIO = "lm_studio"
    OLLAMA = "ollama"
    VLLM = "vllm"


@dataclass
class BackendConfig:
    """Configuration for an LLM backend."""
    base_url: str
    api_key: str
    default_model: str


BACKEND_CONFIGS = {
    LLMBackend.OPENAI: BackendConfig(
        base_url="https://api.openai.com/v1",
        api_key="sk-...",  # Set via environment
        default_model="gpt-4o"
    ),
    LLMBackend.LM_STUDIO: BackendConfig(
        base_url="http://localhost:1234/v1",
        api_key="lm-studio",
        default_model="local-model"
    ),
    LLMBackend.OLLAMA: BackendConfig(
        base_url="http://localhost:11434/v1",
        api_key="ollama",
        default_model="llama3.1"
    ),
    LLMBackend.VLLM: BackendConfig(
        base_url="http://localhost:8000/v1",
        api_key="vllm",
        default_model="meta-llama/Meta-Llama-3.1-8B-Instruct"
    ),
}


class UnifiedLLMClient:
    """
    Unified client for local and cloud LLMs.
    
    Provides consistent interface regardless of backend.
    """
    
    def __init__(
        self,
        backend: LLMBackend = LLMBackend.OLLAMA,
        api_key: Optional[str] = None,
        base_url: Optional[str] = None
    ):
        """
        Initialize the LLM client.
        
        Args:
            backend: Which LLM backend to use
            api_key: Override default API key
            base_url: Override default base URL
        """
        config = BACKEND_CONFIGS[backend]
        
        self.client = OpenAI(
            base_url=base_url or config.base_url,
            api_key=api_key or config.api_key
        )
        self.default_model = config.default_model
        self.backend = backend
    
    def chat(
        self,
        messages: list[dict],
        model: Optional[str] = None,
        temperature: float = 0.7,
        max_tokens: int = 1000,
        stream: bool = False,
        **kwargs
    ):
        """
        Send a chat completion request.
        
        Args:
            messages: List of message dictionaries
            model: Model to use (default: backend's default)
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
            stream: Whether to stream the response
            **kwargs: Additional OpenAI API parameters
            
        Returns:
            Chat completion response or stream
        """
        response = self.client.chat.completions.create(
            model=model or self.default_model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            stream=stream,
            **kwargs
        )
        
        if stream:
            return self._handle_stream(response)
        return response
    
    def _handle_stream(self, response):
        """Handle streaming response."""
        for chunk in response:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content
    
    def complete(self, prompt: str, **kwargs) -> str:
        """Simple completion helper."""
        messages = [{"role": "user", "content": prompt}]
        response = self.chat(messages, **kwargs)
        return response.choices[0].message.content
    
    def embed(self, texts: list[str], model: str = None) -> list[list[float]]:
        """
        Get embeddings (if backend supports it).
        
        Note: Not all local backends support embeddings.
        Consider using sentence-transformers for local embeddings.
        """
        response = self.client.embeddings.create(
            model=model or "text-embedding-3-small",
            input=texts
        )
        return [item.embedding for item in response.data]


# Example usage
if __name__ == "__main__":
    # Easy backend switching
    client = UnifiedLLMClient(backend=LLMBackend.OLLAMA)
    
    # Simple completion
    answer = client.complete("What is 2 + 2?")
    print(answer)
    
    # Chat with history
    messages = [
        {"role": "system", "content": "You are a helpful math tutor."},
        {"role": "user", "content": "Explain the Pythagorean theorem."}
    ]
    
    response = client.chat(messages, temperature=0.5)
    print(response.choices[0].message.content)
    
    # Streaming
    print("nStreaming response:")
    for chunk in client.chat(messages, stream=True):
        print(chunk, end="", flush=True)

Extended Thinking and Reasoning

Modern reasoning models like DeepSeek R1 and Qwen3 support extended “thinking” phases where the model reasons through complex problems step by step. This capability is especially valuable for coding, math, and multi-step reasoning tasks. Rather than producing an answer in a single forward pass, these models generate an internal chain-of-thought—often wrapped in <think>...</think> tags—before committing to a final response, closely mirroring the “System 2” deliberate reasoning described in cognitive science literature. The result is significantly fewer hallucinations on arithmetic, logic puzzles, and structured data extraction tasks compared with standard autoregressive generation.

Running reasoning models locally is particularly compelling because the thinking trace can be many thousands of tokens long; with cloud APIs that charge per token, a single complex query could cost dollars, whereas locally the only cost is electricity and compute time. Temperature also plays a different role with reasoning models: values between 0.5 and 0.7 tend to produce the best results, since too much randomness disrupts the coherent reasoning chain while too little causes the model to converge on its first guess without adequately exploring alternatives.

Diagram showing extended thinking in reasoning models
Figure 4: Extended thinking flow where reasoning models produce internal thought processes before final answers.
"""
Support for reasoning models with thinking blocks.
"""

from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
import re


@dataclass
class ReasoningResponse:
    """Response from a reasoning model."""
    thinking: str
    answer: str
    raw_response: str


def call_reasoning_model(
    client: OpenAI,
    prompt: str,
    model: str = "deepseek-r1:14b",
    enable_thinking: bool = True,
    max_thinking_tokens: int = 4000,
    max_answer_tokens: int = 2000
) -> ReasoningResponse:
    """
    Call a reasoning model that supports thinking blocks.
    
    Args:
        client: OpenAI client
        prompt: User prompt
        model: Model name (e.g., deepseek-r1, qwen3)
        enable_thinking: Whether to enable extended thinking
        max_thinking_tokens: Max tokens for thinking phase
        max_answer_tokens: Max tokens for final answer
        
    Returns:
        ReasoningResponse with thinking and answer separated
    """
    system_prompt = (
        "You are a helpful AI assistant. "
        "Think through problems step by step before providing your final answer."
    )
    
    if enable_thinking:
        # Some models use special tags for thinking
        system_prompt += (
            "nWrap your internal reasoning in ... tags. "
            "After thinking, provide your final answer outside these tags."
        )
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_thinking_tokens + max_answer_tokens,
        temperature=0.6  # Lower temp for reasoning
    )
    
    raw_content = response.choices[0].message.content
    
    # Parse thinking and answer
    thinking = ""
    answer = raw_content
    
    # Extract thinking blocks
    think_pattern = r'(.*?)'
    matches = re.findall(think_pattern, raw_content, re.DOTALL)
    
    if matches:
        thinking = "n".join(matches)
        answer = re.sub(think_pattern, '', raw_content, flags=re.DOTALL).strip()
    
    return ReasoningResponse(
        thinking=thinking,
        answer=answer,
        raw_response=raw_content
    )


# Example usage
if __name__ == "__main__":
    client = OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    
    # Complex reasoning task
    result = call_reasoning_model(
        client,
        prompt="""
        A farmer has 17 sheep. All but 9 die. How many sheep does the farmer have left?
        """,
        model="deepseek-r1:14b"
    )
    
    print("=== Thinking Process ===")
    print(result.thinking)
    print("n=== Final Answer ===")
    print(result.answer)

Performance Optimization

Maximizing local LLM performance requires tuning several interrelated factors, and understanding how they interact can mean the difference between a painfully slow model and one that matches cloud-API responsiveness. VRAM is almost always the primary bottleneck: the model weights, the KV cache, and the activations must all fit simultaneously on the GPU, so every decision around quantization and context length is ultimately a VRAM budget allocation problem. Beyond raw memory, inference throughput is shaped by batch size, the precision of the compute kernels (FP16 vs.

INT8 vs. INT4), the number of CPU threads used for pre/post-processing, and whether the model’s attention layers have been compiled with FlashAttention. Profiling your specific hardware before choosing a quantization level pays dividends—a Q5_K_M model that fully fits in VRAM will outperform a Q4_K_M model that spills a few layers to system RAM by a wide margin because even a small amount of CPU ↔ GPU data transfer becomes a severe bottleneck at inference time.

Quantization

Quantization reduces model precision to save memory and increase speed, and it is the single most impactful optimisation available to local LLM users. The three major quantization ecosystems are GGUF (used by llama.cpp, LM Studio, and Ollama), AWQ (Activation-aware Weight Quantization, favoured by vLLM), and GPTQ (a post-training quantization method supported across most frameworks). GGUF is the most convenient for desktop use because the entire model—including tokenizer and metadata—is packed into a single portable file, while AWQ and GPTQ are typically preferred in GPU server environments for their superior quality-to-speed ratio at INT4 precision. The K-quants introduced in llama.cpp (e.g., Q4_K_M, Q5_K_M) use a mixed-precision scheme that quantizes attention weights more aggressively than feed-forward weights, giving a better quality/size tradeoff than older uniform quantizations at the same bit depth.

QuantizationMemory (vs FP16)SpeedQuality Loss
FP16 (Full)100%BaselineNone
Q8_0~53%+20%Minimal
Q5_K_M~37%+40%Low
Q4_K_M~30%+50%Moderate
Q2_K~18%+70%Significant
Recommended Quantizations
  • Q4_K_M: Best balance of size, speed, and quality for most uses
  • Q5_K_M: When quality is more important than memory
  • Q8_0: When you have plenty of VRAM and want near-full quality

Context Length Management

def estimate_context_memory(
    model_params_b: float,
    context_length: int,
    hidden_size: int = 4096,
    num_layers: int = 32,
    quantization: str = "Q4_K_M"
) -> float:
    """
    Estimate memory needed for context (KV cache).
    
    Args:
        model_params_b: Model parameters in billions
        context_length: Desired context window
        hidden_size: Model hidden dimension
        num_layers: Number of transformer layers
        quantization: Quantization level
        
    Returns:
        Estimated memory in GB
    """
    quant_factors = {
        "FP16": 2.0, "Q8_0": 1.0, "Q5_K_M": 0.65,
        "Q4_K_M": 0.5, "Q2_K": 0.25
    }
    
    bytes_per_param = quant_factors.get(quantization, 0.5)
    
    # KV cache: 2 * layers * hidden_size * context * 2 bytes (FP16)
    kv_cache_bytes = 2 * num_layers * hidden_size * context_length * 2
    kv_cache_gb = kv_cache_bytes / (1024**3)
    
    # Model weights
    model_gb = model_params_b * bytes_per_param
    
    return model_gb + kv_cache_gb


# Example
memory_8k = estimate_context_memory(7, 8192, quantization="Q4_K_M")
memory_32k = estimate_context_memory(7, 32768, quantization="Q4_K_M")

print(f"7B Q4 @ 8k context: {memory_8k:.1f} GB")
print(f"7B Q4 @ 32k context: {memory_32k:.1f} GB")

RAG Integration

Combining local LLMs with RAG creates powerful, private knowledge systems that allow organisations to query their own documents without ever sending proprietary text to an external API. The architecture described below pairs sentence-transformer embeddings—which run entirely on CPU or GPU locally—with a LanceDB vector store and an Ollama or LM Studio generation backend, so the entire pipeline from document ingestion to answer generation stays within your network perimeter. This is a direct local equivalent of cloud RAG services like Azure AI Search + GPT-4o or Amazon Bedrock + Knowledge Bases, but with zero per-request cost and full auditability of every stage. One practical advantage of local RAG is that you can inspect and modify the retrieved context before it reaches the LLM, implement custom re-ranking algorithms, or inject structured metadata—all without hitting API rate limits or incurring extra charges for embedding calls.

"""
Complete local RAG pipeline with local LLM.
"""

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import lancedb
from typing import List, Dict, Any


class LocalRAG:
    """
    Local RAG pipeline using local embeddings and LLM.
    
    Components:
    - Sentence Transformers for embeddings
    - LanceDB for vector storage
    - Ollama/LM Studio for generation
    """
    
    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        llm_base_url: str = "http://localhost:11434/v1",
        llm_model: str = "llama3.1",
        db_path: str = "./rag_db"
    ):
        # Local embeddings
        self.embedder = SentenceTransformer(embedding_model)
        
        # Local LLM
        self.llm = OpenAI(
            base_url=llm_base_url,
            api_key="local"
        )
        self.llm_model = llm_model
        
        # Vector database
        self.db = lancedb.connect(db_path)
        self.table = None
    
    def index_documents(self, documents: List[Dict[str, Any]], table_name: str = "docs"):
        """
        Index documents with embeddings.
        
        Args:
            documents: List of {"text": str, ...metadata...}
            table_name: Name for the vector table
        """
        # Generate embeddings
        texts = [doc["text"] for doc in documents]
        embeddings = self.embedder.encode(texts, show_progress_bar=True)
        
        # Add embeddings to documents
        for doc, emb in zip(documents, embeddings):
            doc["vector"] = emb.tolist()
        
        # Create/overwrite table
        self.table = self.db.create_table(table_name, documents, mode="overwrite")
        print(f"Indexed {len(documents)} documents")
    
    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        """Retrieve relevant documents."""
        if not self.table:
            raise ValueError("No documents indexed")
        
        # Embed query
        query_embedding = self.embedder.encode(query)
        
        # Search
        results = self.table.search(query_embedding).limit(top_k).to_list()
        
        return results
    
    def generate(
        self,
        query: str,
        context_docs: List[Dict],
        temperature: float = 0.7
    ) -> str:
        """Generate answer using retrieved context."""
        # Build context string
        context = "nn".join([
            f"Document {i+1}:n{doc['text']}"
            for i, doc in enumerate(context_docs)
        ])
        
        # Create prompt
        messages = [
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer questions based on "
                    "the provided context. If the context doesn't contain "
                    "relevant information, say so."
                )
            },
            {
                "role": "user",
                "content": f"Context:n{context}nnQuestion: {query}"
            }
        ]
        
        response = self.llm.chat.completions.create(
            model=self.llm_model,
            messages=messages,
            temperature=temperature
        )
        
        return response.choices[0].message.content
    
    def query(self, question: str, top_k: int = 5) -> Dict:
        """
        Complete RAG query: retrieve and generate.
        
        Returns:
            Dict with answer and sources
        """
        # Retrieve
        docs = self.retrieve(question, top_k)
        
        # Generate
        answer = self.generate(question, docs)
        
        return {
            "question": question,
            "answer": answer,
            "sources": docs
        }


# Example usage
if __name__ == "__main__":
    rag = LocalRAG()
    
    # Sample documents
    documents = [
        {"text": "Python is a high-level programming language.", "source": "doc1"},
        {"text": "Machine learning uses algorithms to learn patterns.", "source": "doc2"},
        {"text": "Neural networks are inspired by biological brains.", "source": "doc3"},
    ]
    
    rag.index_documents(documents)
    
    result = rag.query("What is Python?")
    print(f"Answer: {result['answer']}")

Conclusion

Running LLMs locally has become remarkably accessible with tools like LM Studio, Ollama, and vLLM. The OpenAI-compatible API standard means your code can seamlessly switch between local and cloud models, giving you flexibility to choose the right backend for each use case. What once required a team of MLOps engineers and a rack of A100s can now be achieved on a single consumer GPU in an afternoon—a shift that fundamentally alters the economics of AI integration for small teams and enterprises alike.

The ecosystem has also matured rapidly: quantisation quality has improved to the point where a Q4_K_M 7B model often outperforms GPT-3.5 on domain-specific tasks when paired with a well-crafted system prompt and a good RAG retrieval pipeline. As open-source models continue to close the gap with frontier closed-source models, the argument for local-first deployment will only strengthen, particularly for privacy-sensitive industries and high-volume applications where API costs compound quickly.

Key Takeaways

Key takeaways:

  • LM Studio: Best for getting started with a user-friendly GUI
  • Ollama: Perfect for developers who prefer command-line tools
  • vLLM: The choice for production deployments requiring high throughput
  • Quantization: Q4_K_M offers the best balance for most applications
  • Code portability: OpenAI-compatible APIs let you switch backends freely

In the finale article of this series, we’ll explore how to use Vision Language Models to describe charts and figures extracted from PDFs—completing our comprehensive RAG pipeline. Vision-capable models such as LLaVA, BakLLaVA, and Qwen-VL can accept raw image bytes alongside a text prompt, making it possible to automatically generate natural-language captions for every chart, table screenshot, or diagram in a document corpus without any manual annotation.

This closes the last gap in a fully local, fully private document intelligence system: text is extracted and chunked by PyMuPDF, embedded by Sentence Transformers, indexed in LanceDB, searched with hybrid retrieval, and now even the visual content is described by a locally running vision model before being fed into the LLM context window. If you’ve worked through the series from the beginning, you’ll have all the pieces needed to build a production-grade RAG system that runs entirely on your own hardware.

Artur Poniedziałek
Artur Poniedziałek
IT Expert & Project Manager
🤖 AI ⚡ PM 🐍 Python 🖥️ Local AI

IT Expert & Project Manager with 15+ years of experience. Exploring practical AI applications — from local LLMs and RAG systems to workflow automation. Writing to share knowledge and inspire others to experiment with new technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *