Blog

  • Serving LLMs: Core Concepts For Beginner

    So, you’ve trained a Large Language Model (LLM) or picked up a powerful open-source one. That’s the first mountain climbed. Now comes the next, arguably more complex one: how do you actually serve this model to potentially thousands of users efficiently without setting your cloud budget on fire?

    LLMs are notoriously massive and hungry for computational resources. A naive approach to serving them will quickly lead to slow response times and skyrocketing costs. But over the last few years, a stack of brilliant optimization techniques has emerged, turning what was once impractical into a robust, scalable reality.

    The Foundation: Understanding the KV Cache

    At its core, an LLM generates text one token (roughly, a word or part of a word) at a time in a process called autoregression. To generate the next token, it needs the context of all the tokens that came before it.

    The most basic way to do this is also the most inefficient: for every new token, the model would re-process the entire sequence from the very beginning.

    This is where the KV Cache comes in, and it’s the single most important optimization for LLM inference.

    generated_tokens = []
    next_inputs = inputs
    durations_s = []
    for _ in range(10):
        t0 = time.time()
        next_token_id = generate_token(next_inputs)
        durations_s += [time.time() - t0]
        
        next_inputs = {
            "input_ids": torch.cat(
                [next_inputs["input_ids"], next_token_id.reshape((1, 1))],
                dim=1),
            "attention_mask": torch.cat(
                [next_inputs["attention_mask"], torch.tensor([[1]])],
                dim=1),
        }
        
        next_token = tokenizer.decode(next_token_id)
        generated_tokens.append(next_token)
    • What it is: In the “attention” mechanism of a Transformer, the model calculates three matrices from the input: a Query, a Key, and a Value. To generate a new token, its Query is compared against the Keys of all previous tokens to figure out “what to pay attention to,” and then a weighted sum of the Values is used to produce the output. The KV Cache simply stores the Key and Value matrices for all previous tokens so they don’t have to be recalculated every single time.
    • The Analogy We Love: Think of it like taking notes in a meeting. To understand the last sentence spoken, you don’t need to re-listen to the entire meeting recording from the start. You just glance at your notes (the KV Cache).
    • A Point of Confusion: In code, you might see the attention_mask being manually updated in a loop. Why? The attention mask tells the model which tokens to pay attention to. As we generate a new token and add it to our KV Cache, we must also extend the mask to tell the model, “Hey, pay attention to this new token, too!”

    The result is a two-phase generation process:

    1. Prefill: The slow first step where the prompt is processed and the initial KV Cache is filled. This determines the “Time to First Token” (TTFT).
    2. Decode: The fast subsequent steps where each new token is generated quickly by re-using the cache. This determines the token-per-second throughput.

    Scaling Up: Continuous Batching

    Okay, so we can serve one user efficiently. What happens when we have dozens of requests hitting our server at once? The obvious answer is to “batch” them—process them together to maximize GPU utilization. But how we batch makes all the difference.

    • The Problem with Static Batching: The simple approach is to group, say, 8 requests and process them as one batch. The catch? The entire batch is only as fast as its slowest request. If one user asks for a 500-token essay and seven others ask for a 10-token answer, those seven will finish quickly and then sit idle, wasting precious GPU cycles while the long request finishes.
    • The Analogy: It’s like a bus that must drive to the final destination of its last passenger before it can pick up anyone new, even if everyone else got off at the first stop.
    • The Solution: Continuous Batching. This is a smarter scheduling algorithm. As soon as any request in the batch finishes, its spot is immediately filled with a new request from the waiting queue. This keeps the GPU constantly fed with useful work. The result is dramatically higher throughput and lower average latency for all users.

    Fine-tuning with LoRA

    LoRA is an efficient fine-tuning technique that adapts a model’s behavior by injecting small, trainable “adapter” layers into an existing model, without altering the model’s original, massive weights.

    Imagine we setting up a model with a hidden_size of 1024. The weight matrix W of the model.linear layer is a 1024 x 1024 matrix, containing over 1 million parameters (1024×1024=1,048,576).

    If you were to perform traditional fine-tuning on this model, you would need to update all of these 1 million+ parameters, which consumes significant computational resources. Furthermore, for each new task you fine-tune, you need to save a new, full-sized copy of the model.

    The LoRA Solution: Don’t Modify, Just Add

    LoRA’s approach is not to modify the original weights W, but to “freeze” them and add a “bypass” path alongside.

    Create low-rank matrices:  It creates two very “thin” matrices: lora_a (shape 1024 x 2) and lora_b (shape 2 x 1024). The number 2 here is the “rank.”

    Calculate the update: These two small matrices are multiplied (W2 = lora_a @ lora_b) to produce an “update matrix” that has the same shape (1024 x 1024) as the original weight matrix W.

    Combine the results: During the forward pass, the model’s output is the sum of two parts:

    • Output from the original path: base_output = X @ W
    • Output from the LoRA bypass: lora_output = X @ A @ B
    • Final result: total_output = base_output + lora_outpu

    This shows that we only need to add less than 0.4% of the original parameter count to simulate a full update of the entire weight matrix. During fine-tuning, we only train these 0.4% of parameters while the millions of original parameters remain frozen. This dramatically saves computational and storage resources.

    Multi-LoRA

    Imagine a cloud service platform that uses a powerful foundation model (like Llama 3). Now, there are hundreds or thousands of customers, each of whom has fine-tuned this foundation model with their own data using LoRA to adapt it to their specific tasks (e.g., Customer A uses it for customer service chats, Customer B for summarizing legal documents, Customer C for generating marketing copy, etc.).

    Now, the server receives a batch of requests, which contains requests from different customers. This means that for each request in the batch, we need to apply a different LoRA adapter.

    The most naive and worst approach would be: to deploy a separate model instance for each customer. This would immediately exhaust all GPU memory, making the cost prohibitively high.

    The purpose of Multi-LoRA is to solve this exact problem: it allows you to load just one copy of the giant foundation model into memory, and then load hundreds or thousands of tiny LoRA adapters (the A/B matrices) alongside it. When processing a batch of requests, the system can dynamically apply the correct LoRA adapter for each individual request within the batch.

    Method 1: Looping

    The LoopMultiLoraModel class demonstrates the most intuitive method: using a for loop to iterate through each request in the batch.

    • In each step of the loop, it finds the corresponding lora_a and lora_b matrices for the current request using lora_indices.
    • It then performs the LoRA computation for that single request: y[batch_idx] += x[batch_idx] @ lora_a @ lora_b.

    Disadvantage: Python loops, especially in high-performance computing scenarios, are notoriously inefficient. They cannot fully leverage the massively parallel processing capabilities of a GPU, leading to higher latency as the batch size grows. The first chart in the script clearly illustrates this: the latency grows linearly with the batch size, indicating poor performance.

    Method 2: Gathering and Vectorizing

    The GatheredMultiLoraModel class demonstrates the efficient implementation.

    • Instead of a loop, it uses a critical operation: torch.index_select. This operation can, in a single step, **“gather”**all the required LoRA weights (loras_a and loras_b) for every request in the batch into new tensors, based on the lora_indices.
    • It then performs a single, batch-wide matrix multiplication: y += x @ lora_a @ lora_b. Here, xlora_a, and lora_b are all tensors containing the data for the entire batch.

    Advantage: This operation is highly vectorized and can be processed in parallel at high speed by PyTorch on the GPU. The second chart in the script proves this: even as the batch size increases, the latency growth is far flatter than the looping method, resulting in much higher performance.

    Production Systems Like LoRAX

    A production-grade serving framework like Predibase’s LoRAX is what you get when you put all these concepts together into a single, polished system.

    1. It measures TTFT and throughput, proving it uses a KV Cache.
    2. It handles concurrent requests of different lengths, proving it uses Continuous Batching.
    3. It serves different models via an adapter_id parameter, proving it’s a Multi-LoRA server.
  • MCP(3)- Advanced Agentic Flows

    Reversing the Flow – Sampling

    Sampling is a client-side capability that allows an MCP server to make a request to the host application, asking it to generate an LLM completion. This creates a bidirectional cognitive loop: the host’s LLM can call a server tool, the server can perform some logic, and then the server can call back to the host’s LLM for further reasoning or analysis.   

    This capability is what enables patterns like ReAct (Reason-Act). A server can perform an action (Act), receive a result, and then use sampling to have the LLM Reason about that result to determine the next action. This is far more powerful than a simple, one-way tool call and is fundamental to creating agents that can plan and adapt.

    Due to its power, sampling also carries risks. A malicious server could attempt to abuse this feature. Therefore, the protocol includes critical safeguards: hosts MUST obtain explicit user consent before fulfilling any sampling request, and the protocol intentionally limits the server’s visibility into the final prompt that is actually sent to the LLM, giving the host and user ultimate control.

    Use Cases for Server-Driven Automation

    The ability for a server to request LLM reasoning opens up a new class of applications

    Sure! Here’s the English version of that explanation:

    MCP servers can also request reasoning from the host LLM—essentially flipping the flow. Instead of just AI calling tools, the server can actively “ask the AI to think,” enabling a new class of applications known as server-driven automation.

    • Reflective Analysis: A server with a tool that fetches a large, unstructured piece of data (e.g., a 10,000-line log file or a complex financial report) can use sampling to offload the analysis to a powerful reasoning model. After fetching the data, the server could send a sampling request to the host: “Please summarize the critical errors from the following log data: [log data]”. The server gets back a concise, structured summary instead of having to parse the raw data itself.
    • Dynamic Sub-task Generation: Imagine a “Project Manager” MCP server. It could have a single tool, execute_project(goal: str). When called, instead of having hardcoded logic, it could use sampling to ask the host’s LLM: “Given the goal ‘{goal}’, break it down into a series of smaller, executable steps using the available tools.” The LLM might return a JSON list of sub-tasks, which the server can then execute in sequence, creating a dynamic and adaptive workflow.
    • Mentorship and Second Opinions: A server can act as a specialized agent that leverages the host’s LLM for its “expertise.” The mentor-mcp-server example demonstrates this pattern: it provides a tool that accepts a block of code. Internally, it uses sampling to send a request to the host’s powerful reasoning model (like Claude 3.5 Sonnet or GPT-4o) with the prompt: “Provide a detailed critique of this code’s design, highlighting potential improvements and security vulnerabilities”. The server acts as an orchestrator for a specialized task, using the host’s general intelligence

    The Developer’s Toolkit: Debugging and Troubleshooting

    Effective debugging is crucial for developing robust MCP servers.

    An effective debugging workflow must adopt a “bottom-up” approach: first, ensure the server works perfectly in isolation using the MCP Inspector, and only then move to troubleshoot its integration within a host application. This systematic process saves developers hours of frustration by allowing them to isolate the source of a problem efficiently.   

    The MCP Inspector: Your Essential Debugging Partner

    The MCP Inspector is the most critical tool in a developer’s arsenal. It is a standalone web-based UI that acts as a client, allowing you to connect to and interact with your MCP server directly, without needing a full host application like Cursor or Claude. It is the “Postman for MCP.”   

    To use it, you can run a single command from your terminal, pointing it to your server’s execution command.

    Bash

    # For a Node.js server built to dist/index.js
    npx @modelcontextprotocol/inspector node dist/index.js
    
    # For a Python server
    npx @modelcontextprotocol/inspector python server.py
    

    This command starts a local web server (typically on port 6274) and connects to your MCP server. The Inspector’s UI allows you to:   

    • Verify Connection: Immediately see if the Inspector successfully established a connection and handshake with your server.
    • View Capabilities: Inspect the server’s metadata and the list of Tools, Resources, and Prompts it advertises.
    • Test Tools: Select a tool, enter its arguments in a JSON editor, and execute it. This is invaluable for testing your tool’s logic, input validation, and error handling in a controlled environment.   
    • Browse Resources: List and read the content of available resources to ensure they are being served correctly.
    • Test Prompts: Invoke prompts with different arguments to check the generated output.

    Common Pitfalls and Solutions

    The following table serves as a practical checklist for diagnosing the most common issues developers encounter when building and integrating MCP servers.

    SymptomDiagnostic StepWhat to Check
    Server not appearing in Host1. Check Host LogsLook for connection errors, JSON parsing errors in the config file.
    2. Verify Server ProcessUse ps aux or Task Manager to see if the server command is running.
    3. Check Host ConfigEnsure the command and args have absolute paths. Verify JSON syntax.
    Tool not appearing in Host1. Test with InspectorConnect the Inspector to your server. Does the tool appear in the “Tools” tab?
    2. Check Server Logs (stderr)Look for errors during server initialization or tool registration.
    3. Review Tool DefinitionIs the tool correctly registered (e.g., with @mcp.tool())? Is the name correct?
    Tool call fails1. Test with InspectorExecute the tool in the Inspector with the exact same arguments the LLM tried to use.
    2. Check Input ValidationIs your tool handler correctly parsing the arguments? Add logs to check received parameters.
    3. Check Environment VariablesDoes the tool rely on an API key that isn’t set in the host config envblock?
    General Issues1. Use Structured LoggingAdd logging with timestamps and request IDs to your server code (stderr).
    2. Isolate with InspectorAlways test new functionality with the Inspector before testing in the full Host application.

    Real-World Integration Patterns and Architectures

    While building a single server is a great starting point, the true power of MCP is realized when it’s used to architect complex, production-grade AI systems. Adopting MCP is not just about connecting a single LLM to a single tool; it is a strategic decision that can lead to a more flexible, scalable, and future-proof “composable enterprise” architecture. This approach, where business capabilities are exposed as interchangeable, AI-consumable services, mirrors the philosophy of microservices but is tailored for the agentic AI era.   

    MCP wraps complex APIs into structured, natural-language-describable tools, allowing AI to use API functionality like calling a function—without needing to understand the underlying technical details

    The Wrapper Pattern: Modernizing Legacy Systems

    The most common and immediately valuable integration pattern is using an MCP server as a modern, AI-friendly “façade” or “wrapper” over existing systems. This allows organizations to bring their legacy infrastructure into the AI ecosystem without costly refactoring.

    • API Wrapper: An MCP server can wrap an internal REST or GraphQL API, exposing a curated set of critical endpoints as well-described tools. Instead of forcing an LLM to understand a complex OpenAPI specification, the server provides simple, natural-language-friendly functions like create_user_ticket(title: str, description: str).   
    • Database Wrapper: A server can provide a secure, natural language interface to a SQL database. This is a powerful pattern for business intelligence and data analysis. The server can expose a high-risk run_sql_query(sql: str) tool for expert users, or safer, more abstract tools like find_customer_by_name(name: str) that construct and execute SQL queries internally.   
    • SaaS Wrapper: The vast and growing ecosystem of public MCP servers for platforms like GitHub, Slack, Google Drive, Sentry, and Stripe are prime examples of this pattern. They provide a standardized way for any agent to interact with these popular services.   

    The Multi-Server Orchestrator Pattern

    This pattern demonstrates the composability of MCP. A single, sophisticated AI agent can orchestrate multiple, specialized MCP servers to accomplish a complex, multi-step task. In this model, the Host application (or an agentic framework like LangGraph or Google’s Agent Development Kit running within the host) acts as the central coordinator.   

    Consider a workflow where a user prompts an agent: “Review the latest ‘Project Phoenix’ spec from Google Drive, create a new feature branch in our Git repository based on the spec’s title, and then post a summary to the #phoenix-dev Slack channel.”

    1. The agent, running in the Host, discovers tools from three separate, specialized MCP servers: google-drive-mcpgit-mcp, and slack-mcp.
    2. It first calls the search_files tool on the Google Drive server to find the specification document and reads its content.
    3. Next, it calls the create_branch tool on the Git server, using information extracted from the document for the branch name.
    4. Finally, it calls the post_message tool on the Slack server to provide a status update.

    The Host manages the state and context between each tool call, seamlessly chaining together the capabilities of independent servers to execute a workflow。 

  • MCP(2) Building Server Capabilities: The Three Primitives

    Tools: Giving the LLM Superpowers

    # In your server.py
    from mcp.server.fastmcp import FastMCP
    from typing import Literal
    
    mcp = FastMCP("CalculatorServer", version="1.0.0")
    
    @mcp.tool()
    def calculate(
        x: float,
        y: float,
        operation: Literal["add", "subtract", "multiply", "divide"],
    ) -> float:
        """
        Performs a basic arithmetic operation on two numbers.
        """
        if operation == "add":
            return x + y
        elif operation == "subtract":
            return x - y
        elif operation == "multiply":
            return x * y
        elif operation == "divide":
            if y == 0:
                # This error will be caught by the framework and sent as a structured error response.
                raise ValueError("Cannot divide by zero.")
            return x / y

    Exposing and Managing Resources: Providing Context

    Resources are read-only data sources that an LLM can access to gain context for its tasks. They are analogous to   

    GET endpoints in a REST API and should be designed to be free of side effects.

    Static Resource: Exposing a fixed piece of data, like a project’s README.md file.

    # In your server.py
    @mcp.resource("docs://readme", title="Project README")
    def get_readme() -> str:
        """Returns the content of the project's README file."""
        with open("README.md", "r") as f:
            return f.read()

    Dynamic (Templated) Resource: Exposing data that changes based on parameters in the URI. This allows the client to request specific pieces of information.

    # In your server.py
    # This assumes a dictionary `user_profiles` exists.
    user_profiles = {
        "123": {"name": "Alice", "role": "Engineer"},
        "456": {"name": "Bob", "role": "Designer"},
    }
    
    @mcp.resource("users://{user_id}/profile")
    def get_user_profile(user_id: str) -> dict:
        """Returns the profile for a given user ID."""
        return user_profiles.get(user_id, {})

    In this example, inspired by , a client can request   

    users://123/profile to get Alice’s data.

    Integration Pattern: Managing Persistent Connections

    A common and critical real-world use case is providing access to a database. Managing the database connection lifecycle—opening it on startup and closing it gracefully on shutdown—is essential for a robust server. The Python SDK provides a lifespan context manager for this exact purpose.   

    This example demonstrates how to connect to a SQLite database and expose its data as a resource.

    # In your server.py
    import sqlite3
    from contextlib import asynccontextmanager
    from collections.abc import AsyncIterator
    from dataclasses import dataclass
    from mcp.server.fastmcp import FastMCP
    
    # Define a context object to hold the database connection
    @dataclass
    class AppContext:
        db_conn: sqlite3.Connection
    
    # Define the lifespan context manager
    @asynccontextmanager
    async def db_lifespan(server: FastMCP) -> AsyncIterator[AppContext]:
        """Manages the database connection lifecycle."""
        print("Connecting to database...")
        conn = sqlite3.connect("my_database.db")
        try:
            # Yield the connection to be used by handlers
            yield AppContext(db_conn=conn)
        finally:
            # Ensure the connection is closed on shutdown
            print("Closing database connection...")
            conn.close()
    
    # Initialize the server with the lifespan manager
    mcp = FastMCP("DatabaseServer", version="1.0.0", lifespan=db_lifespan)
    
    # Define a tool to query the database
    @mcp.tool()
    def query_products(max_price: float) -> list[dict]:
        """Queries the products table for items below a max price."""
        # Get the context, which includes the lifespan context
        ctx = mcp.get_context()
        # Access the database connection from the lifespan context
        db_conn = ctx.request_context.lifespan_context.db_conn
        
        cursor = db_conn.cursor()
        cursor.execute("SELECT id, name, price FROM products WHERE price <=?", (max_price,))
        products = [{"id": row, "name": row, "price": row} for row in cursor.fetchall()]
        return products

    Creating Reusable Prompts: Standardizing Workflows

    Prompts are user-controlled, parameterized templates that guide the LLM to perform a task in a specific, optimized way.They are useful for encapsulating complex instruction sets or standardizing common workflows.   

    The Python SDK allows you to register prompts with parameters, similar to tools

    # In your server.py
    from mcp.server.fastmcp import FastMCP
    from mcp.server.fastmcp.prompts import base
    
    mcp = FastMCP("PromptServer", version="1.0.0")
    
    @mcp.prompt(title="Code Review Assistant")
    def review_code_prompt(code: str, focus: Literal["style", "performance", "security"]) -> list[base.Message]:
        """
        Generates a structured prompt for reviewing code with a specific focus.
        """
        return

    In this example from , a client can invoke the   

    review_code_prompt and provide both the code to be reviewed and the focus of the review. The server then constructs a multi-part prompt to send to the LLM.

    It is important to note a practical caveat: while AI agents can often discover and use Tools autonomously, some host applications may require the user to explicitly select a Prompt from a list rather than the agent choosing it automatically.This makes Prompts better suited for user-initiated, standardized tasks

  • MCP(1): From Protocol to Production

    Deconstructing the MCP Architecture: The “USB-C for AI”

    Imagine connecting AI models to the vast landscape of external tools, databases, and APIs was a significant engineering challenge.   

    This challenge is often described as the “M×N integration problem”: for M different AI models and N different tools, a developer might need to build and maintain M×N unique, bespoke connectors. This combinatorial explosion resulted in brittle, high-maintenance systems that were difficult to scale, secure, or test, ultimately slowing the pace of innovation.


    Hosts, Clients, and Servers

    🔹 Host: The Front Door to AI

    The Host is the part of the app you actually see and use — like the AI chat on your desktop, or a smart code editor like Cursor. It talks to the AI model, handles your questions, and brings in tools if needed.
    Examples: Claude Desktop, Cursor IDE, Hugging Face Python SDK, or apps built with tools like

    🔹 Client: The Middleman

    Inside the Host, there’s a hidden helper called the Client. This part handles the connection to a specific server.

    🔹 Server: The Toolbox
    An external program or service that exposes capabilities (Tools, Resources, Prompts) via the MCP protocol. This could be a REST API, a SQL database, a local filesystem, or any other tool or data source. The server’s job is to expose these external capabilities through the standardized MCP interface by offering one or more of the protocol’s three main primitives: Tools, Resources, and Prompts.

    Your First MCP Server: A “Hello, World!”

    Building and Configuring the Server

    This guide will now walk through the creation of a local stdio server.

    Step 1: Create Project and Install Dependencies Open a terminal and create a new project directory. Then, use uv to initialize a virtual environment and install the MCP Python SDK.

    # Create and enter the project directory
    mkdir my_mcp_server
    cd my_mcp_server
    
    # Create and activate a virtual environment
    uv venv
    source.venv/bin/activate
    
    # Install the MCP SDK with command-line interface tools
    uv add "mcp[cli]"

    Step 2: Write the Server Code Create a new file named server.py. This script will use the FastMCP class from the SDK, which provides a high-level, decorator-based interface for building servers.

    # server.py
    from mcp.server.fastmcp import FastMCP
    
    # 1. Initialize the MCP server
    mcp = FastMCP("HelloWorldServer", version="0.1.0")
    
    # 2. Define a tool using the @mcp.tool() decorator
    @mcp.tool()
    def greet(name: str) -> str:
        """
        A simple tool that returns a greeting to the given name.
        """
        return f"Hello, {name}! Welcome to the world of MCP."
    
    # 3. This block allows the script to be run directly
    if __name__ == "__main__":
        mcp.run()

    This minimal server defines a single tool named greet that accepts one string argument, name.

    Step 3: Configure the Host Application

    The Host application (e.g., Cursor or Claude Desktop) needs to be told how to find and run this server. This is done via a JSON configuration file. For a project-specific server in Cursor, this file would be located at .cursor/mcp.json within your project’s root directory.

    Create this file and add the following configuration. Crucially, you must replace the placeholder paths with the absolute paths on your system.

    //.cursor/mcp.json
    {
      "mcpServers": {
        "helloServer": {
          "command": "/path/to/your/project/my_mcp_server/.venv/bin/python",
          "args": ["/path/to/your/project/my_mcp_server/server.py"],
          "env": {
            "LOG_LEVEL": "DEBUG"
          }
        }
      }
    }

    "helloServer": A unique name for your server within the host.

    "command": The absolute path to the Python executable inside your virtual environment.

    "args": The absolute path to your server.py script.

    "env": An optional object for passing environment variables to your server process, useful for API keys or configuration flags.

    MCP VS Tool-calling

    Discoverability: How the system finds out what tools it can use

    Tool CallingMCP
    How it worksTools are hardcoded. The system only knows fixed tools.Tools are discovered at runtime. The system learns as it runs.
    FlexibilityLow – adding a new tool requires code changes.High – new tools can be added without changing the client.
    AnalogyYou have a fixed menu saved on your phone.You walk into a café and get the latest menu handed to you.

    Interaction Type: How the system communicates with tools

    Tool CallingMCP
    How it worksOne-off request and response. No memory of past actions.Continuous, two-way conversation. Remembers context.
    Good forSimple, one-step tasks.Complex tasks that need follow-up or multiple steps.
    AnalogyLike texting a bot: ask once, get an answer, end.Like a phone call: you talk, respond, and continue the conversation.