Blog

Serving LLMs: Core Concepts For Beginner
So, you’ve trained a Large Language Model (LLM) or picked up a powerful open-source one. That’s the first mountain climbed. Now comes the next, arguably more complex one: how do you actually serve this model to potentially thousands of users efficiently without setting your cloud budget on fire?

LLMs are notoriously massive and hungry for computational resources. A naive approach to serving them will quickly lead to slow response times and skyrocketing costs. But over the last few years, a stack of brilliant optimization techniques has emerged, turning what was once impractical into a robust, scalable reality.

The Foundation: Understanding the KV Cache

At its core, an LLM generates text one token (roughly, a word or part of a word) at a time in a process called autoregression. To generate the next token, it needs the context of all the tokens that came before it.

The most basic way to do this is also the most inefficient: for every new token, the model would re-process the entire sequence from the very beginning.

This is where the KV Cache comes in, and it’s the single most important optimization for LLM inference.
```
generated_tokens = []
next_inputs = inputs
durations_s = []
for _ in range(10):
    t0 = time.time()
    next_token_id = generate_token(next_inputs)
    durations_s += [time.time() - t0]
    
    next_inputs = {
        "input_ids": torch.cat(
            [next_inputs["input_ids"], next_token_id.reshape((1, 1))],
            dim=1),
        "attention_mask": torch.cat(
            [next_inputs["attention_mask"], torch.tensor([[1]])],
            dim=1),
    }
    
    next_token = tokenizer.decode(next_token_id)
    generated_tokens.append(next_token)
```
- What it is: In the “attention” mechanism of a Transformer, the model calculates three matrices from the input: a Query, a Key, and a Value. To generate a new token, its Query is compared against the Keys of all previous tokens to figure out “what to pay attention to,” and then a weighted sum of the Values is used to produce the output. The KV Cache simply stores the Key and Value matrices for all previous tokens so they don’t have to be recalculated every single time.
- The Analogy We Love: Think of it like taking notes in a meeting. To understand the last sentence spoken, you don’t need to re-listen to the entire meeting recording from the start. You just glance at your notes (the KV Cache).
- A Point of Confusion: In code, you might see the attention_mask being manually updated in a loop. Why? The attention mask tells the model which tokens to pay attention to. As we generate a new token and add it to our KV Cache, we must also extend the mask to tell the model, “Hey, pay attention to this new token, too!”
The result is a two-phase generation process:
1. Prefill: The slow first step where the prompt is processed and the initial KV Cache is filled. This determines the “Time to First Token” (TTFT).
2. Decode: The fast subsequent steps where each new token is generated quickly by re-using the cache. This determines the token-per-second throughput.
Scaling Up: Continuous Batching

Okay, so we can serve one user efficiently. What happens when we have dozens of requests hitting our server at once? The obvious answer is to “batch” them—process them together to maximize GPU utilization. But how we batch makes all the difference.
- The Problem with Static Batching: The simple approach is to group, say, 8 requests and process them as one batch. The catch? The entire batch is only as fast as its slowest request. If one user asks for a 500-token essay and seven others ask for a 10-token answer, those seven will finish quickly and then sit idle, wasting precious GPU cycles while the long request finishes.
- The Analogy: It’s like a bus that must drive to the final destination of its last passenger before it can pick up anyone new, even if everyone else got off at the first stop.
- The Solution: Continuous Batching. This is a smarter scheduling algorithm. As soon as any request in the batch finishes, its spot is immediately filled with a new request from the waiting queue. This keeps the GPU constantly fed with useful work. The result is dramatically higher throughput and lower average latency for all users.
Fine-tuning with LoRA

LoRA is an efficient fine-tuning technique that adapts a model’s behavior by injecting small, trainable “adapter” layers into an existing model, without altering the model’s original, massive weights.

Imagine we setting up a model with a hidden_size of 1024. The weight matrix W of the model.linear layer is a 1024 x 1024 matrix, containing over 1 million parameters (1024×1024=1,048,576).

If you were to perform traditional fine-tuning on this model, you would need to update all of these 1 million+ parameters, which consumes significant computational resources. Furthermore, for each new task you fine-tune, you need to save a new, full-sized copy of the model.

The LoRA Solution: Don’t Modify, Just Add

LoRA’s approach is not to modify the original weights W, but to “freeze” them and add a “bypass” path alongside.
Create low-rank matrices: It creates two very “thin” matrices: lora_a (shape 1024 x 2) and lora_b (shape 2 x 1024). The number 2 here is the “rank.”

Calculate the update: These two small matrices are multiplied (W2 = lora_a @ lora_b) to produce an “update matrix” that has the same shape (1024 x 1024) as the original weight matrix W.

Combine the results: During the forward pass, the model’s output is the sum of two parts:

Output from the original path: base_output = X @ W

Output from the LoRA bypass: lora_output = X @ A @ B

Final result: total_output = base_output + lora_outpu
This shows that we only need to add less than 0.4% of the original parameter count to simulate a full update of the entire weight matrix. During fine-tuning, we only train these 0.4% of parameters while the millions of original parameters remain frozen. This dramatically saves computational and storage resources.

Multi-LoRA

Imagine a cloud service platform that uses a powerful foundation model (like Llama 3). Now, there are hundreds or thousands of customers, each of whom has fine-tuned this foundation model with their own data using LoRA to adapt it to their specific tasks (e.g., Customer A uses it for customer service chats, Customer B for summarizing legal documents, Customer C for generating marketing copy, etc.).

Now, the server receives a batch of requests, which contains requests from different customers. This means that for each request in the batch, we need to apply a different LoRA adapter.

The most naive and worst approach would be: to deploy a separate model instance for each customer. This would immediately exhaust all GPU memory, making the cost prohibitively high.

The purpose of Multi-LoRA is to solve this exact problem: it allows you to load just one copy of the giant foundation model into memory, and then load hundreds or thousands of tiny LoRA adapters (the A/B matrices) alongside it. When processing a batch of requests, the system can dynamically apply the correct LoRA adapter for each individual request within the batch.

Method 1: Looping

The LoopMultiLoraModel class demonstrates the most intuitive method: using a for loop to iterate through each request in the batch.
- In each step of the loop, it finds the corresponding lora_a and lora_b matrices for the current request using lora_indices.
- It then performs the LoRA computation for that single request: y[batch_idx] += x[batch_idx] @ lora_a @ lora_b.
Disadvantage: Python loops, especially in high-performance computing scenarios, are notoriously inefficient. They cannot fully leverage the massively parallel processing capabilities of a GPU, leading to higher latency as the batch size grows. The first chart in the script clearly illustrates this: the latency grows linearly with the batch size, indicating poor performance.

Method 2: Gathering and Vectorizing

The GatheredMultiLoraModel class demonstrates the efficient implementation.
- Instead of a loop, it uses a critical operation: torch.index_select. This operation can, in a single step, **“gather”**all the required LoRA weights (loras_a and loras_b) for every request in the batch into new tensors, based on the lora_indices.
- It then performs a single, batch-wide matrix multiplication: y += x @ lora_a @ lora_b. Here, x, lora_a, and lora_b are all tensors containing the data for the entire batch.
Advantage: This operation is highly vectorized and can be processed in parallel at high speed by PyTorch on the GPU. The second chart in the script proves this: even as the batch size increases, the latency growth is far flatter than the looping method, resulting in much higher performance.

Production Systems Like LoRAX

A production-grade serving framework like Predibase’s LoRAX is what you get when you put all these concepts together into a single, polished system.
1. It measures TTFT and throughput, proving it uses a KV Cache.
2. It handles concurrent requests of different lengths, proving it uses Continuous Batching.
3. It serves different models via an adapter_id parameter, proving it’s a Multi-LoRA server.
June 29, 2025

MCP(3)- Advanced Agentic Flows

Reversing the Flow – Sampling

Sampling is a client-side capability that allows an MCP server to make a request to the host application, asking it to generate an LLM completion. This creates a bidirectional cognitive loop: the host’s LLM can call a server tool, the server can perform some logic, and then the server can call back to the host’s LLM for further reasoning or analysis.

This capability is what enables patterns like ReAct (Reason-Act). A server can perform an action (Act), receive a result, and then use sampling to have the LLM Reason about that result to determine the next action. This is far more powerful than a simple, one-way tool call and is fundamental to creating agents that can plan and adapt.

Due to its power, sampling also carries risks. A malicious server could attempt to abuse this feature. Therefore, the protocol includes critical safeguards: hosts MUST obtain explicit user consent before fulfilling any sampling request, and the protocol intentionally limits the server’s visibility into the final prompt that is actually sent to the LLM, giving the host and user ultimate control.

Use Cases for Server-Driven Automation

The ability for a server to request LLM reasoning opens up a new class of applications

Sure! Here’s the English version of that explanation:

MCP servers can also request reasoning from the host LLM—essentially flipping the flow. Instead of just AI calling tools, the server can actively “ask the AI to think,” enabling a new class of applications known as server-driven automation.

Reflective Analysis: A server with a tool that fetches a large, unstructured piece of data (e.g., a 10,000-line log file or a complex financial report) can use sampling to offload the analysis to a powerful reasoning model. After fetching the data, the server could send a sampling request to the host: “Please summarize the critical errors from the following log data: [log data]”. The server gets back a concise, structured summary instead of having to parse the raw data itself.
Dynamic Sub-task Generation: Imagine a “Project Manager” MCP server. It could have a single tool, execute_project(goal: str). When called, instead of having hardcoded logic, it could use sampling to ask the host’s LLM: “Given the goal ‘{goal}’, break it down into a series of smaller, executable steps using the available tools.” The LLM might return a JSON list of sub-tasks, which the server can then execute in sequence, creating a dynamic and adaptive workflow.
Mentorship and Second Opinions: A server can act as a specialized agent that leverages the host’s LLM for its “expertise.” The mentor-mcp-server example demonstrates this pattern: it provides a tool that accepts a block of code. Internally, it uses sampling to send a request to the host’s powerful reasoning model (like Claude 3.5 Sonnet or GPT-4o) with the prompt: “Provide a detailed critique of this code’s design, highlighting potential improvements and security vulnerabilities”. The server acts as an orchestrator for a specialized task, using the host’s general intelligence

The Developer’s Toolkit: Debugging and Troubleshooting

Effective debugging is crucial for developing robust MCP servers.

An effective debugging workflow must adopt a “bottom-up” approach: first, ensure the server works perfectly in isolation using the MCP Inspector, and only then move to troubleshoot its integration within a host application. This systematic process saves developers hours of frustration by allowing them to isolate the source of a problem efficiently.

The MCP Inspector: Your Essential Debugging Partner

The MCP Inspector is the most critical tool in a developer’s arsenal. It is a standalone web-based UI that acts as a client, allowing you to connect to and interact with your MCP server directly, without needing a full host application like Cursor or Claude. It is the “Postman for MCP.”

To use it, you can run a single command from your terminal, pointing it to your server’s execution command.

Bash

# For a Node.js server built to dist/index.js
npx @modelcontextprotocol/inspector node dist/index.js

# For a Python server
npx @modelcontextprotocol/inspector python server.py

This command starts a local web server (typically on port 6274) and connects to your MCP server. The Inspector’s UI allows you to:

Verify Connection: Immediately see if the Inspector successfully established a connection and handshake with your server.
View Capabilities: Inspect the server’s metadata and the list of Tools, Resources, and Prompts it advertises.
Test Tools: Select a tool, enter its arguments in a JSON editor, and execute it. This is invaluable for testing your tool’s logic, input validation, and error handling in a controlled environment.
Browse Resources: List and read the content of available resources to ensure they are being served correctly.
Test Prompts: Invoke prompts with different arguments to check the generated output.

Common Pitfalls and Solutions

The following table serves as a practical checklist for diagnosing the most common issues developers encounter when building and integrating MCP servers.

Symptom	Diagnostic Step	What to Check
Server not appearing in Host	1. Check Host Logs	Look for connection errors, JSON parsing errors in the config file.
	2. Verify Server Process	Use `ps aux` or Task Manager to see if the server command is running.
	3. Check Host Config	Ensure the `command` and `args` have absolute paths. Verify JSON syntax.
Tool not appearing in Host	1. Test with Inspector	Connect the Inspector to your server. Does the tool appear in the “Tools” tab?
	2. Check Server Logs (`stderr`)	Look for errors during server initialization or tool registration.
	3. Review Tool Definition	Is the tool correctly registered (e.g., with `@mcp.tool()`)? Is the name correct?
Tool call fails	1. Test with Inspector	Execute the tool in the Inspector with the exact same arguments the LLM tried to use.
	2. Check Input Validation	Is your tool handler correctly parsing the arguments? Add logs to check received parameters.
	3. Check Environment Variables	Does the tool rely on an API key that isn’t set in the host config `env`block?
General Issues	1. Use Structured Logging	Add logging with timestamps and request IDs to your server code (`stderr`).
	2. Isolate with Inspector	Always test new functionality with the Inspector before testing in the full Host application.

Real-World Integration Patterns and Architectures

While building a single server is a great starting point, the true power of MCP is realized when it’s used to architect complex, production-grade AI systems. Adopting MCP is not just about connecting a single LLM to a single tool; it is a strategic decision that can lead to a more flexible, scalable, and future-proof “composable enterprise” architecture. This approach, where business capabilities are exposed as interchangeable, AI-consumable services, mirrors the philosophy of microservices but is tailored for the agentic AI era.

MCP wraps complex APIs into structured, natural-language-describable tools, allowing AI to use API functionality like calling a function—without needing to understand the underlying technical details

The Wrapper Pattern: Modernizing Legacy Systems

The most common and immediately valuable integration pattern is using an MCP server as a modern, AI-friendly “façade” or “wrapper” over existing systems. This allows organizations to bring their legacy infrastructure into the AI ecosystem without costly refactoring.

API Wrapper: An MCP server can wrap an internal REST or GraphQL API, exposing a curated set of critical endpoints as well-described tools. Instead of forcing an LLM to understand a complex OpenAPI specification, the server provides simple, natural-language-friendly functions like create_user_ticket(title: str, description: str).
Database Wrapper: A server can provide a secure, natural language interface to a SQL database. This is a powerful pattern for business intelligence and data analysis. The server can expose a high-risk run_sql_query(sql: str) tool for expert users, or safer, more abstract tools like find_customer_by_name(name: str) that construct and execute SQL queries internally.
SaaS Wrapper: The vast and growing ecosystem of public MCP servers for platforms like GitHub, Slack, Google Drive, Sentry, and Stripe are prime examples of this pattern. They provide a standardized way for any agent to interact with these popular services.

The Multi-Server Orchestrator Pattern

This pattern demonstrates the composability of MCP. A single, sophisticated AI agent can orchestrate multiple, specialized MCP servers to accomplish a complex, multi-step task. In this model, the Host application (or an agentic framework like LangGraph or Google’s Agent Development Kit running within the host) acts as the central coordinator.

Consider a workflow where a user prompts an agent: “Review the latest ‘Project Phoenix’ spec from Google Drive, create a new feature branch in our Git repository based on the spec’s title, and then post a summary to the #phoenix-dev Slack channel.”

The agent, running in the Host, discovers tools from three separate, specialized MCP servers: google-drive-mcp, git-mcp, and slack-mcp.
It first calls the search_files tool on the Google Drive server to find the specification document and reads its content.
Next, it calls the create_branch tool on the Git server, using information extracted from the document for the branch name.
Finally, it calls the post_message tool on the Slack server to provide a status update.

The Host manages the state and context between each tool call, seamlessly chaining together the capabilities of independent servers to execute a workflow。

June 20, 2025

MCP(2) Building Server Capabilities: The Three Primitives

Tools: Giving the LLM Superpowers

# In your server.py
from mcp.server.fastmcp import FastMCP
from typing import Literal

mcp = FastMCP("CalculatorServer", version="1.0.0")

@mcp.tool()
def calculate(
    x: float,
    y: float,
    operation: Literal["add", "subtract", "multiply", "divide"],
) -> float:
    """
    Performs a basic arithmetic operation on two numbers.
    """
    if operation == "add":
        return x + y
    elif operation == "subtract":
        return x - y
    elif operation == "multiply":
        return x * y
    elif operation == "divide":
        if y == 0:
            # This error will be caught by the framework and sent as a structured error response.
            raise ValueError("Cannot divide by zero.")
        return x / y

Exposing and Managing Resources: Providing Context

Resources are read-only data sources that an LLM can access to gain context for its tasks. They are analogous to

GET endpoints in a REST API and should be designed to be free of side effects.

Static Resource: Exposing a fixed piece of data, like a project’s README.md file.

# In your server.py
@mcp.resource("docs://readme", title="Project README")
def get_readme() -> str:
    """Returns the content of the project's README file."""
    with open("README.md", "r") as f:
        return f.read()

Dynamic (Templated) Resource: Exposing data that changes based on parameters in the URI. This allows the client to request specific pieces of information.

# In your server.py
# This assumes a dictionary `user_profiles` exists.
user_profiles = {
    "123": {"name": "Alice", "role": "Engineer"},
    "456": {"name": "Bob", "role": "Designer"},
}

@mcp.resource("users://{user_id}/profile")
def get_user_profile(user_id: str) -> dict:
    """Returns the profile for a given user ID."""
    return user_profiles.get(user_id, {})

In this example, inspired by , a client can request

users://123/profile to get Alice’s data.

Integration Pattern: Managing Persistent Connections

A common and critical real-world use case is providing access to a database. Managing the database connection lifecycle—opening it on startup and closing it gracefully on shutdown—is essential for a robust server. The Python SDK provides a lifespan context manager for this exact purpose.

This example demonstrates how to connect to a SQLite database and expose its data as a resource.

# In your server.py
import sqlite3
from contextlib import asynccontextmanager
from collections.abc import AsyncIterator
from dataclasses import dataclass
from mcp.server.fastmcp import FastMCP

# Define a context object to hold the database connection
@dataclass
class AppContext:
    db_conn: sqlite3.Connection

# Define the lifespan context manager
@asynccontextmanager
async def db_lifespan(server: FastMCP) -> AsyncIterator[AppContext]:
    """Manages the database connection lifecycle."""
    print("Connecting to database...")
    conn = sqlite3.connect("my_database.db")
    try:
        # Yield the connection to be used by handlers
        yield AppContext(db_conn=conn)
    finally:
        # Ensure the connection is closed on shutdown
        print("Closing database connection...")
        conn.close()

# Initialize the server with the lifespan manager
mcp = FastMCP("DatabaseServer", version="1.0.0", lifespan=db_lifespan)

# Define a tool to query the database
@mcp.tool()
def query_products(max_price: float) -> list[dict]:
    """Queries the products table for items below a max price."""
    # Get the context, which includes the lifespan context
    ctx = mcp.get_context()
    # Access the database connection from the lifespan context
    db_conn = ctx.request_context.lifespan_context.db_conn
    
    cursor = db_conn.cursor()
    cursor.execute("SELECT id, name, price FROM products WHERE price <=?", (max_price,))
    products = [{"id": row, "name": row, "price": row} for row in cursor.fetchall()]
    return products

Creating Reusable Prompts: Standardizing Workflows

Prompts are user-controlled, parameterized templates that guide the LLM to perform a task in a specific, optimized way.They are useful for encapsulating complex instruction sets or standardizing common workflows.

The Python SDK allows you to register prompts with parameters, similar to tools

# In your server.py
from mcp.server.fastmcp import FastMCP
from mcp.server.fastmcp.prompts import base

mcp = FastMCP("PromptServer", version="1.0.0")

@mcp.prompt(title="Code Review Assistant")
def review_code_prompt(code: str, focus: Literal["style", "performance", "security"]) -> list[base.Message]:
    """
    Generates a structured prompt for reviewing code with a specific focus.
    """
    return

In this example from , a client can invoke the

review_code_prompt and provide both the code to be reviewed and the focus of the review. The server then constructs a multi-part prompt to send to the LLM.

It is important to note a practical caveat: while AI agents can often discover and use Tools autonomously, some host applications may require the user to explicitly select a Prompt from a list rather than the agent choosing it automatically.This makes Prompts better suited for user-initiated, standardized tasks

June 20, 2025

MCP(1): From Protocol to Production
Deconstructing the MCP Architecture: The “USB-C for AI”

Imagine connecting AI models to the vast landscape of external tools, databases, and APIs was a significant engineering challenge.

This challenge is often described as the “M×N integration problem”: for M different AI models and N different tools, a developer might need to build and maintain M×N unique, bespoke connectors. This combinatorial explosion resulted in brittle, high-maintenance systems that were difficult to scale, secure, or test, ultimately slowing the pace of innovation.

Hosts, Clients, and Servers

🔹 Host: The Front Door to AI

The Host is the part of the app you actually see and use — like the AI chat on your desktop, or a smart code editor like Cursor. It talks to the AI model, handles your questions, and brings in tools if needed.
Examples: Claude Desktop, Cursor IDE, Hugging Face Python SDK, or apps built with tools like

🔹 Client: The Middleman

Inside the Host, there’s a hidden helper called the Client. This part handles the connection to a specific server.

🔹 Server: The Toolbox
An external program or service that exposes capabilities (Tools, Resources, Prompts) via the MCP protocol. This could be a REST API, a SQL database, a local filesystem, or any other tool or data source. The server’s job is to expose these external capabilities through the standardized MCP interface by offering one or more of the protocol’s three main primitives: Tools, Resources, and Prompts.

Your First MCP Server: A “Hello, World!”

Building and Configuring the Server

This guide will now walk through the creation of a local stdio server.

Step 1: Create Project and Install Dependencies Open a terminal and create a new project directory. Then, use uv to initialize a virtual environment and install the MCP Python SDK.
```
# Create and enter the project directory
mkdir my_mcp_server
cd my_mcp_server

# Create and activate a virtual environment
uv venv
source.venv/bin/activate

# Install the MCP SDK with command-line interface tools
uv add "mcp[cli]"
```
Step 2: Write the Server Code Create a new file named server.py. This script will use the FastMCP class from the SDK, which provides a high-level, decorator-based interface for building servers.
```
# server.py
from mcp.server.fastmcp import FastMCP

# 1. Initialize the MCP server
mcp = FastMCP("HelloWorldServer", version="0.1.0")

# 2. Define a tool using the @mcp.tool() decorator
@mcp.tool()
def greet(name: str) -> str:
    """
    A simple tool that returns a greeting to the given name.
    """
    return f"Hello, {name}! Welcome to the world of MCP."

# 3. This block allows the script to be run directly
if __name__ == "__main__":
    mcp.run()
```
This minimal server defines a single tool named greet that accepts one string argument, name.

Step 3: Configure the Host Application

The Host application (e.g., Cursor or Claude Desktop) needs to be told how to find and run this server. This is done via a JSON configuration file. For a project-specific server in Cursor, this file would be located at .cursor/mcp.json within your project’s root directory.

Create this file and add the following configuration. Crucially, you must replace the placeholder paths with the absolute paths on your system.
```
//.cursor/mcp.json
{
  "mcpServers": {
    "helloServer": {
      "command": "/path/to/your/project/my_mcp_server/.venv/bin/python",
      "args": ["/path/to/your/project/my_mcp_server/server.py"],
      "env": {
        "LOG_LEVEL": "DEBUG"
      }
    }
  }
}
```
"helloServer": A unique name for your server within the host.

"command": The absolute path to the Python executable inside your virtual environment.

"args": The absolute path to your server.py script.

"env": An optional object for passing environment variables to your server process, useful for API keys or configuration flags.

MCP VS Tool-calling

Discoverability: How the system finds out what tools it can use

Tool Calling MCP

How it works Tools are hardcoded. The system only knows fixed tools. Tools are discovered at runtime. The system learns as it runs.

Flexibility Low – adding a new tool requires code changes. High – new tools can be added without changing the client.

Analogy You have a fixed menu saved on your phone. You walk into a café and get the latest menu handed to you.

Interaction Type: How the system communicates with tools

Tool Calling MCP

How it works One-off request and response. No memory of past actions. Continuous, two-way conversation. Remembers context.

Good for Simple, one-step tasks. Complex tasks that need follow-up or multiple steps.

Analogy Like texting a bot: ask once, get an answer, end. Like a phone call: you talk, respond, and continue the conversation.
June 20, 2025

Blog

Serving LLMs: Core Concepts For Beginner

The Foundation: Understanding the KV Cache

Scaling Up: Continuous Batching

Fine-tuning with LoRA

The LoRA Solution: Don’t Modify, Just Add

Multi-LoRA

Method 1: Looping

Method 2: Gathering and Vectorizing

Production Systems Like LoRAX

MCP(3)- Advanced Agentic Flows

Reversing the Flow – Sampling

Use Cases for Server-Driven Automation

The Developer’s Toolkit: Debugging and Troubleshooting

The MCP Inspector: Your Essential Debugging Partner

Common Pitfalls and Solutions

Real-World Integration Patterns and Architectures

The Wrapper Pattern: Modernizing Legacy Systems

The Multi-Server Orchestrator Pattern

MCP(2) Building Server Capabilities: The Three Primitives

Tools: Giving the LLM Superpowers

Exposing and Managing Resources: Providing Context

Integration Pattern: Managing Persistent Connections

Creating Reusable Prompts: Standardizing Workflows

MCP(1): From Protocol to Production

Deconstructing the MCP Architecture: The “USB-C for AI”

🔹 Host: The Front Door to AI

🔹 Client: The Middleman

Your First MCP Server: A “Hello, World!”

Building and Configuring the Server

MCP VS Tool-calling

Discoverability: How the system finds out what tools it can use

Interaction Type: How the system communicates with tools