Improving MCP tool schemas
to increase agent reliability


Pamela Fox

pamelafox.org

But first... let's have fun with MCP!

What if I could use MCP to pick my outfit for today?

An MCP server searching clothes images and rendering in VS Code Copilot

github.com/Azure-Samples/image-search-aisearch

Building MCP servers with Python

A basic FastMCP server

This tool signature:


@mcp.tool
async def add_expense(
    expense_date: str,
    amount: float,
    category: str,
    description: str,
) -> str:
    """Add a new expense."""
    ...
            

becomes this schema:


{
  "name": "add_expense",
  "description": "Add a new expense.",
  "inputSchema": {
    "properties": {
      "expense_date": {"type": "string"},
      "amount": {"type": "number"},
      "category": {"type": "string"},
      "description": {"type": "string"}
    },
    "required": ["expense_date", "amount",
      "category", "description"],
    "type": "object"
  }
}
            

LLMs can be soo too creative!

What does the model suggest for the category string?
Across 83 tool calls:

Gas Electronics Coffee Transportation Groceries Comida Gasolina Electrónica Clothing Apps Spa Entretenimiento Car purchase Dining Shoes Personal Care Lunch Dinner Ocio Zapatos Calzado Grocery Delivery Ropa y calzado Apps & Software

Reign those LLMs in with better schemas!

Annotate tool arguments with descriptions

Python code:


    category: Annotated[
        str,
        Field(
            description="Must be one of: "
            "Food & drink, "
            "Transit and Fuel, "
            "Media & streaming, "
            "Apparel and Beauty, "
            "Electronics & tech, "
            "Home and office, ..."
        ),
    ],
            

Generated schema:


"category": {
    "type": "string",
    "description": "Must be one of:
      Food & drink,
      Transit and Fuel,
      Media & streaming,
      Apparel and Beauty,
      Electronics & tech,
      Home and office, ..."
}
            

Use constrained types like Literal or Enum

Python code:


CATEGORY_LITERAL = Literal[
    "Food & drink", "Transit and Fuel",
    "Media & streaming", ...
]
    category: CATEGORY_LITERAL,
            

class Category(Enum):
    FOOD_AND_DRINK = "Food & drink"
    TRANSIT_AND_FUEL = "Transit and Fuel"
    ...

    category: Category,
            

Generated schema:


"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ]
}
            

Why not both?

Python code:


    category: Annotated[
        Category,
        Field(
            description=(
                "Choose the closest category. "
                "If truly unclear, use Misc.\n\n"
                "Heuristics: "
                "Food & drink=meals, coffee; "
                "Transit and Fuel=rideshare, "
                "gas, parking; ..."
            )
        ),
    ],
            

Generated schema:


"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ],
    "description": "Choose the closest
      category. If truly unclear, use
      Misc. Heuristics: Food & drink=
      meals, coffee; Transit and Fuel=
      rideshare, gas, parking; ..."
}
            

Do agents like stricter schemas?

PydanticAI agent with MCP server


server = MCPServerStreamableHTTP(url="http://localhost:8000/mcp")

model = OpenAIResponsesModel(
    "gpt-5.3-codex",
    provider=OpenAIProvider(openai_client=azure_openai_client))

agent = Agent(
    model,
    system_prompt=(
        "You help users log expenses. "
        f"Today's date is {datetime.now().strftime('%B %-d, %Y')}."
    ),
    output_type=str,
    toolsets=[server],
)

result = await agent.run("I bought a sandwich for $12.50.")
    

Filtering tool variants

The MCP server exposes multiple versions of the same tool with different schemas:


def add_expense_cat_b(category: Annotated[str, Field(description="...")], ...): ...

def add_expense_cat_c(category: Literal["Food & drink", ...], ...): ...

def add_expense_cat_d(category: ExpenseCategory, ...): ...

def add_expense_cat_e(category: Annotated[ExpenseCategory, Field(description="...")], ...): ...

    

The agent only sees one schema variant at a time, thanks to filtering:


toolset = server.filtered(
    lambda ctx, tool: tool.name == f"add_expense_cat_b")
agent = Agent(model, toolsets=[toolset], ...)
result = await agent.run(case.prompt)
    

Batch evaluation across schema variants


EXPENSE_CASES = [
    ExpenseCase(
        name="clear_food_yesterday",
        prompt="Yesterday I bought a sandwich for $12.50.",
        expected_category="Food & drink",
        expected_date=get_yesterday(),
        expected_amount=12.50,
    ),
    ...  # 17 cases ➡️
]

def evaluate_category_match(tool_calls, expected):
    """Does the category match what we expected?"""
    for tc in tool_calls:
        category = tc.arguments.get("category")
        if category == expected:
            return EvalResult(passed=True, score=1.0)
    return EvalResult(passed=False, score=0.0)

for variant in ["cat_b", "cat_c", "cat_d", "cat_e"]:
    toolset = server.filtered(
        tool_filter=lambda t: t.name == f"add_expense_{variant}")
    agent = Agent(model, toolsets=[toolset], ...)
    for case in EXPENSE_CASES:
        result = await agent.run(case.prompt)
        evals = run_all_evaluations(
            result.tool_calls, case)
            
🗣️ Yesterday I bought a sandwich for $12.50.
🗣️ The Monday before this one... $12.50.
🗣️ Two Mondays ago... $8.75 coffee.
🗣️ First Monday of this month... $12.50.
🗣️ Last day of last month... $25.99 movie.
🗣️ Last business day... $60 gas.
🗣️ Day before yesterday... $4.50 coffee.
🗣️ Three days ago... $38 Uber.
🗣️ Last Friday... $18 movie ticket.
🗣️ Day after tomorrow... $20 bus pass.
🗣️ Yesterday... $65 Instacart delivery.
🗣️ Last day of last month... $79.99 headphones.
🗣️ Yesterday... car for 35000 USD.
🗣️ Yesterday... $0.99 for an app.
🗣️ Yesterday... $200 spa treatment.
🗣️ Yesterday... €50 on dinner.
🗣️ Ayer compré una laptop por $1200.

Running + reviewing evals

  1. Run batch evaluations → results.json + RESULTS.md
    uv run python evals/runner.py \
      --model gpt-4.1-mini \
      --seed 42 --temperature 0 \
      --output evals/runs/my_run
  2. Collaborate with GitHub Copilot on reviewing the results
    Summarize the most recent run
    Compare the runs across model X and Y, and summarize any differences you see in the results
    Highlight interesting failures

Evals: Which category schema did best?

gpt-4.1-mini, 17 cases for each schema, with a Pydantic-AI agent:

Annotated[str]LiteralEnumAnnotated[Enum]
Was tool called?15/1716/1716/1717/17
When called, did category match expected?14/1513/1613/1615/17
Schema size (avg tokens)374412424836

When did tool calling improve?

🗣️ "The day after tomorrow I will buy a bus pass for $20."
Annotated[str]
🤖 "Could you please specify the category?"
Annotated[Enum]
🤖 Transit and Fuel

When did category match improve?

🗣️ "I paid $0.99 for an app"
Literal
🤖 Apparel and Beauty
Annotated[Enum]
🤖 Electronics & tech
🗣️ "Last Friday I spent $18 on a movie ticket"
Annotated[str]
🤖 Arts and hobbies
Annotated[Enum]
🤖 Media & streaming

Date schema variants

Bare string

Annotated description

date type

Regex pattern

expense_date: str,
expense_date: Annotated[
    str, "Date in YYYY-MM-DD format"
],
expense_date: date,
expense_date: Annotated[
    str,
    Field(pattern=r"^\d{4}-\d{2}-\d{2}$"),
],

"expense_date": {
    "type": "string"
}
"expense_date": {
    "description": "Date in YYYY-MM-DD format",
    "type": "string"
}
"expense_date": {
    "format": "date",
    "type": "string"
}
"expense_date": {
    "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
    "type": "string"
}

Evals: Which date schema did best?

gpt-4.1-mini, 17 cases for each date schema, with a Pydantic-AI agent:

strAnnotated[str]dateField(pattern)
Was tool called?17/1717/1717/1717/17
Date match (of called)12/1712/1712/1712/17
Schema size (avg tokens)326406414423

Date schema made zero difference! All 4 variants produced identical results for every case.

Why? The parameter name expense_date already implies ISO 8601 format:

🗣️ "Yesterday I bought a sandwich for $12.50."
str (no hints at all!)
🤖 2026-03-07
Field(pattern) (regex)
🤖 2026-03-07

All 5 failures were date miscalculation on relative dates — not a schema issue:

🗣️ "Two Mondays ago I spent $8.75 on coffee."
str
🤖 2026-02-23
Field(pattern)
🤖 2026-02-23 ❌ (same wrong answer)

Do all models think alike?

Evals across models

Same 17 test cases, same Pydantic AI agent, different models:

Category

Did agent call the tool?

Schemagpt-4o4.1-mini5.3-codex (med)
Annotated[str]17/1715/1717/17
Literal17/1716/1717/17
Enum17/1716/1717/17
Annotated[Enum]17/1717/1717/17

When called, did category match expected?

Schemagpt-4o4.1-mini5.3-codex (med)
Annotated[str]17/1714/1515/17
Literal15/1713/1613/17
Enum14/1713/1613/17
Annotated[Enum]17/1715/1715/17

Date

Did agent call the tool?

Schemagpt-4o4.1-mini5.3-codex (med)
str17/1717/1717/17
Annotated[str]17/1717/1717/17
date17/1717/1717/17
Field(pattern)17/1717/1717/17

When called, did date match expected?

Schemagpt-4o4.1-mini5.3-codex (med)
str15/1712/1717/17
Annotated[str]15/1712/1717/17
date15/1712/1717/17
Field(pattern)15/1712/1717/17

Same prompt, different models

Category: "Yesterday I spent $200 on a spa treatment." with Annotated[Enum]

gpt-4o
🤖 Health & Fitness
gpt-4.1-mini
🤖 Apparel and Beauty
gpt-5.3-codex
🤖 Apparel and Beauty

Category: "Yesterday I bought a car for 35000 USD." with Annotated[Enum]

gpt-4o
🤖 Misc
(followed the "if unclear, use Misc" heuristic)
gpt-4.1-mini
🤖 Transit and Fuel
gpt-5.3-codex
🤖 Transit and Fuel

Date: "On the last day of last month I bought headphones for $79.99."

gpt-4o
🤖 2026-02-29
(Feb 29 doesn't exist!)
gpt-4.1-mini
🤖 2026-02-29
(still doesn't exist!)
gpt-5.3-codex
🤖 2026-02-28

Disagree with the "expected" answers?
Evals can also reveal ambiguities in your categories or test data!

Don't overthink it!

gpt-5.3-codex with Annotated[Enum], same 17 cases, varying reasoning effort:

lowmediumhighxhigh
Did category match ground truth?100%88.2%88.2%88.2%
Did date match ground truth?94.1%94.1%94.1%94.1%
Schema size (average tokens)8628909391114
Latency (average ms)7,1297,4748,82811,554

Why?

🗣️ "Yesterday I spent $200 on a spa treatment"
low
💭 "considering Health & Fitness or Apparel/Beauty... spa = personal care or wellness"
Health & Fitness
medium
💭 "personal care fits in Apparel and Beauty, while wellness aligns with Health & Fitness. Since a spa treatment feels more like a beauty or per..."
Apparel and Beauty

More reasoning → more overthinking → wrong answer!

Are all agent frameworks the same?

GitHub Copilot SDK agent with MCP server


client = CopilotClient()

session = await client.create_session(SessionConfig(
    model="gpt-5.3-codex",
    mcp_servers={
        "expenses": MCPRemoteServerConfig(
            type="http",
            url="http://localhost:8000/mcp",
            tools=["add_expense_cat_e"],
        )
    },
    system_message={
        "mode": "replace",
        "content": "You help users log expenses. "
            f"Today's date is {datetime.now().strftime('%B %-d, %Y')}.",
    },
))

await session.send_and_wait({"prompt": "I bought a sandwich for $12.50."})
    

Pydantic AI vs. GitHub Copilot SDK

Same model (gpt-5.3-codex), same 17 cases, same MCP server, different agent framework:

Was tool called at all?

SchemaPydantic AICopilot SDK
Annotated[str]17/1717/17
Literal17/1717/17
Enum17/1717/17
Annotated[Enum]17/1717/17

Did category match expected?

SchemaPydantic AICopilot SDK
Annotated[str]15/1715/17
Literal13/1713/17
Enum13/1713/17
Annotated[Enum]15/1715/17

Did date match expected?

SchemaPydantic AICopilot SDK
str17/1717/17
Annotated[str]17/1717/17
date17/1717/17
Field(pattern)17/1717/17

Tools have return schemas too!

Return a plain string

Python code:


@mcp.tool
def get_expenses_a() -> str:
    """Get all expenses."""
    return "\n".join(
        f"Date: {e['date']}, Amount: ${e['amount']}, "
        f"Category: {e['category']}, ..."
        for e in expenses
    )
            

Example output:


Date: 2025-01-02, Amount: $4.50,
  Category: Food & drink,
  Description: Morning coffee
Date: 2025-01-02, Amount: $12.99,
  Category: Food & drink,
  Description: Lunch sandwich
Date: 2025-01-03, Amount: $45.00,
  Category: Transit and Fuel,
  Description: Gas station fill-up
...
            

Return a list of Pydantic models

Python code:


class Expense(BaseModel):
    """A single expense record."""
    expense_date: date = Field(
        alias="date",
        description="Date of the expense")
    amount: float = Field(
        description="Amount spent")
    category: str = Field(
        description="Category of expense")
    description: str = Field(
        description="Description of the expense")

@mcp.tool
def get_expenses_c() -> list[Expense]:
    """Get all expenses."""
    return [Expense(**e) for e in expenses]
            

Example output:


[
  {"date": "2025-01-02",
   "amount": 4.5,
   "category": "Food & drink",
   "description": "Morning coffee"},
  {"date": "2025-01-02",
   "amount": 12.99,
   ...},
  ...
]
            

Evals: Which output schema did best?

gpt-5.3-codex, 7 cases for each schema, with a Pydantic-AI agent:

🗣️ How many expenses are recorded in total?
🗣️ What is the most expensive expense?
🗣️ What is the cheapest expense?
🗣️ What is the date of the earliest expense?
🗣️ What category is the most expensive expense?
🗣️ Show the 3 most expensive as a table.
🗣️ Show all Electronics & tech as a table.
strlist[Expense]
Answered correctly?7/77/7
Tool response size6,297 chars7,306 chars

Accuracy is identical, but structured output costs more in tool response size.

So why use it? For consumption by downstream MCP servers or agents.

Conclusions

  • Structured schemas benefit your code: type safety, validation, IDE autocompletion, and fewer runtime errors
  • They may also improve model accuracy when calling tools
  • The only way to know is to do evals! Don't go on vibes alone. Hamel's tweet recommending doing evals

Thank you!

Slides:
pamelafox.github.io/py-ai-mcp-tool-schemas

Code:
github.com/pamelafox/py-ai-mcp-tool-schemas/

Questions? Find me online at:

Twitter@pamelafox
Mastodon@pamelafox@fosstodon.org
BlueSky@pamelafox.bsky.social
GitHubwww.github.com/pamelafox
Websitepamelafox.org