Improving MCP tool schemas
to increase agent reliability

Pamela Fox

pamelafox.org

But first... let's have fun with MCP!

What if I could use MCP to pick my outfit for today?

An MCP server searching clothes images and rendering in VS Code Copilot

github.com/Azure-Samples/image-search-aisearch

Building MCP servers with Python

A basic FastMCP server

This tool signature:


@mcp.tool
async def add_expense(
    expense_date: str,
    amount: float,
    category: str,
    description: str,
) -> str:
    """Add a new expense."""
    ...

→

becomes this schema:


{
  "name": "add_expense",
  "description": "Add a new expense.",
  "inputSchema": {
    "properties": {
      "expense_date": {"type": "string"},
      "amount": {"type": "number"},
      "category": {"type": "string"},
      "description": {"type": "string"}
    },
    "required": ["expense_date", "amount",
      "category", "description"],
    "type": "object"
  }
}

LLMs can be soo too creative!

What does the model suggest for the category string?
Across 83 tool calls:

Gas Electronics Coffee Transportation Groceries Comida Gasolina Electrónica Clothing Apps Spa Entretenimiento Car purchase Dining Shoes Personal Care Lunch Dinner Ocio Zapatos Calzado Grocery Delivery Ropa y calzado Apps & Software

Reign those LLMs in with better schemas!

Annotate tool arguments with descriptions

Python code:


    category: Annotated[
        str,
        Field(
            description="Must be one of: "
            "Food & drink, "
            "Transit and Fuel, "
            "Media & streaming, "
            "Apparel and Beauty, "
            "Electronics & tech, "
            "Home and office, ..."
        ),
    ],

→

Generated schema:


"category": {
    "type": "string",
    "description": "Must be one of:
      Food & drink,
      Transit and Fuel,
      Media & streaming,
      Apparel and Beauty,
      Electronics & tech,
      Home and office, ..."
}

Use constrained types like Literal or Enum

Python code:


CATEGORY_LITERAL = Literal[
    "Food & drink", "Transit and Fuel",
    "Media & streaming", ...
]
    category: CATEGORY_LITERAL,


class Category(Enum):
    FOOD_AND_DRINK = "Food & drink"
    TRANSIT_AND_FUEL = "Transit and Fuel"
    ...

    category: Category,

→

Generated schema:


"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ]
}

Why not both?

Python code:


    category: Annotated[
        Category,
        Field(
            description=(
                "Choose the closest category. "
                "If truly unclear, use Misc.\n\n"
                "Heuristics: "
                "Food & drink=meals, coffee; "
                "Transit and Fuel=rideshare, "
                "gas, parking; ..."
            )
        ),
    ],

→

Generated schema:


"category": {
    "type": "string",
    "enum": [
        "Food & drink",
        "Transit and Fuel",
        "Media & streaming", ...
    ],
    "description": "Choose the closest
      category. If truly unclear, use
      Misc. Heuristics: Food & drink=
      meals, coffee; Transit and Fuel=
      rideshare, gas, parking; ..."
}

Do agents like stricter schemas?

PydanticAI agent with MCP server


server = MCPServerStreamableHTTP(url="http://localhost:8000/mcp")

model = OpenAIResponsesModel(
    "gpt-5.3-codex",
    provider=OpenAIProvider(openai_client=azure_openai_client))

agent = Agent(
    model,
    system_prompt=(
        "You help users log expenses. "
        f"Today's date is {datetime.now().strftime('%B %-d, %Y')}."
    ),
    output_type=str,
    toolsets=[server],
)

result = await agent.run("I bought a sandwich for $12.50.")

Filtering tool variants

The MCP server exposes multiple versions of the same tool with different schemas:


def add_expense_cat_b(category: Annotated[str, Field(description="...")], ...): ...

def add_expense_cat_c(category: Literal["Food & drink", ...], ...): ...

def add_expense_cat_d(category: ExpenseCategory, ...): ...

def add_expense_cat_e(category: Annotated[ExpenseCategory, Field(description="...")], ...): ...

The agent only sees one schema variant at a time, thanks to filtering:


toolset = server.filtered(
    lambda ctx, tool: tool.name == f"add_expense_cat_b")
agent = Agent(model, toolsets=[toolset], ...)
result = await agent.run(case.prompt)

Batch evaluation across schema variants


EXPENSE_CASES = [
    ExpenseCase(
        name="clear_food_yesterday",
        prompt="Yesterday I bought a sandwich for $12.50.",
        expected_category="Food & drink",
        expected_date=get_yesterday(),
        expected_amount=12.50,
    ),
    ...  # 17 cases ➡️
]

def evaluate_category_match(tool_calls, expected):
    """Does the category match what we expected?"""
    for tc in tool_calls:
        category = tc.arguments.get("category")
        if category == expected:
            return EvalResult(passed=True, score=1.0)
    return EvalResult(passed=False, score=0.0)

for variant in ["cat_b", "cat_c", "cat_d", "cat_e"]:
    toolset = server.filtered(
        tool_filter=lambda t: t.name == f"add_expense_{variant}")
    agent = Agent(model, toolsets=[toolset], ...)
    for case in EXPENSE_CASES:
        result = await agent.run(case.prompt)
        evals = run_all_evaluations(
            result.tool_calls, case)

🗣️ Yesterday I bought a sandwich for $12.50.

🗣️ The Monday before this one... $12.50.

🗣️ Two Mondays ago... $8.75 coffee.

🗣️ First Monday of this month... $12.50.

🗣️ Last day of last month... $25.99 movie.

🗣️ Last business day... $60 gas.

🗣️ Day before yesterday... $4.50 coffee.

🗣️ Three days ago... $38 Uber.

🗣️ Last Friday... $18 movie ticket.

🗣️ Day after tomorrow... $20 bus pass.

🗣️ Yesterday... $65 Instacart delivery.

🗣️ Last day of last month... $79.99 headphones.

🗣️ Yesterday... car for 35000 USD.

🗣️ Yesterday... $0.99 for an app.

🗣️ Yesterday... $200 spa treatment.

🗣️ Yesterday... €50 on dinner.

🗣️ Ayer compré una laptop por $1200.

Running + reviewing evals

Run batch evaluations → results.json + RESULTS.md

uv run python evals/runner.py \
  --model gpt-4.1-mini \
  --seed 42 --temperature 0 \
  --output evals/runs/my_run

Collaborate with GitHub Copilot on reviewing the results
Summarize the most recent run

Compare the runs across model X and Y, and summarize any differences you see in the results

Highlight interesting failures

Evals: Which category schema did best?

gpt-4.1-mini, 17 cases for each schema, with a Pydantic-AI agent:

	`Annotated[str]`	`Literal`	`Enum`	`Annotated[Enum]`
Was tool called?	15/17	16/17	16/17	17/17
When called, did category match expected?	14/15	13/16	13/16	15/17
Schema size (avg tokens)	374	412	424	836

When did tool calling improve?

🗣️ "The day after tomorrow I will buy a bus pass for $20."

Annotated[str]
🤖 "Could you please specify the category?" ❌

Annotated[Enum]
🤖 Transit and Fuel ✅

When did category match improve?

🗣️ "I paid $0.99 for an app"

Literal
🤖 Apparel and Beauty ❌

Annotated[Enum]
🤖 Electronics & tech ✅

🗣️ "Last Friday I spent $18 on a movie ticket"

Annotated[str]
🤖 Arts and hobbies ❌

Annotated[Enum]
🤖 Media & streaming ✅

Date schema variants

Bare string

Annotated description

date type

Regex pattern

expense_date: str,

expense_date: Annotated[
    str, "Date in YYYY-MM-DD format"
],

expense_date: date,

expense_date: Annotated[
    str,
    Field(pattern=r"^\d{4}-\d{2}-\d{2}$"),
],

↓

"expense_date": {
    "type": "string"
}

"expense_date": {
    "description": "Date in YYYY-MM-DD format",
    "type": "string"
}

"expense_date": {
    "format": "date",
    "type": "string"
}

"expense_date": {
    "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
    "type": "string"
}

Evals: Which date schema did best?

gpt-4.1-mini, 17 cases for each date schema, with a Pydantic-AI agent:

	`str`	`Annotated[str]`	`date`	`Field(pattern)`
Was tool called?	17/17	17/17	17/17	17/17
Date match (of called)	12/17	12/17	12/17	12/17
Schema size (avg tokens)	326	406	414	423

Date schema made zero difference! All 4 variants produced identical results for every case.

Why? The parameter name expense_date already implies ISO 8601 format:

🗣️ "Yesterday I bought a sandwich for $12.50."

str (no hints at all!)
🤖 2026-03-07 ✅

Field(pattern) (regex)
🤖 2026-03-07 ✅

All 5 failures were date miscalculation on relative dates — not a schema issue:

🗣️ "Two Mondays ago I spent $8.75 on coffee."

str
🤖 2026-02-23 ❌

Field(pattern)
🤖 2026-02-23 ❌ (same wrong answer)

Surprise: date schemas make ZERO difference! - All 4 variants — bare str, Annotated description, date type, regex pattern — produce identical results on every case. - Date FORMAT is 100% correct for ALL variants, even bare str — models default to ISO 8601 (YYYY-MM-DD) because the parameter is named "expense_date" and ISO is the dominant format in training data (JSON, SQL, APIs). We even changed the system prompt from "2026-03-07" to "March 7, 2026" to avoid leaking the format hint — no difference. - Date ACCURACY is 70.6% (12/17) across ALL schemas. The same 5 hard relative date cases fail everywhere: "two Mondays ago", "last day of last month", "last business day", etc. — these are date reasoning errors, not schema errors. - To actually test date format impact, you'd need an ambiguous parameter name (like "when") or test non-ISO formats. Data source: evals/runs/gpt41mini_date_v4

Do all models think alike?

Evals across models

Same 17 test cases, same Pydantic AI agent, different models:

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`Annotated[str]`	17/17	15/17	17/17
`Literal`	17/17	16/17	17/17
`Enum`	17/17	16/17	17/17
`Annotated[Enum]`	17/17	17/17	17/17

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`Annotated[str]`	17/17	14/15	15/17
`Literal`	15/17	13/16	13/17
`Enum`	14/17	13/16	13/17
`Annotated[Enum]`	17/17	15/17	15/17

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`str`	17/17	17/17	17/17
`Annotated[str]`	17/17	17/17	17/17
`date`	17/17	17/17	17/17
`Field(pattern)`	17/17	17/17	17/17

Schema	gpt-4o	4.1-mini	5.3-codex (med)
`str`	15/17	12/17	17/17
`Annotated[str]`	15/17	12/17	17/17
`date`	15/17	12/17	17/17
`Field(pattern)`	15/17	12/17	17/17

Same prompt, different models

Category: "Yesterday I spent $200 on a spa treatment." with Annotated[Enum]

gpt-4o
🤖 Health & Fitness ✅

gpt-4.1-mini
🤖 Apparel and Beauty ❌

gpt-5.3-codex
🤖 Apparel and Beauty ❌

Category: "Yesterday I bought a car for 35000 USD." with Annotated[Enum]

gpt-4o
🤖 Misc ✅
(followed the "if unclear, use Misc" heuristic)

gpt-4.1-mini
🤖 Transit and Fuel ❌

gpt-5.3-codex
🤖 Transit and Fuel ❌

Date: "On the last day of last month I bought headphones for $79.99."

gpt-4o
🤖 2026-02-29 ❌
(Feb 29 doesn't exist!)

gpt-4.1-mini
🤖 2026-02-29 ❌
(still doesn't exist!)

gpt-5.3-codex
🤖 2026-02-28 ✅

Disagree with the "expected" answers?
Evals can also reveal ambiguities in your categories or test data!

Don't overthink it!

gpt-5.3-codex with Annotated[Enum], same 17 cases, varying reasoning effort:

	low	medium	high	xhigh
Did category match ground truth?	100%	88.2%	88.2%	88.2%
Did date match ground truth?	94.1%	94.1%	94.1%	94.1%
Schema size (average tokens)	862	890	939	1114
Latency (average ms)	7,129	7,474	8,828	11,554

Why?

🗣️ "Yesterday I spent $200 on a spa treatment"

low

💭 "considering Health & Fitness or Apparel/Beauty... spa = personal care or wellness"

→ Health & Fitness ✅

medium

💭 "personal care fits in Apparel and Beauty, while wellness aligns with Health & Fitness. Since a spa treatment feels more like a beauty or per..."

→ Apparel and Beauty ❌

More reasoning → more overthinking → wrong answer!

Are all agent frameworks the same?

GitHub Copilot SDK agent with MCP server


client = CopilotClient()

session = await client.create_session(SessionConfig(
    model="gpt-5.3-codex",
    mcp_servers={
        "expenses": MCPRemoteServerConfig(
            type="http",
            url="http://localhost:8000/mcp",
            tools=["add_expense_cat_e"],
        )
    },
    system_message={
        "mode": "replace",
        "content": "You help users log expenses. "
            f"Today's date is {datetime.now().strftime('%B %-d, %Y')}.",
    },
))

await session.send_and_wait({"prompt": "I bought a sandwich for $12.50."})

Pydantic AI vs. GitHub Copilot SDK

Same model (gpt-5.3-codex), same 17 cases, same MCP server, different agent framework:

Was tool called at all?

Schema	Pydantic AI	Copilot SDK
`Annotated[str]`	17/17	17/17
`Literal`	17/17	17/17
`Enum`	17/17	17/17
`Annotated[Enum]`	17/17	17/17

Did category match expected?

Schema	Pydantic AI	Copilot SDK
`Annotated[str]`	15/17	15/17
`Literal`	13/17	13/17
`Enum`	13/17	13/17
`Annotated[Enum]`	15/17	15/17

Did date match expected?

Schema	Pydantic AI	Copilot SDK
`str`	17/17	17/17
`Annotated[str]`	17/17	17/17
`date`	17/17	17/17
`Field(pattern)`	17/17	17/17

With the updated system prompt (natural language date format), results are now identical across both frameworks: - Tool calling: 17/17 everywhere - Category match: identical (15/17 for Annotated[str] and Annotated[Enum], 13/17 for Literal and Enum) - Date match: 17/17 everywhere — both frameworks get perfect date accuracy with gpt-5.3-codex The framework makes no difference when the system prompt and model are the same. Caveat: In an earlier run with the Copilot SDK (late evening PT), dates were consistently off-by-one — the model was computing dates for the next day (UTC). The system prompt said "today is March 8" but the model may have seen a UTC-based date of March 9. Re-running the next morning produced perfect results. This suggests the Copilot SDK may inject its own date context or the model uses UTC internally. Data sources: PydanticAI cat = gpt53codex_cat_v3, PydanticAI date = gpt53codex_date_v3, Copilot cat = copilot_53codex_cat_v2, Copilot date = copilot_53codex_date_v3

Tools have return schemas too!

Return a plain string

Python code:


@mcp.tool
def get_expenses_a() -> str:
    """Get all expenses."""
    return "\n".join(
        f"Date: {e['date']}, Amount: ${e['amount']}, "
        f"Category: {e['category']}, ..."
        for e in expenses
    )

→

Example output:


Date: 2025-01-02, Amount: $4.50,
  Category: Food & drink,
  Description: Morning coffee
Date: 2025-01-02, Amount: $12.99,
  Category: Food & drink,
  Description: Lunch sandwich
Date: 2025-01-03, Amount: $45.00,
  Category: Transit and Fuel,
  Description: Gas station fill-up
...

Return a list of Pydantic models

Python code:


class Expense(BaseModel):
    """A single expense record."""
    expense_date: date = Field(
        alias="date",
        description="Date of the expense")
    amount: float = Field(
        description="Amount spent")
    category: str = Field(
        description="Category of expense")
    description: str = Field(
        description="Description of the expense")

@mcp.tool
def get_expenses_c() -> list[Expense]:
    """Get all expenses."""
    return [Expense(**e) for e in expenses]

→

Example output:


[
  {"date": "2025-01-02",
   "amount": 4.5,
   "category": "Food & drink",
   "description": "Morning coffee"},
  {"date": "2025-01-02",
   "amount": 12.99,
   ...},
  ...
]

Evals: Which output schema did best?

gpt-5.3-codex, 7 cases for each schema, with a Pydantic-AI agent:

🗣️ How many expenses are recorded in total?

🗣️ What is the most expensive expense?

🗣️ What is the cheapest expense?

🗣️ What is the date of the earliest expense?

🗣️ What category is the most expensive expense?

🗣️ Show the 3 most expensive as a table.

🗣️ Show all Electronics & tech as a table.

	`str`	`list[Expense]`
Answered correctly?	7/7	7/7
Tool response size	6,297 chars	7,306 chars

Accuracy is identical, but structured output costs more in tool response size.

So why use it? For consumption by downstream MCP servers or agents.

Conclusions

Structured schemas benefit your code: type safety, validation, IDE autocompletion, and fewer runtime errors
They may also improve model accuracy when calling tools
The only way to know is to do evals! Don't go on vibes alone.

Thank you!

Slides:
pamelafox.github.io/py-ai-mcp-tool-schemas

Code:
github.com/pamelafox/py-ai-mcp-tool-schemas/

Questions? Find me online at:

Twitter	@pamelafox
Mastodon	@pamelafox@fosstodon.org
BlueSky	@pamelafox.bsky.social
GitHub	www.github.com/pamelafox
Website	pamelafox.org

Improving MCP tool schemas to increase agent reliability

Pamela Fox

But first... let's have fun with MCP!

Building MCP servers with Python

A basic FastMCP server

LLMs can be soo too creative!

Reign those LLMs in with better schemas!

Annotate tool arguments with descriptions

Use constrained types like Literal or Enum

Why not both?

Do agents like stricter schemas?

PydanticAI agent with MCP server

Filtering tool variants

Batch evaluation across schema variants

Running + reviewing evals

Evals: Which category schema did best?

Date schema variants

Evals: Which date schema did best?

Do all models think alike?

Evals across models

Same prompt, different models

Don't overthink it!

Are all agent frameworks the same?

GitHub Copilot SDK agent with MCP server

Pydantic AI vs. GitHub Copilot SDK

Tools have return schemas too!

Return a plain string

Return a list of Pydantic models

Evals: Which output schema did best?

Conclusions

Thank you!

Improving MCP tool schemas
to increase agent reliability