Skip to content

ToolPicker

Pick K relevant tools out of N for an LLM agent, without dumping the whole catalogue into the prompt.

When an LLM agent has 50 tools it spends a third of its prompt budget on tool schemas it never calls. ToolPicker fuses BM25 lexical retrieval, semantic embeddings, and (optionally) an example-trained intent classifier through Reciprocal Rank Fusion, then packs the top-K into your token budget.


Why this exists

Five tools in the prompt is fine. Fifty isn't. The token cost of carrying every tool schema scales linearly with the number of tools but the value of any single tool to a given query is sparse — most tools are irrelevant most of the time. Routing the tool list per query gets you smaller prompts, lower cost, and (usually) better tool-calling accuracy because the model isn't distracted by irrelevant signatures.

ToolPicker is the routing layer. It does not call the tools, generate plans, or run a loop. It returns a list[Tool] and gets out of the way.


Headline numbers (v0.6)

Multi-strategy comparison on a 200-case in-repo synthetic corpus and a 500-case Gorilla slice, both using OpenAI text-embedding-3-small:

strategy p@1 p@3 mrr
bm25-only 0.645 0.760 0.701
semantic-only 0.885 0.970 0.926
hybrid-rrf 0.800 0.960 0.879
intent-only 0.715 0.925 0.819
bm25+semantic+intent 0.845 0.965 0.908
strategy p@1 p@3 mrr
bm25-only 0.062 0.122 0.098
semantic-only 0.102 0.186 0.147
hybrid-rrf 0.088 0.168 0.132

Honest read: on these corpora, pure semantic retrieval beats every hybrid configuration under uniform-weight RRF. Adding intent narrows the gap on synthetic (0.800 → 0.845 p@1) but doesn't close it. The Gorilla absolute numbers are low because Gorilla is genuinely hard — 1726 tightly-clustered ML APIs against natural-language queries. Description enrichment (v0.6) lifted hybrid 29% relative over v0.5.

Reproduce in one command:

uv run python -m evals.compare --benchmark synthetic --embedder openai --output out/compare.json

Install

pip install toolpicker[openai]

The core install is zero-dependency. Optional extras gate the things you actually use:

pip install toolpicker[openai]    # OpenAIEmbeddings
pip install toolpicker[openapi]   # OpenAPISource
pip install toolpicker[mcp]       # MCPSource + live introspection
pip install toolpicker[tokens]    # tiktoken-backed token counting

Thirty-second example

from toolpicker import FunctionSchemaSource, OpenAIEmbeddings, ToolPicker

tools = [
    {"name": "get_weather", "description": "Get weather for a city.",
     "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}},
    {"name": "send_email", "description": "Send an email.",
     "parameters": {"type": "object", "properties": {"to": {"type": "string"}}}},
    # ... 48 more
]

picker = ToolPicker(FunctionSchemaSource(tools), embedder=OpenAIEmbeddings())
selected = picker.select("what's the forecast for Boston?", k=5)
# -> [Tool(name='get_weather', ...), ...]

More on Quickstart.


What ToolPicker is not

  • Not a tool runner. It returns tools; you call them.
  • Not an agent framework. It plugs into LangChain / LlamaIndex / raw OpenAI / Claude Agent SDK / anything that takes a list[function_schema].
  • Not a vector database. The semantic half stores embeddings in-process. If you have 100k tools, you want a vector DB; ToolPicker is for the 10–10,000 tool range.

Where to go next

  • Quickstart — five-line working example with a real API.
  • Concepts — what BM25, semantic, intent, RRF, and token-budget packing each do, and when each wins.
  • SourcesFunctionSchemaSource, OpenAPISource, MCPSource, MergedSource.
  • Eval harness — how to reproduce the headline numbers, plus how to run on ToolBench / Gorilla after you fetch them.
  • API reference — full autogenerated reference.