1) Architecture overview (recommended mental model)
Tool calling is a three-part loop:
- Model server (vLLM / llama.cpp-WebUI / Ollama) receives chat + tool schema.
- Model response includes a tool request (name + JSON arguments).
- Your controller validates arguments, runs approved code, returns tool output to the model for final answer.
Why this works well on Windows
- OpenAI-style APIs are common across all three stacks.
- You can reuse one Python controller across endpoints.
- You can begin with small local models, then scale up.
Recommended default path
- Best quality/perf: vLLM in WSL2 + CUDA GPU.
- Easiest all-Windows setup: Ollama.
- Most tweakable UI route: text-generation-webui.
2) Prerequisites
- Windows 11 (or recent Windows 10), admin access.
- NVIDIA GPU with current drivers (for practical local inference).
- At least 16 GB RAM (32+ GB is nicer), enough disk for models (20–100+ GB).
- Python 3.10+ and
pipfor your controller app.
| Stack | Endpoint style | Best for |
|---|---|---|
| vLLM | /v1/chat/completions (OpenAI-compatible) | Higher throughput + production-like serving |
| llama.cpp / text-generation-webui | OpenAI-compatible API in local server mode | GGUF flexibility, broad model compatibility |
| Ollama | /api/chat and OpenAI-compatible mode | Fastest setup experience |
3) Option A: vLLM + Nemotron/Hugging Face (WSL2)
Use this when: you want strong serving performance and OpenAI-compatible tool calling with explicit parser/template controls.
Step A1 — Prepare WSL2 and GPU access
- Install WSL2 and Ubuntu (if not already installed).
- Confirm your NVIDIA Windows driver is current.
- Inside WSL2, verify GPU visibility (
nvidia-smi).
Step A2 — Install vLLM in WSL2
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install -U pip
pip install vllm
If dependency resolution fails, use the vLLM installation matrix for your CUDA/PyTorch combination from docs.
Step A3 — Select a tool-capable model
Pick a model card that explicitly supports instruction/chat + tool/function calling format. For Nemotron family models, verify VRAM needs and prompt/template notes on Hugging Face model card before serving.
Step A4 — Start vLLM with tool calling enabled
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--chat-template examples/tool_chat_template_llama3.1_json.jinja
Step A5 — Smoke test endpoint
curl http://localhost:8000/v1/models
Then run your controller (section 6) against http://localhost:8000/v1.
4) Option B: llama.cpp or text-generation-webui
Use this when: you want maximum model portability (especially GGUF) and straightforward local APIs.
Path B1 — llama.cpp server
- Get a tool-capable GGUF model (Qwen/Mistral/etc. with known tool template support).
- Run
llama-serverwith Jinja/template enabled for function calling.
llama-server --model C:\models\YourModel.gguf --jinja --host 0.0.0.0 --port 8080
If tool JSON is weird, try an explicit chat template override for your model family.
Path B2 — text-generation-webui (oobabooga)
- Install and launch webui on Windows.
- Enable API mode (e.g.,
--api) so OpenAI-style endpoints are exposed. - Load a model with tool-calling support.
# Example launch flags
python server.py --api --listen --api-port 5000
Per the project wiki, tool calls return finish_reason: "tool_calls" and a structured tool_calls array; your app executes tools and sends back a role: "tool" message.
5) Option C: Ollama on Windows
Use this when: you want the least setup friction and quick local iteration.
Step C1 — Install and run a tool-capable model
ollama pull qwen3
ollama run qwen3
Step C2 — Call tools through Ollama API
curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
"model": "qwen3",
"stream": false,
"messages": [{"role":"user","content":"What is the temperature in New York?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get temperature for a city",
"parameters": {
"type": "object",
"required": ["city"],
"properties": {"city": {"type":"string"}}
}
}
}]
}'
When Ollama returns a tool call, execute it in your app and send a follow-up message with the tool result.
6) Controller pattern for tool calls (reusable across all 3 stacks)
This is the production-critical pattern: model asks, controller validates + executes, model finalizes.
import json
from openai import OpenAI
BASE_URL = "http://localhost:8000/v1" # vLLM example
MODEL = "your-model-id"
client = OpenAI(base_url=BASE_URL, api_key="dummy")
def get_weather(city: str):
# Replace with your real implementation
return {"city": city, "temp_c": 22}
TOOL_MAP = {"get_weather": get_weather}
TOOLS = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
messages = [{"role": "user", "content": "Weather in Chicago?"}]
for _ in range(6): # hard stop avoids infinite tool loops
resp = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOLS,
tool_choice="auto"
)
msg = resp.choices[0].message
if not getattr(msg, "tool_calls", None):
print(msg.content)
break
messages.append({"role": "assistant", "tool_calls": msg.tool_calls})
for tc in msg.tool_calls:
name = tc.function.name
args = json.loads(tc.function.arguments or "{}")
if name not in TOOL_MAP:
result = {"error": f"unknown tool: {name}"}
else:
# Validate args here before execution
result = TOOL_MAP[name](**args)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"name": name,
"content": json.dumps(result)
})
BASE_URL/MODEL per backend.7) Safety guardrails you should implement on day one
- Allowlist tools: only expose specific functions, never raw shell/file/network by default.
- Schema validation: validate JSON args (type, range, enum, required fields).
- Execution timeout: avoid hung tool calls.
- Loop cap: max tool iterations per request (e.g., 4–8).
- Sensitive actions require confirmation: delete/write/send actions should need explicit user approval.
- Audit logs: log tool name, args hash, and result status (without leaking secrets).
- Prompt injection defense: treat tool-call intent as untrusted until validated by policy.
8) Verify success
- Endpoint responds (
/v1/modelsfor OpenAI-style, or Ollama/api/tags). - Model returns a structured tool request (not plain text imitation).
- Your controller executes tool and sends back
role: "tool"result. - Model uses tool output in final answer.
- Bad arguments are rejected safely with a clear error result.
9) Troubleshooting
Model replies with fake tool JSON in normal text
- Wrong chat template/parser for model family.
- Model not actually tuned for tool use.
Malformed arguments / wrong function names
- Tighten tool schema descriptions and required fields.
- Use lower temperature during tool-selection turns.
vLLM fails or is unstable on Windows host
- Run it inside WSL2 Linux environment.
- Confirm CUDA driver/runtime pairing and VRAM headroom.
Out-of-memory crashes
- Use a smaller model or quantized variant.
- Lower context length and batch/parallel settings.