AI & Automation / Local AI

Run Local LLM Tool Calling on Windows

This guide gives you one practical architecture and three implementation paths: Nemotron/Hugging Face via vLLM in WSL2, llama.cpp or text-generation-webui, and Ollama on Windows. You will finish with a local endpoint that can request tools (functions), a controller loop that executes them safely, and a verification checklist.

Audience: Windows users comfortable with command line, Python, and installing GPU/runtime dependencies.

1) Architecture overview (recommended mental model)

Tool calling is a three-part loop:

  1. Model server (vLLM / llama.cpp-WebUI / Ollama) receives chat + tool schema.
  2. Model response includes a tool request (name + JSON arguments).
  3. Your controller validates arguments, runs approved code, returns tool output to the model for final answer.
Important: the model never executes your tools directly. Your app does. That is your security boundary.

Why this works well on Windows

  • OpenAI-style APIs are common across all three stacks.
  • You can reuse one Python controller across endpoints.
  • You can begin with small local models, then scale up.

Recommended default path

  • Best quality/perf: vLLM in WSL2 + CUDA GPU.
  • Easiest all-Windows setup: Ollama.
  • Most tweakable UI route: text-generation-webui.

2) Prerequisites

  • Windows 11 (or recent Windows 10), admin access.
  • NVIDIA GPU with current drivers (for practical local inference).
  • At least 16 GB RAM (32+ GB is nicer), enough disk for models (20–100+ GB).
  • Python 3.10+ and pip for your controller app.
Reality check: large Nemotron variants are heavy. If VRAM is limited, start with smaller instruction models or quantized GGUF models (Option B/C), validate your controller loop, then scale.
StackEndpoint styleBest for
vLLM/v1/chat/completions (OpenAI-compatible)Higher throughput + production-like serving
llama.cpp / text-generation-webuiOpenAI-compatible API in local server modeGGUF flexibility, broad model compatibility
Ollama/api/chat and OpenAI-compatible modeFastest setup experience

3) Option A: vLLM + Nemotron/Hugging Face (WSL2)

Use this when: you want strong serving performance and OpenAI-compatible tool calling with explicit parser/template controls.

Step A1 — Prepare WSL2 and GPU access

  1. Install WSL2 and Ubuntu (if not already installed).
  2. Confirm your NVIDIA Windows driver is current.
  3. Inside WSL2, verify GPU visibility (nvidia-smi).

Step A2 — Install vLLM in WSL2

python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install -U pip
pip install vllm

If dependency resolution fails, use the vLLM installation matrix for your CUDA/PyTorch combination from docs.

Step A3 — Select a tool-capable model

Pick a model card that explicitly supports instruction/chat + tool/function calling format. For Nemotron family models, verify VRAM needs and prompt/template notes on Hugging Face model card before serving.

Step A4 — Start vLLM with tool calling enabled

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --chat-template examples/tool_chat_template_llama3.1_json.jinja
Parser and template must match your model family. If you get malformed tool calls, this is the first thing to adjust.

Step A5 — Smoke test endpoint

curl http://localhost:8000/v1/models

Then run your controller (section 6) against http://localhost:8000/v1.

4) Option B: llama.cpp or text-generation-webui

Use this when: you want maximum model portability (especially GGUF) and straightforward local APIs.

Path B1 — llama.cpp server

  1. Get a tool-capable GGUF model (Qwen/Mistral/etc. with known tool template support).
  2. Run llama-server with Jinja/template enabled for function calling.
llama-server --model C:\models\YourModel.gguf --jinja --host 0.0.0.0 --port 8080

If tool JSON is weird, try an explicit chat template override for your model family.

Path B2 — text-generation-webui (oobabooga)

  1. Install and launch webui on Windows.
  2. Enable API mode (e.g., --api) so OpenAI-style endpoints are exposed.
  3. Load a model with tool-calling support.
# Example launch flags
python server.py --api --listen --api-port 5000

Per the project wiki, tool calls return finish_reason: "tool_calls" and a structured tool_calls array; your app executes tools and sends back a role: "tool" message.

5) Option C: Ollama on Windows

Use this when: you want the least setup friction and quick local iteration.

Step C1 — Install and run a tool-capable model

ollama pull qwen3
ollama run qwen3

Step C2 — Call tools through Ollama API

curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
  "model": "qwen3",
  "stream": false,
  "messages": [{"role":"user","content":"What is the temperature in New York?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_temperature",
      "description": "Get temperature for a city",
      "parameters": {
        "type": "object",
        "required": ["city"],
        "properties": {"city": {"type":"string"}}
      }
    }
  }]
}'

When Ollama returns a tool call, execute it in your app and send a follow-up message with the tool result.

6) Controller pattern for tool calls (reusable across all 3 stacks)

This is the production-critical pattern: model asks, controller validates + executes, model finalizes.

import json
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"  # vLLM example
MODEL = "your-model-id"

client = OpenAI(base_url=BASE_URL, api_key="dummy")

def get_weather(city: str):
    # Replace with your real implementation
    return {"city": city, "temp_c": 22}

TOOL_MAP = {"get_weather": get_weather}

TOOLS = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather",
    "parameters": {
      "type": "object",
      "properties": {"city": {"type": "string"}},
      "required": ["city"]
    }
  }
}]

messages = [{"role": "user", "content": "Weather in Chicago?"}]

for _ in range(6):  # hard stop avoids infinite tool loops
    resp = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto"
    )
    msg = resp.choices[0].message

    if not getattr(msg, "tool_calls", None):
        print(msg.content)
        break

    messages.append({"role": "assistant", "tool_calls": msg.tool_calls})

    for tc in msg.tool_calls:
        name = tc.function.name
        args = json.loads(tc.function.arguments or "{}")
        if name not in TOOL_MAP:
            result = {"error": f"unknown tool: {name}"}
        else:
            # Validate args here before execution
            result = TOOL_MAP[name](**args)

        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "name": name,
            "content": json.dumps(result)
        })
Pro move: keep this loop identical and only swap BASE_URL/MODEL per backend.

7) Safety guardrails you should implement on day one

  • Allowlist tools: only expose specific functions, never raw shell/file/network by default.
  • Schema validation: validate JSON args (type, range, enum, required fields).
  • Execution timeout: avoid hung tool calls.
  • Loop cap: max tool iterations per request (e.g., 4–8).
  • Sensitive actions require confirmation: delete/write/send actions should need explicit user approval.
  • Audit logs: log tool name, args hash, and result status (without leaking secrets).
  • Prompt injection defense: treat tool-call intent as untrusted until validated by policy.
Never do this: exposing an unrestricted “run_command” tool to the model. That is a speedrun to regret.

8) Verify success

  1. Endpoint responds (/v1/models for OpenAI-style, or Ollama /api/tags).
  2. Model returns a structured tool request (not plain text imitation).
  3. Your controller executes tool and sends back role: "tool" result.
  4. Model uses tool output in final answer.
  5. Bad arguments are rejected safely with a clear error result.

9) Troubleshooting

Model replies with fake tool JSON in normal text

  • Wrong chat template/parser for model family.
  • Model not actually tuned for tool use.

Malformed arguments / wrong function names

  • Tighten tool schema descriptions and required fields.
  • Use lower temperature during tool-selection turns.

vLLM fails or is unstable on Windows host

  • Run it inside WSL2 Linux environment.
  • Confirm CUDA driver/runtime pairing and VRAM headroom.

Out-of-memory crashes

  • Use a smaller model or quantized variant.
  • Lower context length and batch/parallel settings.

10) Sources