Run Local LLM Tool Calling on Windows (vLLM, llama.cpp/WebUI, Ollama)

1) Architecture overview (recommended mental model)

Tool calling is a three-part loop:

Model server (vLLM / llama.cpp-WebUI / Ollama) receives chat + tool schema.
Model response includes a tool request (name + JSON arguments).
Your controller validates arguments, runs approved code, returns tool output to the model for final answer.

Important: the model never executes your tools directly. Your app does. That is your security boundary.

Why this works well on Windows

OpenAI-style APIs are common across all three stacks.
You can reuse one Python controller across endpoints.
You can begin with small local models, then scale up.

Recommended default path

Best quality/perf: vLLM in WSL2 + CUDA GPU.
Easiest all-Windows setup: Ollama.
Most tweakable UI route: text-generation-webui.

2) Prerequisites

Windows 11 (or recent Windows 10), admin access.
NVIDIA GPU with current drivers (for practical local inference).
At least 16 GB RAM (32+ GB is nicer), enough disk for models (20–100+ GB).
Python 3.10+ and pip for your controller app.

Reality check: large Nemotron variants are heavy. If VRAM is limited, start with smaller instruction models or quantized GGUF models (Option B/C), validate your controller loop, then scale.

Stack	Endpoint style	Best for
vLLM	`/v1/chat/completions` (OpenAI-compatible)	Higher throughput + production-like serving
llama.cpp / text-generation-webui	OpenAI-compatible API in local server mode	GGUF flexibility, broad model compatibility
Ollama	`/api/chat` and OpenAI-compatible mode	Fastest setup experience

3) Option A: vLLM + Nemotron/Hugging Face (WSL2)

Use this when: you want strong serving performance and OpenAI-compatible tool calling with explicit parser/template controls.

Step A1 — Prepare WSL2 and GPU access

Install WSL2 and Ubuntu (if not already installed).
Confirm your NVIDIA Windows driver is current.
Inside WSL2, verify GPU visibility (nvidia-smi).

Step A2 — Install vLLM in WSL2

python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install -U pip
pip install vllm

If dependency resolution fails, use the vLLM installation matrix for your CUDA/PyTorch combination from docs.

Step A3 — Select a tool-capable model

Pick a model card that explicitly supports instruction/chat + tool/function calling format. For Nemotron family models, verify VRAM needs and prompt/template notes on Hugging Face model card before serving.

Step A4 — Start vLLM with tool calling enabled

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --host 0.0.0.0 \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --chat-template examples/tool_chat_template_llama3.1_json.jinja

Parser and template must match your model family. If you get malformed tool calls, this is the first thing to adjust.

Step A5 — Smoke test endpoint

curl http://localhost:8000/v1/models

Then run your controller (section 6) against http://localhost:8000/v1.

4) Option B: llama.cpp or text-generation-webui

Use this when: you want maximum model portability (especially GGUF) and straightforward local APIs.

Path B1 — llama.cpp server

Get a tool-capable GGUF model (Qwen/Mistral/etc. with known tool template support).
Run llama-server with Jinja/template enabled for function calling.

llama-server --model C:\models\YourModel.gguf --jinja --host 0.0.0.0 --port 8080

If tool JSON is weird, try an explicit chat template override for your model family.

Path B2 — text-generation-webui (oobabooga)

Install and launch webui on Windows.
Enable API mode (e.g., --api) so OpenAI-style endpoints are exposed.
Load a model with tool-calling support.

# Example launch flags
python server.py --api --listen --api-port 5000

Per the project wiki, tool calls return finish_reason: "tool_calls" and a structured tool_calls array; your app executes tools and sends back a role: "tool" message.

5) Option C: Ollama on Windows

Use this when: you want the least setup friction and quick local iteration.

Step C1 — Install and run a tool-capable model

ollama pull qwen3
ollama run qwen3

Step C2 — Call tools through Ollama API

curl -s http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
  "model": "qwen3",
  "stream": false,
  "messages": [{"role":"user","content":"What is the temperature in New York?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_temperature",
      "description": "Get temperature for a city",
      "parameters": {
        "type": "object",
        "required": ["city"],
        "properties": {"city": {"type":"string"}}
      }
    }
  }]
}'

When Ollama returns a tool call, execute it in your app and send a follow-up message with the tool result.

6) Controller pattern for tool calls (reusable across all 3 stacks)

This is the production-critical pattern: model asks, controller validates + executes, model finalizes.

import json
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"  # vLLM example
MODEL = "your-model-id"

client = OpenAI(base_url=BASE_URL, api_key="dummy")

def get_weather(city: str):
    # Replace with your real implementation
    return {"city": city, "temp_c": 22}

TOOL_MAP = {"get_weather": get_weather}

TOOLS = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather",
    "parameters": {
      "type": "object",
      "properties": {"city": {"type": "string"}},
      "required": ["city"]
    }
  }
}]

messages = [{"role": "user", "content": "Weather in Chicago?"}]

for _ in range(6):  # hard stop avoids infinite tool loops
    resp = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto"
    )
    msg = resp.choices[0].message

    if not getattr(msg, "tool_calls", None):
        print(msg.content)
        break

    messages.append({"role": "assistant", "tool_calls": msg.tool_calls})

    for tc in msg.tool_calls:
        name = tc.function.name
        args = json.loads(tc.function.arguments or "{}")
        if name not in TOOL_MAP:
            result = {"error": f"unknown tool: {name}"}
        else:
            # Validate args here before execution
            result = TOOL_MAP[name](**args)

        messages.append({
            "role": "tool",
            "tool_call_id": tc.id,
            "name": name,
            "content": json.dumps(result)
        })

Pro move: keep this loop identical and only swap BASE_URL/MODEL per backend.

7) Safety guardrails you should implement on day one

Allowlist tools: only expose specific functions, never raw shell/file/network by default.
Schema validation: validate JSON args (type, range, enum, required fields).
Execution timeout: avoid hung tool calls.
Loop cap: max tool iterations per request (e.g., 4–8).
Sensitive actions require confirmation: delete/write/send actions should need explicit user approval.
Audit logs: log tool name, args hash, and result status (without leaking secrets).
Prompt injection defense: treat tool-call intent as untrusted until validated by policy.

Never do this: exposing an unrestricted “run_command” tool to the model. That is a speedrun to regret.

8) Verify success

Endpoint responds (/v1/models for OpenAI-style, or Ollama /api/tags).
Model returns a structured tool request (not plain text imitation).
Your controller executes tool and sends back role: "tool" result.
Model uses tool output in final answer.
Bad arguments are rejected safely with a clear error result.

Run Local LLM Tool Calling on Windows

1) Architecture overview (recommended mental model)

Why this works well on Windows

Recommended default path

2) Prerequisites

3) Option A: vLLM + Nemotron/Hugging Face (WSL2)

Step A1 — Prepare WSL2 and GPU access

Step A2 — Install vLLM in WSL2

Step A3 — Select a tool-capable model

Step A4 — Start vLLM with tool calling enabled

Step A5 — Smoke test endpoint

4) Option B: llama.cpp or text-generation-webui

Path B1 — llama.cpp server

Path B2 — text-generation-webui (oobabooga)

5) Option C: Ollama on Windows

Step C1 — Install and run a tool-capable model

Step C2 — Call tools through Ollama API

6) Controller pattern for tool calls (reusable across all 3 stacks)

7) Safety guardrails you should implement on day one

8) Verify success

9) Troubleshooting

Model replies with fake tool JSON in normal text

Malformed arguments / wrong function names

vLLM fails or is unstable on Windows host

Out-of-memory crashes

10) Sources

1) Architecture overview (recommended mental model)

Why this works well on Windows

Recommended default path

2) Prerequisites

3) Option A: vLLM + Nemotron/Hugging Face (WSL2)

Step A1 — Prepare WSL2 and GPU access

Step A2 — Install vLLM in WSL2

Step A3 — Select a tool-capable model

Step A4 — Start vLLM with tool calling enabled

Step A5 — Smoke test endpoint

4) Option B: llama.cpp or text-generation-webui

Path B1 — llama.cpp server

Path B2 — text-generation-webui (oobabooga)

5) Option C: Ollama on Windows

Step C1 — Install and run a tool-capable model

Step C2 — Call tools through Ollama API

6) Controller pattern for tool calls (reusable across all 3 stacks)

7) Safety guardrails you should implement on day one

8) Verify success

9) Troubleshooting

Model replies with fake tool JSON in normal text

Malformed arguments / wrong function names

vLLM fails or is unstable on Windows host

Out-of-memory crashes

10) Sources

11) Related guides