Tools & agents

The gateway supports tool-calling in three modes. You choose by the format of the request you send.

OpenAI tools format (recommended)

Send tools in the standard OpenAI schema. We forward them to the model and parse the response back into canonical tool_calls, regardless of whether the upstream model emits OpenAI-style or Hermes-style output.

resp = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": "Read /etc/hosts"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "read_file",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"],
            },
        },
    }],
)
print(resp.choices[0].message.tool_calls)
# [ChoiceDeltaToolCall(id='call_1', function=Function(name='read_file', arguments='{"path":"/etc/hosts"}'), type='function')]

Raw Hermes (for OpenClaw and Hermes-native runtimes)

If your runtime already produces and consumes raw Hermes XML (<tool_call>{...}</tool_call>), set the X-Tool-Mode: raw header. We will not rewrite the prompt or parse the output — you get exactly what the model emits.

curl https://router.mingles.ai/v1/chat/completions \
  -H "Authorization: Bearer $ROUTER_API_KEY" \
  -H "X-Tool-Mode: raw" \
  -H "Content-Type: application/json" \
  -d '{ "model": "moonshotai/Kimi-K2.6", "messages": [...] }'

Streaming tool-calls

Streaming returns standard OpenAI delta.tool_calls chunks. The gateway buffers Hermes XML internally and emits incremental JSON to your client, so runtimes like Cline and Cursor get live indicators.

Why tool-calls cut off or arrive as malformed JSON

The most common tool-calling failure is output truncation. Tool calls are generated as tokens, so they count against max_output_tokens. When the budget runs out mid-call, the client receives a partial <tool_call>{... fragment that parses as malformed JSON arguments — this is what breaks Kilo Code, OpenCode, Hermes/OpenClaw and Context7-style runtimes.

The gateway detects truncation, keeps finish_reason: "length" so you can retry, and logs it — but it cannot reconstruct the missing half. To fix:

Raise max_tokens. Reasoning models also spend output budget on internal thinking before the tool call, so leave generous headroom.
Shorten tool schemas (see the 8 KB note below).
Retry on finish_reason: "length" with a larger budget.

See Model limits & behavior for the full breakdown.

Known limits

Parallel tool-calls: at most 1 tool-call per assistant message today.
Streaming tool-calls: best-effort; we emit chunks as JSON segments arrive.
Tool schemas: kept under 8 KB total per request for parsing reliability.

Runtime configs

For ready-to-paste configs for OpenClaw, Cline, Continue.dev, Cursor, Aider and n8n, use the Agents Hub generator.