Tools & agents
The gateway supports tool-calling in three modes. You choose by the format of the request you send.
OpenAI tools format (recommended)
Send tools in the standard OpenAI schema. We forward them to the model and
parse the response back into canonical tool_calls, regardless of whether the
upstream model emits OpenAI-style or Hermes-style output.
resp = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=[{"role": "user", "content": "Read /etc/hosts"}],
tools=[{
"type": "function",
"function": {
"name": "read_file",
"parameters": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"],
},
},
}],
)
print(resp.choices[0].message.tool_calls)
# [ChoiceDeltaToolCall(id='call_1', function=Function(name='read_file', arguments='{"path":"/etc/hosts"}'), type='function')]
Raw Hermes (for OpenClaw and Hermes-native runtimes)
If your runtime already produces and consumes raw Hermes XML
(<tool_call>{...}</tool_call>), set the X-Tool-Mode: raw header.
We will not rewrite the prompt or parse the output — you get exactly what the
model emits.
curl https://router.mingles.ai/v1/chat/completions \
-H "Authorization: Bearer $ROUTER_API_KEY" \
-H "X-Tool-Mode: raw" \
-H "Content-Type: application/json" \
-d '{ "model": "moonshotai/Kimi-K2.6", "messages": [...] }'
Streaming tool-calls
Streaming returns standard OpenAI delta.tool_calls chunks. The gateway
buffers Hermes XML internally and emits incremental JSON to your client, so
runtimes like Cline and Cursor get live indicators.
Why tool-calls cut off or arrive as malformed JSON
The most common tool-calling failure is output truncation. Tool calls are
generated as tokens, so they count against max_output_tokens. When the budget
runs out mid-call, the client receives a partial <tool_call>{... fragment that
parses as malformed JSON arguments — this is what breaks Kilo Code,
OpenCode, Hermes/OpenClaw and Context7-style runtimes.
The gateway detects truncation, keeps finish_reason: "length" so you can retry,
and logs it — but it cannot reconstruct the missing half. To fix:
- Raise
max_tokens. Reasoning models also spend output budget on internal thinking before the tool call, so leave generous headroom. - Shorten tool schemas (see the 8 KB note below).
- Retry on
finish_reason: "length"with a larger budget.
See Model limits & behavior for the full breakdown.
Known limits
- Parallel tool-calls: at most 1 tool-call per assistant message today.
- Streaming tool-calls: best-effort; we emit chunks as JSON segments arrive.
- Tool schemas: kept under 8 KB total per request for parsing reliability.
Runtime configs
For ready-to-paste configs for OpenClaw, Cline, Continue.dev, Cursor, Aider and n8n, use the Agents Hub generator.