Model limits & behavior
The gateway is OpenAI-compatible, but the underlying models from upstream model providers have a
few behaviors that differ from api.openai.com. Most “agent broke” reports
trace back to the points below — read this before filing a bug.
Token limits per model
Every model advertises its limits on the /v1/models endpoint, so SDKs and
agent runtimes can size requests correctly:
curl https://router.mingles.ai/v1/models \
-H "Authorization: Bearer $ROUTER_API_KEY"
{
"object": "list",
"data": [
{
"id": "moonshotai/Kimi-K2.6",
"object": "model",
"owned_by": "mingles",
"context_window": 200000,
"max_output_tokens": 8192
}
]
}
context_window— total input + output tokens the model can hold.max_output_tokens— hard cap on what a single completion can generate.
These are advisory limits the gateway reports; the upstream network enforces the real caps. Always read them at runtime rather than hard-coding numbers.
Reasoning models spend the output budget on “thinking”
This is the single most common surprise. Reasoning-capable models (e.g. Kimi)
generate internal reasoning tokens before the visible answer, and that
reasoning counts against max_output_tokens. If the budget is too small, the
model can spend the entire allowance thinking and return zero visible
tokens — the response looks empty or stalls.
What to do:
- Set
max_tokensdeliberately and generously — don’t leave it at a small default. If you want ~1k tokens of answer, allow several thousand so reasoning has headroom. - If you get an empty completion with
finish_reason: "length", that’s the signature of this problem — raisemax_tokens. - Keep prompts lean so more of the window is available for output.
Truncated tool calls → malformed JSON
When a tool call is cut off by max_output_tokens, the partial
<tool_call>{... fragment can reach your client as malformed JSON arguments.
The gateway detects this case, keeps finish_reason: "length" (so you can
retry), and logs it — but it cannot invent the missing half.
What to do:
- Raise
max_tokensso the tool call can complete. - Shorten tool schemas (keep them well under 8 KB total — see Tools & agents).
- On
finish_reason: "length"during a tool call, retry with a larger budget.
Large system prompts
Agents that ship 30k+ token system prompts (some OpenClaw/Hermes presets) can exhaust the input side of the window or starve the output budget. Trim the system prompt, or move static context into a retrieval step instead of pasting it on every request.
No KV cache
There is no prompt/KV caching across requests today. Re-sending the same long prefix is re-processed every time — it does not get cheaper or faster on repeat. Design agent loops to avoid re-sending unchanged context where you can.
No built-in tools (web search / fetch)
The models have no built-in web search or web fetch. If your agent needs live data, implement it yourself as a normal function tool (you fetch the page and pass the text back in) — the gateway will route the tool call, but it will not browse the web for you.
Modality
Current models are text-only. Image input is not enabled — multimodal requests will be rejected or ignored. Track the model list for changes.
Quick checklist
| Symptom | Likely cause | Fix |
|---|---|---|
| Empty response / stalls | Reasoning ate the output budget | Raise max_tokens |
finish_reason: "length" | Hit max_output_tokens | Raise max_tokens, shorten prompt |
| Malformed tool-call JSON | Tool call truncated | Raise max_tokens, shorten tool schemas |
| Agent with huge system prompt fails | Input too large | Trim system prompt |
| Repeated prompts not faster | No KV cache | Avoid re-sending unchanged context |
| Web search “doesn’t work” | No built-in tools | Implement search as your own tool |