Model limits & behavior

The gateway is OpenAI-compatible, but the underlying models from upstream model providers have a few behaviors that differ from api.openai.com. Most “agent broke” reports trace back to the points below — read this before filing a bug.

Token limits per model

Every model advertises its limits on the /v1/models endpoint, so SDKs and agent runtimes can size requests correctly:

curl https://router.mingles.ai/v1/models \
  -H "Authorization: Bearer $ROUTER_API_KEY"

{
  "object": "list",
  "data": [
    {
      "id": "moonshotai/Kimi-K2.6",
      "object": "model",
      "owned_by": "mingles",
      "context_window": 200000,
      "max_output_tokens": 8192
    }
  ]
}

context_window — total input + output tokens the model can hold.
max_output_tokens — hard cap on what a single completion can generate.

These are advisory limits the gateway reports; the upstream network enforces the real caps. Always read them at runtime rather than hard-coding numbers.

Reasoning models spend the output budget on “thinking”

This is the single most common surprise. Reasoning-capable models (e.g. Kimi) generate internal reasoning tokens before the visible answer, and that reasoning counts against max_output_tokens. If the budget is too small, the model can spend the entire allowance thinking and return zero visible tokens — the response looks empty or stalls.

What to do:

Set max_tokens deliberately and generously — don’t leave it at a small default. If you want ~1k tokens of answer, allow several thousand so reasoning has headroom.
If you get an empty completion with finish_reason: "length", that’s the signature of this problem — raise max_tokens.
Keep prompts lean so more of the window is available for output.

Truncated tool calls → malformed JSON

When a tool call is cut off by max_output_tokens, the partial <tool_call>{... fragment can reach your client as malformed JSON arguments. The gateway detects this case, keeps finish_reason: "length" (so you can retry), and logs it — but it cannot invent the missing half.

What to do:

Raise max_tokens so the tool call can complete.
Shorten tool schemas (keep them well under 8 KB total — see Tools & agents).
On finish_reason: "length" during a tool call, retry with a larger budget.

Large system prompts

Agents that ship 30k+ token system prompts (some OpenClaw/Hermes presets) can exhaust the input side of the window or starve the output budget. Trim the system prompt, or move static context into a retrieval step instead of pasting it on every request.

No KV cache

There is no prompt/KV caching across requests today. Re-sending the same long prefix is re-processed every time — it does not get cheaper or faster on repeat. Design agent loops to avoid re-sending unchanged context where you can.

No built-in tools (web search / fetch)

The models have no built-in web search or web fetch. If your agent needs live data, implement it yourself as a normal function tool (you fetch the page and pass the text back in) — the gateway will route the tool call, but it will not browse the web for you.

Modality

Current models are text-only. Image input is not enabled — multimodal requests will be rejected or ignored. Track the model list for changes.

Quick checklist

Symptom	Likely cause	Fix
Empty response / stalls	Reasoning ate the output budget	Raise `max_tokens`
`finish_reason: "length"`	Hit `max_output_tokens`	Raise `max_tokens`, shorten prompt
Malformed tool-call JSON	Tool call truncated	Raise `max_tokens`, shorten tool schemas
Agent with huge system prompt fails	Input too large	Trim system prompt
Repeated prompts not faster	No KV cache	Avoid re-sending unchanged context
Web search “doesn’t work”	No built-in tools	Implement search as your own tool