Model limits & behavior

The gateway is OpenAI-compatible, but the underlying models from upstream model providers have a few behaviors that differ from api.openai.com. Most “agent broke” reports trace back to the points below — read this before filing a bug.

Token limits per model

Every model advertises its limits on the /v1/models endpoint, so SDKs and agent runtimes can size requests correctly:

curl https://router.mingles.ai/v1/models \
  -H "Authorization: Bearer $ROUTER_API_KEY"
{
  "object": "list",
  "data": [
    {
      "id": "moonshotai/Kimi-K2.6",
      "object": "model",
      "owned_by": "mingles",
      "context_window": 200000,
      "max_output_tokens": 8192
    }
  ]
}
  • context_window — total input + output tokens the model can hold.
  • max_output_tokens — hard cap on what a single completion can generate.

These are advisory limits the gateway reports; the upstream network enforces the real caps. Always read them at runtime rather than hard-coding numbers.

Reasoning models spend the output budget on “thinking”

This is the single most common surprise. Reasoning-capable models (e.g. Kimi) generate internal reasoning tokens before the visible answer, and that reasoning counts against max_output_tokens. If the budget is too small, the model can spend the entire allowance thinking and return zero visible tokens — the response looks empty or stalls.

What to do:

  • Set max_tokens deliberately and generously — don’t leave it at a small default. If you want ~1k tokens of answer, allow several thousand so reasoning has headroom.
  • If you get an empty completion with finish_reason: "length", that’s the signature of this problem — raise max_tokens.
  • Keep prompts lean so more of the window is available for output.

Truncated tool calls → malformed JSON

When a tool call is cut off by max_output_tokens, the partial <tool_call>{... fragment can reach your client as malformed JSON arguments. The gateway detects this case, keeps finish_reason: "length" (so you can retry), and logs it — but it cannot invent the missing half.

What to do:

  • Raise max_tokens so the tool call can complete.
  • Shorten tool schemas (keep them well under 8 KB total — see Tools & agents).
  • On finish_reason: "length" during a tool call, retry with a larger budget.

Large system prompts

Agents that ship 30k+ token system prompts (some OpenClaw/Hermes presets) can exhaust the input side of the window or starve the output budget. Trim the system prompt, or move static context into a retrieval step instead of pasting it on every request.

No KV cache

There is no prompt/KV caching across requests today. Re-sending the same long prefix is re-processed every time — it does not get cheaper or faster on repeat. Design agent loops to avoid re-sending unchanged context where you can.

No built-in tools (web search / fetch)

The models have no built-in web search or web fetch. If your agent needs live data, implement it yourself as a normal function tool (you fetch the page and pass the text back in) — the gateway will route the tool call, but it will not browse the web for you.

Modality

Current models are text-only. Image input is not enabled — multimodal requests will be rejected or ignored. Track the model list for changes.

Quick checklist

SymptomLikely causeFix
Empty response / stallsReasoning ate the output budgetRaise max_tokens
finish_reason: "length"Hit max_output_tokensRaise max_tokens, shorten prompt
Malformed tool-call JSONTool call truncatedRaise max_tokens, shorten tool schemas
Agent with huge system prompt failsInput too largeTrim system prompt
Repeated prompts not fasterNo KV cacheAvoid re-sending unchanged context
Web search “doesn’t work”No built-in toolsImplement search as your own tool