Prompt Caching

Cache long static context and pay less per request

Prompt caching is most effective when you keep a stable prefix (system prompt, policy text, retrieved context) and move dynamic user input to the end.

Reference payload

Use this baseline request shape and adapt model, provider sort strategy, and token limits to your workload.

request.json

{
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "long policy text"},
        {"type": "text", "text": "long KB chunk", "cache_control": {"type": "ephemeral"}}
      ]
    },
    {"role": "user", "content": "answer with citations"}
  ]
}

Configuration examples

TheRouter.ai keeps request semantics consistent across providers, so you can tune behavior without rewriting your app layer.

TypeScript

const payload = {
  model: "anthropic/claude-sonnet-4.5",
  messages: [
    {
      role: "system",
      content: [
        { type: "text", text: kbChunk },
        { type: "text", text: policies, cache_control: { type: "ephemeral", ttl: "1h" } },
      ],
    },
    { role: "user", content: userPrompt },
  ],
};

Production note

Operate with guardrails

Cache writes can be more expensive on some providers than cache reads. Measure hit rates before rolling out globally.

Use the activity feed and usage exports to validate that these settings improve reliability and cost in your real traffic mix.