There's a specific kind of dread that hits when you check your AI API dashboard and the number is much, much larger than you expected. It happens to OpenClaw users constantly — not because the software is wasteful, but because the defaults are optimised for flexibility, not cost.

This post covers the four configuration changes that cut the bill by 80–95% in most setups, with the actual numbers and the interactive calculator you can use to test your specific usage.

Why Default Settings Are Expensive

OpenClaw ships with auto-compaction enabled and uses a short cache-retention profile by default — so it's not starting from zero. The problem is that it leaves the context window unconstrained. Without a hard token ceiling, conversations grow indefinitely, compaction can only do so much, and the model still bills you for every token it has to process — summarised or not.

Here's what the default setup does on every single message:

Sends your entire system prompt — re-tokenized from scratch, every turn
Carries the full conversation history forward — by message 30, you're sending 30,000+ tokens of context that mostly won't affect the response
Has no fallback for simple tasks — a "remind me to call back" request costs the same as a deep research task

The result: an agent doing 50 messages/day on Claude Opus 4.6, once conversations grow long and tool schemas are injected, can easily average 40,000–60,000 input tokens per message — running $400–$600/month in API fees. The same agent with the changes below:$50–$80/month.

You can test these numbers against your own usage in the OpenClaw API Cost Calculator → Adjust the sliders to match your message volume and see the before/after instantly.

The Four Changes

All four settings live under agents.defaults in your openclaw.json. The snippets below show only the relevant slice of that object.

1. Set a hard context limit

Without a cap, your conversation window grows indefinitely. Token 1 of a context window costs the same as token 50,000 — and most of that old context is doing nothing useful.

Setting contextTokens to 16,000–20,000 covers roughly the last 10–15 full exchanges, which is all an agent typically needs for coherent responses.

{
  "contextTokens": 16000
}

2. Enable automatic context compaction

Instead of cutting off older messages abruptly, compaction summarises them. When the context approaches the cap, OpenClaw replaces old turns with a compact summary. The agent keeps long-term memory without carrying every word forward at full token cost. OpenClaw enables this by default — but pairing it with a hard contextTokens ceiling keeps the window from growing far enough to become expensive before compaction kicks in.

{
  "contextTokens": 16000,
  "compaction": {
    "mode": "default",
    "memoryFlush": {
      "softThresholdTokens": 13600
    }
  }
}

This alone drops average context-per-message from 20,000–50,000 tokens to a consistent 14,000–16,000 regardless of conversation length.

3. Enable prompt caching on your system prompt

Your system prompt is probably 500–1,500 tokens of instructions that never change. Without caching, you pay full price to re-process it on every single message.

Anthropic's prompt caching charges cached token reads at $0.50/M instead of $5/M for Claude Opus 4.6 — a 90% discount on that chunk. OpenAI and Google have equivalent mechanisms. OpenClaw exposes this via the cacheRetention setting.

{
  "models": {
    "anthropic/claude-opus-4-6": {
      "params": {
        "cacheRetention": "short"
      }
    }
  }
}

4. Configure a failover model

OpenClaw supports model failover — if your primary model returns an error or is unavailable, traffic automatically falls to a secondary. This is not per-task routing (OpenClaw doesn't classify messages and route cheap vs. expensive automatically), but it does protect uptime and lets you put a cheaper model in the failover slot so degraded-mode traffic costs less.

Gemini 3 Flash Preview is priced at $0.50/M input — about 10× cheaper than Opus 4.6. Deploying a second low-cost instance for high-volume simple tasks (reminders, lookups, short replies) and routing at the channel level is the practical way to get similar savings today.

{
  "model": {
    "primary": "anthropic/claude-opus-4-6",
    "fallbacks": [
      "google/gemini-3-flash-preview"
    ]
  }
}

Before & After: 50 Messages/Day on Claude Opus

	Default (unconstrained)	With all 4 changes
Avg input tokens/msg	~50,000	~7,250
System prompt cost	Full price every turn ($5/M)	90% off via cache ($0.50/M)
Monthly API bill	~$400–$600	~$50–$80
Setup time	0 minutes	~10 minutes

The numbers above assume 50 messages/day on Claude Opus 4.6 ($5/M input, $25/M output, cache reads $0.50/M). “Default” reflects a real agent with a 2,000-token system prompt, tool schemas, and growing conversation history averaging ~50,000 input tokens/msg. Your mileage varies — plug in your actual usage here.

Who Gets the Most From This

High-volume personal agents

Morning summaries, reminders, file management — dozens of short messages a day that don't need full context.

Team / community bots

Answering 100+ questions a day on Telegram or Discord. Context limits matter enormously here.

Deep research sessions

Long single-session analysis where the full context genuinely matters. Context limits would hurt more than help — use a higher cap or none.

Low-volume usage

5–10 messages/day with a cheap model (Gemini Flash, Sonnet). Your bill is already small enough that this won't move the needle much.

Skip the Configuration Entirely

If you'd rather just have all of this set up correctly from the start — Clawship instances deploy with context limits, compaction, and prompt caching pre-configured. You pick the model, we handle the ops.

Deploy with optimised settings in 60 seconds

Free plan includes a shared Telegram bot. Starter and above get dedicated instances with full model and channel choice.

Get started free

Or keep self-hosting and use the calculator to track what the config changes are actually saving you.

Why Your OpenClaw Bill Is So High (And How to Cut It by 90%)