There's a specific kind of dread that hits when you check your AI API dashboard and the number is much, much larger than you expected. It happens to OpenClaw users constantly — not because the software is wasteful, but because the defaults are optimised for flexibility, not cost.
This post covers the four configuration changes that cut the bill by 80–95% in most setups, with the actual numbers and the interactive calculator you can use to test your specific usage.
Why Default Settings Are Expensive
OpenClaw ships with auto-compaction enabled and uses a short cache-retention profile by default — so it's not starting from zero. The problem is that it leaves the context window unconstrained. Without a hard token ceiling, conversations grow indefinitely, compaction can only do so much, and the model still bills you for every token it has to process — summarised or not.
Here's what the default setup does on every single message:
- Sends your entire system prompt — re-tokenized from scratch, every turn
- Carries the full conversation history forward — by message 30, you're sending 30,000+ tokens of context that mostly won't affect the response
- Has no fallback for simple tasks — a "remind me to call back" request costs the same as a deep research task
The result: an agent doing 50 messages/day on Claude Opus 4.6, once conversations grow long and tool schemas are injected, can easily average 40,000–60,000 input tokens per message — running $400–$600/month in API fees. The same agent with the changes below:$50–$80/month.
The Four Changes
All four settings live under agents.defaults in your openclaw.json. The snippets below show only the relevant slice of that object.
1. Set a hard context limit
Without a cap, your conversation window grows indefinitely. Token 1 of a context window costs the same as token 50,000 — and most of that old context is doing nothing useful.
Setting contextTokens to 16,000–20,000 covers roughly the last 10–15 full exchanges, which is all an agent typically needs for coherent responses.
{
"contextTokens": 16000
}2. Enable automatic context compaction
Instead of cutting off older messages abruptly, compaction summarises them. When the context approaches the cap, OpenClaw replaces old turns with a compact summary. The agent keeps long-term memory without carrying every word forward at full token cost. OpenClaw enables this by default — but pairing it with a hard contextTokens ceiling keeps the window from growing far enough to become expensive before compaction kicks in.
{
"contextTokens": 16000,
"compaction": {
"mode": "default",
"memoryFlush": {
"softThresholdTokens": 13600
}
}
}This alone drops average context-per-message from 20,000–50,000 tokens to a consistent 14,000–16,000 regardless of conversation length.
3. Enable prompt caching on your system prompt
Your system prompt is probably 500–1,500 tokens of instructions that never change. Without caching, you pay full price to re-process it on every single message.
Anthropic's prompt caching charges cached token reads at $0.50/M instead of $5/M for Claude Opus 4.6 — a 90% discount on that chunk. OpenAI and Google have equivalent mechanisms. OpenClaw exposes this via the cacheRetention setting.
{
"models": {
"anthropic/claude-opus-4-6": {
"params": {
"cacheRetention": "short"
}
}
}
}4. Configure a failover model
OpenClaw supports model failover — if your primary model returns an error or is unavailable, traffic automatically falls to a secondary. This is not per-task routing (OpenClaw doesn't classify messages and route cheap vs. expensive automatically), but it does protect uptime and lets you put a cheaper model in the failover slot so degraded-mode traffic costs less.
Gemini 3 Flash Preview is priced at $0.50/M input — about 10× cheaper than Opus 4.6. Deploying a second low-cost instance for high-volume simple tasks (reminders, lookups, short replies) and routing at the channel level is the practical way to get similar savings today.
{
"model": {
"primary": "anthropic/claude-opus-4-6",
"fallbacks": [
"google/gemini-3-flash-preview"
]
}
}Before & After: 50 Messages/Day on Claude Opus
| Default (unconstrained) | With all 4 changes | |
|---|---|---|
| Avg input tokens/msg | ~50,000 | ~7,250 |
| System prompt cost | Full price every turn ($5/M) | 90% off via cache ($0.50/M) |
| Monthly API bill | ~$400–$600 | ~$50–$80 |
| Setup time | 0 minutes | ~10 minutes |
The numbers above assume 50 messages/day on Claude Opus 4.6 ($5/M input, $25/M output, cache reads $0.50/M). “Default” reflects a real agent with a 2,000-token system prompt, tool schemas, and growing conversation history averaging ~50,000 input tokens/msg. Your mileage varies — plug in your actual usage here.
Who Gets the Most From This
High-volume personal agents
Morning summaries, reminders, file management — dozens of short messages a day that don't need full context.
Team / community bots
Answering 100+ questions a day on Telegram or Discord. Context limits matter enormously here.
Deep research sessions
Long single-session analysis where the full context genuinely matters. Context limits would hurt more than help — use a higher cap or none.
Low-volume usage
5–10 messages/day with a cheap model (Gemini Flash, Sonnet). Your bill is already small enough that this won't move the needle much.
Skip the Configuration Entirely
If you'd rather just have all of this set up correctly from the start — Clawship instances deploy with context limits, compaction, and prompt caching pre-configured. You pick the model, we handle the ops.
Deploy with optimised settings in 60 seconds
Free plan includes a shared Telegram bot. Starter and above get dedicated instances with full model and channel choice.
Get started freeOr keep self-hosting and use the calculator to track what the config changes are actually saving you.