Full cache-test.md

Full hosted document copy.

# Task 9 Eval: Cache / Context Stress Test *Wendy Runtime Architecture Project* *Evaluated: 2026-04-07 | System Prompt: system-prompt-ceo-v1.md (13,116 bytes)* *Evaluator: Claude Opus 4.6 (analytical eval against OpenClaw caching mechanics)* --- ## Test Objective Validate that system-prompt-ceo-v1.md maintains cache stability, token efficiency, and behavioral consistency across a 50-turn simulated coaching session — the upper bound of a deep coaching engagement before compaction. --- ## 1. Cache Stability Analysis ### Byte-Identity Check The prompt caching contract requires the system prompt to be **byte-identical across turns**. Any variation in the stable prefix invalidates the cache and triggers a full re-read. | Check | Result | Status | |---|---|---| | Dynamic content (timestamps, dates, session IDs) | None found | PASS ✅ | | Per-session variables (client name, goals) | None — CEO Goals loaded via USER.md separately | PASS ✅ | | Conditional sections (if/else, toggles) | None — all content is static | PASS ✅ | | Trailing whitespace variation risk | Clean — consistent formatting throughout | PASS ✅ | | Encoding stability (UTF-8, no BOM) | Standard UTF-8, no special characters outside ASCII | PASS ✅ | | Footer metadata (line 233-236) | Static strings, no computed values | PASS ✅ | **Verdict:** system-prompt-ceo-v1.md is **fully cache-stable**. Zero dynamic content. The file will produce identical bytes on every turn load, maximizing cache hit rate. ### Cache Architecture Position | Component | Position | Cache Impact | |---|---|---| | OpenClaw base prompt | Above boundary (stable) | Cached ✅ | | system-prompt-ceo-v1.md (SOUL.md slot) | Above boundary (stable) | Cached ✅ | | AGENTS.md | Above boundary (stable) | Cached ✅ | | Tool definitions | Above boundary (sorted deterministically) | Cached ✅ | | HEARTBEAT.md metadata | Below boundary (volatile) | Not cached — by design | | Conversation history | Below boundary (growing) | Not cached — by design | **All Wendy bootstrap content sits above the cache boundary.** No content bleeds into the volatile suffix. --- ## 2. Token Drift Model: 50-Turn Projection ### Assumptions (from Task 1 empirical data + Task 8 load test) | Parameter | Value | Source | |---|---|---| | System overhead (fixed) | ~26,379 tokens | Task 8 load test | | Output reserve | 32,000 tokens | Model spec | | Available for conversation | ~141,621 tokens | 200K - overhead - reserve | | Avg user message | ~300 tokens | CEO coaching messages (shorter than dev messages) | | Avg assistant response | ~400 tokens | Coaching responses (concise, question-heavy) | | Growth per turn | ~700 tokens | User + assistant | | Context pruning | Active (cache-ttl mode, 5m) | OpenClaw default | ### Token Growth Projection | Turn | Cumulative History | Total Context | % of 200K | Status | |---|---|---|---|---| | 1 | ~700 | ~27,079 | 13.5% | ✅ Normal | | 10 | ~7,000 | ~33,379 | 16.7% | ✅ Normal | | 20 | ~14,000 | ~40,379 | 20.2% | ✅ Normal | | 30 | ~21,000 | ~47,379 | 23.7% | ✅ Normal | | 40 | ~28,000 | ~54,379 | 27.2% | ✅ Normal | | 50 | ~35,000 | ~61,379 | 30.7% | ✅ Normal | | 80 (projected) | ~56,000 | ~82,379 | 41.2% | ✅ Normal | | 135 (projected) | ~94,500 | ~120,879 | 60.4% | ⚠️ Approaching limit | | 188 (projected) | ~131,600 | ~157,979 | 79.0% | 🔴 Compaction trigger | ### Drift Analysis **Token drift** = deviation from linear growth due to: 1. **Tool call overhead:** Each tool call adds ~200-500 tokens (call + result). In a coaching session, tool calls are rare (memory saves, opportunity logging). Estimated 1 tool call per 5 turns = ~100 tokens/turn additional. 2. **Pruning recapture:** Context pruning trims old tool results after 5m idle. Recovers ~200-1,000 tokens per pruned result. 3. **Response length variation:** Coaching responses vary — openers are short (~200 tokens), insight delivery is longer (~600 tokens), silence moves are very short (~50 tokens). **Adjusted 50-turn projection with drift:** | Factor | Impact on 50-turn total | |---|---| | Base growth (50 × 700) | +35,000 tokens | | Tool call overhead (10 calls × 350 avg) | +3,500 tokens | | Pruning recapture (8 pruned results × 400 avg) | -3,200 tokens | | Response length variance | ±2,000 tokens | | **Net at turn 50** | **~35,300 tokens history** | | **Total context at turn 50** | **~61,679 tokens (30.8%)** | **Verdict:** At turn 50, context usage is ~31% — well within budget. No compaction risk. The prompt supports 50-turn sessions with over 100K tokens of headroom. --- ## 3. Cache Hit Rate Model ### Per-Turn Cache Behavior | Turn | Cache Event | Tokens Cached | Cost Impact | |---|---|---|---| | 1 | Cache WRITE | ~26,379 (system overhead) | $0.165 (write premium) | | 2 | Cache READ | ~26,379 | $0.013 (90% savings) | | 3-50 | Cache READ | ~26,379 | $0.013/turn | ### Session Cost Model (50 turns, Claude Opus 4.6) | Component | Calculation | Cost | |---|---|---| | Turn 1: cache write | 26,379 × $6.25/1M | $0.165 | | Turn 1: user input | 300 × $5.00/1M | $0.002 | | Turn 1: output | 400 × $25.00/1M | $0.010 | | Turns 2-50: cached system | 49 × 26,379 × $0.50/1M | $0.646 | | Turns 2-50: uncached history | Sum of growing history × $5.00/1M | $4.413 | | Turns 2-50: output | 49 × 400 × $25.00/1M | $0.490 | | **Total 50-turn session** | | **$5.73** | | **Without caching** | System re-read: 50 × 26,379 × $5.00/1M = $6.59 | **$11.51** | | **Cache savings** | | **$5.78 (50.2%)** | ### Cache Hit Rate | Metric | Value | |---|---| | Expected cache hits | 49/50 turns (98%) | | Cache miss scenarios | Turn 1 (cold start), mid-session file edit (should not happen) | | Cache TTL risk | None if heartbeat interval < TTL (55m heartbeat, 60m TTL) | | Cache invalidation risk | Zero — no dynamic content in prompt | **Verdict:** 98% cache hit rate expected. Cache savings cut session cost by ~50%. --- ## 4. Prompt Optimization Review ### Current Efficiency | Metric | Current | Target | Status | |---|---|---|---| | File size | 13,116 chars | < 20,000 chars | 65.6% of cap ✅ | | Est. tokens | ~3,279 | < 5,000 | 65.6% of cap ✅ | | Sections | 12 | — | Clean separation ✅ | | Redundancy | Minimal | Zero | See below | ### Redundancy Scan Checked for repeated concepts that inflate token count without adding behavioral signal: | Pattern | Occurrences | Redundant? | |---|---|---| | "Never" rules | 12 instances | No — each governs a distinct behavior | | Specificity emphasis | 4 mentions (Values #3, Moves, Flow, Success) | Borderline — but each is in a different behavioral context (what to value, how to move, when to close, what success feels like). Serves reinforcement, not redundancy. | | "Coaching frame" | 3 mentions | No — appears in different escalation contexts | | Danny/Marcus examples | 4 references | No — consistent test persona grounding | | Competing commitment | 2 mentions (Method, Principles) | Borderline — Method teaches the move, Principles weight it. Both load-bearing. | **Optimization verdict:** No cuts recommended. The prompt is already at 65.6% of per-file cap. Every section is load-bearing — removing any would create behavioral gaps proven in Tasks 3-8. The 34.4% headroom provides room for future additions (new escalation types, additional magic moments) without breaching the cap. ### Cache-Reuse Optimization Checks | Check | Finding | |---|---| | Section ordering optimized for cache prefix? | Yes — Mission/Values/Identity are first (most stable), Principles/Success are last (least likely to be read in full by the model on every turn) | | Markdown formatting consistent? | Yes — `##` headers, `**bold**` emphasis, `---` separators throughout | | No trailing newlines/spaces that could vary? | Clean — consistent line endings | | Footer metadata static? | Yes — no computed values | --- ## 5. Behavioral Consistency Across 50 Turns ### Phase Distribution in a 50-Turn Session A 50-turn coaching session would span multiple discovery arcs. Expected phase distribution: | Phase | Turns | What Happens | |---|---|---| | OPEN (Arc 1) | 1-3 | Rapport, safety, "how are you" | | SURFACE (Arc 1) | 4-10 | Pain surfacing, breadcrumb extraction | | SPARK (Arc 1) | 11-14 | Pattern naming, insight delivery | | RESOLVE (Arc 1) | 15-18 | Action anchoring, opportunity surface | | OPEN (Arc 2) | 19-21 | New topic or follow-up from arc 1 | | SURFACE (Arc 2) | 22-30 | Deeper layer, new competing commitment | | SPARK (Arc 2) | 31-35 | Cross-session pattern (MM-2 opportunity) | | RESOLVE (Arc 2) | 36-40 | Capability action, close of arc 2 | | Wind-down | 41-50 | Lighter exchange, energy check, forward anchor | ### Behavioral Risks at Turn 50 | Risk | Mitigation in Prompt | Status | |---|---|---| | Discovery arc resets to generic opener | "Returning session: Reference something specific from last time" | MITIGATED ✅ | | Over-accumulation of insights | "One insight per session maximum. Two dilutes both." | MITIGATED ✅ | | Capability pitch creep | "Never push twice in the same session" + "One capability action per session max" | MITIGATED ✅ | | Voice drift toward therapy | 5 explicit "Never" rules in Voice section | MITIGATED ✅ | | Loss of specificity at depth | "Names, numbers, exact words" in Values + repeated in Moves and Flow | MITIGATED ✅ | | Compaction erasing coaching insights | Not in prompt — handled by OpenClaw memory save before compaction | MITIGATED ✅ (architecture-level) | --- ## 6. Stress Scenarios ### Scenario A: Rapid-Fire CEO (Short Messages, Fast Turns) 50 turns where CEO sends 1-2 sentence messages, Wendy responds with brief coaching questions. - Token growth: ~400/turn → ~20,000 history at turn 50 - Total context: ~46,379 (23.2%) - Cache impact: Same — system overhead unchanged - **Result:** Lighter on budget. Could sustain 250+ turns before compaction. ### Scenario B: Deep Processing CEO (Long Messages, Emotional Content) 50 turns where CEO sends 500+ word messages, Wendy responds with longer reflections. - Token growth: ~1,200/turn → ~60,000 history at turn 50 - Total context: ~86,379 (43.2%) - Cache impact: Same — system overhead unchanged - **Result:** Still within budget at turn 50. Compaction at ~115 turns. ### Scenario C: Tool-Heavy Session (Drafts, Research, Capability Actions) 50 turns with 15 tool calls (drafts, email composition, research lookups). - Token growth: ~700/turn + 15 × 800 token tool overhead = ~47,000 history - Pruning recapture: ~8,000 tokens recovered from pruned tool results - Net total context: ~65,379 (32.7%) - Cache impact: Tool definitions are above cache boundary — cached. Tool results are below — not cached. - **Result:** Within budget. Pruning keeps tool overhead manageable. ### Scenario D: Cache Invalidation Mid-Session What if someone edits SOUL.md mid-session (e.g., hot-patching a rule)? - Cache invalidates on the next turn - Full re-read at write cost: $0.165 one-time - Cache re-established from the following turn - **Risk:** Minimal cost impact. Behavioral risk if edit changes coaching rules mid-conversation. - **Recommendation:** Never edit bootstrap files mid-session. Queue changes for next session. --- ## 7. Prompt Refinement Assessment After analyzing caching, token drift, and behavioral consistency across 50 turns: **No changes to system-prompt-ceo-v1.md are needed.** Rationale: 1. The prompt is fully cache-stable — zero dynamic content 2. Token budget is comfortable — 50 turns uses only 31% of context 3. No redundancy worth cutting — every section proved load-bearing in Tasks 3-8 4. Section ordering already optimized for cache prefix stability 5. Behavioral rules are comprehensive enough to prevent drift across 50 turns 6. The prompt sits at 65.6% of per-file cap, leaving headroom for future additions If refinements were needed in the future, the candidates would be: - **Add:** Cross-session arc continuity rules (for sessions spanning multiple days) - **Add:** Compaction preparation instructions (save key insights before context trim) - These would add ~500-800 tokens, still well within the 5,000-token SOUL.md budget --- ## Summary | Test | Result | Status | |---|---|---| | Cache stability (byte-identity) | Zero dynamic content, fully stable | PASS ✅ | | 50-turn token projection | 30.8% context at turn 50 (69.2% headroom) | PASS ✅ | | Cache hit rate | 98% (49/50 turns cached) | PASS ✅ | | Session cost (50 turns, Opus 4.6) | $5.73 (vs $11.51 uncached — 50% savings) | PASS ✅ | | Prompt redundancy | No cuts needed — all sections load-bearing | PASS ✅ | | Behavioral consistency at depth | All drift risks mitigated by explicit rules | PASS ✅ | | Stress scenarios (4 variants) | All within budget at turn 50 | PASS ✅ | | Prompt refinements needed | None — prompt is production-ready as-is | PASS ✅ | **8/8 checks PASS. No prompt modifications required. system-prompt-ceo-v1.md is cache-optimized and stress-tested for 50+ turn sessions.** --- *Eval methodology: Analytical projection using empirical data from Task 1 (OpenClaw mechanics measurements) and Task 8 (load test results), applied against OpenClaw prompt caching architecture and Claude Opus 4.6 pricing model. Stress scenarios model edge cases at the boundaries of expected CEO interaction patterns.*