s08

Context Compact

Memory Management

Context Will Fill Up

414 LOC9 toolsContext compaction

Compression keeps the conversation usable when the context window gets crowded.

Three-Layer Context Compression

Context Window

USR 1

AST 2

TRL 3

USR 4

AST 5

TRL 6

USR 7

AST 8

30%

30K

/ 100K

Token usage30,000 / 100,000

user

assistant

tool_result

Microwaiting

old tool_result

shrink bulky outputs

Autowaiting

token threshold

summarize the conversation

Manualwaiting

user command

keep one compact summary

Growing Context

The context window holds the conversation. Each API call adds more messages.

1/7

s01 → s02 → s03 → s04 → s05 → s06 → s07 → s08 → s09 → s10 → ... → s20

"Context will fill up — have a way to make room" — Four-layer compression pipeline: cheap first, expensive last.

Harness Layer: Compression — clean memory, unlimited sessions.

The Problem

The agent is running along, then freezes.

It has bash, read, write — all the capabilities it needs. But it read a 1000-line file (~4000 tokens), then read 30 more files, ran 20 commands. Every command's output, every file's contents, all pile up in the messages list.

The context window is finite. Once full, the API outright rejects the call: prompt_too_long.

Without compression, an agent simply cannot work on large projects.

The Solution

Compact Overview

The hook structure, skill loading, and sub-Agent from s07 are preserved, with some tools omitted to focus on compaction. The core change: insert three pre-processors (0 API calls) before each LLM call, trigger an LLM summary (1 API call) when tokens still exceed the threshold, and emergency-trim if the API throws an error.

Core design: cheap first, expensive last.

How It Works

Four-layer compression pipeline

L1: snip_compact — Trim Irrelevant Old Conversation

The agent ran 80 turns of conversation, accumulating 160 messages. The very first "help me create hello.py" is barely relevant to current work, yet it still occupies space.

Message count exceeds 50 → keep the first 3 (initial context) and the last 47 (current work), trim the middle; the only extra boundary rule is that assistant(tool_use) must not be separated from the following user(tool_result):

def snip_compact(messages, max_messages=50):
    if len(messages) <= max_messages:
        return messages
    head_end, tail_start = 3, len(messages) - (max_messages - 3)
    if head_end > 0 and _message_has_tool_use(messages[head_end - 1]):
        while head_end < len(messages) and _is_tool_result_message(messages[head_end]):
            head_end += 1
    if (tail_start > 0 and tail_start < len(messages)
            and _is_tool_result_message(messages[tail_start])
            and _message_has_tool_use(messages[tail_start - 1])):
        tail_start -= 1
    snipped = tail_start - head_end
    placeholder = {"role": "user", "content": f"[snipped {snipped} messages from conversation middle]"}
    return messages[:head_end] + [placeholder] + messages[tail_start:]

Messages are still trimmed directly; this just adds one boundary guard. tool_result content within remaining messages still keeps accumulating — message #34 may still hold 30KB of old file contents. → L2.

L2: micro_compact — Placeholder for Old Tool Results

Old results placeholder

The agent read 10 files consecutively. The full contents of reads 1–7 are still sitting in context, no longer needed, but hogging large amounts of space.

Keep only the 3 most recent tool_result entries intact; replace older ones with a one-line placeholder:

KEEP_RECENT_TOOL_RESULTS = 3

def micro_compact(messages):
    tool_results = collect_tool_result_blocks(messages)
    if len(tool_results) <= KEEP_RECENT_TOOL_RESULTS:
        return messages
    for _, _, block in tool_results[:-KEEP_RECENT_TOOL_RESULTS]:
        if len(block.get("content", "")) > 120:
            block["content"] = "[Earlier tool result compacted. Re-run if needed.]"
    return messages

Old results are cleared, but a single new result can be 500KB — one cat of a large file can max out the context. → L3.

L3: tool_result_budget — Persist Large Results to Disk

Large results to disk

The model read 5 large files in one go; all tool_result blocks in the last user message total 500KB.

Sum the size of all tool_result blocks in the last user message. If over 200KB → sort by size, starting from the largest, persist to .task_outputs/tool-results/, keeping only a <persisted-output> marker + a 2000-character preview in context. The model sees the marker and knows the full content is on disk, re-reading it when needed.

def tool_result_budget(messages, max_bytes=200_000):
    last = messages[-1]
    blocks = [(i, b) for i, b in enumerate(last["content"])
              if b.get("type") == "tool_result"]
    total = sum(len(str(b.get("content", ""))) for _, b in blocks)
    if total <= max_bytes:
        return messages
    ranked = sorted(blocks, key=lambda p: len(str(p[1].get("content", ""))), reverse=True)
    for idx, block in ranked:
        if total <= max_bytes:
            break
        block["content"] = persist_large_output(block["tool_use_id"], str(block["content"]))
        total = recalculate_total(blocks)
    return messages

The first three layers are all plain-text / structural operations — 0 API calls — but they cannot "understand" conversation content. Context may still be too large. → L4.

L4: compact_history — Full LLM Summary

Full LLM summary

All three previous layers have run, but after 30 minutes of continuous work on a huge project, tokens still exceed the threshold.

Three-step process:

Save transcript: Write the full conversation to .transcripts/ in JSONL format. The transcript preserves a recoverable record, but the model's active context only contains the summary. For the model's current reasoning, the details are no longer in context. The teaching code does not provide a transcript retrieval tool.
LLM generates summary: Send conversation history to the LLM, asking it to preserve key information: current goals, important findings, modified files, remaining work, user constraints, etc.
Replace message list: All old messages are replaced with a single summary. The teaching version only keeps the summary; the real Claude Code re-attaches some recent files, plans, agent/skill/tool context after compaction.

def compact_history(messages):
    transcript_path = write_transcript(messages)  # Save full conversation first
    summary = summarize_history(messages)          # LLM generates summary
    return [{"role": "user",
             "content": f"[Compacted]\n\n{summary}"}]

Circuit breaker: After 3 consecutive failures, stop retrying to prevent an infinite loop wasting API calls.

Reactive: reactive_compact

Sometimes the API still returns prompt_too_long (413) — when context grows faster than compression triggers.

This triggers reactive_compact: more aggressive than compact_history, it retreats from the tail, but still avoids leaving an orphaned tool_result.

def reactive_compact(messages):
    transcript = write_transcript(messages)
    tail_start = max(0, len(messages) - 5)
    if (tail_start > 0 and tail_start < len(messages)
            and _is_tool_result_message(messages[tail_start])
            and _message_has_tool_use(messages[tail_start - 1])):
        tail_start -= 1
    summary = summarize_history(messages[:tail_start])
    return [{"role": "user",
             "content": f"[Reactive compact]\n\n{summary}"}, *messages[tail_start:]]

Reactive compact has a retry limit (default 1). If it still fails, an exception is raised instead of looping forever. Full error recovery is deferred to s11.

Putting It All Together

def agent_loop(messages):
    reactive_retries = 0
    while True:
        # Three pre-processors (0 API calls)
        # Order: budget first, so large content is persisted before placeholders
        messages[:] = tool_result_budget(messages)    # L3: persist large results
        messages[:] = snip_compact(messages)          # L1: trim middle
        messages[:] = micro_compact(messages)         # L2: old result placeholders

        # Still too much? LLM summary (1 API call)
        if estimate_token_count(messages) > THRESHOLD:
            messages[:] = compact_history(messages)

        try:
            response = client.messages.create(...)
        except PromptTooLongError:
            if reactive_retries < MAX_REACTIVE_RETRIES:
                messages[:] = reactive_compact(messages)  # Emergency
                reactive_retries += 1
                continue
            raise  # retry limit exceeded, raise exception
        # ... tool execution ...

        # compact tool: when the model actively calls it, triggers compact_history
        if block.name == "compact":
            messages[:] = compact_history(messages)
            results.append({..., "content": "[Compacted. History summarized.]"})
            messages.append({"role": "user", "content": results})
            break  # end current turn, start fresh with compacted context

The order must not be swapped. L3 (budget) runs before L2 (micro) because micro replaces old large tool_results with one-line placeholders — budget must persist the full content before that happens. This is why CC source puts applyToolResultBudget first.

Changes From s07

Component	Before (s07)	After (s08)
Context management	None (context grows unbounded)	Four-layer compression pipeline + emergency
New functions	—	snip_compact, micro_compact, tool_result_budget, compact_history, reactive_compact
Tools	bash, read_file, write_file, edit_file, glob, todo_write, task, load_skill (8)	8 + compact (9)
Loop	LLM call → tool execution	Three pre-processors before each turn + threshold-triggered compact_history
Design principle	—	Cheap first, expensive last

Try It

cd learn-claude-code
python s08_context_compact/code.py

Try these prompts:

Read the file README.md, then read code.py, then read s01_agent_loop/README.md (read multiple files consecutively, observe L2 compressing old results)
Read every file in s08_context_compact/ (read a large amount of content at once, observe L3 persisting to disk)
Chat for 20+ turns, observe whether [auto compact] or [reactive compact] appears

What to watch for: After each tool execution, are old tool_result entries compressed? When tokens exceed the threshold after extended conversation, is summarization triggered automatically?

What's Next

Context compression lets an agent run for a long time without crashing. But after each compression, the preferences and constraints the user told it are also lost. Can we let the agent selectively remember important things?

s09 Memory → three subsystems: choosing what to remember, extracting key information, consolidating and organizing. Across compressions, across sessions.

Deep Dive Into CC Source Code

The following is based on analysis of CC source code compact.ts, autoCompact.ts, microCompact.ts, and query.ts.

Execution Order Comparison

The teaching version labels layers L1/L2/L3/L4 for pedagogical clarity, but actual execution order does not match the numbering:

Dimension	Teaching Version	Claude Code
Execution order	budget → snip → micro → auto	budget → snip → micro → collapse → auto (`query.ts:379-468`)
snip_compact	Keep head 3 + tail 47	CC only enables on main thread; implementation not in open-source repo (`HISTORY_SNIP` feature gate), but interface is visible: `snipCompactIfNeeded(messages)` → `{ messages, tokensFreed, boundaryMessage? }`, also exposes `SnipTool` for model-initiated snipping. Teaching version's 3/47 are simplified parameters
micro_compact	Text placeholder replacement	Two paths: time-based clears content directly, cached uses API `cache_edits` (legacy path removed)
micro_compact whitelist	By position (most recent 3)	time-based triggers by time threshold; cached triggers by count (`microCompact.ts`)
tool_result_budget	200KB characters	200,000 characters (`toolLimits.ts:49`)
compact_history threshold	Character count estimate	Precise tokens: `contextWindow - maxOutputTokens - 13_000`
Summary requirements	5 categories of info	9 sections + `<analysis>`/`<summary>` dual tags
Compression prompt	Simple prompt	Double-ended hard guardrails forbidding tool calls
PTL retry	Yes (simplified)	`truncateHeadForPTLRetry()` retreats by message groups (`compact.ts:243-290`)
Post-compaction recovery	None (teaching version only keeps summary)	Auto re-read recent files, plans, agent/skill/tool context
Circuit breaker	3 times	3 times (`autoCompact.ts:70`)
Reactive retry	1 time	CC has more granular tiered retries

Execution Order Details

The real order in CC source query.ts:

applyToolResultBudget (L379): persist large results first, ensuring full content is saved
snipCompact (L403): trim middle messages
microcompact (L414): old result placeholders
contextCollapse (L441): independent context management system (not in teaching version)
autoCompact (L454): LLM full summary

The teaching version's budget → snip → micro order matches this. The teaching version does not have the contextCollapse mechanism.

read_file Trade-off

The teaching version's micro_compact replaces old tool_result blocks with placeholders uniformly, including read_file. This usually does not affect functional correctness: if the model needs the file contents later, it can read the file again. The cost is an extra tool call and potentially lower prompt cache hit rates.

Claude Code does not solve this with the teaching version's simple rule. It also puts Read in the microcompactable tool set, but maintains a separate readFileState: repeated reads of unchanged files return FILE_UNCHANGED_STUB, and after compaction it restores recently read file contents within a budget (for example, up to 5 files, 5K tokens per file, 50K tokens total). That is a production-level cache and recovery mechanism. The teaching version does not expand into that machinery; it keeps the simpler trade-off of compacting old results and re-reading when needed.

Full Constant Reference

Constant	Value	Source File
`AUTOCOMPACT_BUFFER_TOKENS`	13,000	`autoCompact.ts:62`
`MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES`	3	`autoCompact.ts:70`
`MAX_OUTPUT_TOKENS_FOR_SUMMARY`	20,000	`autoCompact.ts:30`
`POST_COMPACT_TOKEN_BUDGET`	50,000	`compact.ts:123`
`POST_COMPACT_MAX_FILES_TO_RESTORE`	5	`compact.ts:122`
`POST_COMPACT_MAX_TOKENS_PER_FILE`	5,000	`compact.ts:124`
Time micro_compact interval	60 minutes	`timeBasedMCConfig.ts`
`MAX_COMPACT_STREAMING_RETRIES`	2	`compact.ts:131`

contextCollapse and sessionMemoryCompact

CC source code has two additional mechanisms not covered in this teaching version:

contextCollapse: An independent context management system that, when enabled, suppresses proactive autocompact (autoCompact.ts:215-222), with collapse's commit/blocking flow taking over context management. Manual /compact and reactive fallback remain independent paths, unaffected by contextCollapse.
sessionMemoryCompact: Before compact_history, CC first attempts a lightweight summary using existing session memory (covered in s09) without calling the LLM. This mechanism becomes clearer after learning s09.

What Does the Compression Prompt Look Like?

CC's compression prompt has two hard requirements:

Absolutely no tool calls: It begins with CRITICAL: Respond with TEXT ONLY. Do NOT call any tools., and appends another REMINDER at the end
Analyze first, then summarize: The model must first reason in an <analysis> tag, then output the formal summary in a <summary> tag. The analysis is stripped during formatting

Teaching Version Simplifications Are Intentional

micro_compact uses text placeholders → we don't have API-level cache_edits access
read_file is not special-cased → the teaching version accepts re-reading when needed instead of introducing readFileState and post-compaction recovery
Tokens estimated via character count → precise tokenizers are out of scope
Post-compaction recovery omitted → teaching version only keeps summary, does not auto re-attach files
Two auxiliary mechanisms not covered → they fall in the 10% detail category

The core design principle, cheap first, expensive last, is fully preserved.