Understand LLM Context

Concepts

Fundamentals and Representation

At its core, an LLM is stateless. Every API call begins with a blank slate, and the model retains nothing from previous interactions unless explicitly provided. What we call context is essentially the working memory of the system: the full payload of text (and increasingly, other modalities) that gets passed to the model on each inference call. This is distinct from persistent memory or long-term storage; context exists only for the duration of a single forward pass.

Understanding this distinction is critical for system design. When users perceive a chatbot as "remembering" their preferences, that illusion is constructed entirely by the application layer. The model itself has no notion of yesterday's conversation. Every piece of relevant history must be serialized and injected into the context window, every single time.

Before the model can process text, it must be converted into tokens. Tokenization algorithms like Byte Pair Encoding (BPE) or SentencePiece break text into subword units that the model can understand. A crucial insight for engineers: tokens do not map cleanly to words. The phrase "unbelievable" might be a single token, while ChatGPT could be split into three. This has direct implications for context budgeting. When you have a 128K token limit, that does not translate to 128,000 words. Depending on the language, formatting, and content type, you might fit anywhere from 80K to 300K words. Technical content with code snippets, special characters, or non-English text tends to be less token-efficient.

Attention Mechanism

Context windows exist because of how transformer models process information. The self-attention mechanism allows every token in the input to attend to every other token, computing relevance scores and building contextual representations. This is what enables LLMs to understand that "it" in sentence five refers to an object mentioned in sentence one.

However, this architecture comes with constraints. The attention computation scales with the square of the sequence length, which is why extending context windows has historically been so challenging. A model processing 8K tokens performs roughly 64 million attention calculations. Scale that to 128K tokens and you're looking at over 16 billion. This computational reality shapes everything from latency to infrastructure costs, and understanding it helps explain why "just make the context bigger" is never a simple solution.

Context Window & Its Management

Anatomy of a Context Window

Every model has a hard limit on context size, defined by its architecture and training. But the effective limit is usually smaller. System prompts, safety guidelines, and formatting overhead consume tokens before the user's content ever arrives. In production systems, it's common to reserve 10-20% of the context window for system-level instructions, leaving the remainder for actual conversation and retrieved content.

Context windows have grown dramatically over the past two years. GPT-3 launched with 4K tokens. Today, models routinely support 128K, 200K, or even 1M+ tokens. But bigger is not automatically better. Larger windows introduce latency, increase costs, and as we'll explore later, can actually degrade output quality in subtle ways. The evolution of window sizes reflects both architectural innovation and market pressure, but engineers should resist the temptation to treat expanded limits as a solution to all context problems.

The system prompt deserves special attention. This is the foundational instruction set that shapes model behavior, and it sits at the privileged beginning of the context. In complex applications, system prompts can run to thousands of tokens. Every token spent here is unavailable for conversation history or retrieved documents. Designing efficient, effective system prompts is an underappreciated skill in LLM engineering.

Composition and Truncation Strategies

Production systems rarely dump content into the context window haphazardly. Instead, they implement layered architectures with clear priority hierarchies. A typical structure might place system instructions first, followed by relevant retrieved documents, then conversation history, and finally the current user query. Each layer has a token budget, and when the total exceeds the limit, something must be cut.

The simplest truncation strategy is FIFO: first in, first out. When context overflows, drop the oldest messages. This is easy to implement but problematic in practice. Conversations often reference earlier content, and blindly removing old messages can sever important threads. Users might ask "what did I say about the budget?" when that budget discussion was truncated three turns ago.

Semantic-aware truncation offers a more sophisticated approach. Rather than purely chronological removal, the system scores content by relevance to the current query and retains what matters most. This requires additional computation (typically embedding similarity) but preserves coherence better. Some systems maintain "pinned" content that never gets truncated: critical user preferences, key facts, or summary checkpoints that anchor the conversation's context regardless of length.

Token budgeting can be fixed or dynamic. Fixed allocation assigns rigid limits to each context layer (e.g., 4K for system prompt, 20K for documents, 10K for history). Dynamic budgeting adjusts based on the specific query; a research question might allocate more to retrieved documents, while a casual chat prioritizes conversation history. The right approach depends on your application's needs and how predictable user interactions are.

Multi-Turn Session State

From the API perspective, every call is independent. But users experience conversations as continuous sessions. Bridging this gap is the application's responsibility. The client (or an intermediary service) must store conversation history and reconstruct it for each request.

Session serialization raises practical questions. How do you store conversation state? How do you handle concurrent requests within the same session? When a user returns after hours or days, do you restore the full history or start fresh? These decisions affect user experience, cost, and system complexity. Some architectures maintain server-side session stores with TTLs, while others push state management entirely to the client. There is no universal answer, only trade-offs appropriate to your use case.

Context Rot

Attention Degradation and Semantic Drift

Larger context windows create an illusion of unlimited memory, but attention is not evenly distributed. Research has documented the Lost in the Middle phenomenon: information placed in the middle of long contexts is retrieved and utilized less reliably than content at the beginning or end. Models exhibit both primacy bias (favoring early content) and recency bias (favoring recent content), leaving the middle as a kind of dead zone.

This has direct engineering implications. If you stuff 50 documents into context, the model might effectively ignore 30 of them based purely on position. Relevance ranking becomes crucial. The most important content should be placed strategically, not just appended sequentially.

As conversations extend, a subtler problem emerges: semantic drift. Initial instructions get diluted as the context fills with other content. A model told in the system prompt to "respond formally" might gradually shift to casual language as informal user messages accumulate. Conflicting information compounds the issue. If a user corrects themselves multiple times, all versions persist in context. The model must somehow reconcile contradictions, and it doesn't always do so reliably. In extreme cases, extended sessions can exhibit persona collapse, where the model's coherent behavior degrades into inconsistency.

Compaction Challenges

The obvious solution to context limits is summarization: periodically condense history to free up space. But summarization is inherently lossy. Details vanish. Nuance flattens. A summary stating "the user discussed budget concerns" loses the specific numbers, the emotional tone, and the context of why those concerns arose.

Reference resolution becomes particularly fragile after compaction. If the original context contained "I'll call my brother Mike about this," a summary might reduce this to "user will follow up with family." When the user later asks "did I mention Mike?", the model has no answer. Proper nouns, specific commitments, and temporal references are especially vulnerable to summarization loss.

Perhaps most insidious is the summarization-of-summaries problem. As conversations stretch across many sessions, you might summarize a summary, then summarize that summary again. Each pass compounds information loss. After several iterations, the resulting context may bear little resemblance to what actually occurred. Designing compaction strategies that preserve essential facts while discarding redundancy is a genuinely hard problem with no clean solutions.

Computational and Latency Costs

Context size directly impacts performance. Time-to-first-token (TTFT), the delay before the model begins streaming a response, increases with context length. For latency-sensitive applications like chat interfaces, this degradation is noticeable. Users waiting 3-5 seconds for a response in a long conversation will feel the friction.

Cost scales similarly. Most API pricing is based on token count, both input and output. A conversation that has accumulated 80K tokens of context costs 20 times more per turn than a fresh conversation with 4K. Over thousands of users and millions of requests, this adds up quickly. Context management is not just an engineering problem; it's an economic one.

Modern inference infrastructure uses KV-caching to avoid recomputing attention for unchanged context. But this cache is fragile. Any modification to context (even inserting a single token) can invalidate the cache and force full recomputation. Systems that dynamically inject retrieved content or reorder context elements may inadvertently defeat caching optimizations, paying the full computational cost on every request.

Current Approaches and Future Strategies

Retrieval-Augmented Generation (RAG)

RAG addresses context limitations by decoupling storage from the context window. Instead of cramming everything into context, you store information externally (typically in a vector database) and retrieve only what's relevant for each query.

The approach involves chunking documents into segments, generating embeddings for each chunk, and performing similarity search at query time. Effective chunking is harder than it sounds. Chunks that are too small lose context; chunks too large waste tokens and reduce precision. Hybrid search combining semantic similarity with keyword matching often outperforms pure vector search, especially for queries involving specific names, codes, or technical terms.

RAG is not a silver bullet. Relevance scoring is imperfect, and important information gets missed. There's also a tension between relevance and recency. A semantically similar document from two years ago might outrank a less similar but current document. For conversational applications, RAG struggles with the inherently temporal nature of dialogue. "What did we discuss yesterday?" requires temporal awareness that pure similarity search doesn't provide.

Memory Architectures and Compression Techniques

Sophisticated systems implement tiered memory architectures inspired by cognitive science. Working memory holds the current conversation and immediate context. Episodic memory stores summaries of past interactions, tagged with temporal metadata. Semantic memory captures distilled facts and user preferences that persist across sessions. Each tier has different retention policies, summarization strategies, and retrieval mechanisms.

Summarization pipelines in these architectures must preserve metadata alongside content. A summary should retain timestamps, confidence levels, and source references so the system can trace back to originals when needed. Event-driven consolidation (summarizing after significant interactions) often works better than purely time-based approaches, as it aligns with natural conversation boundaries.

Prompt compression techniques like LLMLingua and AutoCompressors offer a different angle. Rather than summarizing at the semantic level, these approaches compress text while preserving essential meaning. The model receives a compressed representation that maintains more information than a natural language summary of equivalent length. This is still an active research area with trade-offs between compression ratio and fidelity.

The fundamental tension in all compression approaches is lossy versus lossless. Lossless preservation of full context is eventually impossible under fixed token limits. Lossy approaches sacrifice information. The art is in choosing what to sacrifice, and no automated system does this perfectly. Human-in-the-loop review of memory consolidation remains valuable for high-stakes applications.

Architectural Innovations and Future Outlook

Architectural research continues to push at context limitations. Sparse attention patterns, implemented in models like Longformer and BigBird, reduce computational complexity by limiting which tokens attend to which. Rather than full quadratic attention, these models use sliding windows combined with global tokens, allowing much longer contexts without proportional cost increases.

State-space models like Mamba represent a more radical departure. By replacing attention with recurrent state-space layers, these architectures achieve linear scaling with sequence length. Early results are promising, though the approach involves different trade-offs around in-context learning and fine-tuning behavior. Whether state-space models will complement or replace transformers for long-context applications remains an open question.

Tool-augmented context offers a pragmatic near-term solution. Instead of storing everything in context, the model gains access to external memory through tool calls. It can query a database, search past conversations, or retrieve specific documents on demand. This shifts context management from passive accumulation to active retrieval, aligning compute with actual information needs.

Self-reflective context pruning represents an emerging pattern where the model itself evaluates context relevance. Rather than relying on external heuristics, the LLM scores which parts of its context are most valuable for the current task and suggests what can be safely removed. This approach leverages the model's understanding of its own attention patterns but requires careful prompt engineering to avoid runaway pruning.

The trajectory of these developments points toward what researchers increasingly call cognitive architectures. These are integrated systems combining working memory, long-term storage, retrieval, tool use, and meta-cognitive monitoring into coherent wholes. The context window becomes just one component in a larger memory infrastructure, dynamically managed rather than passively filled. Production systems are already moving in this direction, and the next generation of foundation models will likely incorporate memory primitives more natively.

Conclusion

Context is the central bottleneck in LLM system design. Every architectural decision, from tokenization to summarization, from retrieval to caching, ultimately flows through this constraint. Engineers who understand context deeply build better systems.

The trade-offs are persistent: capacity versus fidelity versus cost. Larger windows offer more information but introduce latency, expense, and attention degradation. Compression preserves tokens but loses nuance. Retrieval enables scale but imperfect relevance. There is no free lunch.

The practical imperative is to design context-aware abstractions from day one. Don't treat the context window as a dumping ground. Implement thoughtful layering, prioritization, and lifecycle management early. Monitor context utilization and quality metrics in production. Build for the systems of tomorrow, where context is actively managed by intelligent memory architectures, not just passively accumulated until limits force truncation.

Understand LLM Context

Concepts

Fundamentals and Representation

Attention Mechanism

Context Window & Its Management

Anatomy of a Context Window

Composition and Truncation Strategies

Multi-Turn Session State

Context Rot

Attention Degradation and Semantic Drift

Compaction Challenges

Computational and Latency Costs

Current Approaches and Future Strategies

Retrieval-Augmented Generation (RAG)

Memory Architectures and Compression Techniques

Architectural Innovations and Future Outlook

Conclusion

Comments

More from this blog

RAG and Agentic AI

Results and errors handling strategies (in C#)

Command Palette

Concepts

Fundamentals and Representation

Attention Mechanism

Context Window & Its Management

Anatomy of a Context Window

Composition and Truncation Strategies

Multi-Turn Session State

Context Rot

Attention Degradation and Semantic Drift

Compaction Challenges

Computational and Latency Costs

Current Approaches and Future Strategies

Retrieval-Augmented Generation (RAG)

Memory Architectures and Compression Techniques

Architectural Innovations and Future Outlook

Conclusion

Comments

More from this blog