Why stuffing your LLM with 100s of documents isn't a smart idea

You are here, likely because you are exploring the use of LLMs to work with a large number of documents. As LLMs become more powerful, you are hopeful that you can throw a large number of documents at a LLM, and get comprehensive and accurate answers over the entire corpus. Unfortunately, this may not work out. Read on to find out why.

LLMs now have super large context windows

The computational prowess of LLMs has surged considerably in the last few years. Early LLMs could handle only a few thousand tokens, enough for a paragraph or a short conversation. But the situation has totally changed now. Google's Gemini models serve as a prime example of this rapid expansion. The initial Gemini 1.0 Pro offered a 32,000-token context window. This was significantly expanded with Gemini 1.5 Pro, which boasts a default 128,000-token window, with the 2.5 models reaching a staggering 1 million tokens. This leap allows models to ingest entire codebases, lengthy novels, or extensive datasets in one go.

What 1 million tokens really means

To grasp the true scale of a 1-million-token context window, let's translate it into more familiar terms:

• Tokens to Words: A widely accepted conversion rate for English text is that 1 token is roughly equivalent to 0.75 words.
• Words to Pages: A standard, single-spaced page typically contains about 500 words.
• Number of Pages: Therefore, 1,000,000 tokens translate to approximately 750,000 words which is about 1500 pages!

For reference, Leo Tolstoy's War and Peace is about 560,000 words. In 1500 pages one can comfortably fit, a dozen of lengthiest academic papers or comprehensive business reports, a significant portion of a legal codebook or an entire software repository.

What are the use cases for large context windows?

Large context windows were designed with specific, high-value use cases in mind, particularly those requiring a deep, holistic understanding of extensive documents or datasets:

•
Comprehensive Legal Analysis: Imagine feeding an LLM an entire federal or state legal code, including all relevant amendments, case law summaries, and legislative histories. A lawyer could then ask complex questions like, "Given this new business proposal, identify all potential regulatory compliance risks across our existing contracts and the relevant statutes in jurisdiction X, focusing on environmental impact." The LLM can then cross-reference thousands of pages to identify specific clauses and statutes, providing a granular, context-aware analysis that would take human legal teams weeks.
•
Advanced Codebase Comprehension: A developer could input an entire software repository—source code, documentation, configuration files, and even bug reports—into the LLM. They could then query, "Find all instances where this specific data structure is modified across the codebase, explain its dependencies, and suggest refactoring improvements to optimize performance while maintaining backward compatibility." The LLM, with its full view of the project, can offer insights into interdependencies and potential pitfalls that are impossible to discern from reviewing isolated files.

These examples highlight that large context windows excel when deep, context-aware reasoning across a single, vast body of related text is required. For structured data extraction from documents, traditional OCR approaches combined with AI processing often provide better results than feeding raw documents to LLMs.

The "lost in the middle" problem affecting large context windows

Despite their impressive capacity, LLMs with large context windows face a significant challenge: the "lost in the middle" problem. The LLM model has a tendency to pay the most attention to information located at the very beginning and very end of its input, while often overlooking or misinterpreting crucial details buried in the middle of a long document. This isn't a bug, but an inherent characteristic of the transformer architecture and how LLMs are trained.

This phenomenon has been documented in academic research, for example, "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. (2023), which demonstrates that performance degrades significantly when relevant information is placed in the middle of long contexts, creating a distinctive U-shaped performance curve.

Lost in the Middle problem visualization showing U-shaped performance curve

You probably know that LLMs are next token prediction machines, and they look back at the preceding tokens to predict the next one. As the LLM, goes about predicting one token after next, it processes the initial text a greater number of times, thus effectively baking the influence of these initial tokens deeply into the model. The final tokens that contain explicit questions or instructions are the most immediate input before the model generates response, so the models' final layers are greatly influenced by this information. Text that is in the middle of a large context window benefits from neither repetition nor recency effect, has a lower signal to noise ratio, leading to lower accuracy. The terms context degradation or context rot are also used to describe this negative effect of large context on LLM accuracy.

Why the situation is even worse for lots of small documents

The "lost in the middle" problem is the same for one large document or many small documents, but additional negative effects creep for small documents that impact accuracy and usability. The severity of these negative effects is summarized below:

Aspect	Single Large Document	Many Small Documents
Lost in the Middle Problem	High	High
Contextual Confusion	Low. The information is from a single, cohesive source.	⚠️ High. Factual errors due to information blending from multiple documents
Noise & Irrelevance	Low. The entire document is inherently relevant.	⚠️ High for disparate documents (e.g., a mix of emails, reports, and contracts), Low for uniform documents
Lack of Granularity	Low. Typically, well structured document with full context	⚠️ High. If the "small documents" are snippets without context Low for well structured complete documents
Cost & Latency	High. Sending a massive prompt is expensive and slow.	High. Same as for large document case
Systemic Complexity & Fragility	High. A single-payload API call for a huge document is a point of failure.	High. Same as for large document case
Interpretability & Debugging	Low. Hard to trace logic, but the source is a single document.	⚠️ Very High. Extremely difficult to verify facts or trace errors to a specific source among hundreds of documents

In summary, simply "dumping" a multitude of small documents into a large context window introduces challenges even beyond the lost-in-the-middle problem.

How retrieval-augmented generation (RAG) mitigates these issues

This is where Retrieval-Augmented Generation (RAG) systems prove invaluable. It has become the go to technique for using LLMs. For further information on RAG, please refer to the article Augmend vs RAG. RAG mitigates some of the challenges when using many small documents as follows:

Aspect	Many Small Documents (without RAG)	Many Small Documents with RAG
Lost in the Middle Problem	High	✅ Solved. System fetches only a small and highly relevant document set so there is very little "middle"
Cost & Latency	High	✅ Low. Only relevant documents = less tokens so lower cost and latency
Contextual Confusion	High	✅ Low. Cleaner filtered context directly relevant to the query sent reducing chances of conflicting information.
Noise & Irrelevance	High for disparate documents Low for uniform documents	✅ Low. Same reason as above
Lack of Granularity	High. Snippets Low. Structured documents	Variable. Depends on quality of chunking strategy to break down documents
Interpretability & Debugging	Very High	✅ Low. RAG uses specific documents hence can provide audit trail
Systemic Complexity & Fragility	High	✅ Low

In essence, RAG acts as an intelligent librarian for the LLM. It finds the handful of most relevant "books" (documents) and places them neatly on the LLM's "desk" (context window) for focused analysis. Thus the reliability of LLM results actually improves by narrowing down the context to fewer tokens when analysing many small documents!

So, if you are just stuffing your LLM with all documents in the hope that the results will be more comprehensive, you are also reducing the accuracy of your results.

How do you get both comprehensive results and good accuracy

This is why we created Augmend. You essentially want to use the power of LLMs to consistently and accurately extract structured information out of each individual document first. Effectively, you need to do an ETL like process. This is the core strength of Augmend. After doing accurate extraction, quantitative / comprehensive analysis is as trivial as running a SQL query. For more on building a comprehensive data strategy, check out our guide on working across three horizons.

Are you interested in finding out more about Augmend? Access the Book a demo button in the navigation bar!