RAG and Agentic AI

Overview
Agentic AI (multi-agent LLM workflows) and Retrieval-Augmented Generation (RAG) are complementary. Agents run perceive → plan → act → observe loops and call tools/APIs. RAG supplies curated, evidence-backed context via an ingestion and a retrieval pipeline to reduce hallucination and improve decision accuracy.
This combination addresses a fundamental limitation of standalone LLMs: while they're remarkably good at reasoning and generation, their knowledge is frozen at training time. Worse, they can't reliably tell the difference between what they actually know and what they're making up. RAG grounds agent outputs in verifiable sources, which turns speculative responses into evidence-based decisions. In enterprise settings where accuracy, auditability, and domain-specific knowledge aren't optional, this matters a lot.
What's changed recently is how these systems relate to each other. Modern agentic systems increasingly treat RAG not as a separate preprocessing step but as an intrinsic capability. Agents autonomously decide when to retrieve, what sources to consult, and how to synthesize multiple evidence streams. This shift from RAG as preprocessing to RAG as agent skill enables more sophisticated workflows where retrieval happens dynamically based on what the task actually requires.
Ingestion Pipeline
Source Normalization
Convert PDFs, Word docs, spreadsheets, images, and other artifacts into machine-readable, LLM-friendly formats such as Markdown. Use document conversion utilities (e.g.: Docling) to normalize PDFs to Markdown while preserving structure and metadata.
Markdown makes sense as a target format because it preserves semantic structure (headings, lists, tables) while staying lightweight and universally parseable. Other solid options include Unstructured.io for gnarly document layouts, PyMuPDF for PDF text extraction with coordinate preservation, and Pandoc for cross-format conversions. For scanned documents, you'll need OCR engines like Tesseract or cloud services (Azure Document Intelligence, AWS Textract, etc) that can handle degraded image quality and complex layouts.
One thing worth setting up early: document fingerprinting using content hashes to catch duplicates and track versions. This prevents your index from bloating with redundant content and enables incremental updates when source documents change. You'll also need clear rules for handling conflicts when the same information shows up in multiple sources with different values.
Content Enrichment
Extract and preserve tables, graphs, captions, page numbers, image alt text; detect truncated pages and flag for OCR/manual review.
Tables are tricky. Simple extraction often loses the semantic relationship between headers and cells. You'll want table-aware parsing that converts tables to structured formats (JSON, CSV) with header associations preserved, or linearizes them into natural language descriptions for embedding. For complex nested tables, storing both the structured representation and a natural language summary works well.
Graphs and charts need special handling too. Extract the underlying data where possible, generate descriptive alt text using vision models, and store both the visual reference and textual description. Keep a mapping between extracted content and its original location (page, bounding box coordinates) so you can cite precisely and let users visually verify.
Set up automated quality gates that flag pages with low OCR confidence scores, detect truncated content at page boundaries, and identify missing figures or broken cross-references. Route flagged content to human review rather than silently ingesting garbage that will pollute your retrieval results downstream.
Chunking Strategy
Split documents into semantically coherent chunks sized to balance retrieval relevance and LLM context limits; include chunk metadata (source id; page; offset; confidence) for provenance.
If I had to pick one thing that makes or breaks RAG systems, it's chunking. This is probably the highest-leverage optimization in the entire pipeline. Naive approaches (fixed character counts, sentence boundaries) fragment semantic units and create chunks that lack sufficient context to stand on their own. Semantic chunking that respects document structure, splitting at section boundaries, paragraph breaks, or topic transitions detected via embedding similarity; that works much better.
Chunk sizes typically range from 256 to 1024 tokens, with 512 being a reasonable default. Overlapping windows (10-20% overlap) prevent information loss at chunk boundaries, which matters because critical details often span your artificial split points. For highly structured documents like legal contracts or technical specifications, hierarchical chunking that preserves parent-child relationships between sections is worth the extra complexity.
Your metadata schema should capture: source document identifier, version/timestamp, page numbers, section hierarchy, extraction confidence, content type (prose, table, code, list), and any domain-specific tags. This metadata powers filtered retrieval ("only search 2024 policy documents") and makes provenance tracking possible in downstream outputs.
Embedding Generation
Use a single, consistent embedding model for both ingestion and query embedding to avoid embedding-space mismatch; store embeddings with chunk metadata.
Embedding model selection has a big impact on retrieval quality. General-purpose models (OpenAI text-embedding-3-large, Cohere embed-v3, BGE-large) work well across domains, but domain-specific fine-tuning can yield substantial gains for specialized corpora in legal, medical, or scientific contexts. Test models on your actual retrieval tasks using held-out query sets before committing to one.
Dimension matters for both quality and cost. Higher-dimensional embeddings (1536, 3072) capture more nuance but increase storage and search latency. Many vector databases support dimensionality reduction or quantization (product quantization, scalar quantization) to trade modest accuracy loss for significant efficiency gains at scale.
Here's something people often learn the hard way: implement embedding versioning from day one. When you eventually upgrade embedding models (and you will), you'll need to re-embed your entire corpus. Without version tracking, you risk mixing incompatible embedding spaces. Store the model identifier alongside each embedding vector and build migration tooling for model transitions before you need it.
Vector DB write
Persist embeddings and metadata to a vector database optimized for fast similarity search and re-ranking hooks.
Which vector database you choose depends on scale, deployment constraints, and feature requirements. Pinecone and Weaviate offer managed services with minimal operational overhead. Milvus, Qdrant, and Chroma provide self-hosted options for data sovereignty requirements. pgvector extends PostgreSQL for teams that want to consolidate on existing infrastructure.
When evaluating options, look at: supported index types (HNSW, IVF, flat), filtering capabilities (metadata predicates during search), update semantics (real-time vs. batch), multi-tenancy support, and hybrid search features. For production systems, also check backup/restore procedures, monitoring integrations, and horizontal scaling characteristics.
Design your indexing strategy around your query patterns. If most queries filter by date range or document type, make sure those fields are indexed for efficient predicate pushdown. For large corpora, partition by logical boundaries (tenant, document category, time period) to limit search scope and improve latency.
Retrieval and Context Engineering
Query Embedding and Hybrid Recall
Convert queries to embeddings and combine semantic similarity with keyword/boolean search to capture both intent and exact matches.
Pure semantic search is great at understanding intent but can miss exact terminology matches that matter in technical domains. A query about "401(k) contribution limits" needs to match documents containing that exact term, even if semantically similar phrases like "retirement savings caps" score higher. Hybrid search gives you both.
Implement hybrid recall using reciprocal rank fusion (RRF) or learned combination weights. RRF provides a parameter-free baseline: for each retrieval method, assign scores based on rank position (score = 1 / (k + rank)), then sum scores across methods. Learned weights require evaluation data but can significantly outperform simple fusion by calibrating relative reliability of each retrieval signal.
Query preprocessing improves both semantic and keyword recall. Expand acronyms, resolve pronouns using conversation context, and decompose complex queries into sub-queries for multi-hop reasoning. For conversational agents, maintain query context across turns: "tell me more about that" needs resolution against the previous retrieval context, not a fresh search.
Top-K Selection and Re-ranking
Retrieve top-K chunks (typical K = 3–5), then re-rank by relevance signals (semantic score; keyword overlap; recency; source trust).
Initial retrieval optimizes for recall; you're casting a wide net to make sure relevant chunks don't slip through. Re-ranking optimizes for precision, surfacing the best chunks from that initial set. This two-stage approach lets you use lightweight embeddings for broad recall, then apply expensive cross-encoder models for precise ranking.
Cross-encoder re-rankers (Cohere Rerank, BGE-reranker, MS MARCO fine-tuned models) jointly encode query and document, enabling richer relevance modeling than dot-product similarity. The latency cost is real though: cross-encoders process each query-document pair individually; so apply re-ranking only to the initial retrieval set, never the full corpus.
Beyond semantic relevance, incorporate domain-specific ranking signals: document recency (prefer current policies over outdated versions), source authority (official documentation over forum posts), user access permissions (filter results the user can't actually see), and historical click-through data where you have it.
Chunk Combination and Coherence
Merge related chunks into a single coherent context block; preserve ordering and add minimal connective prompts to avoid contradictory snippets.
Retrieved chunks often overlap, repeat information, or cover the same concept at different granularities. Naive concatenation wastes context tokens and can confuse the LLM with redundant or subtly inconsistent phrasings. Implement deduplication based on content similarity: if two chunks exceed a similarity threshold (say, 0.9 cosine similarity), keep only the more comprehensive version.
For chunks from the same document, restore original ordering and add minimal structural markers ("From section 3.2:", "Continuing from the previous passage:"). For chunks from different sources discussing the same topic, label source transitions explicitly and note any apparent conflicts rather than hiding them.
A technique worth trying: context stuffing that expands high-confidence chunks with surrounding content. If a 512 token chunk scores highly, fetching the preceding and following chunks often provides valuable context that improves answer quality. You're effectively implementing dynamic chunk sizing based on retrieval confidence.
Compression and Prioritization
Summarize or compress low-value chunks and prioritize high-precision evidence to control token usage and latency.
Context window management is a bigger optimization surface than most people realize. Even with 100K+ token context windows, more content isn't always better. LLMs exhibit "lost in the middle" effects where information buried in long contexts gets less attention. Prioritize placement: put highest-relevance chunks at the beginning and end of the context block, with supporting evidence in the middle.
Tiered compression helps here: high-confidence chunks appear verbatim, medium-confidence chunks get summarized to key points, low-confidence chunks become one-line references with source links. LLM-based summarization can compress 4-5x while preserving key facts, though you're trading latency and risking some information loss.
For latency-sensitive applications, consider progressive retrieval: return an initial response based on top-1 or top-2 chunks while asynchronously retrieving and processing additional context for follow-up. Users generally prefer fast approximate answers with the option to dig deeper over consistently slow comprehensive responses.
Provenance and Evidence Surfacing
Attach chunk metadata and short citations to LLM outputs so agents can justify decisions and enable human verification.
Provenance isn't optional for enterprise RAG. Every factual claim in generated outputs should trace to specific source chunks, letting human reviewers verify accuracy and spot hallucinations. Implement citation generation as a core prompt engineering pattern: instruct the LLM to bracket claims with source references and explicitly flag statements not grounded in retrieved evidence.
Design citation formats for your use case. Inline references work for conversational responses; footnotes suit long-form documents. Include enough metadata for users to find the original source: document name, section heading, page number, and ideally a direct link or preview. For high-stakes domains (legal, medical, financial), consider citation verification that confirms generated citations actually appear in the referenced source.
Surfacing retrieval confidence alongside citations adds useful nuance: "According to the 2024 Employee Handbook (high confidence)..." versus "Based on a 2019 policy document that may be outdated (verify current policy)...". This calibrated uncertainty helps users weigh AI-generated information appropriately.
Scaling Tradeoffs and Operational Concerns
Diminishing Returns
More retrieved tokens can yield marginal gains and eventually degrade performance due to noise and LLM context limits; tune K and chunk size empirically.
The retrieval-quality curve is typically logarithmic. The first few high-quality chunks provide most of the value; each additional chunk contributes less. Past an inflection point, additional retrieval actually degrades output quality. Irrelevant chunks distract the model, near-duplicate content creates confusion, and sheer volume triggers "lost in the middle" attention issues.
Tune retrieval parameters empirically using held-out evaluation sets with human-judged relevance labels. Measure both retrieval metrics (precision@K, recall@K, MRR) and end-to-end answer quality (factual accuracy, completeness, hallucination rate). The optimal K varies by query type: simple factual lookups may need K=1-2, while complex analytical questions benefit from K=5-10.
Adaptive retrieval that adjusts K based on query characteristics or retrieval confidence is worth exploring. If the top-1 chunk scores way higher than alternatives (large margin), additional chunks probably add noise. If scores are clustered, more chunks may provide complementary perspectives worth including.
Latency and Cost
Larger contexts increase inference time and billing; mitigate via chunk prioritization, compression, and local model hosting.
RAG systems stack multiple latency sources: embedding generation (10-50ms), vector search (10-100ms depending on scale and index type), chunk fetching (variable), optional re-ranking (100-500ms for cross-encoder models), and LLM inference (scales roughly linearly with context + output tokens). Profile each component to find your actual bottlenecks.
Cost optimization strategies include: caching embeddings for common queries, tiered storage (hot chunks in memory, cold in disk-backed stores), batching embedding requests, and using smaller models for initial retrieval with larger models only for final generation. For high-volume systems, embedding and re-ranking costs can actually exceed LLM inference costs; hence factor this into architecture decisions.
Implement circuit breakers and graceful degradation. If retrieval latency exceeds thresholds, fall back to reduced K, skip re-ranking, or serve from cache. Users generally prefer fast approximate answers with the option to dig deeper over consistently slow comprehensive responses.
Data Curation Overhead
High-quality ingestion (OCR fixes, table extraction, metadata capture) reduces downstream noise but increases upfront engineering effort.
"Garbage in, garbage out" applies forcefully to RAG. Poor source processing creates systematic retrieval failures that are hard to debug and impossible to fix without re-ingestion. Invest in ingestion quality proportional to the corpus's importance and expected query volume.
Set up continuous quality monitoring: track retrieval success rates by source document, identify chunks that frequently appear in retrievals but rarely in final answers (suggesting low utility), flag sources with high hallucination rates in downstream outputs. Use this telemetry to prioritize re-processing of problematic sources.
Build feedback loops from production usage. When users mark answers as incorrect or unhelpful, trace back to retrieved chunks and source documents. This creates a prioritized queue for manual review and re-curation, focusing human effort where it actually moves the needle on system quality.
Local Models and Runtime Optimizations
On-prem Hosting
Open-source runtimes (e.g.: vLLM, Llama C++) can run models locally for data sovereignty and lower per-call costs.
Local hosting starts making economic sense at scale. Once inference volume exceeds roughly $10K-20K/month in API costs, dedicated GPU infrastructure often provides better unit economics. The exact crossover depends on utilization rates, hardware costs, and operational overhead: model this carefully before committing.
Data sovereignty requirements often mandate local hosting regardless of economics. Regulated industries (healthcare, finance, government) may prohibit sending data to third-party APIs, and even non-regulated organizations increasingly prefer keeping sensitive data on-premises. Local hosting also eliminates external dependencies for availability-critical applications.
Model selection involves different tradeoffs for local hosting versus API usage. Smaller models (7B-13B parameters) run on consumer GPUs and provide acceptable quality for many tasks. Larger models (70B+) require multi-GPU setups but approach frontier model quality. Quantized versions (4-bit, 8-bit) reduce memory requirements 2-4x with modest quality degradation, letting you run larger models on smaller hardware.
Runtime Tuning
Optimize KV cache, batching, and model runtime parameters to accelerate RAG and multi-agent throughput.
KV cache management is the primary lever for inference optimization. The attention mechanism's key-value cache grows linearly with sequence length and eats substantial GPU memory. PagedAttention (vLLM's approach) manages cache memory dynamically, eliminating fragmentation and enabling higher batch sizes. For RAG workloads with predictable context patterns, pre-computing and caching KV states for common context prefixes can help significantly.
Continuous batching dramatically improves throughput by adding new requests to running batches as previous requests complete. Unlike static batching (waiting for a full batch before processing), continuous batching maintains high GPU utilization even with variable request arrival rates. vLLM, TensorRT-LLM, and other modern runtimes implement this by default.
Other tuning parameters worth exploring: tensor parallelism configuration for multi-GPU setups, speculative decoding (using a smaller model to draft tokens verified by the larger model), and flash attention implementations that reduce memory bandwidth requirements. Profile systematically; optimal configurations vary significantly across models, hardware, and workload characteristics.
API compatibility
Maintain the same API surface as cloud models where possible to simplify integration while benefiting from local performance.
Standardizing on OpenAI-compatible API formats lets you switch between providers seamlessly and simplifies testing. vLLM, Ollama, LocalAI, and most inference servers support OpenAI-compatible endpoints. This compatibility layer means you can develop against cloud APIs during prototyping, then deploy to local infrastructure without touching application code.
Abstract provider-specific features behind consistent interfaces: model names, embedding dimensions, token limits, and capability flags should all be configurable without code changes. Implement health checks, retry logic, and fallback chains that can route traffic between local and cloud providers based on availability and load.
Maintain parity testing that validates local deployments produce comparable outputs to reference cloud models. Quantization and different inference implementations can introduce subtle behavioral differences; automated regression testing catches these before they hit production quality.
Agentic Patterns and RAG Integration
Agent Roles
Typical multi-agent patterns: planner/architect → implementer → reviewer for coding; triage and specialized agents for support/HR workflows.
Dividing cognitive labor across specialized agents mirrors how effective human teams work. Planner agents excel at decomposing complex tasks, managing dependencies, and maintaining coherent high-level strategy. Implementer agents focus on executing specific subtasks with deep domain expertise. Reviewer agents apply quality control, catch errors, and ensure outputs meet requirements.
For coding workflows, a common pattern includes an architect agent that designs system structure and interfaces, implementer agents specialized by technology (frontend, backend, database), and a reviewer agent checking for bugs, security issues, and style violations. Each agent can have tailored RAG access; the architect queries design pattern documentation while implementers access language-specific references.
Support workflows benefit from triage agents that classify incoming requests, route to specialized agents (billing, technical, account management), and escalate to humans when confidence is low. Each specialized agent maintains its own knowledge base and RAG configuration optimized for its domain. Implement clear handoff protocols: when an agent transfers responsibility, it should pass relevant context and retrieval results rather than forcing the receiving agent to rediscover information.
Tool Calling and Protocols
Use standardized protocols (e.g., model context protocols) for reliable service/API calls and agent coordination.
Robust tool calling requires careful schema design. Define clear input/output contracts, handle partial failures gracefully, and implement timeout and retry policies appropriate to each tool's characteristics. Tools should be idempotent where possible: agents may retry operations due to transient failures, and non-idempotent tools risk duplicating side effects.
Model Context Protocol (MCP) and similar standards provide structured frameworks for tool definition, capability discovery, and result formatting. Standardized protocols enable tool reuse across agents and simplify debugging through consistent logging and tracing formats. Tool registries that agents can query to discover available capabilities dynamically are worth building.
Agent coordination patterns include direct messaging (agents communicate point-to-point), blackboard systems (agents read/write to shared state), and orchestrator patterns (a central coordinator assigns tasks and collects results). Choose based on workflow complexity: simple linear pipelines work fine with direct messaging, while complex workflows with conditional branching and parallel execution benefit from explicit orchestration.
Observation and Feedback Loops
Instrument agents to observe outcomes, log evidence, and update memory or retraining signals to reduce repeated hallucinations.
Comprehensive observability is essential for debugging and improving agentic systems. Log every agent action: tool calls with inputs/outputs, retrieval queries with results, reasoning traces, final outputs. Structure logs for analysis so you can run queries like "show all cases where the billing agent called the refund tool after retrieval returned no results."
Implement outcome tracking that connects agent actions to downstream results. Did the generated code compile? Did the customer issue get resolved? Did the user accept or reject the suggestion? This feedback enables identifying failure patterns and measuring improvement over time.
Build learning loops that translate observations into system improvements. Common failure patterns should trigger knowledge base updates, retrieval tuning, or prompt refinements. For high-volume systems, automated anomaly detection that flags unusual patterns (sudden spike in tool failures, retrieval returning empty results, agent loops) for human investigation pays dividends.
RAG as a Guardrail
Integrate RAG retrieval into agent decision paths so agents consult evidence before acting and attach retrieved passages for traceability.
RAG transforms agents from confident confabulators into evidence-grounded reasoners. Design agent prompts to require evidence: "Before recommending an action, retrieve relevant documentation and cite specific passages supporting your recommendation." This forces deliberate consultation rather than relying solely on parametric knowledge.
Implement retrieval triggers at decision points: before executing irreversible actions, before providing factual claims to users, before contradicting previous agent outputs. Make absence of evidence explicit in prompts: "If no relevant documentation is found, state that the recommendation is based on general knowledge rather than verified sources."
Use retrieval confidence as an escalation signal. Low-confidence retrievals for high-stakes decisions should trigger human review rather than autonomous action. Combine retrieval-based grounding with output validation, and check generated outputs against retrieved sources to catch hallucinations that slipped through generation.

