AI Agents, Mathematics, and Making Sense of Chaos
From Artificiality This Week * Our Gathering: Our Artificiality Summit 2025 will be held on October 23-25 in Bend, Oregon. The
Enterprises face a critical choice in their generative AI adoption strategy: fine-tuning or Retrieval-Augmented Generation (RAG)? While fine-tuning has been the go-to approach for early adopters, a new study suggests that RAG may be the more powerful and sustainable path forward.
Enterprises face a critical choice in their generative AI adoption strategy: fine-tuning or Retrieval-Augmented Generation (RAG)? While fine-tuning has been the go-to approach for early adopters seeking to quickly adapt genAI to their needs, a new study suggests that RAG may be the more powerful and sustainable path forward.
To date, fine-tuning has been the favored approach for enterprises looking to harness genAI. By training foundational language models (LLMs) on domain-specific data, businesses can rapidly develop custom applications tailored to their needs. This plug-and-play simplicity has made fine-tuning the entry point for many—a16z's research shows that 72% of enterprises rely on fine-tuning while only 22% rely on RAG.
However, the popularity of fine-tuning may owe more to timing than true technical superiority. As the first widely accessible adaptation technique, fine-tuning naturally attracted early adopters eager to experiment with genAI. The publicity around prominent fine-tuned models like BloombergGPT further fueled this trend.
However as enterprises move beyond initial pilots into large-scale deployment, the limitations of fine-tuning are coming into sharper focus. Training LLMs from scratch is staggeringly resource-intensive, requiring vast computational power and bespoke technical talent. Fine-tuned models also struggle to keep pace with the rapid evolution of genAI, leaving enterprises at risk of being leapfrogged by nimbler competitors.
Rather than retraining LLMs on static datasets, RAG dynamically retrieves relevant information from a knowledge base to inform the model's outputs in real-time. This approach offers several advantages over fine-tuning.
First, RAG is inherently more flexible and scalable. Enterprises can continually expand their knowledge base without expensive retraining, allowing them to quickly adapt to new data and use cases. This "build once, use many" architecture also enables a more efficient allocation of compute resources.
RAG gives enterprises greater control over the provenance and security of their data. By keeping sensitive information in-house and querying it on-demand, businesses can mitigate the data leakage and regulatory risks associated with sharing proprietary datasets externally for fine-tuning.
Control and customization are key, these factors heading out cost for now. And this is where open source comes into the picture: the ability to have secure control of proprietary data, to understand why models produce certain outputs and the ability to attain a set degree of accuracy for a given use case are the primary reasons to adopt open source and, to date, fine-tuning.
But RAG also has the potential to significantly reduce costs. Enterprises can build on top of open-source RAG frameworks and plug in their own knowledge bases, sidestepping the steep licensing fees of cloud-based fine-tuning services. As genAI budgets inevitably come under scrutiny, this cost-efficiency is a major selling point.
The researchers conducted a head-to-head comparison of the two techniques across a range of state-of-the-art LLMs and enterprise use cases.
On average, RAG-based models outperformed their fine-tuned counterparts by 16% on ROUGE score, 15% on BLEU score, and 53% on cosine similarity (see below for more explanation on what these tests mean). In plain English, RAG is better at "getting the gist" of the text, communicating as a human would, and returning results that are more factually grounded and relevant to the task at hand.
Interestingly, the study also found that combining RAG with fine-tuning did not yield additional benefits, and in some cases actually degraded performance. This underscores that RAG is a self-sufficient adaptation technique that can deliver superior results on its own.
Perhaps most importantly, the research validates RAG's ability to mitigate hallucination—the tendency of LLMs to generate plausible but factually incorrect outputs. By grounding responses in verified knowledge, RAG substantially reduces this risk compared to fine-tuning. For enterprises deploying genAI in high-stakes domains, this safety advantage cannot be overstated.
What's the downside of RAG? Fine-tuned models can better adapt to the target task and reach conclusions that may not be available with RAG. They are more flexible and adaptive and may appear "smarter" when used for complex tasks. This means that, to make truly good RAG requires more investment in the knowledge infrastructure and retrieval architectures needed to support dynamic context injection. The diagram below shows a conceptual example of the additional search infrastructure required for RAG.
The key takeaway for enterprises is that RAG's ability to dynamically retrieve and incorporate verified information makes it a more reliable and accurate approach for deploying generative AI at scale. As businesses increasingly rely on AI to make high-stakes decisions and interact with customers, this advantage cannot be overstated.
RAG shows how enterprises need to re-envision genAI not as a static tool, but as an ever-evolving complex capability that grows with their organization.
Further details on benchmarking tests:
ROUGE, BLEU, and cosine similarity are all methods for evaluating the quality of text generated by AI models. Here's what each of them measures:
The strong performance on ROUGE and BLEU suggests that RAG is more effective at generating text that matches what a human would write for the same task. This is likely because RAG can draw upon a vast knowledge base to find the most relevant information, rather than relying solely on patterns learned during fine-tuning.
The large advantage in cosine similarity is particularly noteworthy, as it points to RAG's ability to generate text that is semantically on-topic. By retrieving passages that are conceptually related to the input query, RAG can produce responses that are more focused and coherent.
Crucially, these metrics also have implications for the factuality and accuracy of the AI-generated text. Higher ROUGE and BLEU scores imply that RAG is more faithfully conveying the information from the reference texts, which are assumed to be factually accurate. The cosine similarity boost further reinforces that RAG's outputs are more closely aligned with the semantic content of the verified knowledge base.
In other words, by grounding its responses in a curated repository of information, RAG is less prone to hallucination (i.e., generating plausible but false statements) compared to fine-tuned models that rely purely on learned patterns. This is a critical advantage for enterprises deploying AI in domains where accuracy is paramount, such as finance, healthcare, and legal services.
Of course, no evaluation metric is perfect, and there are ongoing debates in the AI community about how to best assess language model outputs. But the fact that RAG consistently outperforms fine-tuning across multiple well-established benchmarks provides compelling evidence of its superiority in generating factual and semantically relevant text.
The Artificiality Weekend Briefing: About AI, Not Written by AI