Enhancing your LLMs
Even the latest large language models (LLMs) aren’t that useful for complex tasks. So we often talk to folks who’re interested in enhancing an LLM. They have the correct intuition that their use case would be well-served with “something like ChatGPT”, but more in tune with their specific domain and its requirements. So I thought I’d put together a very quick primer on what the most common methods are for supercharging an LLM.
Finetuning
We’ll start with the oldest of techniques, widely applicable to all sorts of AI systems, not just LLMs. After an LLM has been trained, for many millions or billions of dollars, it has learned to statistically replicate the text corpus it has been trained on. In finetuning, we take a more specialized (often private/proprietary) dataset and continue the training for a few more runs, hopefully not costing millions of dollars. Consider it like higher education. High-school gives you a broad understanding of the world, but in university you go deep on a topic of your choice.
While finetuning is a staple in computer vision, I find it of limited relevance in large language models (much more important in small language models that you want to train for a very narrow task.) The large models have “seen it all”, and showing them a few more examples of “how a lawyer writes” has limited effect on how useful the resulting model is in the end. What’s more: The moment you use a fine-tuned model, you’re cut off from improvements in the underlying base model. If you fine-tuned on GPT3, you’d then have to re-run that tuning run for GPT4, 4.5, 5 and so on.
Context Engineering
Sounds like another buzzword, but the idea is sound. If we consider Prompt Engineering the art of posing the question to the LLM in just the right way, Context Engineering is the art of giving it just the right background information to succeed at its task. The idea is simple: You set up your AI system such that each request to the LLM also brings with it a wealth of relevant context and information. That could be examples of the writing style you’re going for, or extensive guides on the desired style, output characteristics. We see this a lot in coding assistants. Claude Code for example will consult a file with instructions where you can tell it how you want it to approach a coding task.
In addition to instructions, you can also set up the system such that just the right amount and type of context gets pulled into the request (a special case of this method is RAG, which we’ll talk about in the next section)
The benefit of this method is that it’s very intuitive and largely independent of the underlying model. Claude, GPT, and co might differ in how well they follow the instructions, examples, and guidelines, but you don’t have to perform another expensive training run just to use it with the next version of a model.
Retrieval Augmented Generation (RAG)
Already old in “AI time”, we can consider RAG a special case of context engineering. The problem with dumping all relevant information, indiscriminately, into each request to an LLM is twofold. First, it’s expensive for those models where you pay per (input) token. Second, it presents a “needle in the haystack” challenge for the LLM. “Somewhere in these 10000 pages of documentation is a single paragraph with relevant info. Good luck!”
To solve this challenge, in a RAG system the input document first gets chopped up into more digestible pieces called chunks. The chunks are then put through that part of an LLM that turns text into so-called vector embeddings. Sounds complicated, but it’s just a bunch of numbers. The cool thing about these LLM-generated number-bunches is that sentences which talk about the same idea end up with “almost the same” bunches of numbers. We can make this all mathematically precise, but it’s enough to know the high-level idea. Each chunk, together with its vector, is then stored in an aptly-named vector database.
Now when a request comes in, the RAG system computes the vector for that request and finds a handful of chunks whose vector is close. Those chunks are then added to the context together with the request.
This sounds more complicated than it is. The promise of RAG is that you only feed the relevant chunks to the LLM. The challenge is that there are quite a few nuances to get right. How big should chunks be? What measure of vector similarity are we going to use?
And, crucially, not every request is of such a nature that a good answer can be found in a single relevant chunk. Often we need to take the entire document into account, which gets us back to the original needle-and-haystack problem. Advanced versions such as Graph-RAG exist, though they are more challenging to set up than simple RAG.
It all depends…
The best method to use depends on your specific use case and challenge. The list above gives a very short overview of what’s out there. A great resource on these topics is Chip Huyen’s great book, AI Engineering. Thanks for sticking through this denser than usual post.
If you want to discuss which approach might be best for your problem, hit reply or schedule a call.