Big Data ≠ LLMs
A while I ago, I was talking with a friend about potential use cases specifically for generative AI. The friend was bringing up a number of areas of their business where AI might help. Their intuition was spot on, but in most of these cases, you would not use generative AI or language models. Instead, it was mostly number crunching: Big data, statistics, and “classical” machine learning.
To set the record straight: Just because large language models are trained on massive data sets does not mean that they themselves are good at dealing with massive datasets. They are not, and they’re not intended to. You would not load gigabytes of numerical data (financial records, for example) into ChatGPT asking it to clean the data or check for anomalies.
I understand the allure: For language problems, LLMs appear to obsolete a lot of the finicky use-case dependent model building you had to do in the past. No need to build complex custom systems to classify reviews, apply content moderation to social media posts, or even grade essays. Just throw it all into ChatGPT with the right prompt. (If only at was that easy. But at least it’s plausible.)
With big data, though, there’s no way around custom building. There’s no general model that deals with it all, because there’s nothing that would make sense to train such a model on. And so if you need to find signatures of fraud in a list of credit card transactions, or patterns of buyer behaviour in sales data, you cannot use the same generic model with just a few tweaks to the prompt.
You might, of course, have a tool that performs some basic statistical analysis automatically, and expose that tool to a language model or agent via the Model Context Protocol (MCP). So you would throw your dataset into the system, then ask a chatbot for a plot of this or that statistic, and it would oblige. I could even envision an automated system that asks you a few questions and then trains an appropriate model on the data so you can start making proper predictions.
In these scenarios, the LLM would be providing a more ergonomic interface but, under the hood, you’d be dealing with the tools of statistics, machine learning, and data analysis, not just vibes and prompts.