Click the Subscribe button to sign up for regular insights on doing AI initiatives right.
"The Computer Doesn't Do What I Tell It To!"
As a teenager, I was often called on to provide basic tech support for friends and family. They'd complain that the computer wouldn't "listen" to them. I'd chuckle at that because the problem likely arose because the computer did do exactly as told. Barring outright bugs, old-school software is deterministic. Clicks and keystrokes have pre-programmed results you can rely on.
Not so with generative AI. Non-determinism lurks everywhere.
There is the inherent randomness of the generated output. The typical chat models answer differently for the same repeated question. When accessing a model programmatically, this randomness can be turned off, but other sources of nondeterminism remain.
In cases where the model, even with the randomness set to zero, is fed raw user input, even slight variations in the input can lead to different outcomes: Whether the user wrote "Can you write me a poem about cats?" or did they write, "Make up a poem about cats!" might lead to very different outcomes.
And then, there is of course the unpredictable way the model handles a large input, or a complex request. You might ask for a piece of writing in a particular format and it's anyone's guess whether you get it or not. So in that case, the computer really didn't do what you told it to.
What can AI engineers do here?
Aim narrow, accept wide. That's general good advice for a user experience. The more you have to tell your users exactly how to use the system, the less awesome they'll feel about themselves and the experience. On the flipside, the more you can make the tool's output predictable, or at least non-perplexing, the better.
Safeguards around user input and post-processing of model output, to file off the rough edges. This could be traditional preprocessing (like removing all special characters and punctuation from an input where it shouldn't influence the actual output) or it could be yet another AI step.
Tune the model's randomness. Don't just accept the default parameter. It might not be the best. Neither might setting it to zero be best. Experiment and evaluate.
And if all that doesn't help, just put "vibe" in front of your product name and all is forgiven ;)
AI in Medicine - Promise and Peril
I recently came across Rachel Thomas's talk on AI in medicine. She co-founded fast.ai (a great learning resource for all things deep learning) and has recently taken a strong interest in the intersection of AI and the medical field.
One line from the talk that hit me hard:
It can be exciting to hear about AI that can read MRIs accurately, but that’s not going to help patients whose doctors won’t take their symptoms seriously enough to order an MRI in the first place.
AI in medicine has so much promise; it's worth getting it right. There are the obvious technical pitfalls, such as data quality issues. But as the quote points out, there are also more pernicious inherent biases that, if we don't get it right, will be perpetuated instead of alleviated with AI.
Here are some points on how we would tackle an AI project in healthcare:
Relentless focus on the final desired outcome, which in my book means better patient health outcomes.
Brutal honesty about which metrics (accuracy, recall, F1 score etc) actually matter toward that end goal, and at what threshold, so we're not just chasing metrics for their own sake.
Before writing any code or putting together any wireframes or prototypes, understand exactly the current workflow, how the new product will be integrated, by whom, and what conflicting incentives they might have.
Conduct a pre-mortem: Ask, "it's now six months later and the project was a total disaster. What happened?" Then develop defensively to prevent those things.
In particular, consider the following hypothetical future scenario: "Our well-meaning AI product had this terrible unintended side effect. What could we have done to prevent it?"
All that can ensure that AI gets used for good is the non-negotiable groundwork for a successful initiative.
Leverage: A Physicist’s Rant
In business speak, to leverage means to use. You might as well employ, deploy, or utilize. Just say "use", then.
However, in physics, "leverage" has a precise meaning. It's the principle behind clever contraptions such as
the lever (duh).
pulleys
gears
These systems have one thing in common: They translate one physical situation into another and let you choose advantageous trade-offs. For example, with a 3:1 pulley, you can triple the force you can apply to something, but you pay for it by having to travel further: To lift something 1 foot with a 3:1 pulley, you must pull on the other end for 3 feet.
When businesses speak of leverage, this trade-off discussion is missing. To say, "We're leveraging AI" sounds like you get something for nothing. But just like in the Physical context, you rarely get a free lunch. Instead, you're making an advantageous trade-off. So, what is it you're trading in? And is it truly advantageous?
You can use AI to generate code, or writing, faster. But can you keep quality the same?
Using chatbots in customer service lets you scale up massively, but you trade in the human touch and connections.
It's essential to articulate what you're trading away and if you're okay with it that Leverage only makes sense if what you gain is more valuable than what you lose.
You’re Not Hiring a Calculator…
Imagine you're hiring an accountant, and you worry about making sure that they can't use a calculator in the interview, because all you're going to ask them is to add numbers in their head, so using a calculator would be cheating. Pointless, right?
It seems candidates and companies are locked in an arms race of AI sophistication, especially in tech: As it turns out, AI coding assistants are really good at the types of puzzles hiring managers like to use in tech assessments: "You're given two lists. Both are sorted. What's the fastest way to find their collective median?" and the likes.
So now we have candidates developing tools that sit in your browser and feed interview questions straight into ChatGPT and hiring managers wondering if it's time to bring back the in-person (as in, live, not via Zoom) interview.
That's misguided. We can't celebrate the productivity gains that AI enables in those who know how to use it, and then freak out when candidates use AI in the hiring process. Why not design the whole process so that only those who can produce great results with AI pass?
Take writing. Yes, AI can generate lucid and passable responses. But let's say you're actually hiring for a position where someone has to produce writing. Marketing copy for your website, for example. Why bother hiring someone? Couldn't you just ask ChatGPT to "Write my marketing copy, please?" Why? No, really. Why aren't you saving all that money and instead spend just five minutes each day prompting the AI for the writing you need? Or hire a part-time teenager to copy-paste your prompts into ChatGPT and copy-paste the answer to your website. Maybe because there's skill, taste, and discernment required beyond that? (I know for sure that I couldn't just prompt my way to a Pulitzer prize.)
So then, in hiring, make it an "open-book" exam. AI explicitly allowed. Just raise the bar for the outcome you want to see. It instantly defeats the point of using AI to "cheat." So ask for spicy takes, strong opinions, a human explanation for why this is good and that is bad. (Try asking ChatGPT something like "which is the worst frontend framework" and you get some "on the hand, on the other hand, to be fair, in the end..." wishy-washy position. A real human will be happy to go off on a fun rant.)
AI Theatre
Maybe it's because Vancouver's "Bard on the Beach" festival is kicking off, or agile expert Yuval Yeret recently wrote a great post about Agile Theatre, but I've been thinking about that theatre a bit.
As an art form, theatre is great. But when applied to anything else, the term is derogatory:
The security theatre we endure at airports. It causes a lot of hassle but doesn't demonstrably keep us any safer.
The Agile/Strategy/OKR theatre that Yuval writes about. Companies adopt a methodology's outward trappings and props without buying into its deeper insights.
And so, naturally, I think of AI theatre and forms that can take:
Press-Release-Driven Development: "Look at us, we're so innovative. We use AI!" Yet the splashy demo never gets used in production.
FOMO-Driven Marketing: "Don't get left behind! AI is coming for all you, your job, your company, your whole industry. Buy our consulting services NOW."
Chasing the wrong thing: Obsessing over which model tops which leaderboard or which system impressed mathematicians instead of focusing on tangible outcomes and benefits in the real world.
Wild extrapolations, prognostications, and tangential philosophical debates. It's fun. And it's safe, because you're not putting anything immediately falsifiable out there.
It can be entertaining to watch, but the real work happens quietly and with a much more pragmatic focus.
Benchmarks Considered Boring
I think about and write about AI and work with it daily. I've also repeatedly mentioned the importance of objective evaluations for your AI tools. So you might assume that I'm constantly checking the various leaderboards and benchmarks:
Who's on top of Chatbot Arena?
Is Opus-4 or o4-mini better at the FrontierMath benchmark?
"Gemini 2.5 Pro just topped SimpleBench! Here's why that's a big deal."
Honestly, I never really cared about them. Sure, having a rough sense of who's coming out with new models demonstrating generally improved capabilities is good. For that, it's enough to keep a very loose finger on the pulse. If at any point a model makes a really big splash, you're guaranteed to hear about it. No need to compulsively refresh the AI Benchmarking Dashboard.
But beyond larger trends, much of the leaderboard news is just noise. OpenAI comes out with its new model and tops the leaderboard. "This is why OpenAI is the leader and Google missed the boat", the pundits proclaim. Next week, Google comes out and it's all "Oh baby, Google is so back!".
Besides, the leaderboard position does a terrible job at making an informed choice:
The best model at a benchmark might fare poorly on the specific task you intend to use it for.
Or maybe it's "the best" in output quality, but you can't use it for your purpose due to... cost, deployment considerations, compliance issues, latency, etc
Instead of obsessing over rankings that collaps a multi-faceted decision into a single number, you'd do best to consider all these tradeoffs holistically and then, combined with a benchmark custom-built for your use case, make an informed decision.
If You’re so Smart, Why Aren’t You Rich?
This well-known aphorism highlights the discrepancy between intellect and the accumulation of wealth. Not all smart people are rich, and vice versa.
It is, however, a good yardstick to filter out some AI hucksters: People who know the #1 secret to building 7-figure businesses with AI (in a single weekend) would just do that, instead of hawking their video courses online. That much should be obvious. The technology underlying the unfounded promises of massive wealth has changed (it used to be cryptocurrency), but the playbook is the same.
A more subtle display of this discrepancy is the focus of some AI companies on demonstrating their models' proficiency at obscure, elite-level math and coding problems. See here for an example: Inside the Secret Meeting Where Mathematicians Struggled to Outsmart AI | Scientific American
To me, this is a distraction. What's most likely happening here (the article is light on details) is that the AI recognizes specific repeating patterns in the math problems it has already been trained on and reproduces them.
My retort would be: If this AI (OpenAI's o4-mini model) is so smart, why do we read about the great AI disillusionment? Where are the novel breakthroughs in math and science, made purely by AI with no human input? Where are the actual 10x economic benefits?
These models are undoubtedly powerful, but only if they're carefully integrated, with lots of plumbing, engineering, and human ingenuity, into an overall system that produces results.
AI’s Swiss Cheese Capabilities
Another gem from Andrej Karpathy's Deep Dive into LLMs like ChatGPT is his likening of AI capabilities to Swiss cheese: Delicious all around but with big holes in it.
Imagine conversing with someone for the first time at a social event. You discuss some current affairs topics, and the person comes across as genuinely knowledgeable and thoughtful. Then you switch topics to sports, and not only do they not know what soccer is, they don't even know what a ball is. This is quite unthinkable for humans, but commonplace with AI.
Now it's true that people can be brilliant in certain areas and lacking in other, yet we have built up an intuition of what we can expect from each other based on demonstrated capabilities.
The "Swiss Cheese" model of AI capabilities tells us that, with generative AI, we cannot rely on that intuition. These models can stun us with super-human performance in one domain and then frustrate us with basic mistakes and gaps in another.
The takeaway for running a successful AI initiative is that each new task and application of AI requires its own set of evaluations. You cannot rely on simple interpolation: "If it knows X and Z, then clearly it must know Y".
This is true both at the macroscopic and microscopic level:
Just because an AI model is good at writing code, which requires sound structured thinking, doesn't mean it'll be good at other tasks that require sound structured thinking, such as writing legal documents.
Just because the AI can do specific coding tasks surprisingly well doesn't mean it won't make mindbogglingly basic mistakes in others.
While these hilarious mistakes sometimes make for viral tweets, they'll also erode trust in the AI tools that rely on them. Better to check with solid evals and good testing how well AI performs on your tasks before blindly trusting that it should know what it's doing.
Free Massive Yacht
It's Friday, so in the spirit of a lightened weekend, here's what it (sometimes) feels like to be in business offering AI:
That Mitchell and Webb Look - Massive Yacht (s04e05) - YouTube
The three owners of MassiveYachts.com want to celebrate their website's success by offering a free massive yacht to their one millionth visitor. You can guess where this is going. The lucky millionth visitor dismisses the pop-up.
It's a case of an overly sensitive BS detector. In the sketch's scenario, the visitor rationalizes that the likelihood of a genuine "you're the millionth visitor" contest is so astronomically low that it's not even worth the minor effort of verifying the offer.
After being oversold and underwhelmed, business owners are rightfully wary of anyone promising massive efficiency gains from AI. But just as in sketch, I wonder how many real yachts get dismissed as pop-ups these days.
AI vs OR
Inspired by two unrelated LinkedIn posts. One was bemoaning that they didn't see much success in using AI for route planning. Another bemoaning that operations research (OR) doesn't get the same time in the spotlight as AI does.
OR is about using the power of math to come up with definitive answers to questions such as how to best pack parcels onto different trucks and then send them to different destinations. Generally it's about finding the solution to a problem that maximizes or minimizes some objective function subject to some constraints.
I've seen first hand how OR is treated as the ugly step sibling of AI. At a previous job, in a pre-sales call, we were looking at a slam dunk OR problem. But the client kept pushing, because they wanted an AI solution. Maybe to impress investors, maybe to access special purpose government funding. Either way, the project didn't make sense that way and the client lost out on actually solving their problem.
In a way, AI and OR complement each other. OR solutions are mathematically precise and require no training data. They're great for "narrow road" problems where and slight departure from the right path leads to an invalid answer instead of just a "good but not great" answer.
Where OR lacks though is ease of use. Taking a messy real world problem and shoving it into the mathematical formulations amenable to common solvers is hard, highly specialized work. You can't just hand a trucking company a license to Gurobi (a popular tool for optimizing so called mixed integer problems) and be on your merry way.
So that's where AI can come up the rescue. I have great hopes for what a generative AI interface into the arcane OR tools could do.
Because many people in industry I talk to have problems much more suited for OR, yet they are looking for solutions among the AI tools, because that's what everyone, from LinkedIn influencers to the big consulting firms, is pushing.
User Stories Revisited
Here's one reason AI initatives end up in Proof-of-Concept Purgatory, never to see the light of the real world: Not enough high-bandwidth communication and iterative development between the users and the builders.
Of course this afflicts projects of all types, but at least in recent years, "normal" software projects did quite okay. Take, for example, this explanation by the makers of Linear (a beautifully made project management and issue tracking software) on why they recommend tracking "issues" instead of "user stories" (the agile concept of writing the things you want to work on from the point of view of the user):
To quote,
User stories evolved over twenty years ago as a way to communicate what a customer wanted into product requirements that a software team could deliver. Fast forward to today and a lot of things have changed about how we build software. Customers are tech-savvy enough to articulate basic product requirements. We’ve developed standards for common features such as shopping carts, todo lists, and notifications so there is no need to explain how they should work. The best product and engineering teams understand their users deeply and are familiar with how their product should work.
Source: Write issues not user stories - Linear Method
If a client wants a web shop, they'll talk about the shopping cart, and everyone has a shared mental picture of how that should generally behave. Some discussion for full alignment are in order, but you won't see wild swings that miss the mark completely.
Not so with AI. There has not been enough time for shared norms to emerge where everybody agrees what terms mean, how systems should behave, how AI capabilities should be presented. I've seen this directly affect project outcomes when talking to people who participated in AI initiatives: Things get lost in communication and now there is a wide gap between what the users wanted and what was delivered. In essence, this cartoon is relevant again. In a concrete example, the users needed a tool that gets a good draft going, with lots of iterations. What was delivered was an overengineerd tool that tries (in vain) to get a perfect version out in the first shot (to save precious processing time). The users assumed it was obvious that a tool would work the way they expected. The engineers assumed their way was the best.
The way out? Communication. To the point that it feels like too much communication. Exactly how will the tool fit into the existing workflow and the larger ecosystem of its company? How will results be delievered? Do we need speed, or accuracy? What are the thresholds for each? Lots of questions, and you can't, for the time being, assume that some things are too obvious to mention.
User stories, as they were originally conceived, were about realizing this need for thinking from the users' point of view. Communication is important, but certain decisions cannot be made by the user. Instead, they require that the designers and engineers understand, at a deep level, what the users want to accomplish. That's then the one decision that takes care of a thousand little decisions down the road. What does decidedly not work is a user asking for a feature without revealing their intent, and engineers rushing ahead with an implementation without pushing to know that intent.
Recalibrate your BS detector
The bullshit detector. We all have one built in. It rings its alarm bells when the Prince of Nigeria reaches out to us with a lucrative scheme. It buzzes when we're around used car salespeople. It stops us in our tracks so that we may ask: Is this too good to be true? A well-calibrated BS detector makes us skeptical enough not to be fooled, but not so skeptical that we dismiss valuable new ideas. As we experience false positives (calling BS on something legitimate) or false negatives (getting fooled by BS), over time our detectors become increasingly well calibrated.
But in the face of rapid innovation, things can get out of whack really quickly. We see this with every big wave of innovation. It's new, it's exciting, the possibilities are endless. Why shouldn't it work? When electricity went mainstream, we got a lot of questionable devices and people fell for them. Electricity had brought them so many marvels, what's one marvel more?
Then suddenly, we had nanoparticles and nanotech in everything. Quantum physics is a perennial favourite, of course, and now we have generative and agentic AI.
What to do? Especially with AI, I firmly believe that it holds great promise. Yet I also shake my head every day I see another AI influencer tell me that their half-baked prompts will replace entire professions.
How can we stay informed and current on the fantastic possibilities without falling for false promises? Some thoughts:
Have lots of conversations with other practitioners in your field. What have they seen work for them?
Press for details. Exactly how is this particular application of AI supposed to work? Is it a big leap compared to what you have seen working out in the wild?
But also: Stay open. Progress has been rapid. What seemed implausible a year ago is now routine. Simply don't bet the whole farm on any one idea.
As we press on and gain experience (and some battle scars, maybe), our BS detectors will become well-calibrated again. Until the next wave of innovation, anyway.
PS: Another way to navigate the threats and opportunities and tell hype from promise is to work with someone you trust on these topics. Ready when you are. Reply directly to see how we can be your navigation aids through these uncharted waters.
Buy vs Build
Is generative AI, and vibe coding in particular, ringing in the end of SaaS (Software-as-a-Service) as we know it? After all, why pay monthly for expensive software when you can code it up for yourself?
Tech Influencers have you believe that you'll save thousands, and list all the things they've coded up in a weekend.
I'm skeptical.
Time Versus Money
Even before generative AI, any company had to decide whether to buy certain capabilities or build them in-house. Generative AI has shifted the costs of building somewhat, but has certainly not eliminated them. One of the benefits of buying a solution is not just that you save the upfront build cost. You also save the ongoing maintenance cost.
More than anything, a company wants to focus all their energy on the unique value it brings. Unless your company's product is a calendar booking app, it's a distraction to vibe-code your alternative to Calendly rather than just paying the subscription cost.
Where in-house vibe coding has its place are small tools for which, before Gen AI, you would never have thought of getting custom software made for it, because the return on investment would be too low. If it costs tens of thousands of dollars to build a tool that, realistically, only saves you a thousand bucks each year, such a project will not happen. If generative AI drops that cost down to a few hundred dollars, real savings can be had. Though, in the grand scheme of things, they'd still be small.
Instead of chasing such small gains, it's better to focus on putting AI to use where it can drastically enhance your unique value proposition. What is it that only you can do, and how can AI give you a boost there?
It's Not (So Much) About Data
Way before the current hype cycle involving generative AI, and the accompanying disillusionment, we already knew that the vast majority of machine learning (ML) projects failed, with some sources putting the number at 85% (Why 85% of Machine Learning Projects Fail - How to Avoid This)
The most dominant failure mode for this previous era of ML projects was a lack of quality data. After all, you can't train a quality ML model on insufficient input data. Garbage in, garbage out.
Interestingly, I don't see this failure mode as relevant for generative AI projects, for a few reasons:
The foundation models (ChatGPT, Claude, Llama, etc.) have been trained on so much data that they've already seen it all.
By their very nature, these models are quite robust to high variance in their inputs. Try talking to ChatGPT while making lots of spelling mistakes; it will still understand you.
You could even harness generative AI in the data cleaning process, as a first step in a larger workflow.
Note: This point is specifically for applications of generative AI. In training a foundation model, the quality of input data matters very much.
All this to say: If you're holding back on exploring uses of generative AI in your organization because you feel you don't have high-quality data, you might be worried for no reason. Unless your data is in such poor shape that neither AI nor human experts could make sense of it, it's likely enough to see a positive return on investment.
PS: If you want to, or know someone who wants to, explore further whether your organization has solid, no-empty-promises-hype uses for AI, reach out for a free initial consultation! Just hit reply to this email or find us on our website https://www.aicelabs.com
Measuring Soft Outcomes
We've previously touched on the importance of objective evaluations when looking at an AI model's outputs. It's just as important to be objective about the project outcome itself. Otherwise, we risk going purely by gut feel.
Going all the way to the initative's inception, what was the needle we wanted to move?
Maybe it's an eminently quantifiable goal. Task X used to take 2 hours. Now it only takes 10 minutes.
Or it's an objective quality goal: Manual review was missing, on average, 5% of issues. Now we're only missing 1%.
However, goals can be softer: "Enhancing employee satisfaction" is excellent. Those can be harder to measure, but it's not impossible. For even the softest goals, you have a picture in your mind of what success would look like, or at least a sense of what's bothering you about the status quo. If it weren't the case, you wouldn't have a problem: If an outcome can't be measured, you might as well declare it achieved.
Sticking with the "employee satisfaction" example, let's assume you've noticed low employee satisfaction. So what? Well, maybe it leads to high turnover. And that's certainly something we can measure. Or it leads to lots of employees coming to their manager with complaints. That, too, can be measured. Whatever it is about employee satisfaction that's bothering you would have to manifest itself in some observable way. And if it can be observed, it can be measured.
So if you've determined that an annoying but necessary task leads to low employee satisfaction to the point that you want to do something about it, and you suspect that automating that task should help, you can then put the correct measures and objectives into place: The overall objective becomes, say, "reduce employee turnover by x%" or "x% fewer complaints to managers" (but be careful with the latter one... an easy way to achieve that metric is for managers to punish those who complain)
In any case, identifying the real goal of any project or initiative and tying it to a measurable outcome immensely clarifies what success looks like to anyone involved. It also moves the conversation to a more helpful place: If I know the ultimate goal, I can confidently make many subordinate decisions and trade-offs. How accurate does the model have to be? How fast? How much should we invest in a snappy, polished user interface?
Conversely, if there is no real goal other than a top-down mandate to "do something with AI", it's easy to see how none of the stakeholders would ever be able to align. Such an initiative cannot succeed. It'd be like playing golf on a course with no holes.
We've been here before
With all the recent news about the disillusionment that's setting in about generative AI, I'm wondering how AI initiatives compare to other initiatives. I'm sure those initiatives experience plenty of failure, too, and AI isn't that special.
There are undoubtedly several failure modes unique to AI work. In the past, there was a lack of high-quality data. For generative AI, where data requirements can be significantly less, it could be the lack of good and unambiguous evaluations.
But a lot of even when data is plenty and evaluations are good, an AI initiative can stall and end up in what I'll call the proof-of-concept purgatory if it just doesn't turn out to be all that useful. Now, why would that happen? Plenty of reasons:
The problem shouldn't have been tackled with AI in the first place.
The "problem" is non-existent, so nobody will use the solution.
The solution wasn't built with a tight user-focused feedback loop, so while it's generally going in the right direction, it still misses the mark.
The solution wasn't integrated into the larger ecosystem of where it's being deployed.
These reasons are not unique to AI. To avoid these issues, follow these two principles:
Begin with the end in mind.
Release early and iterate with lots of feedback rather than planning everything out from the beginning.
That might sound contradictory: How can we start with the end in mind and iterate/adjust? It's essential that beginning with the end in mind means clearly understanding what success looks like, not prescribing in much detail how we'll get there. The clear vision for the final outcome guides the iterations (and keeps them from running in circles).
With just those two simple to understand (but hard to put into practice) principles, your project, whether it uses AI or not, has a much higher chance of success.
The AI Trough of Disillusionment
A scathing article in the Business Section of The Economist (Welcome to the AI trough of disillusionment) states that, "Tech giants are spending big, but many other companies are growing frustrated."
I can't say I'm surprised. Many of the past articles (browse the full archive at https://aicelabs.com/articles) here discuss the pitfalls of undertaking an AI initiative, whether that's building a bespoke tool or onboarding an off-the-shelf solution.
In a way, this article is vindication for our stance at AICE Labs that AI needs to be done right, from end-to-end, with clear outcomes to evaluate against. It's not enough to download a "10 prompts that will hypercharge your organization" article and call it a day. Neither is it enough to make a top-down decree to do something with this AI stuff, no matter how much money gets thrown at it.
There is no simple solution—and, for that matter, no complex big-consultancy-style 17-step process either—that will guarantee success with any project. Throw the massive product risk of an AI initiative into the mix and stir in overhyped promises from influencers and you have the perfect recipe for disappointment.
We've been here before. Machine Learning went through several cycles of hype followed by disillusionment, and it's no surprise that the cycle repeats anew. What can we do? I hope writing this newsletter is doing a small part shedding light on some of the pitfalls, and I'll expand on a few pieces of them in the next little while.
All this to say: It doesn't have to be like this. It's painful to see so much effort and hard work go to waste and lead to disappointing outcomes for users and businesses, where expensive projects lead to nothing more than a proof-of-concept gathering dust in some forgotten cloud folder. And it's this pain that drives us at AICE Labs to dig deeper into what it takes to deliver outcomes rather than code.
AI Affordances
If you’ve driven a car lately, you’ve noticed that they just don’t have that many buttons any more. Instead, most functionality is accessed in nested layers of touch-screen menus. Want to raise the temperature for the rear seats? Tap for the climate menu, tap for rear settings, tap tap tap tap tap to increase it by five degrees.
The main problem is that this poses a road hazard. A secondary issue is that it obscures from the user any clear indication of what the system can do for you. In UI/UX (User Interface/User Experience) speak, the features of such a car have low discoverability because they lack affordances.
Affordances are the characteristics or properties of an object that suggest how it can be used. — What are Affordances?
I’ve been in hotel showers where I struggled longer than I’d care to admit figuring out how to turn the damn thing on. No indication what should be turned, twisted, pulled, or pushed. All sacrificed to “fancy” design.
Which, as always, brings us to AI systems.
When they are purely chat-based, they have no signifiers or affordances hinting at their capabilities: ChatGPT will draw you an image if you ask it to. Claude can’t do that. But you wouldn’t know that from just looking at either of them.
This, in turn, limits the amount of use the typical user (i.e., not a power-user who reads every “top 10 secret prompts to unleash your whatever” article out there) gets out of the tool. That’s mildly annoying for OpenAI, Anthropic, and co, but their tools get enough attention, and are sufficiently embedded in the zeitgeist, that over time everyone “gets it”.
If, on the other hand, you’re building and rolling out a custom tool for an internal workflow to supercharge your team’s capabilities, you need to account for how users will learn about and discover the capabilities. Luckily, you’ve got options
Go full UI
UIs have buttons and other control elements with well-defined and understood behaviour. Buttons are for clicking, toggles are for toggling, and dropdown menus present more options. At their leisure, users can look at everything in the UI and get a full picture of the tool’s capabilities, much like they’d do with a physical system (such as a car that still has traditional buttons…).
If you have buttons, toggles, and menu options for everything you want people to accomplish with your tool, they’ll figure it out.
Document everything
Command-line tools live on the opposite spectrum. They have no affordances whatsoever. You must know which command to type, how to pass in the right options and flags. It’s not there staring at you.
Instead, such tools come with comprehensive documentation about every subcommand and every variation you can use to control the behavior of the program. Check out the Git - Reference if you ever have to escape insomnia.
In the case of an AI tool, you’d want to list concrete examples of all the workflows and queries you built the tool for, explaining any caveats and limitations, the preferred format for passing additional instructions or data, and an explanation of what is (and isn’t) to be expected from the response.
Or something in between
Likely, you’ll combine some affordances in the user interface and comprehensive documentation for the rest. The documentation is there as the ground truth of the tool’s capabilities, and the UI elements facilitate a smooth flow for the everyday use of the tool.
Just don’t hide everything away in a misguided attempt to “simplify”. People want their car buttons back, and they don’t want to go hunting many layers deep to get their work done.
AI Tasks: Context and Open-Endedness
How do you know a task is a good candidate for AI automation?
The most accurate way to answer that is to go and build the AI tool. But let’s assume we don’t want to jump headfirst into it, because there’s time and money at stake.
We want to weed out those tasks where we wouldn’t expect current AI to have a fighting chance. For that, we can draw up a framework that looks at how well-defined the task’s outcome is and how much it depends on an overall context, leading us to a 2x2 matrix. (Can’t do consulting without sprinkling those matrices around every once in a while, after all.)
Let’s go through them
Simple Automation
Tasks that do not require a lot of context and have narrowly defined success criteria are good candidates for simple automation. They might not even need any machine learning or AI. Or, if they do, they will be straightforward to implement, with AI engineering mainly focused on finding the right model prompt and processing the output back.
Examples
Summarizing a news article
Small-scale code refactorings (“Change this Python dictionary into a dataclass”)
Precision Automation
Some tasks have a clearly and narrowly defined outcome but require a lot of context, and that context might be implicit instead of easily passed to the AI tool. To handle such tasks with AI, you need to have a way to provide the appropriate context, which means a lot of data engineering behind the scenes, and, before any work on the actual tool can begin, “downloading” the implicit knowledge from the subject matter experts. This is also where various retrieval-based techniques (basically, intelligent search for source material) plays an important role.
Examples
Reviewing a legal document and flagging problematic clauses. What counts as problematic depends on the context, but once that context is defined, it’s a narrowly defined task.
Implementing a straightforward feature in a codebase while adhering to the company’s coding guidelines and following their chosen practices.
Creative Exploration
Moving on to the two “high open-endedness” quadrants, let’s first define what we mean by that. We define open-endedness as an inability to state a universally accepted definition of done. Or, in short, you can’t tell in advance what a good solution to an open-ended task looks like, but you’ll know it when you see it. With a narrow task, you could let the tool judge whether the task was completed. With an open-ended task, you have to be the final judge.
If the task requires such open-endedness but does not require much context, there’s a good chance existing off-the-shelf generative AI tools are just what you need. ChatGPT, Claude, and co for text, Midjourney for images, Runway for videos, and countless more for bespoke requirements (company logos, marketing copy, etc.).
Example
Creating a visual to go with your blog post. Context dependence is low (paste in the blog post), but you must iterate over several variations to find something you like.
Guided Intelligence
Leaving the hardest nut to crack for last. A highly open-ended task that also requires intimate knowledge of your unique context. This combines the challenges of all the previous approaches. We’ll need lots of prep work to get the correct data (i.e., context) into the system. We also need intuitive interfaces that let you seamlessly iterate and refine so that you can explore solutions in a fruitful direction.
Example
Generating marketing copy that takes brand voice, target audience and corporate strategy as well as legal requirements into consideration
Why it matters
You’ll want to know what task you’re dealing with when choosing what AI system (if any) to build. For example, if you try to develop a “fire-and-forget” system for a highly open-ended task, you’ll waste a lot of time trying to find that one magical prompt that gets the AI to give you the perfect outcome.
Pick the simplest approach for the problem you have, but not simpler.
Making Users Awesome
There’s a great book by Kathy Sierra: Badass: Making Users Awesome. Aimed at product managers, it makes the compelling point that nobody wants your tool; they want the outcomes the tool enables. To create loyal, even raving, fans of your product, you should therefore build an entire ecosystem around helping people achieve these outcomes.
The book was written well before the recent generative AI wave, and so it focuses on pieces like creating tutorials around common use cases, and, generally, asking yourself how to optimize the tool so users can get to the outcomes they want.
But with AI, I can easily think of a novel way to make users awesome: by providing a natural language interface to the tool’s more advanced, obscure, or hard-to-get-right features. It doesn’t even have to do everything autonomously. It could be as simple as, “Tell me what you want to do, and I’ll look over your shoulder and guide you along,” so that the user even learns something.
I can immediately think of a few examples:
Loading your vacation pictures into Lightroom or Photoshop, you decide they don’t convey the relaxed summertime feeling you experienced. You ask the AI to work with you on enhancing it, and the AI walks you through some colour curve adjustments.
You open an Excel sheet with your department’s sales figures for the year. You want to achieve a certain visualization for a report, but aren’t convinced by the standard options. You ask the AI what could be done and it suggests some groupings, aggregations, and charts that could get you there.
In general, any software with lots of knobs to turn and tweak would be a great candidate here. There’s often a mismatch between reading the documentation, which outlines each feature in isolation, and how you’d actually use that feature.
As a concrete example, here’s Adobe’s official documentation on what happens if you set a layer’s blend mode to Overlay
Multiplies or screens the colors, depending on the base color. Patterns or colors overlay the existing pixels while preserving the highlights and shadows of the base color. The base color is not replaced, but mixed with the blend color to reflect the lightness or darkness of the original color.
🤷♂️🤔❓
Now here’s a cool trick in Photoshop to instantly make the colours of your photo “pop” more:
Load your photo
Duplicate the only layer (which holds your photo)
Set its blend mode to “Overlay”
Adjust that layer’s opacity to control the strength of the effect
You learn these tricks by watching tons of tutorials on YouTube, or, these days, you could ask the AI. And if there’s a bespoke one built right into the tool, all the better.
I’m sure if you’re using industry-specific specialized tools, you can think of great examples where an intelligent AI assistant would give “normal” users superpowers.