Click the Subscribe button to sign up for regular insights on doing AI initiatives right.
Measuring Soft Outcomes
We've previously touched on the importance of objective evaluations when looking at an AI model's outputs. It's just as important to be objective about the project outcome itself. Otherwise, we risk going purely by gut feel.
Going all the way to the initative's inception, what was the needle we wanted to move?
Maybe it's an eminently quantifiable goal. Task X used to take 2 hours. Now it only takes 10 minutes.
Or it's an objective quality goal: Manual review was missing, on average, 5% of issues. Now we're only missing 1%.
However, goals can be softer: "Enhancing employee satisfaction" is excellent. Those can be harder to measure, but it's not impossible. For even the softest goals, you have a picture in your mind of what success would look like, or at least a sense of what's bothering you about the status quo. If it weren't the case, you wouldn't have a problem: If an outcome can't be measured, you might as well declare it achieved.
Sticking with the "employee satisfaction" example, let's assume you've noticed low employee satisfaction. So what? Well, maybe it leads to high turnover. And that's certainly something we can measure. Or it leads to lots of employees coming to their manager with complaints. That, too, can be measured. Whatever it is about employee satisfaction that's bothering you would have to manifest itself in some observable way. And if it can be observed, it can be measured.
So if you've determined that an annoying but necessary task leads to low employee satisfaction to the point that you want to do something about it, and you suspect that automating that task should help, you can then put the correct measures and objectives into place: The overall objective becomes, say, "reduce employee turnover by x%" or "x% fewer complaints to managers" (but be careful with the latter one... an easy way to achieve that metric is for managers to punish those who complain)
In any case, identifying the real goal of any project or initiative and tying it to a measurable outcome immensely clarifies what success looks like to anyone involved. It also moves the conversation to a more helpful place: If I know the ultimate goal, I can confidently make many subordinate decisions and trade-offs. How accurate does the model have to be? How fast? How much should we invest in a snappy, polished user interface?
Conversely, if there is no real goal other than a top-down mandate to "do something with AI", it's easy to see how none of the stakeholders would ever be able to align. Such an initiative cannot succeed. It'd be like playing golf on a course with no holes.
We've been here before
With all the recent news about the disillusionment that's setting in about generative AI, I'm wondering how AI initiatives compare to other initiatives. I'm sure those initiatives experience plenty of failure, too, and AI isn't that special.
There are undoubtedly several failure modes unique to AI work. In the past, there was a lack of high-quality data. For generative AI, where data requirements can be significantly less, it could be the lack of good and unambiguous evaluations.
But a lot of even when data is plenty and evaluations are good, an AI initiative can stall and end up in what I'll call the proof-of-concept purgatory if it just doesn't turn out to be all that useful. Now, why would that happen? Plenty of reasons:
The problem shouldn't have been tackled with AI in the first place.
The "problem" is non-existent, so nobody will use the solution.
The solution wasn't built with a tight user-focused feedback loop, so while it's generally going in the right direction, it still misses the mark.
The solution wasn't integrated into the larger ecosystem of where it's being deployed.
These reasons are not unique to AI. To avoid these issues, follow these two principles:
Begin with the end in mind.
Release early and iterate with lots of feedback rather than planning everything out from the beginning.
That might sound contradictory: How can we start with the end in mind and iterate/adjust? It's essential that beginning with the end in mind means clearly understanding what success looks like, not prescribing in much detail how we'll get there. The clear vision for the final outcome guides the iterations (and keeps them from running in circles).
With just those two simple to understand (but hard to put into practice) principles, your project, whether it uses AI or not, has a much higher chance of success.
The AI Trough of Disillusionment
A scathing article in the Business Section of The Economist (Welcome to the AI trough of disillusionment) states that, "Tech giants are spending big, but many other companies are growing frustrated."
I can't say I'm surprised. Many of the past articles (browse the full archive at https://aicelabs.com/articles) here discuss the pitfalls of undertaking an AI initiative, whether that's building a bespoke tool or onboarding an off-the-shelf solution.
In a way, this article is vindication for our stance at AICE Labs that AI needs to be done right, from end-to-end, with clear outcomes to evaluate against. It's not enough to download a "10 prompts that will hypercharge your organization" article and call it a day. Neither is it enough to make a top-down decree to do something with this AI stuff, no matter how much money gets thrown at it.
There is no simple solution—and, for that matter, no complex big-consultancy-style 17-step process either—that will guarantee success with any project. Throw the massive product risk of an AI initiative into the mix and stir in overhyped promises from influencers and you have the perfect recipe for disappointment.
We've been here before. Machine Learning went through several cycles of hype followed by disillusionment, and it's no surprise that the cycle repeats anew. What can we do? I hope writing this newsletter is doing a small part shedding light on some of the pitfalls, and I'll expand on a few pieces of them in the next little while.
All this to say: It doesn't have to be like this. It's painful to see so much effort and hard work go to waste and lead to disappointing outcomes for users and businesses, where expensive projects lead to nothing more than a proof-of-concept gathering dust in some forgotten cloud folder. And it's this pain that drives us at AICE Labs to dig deeper into what it takes to deliver outcomes rather than code.
AI Affordances
If you’ve driven a car lately, you’ve noticed that they just don’t have that many buttons any more. Instead, most functionality is accessed in nested layers of touch-screen menus. Want to raise the temperature for the rear seats? Tap for the climate menu, tap for rear settings, tap tap tap tap tap to increase it by five degrees.
The main problem is that this poses a road hazard. A secondary issue is that it obscures from the user any clear indication of what the system can do for you. In UI/UX (User Interface/User Experience) speak, the features of such a car have low discoverability because they lack affordances.
Affordances are the characteristics or properties of an object that suggest how it can be used. — What are Affordances?
I’ve been in hotel showers where I struggled longer than I’d care to admit figuring out how to turn the damn thing on. No indication what should be turned, twisted, pulled, or pushed. All sacrificed to “fancy” design.
Which, as always, brings us to AI systems.
When they are purely chat-based, they have no signifiers or affordances hinting at their capabilities: ChatGPT will draw you an image if you ask it to. Claude can’t do that. But you wouldn’t know that from just looking at either of them.
This, in turn, limits the amount of use the typical user (i.e., not a power-user who reads every “top 10 secret prompts to unleash your whatever” article out there) gets out of the tool. That’s mildly annoying for OpenAI, Anthropic, and co, but their tools get enough attention, and are sufficiently embedded in the zeitgeist, that over time everyone “gets it”.
If, on the other hand, you’re building and rolling out a custom tool for an internal workflow to supercharge your team’s capabilities, you need to account for how users will learn about and discover the capabilities. Luckily, you’ve got options
Go full UI
UIs have buttons and other control elements with well-defined and understood behaviour. Buttons are for clicking, toggles are for toggling, and dropdown menus present more options. At their leisure, users can look at everything in the UI and get a full picture of the tool’s capabilities, much like they’d do with a physical system (such as a car that still has traditional buttons…).
If you have buttons, toggles, and menu options for everything you want people to accomplish with your tool, they’ll figure it out.
Document everything
Command-line tools live on the opposite spectrum. They have no affordances whatsoever. You must know which command to type, how to pass in the right options and flags. It’s not there staring at you.
Instead, such tools come with comprehensive documentation about every subcommand and every variation you can use to control the behavior of the program. Check out the Git - Reference if you ever have to escape insomnia.
In the case of an AI tool, you’d want to list concrete examples of all the workflows and queries you built the tool for, explaining any caveats and limitations, the preferred format for passing additional instructions or data, and an explanation of what is (and isn’t) to be expected from the response.
Or something in between
Likely, you’ll combine some affordances in the user interface and comprehensive documentation for the rest. The documentation is there as the ground truth of the tool’s capabilities, and the UI elements facilitate a smooth flow for the everyday use of the tool.
Just don’t hide everything away in a misguided attempt to “simplify”. People want their car buttons back, and they don’t want to go hunting many layers deep to get their work done.
AI Tasks: Context and Open-Endedness
How do you know a task is a good candidate for AI automation?
The most accurate way to answer that is to go and build the AI tool. But let’s assume we don’t want to jump headfirst into it, because there’s time and money at stake.
We want to weed out those tasks where we wouldn’t expect current AI to have a fighting chance. For that, we can draw up a framework that looks at how well-defined the task’s outcome is and how much it depends on an overall context, leading us to a 2x2 matrix. (Can’t do consulting without sprinkling those matrices around every once in a while, after all.)
Let’s go through them
Simple Automation
Tasks that do not require a lot of context and have narrowly defined success criteria are good candidates for simple automation. They might not even need any machine learning or AI. Or, if they do, they will be straightforward to implement, with AI engineering mainly focused on finding the right model prompt and processing the output back.
Examples
Summarizing a news article
Small-scale code refactorings (“Change this Python dictionary into a dataclass”)
Precision Automation
Some tasks have a clearly and narrowly defined outcome but require a lot of context, and that context might be implicit instead of easily passed to the AI tool. To handle such tasks with AI, you need to have a way to provide the appropriate context, which means a lot of data engineering behind the scenes, and, before any work on the actual tool can begin, “downloading” the implicit knowledge from the subject matter experts. This is also where various retrieval-based techniques (basically, intelligent search for source material) plays an important role.
Examples
Reviewing a legal document and flagging problematic clauses. What counts as problematic depends on the context, but once that context is defined, it’s a narrowly defined task.
Implementing a straightforward feature in a codebase while adhering to the company’s coding guidelines and following their chosen practices.
Creative Exploration
Moving on to the two “high open-endedness” quadrants, let’s first define what we mean by that. We define open-endedness as an inability to state a universally accepted definition of done. Or, in short, you can’t tell in advance what a good solution to an open-ended task looks like, but you’ll know it when you see it. With a narrow task, you could let the tool judge whether the task was completed. With an open-ended task, you have to be the final judge.
If the task requires such open-endedness but does not require much context, there’s a good chance existing off-the-shelf generative AI tools are just what you need. ChatGPT, Claude, and co for text, Midjourney for images, Runway for videos, and countless more for bespoke requirements (company logos, marketing copy, etc.).
Example
Creating a visual to go with your blog post. Context dependence is low (paste in the blog post), but you must iterate over several variations to find something you like.
Guided Intelligence
Leaving the hardest nut to crack for last. A highly open-ended task that also requires intimate knowledge of your unique context. This combines the challenges of all the previous approaches. We’ll need lots of prep work to get the correct data (i.e., context) into the system. We also need intuitive interfaces that let you seamlessly iterate and refine so that you can explore solutions in a fruitful direction.
Example
Generating marketing copy that takes brand voice, target audience and corporate strategy as well as legal requirements into consideration
Why it matters
You’ll want to know what task you’re dealing with when choosing what AI system (if any) to build. For example, if you try to develop a “fire-and-forget” system for a highly open-ended task, you’ll waste a lot of time trying to find that one magical prompt that gets the AI to give you the perfect outcome.
Pick the simplest approach for the problem you have, but not simpler.
Making Users Awesome
There’s a great book by Kathy Sierra: Badass: Making Users Awesome. Aimed at product managers, it makes the compelling point that nobody wants your tool; they want the outcomes the tool enables. To create loyal, even raving, fans of your product, you should therefore build an entire ecosystem around helping people achieve these outcomes.
The book was written well before the recent generative AI wave, and so it focuses on pieces like creating tutorials around common use cases, and, generally, asking yourself how to optimize the tool so users can get to the outcomes they want.
But with AI, I can easily think of a novel way to make users awesome: by providing a natural language interface to the tool’s more advanced, obscure, or hard-to-get-right features. It doesn’t even have to do everything autonomously. It could be as simple as, “Tell me what you want to do, and I’ll look over your shoulder and guide you along,” so that the user even learns something.
I can immediately think of a few examples:
Loading your vacation pictures into Lightroom or Photoshop, you decide they don’t convey the relaxed summertime feeling you experienced. You ask the AI to work with you on enhancing it, and the AI walks you through some colour curve adjustments.
You open an Excel sheet with your department’s sales figures for the year. You want to achieve a certain visualization for a report, but aren’t convinced by the standard options. You ask the AI what could be done and it suggests some groupings, aggregations, and charts that could get you there.
In general, any software with lots of knobs to turn and tweak would be a great candidate here. There’s often a mismatch between reading the documentation, which outlines each feature in isolation, and how you’d actually use that feature.
As a concrete example, here’s Adobe’s official documentation on what happens if you set a layer’s blend mode to Overlay
Multiplies or screens the colors, depending on the base color. Patterns or colors overlay the existing pixels while preserving the highlights and shadows of the base color. The base color is not replaced, but mixed with the blend color to reflect the lightness or darkness of the original color.
🤷♂️🤔❓
Now here’s a cool trick in Photoshop to instantly make the colours of your photo “pop” more:
Load your photo
Duplicate the only layer (which holds your photo)
Set its blend mode to “Overlay”
Adjust that layer’s opacity to control the strength of the effect
You learn these tricks by watching tons of tutorials on YouTube, or, these days, you could ask the AI. And if there’s a bespoke one built right into the tool, all the better.
I’m sure if you’re using industry-specific specialized tools, you can think of great examples where an intelligent AI assistant would give “normal” users superpowers.
AI Doesn’t Learn
AI doesn’t learn. That might sound contradictory. Isn’t it called machine learning? What I’m talking about here concerns large language models (LLMs). In their creation, learning takes place. But once they’re done, they remain static and unchanged, unless the creators deliberately add on additional training runs.
It’s a bit like the main protagonist in Christopher Nolan’s Memento or, if you’re in the mood for lighter fare, Adam Sandler’s 50 First Dates. The characters have long-term memories until a traumatic incident, but cannot form new ones, and their short-term memories regularly reset.
Why it matters
If we want to treat AI like a capable personal assistant, we’d like it to learn and improve at its tasks over time. We want to avoid repeating ourselves when we give feedback.
Current AI systems have to rely on emulating learning. There’s ChatGPT’s memory function, for example. During the conversation, it will identify and store salient information about you and your preferences in a database. And then, behind the scenes, whenever you ask it a question, it’ll use that information to provide additional context to the underlying language model.
Higher-level learning and insight
These clever tricks allow the AI tool to build a repository of information that’s highly relevant to what you use it for. However, a higher-level feedback loop around growth and mastery is missing: If you repeatedly assign the same task to a human, they’ll become more proficient and efficient over time. They might even identify shortcuts or ways to automate or eliminate the task altogether. Assign the same task repeatedly to an AI tool, and it’ll work diligently on it the same way it’s always worked. That’s fine for small rote jobs, but it will be a significant handicap for the larger jobs we’d like AI agents to handle.
I’m curious whether simple tricks similar to ChatGPT’s memory will enable a fake-but-useful version of this higher-order learning, or whether we need a completely different AI paradigm for that.
Mechanical Turks Are for Market Risk Only
In recent news, AI startup Builder.ai, formerly Engineer.ai, entered bankruptcy proceedings. But the first time it made headlines was in 2019, when former engineers of the company alleged that its claims of using AI to fully automate app building were exaggerated and that it was using human developers in the background: This AI startup claims to automate app making but actually just uses humans
The first instance of passing a human off as an artificial intelligence was the Mechanical Turk, a “fraudulent chess-playing machine constructed in 1770”. These days, Amazon offers a service with that name, providing cheap human labour for repetitive tasks that cannot yet be easily automated.
Let’s be clear: Faking AI capabilities by using humans and using that to attract customers and investors is fraudulent, even if the plan is to use these investments to build out real AI capabilities.
Market Risk vs Product Risk
However, if you’re merely in the early stages of validating a business idea, it can be a good idea to postpone building out the AI tool and instead simulate it with human labour. But only if the main risk is market risk. If you are supremely confident that you can build the AI part, you just aren’t sure if anyone will pay for the thing, it won’t hurt to test the waters with a cheaper way. But if there’s no guarantee that you can build it, faking it is a dangerous dead-end.
That is the trap builder.ai fell into: All product risk, no market risk. If you don’t need to answer whether people would pay for the real thing, don’t bother faking it.
Always start with this question: What is the most significant uncertainty about our product? And how can we most effectively address it?
How the Sausage is Made
I’ve seen several posts on LinkedIn stating, “Your customers don’t care about technical details. They don’t care if you practice test-driven development, use microservices, write code in C, Java, Python, or Go, follow Scrum, Kanban or whatever agile framework you follow.”
In the purest sense, this is true. Customers don’t want code; they want the outcomes this code enables. A software tool’s product info page won’t include a checkbox like “100% made with Domain-Driven Design” or “Now with 98% test coverage.”
But in a practical sense, customers do care.
They want their software to be free of bugs.
They want frequent releases of those features they asked for.
If it’s a cloud-based service, they want it to be constantly available, with no noticeable slowdown or hiccups during peak traffic.
They’ve gotten used to patterns and paradigms of behaviour, so your user interface needs to respect and follow that.
They also want to know that their data is secure and won’t be on sale on the dark web the moment they hand it over to your site.
And because customers care about all these, they implicitly care about you doing your best to get there. We’ve got best practices for a reason, and just because a customer can’t literally tell whether your product was developed one way or another doesn’t mean they can suffer (or enjoy) the second-order effects.
The Generalization Trap
One desirable outcome of an AI project is that it’ll save time. But there is a trap here.
The decline of specialist support
Over the last few decades, advances in computers and software have meant that many tasks can now be done by anyone, while before, they’d be done by specialized support staff. Using a computer, anyone can
put together slides for a presentation,
fill out a complete expense report, and
book travel and accommodation for a business trip,
so why bother having graphic designers, administrative assistants or travel agents on staff?
AI tools promise to vastly expand what anyone is capable of. That means the trend of shifting more work away from specialist support roles will likely continue. But unless AI makes that work effortless, we should think twice before performing such a reassignment. Here's why.
The world’s highest-paid assistant
A brilliant professor I knew once lamented that she sometimes felt like “the world’s highest paid administrative assistant” due to the vast amount of administrative work she had to do in addition to her actual duties of supervising graduate students, teaching classes, and conducting award-winning research.
Of course, she’s perfectly capable of using her computer to file her expense reports, book conference travel, fill out this or that form, and whatever else would have been handled by her assistant had the university provided one.
But whenever she’s doing that type of work, she’s not prepping classes, coaching students, or conducting research, which is (you’d assume) what the university hired her to do.
Like universities, most organizations have a core value-creating activity undertaken by a specific type of employee: researchers, software developers, writers, etc. Other roles exist to support them, but they don’t create value themselves. You’d think that, therefore, organizations would try to maximize the amount of value-creating work done by their value-creating employees, and ruthlessly eliminate anything that distracts from that. Instead, they focus on eliminating the supporting roles by shifting the responsibility of the supporting tasks onto the value-creating employees.
Supercharged Support
Instead of using AI to eliminate support roles and have everything handled by your core employees, think about using AI to make your support roles that much more effective. I’m convinced that in most cases this makes the most economic sense. If you’re a research institute, the value you create is directly related to the amount of time your researchers can fully dedicate to research. Anything you can take off their plate that’s not research is an economic win.
The organizations that thrive with AI won't be the ones that eliminate the most roles. They'll be the ones that amplify the right ones.
AI Project Outcomes, Part 2
In a previous post we talked about the desired level of autonomy for your AI project: Some manual work required, or fully autonomous?
Let’s now tackle a different question that’s just as important:
What is the goal of the AI Project
“Duh, it’s to solve problem X.”
Fair enough. But what does it mean to solve it? What does a home run look like? What does the successful solution enable you to do? If we have good answers to this question, we can make the right trade-offs and downstream decisions.
Success could mean
same quality, but lower cost
better quality, at no cost reduction
increased velocity at acceptable loss of quality and acceptable increase in cost
same quality, same cost, but everyone on your team is happier
There’s no right or wrong when it comes to success criteria. But once you’ve picked them, there are right and wrong choices for building the solution. Many projects fail, not just in AI, because such criteria were never established or communicated to all stakeholders.
Before drafting a project proposal, we at AICE Labs work hard with you to uncover this critical question: What does total success look like, and what value does that unlock for you? With those questions answered, we have many knobs to turn and can present a number of options that would get you closer to that goal.
Buy Nice or Buy Twice: Quality Thresholds
Back at my university outdoor club, we’d give this advice to interested newcomers when asked about what sort of gear (sleeping bag, boots, hiking pack) to buy: You either buy nice, or you buy twice: You either buy a sleeping bag that fits into your pack and keeps you warm at night, or you buy a cheap one that’s bulky and cold and then you buy the better one.
Of course, if you’re starting, it’s fine not to go for the ultra-high-end option, but a quality threshold must be met for an item to serve any purpose at all. If a good sleeping bag that keeps you warm costs $200 and a bad one that leaves you shivering costs $50, going for the bad one doesn’t save $150. It wastes $50.
The same goes for the effort invested in a project. It’s a caveat for the 80/20 rule. Just because you can get something to 80% done with just 20% of the total effort, there’s no guarantee that 80% done will deliver 80% of the total value. There might be thresholds below which, regardless of effort or % done, no value is delivered.
Fuzzy Doneness
With traditional software, we know whether it meets a threshold. A feature is either available or not. Even if certain functionality is only partially supported, it’s clear where the lines are drawn.
Once AI gets involved, it gets murkier. Imagine a feature for your email program that automatically extracts action items from emails you receive.
To trust it, you must be convinced it won’t miss anything.
But for the tool’s creators, it’s impossible to guarantee 100% accuracy.
As we’ve seen in a previous article on evals the tool’s creators will have to set up ways to evaluate how well their tool does on task extraction. Then, they’ll need to determine what the acceptable trade-off between convenience and accuracy is for their users.
Are users okay with missing one out of a hundred action items in return for the efficiency gained?
What about one in ten, or one in a thousand?
To tie it all back to the intro: As long as you’re below the threshold, improvements don’t matter. If the tool only accurately identifies every other task from an email, it’s pretty useless. If it accurately identifies 6 out of 10, that’s still pretty useless.
Part of any successful AI initiative is getting crystal clear on what makes the outcome genuinely usable. How good does it need to be at the very least?
Your AI Product: How Far Will It Go?
In last week’s post, I introduced some initial thinking you’d want to do before tackling an AI project. In short, figure out the right capabilities based on where AI comes in.
Today, I’ll tackle the next step you want to get clarity on: What you want the AI tool to accomplish. This goes beyond defining the problem at which you time the tool. You also need to be clear on how far that tool will push it.
Coffee makers: An Example
I’ve got a fine yet simple espresso maker. I fill powder into the basket, attach it to the machine, turn a knob and wait for delicious espresso to pour out. It has a steam function, so if I want to make a cappucino, I can use that to froth up some milk.
My mom has a much fancier machine. She turns a dial to select from a number of beverage options, presses a button, and has the machine grind beans, heat and froth milk, and assemble them in just the right quantity and layering for a cappucino, latte, macchiato and dozens more.
Either produces a fine beverage, but I undoubtedly have to do more work than my mom. On the other hand, mine was an order of magnitude cheaper and requires no maintenance other than the occasional wipedown.
80/20
So with your AI tool and the problem you’re letting it loose on, you’ll also have to decide whether it requires some hand-cranking by the user or whether it should produce a perfect result autonomously at the press of a button. And just like with coffee makers, there’s an order-of-magnitude difference in cost and complexity:
Enough to just spit out a rough draft or some suggestions that the user then takes and runs with? Easy peasy.
The AI output needs to be near-perfect and only require user intervention in the rarest of cases? Quite a bit harder.
The user won’t even see the AI output and so it needs to be 100% reliable because it’ll feed into the next, unseen, part of the workflow, all to create a final result that needs to be absolutely correct? Now we’re talking absolute cutting edge of AI engineering.
As with many things in life, getting somethign that’s 80% there can often be achieved with 20% of the effort, and any incremental improvement on top of that will require much more effort.
Your challenge, then, is to find the optimal balance:
Conserve energy and resources and use the simplest approach possible, but
Deliver something complete and useable to your users
Tackling AI Projects
Right before the (long, in Canada) weekend, let’s start talking about how you can tackle an AI Project in a way that does not end up in tears of frustration (or just lack of interest).
I’d hope to write a number of posts about this topic. For now, let’s start with one big high-level question: How much AI are we talking about?
Little AI, Lots of Other Stuff
It might very well be that AI plays a small but important role in a much larger workflow, where the majority of steps are handled by traditional software or more classical machine learning as opposed to generative AI. An example would be any pre-existing tool that incorporates some AI into it as a way to speed up some steps, but isn’t strictly necessary for the product to function. Like a todo-list app with an AI feature that breaks bigger tasks into smaller ones for you. Neat, but not essential.
Lots of AI, Little Other Stuff
Or you have something in mind where the AI does all the heavy lifting, and the other surrounding stuff is minimal skeleton that lets you get to the AI with as little distraction as possible. The extreme example of this are the chat interfaces to large language models, especially in their original form without extra frills: You get a text box and a send button and that’s all you need to get to the latest super powerful LLM.
Why does it matter? Because it determines where our risks lie and where efforts need to be focused. In the first case, we have more software engineering to do and (comparatively) less AI engineering. It’ll still be important, but we won’t have to squeeze as much out of the model as we’d have to do in the other case. If your product is the AI model and little else, it better be really good. In this case, you’ll spend a lot of effort on AI engineering.
Somewhere in the middle
Unless you already have an established non-AI product and want to sprinkle a bit of it in there, chances are you’re not in the “Little AI” situation. But unless you’re building the next big foundation model, you’re probably also not in the “Little Other Stuff” camp. In that case, you’re looking at a complex workflow that also requires a fair amount of AI that’s not optional. An example here would be a complex industry-specific document review tool. It needs a lot of scaffolding and supporting workflows, data storage, user management etc., but it also needs an appropriately engineered AI solution.
The right skillset for the right situation
What’s the right team for these scenarios? Here’s my take:
Software Engineers with a bit of AI training will handle Little AI just fine
No need to hire Prompt Engineers for $400,000/year. Just have your engineers take a stab. In this scenario, the AI is likely just one more external tool that gets incorporated, like any other third-party tool they’re already using.
ML Scientists can build the little stuff around their AI models
When it’s all about the model and you just need basic scaffolding to let it shine, your ML Scientists already have a great ecosystem of tools they can use. Gradio, Streamlit, and other tools let you hook an AI model up to a nice web interface in no time.
A mixed team for the complex middle
As much as I’m against artificial skill boundaries (as we’ve unfortunately seen with the Front End / Back End split), I believe that projects that have both complex “traditional” software needs and complex AI model requirements need a mix of two roles. Someone to handle the AI engineering and someone to handle “everything else”. At the very least, you’d want a two-person team with a strong software engineer and a strong AI engineer.
So, that’s it for the start. Once you’ve got an idea what sort of project you’re looking on and who the right people are for it, you can go to the next step. More next week!
When to Wait and See
If you ever reach out to us at AICE Labs for a custom product involving AI (say agentic assistance with a tedious workflow, chatting to your data in a way that doesn’t expose it to the outside world, or creating writing and visualizations based on your own data sources), one of the questions we’d ask in our initial discovery call would be: Why now? Why not just wait and see what happens?
AI tools have been evolving quite fast, and rather than paying us the big bucks to develop a custom solution, you could just wait a year and get an off-the-shelf tool to do it for you. A number of startups that were founded right after the initial GPT-3 landed found this out the hard way, when subsequent versions of ChatGPT did perfectly well what they had built custom tools for. And we’d rather have you not suffer buyer’s remorse.
What’s There
If you have a problem where AI could offer a solution, why not check out existing tools first. Would an existing model or tool, maybe with a bit of no-code automation (tools like Zapier, Make, Pipedream) do the job already? If not, why not?
Maybe you have more stringent privacy requirements and can’t send your data to OpenAI, Anthropic, and co
You need more sophisticated workflow assistance than a chatbot would give you, and your use case isn’t already handled by other “wrapper” tools.
Or maybe the workflow and all is fine, but the model’s responses just aren’t good enough.
It’s in this latter option where the most cost-effective solution is to just sit tight and wait for the models to get better, if the reason for the poor answers lies in raw model power and not some flaw in providing the model with the right sort of data.
If that does not sound like you and you’re curious about what we can build for you, get in touch.
Right Tool for the Wrong Job
I’m a big fan of Cal Newport’s writings on workplace productivity, and here’s a lesson we can also apply to the world of software:
Email is a much better communication tool than fax, voice memos, or handwritten notes.
Email is, however, terrible for having project-related back-and-forth conversations, with their long subject lines (RE:RE:RE:RE:RE:that thing we talked about), people hitting REPLY ALL in the worst situations imaginable and important points getting lost because two people replied to the same email thread simultaneously.
Messaging tools such as Slack and Teams are arguably much better here. We have channels, threads, and a DMS. People can join and leave channels as needed, and the chronological order is easily followed.
But Cal argues that ideally, work wouldn’t unfold in unstructured back-and-forth communication in the first place. It’s great for quick pings, but not good for driving a discussion to a quick and decisive conclusion, and definitely poison for our attention spans: If work can only proceed if I chime in on dozens of Slack or Teams threads, I have to constantly monitor them. Good luck getting deep work done that way.
Better Ways of Working
When we feel dissatisfied with a tool we’re using, be it a frontend framework, a database, a machine learning system or a large language model, the initial instinct is to look for a better tool. Sometimes that’s true, as with email versus fax, and sometimes it’s a distraction.
In these situations, we must take the proverbial step back and reassess things at a higher level. Instead of asking how we could improve the tool, we should ask how we could improve the overall workflow.
Tech Debt Interest Rates
The technical term for the results of lazy-today decisions that create more work later is technical debt, and when it comes to technical debt, there are two extreme camps:
Those who say that all of this worrying about clean code, good architecture, AI model evaluation, etc., slows you down unnecessarily. While we worry about test coverage, they launch ten new features.
Those who insist you better do it perfectly today, lest it all come crashing down on you.
Smart investments
The right amount and mix of tech debt lies somewhere between recklessly forging ahead and never getting anything shipped.
Like financial debt, where we make a difference between smart debt that finances future growth and unwise debt that finances excessive consumption, we can differentiate between wise and foolish tech debt.
Smart debt has a low interest rate while enabling massive progress today
Foolish debt has a high interest rate and does not unlock meaningful potential
“Interest rate” refers to how much extra work you’d have to do tomorrow because of your tech choices today. There’s a world of difference between “we didn’t want to spend time to nail down a design system, so we’ll have to go back later and replace hard-coded color values with style variables” versus “we picked a dead-end tech stack for the type of app we’re building so we need to rebuild from scratch”.
Some concepts we’ve come to appreciate at AICE Labs:
I prefer tech debt that you can pay down in small installments over time, compared to debt that requires a big lump sum payment. You can fit that work in here and there and immediately reap the benefits.
When interest rates are low and a time-sensitive opportunity is at hand, incurring debt is fine, even advised, especially if the opportunity unlocks additional resources you can use to pay down the debt.
Be honest about why you’re incurring tech debt. Are you being strategic or just lazy?
The Right Kind of Lazy
“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.”
― Bill Gates
I was contemplating this quote and felt it was missing something. A lazy person might indeed find an easy way to do a hard job. Automate a tedious task, identify simplifications that allow faster progress, and so forth.
A lazy person might also just do a poor, rushed job and leave us with a large mess to clean up: Writing code that's quick and dirty, rigging up an AI agent with no evaluation methods and guardrails in place. Those sorts of things.
Present Lazy vs Future Lazy
To tap into the engineering brilliance unlocked by the lazy person Bill Gates talks about, we need to be lazy about total effort over time, not just in the present moment:
Too lazy to write unit tests right now vs. too lazy to extensively debug and revisit things that inexplicably break
Too lazy to document code today vs. too lazy to repeatedly explain the system to every new team member
Too lazy to implement security best practices upfront vs. too lazy to recover from data breaches and rebuild customer trust
Through experience, we gain a sixth sense for these trade-offs. And then our inherent laziness becomes a powerful ally in reinforcing good habits.
Prototypes
Imagine getting an architect to design your dream home. They take your input, then return with a beautiful scale model. You are amazed. So amazed, in fact, that you inform them that further services (like getting permits and assembling a team of contractors) won't be necessary as you'll just move in right then and there. 😬
With real-life objects, we have the good sense to tell a model or prototype from the real thing. The scale is wrong, details are missing, or it's clear that the material isn't as durable as the final product should be.
Not so with code. Take a web app. It's all just pixels on the screen anyway. Nobody can discern what horrors lurk beneath as long as the user interface is polished:
Will it hold up to increased traffic?
Can hackers practically walk in through the front door?
If we want to add more features, will that take months due to piles of poor technical choices?
It appears to work, but has it been tested for all edge cases?
Remember these points when someone shows off the app they vibe-coded in a weekend. Maybe it's the real deal. Maybe it's just a cardboard cutout.
The XY Problem
Here's a frustrating experience commonly experienced by novice programmers on the popular Q&A site StackOverflow:
Q: "Hey, I'm trying to do Y, but I'm running into these issues and can't figure out how to do it."
A: "Are you sure you want to do Y? Y sounds weird. You can't do Y anyway. Anyway, here are some things you might try to get Y working."
... many iterations later ...
Q: "None of these work. :-/"
The issue is that the asker wants to do X. They have determined that one of the required steps to achieve X is Y, and now that's why they're asking about it Y without revealing that, ultimately, they want to achieve X.
An example inspired by [this XCKD comic] would be:
Q: "How can I properly embed Youtube videos in a Microsoft Word document?"
The real question is: "How can I share Youtube videos via email?"
XY AI
Of course, we're all smart here and avoid such ridiculous situations. But when we jump too quickly to an imagined solution, we get stuck on trying to solve Y when there'd be a much simpler way to solve X. Especially with AI, where things are not as clear cut as with standard programming and minor tweaks can lead to large differences in the outcome, we can fall prey to this thining:
Do you need to fine-tune a large-language model on custom data, or do you need to develop a better prompt for the base model and just provide relevant custom data in when querying the model?
Do you even need AI? Maybe the answer to "How do I stop this regression model from over-fitting my data?" is to use some standard rules-based programming.
Any question around loading a large language model and hosting it yourself, distributing it over multiple GPUs for parallel procssing etc might become mute if you just use a managed service that does all that for you.
Good AI engineering takes a step back and takes time to explore and evaluate multiple possible Ys for the current X. That's how you don't get stuck with a suboptimal approach and loads of frustration.
