Buy Nice or Buy Twice: Quality Thresholds
Back at my university outdoor club, we’d give this advice to interested newcomers when asked about what sort of gear (sleeping bag, boots, hiking pack) to buy: You either buy nice, or you buy twice: You either buy a sleeping bag that fits into your pack and keeps you warm at night, or you buy a cheap one that’s bulky and cold and then you buy the better one.
Of course, if you’re starting, it’s fine not to go for the ultra-high-end option, but a quality threshold must be met for an item to serve any purpose at all. If a good sleeping bag that keeps you warm costs $200 and a bad one that leaves you shivering costs $50, going for the bad one doesn’t save $150. It wastes $50.
The same goes for the effort invested in a project. It’s a caveat for the 80/20 rule. Just because you can get something to 80% done with just 20% of the total effort, there’s no guarantee that 80% done will deliver 80% of the total value. There might be thresholds below which, regardless of effort or % done, no value is delivered.
Fuzzy Doneness
With traditional software, we know whether it meets a threshold. A feature is either available or not. Even if certain functionality is only partially supported, it’s clear where the lines are drawn.
Once AI gets involved, it gets murkier. Imagine a feature for your email program that automatically extracts action items from emails you receive.
To trust it, you must be convinced it won’t miss anything.
But for the tool’s creators, it’s impossible to guarantee 100% accuracy.
As we’ve seen in a previous article on evals the tool’s creators will have to set up ways to evaluate how well their tool does on task extraction. Then, they’ll need to determine what the acceptable trade-off between convenience and accuracy is for their users.
Are users okay with missing one out of a hundred action items in return for the efficiency gained?
What about one in ten, or one in a thousand?
To tie it all back to the intro: As long as you’re below the threshold, improvements don’t matter. If the tool only accurately identifies every other task from an email, it’s pretty useless. If it accurately identifies 6 out of 10, that’s still pretty useless.
Part of any successful AI initiative is getting crystal clear on what makes the outcome genuinely usable. How good does it need to be at the very least?