Move slow and test things: the key to trusted AI in business

Artificial intelligence is the last digital transformation of our lifetimes, but the gap between AI in experiments and AI in production feels like a widening chasm.

The rate of abandoned AI projects has risen to 42%, up from 17% last year. That’s from a survey by S&P Global Market Intelligence of 1,000 respondents in companies across North America and Europe.

Some high-profile AI champions have had to publicly walk back their plans: after announcing that Duolingo would become an “AI-first” company, a week later its CEO Luis van Ahn reversed his intentions to phase out contractors if AI could do their jobs better. Similarly, Klarna rowed back on replacing customer support with a chatbot after the tech led to a massive jump in unresolved tickets.

It’s worth saying that the high failure rate identified in the S&P data isn’t stopping companies from experimenting. Most are still investing in generative AI, but the stumbling block seems to be firmly wedged between proof-of-concept phase and actual production.

In my experience, many experiments usually get as far as concluding “this kind of works, 85% of the time”. Business users are very interested in the promise of AI but is it any surprise that many are reluctant to take the next step and commit to more extensive projects when the risk of error or hallucination is still high?

Rapid evolution = risk

I have a theory about why this is happening: the technology is evolving so fast that it can feel like we’re in constant catch-up mode. It’s one thing to successfully launch something that works in experiment mode. It’s another to put it to work in the business – and to trust it with your customers’ and your own data.

Because fundamentally, that’s what it comes down to: a question of trust.

And for that, you still need people. With AI, you can insert automations within a process to speed it up, but you soon reach a point where you need to verify that it’s producing the right output. So how do we test that?

The answer to this question blends into another subject I’m interested in, which touches on the UX world. At the moment, there’s a lot of buzz about tools like Lovable that let you mock up a whole application in minutes just by describing it in plain English. Figma, an app for designing and prototyping web interfaces, will be familiar to many designers and it’s got an AI version in development.

Eliminating the humans?

When you start using these tools, you see that what you can do without human subject matter expertise is basically a mock-up. Even with the current state of the art, you still need people who know what good user experience looks like – and what good interactions look like in a UI.

As it happens, Phased AI and Each&Other are working together on exactly this challenge: we’ve been finding the human pain points our customers have, and figuring out where to automate them, effectively and accurately.

What we’ve found is that most of the work is not in building a prompt or an automation to start the process: it’s in evaluating it robustly at the far side.

We work with our customers to ensure they can go to a really trustworthy application at the end of the experimentation phase, that works with human workflows. Together with those businesses, we sketch out the best ideas for how they can use AI with high impact but – and this is the crucial part – low risk.

Typically, some customers will have played around with ChatGPT and think: “could it do this task I have in mind?” “Can I make it work with the API and make it safe?” Some have a specific use case, and they think generative AI can solve this, or can make it faster, but don’t know how.

Workshopping trustworthy AI

In a workshop, we establish if this is a low or high risk use case, and we map out their current human process: is it a spreadsheet or is it another application? Then we identify the hypothesis: can AI do this, and typically, we build a very quick prototype for them, taking in as near-to-real data as we can.

Next we test the AI output at the other end to see if it’s accurate. In parallel, Each&Other tests how this will work from a UX point of view, evaluating how the user will experience this.

After the hypothesis, and the rapid proof of concept, we then build out a bigger data set, scale the test, build out a quick UI, and the users check for flaws. That can be a simple checkbox exercise for the test cases: did the expected output match the real output yes or no. Or you can give it a score on a sliding scale.

This part is so important: even at the proof of concept stage, we’re always testing if the AI is consistently giving the output that we think it should. And at the end of the proof of concept, when we hand over the tool to the customer to try in real case scenarios, it comes with a report from our test cases as to what we think.

After that, the next stage is how do we make this production level, discovering the allowable level of accuracy that can be with customers, and we establish what guardrails need to be in place if this is interacting with customers.

Y Combinator is doing a lot of work in this exact area. Many of the companies it works with are now building tools and platforms or wrappers to perform ‘evals’. These robust checks assess the quality of AI models and agents and they’re especially critical in fields like law, where the risk of fallout from bad outputs is especially high.

For years, the tech industry motto was “move fast and break things”. Right now, I believe we’re at the very beginning of the transformation that AI can bring and companies understandably want to be a part of it. The key to getting it right is to move slowly, intentionally, and to make sure that you are working to properly evaluate at the other end.

AI has the potential to make us all quicker and more efficient, but the current state of the art is not replacing humans. None of this can be just handed over to a machine: there’s a job to understand where the human needs to step in.

Move slow and test things: the key to trusted AI in business

Authors

Rapid evolution = risk

Eliminating the humans?

Workshopping trustworthy AI

Subscribe to our newsletter

Services

About us

News & Views