AI Watch

Blog archive

When Good Chatbots Go Bad: Would You Like Fries with That Tuba?

Several stories caught my eye recently, reminding me how brittle a conversational AI deployment can be. The headline grabber was McDonald's decision to end, for now, its AI ordering system at some of its drive-thrus after customer complaints went viral on social media. The fast-food giant partnered with IBM for a test run of its Automated Order Taker at more than 100 restaurants. The system will be shut off no later than July 26, according to a memo sent to franchisees late last week, CNBC reported.

"As we move forward, our work with IBM has given us the confidence that a voice ordering solution for drive-thru will be part of our restaurants' future," the company said in a statement. "We see tremendous opportunity in advancing our restaurant technology and will continue to evaluate long-term, scalable solutions that will help us make an informed decision on a future voice ordering solution by the end of the year."

Fast-food chains are increasingly integrating generative AI into their systems. McDonald's, Wendy's, Hardee's, Carl's Jr., and Del Taco are among the companies utilizing AI technology at their drive-thrus. Earlier this year, Yum Brands, the parent company of Taco Bell and KFC, declared an "AI-first mentality" for its restaurants.

I'm not going to try to pin-point what went wrong under the golden arches, but I would like to share some of my thoughts on how to build responsible, resilient, and robust chat systems for the customer service experience, and why it’s more important than ever to keep building them.

One reason conversational chatbots jump the track is a failure to consider their ability to lie and even make up business processes on the fly, which can result in unexpected (and unwelcome) outcomes. The fix here is the application of a simple framework for building bots that are safe and resilient in production. I call it The Four Ps for Building a Good Bot Knowledge Base.

Purpose: Use the system prompt to ensure the bot knows its purpose. What is the job you want it to do? How do you want it to act? What tone do you want it to have"?

People: Teach the bot who it's there to help and how it can help them. Provide details on the customer avatars and information those avatars might be looking for.

Precision: Establish guardrails by providing examples of what both desirable conversations and undesirable conversations look like. Give three to five examples for each. This will provide the bot with essential guidance on how to make better decisions when it encounters similar conversations.

Perfection: How can you make a bot perfect? You can’t, of course, but you can identify some clear metrics to ensure that you're checking your bot's performance and helping it to become better with each interaction. Here are some of the metrics I would encourage you to consider:

Relevance: Are the responses consistently relevant to what the user was asking?
Coherence: Are the responses coherent and understandable by the user?
Conciseness: Are the responses short and clear? Sometimes generative models can be overly verbose in their responses creating the opportunity for confusion and misunderstanding. This is critically important to monitor.
Factual Accuracy: This is one of the biggest concerns about using generative models in customer service solutions. What happens if the model makes a mistake or provides an incorrect or even made-up answer? You must create a process that insures you are checking for factual accuracy over time.
Emotional Intelligence: It's important to make sure your model has the appropriate tone for a specific conversation or brand. For example, if you're managing life insurance and customers seeking help will have experienced a death of a loved one, you'll absolutely want to consider the tone and expressed empathy the bot conveys in its communication.

Another reason chatbots fail is a lack of diversity in the teams that test the model before it goes to production. Having a diverse team of testers (ethnically and otherwise) offers several valuable benefits:

Broader Perspectives: Diverse backgrounds bring a wide range of perspectives and experiences, which can lead to more comprehensive testing scenarios and better identification of potential issues.

Enhanced Creativity and Innovation: A mix of cultural viewpoints can foster creativity and innovative thinking, leading to novel solutions and improvements in software quality.

Improved Problem Solving: Diverse teams are often better at problem-solving due to the variety of approaches and ideas contributed by team members.

Better User Representation: A diverse team can better represent the diversity of the end-users, ensuring that the software is tested for usability and accessibility across different demographics.

Increased Coverage of Edge Cases: Ethnically diverse testers may identify edge cases and cultural nuances that might be overlooked by a more homogenous team.

Enhanced Communication Skills: Working in a diverse environment can improve communication skills and foster an inclusive atmosphere, benefiting overall team collaboration.

Cultural Sensitivity and Compliance: Diverse teams are more likely to be aware of and sensitive to cultural differences and legal requirements, helping ensure that the software is appropriate and compliant for different markets.

Competitive Advantage: Companies with diverse teams can better understand and cater to a global customer base, giving them a competitive edge in the marketplace.

During a recent AI Read Teaming workshop, for which I was presenting to a financial institution, one of the students realized the team that was building and testing the application in her company lacked diversity. After learning the value of creating a diverse team of testers, the company decided to use an AI Red Team to stretch and even break models ethnically to ensure that the most unlikely scenarios are caught early, before real customers had begun to interact with their solutions.

Finally, chatbots often fail due to a lack of clarity around those success metrics I mentioned earlier. But more specifically, what does success look like? When you build any AI solution it's critical to identify key metrics that establish when you can call an AI project successful. What is success? Is it saving time? Creating revenue? Saving expenses? Whatever metric you select must allow you to establish a baseline, where you started, and demonstrate clear success in moving that number up or down. Many AI projects fail because leaders do not understand the importance of documenting these critical metrics, KPIs or OKRs, to demonstrate the return on investment for an AI project.

Knowing all that can go wrong with a chatbot implementation is likely to raise concerns about whether this is the right use case. However, providing a system that allows a customer to get the right answer to the right question at the right time has proven to be incredibly valuable to a wide range of businesses. We're seeing customer service chatbots increase sales, increase customer satisfaction, and increase the lifetime value of the customer.

When built safely and responsibly, chat systems for the customer service can shift the trajectory of your company and allow you to tap into the benefits of building inclusive innovation at scale.

BTW: Here's a LinkedIn post from Simon Stenning, founder of, that illustrates what can happen when you fail to implement even the simplest forms of responsible and safe practices for conversational agents.

Posted by Noelle Russell on 07/03/2024