Talking to machines through the years
Conversation comes naturally to us. It’s remarkable just how fluently we can converse on any number of topics, and adapt our style with ease to speak with any number of people.
In contrast, our conversations with machines can be clumsy and stilted. Conversational AI has been a long-standing research topic, and much progress has been made over the last decades. There are some large-scale deployed systems that we’re able to interact with by language, both spoken and written, although I’m sure very few people would call the interactions natural. But whether it’s a task-based conversation like booking a travel ticket, or a social chatbot that makes small talk, we’ve seen continual evolution in the way the technology is built.
The first chatbot
One of the first, and still most famous, chatbots called Eliza was built around 1966. It emulates a psychotherapist using rule-based methods to discover the keywords in what a user types, and reformulate those keywords into a pre-scripted question to ask back to the user. There are implementations still around today which you can try.
Eliza’s inventor, Joseph Weizenbaum, conceived Eliza as a way to show the superficiality of communication between people and machines. And so he was surprised by the emotional attachment that some people went on to develop with Eliza.
“Press 1 to make a booking, press 2 for cancellations…”
The personal computer wasn’t a reality until the late 1970s. So at the time of Eliza there wasn’t really a way that people could interact with a text-based chatbot, unless they happened to work with computers. Chat technology instead begun to be used in customer service scenarios over the phone. These systems were dubbed Interactive Voice Response (IVR). DTMF (dual-tone multi-frequency) was initially a key part of these systems for enabling user input. DTMF assigns each keypad number two frequencies when pressed, which can be decoded by the receiver to figure out which number the user pressed. This is the mechanism behind the scenes when call centres ask you to “Press 1 for bookings, press 2 for cancellations…”, etc.
The first commercial IVR system for inventory control was invented in 1973, with commercialisation of IVRs picking up in the 1980s as computer hardware improved. Through the 1990s, as voice technology improved, limited vocabulary speech-to-text (STT) was increasingly able to handle some voice input from users, alongside continued use of DTMF. Phone conversations also need a way to respond to the user with voice. Initially, this would have been pre-recorded audio, and later text-to-speech (TTS).
In early systems, the natural language processing (NLP) to interpret what users said is typically rule-based. To make life easier, questions asked by the system may be very direct in order to reduce confusion between the number of things a person might say in response, e.g. “Please say either booking or cancellation”, or “Please state the city you are departing from”.
The conversation flow — i.e. what to say next — in these systems was handcrafted, like a flowchart. Standards were developed for writing conversational flows. VoiceXML is one such standard that came into being in 1999. It allowed VUI designers to focus solely on designing the conversation, while software engineers could focus on the system implementation.
Learning how to converse
Handcrafting conversation flows is complex, and leads to sometimes clumsy interactions and brittle systems that can break when users say something unexpected. From the early 2000s, researchers looked into ways to learn conversation flows rather than handcraft them. Many of the models at this time were based on reinforcement learning, and were able to learn a conversation flow (or ‘dialogue policy’) through interacting with simulators and by having lots of conversations with real people.
One of the difficulties of deploying such statistical systems for dialogue policy is in the lack of control they offer to developers. In a world where companies like to maintain control of their brand in customer service interactions, it’s difficult to accept randomness in performance that might reflect poorly on them. A particularly egregious case is that of Tay — a social chatbot released by Microsoft in 2016 which quickly learnt to post offensive and inflammatory tweets, and had to be taken down.
As the internet grew, so too did the places in which conversational AI technology was deployed. Web browsers, instant messaging and mobile apps quickly became channels in which text-based chat was now viable.
The deep learning boom
Through the 2010s, deep learning had a big impact on STT and TTS systems, significantly improving them to handle a wider range of language. Deep learning also started to have an impact in the NLP community. Understanding the meaning of what a user says in a conversation is cast as two machine learning tasks — intent recognition and slot (or entity) recognition. Commercial platforms like Amazon Lex and Google Dialogflow are based around the ideas of intent and slot. Intent recognition is a text classification task which predicts which out of a predefined set of intents a user has asked. For example, a ticket booking system might have MakeBooking or MakeCancellation intents. Slot recognition is a named entity recognition (NER) task which aims at picking salient entities (or slots) out of the text. In a ticket booking scenario, DestinationCity and SourceCity might be among the slots a system aims to recognise. Together, the entity and slots can be used to infer that “I’d like to book a ticket to London” and “Please can I buy a ticket and I’m going to London” effectively mean the same thing to a ticket booking system. The system can use the recognised intent and slots to communicate with a wide range of systems (databases, knowledge graph, APIs etc) and act on a user’s request.
Using machine learning for NLP leads to conversational systems that can robustly handle a wide range of user inputs. Still, it’s common to have a layer of handcrafted rules alongside the ML model to handle edge cases or guarantee the system will behave appropriately for particularly important or common user queries. Further, even when machine learning can interpret individual user utterances, the overall conversation flow still usually remains handcrafted.
Deep Learning for Dialogue
Intent and slot has its limitations as a way of modelling dialogue. For now though, it’s a common way to build both voice and chat bots in real-world applications.
Deep learning continues to impact the trajectory of conversational AI. Deep neural networks (DNNs) were first used for learning dialogue policies. Then, the natural collision of using DNNs for both NLP and for dialogue policy is to build a single model that directly predicts appropriate responses in a conversation. An example of this kind of model is Google’s MEENA — a large neural network that’s trained to be able to respond appropriately in conversations about different topics.
These end-to-end neural dialogue models build on large non-conversational language models like BERT and GPT-3. However, they’re difficult to use in commercial products because of some key issues. It’s difficult to have any control over the conversation flow, and they can sometimes produce biased or inappropriate responses. This isn’t great for company branding! Also, they struggle to retain a consistent persona throughout a conversation, forget what they’ve previously said, often produce relatively boring responses, and cannot easily link with external sources of information like knowledge bases or APIs to take action or to find the right information. These models of dialogue are new, however, and current research is addressing these limitations.
Conversational AI has been the topic of extensive research and development for decades, and a lot has changed in that time. It’s impossible to do justice to all of the research that’s happened, and is still going on, so this is a small snapshot of how the field has developed. Things will look very different in a few years time as the challenges of the current technology are addressed.