Chatbots II


Cloudy builds a chatbot! Well, sort of. This week we go to the cutting edge of chatbots, show how hard it is to build one, and see if world class researchers can do better.

All the same

Ever wonder why all chatbots sort of do the same thing? It seems like many companies advertise chatbot capabilities for customer service, questions about products or other specific items. Take chatbots for banking. Microsoft has a banking case study, as does Amazon, and don’t forget IBM. It seems like there are a few industries where all of the chatbots flock too.

This is because the structure of the conversation is relatively easy to predict. In banking, people probably just want to check their balances, so developing some machine learning code to detect “balance asking” is no problem. But what if we wanted to build a general purpose chatbot that can respond to any request?

“What I cannot create, I cannot understand” – Richard Feynman

My original intent of this chatbot series was to research the cutting edge of chatbots and see how easy it would be to create a general purpose chatbot. While I get the underlying concept of bots, I have never built one entirely from scratch and was trying to grasp the inner details of how they work. I mean, how hard could it be?

Cloudy attempts to build a chatbot

In order to build my chatbot, I first needed to get some data. The first thing to understand about chatbots is the publically available datasets which to train chatbots are quite poor. The two most frequently used datasets are the Cornell Movie Quotes Corpus, which contains 220,579 conversational exchanges between 10,292 pairs of movie characters and comments from Reddit, which, well, Reddit.

There I was, not even an hour into my research, contemplating whether to build my chatbot based on movie quotes or Reddit. I choose to use the movie dataset, mostly because the data size was a lot smaller and because, well, Reddit.

Then I forked someone’s code from Github, and while the trolls out there may say I “cheated” by copying someone else’s code, I do this is my spare time, alright. The person I copied code from helped TA the Stanford Deep Learning Research course, so I thought I was in good hands.

Given the small data size, I thought I could get pretty good results after training my chatbot for 8 hours using a MacBook Pro. What I soon found was after this time, all the computer would reply was periods. Here is an example of our interaction.

To the Cloud!

Given these bizarre results, it was time to take our solution to the cloud and run this on a GPU. In this situation, GPUs gave me about a 50x improvement in training time and let me run tens of thousands of more iterations. But, did I achieve better results?


Turns out other people using this code also had the same problem and there was a long forum on Github to back my claim up. As the chatbot was trained for longer and longer, there were also more unforeseen problems like how my chatbot was getting so big (about 2GBs) that asking questions was using an excessive amount of computing power. So, was I about to give up on building a chatbot

You bet I was

Training on AWS via a GPU comes out to be around $5 an hour and I figured I needed about 15-20 hours of training time to get a chatbot that really works. Oh, well. For what it’s worth, this was supposed resemble what my chatbot ultimately looked like:

Do cutting edge AI/ML researchers have better success?

The answer is clearly “Yes”, but not a resounding yes. One interesting way to gauge progress in the development of chatbots is to look inside the Alexa Prize, a contest which offers up to $3.5 million for building the best bot. How Amazon defines “best” is “achieves the grand challenge of conversing coherently and engagingly with humans for 20 minutes with a 4.0 or higher rating.”

The contest is not going well. Currently, the “Alexa Prize Socialbots” skill has a 2 star rating on the skills page, with 64% of users giving it a 1 star rating. Some of the user comments are great:

“I would rather talk to a toddler than talk to an Alexa Prize Socialbot again!”

“This is terrible AI.”

“The bots themselves are really terrible.”

Just to be clear, world class universities enter this contest and put their heart and soul into it because you could win THREE POINT FIVE MILLION dollars. College kids go crazy just for free pizza, so you can imagine what they would do for REAL MONEY.

Back to the contest

There was a very interesting article in Verge, interviewing many of the contestants and their struggles with building bots. Many of the contestants in the Alexa Prize cited using machine learning but also relied on “hard coding” responses, or just manually coming up with responses to many of the questions, which is clearly not the science fiction we have been promised. One quote from the teams sums it up perfectly:

“Everyone starts with machine learning, and eventually, everyone realizes it doesn’t really work.”

Based on interviews of contestants of the Alexa Prize, some seem to be turning to the smartest AI ever developed: real people. One group had their chatbot answers outsourced to Mechanical Turk, Amazon’s service where humans do mind numbing tasks. If the chatbot could not come up with an answer, the response would get outsourced to a human. The hope is that after getting enough responses from humans, the teams can get better results.

What did I learn from all of this?

Chatbots are highly specialized for specific tasks and general purpose chatbots are still years away. Even with some of the smartest people working on them, with support from large corporations like Amazon, there are still enormous challenges. In terms of chatbots, the same advice parents give their children can be said about chatbots:

Be careful talking to strangers.

Subscribe at

Copyright © 2018 Ogilvy, All rights reserved.

Facebook Comments