Testing, iteration, and failure will be a part of the journey when you’re building a technology and industry from the ground up. If you haven’t seen the news about Amazon’s Alexa chuckling randomly at users, let us bring you up to speed.
For the past few weeks there have been increasing reports of users hearing an eerie laugh come from their Alexa–randomly, without asking or triggering the skill. Videos flooded social media from users that caught these utterances, most claiming how unnerving they found it.
Some users went on, waiting for it to happen again. Others unplugged their device. News agencies stood by, waiting for Amazon to catch the “bug.”
But was it a bug? No. It’s a product of the architecture of voice assistants that highlights why building this technology is so complicated. In fact, everyone working on voice platforms runs into similar problems, it just so happens that Alexa’s made users feel like they were filming the next paranormal activity.
Likely, a poorly trained “Intent.”
Users saying “Alexa, laugh” is the Wake Word (Alexa) + command (Laugh.)
The voice assistant pulls the Intent (action keywords, in this case it’s “laugh”), and then executes on it by taking an action – in this case by playing a laughing sound.
The problem is that the phonemes – think of them as building blocks for speech – for “Alexa, laugh” are similar to several regular words or phrases. One of the trickiest parts of speech recognition is distinguishing between two phrases that sound similar phonetically, but have very different meaning.
A Lexus left the car lot
Let’s have half of that chocolate cake
Hahahaha (because you’re supposed to be on a diet)
In an emailed statement to the New York Times they said, “We are changing that phrase to be ‘Alexa, can you laugh?’ which is less likely to have false positives, and we are disabling the short utterance ‘Alexa, laugh.’”
In addition, they’re changing the response to the command to, “Sure, I can laugh” and then following that with the laughing sounds. This verification serves as a buffer to let users know that the voice assistant was triggered.
Intent parsing requires substantial training data and user testing. The more that a neural network is ‘trained’, the more accurate it becomes. It learns to better distinguish between similar sounding phonemes. Without adequate training and testing, the rate of ‘false positives’ – when an Intent is incorrectly triggered – is too high – and voice assistants risk the sort of incidents that we’ve seen with Alexa laughing.
Now, more than ever, it’s important that voice assistants are trained on a variety of voices. Voices from different people, different ethnicities, different genders, different ages, different accents and different speech abilities. Every voice assistant in the market – and those in development – will become more robust as we use the lessons learned in this episode to further improve and iterate our technology.
These events underscore a few crucial pieces regarding voice technology that’s built in a “black box” fashion.
Voice assistants are placed in intimate parts of the home.
Even when a Wake Word is accidentally triggered, the device begins recording. That means that when a stream of phonemes – sound building blocks – that sound like a Wake Word – is detected – that personal data is sent and saved to corporate servers. That’s a lot of privacy to forego for accidental triggers – which will continue to happen, regardless of platform, for some time to come as voice assistants are still being trained.
The ‘black box’ nature – the inability to see what’s inside – is also an element of concern for users who were left waiting for answers when they heard chuckles coming from the corner. No one outside of Amazon could inspect the code, or work on identifying and resolving the issue. As the world transitions to a shared, decentralized economy, our ubiquitous and pervasive technologies need to keep pace.
Open source voice technologies – such as Mycroft AI – provide transparency while protecting privacy. While we share the same challenges around accurate Wake Word recognition and Intent Parsing that proprietary voice companies do, we have a significant advantage when issues do arise: our code is open for all to see. That means not only are our algorithms and methods able to be inspected – they also invite participation and collective improvement.
How can you help support privacy and user agency in voice technology? Back us on Indiegogo. Our next generation voice assistant, Mark II is out now.
Alyx works as a business analyst for Mycroft, working with data to shape metrics and the broader marketing strategy. She also writes these blog posts.