Mycroft is constantly striving to build a better Voice User Experience for the Community. We’ve introduced the Precise Wake Word spotter. We’ll shortly be deploying the Skill Marketplace. We’re spending the next few months improving skills and usability leading up to Mark II’s delivery.
We’ve also developed the Mimic2 Speech Synthesis engine, opening up Mycroft to more natural speech and more voices to choose from. However, Mimic2 can’t run on a Raspberry Pi like Mimic1. It requires a GPU to generate speech fast enough to be useful. So, we’re hosting it on our servers to provide voices for those who don’t have their own GPU. This also allows us to deploy a function that will improve Mycroft’s speed to respond to requests – response caching.
By caching common responses, the Mimic2 engine doesn’t need to re-synthesize those responses every time they’re called for. Say 100 people every morning ask for the weather in Portland, Oregon within an hour or so of each other. Previously they would have all called the Mimic2 service separately to generate the same forecast 100 times. Now, the 5:30 am early riser may be the first to request Portland’s weather and will have it generated by Mimic2. But, her request now means the other 99 people get the response sent straight to the speaker, improving the time to response.
To learn a bit more about this feature, I checked in with Mycroft’s CTO Steve Penrod.
Steve – Mimic2’s initial implementation would generate every single Text to Speech request; the neural network running on a GPU generating a fresh audio representation for each requested phrase. This straightforward approach is what we call an MVP (Minimum Viable Product) — it did exactly what we needed it to do, but nothing else.
The nature of a voice assistant is that it often repeats stock phrases — “You are welcome” or “All alarms cleared”. We can (and do) cache those at the device level. But more impactful to Mimic2 is the fact that the server is generating responses for many devices. So as rare as the phrase “It is 11:02” or “Currently sunny and 72 degrees” might be for a single user, with thousands of devices interacting you start to get collisions for even those dynamic phrases.
Implementing a simple cache allows us to use a cheap and near infinite resource — disk space — to enhance the system without adding limited and expensive GPU resources.
Steve – There really is no decision — everything Mimic2 generates goes into the cache. The caching scheme places the most recent request at the top of the stack whether it was newly generated or pulled out of the cache. When we start to run out of space in the cache we simply clear out the bottom of the stack and throw away the oldest generated phrases.
Steve – From a technical perspective, this is a great example of how all the old tricks are still useful even in the machine learning world.
A few have asked me if there is any privacy concern, but I don’t see any. Generated utterances have no association with a user account or skill. So even if we cache the phrase “Your balance is twelve fifty” there is no way to determine who initially created the interaction that generated that response, what the question was that elicited it, or even what skill was invoked to generate the output. It is impossible to tell if that balance was referring to a checking account, Steam credits, or the number of calories I have left on my diet plan.
From the user perspective, this is just a great performance boost!
Steve – It is hundreds of times faster to retrieve a phrase from a cache than it is to generate it. As a bonus, the more Mycroft is deployed the more effective this will be with dynamic content, as cache hit likelihood will increase with more users. There is network overhead that still exists, but we are expecting TTS response time to be cut in half on average. Though, aren’t you the guy in charge of metrics around here?
He’s right. So, I took a look at the metrics for our Opted-In user base, using the skills with the slowest Time to Response from the first Mycroft Benchmark. We deployed Mimic2 caching on August 31. My sample was all Mimic2 interactions for the 15 days leading up to August 31 and the 15 days after. For Time to Response (T2R) the Mimic2 cache is showing a reduction on average from 12.27 seconds to 8.99 seconds, over 25% reduction. Not quite the 50% cut Steve mentioned, but T2R takes into account other factors like skill handling (turning on lights with an Iot skill).
When looking at Text to Speech (TTS) generation time in our range, we decreased on average from 6.44 seconds to 2.99 seconds. That’s a 53.5% reduction!
Everything we do for Mycroft is aimed at improving the experience for our Community. If you haven’t given Mimic2 a try yet, you can set it as your voice for Mycroft at https://home.mycroft.ai/#/setting/basic under “Voice”. While you’re there, why not Opt-In to Mycroft’s Open Dataset? Then, you can help make Mycroft better just by using it! We’re regularly updating Mimic2’s initial model, so if you run into words or phrases it stumbles on, let us know here.