was successfully added to your cart.

Cart

Thought Leadership

Improving Mycroft through Metrics: The Mycroft Benchmark

By August 30, 2018 No Comments
Mycroft is proud to announce the Mycroft Benchmark - a comparison of Mycroft or any voice assistant to any other.

Machine Learning requires data to improve. The best source of that data is through our Community, who Opt-In to share the data from their interactions with Mycroft. That allows us at Mycroft AI to build Open Datasets of tagged speech to improve Wake Word spotting, Speech to Text, and Text to Speech. But, to improve the software that utilizes those engines, we need a different kind of data to analyze. How does Mycroft compare to other voice assistants and smart speakers? How does Mycroft itself improve over time? How can you help?

Benchmarking Mycroft

A benchmark is important for a number of reasons, first and foremost being it offers us a baseline of Mycroft’s performance on a given date that we can compare once changes are in place. Then, as necessary, we can compare different configurations of Mycroft, new platforms and hardware for Mycroft, and our competition.

Over the last couple of weeks, we’ve been preparing and conducting a repeatable benchmark of Mycroft against other voice assistants in the field. This will be a new addition to the Mycroft Open Dataset; not tagged speech or intent samples, but a standard process and metrics that anyone can use to measure Mycroft and other voice assistants quantitatively. Below, I’ll report on the results of the first iteration where we compared a Mycroft Mark I to a first generation Amazon Echo and Google Home.

The Process

To conduct this benchmark, we had to put together a series of questions, which wasn’t as easy as it sounds. Being an emerging technology, there aren’t industry standards that exist yet. So who better to set that standard than the Open player? We prepared a starter set of 14 questions based on the observed usage of Skills by Opted-In Mycroft users (more on that later), taking into consideration industry-reported most used Skills from places like Voicebot. That first run of questions was:

  1. How are you?
  2. What time is it?
  3. How is the weather?
  4. What is tomorrow’s forecast?
  5. Wikipedia search Abraham Lincoln
  6. Tell me a joke
  7. Tell me the news
  8. Say “Peter Piper picked a peck of pickled peppers”
  9. Set volume to 50 percent
  10. What is the height of the Eiffel Tower?
  11. Play Capital Cities Radio on Pandora
  12. Who is the lead singer of the Rolling Stones?
  13. Set a 2 minute timer
  14. Add eggs to my shopping list

That list has already evolved a bit to make the benchmark more objective. For one, this test originally was meant to check the intent parsing along with response times, which will in the future be split into separate benchmarks. As an example, Google does not return any response when asked to “Wikipedia search” a topic, and both Mycroft and Alexa only accept volume adjustments between 0 and 10. For next time, the list will probably look more like this:

  1. Tell me about Abraham Lincoln
  2. What is the height of the Eiffel Tower
  3. Who is the lead singer of the Rolling Stones?
  4. How is the weather?
  5. What is tomorrow’s forecast?
  6. Play Capital Cities Radio on Pandora
  7. Play Safe and Sound by Capital Cities on Spotify
  8. Set a 2-minute timer
  9. Set an alarm for tomorrow morning at 7:00
  10. What time is it?
  11. Tell me the news
  12. Add eggs to my shopping list
  13. Set volume to 5
  14. How are you?
  15. Tell me a joke
  16. Say/repeat/Simon says “[random sentence]”

To make sure we could properly check the times of a response, I set up to record the responses on video. I set the devices next to each other in the same room on the same wifi network and did a network speed test on a laptop on the same network for reference. Once all that was taken care of, the requests began.

Issuing all the requests to each assistant took about 45 minutes. To get the best idea of when requests ended and responses started, I imported the audio into Audacity and used the waveforms to determine five points:

  • The Wake Word
  • The end of the request
  • The beginning of the response
  • The start of ‘real info’
  • The end of the response

About ‘Real Info’ – We wanted to see if the other assistants in the field might pad their responses with cached phrases to give more time to synthesize the real info of the response. This seems like an obvious way to improve the perception of a response. Hearing “Right now in Kansas City” – which can be easily pre-generated and cached to stream at the start of a weather response – certainly doesn’t detract from the experience. Though it does mean an extra second or so until you actually hear the temperature and weather. Deciding what is padding and when ‘real info’ starts is a subjective call right now, but we’ll be trying to define it well as things progress.

The Results

Now to the good stuff, or in this case, the “room for improvement” stuff. Here are the results from the first Mycroft Benchmark.

Time to Response

One of the biggest points we wanted to track was the ‘Time to Response.’ In this context, that means the ending of the provided request to the beginning of an audible response. We tracked that across the 14 questions using the new Mimic2 American Male voice. We found that Mycroft currently responds an average of 3.3x slower than Google and Amazon. On average for our sample, Alexa responded to requests in 1.66 seconds, Google Assistant responded in 1.45 seconds, and Mycroft in 5.03 seconds.

The Time to Response information from the first Mycroft Benchmark.Time to Real Info

The next thing we decided to track was when the voice assistant’s response actually began answering the question it was asked. As mentioned above, this is a subjective decision for the time being, but still offers some interesting data to look at. On average, Alexa started providing real info 3.02 seconds after the request finished. Google provided real info at 3.55 seconds. Mycroft started providing real info at 5.7 seconds.The Time to Real Info chart from the first Mycroft Benchmark.

We can see that the graph is a good bit tighter here, and in one case, “Tell me the news,”  Mycroft actually comes out on top. My presumption is that Mycroft’s competition is adding some phrasing to the beginning of responses that require API hits or pulling up a stream. Though, it also included the reason behind the outlier that is Google’s response to the News query – a nearly 16 second notification about being able to search for specific topics or news sources. I also did a quick look at the time between the response starting and when the assistant provided Real Info. On average, Alexa spoke for 1.36 seconds before providing Real Info. Google Assistant spoke for 2.1 seconds before Real Info. Mycroft spoke for 0.66 seconds before providing Real Info.

Where to go from here

This benchmark was especially helpful in comparing Mycroft objectively to Google and Amazon. Eventually, we’ll be able to broaden it to others in the space. Now the trick is figuring out how to improve the experience, then return to this benchmark periodically to reassess.

For improvements to the experience, we have another source of metrics from which we’ll be able to get actionable information: the Mycroft Metrics Service.

Our Opted-In Community Members have timing information for their interactions with Mycroft anonymously uploaded to a database for analysis. This is how we determined the Mycroft Community’s most used Skills (that is, the Opted-In users most used Skills) for the 14 questions of the Benchmark. Apart from Skill usage, we have visibility of what steps are carried out in an interaction, and how long each step takes. From there we can determine what steps of a Mycroft interaction take the longest, and work to speed them up or find creative improvements to the Voice User Experience.

We’ll also revise the benchmark to be more explicit in comparing the timing of responses. It’s likely we’ll create one or more subjective measures for quality of response. As Skills expand, the number of questions will certainly expand too.

There’s also the question of where this information will live and be available to the community. The blog is a great place for explaining a new process but isn’t great for storing and displaying data. We’ve had some Skill Data published on the Github since May. A repo and/or Github.io page will likely be the residence of data, graphs, and more regular updates on Mycroft Metrics and Benchmarking. That will make it free and available for anyone to use, whether you’re comparing the speed of your local system to others, planning an improvement to Mycroft Core to speed up interactions, or creating a visualization for research. This data is Open and yours to use. Since that will take some time to set up, here is a Google Sheet to give you immediate access to the first round of data.

How can you help?

I’m so glad you asked! Like I mentioned, metrics come back only for Community Members who have Opted-In to the Open Dataset. So the best way to help is to Opt-In and use Mycroft! That way, we get a population of interactions that is as broad as possible. People on different networks in different locations using different devices interacting with Mycroft in different ways provides the best information for Mycroft and the community to make decisions on.

To Opt-In:

  • Go to home.mycroft.ai and Log In
  • In the top right corner of the screen, click your name
  • Select “Settings” from the menu. You’ll arrive at the Basic Settings page
  • Scroll to the bottom and once you’ve read about the Open Dataset, check “I agree” to Opt-In
  • That’s it!

Once you’ve done that you’ll not only be providing the metrics from your interactions, but also helping build STT, Wake Word spotting, and Intent Parsing for Mycroft. We always want to thank those members of our community who have Opted-In to help make AI for Everyone.

Have an idea to improve Mycroft’s metrics and benchmarking? Maybe a question on the process? Let us know on the forum.