Skip to main content

Grokotron: STT on the edge

By January 9, 2023 No Comments

Mycroft AI’s primary mission has always been to create a true privacy-respecting voice assistant, one that is truly a personal assistant rather than a household spying device, a device which does what you want it to do rather than what the mega-corporation that sold it to you wants it to do. So one of the greatest challenges for us has been the lack of a fast, accurate, flexible Speech to Text (STT) engine that can run locally. While the product is still in early days of development, we believe we finally have an answer to this problem. We call it Grokotron.

For a voice assistant like Mycroft, speech recognition must be performed very quickly and with a high degree of accuracy. This is one of the reasons that voice interfaces have exploded in recent years. When it comes to automatic speech recognition, the difference between 80%, 90%, 95% or even higher accuracy may sound like small potatoes, but they are absolutely game changing for how usable a system is in the real world.

We’ve tried a lot of local STT options over the years, and while there’s been incredible work going into many projects, unfortunately nothing has come close to providing the level of experience we think is required for a general purpose voice assistant.

For this reason, by default Mycroft has used Google’s STT cloud services and layered on some additional privacy protections. We proxy the requests through Mycroft’s servers and delete identifying data related to these requests as soon as possible. (You can read more about that here.) But as much as we try to mitigate the privacy exposure inherent in such a system, this has always been a stop gap solution – a necessary evil in order to provide a quality voice experience.

We want Grokotron, our new STT module (based on the great work done on the Kaldi project), to break this reliance. It is not yet ready to replace big data cloud services for all users and all use cases, but we have big plans for it and look forward to it becoming a viable replacement for cloud services for those who want a zero-trust privacy solution.

Grokotron provides limited domain automatic speech recognition on low-resource hardware like the Raspberry Pi 4 that comes in the Mark II. It does this extremely quickly, and of course completely offline. Grokotron’s impressive accuracy and performance is due to its hybrid nature. It includes both an acoustic model and a grammar of expected expressions which constrains its transcription. This grammar is easy to define and extend with a simple markup language. This ability to be expanded easily means that while the range of expressions Grokotron can process is limited, it can be quite large and can be practically extended to cover nearly anything a voice assistant needs.

So whilst it won’t yet be transcribing your original space opera screenplay about an invasion led by the first Pontifex Dvorn… It can understand all of your requests to check the weather, set a timer, even play different music.

To show this in action, we wanted to share a complete proof of concept image. This is a Mark II Sandbox image running the new Dinkum software with a couple of tweaks.

  1. It has Grokotron pre-configured for STT
  2. It does not need to be connected to the internet to function.
  3. It has our backend pairing completely disabled, so even if you do connect to the internet, it won’t touch our servers.

Because this image is designed to run completely offline,  functions normally provided by our backend are not available, including paid API’s like the weather and Wolfram Alpha. Some settings normally configured on the backend such as the device’s location must also be set manually within your mycroft.conf. See the Grokotron documentation for details.

The grammar pre-configured on this image does not yet cover all expressions which Mycroft’s core intent system can understand, however it is straightforward to update the grammar and retrain the model on-device. Details on the sentence template syntax and training commands can also be found in the Grokotron documentation.

A Mycroft system already knows the majority of utterances that it is expecting to hear. These strings form the basis of both intent matching and integration test cases. A future optimization would be to reduce duplication of these definitions, and with Grokotron utilize them to provide a local-only grammar model for any Skill that gets installed.

Even big cloud STT systems have trouble with proper nouns. Media libraries are a classic challenge here. Beyonce is only known because of how popular she is, but how about Ke$ha, or Urthboy? These names are trained into cloud based models courtesy of partnerships with streaming media providers, but for open source tools these terms have traditionally been a bridge too far. Grokotron can use entity lists to define exactly the names it needs to recognize for each individual user’s case, which can go a long way to mitigating this problem. Even better, such lists can be compiled and the model efficiently retrained on the fly. For instance, on ingestion of a music library, artist names could automatically be compiled into Grokotron’s grammar. This is just one feature we plan to work on to make Grokotron the best local STT system out there.

Without further ado, you can find the first Grokotron image here:

Download Grokotron Grokotron Documentation