Voice interfaces hold so much promise because of their ease of use. But a truly intelligent voice agent should be able to listen and speak to you in your own language, right? Many are excited about the possibility of using the Mycroft open voice assistant with their native tongue.
Currently, Mycroft only officially works in English, with Community-driven efforts underway to support French, German, Italian, Portuguese, Spanish and Swedish. The exciting thing about the open source, Community-based approach of Mycroft is that translation can be done by more than just the individuals writing Skills. The users are empowered to band together and start support for their own language!
We’ve made a start by providing some initial documentation if you want to experiment with language support. Here, we break down how foreign language support must be implemented at each layer of the Voice Stack, and provide an overview of our language support roadmap.
In order for foreign language support to be useful, it needs to work across the entire Voice Stack. The voice stack is the combination of software components which, just like a layer cake, stack together to provide a voice service. Let’s take a look at them:
The Wake Word is the layer that tells the voice assistant to ‘wake up and start listening for commands’. It’s sometimes called a hot word. By default, the Wake Word on Mycroft Devices is ‘Hey Mycroft‘. Initially Mycroft used PocketSphinx for Wake Words, but moved to the Precise Wake Word engine last fall.
PocketSphinx and Precise work in different ways. PocketSphinx maps phonemes – think of these as sound building blocks – to graphemes. Graphemes are word building blocks. This way, PocketSphinx knows that the phoneme sequence
HH EY M AY K R O F T matches the words “Hey Mycroft”. To learn more about the differences between phonemes and graphemes, this blog post is a great start.
In contrast, Precise works using a neural network. Learning from tagged samples of Mycroft users who have opted-in to our open dataset, Precise is able to build an accurate model of the
Hey Mycroft Wake Word. Building a model using samples from a wide variety of genders, accents and tones means that a man with a French accent saying
Hey Mycroft will be recognized as well as an Australian woman, even though the phoneme sounds in these languages differ significantly. The downside of this approach, of course, is that the model needs a large, diverse dataset for training.
If you wish to use a Wake Word in a language other than English at the moment, you really have two choices; each has benefits and drawbacks. You can set a custom Wake Word in your home.mycroft.ai account using the phonemes available in the English PocketSphinx dictionary. You won’t be able to use phonemes that don’t occur in English – for example the “sch” sound in German, or the “xo” sound in Catalan. If you want to use non-English phonemes, you will need to install a PocketSphinx dictionary in your chosen language; which is not for the faint-hearted and requires advanced Linux skills.
The Speech to Text (STT) layer is the part of the Voice Stack that transcribes what you say to the Mycroft device. Currently, Mycroft defaults to the excellent STT engine from Google, but anonymizes all the requests so any traffic is just seen by Google as ‘Mycroft’. Google STT supports several languages other than English, so if you speak a supported language you can edit your mycroft.conf file to try it.
We intend to move to DeepSpeech as our default STT layer in the future, and you can try DeepSpeech on Mycroft now. This offers many more options, including the potential to host DeepSpeech on your own server and, potentially, directly on your device. However this is still young technology — currently version 0.3 — and the only trained model is currently English. We are working with Mozilla to expand its language range in the future and to generalize the process to support every language!
Once the Speech to Text layer has turned spoken words into text we call it an Utterance. The Utterance is then run through our Intent Parsers layer. The role of an Intent Parser within the Voice Stack is to match an Utterance with the intended action in a specific Skill – that is, find the “Intent” of the user.
In the Mycroft Voice Stack, there are two different Intent parsing phases:
- Adapt: Keywords from the
vocabfiles and patterns from the
regexfiles of the Skills are combined into Intent rules and used to find text matches. This will generate a confidence score for any matching Intent. Flow of control is passed to the Skill with the highest Intent confidence score.
If Adapt can’t parse the Intent…
- Padatious: A neural network determines the confidence score based on Intent examples provided by Skills. Flow of control is still passed to the Skill with the highest Intent confidence score.
If neither Intent Parser finds a match, the flow of control is passed to a Fallback Skill like Wolfram|Alpha to handle the Utterance.
Before the Utterance is run through the intent parsers, a language-specific normalization occurs. Normalization cleans up the transcription, doing things like converting contractions to their expanded form (e.g. “What’s the weather like” becomes “What is the weather like”). This code must be added to
mycroft-core itself for each new language. For example, normalization in Portuguese, which distinguishes between masculine and feminine forms of a word, would need to account for both masculine and feminine phrases.
In the Voice Stack, the role of the Skill is to do the ‘heavy lifting’ and provide the user with the outcome they wanted – such as reporting the news or weather, or playing a piece of music.
The Mycroft skills system has supported multiple languages from the beginning. To support a new language, each Skill must translate three different pieces:
Independent directories within the Skill hold vocab for the various language codes. For example, a skill written originally in English will have several files like
vocaben-usWord.voc, with the English language pieces in the
*.voc files. Adding German support involves creating
vocabde-deWord.voc files holding the German version of the same words.
A Skill might also use regular expressions in parsing, contained in its
*.rx files. For a
regex pattern match not only do these words differ between languages, but also the phrasing and placement of words changes.
For example, let’s take the phrase “How’s the weather today”. In most European languages, the phrasing follows the structure “question – keyword – day”. However, in Turkish, note the two phrases:
“Bugün hava nasıl?”– “How’s the weather today?” (Literally, “Today, how’s the weather?”)
“Hava nasıl?”– “How’s the weather?”
The structure is “day – question – keyword”. This means that not only would regex files need to be rewritten to support Turkish, but the structure of the expressions needs to be changed as well. This process can be complicated further by languages which classify objects as masculine and feminine, because more regular expressions are required to cover all the cases needed to correctly identify an Intent.
Most Skills have lines of text spoken when a Skill completes a task or spoken when information is returned through an API. These Dialog files need to be changed for a new language.
Within some Skills, extra conditional processing may be required to handle new languages. For example translating phrases that come back from an API in English into the target language, converting “cloudy” to “bewölkt” in German.
A Skill will normally complete execution by speaking a line of Dialog to the user – like saying “the weather in Geelong today is clear skies and 22 degrees Celsius”. This is the Text to Speech layer of the Voice Stack, and its role is just that – to speak written information.
The default TTS engine used in Mycroft is Mimic. Mimic is currently available only in English, so if Mimic tries to speak foreign words, or words with diacritical marks (such as say the ö sound in Swedish), then the pronunciation will be unnatural.
Mycroft, being modular, allows you to select other TTS engines. The Google TTS engine has more language options available. Again, if you want to configure this for your language, you need to edit your
Building a new Text to Speech engine like Mimic is very difficult, requiring expert level understanding of languages to build the phonetic mappings, plus generating the voice pieces for the synthesis.
As you can see, providing language support is no easy task. We are continually improving, and the following steps are part of building the tools to officially support more languages.
Once the current ‘Hey Mycroft’ Wake Word in Precise has an accurate model, we will be opening up the Precise Tagger to allow tagging of other Wake Words, including Wake Words in other languages. Since Precise itself is trained directly from recordings, it is already multi-language ready.
Estimates say DeepSpeech requires 10,000 hours of tagged samples to provide a workable STT model for a language. The English dataset is still being built as well as the actual machine learning code that runs DeepSpeech. Gathering 10,000 hours would be a huge amount of work for any individual, but spread over many collaborators it is a much more manageable task. We are creating the tools to collect and tag these training datasets as a community.
Mycroft began with Python 2.7, which does not have built-in Unicode support. Unicode helps represent text in languages which do not use the Latin alphabet (the alphabet used to write English). With the recent transition to Python 3, this hurdle has been cleared.
We are putting resources into a new method for Text to Speech. Mimic 2 is based on a neural network and as such is “trained” rather than “programmed”. We estimate that it will take around 20 hours of recorded speech to yield a reasonable language model in nearly any language. These recordings need to be clear and of one single voice in order to accurately train the model, but this is still much easier than getting a Ph.D. in order to be able to build support in the original Mimic.
The first English language voice is being created, as well as working out the kinks in the recording process. Recommendations and tools will soon be available for the Community to build their own voices.
Over the next few months, we’re creating a harvesting tool to identify all the vocab and dialog files within Mycroft Core and Skills that need to be translated. This will make it easy to see what needs to be done to support each language, and make it easier to keep up as new Skills are being created and old ones change. This will require no programming skill to assist in bringing Mycroft to your favorite language.
So what can you do right now to help advance language efforts?
- Help with Precise Tagging: The sooner we have the
Hey MycroftWake Word training well, the sooner we can move on to training other Wake Words. You can tag Precise samples at home.mycroft.ai under “Tagging.”
- Opt-In to our Open Dataset: So that we can gather a diverse set of spoken samples with many voice types and accents, we need lots of people contributing to the Open Dataset.
- Have web-dev skills? Contact us! We’d love help with the harvesting project and associated web interfaces.
- Be patient! As you can see, multi-language support isn’t easy. But the Mycroft Community has the best potential to support not just the most profitable languages, but all of the languages.