We are proud to release the beta of version 2 of Mycroft’s Mimic Text to Speech technology! Mimic is now a deep learning based Text-to-Speech (TTS) engine trained on audio recordings from a single speaker. Our team member Kusal kindly volunteered to provide the vocals for the first iteration. He spent several weeks speaking predetermined phrases into a microphone.
Voice is a personal choice, and Mimic allows you to select or create a voice that fits your style and preference. To start using this new voice, change the Voice option in your Home settings to American Male (Beta). That’s it! Your devices will update to use the new voice after a minute. You can also tell Mycroft to “update your configuration.”
Mimic2 will be on a 3 month trial and then will be a feature for Mycroft Supporters.
Below is a technical explanation of Mimic. You can continue if you like to see what’s happening under the hood.
The Mimic2 repo is a fork from Keith Ito’s open source implementation of the Tacotron architecture published last year by Google Research. Keith was a huge help in getting us started, and we owe much of Mimic’s success to his excellent work.
Our initial implementation of Mimic used the concatenative approach, which relies on tiny audio recordings that are combined to form the speech. This process is labor intensive as it requires hardcoding different combinations of the audio clips to form words. The final output is clear but sounds emotionless and robotic.
The new Mimic uses deep learning, which generates higher quality speech than traditional concatenative Text-to-Speech systems. Using deep learning, we don’t need to hand engineer the features in speech; instead, we let the computer learn how to generate it. With enough processing power and data, computers are capable of learning features like intonation, tone, stress, and rhythm. These features make speech sound more human-like.
Simply put, the Mimic engine takes in a string of text and maps it to an audio output. Mimic was trained on a corpus of about 20,000 English sentences that equates to about 16 hours of audio. During the recording process, Kusal had to speak clearly into a high-quality microphone in a noise isolated environment. We broke the recording sessions up into a span of two weeks to prevent vocal fatigue. The clarity and the quality of the data are critical, as with any machine learning system. Like they say: “garbage in, garbage out.”
This generation of Mimic is based on Tacotron, a groundbreaking and very successful neural network architecture for speech synthesis. It’s able to take a highly compressed source of information (text) and decompress it into audio. This process is complicated because the same text can correspond to different sounds and speaking styles. Because of the various nuances in spoken language, it is difficult to generate the appropriate output sound from the input text. Try speaking these simple sentences out loud and pay attention to the different pronunciations of the same word:
- I read a book last night; I like to read.
- The violinist took a bow after he dropped his bow.
- You have to use chopsticks to master their use.
- After you graduate you become a graduate.
Below is a concise technical explanation of the deep learning approach. It’s written with the goal of simplifying the various methods in the neural network architecture for understandability. If you’d like more in-depth details, you can read Google’s white paper on Tacotron.
As a high-level overview, the model takes in characters as input and outputs a raw spectrogram. An algorithm then transforms the raw spectrogram into audio waveforms. There are many neural network layers in this architecture that perform various functions. But for conciseness, the layers are grouped into 3 main modules; an Encoder, Decoder, and an Audio Reconstruction module.
Diagram of Mimic V2 architecture
The voice generation starts with the Encoder. A sentence is broken up into individual characters as embeddings and fed into the Feature Extraction module. The Encoder uses this module to extract out local and contextual information from the characters. The process helps define the various patterns in a sentence to aid in producing the sound of the audio output. For example, the “C” in “Chat” is pronounced differently than the “C” in “Cat.” Also, the intonation of a sentence would sound different if there is a question mark at the end versus a period. The output of the Feature Extraction module is essentially an abstract numerical data representation of the various features in a sentence. These encoded text features are used to help generate the sound.
After making a pass through the Encoder, the output of the Feature Extraction module is fed into the Decoder. The Decoder’s job is to generate a mel spectrogram from the encoded text features. A spectrogram is a pictoral representation of sound. A mel spectrogram is a form of a spectrogram that represents sounds that are tuned to the human ear.
The Frame Prediction module (FPM), produces the raw mel spectrogram by recursively generating the frames. The information from the Encoder output is used to aid in the generation process. A method in the FPM called the Attention Mechanism helps align each character with its corresponding sound. As you can see in the diagram, the decoding process takes many repeated steps. The number of steps is dependent on the audio length. The Encoder output is fed into the FPM to start the decoding process.
The FPM produces two things: a mel spectrogram frame which contains information on what the generated audio sounds like; and an abstract numerical data representation of the state of the FPM, which we’ll call the internal state. The internal state contains information that is critical to generating the next frame. The internal state holds information like the text representation from the Encoder and the data from the previous decoding steps. These outputs are necessary because to generate the next sound in speech; it’s beneficial to know prior sounds generated and the information that generated those sounds. In reality, the FPM module is a lot more complicated, but we’ll stick to this explanation for the sake of simplicity. The FPM combines those two outputs, passes it on into the next decoding step, then does its job again. The Decoder repeats this process, building out the mel spectrogram frame by frame. Each step is taking in information from the previous step. After the final mel spectrogram output, the whole sentence is represented in speech form.
The final step in the speech generation process is the Audio Reconstruction module. This module has two main components, the post-processing net, and the Griffin-Lim algorithm. The full mel spectrogram generated from the Decoder is fed into the post-processing net. The network transforms the mel spectrogram into a linear spectrogram. This step is crucial for two reasons. First, the output needs to be converted from a mel spectrogram to a linear spectrogram before it can be reconstructed. Second, during the decoding process, the FPM makes mistakes in generating the mel spectrogram frames. The post-processing net can learn to correct these mistakes. Once the post-processing net generates the linear spectrogram, the Griffin-Lim algorithm is applied to it. Griffin-Lim is a reconstruction algorithm that can take linear spectrograms and turn it into audio waveforms. The linear spectrogram does not contain information on the phase, which is any particular point in time of a waveform. Griffin-Lim estimates the phases in a waveform from the spectrogram, but it’s not perfect. Thus each output waveform has some form of phase distortion. Phase distortion is why the voice can sound as if it’s omnipotent.
That’s it! That’s a simplified version of how Mimic takes in text and transforms it to audible speech. Mimic Text to Speech is in ongoing development, as it is not indistinguishable from a human yet. A few community members have asked about the possibility of using Tacotron2, Google’s shiniest TTS architecture. Architecturally, some argue Tacotron2 is more straightforward then Tacotron, but the most significant difference is the use of WaveNet in the Audio Reconstruction module. WaveNet produces exceptionally human-like speech quality, but the tradeoff is that the compute time is not practical. We’ve tried WaveNet on a Tesla K80 which Google Colab provides, and it took 13 minutes to generate 0.9 seconds of audio. While it sounds excellent, this is not practical for today. Other implementations hold some promise, such as Fast Wavenet. We will continue to develop and share our open source Mimic implementation while watching out for new developments that we can incorporate from this rapidly progressing space.
Michael leads the ML team in research and development of technologies that involve speech recognition, speech synthesis, natural language processing, data acquisition, and ML software systems.