International Open Data Day was Saturday, March 2, 2019.
You may not have had a public holiday, or a big house party, but we think this is still one worth celebrating!
The Open Knowledge Foundation defines open data as that which “can be freely used, modified, and shared by anyone for any purpose”.
This can include broad information from websites, structured data sets such as those published by government bodies, or data sets intentionally compiled by communities like openstreetmap.org.
Just like “open source” how you can use that data depends on the license that is applied by its creator.
Many of the components that make up Mycroft require large data sets. Large established companies collect this from their users, often without their knowledge via unintelligible End User License Agreements. They also keep this data totally inaccessible to researchers outside of their organisation, giving them an incredible advantage in the machine learning world.
Open data sets, on the other hand, make it possible for new companies to innovate in this space, and to do so knowing they are using data that is explicitly in the public domain. Thanks to a range of open speech data sets (such as LJ, Blizzard, or CMU), anyone can train their own voices for use in speech synthesis or Text-To-Speech (TTS) systems. There are similar data sets available for use in speech recognition, or Speech-To-Text (STT) systems.
Mycroft’s Skills also use a whole range of open data each time you make a request:
“Hey Mycroft, what’s the weather forecast?”
OpenWeatherMap.org has your back.
“Hey Mycroft, how do I make a margarita?”
Let’s check TheCocktailDB.com.
“Hey Mycroft, what is a recurrent neural network?”
Here’s a very short summary courtesy of Wikipedia.
Every day human beings all around the world are creating, improving, and making open data more available. I am deeply appreciative for these contributions to collective human knowledge. It is impossible to predict all the ways that open data will be used. The only way to know is to put it out there and see what happens.
Mycroft has benefited greatly from all of this open data, and where appropriate we also like to pay it forward. Our code is open source, and we work closely with other organisations like Mozilla to create the next generation of open data sets.
The benefits of openness must however be balanced with the need for privacy and user choice. Our strong stance on this principle is why our company exists, and one of the key reasons our users trust Mycroft.
That is why we use an “Opt-In” data collection model. Unless you explicitly give Mycroft permission to use your data, we never keep it around.
The easiest way to help, if you’re happy to share some data for the common good, is to Opt-In to the Open Dataset through your personal settings page at home.mycroft.ai. Opting-in grants Mycroft the permission to retain data, such as samples of what you say to your device. These samples are then de-identified before being added to a data set. Should you choose to Opt-Out at a later date, any data originating from your account will be deleted.
The Mycroft community have contributed over 80,000 translations so far, which amounts to an average of 3600 each week. These translations are available under the Apache 2.0 license, just like our code. If you are bilingual, please help us bring voice-interactive technology to more corners of the globe by joining our translate platform.
Common Voice is Mozilla’s speech recognition initiative. It aims to provide a speech-to-text engine that is open and accessible to everyone. Providing recordings of your voice adds to the diversity of their data, whilst validating recordings of others improves the accuracy of their data. These two contributions together enable better training and ultimately a solution that works better for us all. This of course aligns closely with our goals at Mycroft, so we work closely with Mozilla on their voice projects. Visit voice.mozilla.org and see how you can help out anytime you have a few spare minutes.