We’re living through yet another paradigm shift in computing. For the most challenging problems facing computer science, traditional software development is being supplanted by machine learning. Machine learning is a programming technique that allows algorithms to become more accurate at predicting outcomes without being explicitly programmed. Instead, data is used to present examples to the learning algorithm -- much like a child is taught by being told something over and over again. Teaching machines also relies on training data - lots and lots of it.
To give you an idea of the data needs, let’s look at key voice technologies. Speech To Text (STT) can be tackled with a machine learning approach. Estimates and experiments suggest that approximately 10,000 hours of audio is required to get a decent STT engine. That is the equivalent of nearly 5 years of someone talking 40 hours a week.
Though many of the machine learning frameworks are starting to be released as open source the data underpinning them is not being released. Without data, a learning framework cannot solve complex problems. This means that companies without access to LOTS of data are getting left behind.
There are currently a handful of organizations who have billions of customers who have agreed to allow their data be used. This agreement was necessary to make use of the company’s services. This is what’s in those “Terms of Service” notices everyone routinely accepts without careful reading. Typically, the only option for a user who wishes for privacy is to simply stop using these services. This is true of Amazon, Google, Facebook and hundreds of others.
For anyone who does not have billions of customers, there is little hope that they can collect the volume of data they need to solve problems using modern methods. This includes academic researchers, smaller businesses, startups, individuals working on their pet project and, ironically, open source organizations who are respectful of user privacy.
This has been the Catch-22 in the open source world as we enter the machine learning era. Privacy is respected, yet the basic thing needed to allow individuals to have reliable, auditable, trustworthy technologies is data. There has been no ethical way to capture the kind and volume of data needed to allow this to be created.
Here at Mycroft, we’re on the path to changing that – at least in the realm of human-machine interfaces and speech to text. In Mycroft Core 0.8.22 we added the ability for users to select LEARN on Mark 1 devices to choose to contribute recordings of their device activations to reduce false-positives (inadvertent activations) and false negatives (missed activations). Now all users can choose to Opt-In to be part of the Mycroft Open Dataset and share some of their data to help us improve this technology.
All who Opt-In under their basic settings at https://home.mycroft.ai/#/setting/basic can later choose to Opt-Out, stopping not only future data contributions but also removing any previously contributed data from future datasets. We truly appreciate the help of all in building an AI for Everyone and aim to safeguard what has been entrusted to us.
We’re extremely aware that we need to protect our user's privacy. Here at Mycroft we view privacy as a basic human right and will go to significant lengths to protect it. Before we publish any data, we plan to anonymize it securely and remove any private or personally identifiable information. How? Good question. We’re still working on the details, and won’t publish any data until then. We may use differential privacy, paid reviewers or secure sandboxes. Regardless, our goal is to make user data secure while also making it available to improve the state of the art in machine learning and conversational interfaces.
Our commitment to our core principles – Fast, Open, Simple, Strong is absolute. “Being open” is the principle that spans everything we do.