Skip to main content
TechnicalThought Leadership

Privacy and Machine Learning | Our Open Data Set and Opt-In Feature

By November 21, 2017 No Comments

We’re living through a paradigm shift in computing.

We’re living through yet another paradigm shift in computing.   For the most challenging problems facing computer science, traditional software development is being supplanted by machine learning.  Machine learning is a programming technique that allows algorithms to become more accurate at predicting outcomes without being explicitly programmed. Instead, data is used to present examples to the learning algorithm — much like a child is taught by being told something over and over again.  Teaching machines also relies on training data – lots and lots of it.

How much data do we need?

To give you an idea of the data needs, let’s look at key voice technologies.  Speech To Text (STT) can be tackled with a machine learning approach.  Estimates and experiments suggest that approximately 10,000 hours of audio is required to get a decent STT engine.  That is the equivalent of nearly 5 years of someone talking 40 hours a week.

Though many of the machine learning frameworks are starting to be released as open source the data underpinning them is not being released.  Without data, a learning framework cannot solve complex problems.  This means that companies without access to LOTS of data are getting left behind.


How’s everyone else getting their hands on adequate data?

There are currently a handful of organizations who have billions of customers who have agreed to allow their data be used.  This agreement was necessary to make use of the company’s services.  This is what’s in those “Terms of Service” notices everyone routinely accepts without careful reading.  Typically, the only option for a user who wishes for privacy is to simply stop using these services.  This is true of Amazon, Google, Facebook and hundreds of others.

For anyone who does not have billions of customers, there is little hope that they can collect the volume of data they need to solve problems using modern methods.  This includes academic researchers, smaller businesses, startups, individuals working on their pet project and, ironically, open source organizations who are respectful of user privacy.



Are privacy and training data mutually exclusive?

This has been the Catch-22 in the open source world as we enter the machine learning era.  Privacy is respected, yet the basic thing needed to allow individuals to have reliable, auditable, trustworthy technologies is data.  There has been no ethical way to capture the kind and volume of data needed to allow this to be created.



A Cure for the Catch 22

Here at Mycroft, we’re on the path to changing that – at least in the realm of human-machine interfaces and speech to text.  In Mycroft Core 0.8.22 we added the ability for users to select LEARN on Mark 1 devices to choose to contribute recordings of their device activations to reduce false-positives (inadvertent activations) and false negatives (missed activations). Now all users can choose to Opt-In to be part of the Mycroft Open Dataset and share some of their data to help us improve this technology.


All who Opt-In under their basic settings at can later choose to Opt-Out, stopping not only future data contributions but also removing any previously contributed data from future datasets. We truly appreciate the help of all in building an AI for Everyone and aim to safeguard what has been entrusted to us.



Find the Open Data Set on our Github

We view privacy as a basic human right and will go to significant lengths to protect it.

We’re extremely aware that we need to protect our user’s privacy. Here at Mycroft we view privacy as a basic human right and will go to significant lengths to protect it.  Before we publish any data, we plan to anonymize it securely and remove any private or personally identifiable information. How? Good question. We’re still working on the details, and won’t publish any data until then. We may use differential privacy, paid reviewers or secure sandboxes. Regardless, our goal is to make user data secure while also making it available to improve the state of the art in machine learning and conversational interfaces.


Our commitment to our core principles – Fast, Open, Simple, Strong is absolute. “Being open” is the principle that spans everything we do.


Join us in building an open data, so we can build an AI that’s truly for everyone.