The rush towards machine learning among big tech companies (and not only) is pushing progress in a variety of domains.
How to handle privacy is one of those, which can be beneficial for other applications that uses big data (or more simply private information) to give back value to the costumer or the network.
As you might be aware, often training models for machine learning requires to have a huge amount of data for the little robot to start seeing patterns and be capable of identifying for instance an object into a picture out of sample (i.e. new pictures).
We are far from the human capability of recognition but luckily we are well equipped with a vast amount of data to compensate. (take that mother nature!)
This data however is often crowdsourced and contain sensitive information and our training models should not expose private information contained in these datasets.
This is the foundation of quest of differential privacy, I will not go into the details as I just started to look into it post a A16z podcast and the recent announcement at the Google and Apple dev conferences. Below what I have so far.
I don’t want to know, but I kind of need your data in some way
The fundamental concept that triggers the need for differential privacy relies on the fact that in an ideal scenario, the user of the data (training model) does not require the information regarding the ownership or even the true value the single piece of information but the collection of it gives enough clue to infer it.
Even if trustworthy, the user of data finds himself with sensitive information that he has to protect (in all parts of the workflow – collection, storing etc..) and that is a problem.
You don’t want to know, because if you know you are accountable for not telling everybody.
Differential privacy – a probabilist concept – comes into place by messing up the collection of data by introducing a noise (following a given probabilistic function) that cancels out once enough information is collected. In this scenario the user of data is able to get actual insights from the large set of data because the noise cancels out (e.g. the distribution is centered – average – at 0) while not having a clue about the individual data itself, you are no longer a bank to rob, success! There are a few ways of doing that, at the device of collection or when it reaches the server, both have pros an cons.
However, there is always the risk that someone attempts to reverse engineer your proprietary “smoke machine” and figure out the origin and true value of your data. Not good if you had some embarrassing high school pics in your phone.
This is not an easy business in general: the solution (s) need to consider the tradeoff between privacy budget, software complexity and computational power required generating noisy data and than cleaning it, training efficiency and model quality, which is the purpose you did all of this work.
Hence, you have to make your job hard enough to not know what your are doing but not screw up the reason why you did the job in the first place. (insight from the data)
Now if you want some further – technical reading – I have been reading this article and will try to update or follow up the post as my understanding in the topic grows.
I am interested in finding some synergies with the work I am doing on blockchain and selective privacy in public or permissioned blockchain, for instance, can we find a reliable way to open up a (blockchain) platform containing confidential information, having the benefits of outsourcing creativity while preserving confidentiality?