#11 Privacy in a data-driven world

I do not get the point of a separate always-on device such as Alexa. My sister has one, and she uses it to play nursery rhymes for her kid. Devices like Alexa, which perform at least some processing on-device (like wake word detection) are a goldmine for data collection. With some noise around the privacy of such devices, and with incidents such as intimate audio recordings from Alexa and Google Assistant being leaked to other customers and employees listening to recorded conversations, companies are starting to focus more and more on privacy. At least, it appears so.

Take the example of mobile devices. The use cases for data-driven optimization are manifold. Starting from simpler things like face aggregated photo galleries, to auto-suggest emojis and GIFs based on the context being typed, suggest apps and actions based on your movements and activities on your phone. This allows companies such as Apple (the first large tech adopter of a technique known as differential privacy) to get aggregated data such as the one below while compromising the privacy of individual users.

The problem is simple. Companies cannot afford to miss out on billions of data points being generated on their devices around the world. How do they use this data to improve their models while maintaining the privacy of the users? Additionally, the devices are computation power, and power contained so a lot of on-device computation is out of the question.

It is important to first point out how a breakdown of privacy generally occurs. A service can leak personally identifiable information (PII) either by mistake or due to a security vulnerability such as a data leak. A service might be collecting too much of anonymous information, which can be used to fingerprint a user, even without a piece of exact personal information being stored. Browser-based fingerprinting, which relies on the uniqueness of a combination of browser and system attributes, has been used to augment cookie-based tracking for some time. Even with anonymized datasets, there is a possibility of a later dataset becoming available later which can be used to correlate say usage patterns and break down the privacy veil. In a now often quoted paper, researchers deanonymized the famous Netflix Prize dataset using an auxiliary IMDB dataset to identify users and then associate other attributes such as political inclinations. Similar attacks have been possible in the past with anonymized medical records datasets on sale.

Linkage attacks: Or why data anonymization does not work

There is an entire market for ‘anonymized’ datasets. While these datasets are quite useful for research, companies such as Cambridge Analytica in the past have been customers to a lot of these providers (possibly with the intent to deanonymize). A linkage attack in its simplest form is deanonymizing a dataset by using some additional information, usually obtained through some other channel. For a simplistic example, a dataset with 100 attributes can be deanonymized by selecting 5 attributes and a subset of the data, while another anonymization method on the same dataset might use some other subset and other attributes. Combining several of these ‘anonymized’ datasets can produce almost the entire original dataset. The Netflix example mentioned above is another such case where external background information was used. Sometimes, a dataset that is not available yet, but becomes available in the future after anonymization can break the anonymity.

Differential Privacy

Differential privacy is a way to prevent linkage attacks, amazingly even from any possible future additional datasets that become available. Differential privacy also mathematically quantifies the loss of privacy introducing the concept of a privacy budget. Apple places a limit on the amount of data that can be collected by various Apple services for a user to preserve privacy, while still getting useful data to work with. For example, there is a strict limit of submission of a single emoji used on the iPhone per user per day to be submitted back to their servers. In any private dataset, it is impossible to protect the privacy of individuals even if the attacker only has access to a limited amount of queries on the dataset. It is now known that a certain amount of noise (random data) needs to be added to the dataset to prevent the breakdown of privacy.

It is worthwhile to introduce the notion of ‘randomized response’, a technique used to collect information about questions that might be illegal or embarrassing to answer in a survey, and respondents would either lie or refuse to respond. Randomized response introduces the idea of deniability, that is you cannot be incriminated by introducing a randomization step. For any given question, a respondent can flip a coin, and respond truthfully if the coin turns up head. In case it does not, they perform a second coin flip and respond with ‘Yes’ on tails and ‘No’ on heads. Note that while we are adding noise to the data, we know and can quantify the amount of noise that has been added to the dataset and can act accordingly.

In a similar tone, Differential privacy provides a strong guarantee of up to a certain level by introducing randomness. It is important to remember that differential privacy is not an algorithm in itself but a process. More formally, a randomized algorithm is differentially private if it behaves similarly on almost similar datasets (the similarity being datasets that differ by the presence of a single data point.) The bound on similarity is set by a parameter that is linked with the privacy budget you can play with (less for stricter privacy but noisier data). If for example, your data related to smoking is included in a dataset meant for finding the health impact of smoking, differential privacy promises that you do not face additional repercussions due to this act of inclusion. While the outcome of the dataset might indeed harm you (e.g. The result from the entire dataset concludes that smoking severely impacts health and your health insurer uses this analysis to increase your premium), the outcome would have been the same even if you had not participated.

Differential privacy also allows the quantification of things such as the amount of loss of privacy under composition, that is when a data analyst is running different computations on the same data, and privacy loss with data belonging to a group such as family members.

Cynthia Dwork and colleagues are pioneers in formulating and quantifying privacy mathematically. If you are mathematically inclined, it is highly recommended reading the seminal monograph ‘The Algorithmic Foundations of Differential Privacy’ which is the basis of some examples here and many different works in the area. You could also refer to this video by Andrew Trask that refers to the same examples as in the paper.

Federated Learning

Going back to the problem with sensitive data. In the absence of real-world data, simply because a service is not deployed yet, we use close enough proxy datasets to train our models. Over time, as the models are deployed and are run with new data we could use these test time data to fine-tune our models. Except, this is a problem for sensitive data like medical records, audio, and video recordings and typed words on your mobile. Companies want to improve their models from the additional data generated on the edge, while not being responsible for uploading and storing the sensitive data to their servers, which opens them up to a future data leak. Federated Learning is a set of techniques used to solve this exact problem.

The key ideas behind federated learning are to train sub-models on the edge devices like our mobile phones doing away with the need to upload the sensitive data back to central servers and then securely aggregate these sub-models to update and improve our actual model, while also solving for things like the intermittent availability of these randomly selected devices out of billions of them and provide a provable way of no reconstruction or inferring of data from individual sub-models. Most federated learning implementations also make use of differential privacy to limit the impact of a single unique dataset and preserve the privacy of those users and add some noise to the final models.

One of the early uses of Federated Learning was on the Gboard, the phone keyboard with most android phones to predict the next text typed based on previous input and suggest relevant emojis and GIFs. Mozilla has used federated learning to learn from hundreds of thousands of users of the Firefox web browser’s history of typing keywords and website names to improve the suggestion experience in the address bar.

The money worth of a security vulnerability

A big question everyone was asking this week was why on the earth would someone do a paltry Bitcoin scam with all the influential Twitter accounts they got access to? Or ever wondered why valid working credit cards are sold for 10 bucks on the dark web and not just simply used by the criminals for thousands of dollars of profit?

The research on the money worth of various security vulnerabilities is scant. Generally, they rely on inefficient markets like the ones on the dark web or prices of vulnerabilities set by companies as part of their bug bounty programs as proxies. Most companies do not disclose the true impact of a security incident, except for a technical post-mortem and maybe informing their users. This has led to a lack of actual data in identifying the impact of a security vulnerability.

Quantifying value is simpler in case of say Ransomware. Just a few weeks ago, the University of California SF paid $1.14 million towards a ransomware demand. In case of compliance with demands, a basic model could be the amount of money you pay as ransom plus the cost of time and resources wasted. In case of non-compliance, the cost would be the resources spent in failed negotiations, time to recover the data from backups, or the real cost of losing all the data in case of non-availability of backups. It is not that simple in case of say, a leak of personal information of users. Even quantifying cyber risk is somewhat easier with established frameworks.

The lack of information on quantification of the worth of security vulnerabilities spills over to software products. In an early paper published by Ross Anderson, he remarks that the software market is a market for lemons. Since the security measures incorporated in a software product are not quantified, there is no reason for a customer to pay more for security.

The software market suffers from the same information asymmetry. Vendors may make claims about the security of their products, but buyers have no reason to trust them. In many cases, even the vendor does not know how secure its software is. So buyers have no reason to pay more for protection, and vendors are disinclined to invest in it. How can this be tackled?

To add, vendors sometimes do not even know which vulnerabilities to focus on, given the lack of information. Indirectly, this has created a huge market for security certifications with multiple middlemen for products, organizations, and people. It is an unfortunate scenario where a product owner needs to pay huge sums to get a certificate saying that the product is secure rather than relying on free and open standards to convey the security of their product to potential customers. I have in the past turned down inquiries from companies merely interested in paying money for getting XYZ certification for their product instead of finding and fixing security vulnerabilities.

Other things that matter

I guess you know already, some Twitter accounts were hacked, with private DMs stolen via the data archive feature. This was possibly not orchestrated by an organized group or nation-state. A BGP misconfiguration issue with Cloudflare took down some part of the internet again, highlighting the centralization of power with some for-profits on the internet. Case in point, the former ICANN CEO is now the co-CEO of Ethos Captial, which tried to buy the ‘.org’ TLD which I wrote about earlier. You can now run virtual machines on GCP that provide AMD’s encryption of data in-memory.

This is a repost from my newsletter. Do subscribe here https://technotes.substack.com

Cover photograph by Mark Mathosian from Flickr under CC By-SA. The top and bottom margins are cropped.

#11 Privacy in a data-driven world

Linkage attacks: Or why data anonymization does not work

Differential Privacy

Federated Learning

The money worth of a security vulnerability

Abhishek Anand

A weekly long-form newsletter on tech insights.

12 Followers

1 Following

#17 SMS as a poor two-factor authentication choice

Abhishek Anand

#15 The revolt against marketplaces

Abhishek Anand

#14 On some digital trend inversions

Abhishek Anand

#13 How chess engines work

Abhishek Anand

#12 The rise and rise of Ryzen

Abhishek Anand