The ABCs of Differential Privacy | by Essi Alizadeh

By Jessie Hobb On May 2, 2023

MASTERING BASICS

A Guide to Understanding Fundamental Definitions and Key Principles

Differential privacy (DP) is a rigorous mathematical framework that permits the analysis and manipulation of sensitive data while providing robust privacy guarantees.

DP is based on the premise that the inclusion or exclusion of a single individual should not significantly change the results of any analysis or query carried out on the dataset as a whole. In other words, the algorithm should come up with comparable findings when comparing these two sets of data, making it difficult to figure out anything distinctive about that individual. This safety keeps private information from getting out but still lets useful insights be drawn from the data.

Differential privacy initially appeared in the study “Differential Privacy” by Cynthia Dwork [1] while she was working at Microsoft Research.

Let’s take a look at an example to better understand how differential privacy helps to protect data.

Example 1

In a study that looks at the link between social class and health results, researchers ask subjects for private information like where they live, how much money they have, and their medical background [2].

John, one of the participants, is worried that his personal information could get out and hurt his applications for life insurance or a mortgage. To make sure that John’s worries are taken care of, the researchers can use differential privacy. This makes sure that any data that is shared won’t reveal specific information about him. Different levels of privacy can be shown by John’s “opt-out” situation, in which his data is left out of the study. This protects his anonymity because the analysis’s results are not tied to any of his personal details.

Differential privacy seeks to protect privacy in the real world as if the data were being looked at in an opt-out situation. Since John’s data is not part of the computation, the results regarding him can only be as accurate as the data available to everyone else.

A precise description of differential privacy requires formal mathematical language and technical concepts, but the basic concept is to protect the privacy of individuals by limiting the information that can be obtained about them from the released data, thereby ensuring that their sensitive information remains private.

Example 2

The U.S. Census Bureau used a differential privacy framework as a part of its disclosure avoidance strategy to strike a compromise between the data collection and reporting needs and the privacy concerns of the respondents. You can find more information about the confidentiality protection provided by the U.S. Census Bureau here. Moreover, Garfinkel provides an explanation of how DP was utilized in the 2020 US Census data here.

The meaning of “differential” within the realm of DP

The term “differential” privacy refers to its emphasis on the dissimilarity between the results produced by a privacy-preserving algorithm on two datasets that differ by just one individual’s data.

Mechanism M

A mechanism M is a mathematical method or process that is used on the data to make sure privacy is maintained while still giving useful information.

Epsilon (ε)

ε is a privacy parameter that controls the level of privacy given by a differentially private mechanism. In other words, ε regulates how much the output of the mechanism can vary between two neighboring databases and measures how much privacy is lost when the mechanism is run on the database [3].

Stronger privacy guarantees are provided by a smaller ε, but the output may be less useful as a result [4]. ε controls the amount of noise added to the data and shows how much the output probability distribution can change when the data of a single person is altered.

Delta (𝛿)

𝛿 is an extra privacy option that lets you set how likely it is that your privacy will be compromised. Hence, 𝛿 controls the probability of an extreme privacy breach, where the added noise (controlled by ε) does not provide sufficient protection.

𝛿 is a non-negative number that measures the chance of a data breach. It is usually very small and close to zero. This change makes it easier to do more complicated studies and machine learning models while still protecting privacy (see [4]).

If 𝛿 is low, there is less of a chance that someone’s privacy is going to get compromised. But this comes at a cost. If 𝛿 is too small, more noise might be introduced into the data, diminishing the quality of the end-result. 𝛿 is one parameter to consider, but it must be balanced with epsilon and the data’s practicality.

Consider two databases, D and D’, that differ by only one record.

Formally, a mechanism M is ε-differentially private if, for any two adjacent datasets D and D’, and for any possible output O, the following holds:

However, we can reframe the above equation in terms of divergences, resulting in the following:

Figure 1: Differential privacy in the context of divergences (Image by Author).

Here div[⋅∣∣⋅] denotes the Rényi divergence. See the paper Renyi Differential Privacy by Ilya Mironov for more information.

A randomized M is considered (ε, 𝛿)-differentially private if the probability of a significant privacy breach (i.e., a breach that would not occur under ε-differential privacy) is no more than 𝛿. More formally, a mechanism M is (ε, 𝛿)-differentially private if

If 𝛿 = 0, then (ε, 𝛿)-DP is reduced to a ε-DP.

(ε, 𝛿)-DP mechanism may be thought of informally as ε-DP with a probability of 1 — 𝛿.

1. Post-processing immunity

The differentially private output can be subjected to any function or analysis, and the outcome will continue to uphold the original privacy assurances. For instance, if you apply a differentially private mechanism to a dataset and then take the average age of the individuals in the dataset, the resulting average age will still be differentially private and will provide the same level of privacy assurances as the output it was originally designed to provide.

Thanks to the post-processing feature, we can use differentially private mechanisms in the same way as generic ones. Hence, it is possible to combine several differentially private mechanisms without sacrificing the integrity of differential privacy.

2. Composition

When multiple differentially private techniques are used on the same data or when queries are combined, composition is the property that ensures the privacy guarantees of differential privacy still apply. Composition can be either sequential or parallel. If you apply two mechanisms, M1 with ε1-DP and M2 with ε2-DP on a dataset, then the composition of M1 and M2 is at least (ε1 + ε2)-DP.

WARNING: Despite composition’s ability to protect privacy, the composition theorem makes clear that there is a ceiling; as the value of ε rises, so does the amount of privacy lost whenever a new mechanism is employed. If ε becomes too large, then differential privacy guarantees are mostly meaningless [3].

3. Robustness to auxiliary information:

Differential privacy is resistant to auxiliary information attackers, which means that even if an attacker has access to other relevant data, they will not be able to learn anything about a person from a DP output. For instance, if a hospital were to share differentially private information regarding individuals’ medical situations, an attacker with access to other medical records would not be able to greatly increase their knowledge of a given patient from the published numbers.

The notion of differential privacy has been misunderstood in several publications, especially during its early days. Dwork et al. wrote a short paper [5] to correct some widespread misunderstandings. Here are a few examples of common misunderstandings:

DP is not an algorithm but rather a definition. DP is a mathematical guarantee that an algorithm must meet in order to disclose statistics about a dataset. Several distinct algorithms meet the criteria.
Various algorithms can be differentialy private while still meeting various requirements. If someone claims that differential privacy, a specific requirement on ratios of probability distributions, is incompatible with any accuracy target, they must provide evidence for that claim. This means proving that there is no way a DP algorithm can perform to some specified standard. It’s challenging to come up with that proof, and our first guesses about what is and isn’t feasible are often off.
There are no “good” or “bad” results for any given database. Generating the outputs in a way that preserves privacy (perfect or differential) is the key.

DP has shown itself as a viable paradigm for the protection of data privacy, which is particularly important in this day and age, when machine learning and big data are becoming more widespread. Several key concepts were covered in this essay, including the various DP control settings like ε and δ. In addition, we provided several mathematical definitions of the DP. We also explained key features of the DP and addressed some of the most common misconceptions.

[1] Dwork, Cynthia (2006). “Differential Privacy.” In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, 1–12. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/11787006_1.

[2] Wood, Alexandra, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, James Honaker, Kobbi Nissim, David O’Brien, Thomas Steinke, and Salil Vadhan (2018). “Differential Privacy: A Primer for a Non-Technical Audience.” Vand. J. Ent. & Tech. L. 21 (1): 209–76. https://doi.org/10.2139/ssrn.3338027.

[3] Brubaker, M., and S. Prince (2021). “Tutorial #12: Differential Privacy I: Introduction.” Borealis AI. https://www.borealisai.com/research-blogs/tutorial-12-differential-privacy-i-introduction/.

[4] Dwork, Cynthia, Aaron Roth, et al. (2014). “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science 9 (3–4): 211–407.

[5] Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2011. “Differential Privacy — A Primer for the Perplexed.” Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality 11.