Techno Blender
Digitally Yours.

A Differential Privacy Example for Beginners | by Lia Zheng | Dec, 2022

0 32


Image Credit: Dima Andrei

Differential privacy (DP) is a way to preserve the privacy of individuals in a dataset while preserving the overall usefulness of such a dataset. Ideally, someone shouldn’t be able to tell the difference between one dataset and a parallel one with a single point removed. To do this, randomized algorithms are used to add noise to the data.

As a simple example, imagine this: you are in a school with a total student body of 300 people. Each of you are asked, “Have you ever cheated on a test?”

See how this can be a sensitive question? Perhaps those who have cheated would be reluctant to respond yes, out of fear for potential repercussions. So… how do we resolve this issue? This is where DP comes in handy.

Each student to flips a coin. If they land on heads, they tell the truth. If they land on tails, they flip another coin; if they land on heads, respond yes; tails, no.

a diagram to illustrate the coin flip

This way, even if your survey response gets released publicly, there is plausible deniability for the accuracy of your response. At the same time, with greater numbers of people, the school can still make use from the data; it doesn’t become completely useless.

We’ll look at this coin flip example implemented in Python code.

Matplotlib will be used for graphing the data for easy visualizations. Random is needed to implement the random coin flip.

Part 1: Without DP

First, we will see what the mock data looks like without differential privacy. Our “raw data” is represented with 0s and 1s, where 0 means “did not cheat,” and 1 means “cheated.” Each binary number accounts for a different student, and their true “cheating status” is the only metric measured here.

example output from cell above

As you can see, 100 students report cheating, and 200 report not cheating.

Let’s add a new student to the data. Assume that they actually did cheat. Now, run the outputs to see how their data point affects the graph.

Because it’ll be hard to see such a small difference, use the matplotlib annotation tool to add numeric values to the bars.

example output from cell above

In this traditional survey, when we add a student, it’s very easy to tell the difference between the two results. If the “cheated” count goes up, the student cheated. If the “did not cheat” count goes up, then they didn’t cheat. because the cheated column is now 101 instead of 100, that means student #301 cheated.

Part 2: With DP

To implement DP, each student flips a coin. If it lands on heads, they will truthfully answer. If it lands on tails, they flip another coin. If it lands on heads, they respond that they haven’t cheated, and if it lands on tails, they respond that they have cheated.

If you run the code multiple times, you’ll notice how the graph changes every time. That’s because DP algorithms inject randomness, such as we did with a coin flip.

example output from cell above

Now, we can illustrate one of the main purposes of differential privacy: it will not be possible to tell the difference from the results of one dataset versus a parallel one.

Let’s add another new student to the data. Assume that they actually did cheat.

Now, run the outputs to see how their data point affects the graph. If you this multiple times, the graph will change each time.

example output from cell above

As you see, we can’t really tell if Student #302 cheated or not because the output results contain a degree of randomness.

Because of the randomness of the algorithm, this is where having a bigger dataset comes in handy. Now assume we have 30,000 students instead of 300. Run through the code below. If you ran it multiple times, you would observe how the DP graph does not significantly change, as the values follow the calculated probabilities.

example output from cell above

Thus, by using DP, we preserve privacy as someone wouldn’t be able to tell whether an individual was in a database or not.

How good are these results?

As discussed above, the larger the dataset, the more accurate the results will be because the randomness is counteracted with set probabilities. With the coin flip, the amount of randomness is quite high, with about 25% of the data being false. However, the actual distribution of cheaters vs. non-cheaters can be likewise calculated, knowing this.

Say c is the true proportion of students who cheated; then, 1-c is the true proportion of students who didn’t cheat. Say p_y is the proportion of students who reported that they cheated.

p_y is attained by adding 3/4 of the proportion of cheaters with 1/4 of the proportion of non-cheaters, as we assume 1/4 of each of the responses are false.

That simplifies down to:

So c is estimated to be twice the proportion that responded yes, minus 1/2, showing how we can estimate the true proportion of cheaters knowing the results after running with DP. Notice how the randomness of a coin flip is reduced to probabilities, demonstrating the need for a large dataset to preserve accuracy.

Our results showed that about 125/300 (so 5/12) reported cheating — this is p_y. Then, the calculated c value is 1/3, which is the same value as the real proportion.

Conclusion

This coin flip example is a very simplified version of how DP can be used to preserve privacy. By having the coin flip, the responses are anonymized, as there is “plausible deniability” in each response, yet it still allows for about 75% of the responses to be true. We are able to calculate “backwards” to find a good estimate of the true proportion of each response — a number that gets very accurate with large enough datasets.

The great advantage highlighted through this example is individual privacy — with DP, it isn’t possible to distinguish two different datasets that differ by one user, which guarantees privacy on an individual level.


Image Credit: Dima Andrei

Differential privacy (DP) is a way to preserve the privacy of individuals in a dataset while preserving the overall usefulness of such a dataset. Ideally, someone shouldn’t be able to tell the difference between one dataset and a parallel one with a single point removed. To do this, randomized algorithms are used to add noise to the data.

As a simple example, imagine this: you are in a school with a total student body of 300 people. Each of you are asked, “Have you ever cheated on a test?”

See how this can be a sensitive question? Perhaps those who have cheated would be reluctant to respond yes, out of fear for potential repercussions. So… how do we resolve this issue? This is where DP comes in handy.

Each student to flips a coin. If they land on heads, they tell the truth. If they land on tails, they flip another coin; if they land on heads, respond yes; tails, no.

a diagram to illustrate the coin flip

This way, even if your survey response gets released publicly, there is plausible deniability for the accuracy of your response. At the same time, with greater numbers of people, the school can still make use from the data; it doesn’t become completely useless.

We’ll look at this coin flip example implemented in Python code.

Matplotlib will be used for graphing the data for easy visualizations. Random is needed to implement the random coin flip.

Part 1: Without DP

First, we will see what the mock data looks like without differential privacy. Our “raw data” is represented with 0s and 1s, where 0 means “did not cheat,” and 1 means “cheated.” Each binary number accounts for a different student, and their true “cheating status” is the only metric measured here.

example output from cell above

As you can see, 100 students report cheating, and 200 report not cheating.

Let’s add a new student to the data. Assume that they actually did cheat. Now, run the outputs to see how their data point affects the graph.

Because it’ll be hard to see such a small difference, use the matplotlib annotation tool to add numeric values to the bars.

example output from cell above

In this traditional survey, when we add a student, it’s very easy to tell the difference between the two results. If the “cheated” count goes up, the student cheated. If the “did not cheat” count goes up, then they didn’t cheat. because the cheated column is now 101 instead of 100, that means student #301 cheated.

Part 2: With DP

To implement DP, each student flips a coin. If it lands on heads, they will truthfully answer. If it lands on tails, they flip another coin. If it lands on heads, they respond that they haven’t cheated, and if it lands on tails, they respond that they have cheated.

If you run the code multiple times, you’ll notice how the graph changes every time. That’s because DP algorithms inject randomness, such as we did with a coin flip.

example output from cell above

Now, we can illustrate one of the main purposes of differential privacy: it will not be possible to tell the difference from the results of one dataset versus a parallel one.

Let’s add another new student to the data. Assume that they actually did cheat.

Now, run the outputs to see how their data point affects the graph. If you this multiple times, the graph will change each time.

example output from cell above

As you see, we can’t really tell if Student #302 cheated or not because the output results contain a degree of randomness.

Because of the randomness of the algorithm, this is where having a bigger dataset comes in handy. Now assume we have 30,000 students instead of 300. Run through the code below. If you ran it multiple times, you would observe how the DP graph does not significantly change, as the values follow the calculated probabilities.

example output from cell above

Thus, by using DP, we preserve privacy as someone wouldn’t be able to tell whether an individual was in a database or not.

How good are these results?

As discussed above, the larger the dataset, the more accurate the results will be because the randomness is counteracted with set probabilities. With the coin flip, the amount of randomness is quite high, with about 25% of the data being false. However, the actual distribution of cheaters vs. non-cheaters can be likewise calculated, knowing this.

Say c is the true proportion of students who cheated; then, 1-c is the true proportion of students who didn’t cheat. Say p_y is the proportion of students who reported that they cheated.

p_y is attained by adding 3/4 of the proportion of cheaters with 1/4 of the proportion of non-cheaters, as we assume 1/4 of each of the responses are false.

That simplifies down to:

So c is estimated to be twice the proportion that responded yes, minus 1/2, showing how we can estimate the true proportion of cheaters knowing the results after running with DP. Notice how the randomness of a coin flip is reduced to probabilities, demonstrating the need for a large dataset to preserve accuracy.

Our results showed that about 125/300 (so 5/12) reported cheating — this is p_y. Then, the calculated c value is 1/3, which is the same value as the real proportion.

Conclusion

This coin flip example is a very simplified version of how DP can be used to preserve privacy. By having the coin flip, the responses are anonymized, as there is “plausible deniability” in each response, yet it still allows for about 75% of the responses to be true. We are able to calculate “backwards” to find a good estimate of the true proportion of each response — a number that gets very accurate with large enough datasets.

The great advantage highlighted through this example is individual privacy — with DP, it isn’t possible to distinguish two different datasets that differ by one user, which guarantees privacy on an individual level.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment