Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Note: Previously we asked you to show the weaker bound: > (1+ce)-1n (i) If you already showed just that, then you will receive full credit.
Note: Previously we asked you to show the weaker bound: > (1+ce)-1n (i) If you already showed just that, then you will receive full credit. Pr 3 p. {3) a T VVe've provided you with a synthetic X for a population of 100,000 people in covid,data.csv. In reality, it is often unfeasible to calculate a statistic like T over the entire population (because we can't survey everyone), so let's understand how subsampling here affects the accuracy of 1' in practice. Please implement each step yourself no external libraries beyond numpy, pandas, etc. (or equivalents in your language of choice). (a) Calculate T for the entire population. This is ground truth 1". (b) Calculate 'T'm from random subsamples of size in from the population, where m E {100, 1000, 10000}. Run this over I) : 10 random subsamples (specify a seed set of size b) for each m, and report the average error between f'm and T across these runs. (c) Calculate an Edifferentially private estimate i'm for each m using the algorithm you designed in (2) for e : 0.01. Report on the relative errors between 11m. n and 1'. (d) Write 2-3 sentences of discussion on how well the the algorithm performs in this practical setting. How does population size affect DP algorithm accuracy? Also, does having received the vaccine increase or decrease your odds of having contracted COVID? (15 pts) When conducting private data release, we often want to calculate a given statistic with an e- differential privacy guarantee. In class, we saw one way that we can calculate a mean in a differentially private manner. For this problem, we will consider calculating a differentially private odds ratio. An odds ratio is a statistical measure that gives us a sense of the correlation between two events (usually binary). It is often used to assess treatment effects. In this problem, we will consider the scenario where a group of n individuals either were or were not vaccinated, and either did or did not contract COVID-19. To be more concrete, consider a dataset X of individuals labeled with numbers 1, ... , n. The dataset has two binary dimensions, SE {0, 1}" and V E {0, 1}". For some individual , S(i) = 1 if individual i has had COVID, and S(i) = 0 otherwise. Similarly, V(2) = 1 if individual i was vaccinated, and V(i) = 0 otherwise. Consider the following scalar valued queries: A = IS(i) = 1 and V(i) =1}] 1= 1 B = > IS(i) = 0 and V(2) =1}] n C = L I [S(i) = 1 and V(i) = 0}] D = 1[S(i) = 0 and V(i) = 0}] i=1 Then, an odds ratio 7 can be calculated as: A/C T = B/D (1) Intuitively, if 7 = 1, that suggests that S and V are uncorrelated, while a 7 1 suggests that vaccines are associated with higher "odds" of COVID. (a) Show that the sensitivity of the 7 statistic, which we denote Ar = maxx,x' IT(X) - T(X')| can grow as (n), even if we assume say that A, B, C, D > c for some constant c. (b) That the sensitivity of 7 is unbounded makes it difficult to directly add Laplace noise to ensure privacy without strong assumptions over our data. However, we can still calculate 7 privately with a simple algorithm, with reasonable error. Show that there exists an c-differentially private estimator + such that for a constant c: pr [ ( 1 - com ) $ $ 5 ( 1 + cela ! ) ] 21- p. (2) You may assume that e is small - e.g. so that eln .
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started