The End Of Knowledge - Vault 6/42 - CVPR - 2019 - The U.S. Census Bureau Tries to be a Good Data Steward in the 21st Century

graph LR classDef main fill:#f9d4f9, font-weight:bold, font-size:14px classDef census fill:#f9d4d4, font-weight:bold, font-size:14px classDef privacy fill:#d4f9d4, font-weight:bold, font-size:14px classDef disentanglement fill:#d4d4f9, font-weight:bold, font-size:14px classDef challenges fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#d4f9f9, font-weight:bold, font-size:14px Main[The U.S. Census
Bureau Tries to
be a Good
Data Steward in
the 21st Century] --> A[Census Overview] Main --> B[Privacy Concerns] Main --> C[Disentanglement Learning] Main --> D[Challenges and Limitations] Main --> E[Future Directions] A --> A1[Abowd: Census Bureau Research Director,
Chief Scientist 1] A --> A2[2020 Census: accurate count, protect
privacy 2] A --> A3[Collect demographic data on U.S.
residents 3] A --> A4[Key products: apportionment, redistricting,
demographics 4] A --> A5[Researched 2010 data vulnerability to
attacks 5] A --> A6[2020 Census uses differential privacy
system 12] B --> B1[Reconstructed individual records from published
data 6] B --> B2[Too many statistics compromise privacy 7] B --> B3[Privacy-accuracy tradeoff in publishing statistics 8] B --> B4[Differential privacy protects confidentiality,
reduces accuracy 9] B --> B5[Privacy-accuracy challenge for agencies, tech
companies 10] B --> B6[Social scientists, computer scientists determine
optimal balance 11] C --> C1[Unsupervised disentanglement: capture generative factors 14] C --> C2[Unsupervised disentanglement impossible for arbitrary
data 15] C --> C3[Empirical study on common disentanglement
datasets 16] C --> C4[Method matters less than hyperparameters,
seeds 17] C --> C5[No consistent trends in hyperparameter
settings 18] C --> C6[Disentanglement: 1-to-1 mapping features to
factors 28] D --> D1[Transferring hyperparameters works somewhat across
datasets 19] D --> D2[Random seed, hyperparameters cause high
variance 20] D --> D3[Unsupervised model selection remains open
problem 21] D --> D4[Unsupervised metrics dont correlate with
disentanglement 22] D --> D5[Impossibility: two models, same data,
different representations 29] D --> D6[Unsupervised data cant identify true
model 30] E --> E1[Evaluated accuracy, privacy loss for
decisions 13] E --> E2[Explicit inductive biases, supervision avoid
bias 23] E --> E3[Disentanglement benefits for tasks still
unclear 24] E --> E4[Small supervision enables selection, improves
learning 25] E --> E5[Disentanglement may benefit efficiency, fairness 26] E --> E6[Real-world robotics dataset encourages broader
research 27] class Main main class A,A1,A2,A3,A4,A5,A6 census class B,B1,B2,B3,B4,B5,B6 privacy class C,C1,C2,C3,C4,C5,C6 disentanglement class D,D1,D2,D3,D4,D5,D6 challenges class E,E1,E2,E3,E4,E5,E6 future

Resume:

1.- John M. Abowd is the U.S. Census Bureau's Associate Director for Research and Methodology and Chief Scientist.

2.- The 2020 U.S. Census aims to accurately count the population while protecting individual privacy, which is challenging.

3.- The 2020 Census will collect basic demographic data on all U.S. residents as of April 1, 2020.

4.- Key Census data products include apportionment counts, redistricting data, and demographic and housing characteristics.

5.- In 2016, the Census Bureau began researching if published 2010 Census data were vulnerable to database reconstruction and re-identification attacks.

6.- Using only published 2010 Census data, the Census Bureau reconstructed individual records and re-identified a portion by linking to commercial databases.

7.- This proved publishing too many statistics from a confidential database allows reconstructing individual data, compromising privacy.

8.- The fundamental law of information recovery imposes a privacy-accuracy tradeoff when publishing statistics from confidential data.

9.- Formal privacy systems like differential privacy can provably protect confidentiality but reduce accuracy of published statistics.

10.- Statistical agencies and tech companies face the same challenge of the privacy-accuracy tradeoff when using confidential data.

11.- Social scientists need to work with computer scientists to determine the optimal privacy-accuracy balance for each use case.

12.- The Census Bureau set up a formal differential privacy system for the 2020 Census to protect individual privacy.

13.- Accuracy and privacy loss for 2020 Census data products were evaluated to inform policy decisions on the privacy-accuracy tradeoff.

14.- Unsupervised learning of disentangled representations from data aims to capture generative factors of variation in different parts of the representation.

15.- Theoretical results show unsupervised disentanglement learning is impossible for arbitrary data, in contrast to supervised learning.

16.- An empirical study investigated if disentangled representations can be learned unsupervised on common datasets used in the disentanglement literature.

17.- The study found the specific disentanglement method matters less than hyperparameter settings and random seeds for disentanglement performance.

18.- There are no consistent trends in hyperparameter settings that improve disentanglement across different datasets.

19.- Transferring good hyperparameters across similar datasets works to some degree, but not perfectly.

20.- Random seed and hyperparameter choice cause high variance in disentanglement scores for the same method.

21.- Unsupervised model selection to identify the most disentangled model from a set of trained models remains an open problem.

22.- Commonly tracked unsupervised metrics like reconstruction error do not reliably correlate with disentanglement scores.

23.- The role of inductive biases and supervision in disentanglement learning should be made explicit to avoid biasing scientific insights.

24.- Concrete benefits of disentangled representations for downstream tasks are still unclear and should be further investigated.

25.- Follow-up work found a small amount of supervision enables model selection and improves disentanglement learning.

26.- In some settings, disentanglement may provide sample efficiency and fairness benefits for downstream tasks.

27.- A real-world robotics dataset was collected to encourage research on disentanglement beyond toy datasets.

28.- Disentanglement is formally defined as having a 1-to-1 mapping between each learned feature and a ground truth generative factor.

29.- The impossibility result constructs two generative models that could produce the same data but with different entangled representations.

30.- With only unsupervised data, the true generative model is unidentifiable, making disentanglement impossible without further assumptions to exclude alternative models.

Knowledge Vault built byDavid Vivancos 2024