Knowledge Vault 6 /20 - ICML 2016
Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis
Moritz Hardt & Aaron Roth
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px classDef problems fill:#d4d4f9, font-weight:bold, font-size:14px classDef solutions fill:#f9f9d4, font-weight:bold, font-size:14px classDef dp fill:#f9d4f9, font-weight:bold, font-size:14px classDef results fill:#d4f9f9, font-weight:bold, font-size:14px Main[Rigorous Data Dredging:
Theory and Tools
for Adaptive Data
Analysis] Main --> A[Introduction and Problem Statement] A --> A1[Rigorous data dredging tutorial
on Fathers Day 1] A --> A2[Stock prediction scam illustrates
overfitting 2] A --> A3[Overlooking hypothesis class size
causes overfitting 3] A --> A4[Holdout prevents static analysis
overfitting 4] A --> A5[Adaptive analysis revises methods
based outcomes 5] Main --> B[Challenges in Adaptive Analysis] B --> B1[More holdouts enable adaptivity,
expensively 6] B --> B2[Adaptive issues: p-hacking, snooping,
forking paths 7] B --> B3[Allow adaptivity while preventing
overfitting 8] B --> B4[Statistical queries framework captures
analysis tasks 9] B --> B5[Adaptivity: next query based
on answers 10] B --> B6[Empirical estimator fails with
adaptive queries 11] Main --> C[Differential Privacy Solutions] C --> C1[Differential privacy ensures distributional
stability 12] C --> C2[DP composes gracefully, allows
modular design 13] C --> C3[Gaussian mechanism answers n2
adaptive queries 14] C --> C4[Optimal DP answers expn adaptive queries 15] C --> C5[Transfer theorem: DP accuracy
on distribution 16] C --> C6[Proving transfer theorem via
thought experiment 17] Main --> D[Algorithms and Applications] D --> D1[Gaussian mechanism allows n^2
adaptive queries 18] D --> D2[Inefficient DP allows expn
adaptive queries 19] D --> D3[ThresholdOut algorithm prevents overfitting 20] D --> D4[Experiments show ThresholdOuts effectiveness 21] Main --> E[Results and Implications] E --> E1[Results generalize beyond expectations 22] E --> E2[Theoretical study provides algorithms,
rules 23] E --> E3[Ladder principle limits information leakage 24] E --> E4[Adaptive analysis with DP
prevents overfitting 25] class Main main class A,A1,A2,A3,A4,A5 intro class B,B1,B2,B3,B4,B5,B6 problems class C,C1,C2,C3,C4,C5,C6 dp class D,D1,D2,D3,D4 solutions class E,E1,E2,E3,E4 results

Resume:

1.- Tutorial on rigorous data dredging, given alone due to co-tutor's wife being 38 weeks pregnant on Father's Day.

2.- Anecdote about email stock prediction scam illustrating overfitting - sending opposite predictions to many and only continuing with correct subset.

3.- Error was failing to account for hypothesis class size and ability to overfit. P-hacking is a similar widespread problem.

4.- Tools like holdout exist to prevent overfitting in static data analysis, but practice of data analysis is more adaptive.

5.- In static analysis, method is selected, then data gathered. In adaptive analysis, method is revised based on outcomes.

6.- Having more holdout sets can enable adaptivity, but amount of data needed grows linearly with rounds of adaptivity, which is expensive.

7.- Adaptive data analysis issues have many names - data dredging, snooping, p-hacking, garden of forking paths. Some propose preregistration to combat.

8.- Goal is to allow adaptive analysis while preventing overfitting. Potential solutions measure information leakage to control future contamination.

9.- Statistical queries framework captures many data analysis tasks. Data analyst interacts with estimator that has sample from unknown distribution.

10.- Adaptivity means analyst picks next query based on previous answers. Estimator is accurate if whp all answers are close to truth.

11.- Empirical estimator is unbiased for non-adaptive queries but fails with adaptive queries - can force any classifier via "bagging-like" procedure.

12.- Differential privacy notion of distributional stability, robust to post-processing. Ensures similar output probabilities on neighboring datasets.

13.- Differential privacy composes gracefully - running k DP algorithms increases privacy loss only by sqrt(k). Allows modular design of DP algorithms.

14.- Basic DP algorithm: Gaussian mechanism adds N(0,1/αn) noise to query. Can answer n^2 adaptive queries for constant error whp.

15.- Optimal efficient DP algorithm can answer exp(n) adaptive queries. Inefficient algorithm maintains consistency with answers so far, exponential runtime.

16.- Transfer theorem: DP estimator accurate on sample is also accurate on distribution whp. Allows statistical validity under adaptive analysis.

17.- Proving transfer theorem via thought experiment with T copies of analyst interacting with estimator. DP means no copy can overfit its sample.

18.- Gaussian mechanism plugged into transfer theorem allows n^2 adaptive queries, a quadratic improvement over empirical estimator that's optimal for efficient estimators.

19.- Inefficient DP estimator allows exp(n) adaptive queries again when data size exceeds data dimension, else Gaussian mechanism preferred.

20.- ThresholdOut algorithm: if holdout and training query answers close, use training answer, else perturbed holdout answer. Allows many queries if low overfitting.

21.- Experiments show ThresholdOut prevents overfitting when no signal exists and finds correct number of features when mild signal exists, unlike naive holdout.

22.- Results generalize beyond expectations to arbitrary low-sensitivity queries, optimization queries like ERM, and non-low sensitivity queries like p-value computation.

23.- Studying adaptive analysis theoretically helps understand risks and benefits of adaptivity, provides algorithms and rules of thumb like careful holdout use.

24.- "Ladder principle" - in ML competitions, don't switch models for small improvements on leaderboard, wait for substantial gains to limit information leakage.

25.- Talk aimed to allow adaptive data analysis while preventing overfitting through differentially private algorithms. Thanks for attending on Father's Day.

Knowledge Vault built byDavid Vivancos 2024