Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis

Moritz Hardt & Aaron Roth

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef main fill:#f9d4d4, font-weight:bold, font-size:14px
classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px
classDef problems fill:#d4d4f9, font-weight:bold, font-size:14px
classDef solutions fill:#f9f9d4, font-weight:bold, font-size:14px
classDef dp fill:#f9d4f9, font-weight:bold, font-size:14px
classDef results fill:#d4f9f9, font-weight:bold, font-size:14px
Main[Rigorous Data Dredging:

Theory and Tools

for Adaptive Data

Analysis] Main --> A[Introduction and Problem Statement] A --> A1[Rigorous data dredging tutorial

on Fathers Day 1] A --> A2[Stock prediction scam illustrates

overfitting 2] A --> A3[Overlooking hypothesis class size

causes overfitting 3] A --> A4[Holdout prevents static analysis

overfitting 4] A --> A5[Adaptive analysis revises methods

based outcomes 5] Main --> B[Challenges in Adaptive Analysis] B --> B1[More holdouts enable adaptivity,

expensively 6] B --> B2[Adaptive issues: p-hacking, snooping,

forking paths 7] B --> B3[Allow adaptivity while preventing

overfitting 8] B --> B4[Statistical queries framework captures

analysis tasks 9] B --> B5[Adaptivity: next query based

on answers 10] B --> B6[Empirical estimator fails with

adaptive queries 11] Main --> C[Differential Privacy Solutions] C --> C1[Differential privacy ensures distributional

stability 12] C --> C2[DP composes gracefully, allows

modular design 13] C --> C3[Gaussian mechanism answers n2

adaptive queries 14] C --> C4[Optimal DP answers expn adaptive queries 15] C --> C5[Transfer theorem: DP accuracy

on distribution 16] C --> C6[Proving transfer theorem via

thought experiment 17] Main --> D[Algorithms and Applications] D --> D1[Gaussian mechanism allows n^2

adaptive queries 18] D --> D2[Inefficient DP allows expn

adaptive queries 19] D --> D3[ThresholdOut algorithm prevents overfitting 20] D --> D4[Experiments show ThresholdOuts effectiveness 21] Main --> E[Results and Implications] E --> E1[Results generalize beyond expectations 22] E --> E2[Theoretical study provides algorithms,

rules 23] E --> E3[Ladder principle limits information leakage 24] E --> E4[Adaptive analysis with DP

prevents overfitting 25] class Main main class A,A1,A2,A3,A4,A5 intro class B,B1,B2,B3,B4,B5,B6 problems class C,C1,C2,C3,C4,C5,C6 dp class D,D1,D2,D3,D4 solutions class E,E1,E2,E3,E4 results

Theory and Tools

for Adaptive Data

Analysis] Main --> A[Introduction and Problem Statement] A --> A1[Rigorous data dredging tutorial

on Fathers Day 1] A --> A2[Stock prediction scam illustrates

overfitting 2] A --> A3[Overlooking hypothesis class size

causes overfitting 3] A --> A4[Holdout prevents static analysis

overfitting 4] A --> A5[Adaptive analysis revises methods

based outcomes 5] Main --> B[Challenges in Adaptive Analysis] B --> B1[More holdouts enable adaptivity,

expensively 6] B --> B2[Adaptive issues: p-hacking, snooping,

forking paths 7] B --> B3[Allow adaptivity while preventing

overfitting 8] B --> B4[Statistical queries framework captures

analysis tasks 9] B --> B5[Adaptivity: next query based

on answers 10] B --> B6[Empirical estimator fails with

adaptive queries 11] Main --> C[Differential Privacy Solutions] C --> C1[Differential privacy ensures distributional

stability 12] C --> C2[DP composes gracefully, allows

modular design 13] C --> C3[Gaussian mechanism answers n2

adaptive queries 14] C --> C4[Optimal DP answers expn adaptive queries 15] C --> C5[Transfer theorem: DP accuracy

on distribution 16] C --> C6[Proving transfer theorem via

thought experiment 17] Main --> D[Algorithms and Applications] D --> D1[Gaussian mechanism allows n^2

adaptive queries 18] D --> D2[Inefficient DP allows expn

adaptive queries 19] D --> D3[ThresholdOut algorithm prevents overfitting 20] D --> D4[Experiments show ThresholdOuts effectiveness 21] Main --> E[Results and Implications] E --> E1[Results generalize beyond expectations 22] E --> E2[Theoretical study provides algorithms,

rules 23] E --> E3[Ladder principle limits information leakage 24] E --> E4[Adaptive analysis with DP

prevents overfitting 25] class Main main class A,A1,A2,A3,A4,A5 intro class B,B1,B2,B3,B4,B5,B6 problems class C,C1,C2,C3,C4,C5,C6 dp class D,D1,D2,D3,D4 solutions class E,E1,E2,E3,E4 results

**Resume: **

**1.-** Tutorial on rigorous data dredging, given alone due to co-tutor's wife being 38 weeks pregnant on Father's Day.

**2.-** Anecdote about email stock prediction scam illustrating overfitting - sending opposite predictions to many and only continuing with correct subset.

**3.-** Error was failing to account for hypothesis class size and ability to overfit. P-hacking is a similar widespread problem.

**4.-** Tools like holdout exist to prevent overfitting in static data analysis, but practice of data analysis is more adaptive.

**5.-** In static analysis, method is selected, then data gathered. In adaptive analysis, method is revised based on outcomes.

**6.-** Having more holdout sets can enable adaptivity, but amount of data needed grows linearly with rounds of adaptivity, which is expensive.

**7.-** Adaptive data analysis issues have many names - data dredging, snooping, p-hacking, garden of forking paths. Some propose preregistration to combat.

**8.-** Goal is to allow adaptive analysis while preventing overfitting. Potential solutions measure information leakage to control future contamination.

**9.-** Statistical queries framework captures many data analysis tasks. Data analyst interacts with estimator that has sample from unknown distribution.

**10.-** Adaptivity means analyst picks next query based on previous answers. Estimator is accurate if whp all answers are close to truth.

**11.-** Empirical estimator is unbiased for non-adaptive queries but fails with adaptive queries - can force any classifier via "bagging-like" procedure.

**12.-** Differential privacy notion of distributional stability, robust to post-processing. Ensures similar output probabilities on neighboring datasets.

**13.-** Differential privacy composes gracefully - running k DP algorithms increases privacy loss only by sqrt(k). Allows modular design of DP algorithms.

**14.-** Basic DP algorithm: Gaussian mechanism adds N(0,1/αn) noise to query. Can answer n^2 adaptive queries for constant error whp.

**15.-** Optimal efficient DP algorithm can answer exp(n) adaptive queries. Inefficient algorithm maintains consistency with answers so far, exponential runtime.

**16.-** Transfer theorem: DP estimator accurate on sample is also accurate on distribution whp. Allows statistical validity under adaptive analysis.

**17.-** Proving transfer theorem via thought experiment with T copies of analyst interacting with estimator. DP means no copy can overfit its sample.

**18.-** Gaussian mechanism plugged into transfer theorem allows n^2 adaptive queries, a quadratic improvement over empirical estimator that's optimal for efficient estimators.

**19.-** Inefficient DP estimator allows exp(n) adaptive queries again when data size exceeds data dimension, else Gaussian mechanism preferred.

**20.-** ThresholdOut algorithm: if holdout and training query answers close, use training answer, else perturbed holdout answer. Allows many queries if low overfitting.

**21.-** Experiments show ThresholdOut prevents overfitting when no signal exists and finds correct number of features when mild signal exists, unlike naive holdout.

**22.-** Results generalize beyond expectations to arbitrary low-sensitivity queries, optimization queries like ERM, and non-low sensitivity queries like p-value computation.

**23.-** Studying adaptive analysis theoretically helps understand risks and benefits of adaptivity, provides algorithms and rules of thumb like careful holdout use.

**24.-** "Ladder principle" - in ML competitions, don't switch models for small improvements on leaderboard, wait for substantial gains to limit information leakage.

**25.-** Talk aimed to allow adaptive data analysis while preventing overfitting through differentially private algorithms. Thanks for attending on Father's Day.

Knowledge Vault built byDavid Vivancos 2024