Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:
Resume:
1.- Tutorial on rigorous data dredging, given alone due to co-tutor's wife being 38 weeks pregnant on Father's Day.
2.- Anecdote about email stock prediction scam illustrating overfitting - sending opposite predictions to many and only continuing with correct subset.
3.- Error was failing to account for hypothesis class size and ability to overfit. P-hacking is a similar widespread problem.
4.- Tools like holdout exist to prevent overfitting in static data analysis, but practice of data analysis is more adaptive.
5.- In static analysis, method is selected, then data gathered. In adaptive analysis, method is revised based on outcomes.
6.- Having more holdout sets can enable adaptivity, but amount of data needed grows linearly with rounds of adaptivity, which is expensive.
7.- Adaptive data analysis issues have many names - data dredging, snooping, p-hacking, garden of forking paths. Some propose preregistration to combat.
8.- Goal is to allow adaptive analysis while preventing overfitting. Potential solutions measure information leakage to control future contamination.
9.- Statistical queries framework captures many data analysis tasks. Data analyst interacts with estimator that has sample from unknown distribution.
10.- Adaptivity means analyst picks next query based on previous answers. Estimator is accurate if whp all answers are close to truth.
11.- Empirical estimator is unbiased for non-adaptive queries but fails with adaptive queries - can force any classifier via "bagging-like" procedure.
12.- Differential privacy notion of distributional stability, robust to post-processing. Ensures similar output probabilities on neighboring datasets.
13.- Differential privacy composes gracefully - running k DP algorithms increases privacy loss only by sqrt(k). Allows modular design of DP algorithms.
14.- Basic DP algorithm: Gaussian mechanism adds N(0,1/αn) noise to query. Can answer n^2 adaptive queries for constant error whp.
15.- Optimal efficient DP algorithm can answer exp(n) adaptive queries. Inefficient algorithm maintains consistency with answers so far, exponential runtime.
16.- Transfer theorem: DP estimator accurate on sample is also accurate on distribution whp. Allows statistical validity under adaptive analysis.
17.- Proving transfer theorem via thought experiment with T copies of analyst interacting with estimator. DP means no copy can overfit its sample.
18.- Gaussian mechanism plugged into transfer theorem allows n^2 adaptive queries, a quadratic improvement over empirical estimator that's optimal for efficient estimators.
19.- Inefficient DP estimator allows exp(n) adaptive queries again when data size exceeds data dimension, else Gaussian mechanism preferred.
20.- ThresholdOut algorithm: if holdout and training query answers close, use training answer, else perturbed holdout answer. Allows many queries if low overfitting.
21.- Experiments show ThresholdOut prevents overfitting when no signal exists and finds correct number of features when mild signal exists, unlike naive holdout.
22.- Results generalize beyond expectations to arbitrary low-sensitivity queries, optimization queries like ERM, and non-low sensitivity queries like p-value computation.
23.- Studying adaptive analysis theoretically helps understand risks and benefits of adaptivity, provides algorithms and rules of thumb like careful holdout use.
24.- "Ladder principle" - in ML competitions, don't switch models for small improvements on leaderboard, wait for substantial gains to limit information leakage.
25.- Talk aimed to allow adaptive data analysis while preventing overfitting through differentially private algorithms. Thanks for attending on Father's Day.
Knowledge Vault built byDavid Vivancos 2024