Masashi Sugiyama ICLR 2023 - Invited Talk - Importance-Weighting Approach to Distribution Shift Adaptation

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:**

graph LR
classDef reliable fill:#f9d4d4, font-weight:bold, font-size:14px;
classDef weakly fill:#d4f9d4, font-weight:bold, font-size:14px;
classDef noisy fill:#d4d4f9, font-weight:bold, font-size:14px;
classDef transfer fill:#f9f9d4, font-weight:bold, font-size:14px;
classDef future fill:#f9d4f9, font-weight:bold, font-size:14px;
A[Masashi Sugiyama

ICLR 2023] --> B[Reliable ML: challenges, improve reliability. 1] B --> C[Topics: weakly supervised, noisy labels,

transfer learning. 2] C --> D[Weakly supervised uses weak

data, not fully labeled. 3] D --> E[PU classification: positive, unlabeled

samples only. 4] D --> F[Other binary problems: PC,

UU, SD, PNU. 5] D --> G[Multi-class problems: complementary, partial,

single-class labels. 6] D --> H["Weakly Supervised Learning" book:

unified framework. 7] C --> I[Noisy label learning: train

with label noise. 8] I --> J[Loss correction by estimating

noise transition matrix. 9] I --> K[Volume minimization jointly estimates

classifier, noise matrix. 10] C --> L[Importance weight ratio estimation

without separate densities. 11] L --> M[Joint importance-predictor estimation minimizes

test risk bound. 12] L --> N[Online ensemble for continuous

covariate shift. 13] C --> O[Handle distribution shift beyond

covariate shift. 14] O --> P[Minibatch-wise approach for arbitrary

joint shift. 15] A --> Q[Future directions: combine techniques,

handle joint shift. 16] Q --> R[Practical: balance updates, robustness

to malicious data. 17] Q --> S[Estimate class prior in

PU learning. 18] Q --> T[Estimate noise matrix end-to-end

by volume minimization. 19] Q --> U[Meta-learn dynamic learning rate

for online shift. 20] Q --> V[Estimate input densities from

empirical samples. 21] Q --> W[Bridge theory and deep

learning practice gap. 22] Q --> X[Combine weakly supervised and

dynamic feature learning. 23] Q --> Y[Analyze joint representation, importance,

shift methods. 24] Q --> Z[Scale shift handling in

large language models. 25] Z --> AA[Question need for adaptation

in large models. 26] Z --> AB[Continual learning under shift

in language models. 27] Z --> AC[Limited memory approaches for

joint shift adaptation. 28] A --> AD[Overview: weakly supervised, noisy

label, transfer learning. 29] AD --> AE[Themes: risk estimation, importance,

noise matrices, algorithms. 30] class A,B,AD,AE reliable; class C,D,E,F,G,H weakly; class I,J,K noisy; class L,M,N,O,P transfer; class Q,R,S,T,U,V,W,X,Y,Z,AA,AB,AC future;

ICLR 2023] --> B[Reliable ML: challenges, improve reliability. 1] B --> C[Topics: weakly supervised, noisy labels,

transfer learning. 2] C --> D[Weakly supervised uses weak

data, not fully labeled. 3] D --> E[PU classification: positive, unlabeled

samples only. 4] D --> F[Other binary problems: PC,

UU, SD, PNU. 5] D --> G[Multi-class problems: complementary, partial,

single-class labels. 6] D --> H["Weakly Supervised Learning" book:

unified framework. 7] C --> I[Noisy label learning: train

with label noise. 8] I --> J[Loss correction by estimating

noise transition matrix. 9] I --> K[Volume minimization jointly estimates

classifier, noise matrix. 10] C --> L[Importance weight ratio estimation

without separate densities. 11] L --> M[Joint importance-predictor estimation minimizes

test risk bound. 12] L --> N[Online ensemble for continuous

covariate shift. 13] C --> O[Handle distribution shift beyond

covariate shift. 14] O --> P[Minibatch-wise approach for arbitrary

joint shift. 15] A --> Q[Future directions: combine techniques,

handle joint shift. 16] Q --> R[Practical: balance updates, robustness

to malicious data. 17] Q --> S[Estimate class prior in

PU learning. 18] Q --> T[Estimate noise matrix end-to-end

by volume minimization. 19] Q --> U[Meta-learn dynamic learning rate

for online shift. 20] Q --> V[Estimate input densities from

empirical samples. 21] Q --> W[Bridge theory and deep

learning practice gap. 22] Q --> X[Combine weakly supervised and

dynamic feature learning. 23] Q --> Y[Analyze joint representation, importance,

shift methods. 24] Q --> Z[Scale shift handling in

large language models. 25] Z --> AA[Question need for adaptation

in large models. 26] Z --> AB[Continual learning under shift

in language models. 27] Z --> AC[Limited memory approaches for

joint shift adaptation. 28] A --> AD[Overview: weakly supervised, noisy

label, transfer learning. 29] AD --> AE[Themes: risk estimation, importance,

noise matrices, algorithms. 30] class A,B,AD,AE reliable; class C,D,E,F,G,H weakly; class I,J,K noisy; class L,M,N,O,P transfer; class Q,R,S,T,U,V,W,X,Y,Z,AA,AB,AC future;

**Resume: **

**1.-**The talk focuses on reliable machine learning, addressing challenges like insufficient information, label noise, and data bias to improve system reliability.

**2.-**Three main topics are covered: weakly supervised learning, noisy label learning, and transfer learning, with the goal of more reliable ML.

**3.-**Weakly supervised classification uses weak supervision like positive and unlabeled data instead of fully labeled data, which is often too costly.

**4.-**Positive-Unlabeled (PU) classification trains a classifier using only positive and unlabeled samples, without any negative samples, by estimating risk functionals.

**5.-**Other weakly supervised binary classification problems include Positive-Confidence, Unlabeled-Unlabeled, Similar-Dissimilar, and Positive-Negative-Unlabeled classification, solvable using the same risk estimation framework.

**6.-**Multi-class weakly supervised problems like complementary labels, partial labels, and single-class confidence can also be addressed within the empirical risk minimization framework.

**7.-**The book "Weakly Supervised Learning" covers this topic in detail, providing a unified framework combining any loss function, classifier, optimizer and regularizer.

**8.-**Noisy label learning aims to train classifiers from data with noisy labels, which is challenging especially for input-dependent label noise.

**9.-**Loss correction methods based on estimating the noise transition matrix T can handle noisy labels, but T is difficult to estimate accurately.

**10.-**A volume minimization approach is proposed to jointly estimate the classifier and noise transition matrix T by minimizing the simplex volume.

**11.-**Methods are proposed for directly estimating the importance weight ratio between test and train distributions without estimating them separately.

**12.-**A joint importance-predictor estimation method minimizes a justifiable upper bound on the test risk, improving upon two-step importance weighting approaches.

**13.-**Under continuous covariate shift where the input distribution changes over time, an online ensemble approach achieves optimal dynamic regret without knowing shift speed.

**14.-**Reliable machine learning requires handling distribution shift beyond covariate shift, as the test domain may not be covered by the training domain.

**15.-**For arbitrary joint shift where both P(x) and P(y|x) change, a minibatch-wise approach dynamically estimates importance weights by loss matching.

**16.-**Future directions include combining joint shift adaptation with weakly supervised learning, handling continuous joint shift, and incorporating limited memory continual learning.

**17.-**Practical considerations include balancing frequent model updates to reflect new data with robustness to malicious data through periodic/buffered updating schemes.

**18.-**Estimating the class prior probability p in PU learning is challenging and requires assumptions like positive-negative separability; various estimation methods have been proposed.

**19.-**In noisy label learning, the noise transition matrix T can be estimated end-to-end using a volume minimization approach with simplicial constraints.

**20.-**Meta-learning approaches to dynamically estimate the learning rate in online learning under continuous distribution shift are a promising research direction.

**21.-**Marginal input densities in importance weighting methods can be estimated from empirical samples, enabling practical implementation with representation learning models.

**22.-**Bridging the gap between theoretical analysis and deep learning practice in reliable machine learning is an ongoing challenge and opportunity.

**23.-**Weakly supervised learning techniques can potentially be combined with dynamic feature learning in practice to boost robustness and performance.

**24.-**Analyzing combined methods that jointly learn representations, estimate importance weights, and adapt to distribution shift remains an open theoretical problem.

**25.-**Scaling techniques for handling distribution shift in very large language models during fine-tuning is an important research problem.

**26.-**The need for domain adaptation in large pre-trained models is questioned, as their generality may already suffice for many domains.

**27.-**Continual learning under distribution shift in large language models is a key scenario requiring techniques that avoid storing all data.

**28.-**Limited memory approaches for continual joint distribution shift adaptation are crucial for scalability but require further research and development efforts.

**29.-**The talk gives an overview of reliable machine learning research spanning weakly supervised, noisy label, and transfer learning settings.

**30.-**Key themes include estimating risk functionals, importance weights and noise transition matrices, aiming to provide practical algorithms with theoretical guarantees.

Knowledge Vault built byDavid Vivancos 2024