Knowledge Vault 6 /29 - ICML 2017
Real World Interactive Learning
Alekh Agarwal & John Langford
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef supervised fill:#f9d4d4, font-weight:bold, font-size:14px classDef interactive fill:#d4f9d4, font-weight:bold, font-size:14px classDef contextual fill:#d4d4f9, font-weight:bold, font-size:14px classDef algorithms fill:#f9f9d4, font-weight:bold, font-size:14px classDef practical fill:#f9d4f9, font-weight:bold, font-size:14px Main[Real World Interactive
Learning] Main --> A[Supervised learning ignores
valuable interaction data 1] Main --> B[Interactive learning leverages
interaction data 2] B --> C[Algorithm learns from
features, actions, rewards 3] B --> D[Full RL: special
domains, large samples 4] Main --> E[Contextual bandits: right
signal, handle non-stationarity 5] E --> F[Fits real-world problems:
recommendations, ads, education 6] E --> G[Tutorial: algorithms, theory,
evaluation, exploration 7] E --> H[Observe features, choose
action, receive reward 8] H --> I[Policies map features
to actions 9] H --> J[Offline evaluation using
inverse propensity scoring 10] Main --> K[Offline learning: importance
weighted multi-class classification 11] K --> L[Exploration algorithms: epsilon-greedy,
Thompson sampling 12] K --> M[Progressive validation for
unbiased offline evaluation 13] K --> N[Rejection sampling evaluates
full interaction loop 14] Main --> O[Failure modes: mismatched
probabilities, non-stationarity 15] O --> P[Learning systems needed:
modular, scalable, general 16] P --> Q[Decision Service: open-source
contextual bandit system 17] P --> R[Other systems: NEXT,
StreamingBandit, different capabilities 18] Main --> S[Non-stationarity: key issue
requiring special techniques 19] S --> T[Combinatorial actions need
semi-bandits, submodularity approaches 20] S --> U[Reward specification critical,
map goals to proxies 21] S --> V[Smart encoding reduces
variance, improves efficiency 22] Main --> W[Workable recipes exist
for common scenarios 23] W --> X[Complex.com case study:
substantial real-world benefits 24] W --> Y[Offline validation enables
rapid evaluation 25] W --> Z[Contextual bandits fit
for broad consumption 26] Main --> AA[Research needed: automatic
algorithms, expanding RL 27] AA --> AB[Contextual bandits: reliable,
robust for applications 28] AB --> AC[Example: personalizing EEG-based
typing for disabled 29] AB --> AD[Research benefited from
many collaborators 30] class A,B,C,D supervised class E,F,G,H,I,J contextual class K,L,M,N algorithms class O,P,Q,R,S,T,U,V practical class W,X,Y,Z,AA,AB,AC,AD interactive

Resume:

1.- Supervised learning is the bread and butter of machine learning, but ignores valuable interaction data.

2.- Interactive machine learning, including contextual bandits, can leverage interaction data to improve models.

3.- In interactive learning, the algorithm learns from features, actions, and rewards in a continuous loop.

4.- Full reinforcement learning requires special domains and large sample sizes. Active learning has the wrong signal problem.

5.- Contextual bandits provide the right reward signal, handle non-stationarity, and act as economically viable AI agents.

6.- Contextual bandits are a good fit for many real-world problems like recommendations, ads, education, music, robotics and wellness.

7.- The tutorial covers algorithms, theory, evaluation, learning, exploration, practical issues, systems, and experiences.

8.- In contextual bandits, features are observed, an action is chosen, and a reward is received, with the goal of maximizing reward.

9.- Policies map features to actions. Exploration, usually randomized, is critical to gather needed information.

10.- Offline policy evaluation is possible using techniques like inverse propensity scoring, enabling rapid testing of new policies.

11.- Offline learning from exploration data is feasible by reducing the problem to importance weighted multi-class classification.

12.- Exploration algorithms like epsilon-greedy, Thompson sampling, and EXP4 balance exploration and exploitation.

13.- Progressive validation enables unbiased offline evaluation of learning algorithms on streaming data.

14.- Rejection sampling allows offline evaluation of exploration algorithms, considering the full interaction data loop.

15.- Failure modes in practice include mismatched action probabilities, non-stationary features, and delayed or unobserved rewards.

16.- Learning systems rather than just algorithms are needed, with modular, scalable designs, generality, and offline reproducibility.

17.- The Decision Service is an open-source and managed contextual bandit system that addresses many practical issues by design.

18.- Other recent contextual bandit systems include NEXT and StreamingBandit, with some differences in capabilities.

19.- Non-stationarity is a key issue in practice, requiring time-based and ensemble techniques beyond standard theory.

20.- Combinatorial action spaces like rankings require special approaches based on semi-bandits, submodularity, or cascading models.

21.- Reward function specification is critical and complex, often requiring mapping long-term goals to good short-term proxies.

22.- Smart reward encoding, like using infrequent nonzero rewards, can greatly reduce variance and improve data efficiency.

23.- Despite gaps between theory and practice, workable recipes exist for common scenarios in framing contextual bandit problems.

24.- A Complex.com case study demonstrates how contextual bandit approaches can provide substantial real-world benefits.

25.- Offline progressive validation enables rapid evaluation of new models, features, and exploration algorithms on real data.

26.- Contextual bandit techniques have matured to be fit for broad consumption, providing gains over supervised learning with less complexity than RL.

27.- More research is needed on automatic/parameter-free algorithms and expanding the tractable subset of RL problems.

28.- For practitioners, contextual bandits are becoming more reliable, robust and usable for real applications.

29.- An example application is using contextual bandits to personalize EEG-based typing for disabled individuals.

30.- The research has benefited from many collaborators, with slides and references available on hunch.net.

Knowledge Vault built byDavid Vivancos 2024