Knowledge Vault 6 /4 - ICML 2015
Advances in Structured Prediction
Hal Daumé III & John Langford
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

Advances in Structured
Prediction
Introduction to Joint Prediction
Joint prediction introduction,
avoiding term structured 1
Joint prediction examples: labeling,
parsing, translation 2
Joint prediction approaches: independent,
multi-task, graphical 3
Joint prediction characteristics: discrete
predictions, decomposition 4
Joint prediction issues: huge
space, combinatorial 5
Sequential decision-making formulation,
reinforcement learning parallels 6
Learning Approaches
Supervised learning drawbacks:
distribution mismatch 7
Imitation learning: DAgger algorithm,
expert queries 8
Imitation vs reinforcement: labeled
data advantage 9
Imitation limitations: costs,
expert access 10
AGGRAVATE Algorithm
AGGRAVATE: learned policy, deviations,
cost-sensitive classification 11
Multi-class vs cost-sensitive
classification comparison 12
AGGRAVATE components: roll-in, roll-out,
deviations 13
Efficient loss calculation: expert
policy, decomposition 14
Online cost-sensitive classifier:
one-against-all strategy 20
Examples and Applications
Sequence labeling example: time
step, labels 15
Computational advantage: loss decomposition,
past decisions 16
Graph labeling example: linearizing,
breadth-first search 17
Node predictions: neighbor information,
multiple passes 18
Cost-sensitive examples: label perturbations,
Hamming loss 19
Comparisons and Relations
AGGRAVATE convergence, empirical risk
minimization relation 21
Non-optimal experts in learning
to search 22
Learning to search vs conditional
random fields 23
Search vs CRFs: prediction
vs probability 24
Search vs recurrent neural networks:
complementary 26
Further Considerations
Graph labeling sensitivity to
linearization order 25
Online cost-sensitive classification in AGGRAVATE 27
Non-decomposable loss optimization: F-score example 28
Loss decomposability enables computational shortcuts 29
Further topics: reinforcement learning,
neural networks 30

Resume:

1.- Introduction to structured prediction and joint prediction problems. Avoiding the term "structured prediction" due to association with conditional random fields.

2.- Examples of joint prediction problems: sequence labeling, parsing, matching, machine translation, image segmentation, protein folding.

3.- Approaches to joint prediction: independent predictions, multi-task learning, graphical models, handcrafted solutions. Pros and cons of each.

4.- Characterizing joint prediction problems: learning to optimize a joint loss over discrete predictions, exponential output space, need for output decomposition.

5.- Issues with joint prediction: huge output space, combinatorial loss functions, need to assume output decomposes usefully to make progress.

6.- Sequential decision-making formulation for problems with ordered output decomposition. Parallels to reinforcement learning.

7.- Warm-up 1: Vanilla supervised learning approach to solving joint prediction. Drawbacks due to train/test distribution mismatch.

8.- Warm-up 2: Imitation learning with the DAgger algorithm. Repeatedly queries expert on learner's trajectories. Provides theoretical guarantees.

9.- Comparing imitation learning to reinforcement learning. Access to labeled training data and optimal reference policies is an advantage.

10.- Limitations of imitation learning: Doesn't capture varying costs of different wrong predictions. Requires too much expert access.

11.- AGGRAVATE algorithm: Rolls in with learned policy, makes one-step deviations, rolls out with reference policy, generates cost-sensitive classification examples.

12.- Relationship between multi-class and cost-sensitive classification. Goal is minimizing expected cost of predictions vs just expected errors.

13.- Roll-in, roll-out, and one-step deviations in AGGRAVATE. Uses reference policy for roll-outs.

14.- Efficiently calculating losses without explicit roll-outs when using an expert reference policy, by exploiting loss decomposition (e.g. Hamming loss).

15.- Example 1: Sequence labeling with AGGRAVATE. Picking a time step, trying all labels, rolling out with expert, collecting cost-sensitive examples.

16.- Leveraging loss decomposition in sequence labeling to avoid doing roll-outs, gaining computational advantage. Only need to count incorrect past decisions.

17.- Example 2: Graph labeling with AGGRAVATE. Linearizing graph traversal in a belief propagation-like manner. Breadth-first search ordering.

18.- Making predictions on a node given its neighbors' information. Multiple inward and outward passes provide interior and exterior information.

19.- Constructing cost-sensitive examples for a node by trying all label perturbations. Hamming loss enables avoiding roll-outs.

20.- Using an online cost-sensitive classifier in the inner loop of AGGRAVATE. One-against-all strategy is common in practice.

21.- Convergence properties of AGGRAVATE and relation to empirical risk minimization. Second half of tutorial covers this.

22.- Handling non-optimal experts in learning to search. Recent work addresses this (paper being presented at the conference).

23.- Comparing learning to search and conditional random fields (CRFs). High-level similarities but differing approaches and optimization objectives.

24.- Learning to search reduces to prediction ability as a primitive, while CRFs reduce to probability modeling ability. Hard to directly compare.

25.- Sensitivity of graph labeling results to linearization order. Multiple inward/outward passes makes order matter less.

26.- Relationship between learning to search and recurrent neural networks. Orthogonal - one is algorithmic, other is representational. Potentially complementary.

27.- Online cost-sensitive classification as a sub-component of AGGRAVATE. Choice of classifier affects final prediction quality.

28.- Possibility of optimizing non-decomposable loss functions like F-score using learning to search, but requiring explicit roll-outs.

29.- Decomposability of loss function enables computational shortcuts in AGGRAVATE when using expert for roll-outs. Hamming loss is an example.

30.- Further discussion on connections to reinforcement learning, handling non-optimal experts, recurrent neural networks in the second half of tutorial.

Knowledge Vault built byDavid Vivancos 2024