Knowledge Vault 6 /4 - ICML 2015
Advances in Structured Prediction
Hal Daumé III & John Langford
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef main fill:#f9d4d4, font-weight:bold, font-size:14px classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px classDef approaches fill:#d4d4f9, font-weight:bold, font-size:14px classDef algorithms fill:#f9f9d4, font-weight:bold, font-size:14px classDef examples fill:#f9d4f9, font-weight:bold, font-size:14px classDef comparisons fill:#d4f9f9, font-weight:bold, font-size:14px Main[Advances in Structured
Prediction] Main --> A[Introduction to Joint Prediction] A --> A1[Joint prediction introduction,
avoiding term structured 1] A --> A2[Joint prediction examples: labeling,
parsing, translation 2] A --> A3[Joint prediction approaches: independent,
multi-task, graphical 3] A --> A4[Joint prediction characteristics: discrete
predictions, decomposition 4] A --> A5[Joint prediction issues: huge
space, combinatorial 5] A --> A6[Sequential decision-making formulation,
reinforcement learning parallels 6] Main --> B[Learning Approaches] B --> B1[Supervised learning drawbacks:
distribution mismatch 7] B --> B2[Imitation learning: DAgger algorithm,
expert queries 8] B --> B3[Imitation vs reinforcement: labeled
data advantage 9] B --> B4[Imitation limitations: costs,
expert access 10] Main --> C[AGGRAVATE Algorithm] C --> C1[AGGRAVATE: learned policy, deviations,
cost-sensitive classification 11] C --> C2[Multi-class vs cost-sensitive
classification comparison 12] C --> C3[AGGRAVATE components: roll-in, roll-out,
deviations 13] C --> C4[Efficient loss calculation: expert
policy, decomposition 14] C --> C5[Online cost-sensitive classifier:
one-against-all strategy 20] Main --> D[Examples and Applications] D --> D1[Sequence labeling example: time
step, labels 15] D --> D2[Computational advantage: loss decomposition,
past decisions 16] D --> D3[Graph labeling example: linearizing,
breadth-first search 17] D --> D4[Node predictions: neighbor information,
multiple passes 18] D --> D5[Cost-sensitive examples: label perturbations,
Hamming loss 19] Main --> E[Comparisons and Relations] E --> E1[AGGRAVATE convergence, empirical risk
minimization relation 21] E --> E2[Non-optimal experts in learning
to search 22] E --> E3[Learning to search vs conditional
random fields 23] E --> E4[Search vs CRFs: prediction
vs probability 24] E --> E5[Search vs recurrent neural networks:
complementary 26] Main --> F[Further Considerations] F --> F1[Graph labeling sensitivity to
linearization order 25] F --> F2[Online cost-sensitive classification in AGGRAVATE 27] F --> F3[Non-decomposable loss optimization: F-score example 28] F --> F4[Loss decomposability enables computational shortcuts 29] F --> F5[Further topics: reinforcement learning,
neural networks 30] class Main main class A,A1,A2,A3,A4,A5,A6 intro class B,B1,B2,B3,B4 approaches class C,C1,C2,C3,C4,C5 algorithms class D,D1,D2,D3,D4,D5 examples class E,E1,E2,E3,E4,E5,F,F1,F2,F3,F4,F5 comparisons

Resume:

1.- Introduction to structured prediction and joint prediction problems. Avoiding the term "structured prediction" due to association with conditional random fields.

2.- Examples of joint prediction problems: sequence labeling, parsing, matching, machine translation, image segmentation, protein folding.

3.- Approaches to joint prediction: independent predictions, multi-task learning, graphical models, handcrafted solutions. Pros and cons of each.

4.- Characterizing joint prediction problems: learning to optimize a joint loss over discrete predictions, exponential output space, need for output decomposition.

5.- Issues with joint prediction: huge output space, combinatorial loss functions, need to assume output decomposes usefully to make progress.

6.- Sequential decision-making formulation for problems with ordered output decomposition. Parallels to reinforcement learning.

7.- Warm-up 1: Vanilla supervised learning approach to solving joint prediction. Drawbacks due to train/test distribution mismatch.

8.- Warm-up 2: Imitation learning with the DAgger algorithm. Repeatedly queries expert on learner's trajectories. Provides theoretical guarantees.

9.- Comparing imitation learning to reinforcement learning. Access to labeled training data and optimal reference policies is an advantage.

10.- Limitations of imitation learning: Doesn't capture varying costs of different wrong predictions. Requires too much expert access.

11.- AGGRAVATE algorithm: Rolls in with learned policy, makes one-step deviations, rolls out with reference policy, generates cost-sensitive classification examples.

12.- Relationship between multi-class and cost-sensitive classification. Goal is minimizing expected cost of predictions vs just expected errors.

13.- Roll-in, roll-out, and one-step deviations in AGGRAVATE. Uses reference policy for roll-outs.

14.- Efficiently calculating losses without explicit roll-outs when using an expert reference policy, by exploiting loss decomposition (e.g. Hamming loss).

15.- Example 1: Sequence labeling with AGGRAVATE. Picking a time step, trying all labels, rolling out with expert, collecting cost-sensitive examples.

16.- Leveraging loss decomposition in sequence labeling to avoid doing roll-outs, gaining computational advantage. Only need to count incorrect past decisions.

17.- Example 2: Graph labeling with AGGRAVATE. Linearizing graph traversal in a belief propagation-like manner. Breadth-first search ordering.

18.- Making predictions on a node given its neighbors' information. Multiple inward and outward passes provide interior and exterior information.

19.- Constructing cost-sensitive examples for a node by trying all label perturbations. Hamming loss enables avoiding roll-outs.

20.- Using an online cost-sensitive classifier in the inner loop of AGGRAVATE. One-against-all strategy is common in practice.

21.- Convergence properties of AGGRAVATE and relation to empirical risk minimization. Second half of tutorial covers this.

22.- Handling non-optimal experts in learning to search. Recent work addresses this (paper being presented at the conference).

23.- Comparing learning to search and conditional random fields (CRFs). High-level similarities but differing approaches and optimization objectives.

24.- Learning to search reduces to prediction ability as a primitive, while CRFs reduce to probability modeling ability. Hard to directly compare.

25.- Sensitivity of graph labeling results to linearization order. Multiple inward/outward passes makes order matter less.

26.- Relationship between learning to search and recurrent neural networks. Orthogonal - one is algorithmic, other is representational. Potentially complementary.

27.- Online cost-sensitive classification as a sub-component of AGGRAVATE. Choice of classifier affects final prediction quality.

28.- Possibility of optimizing non-decomposable loss functions like F-score using learning to search, but requiring explicit roll-outs.

29.- Decomposability of loss function enables computational shortcuts in AGGRAVATE when using expert for roll-outs. Hamming loss is an example.

30.- Further discussion on connections to reinforcement learning, handling non-optimal experts, recurrent neural networks in the second half of tutorial.

Knowledge Vault built byDavid Vivancos 2024