Advances in Structured Prediction

Hal DaumÃ© III & John Langford

**Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:**

graph LR
classDef main fill:#f9d4d4, font-weight:bold, font-size:14px
classDef intro fill:#d4f9d4, font-weight:bold, font-size:14px
classDef approaches fill:#d4d4f9, font-weight:bold, font-size:14px
classDef algorithms fill:#f9f9d4, font-weight:bold, font-size:14px
classDef examples fill:#f9d4f9, font-weight:bold, font-size:14px
classDef comparisons fill:#d4f9f9, font-weight:bold, font-size:14px
Main[Advances in Structured

Prediction] Main --> A[Introduction to Joint Prediction] A --> A1[Joint prediction introduction,

avoiding term structured 1] A --> A2[Joint prediction examples: labeling,

parsing, translation 2] A --> A3[Joint prediction approaches: independent,

multi-task, graphical 3] A --> A4[Joint prediction characteristics: discrete

predictions, decomposition 4] A --> A5[Joint prediction issues: huge

space, combinatorial 5] A --> A6[Sequential decision-making formulation,

reinforcement learning parallels 6] Main --> B[Learning Approaches] B --> B1[Supervised learning drawbacks:

distribution mismatch 7] B --> B2[Imitation learning: DAgger algorithm,

expert queries 8] B --> B3[Imitation vs reinforcement: labeled

data advantage 9] B --> B4[Imitation limitations: costs,

expert access 10] Main --> C[AGGRAVATE Algorithm] C --> C1[AGGRAVATE: learned policy, deviations,

cost-sensitive classification 11] C --> C2[Multi-class vs cost-sensitive

classification comparison 12] C --> C3[AGGRAVATE components: roll-in, roll-out,

deviations 13] C --> C4[Efficient loss calculation: expert

policy, decomposition 14] C --> C5[Online cost-sensitive classifier:

one-against-all strategy 20] Main --> D[Examples and Applications] D --> D1[Sequence labeling example: time

step, labels 15] D --> D2[Computational advantage: loss decomposition,

past decisions 16] D --> D3[Graph labeling example: linearizing,

breadth-first search 17] D --> D4[Node predictions: neighbor information,

multiple passes 18] D --> D5[Cost-sensitive examples: label perturbations,

Hamming loss 19] Main --> E[Comparisons and Relations] E --> E1[AGGRAVATE convergence, empirical risk

minimization relation 21] E --> E2[Non-optimal experts in learning

to search 22] E --> E3[Learning to search vs conditional

random fields 23] E --> E4[Search vs CRFs: prediction

vs probability 24] E --> E5[Search vs recurrent neural networks:

complementary 26] Main --> F[Further Considerations] F --> F1[Graph labeling sensitivity to

linearization order 25] F --> F2[Online cost-sensitive classification in AGGRAVATE 27] F --> F3[Non-decomposable loss optimization: F-score example 28] F --> F4[Loss decomposability enables computational shortcuts 29] F --> F5[Further topics: reinforcement learning,

neural networks 30] class Main main class A,A1,A2,A3,A4,A5,A6 intro class B,B1,B2,B3,B4 approaches class C,C1,C2,C3,C4,C5 algorithms class D,D1,D2,D3,D4,D5 examples class E,E1,E2,E3,E4,E5,F,F1,F2,F3,F4,F5 comparisons

Prediction] Main --> A[Introduction to Joint Prediction] A --> A1[Joint prediction introduction,

avoiding term structured 1] A --> A2[Joint prediction examples: labeling,

parsing, translation 2] A --> A3[Joint prediction approaches: independent,

multi-task, graphical 3] A --> A4[Joint prediction characteristics: discrete

predictions, decomposition 4] A --> A5[Joint prediction issues: huge

space, combinatorial 5] A --> A6[Sequential decision-making formulation,

reinforcement learning parallels 6] Main --> B[Learning Approaches] B --> B1[Supervised learning drawbacks:

distribution mismatch 7] B --> B2[Imitation learning: DAgger algorithm,

expert queries 8] B --> B3[Imitation vs reinforcement: labeled

data advantage 9] B --> B4[Imitation limitations: costs,

expert access 10] Main --> C[AGGRAVATE Algorithm] C --> C1[AGGRAVATE: learned policy, deviations,

cost-sensitive classification 11] C --> C2[Multi-class vs cost-sensitive

classification comparison 12] C --> C3[AGGRAVATE components: roll-in, roll-out,

deviations 13] C --> C4[Efficient loss calculation: expert

policy, decomposition 14] C --> C5[Online cost-sensitive classifier:

one-against-all strategy 20] Main --> D[Examples and Applications] D --> D1[Sequence labeling example: time

step, labels 15] D --> D2[Computational advantage: loss decomposition,

past decisions 16] D --> D3[Graph labeling example: linearizing,

breadth-first search 17] D --> D4[Node predictions: neighbor information,

multiple passes 18] D --> D5[Cost-sensitive examples: label perturbations,

Hamming loss 19] Main --> E[Comparisons and Relations] E --> E1[AGGRAVATE convergence, empirical risk

minimization relation 21] E --> E2[Non-optimal experts in learning

to search 22] E --> E3[Learning to search vs conditional

random fields 23] E --> E4[Search vs CRFs: prediction

vs probability 24] E --> E5[Search vs recurrent neural networks:

complementary 26] Main --> F[Further Considerations] F --> F1[Graph labeling sensitivity to

linearization order 25] F --> F2[Online cost-sensitive classification in AGGRAVATE 27] F --> F3[Non-decomposable loss optimization: F-score example 28] F --> F4[Loss decomposability enables computational shortcuts 29] F --> F5[Further topics: reinforcement learning,

neural networks 30] class Main main class A,A1,A2,A3,A4,A5,A6 intro class B,B1,B2,B3,B4 approaches class C,C1,C2,C3,C4,C5 algorithms class D,D1,D2,D3,D4,D5 examples class E,E1,E2,E3,E4,E5,F,F1,F2,F3,F4,F5 comparisons

**Resume: **

**1.-** Introduction to structured prediction and joint prediction problems. Avoiding the term "structured prediction" due to association with conditional random fields.

**2.-** Examples of joint prediction problems: sequence labeling, parsing, matching, machine translation, image segmentation, protein folding.

**3.-** Approaches to joint prediction: independent predictions, multi-task learning, graphical models, handcrafted solutions. Pros and cons of each.

**4.-** Characterizing joint prediction problems: learning to optimize a joint loss over discrete predictions, exponential output space, need for output decomposition.

**5.-** Issues with joint prediction: huge output space, combinatorial loss functions, need to assume output decomposes usefully to make progress.

**6.-** Sequential decision-making formulation for problems with ordered output decomposition. Parallels to reinforcement learning.

**7.-** Warm-up 1: Vanilla supervised learning approach to solving joint prediction. Drawbacks due to train/test distribution mismatch.

**8.-** Warm-up 2: Imitation learning with the DAgger algorithm. Repeatedly queries expert on learner's trajectories. Provides theoretical guarantees.

**9.-** Comparing imitation learning to reinforcement learning. Access to labeled training data and optimal reference policies is an advantage.

**10.-** Limitations of imitation learning: Doesn't capture varying costs of different wrong predictions. Requires too much expert access.

**11.-** AGGRAVATE algorithm: Rolls in with learned policy, makes one-step deviations, rolls out with reference policy, generates cost-sensitive classification examples.

**12.-** Relationship between multi-class and cost-sensitive classification. Goal is minimizing expected cost of predictions vs just expected errors.

**13.-** Roll-in, roll-out, and one-step deviations in AGGRAVATE. Uses reference policy for roll-outs.

**14.-** Efficiently calculating losses without explicit roll-outs when using an expert reference policy, by exploiting loss decomposition (e.g. Hamming loss).

**15.-** Example 1: Sequence labeling with AGGRAVATE. Picking a time step, trying all labels, rolling out with expert, collecting cost-sensitive examples.

**16.-** Leveraging loss decomposition in sequence labeling to avoid doing roll-outs, gaining computational advantage. Only need to count incorrect past decisions.

**17.-** Example 2: Graph labeling with AGGRAVATE. Linearizing graph traversal in a belief propagation-like manner. Breadth-first search ordering.

**18.-** Making predictions on a node given its neighbors' information. Multiple inward and outward passes provide interior and exterior information.

**19.-** Constructing cost-sensitive examples for a node by trying all label perturbations. Hamming loss enables avoiding roll-outs.

**20.-** Using an online cost-sensitive classifier in the inner loop of AGGRAVATE. One-against-all strategy is common in practice.

**21.-** Convergence properties of AGGRAVATE and relation to empirical risk minimization. Second half of tutorial covers this.

**22.-** Handling non-optimal experts in learning to search. Recent work addresses this (paper being presented at the conference).

**23.-** Comparing learning to search and conditional random fields (CRFs). High-level similarities but differing approaches and optimization objectives.

**24.-** Learning to search reduces to prediction ability as a primitive, while CRFs reduce to probability modeling ability. Hard to directly compare.

**25.-** Sensitivity of graph labeling results to linearization order. Multiple inward/outward passes makes order matter less.

**26.-** Relationship between learning to search and recurrent neural networks. Orthogonal - one is algorithmic, other is representational. Potentially complementary.

**27.-** Online cost-sensitive classification as a sub-component of AGGRAVATE. Choice of classifier affects final prediction quality.

**28.-** Possibility of optimizing non-decomposable loss functions like F-score using learning to search, but requiring explicit roll-outs.

**29.-** Decomposability of loss function enables computational shortcuts in AGGRAVATE when using expert for roll-outs. Hamming loss is an example.

**30.-** Further discussion on connections to reinforcement learning, handling non-optimal experts, recurrent neural networks in the second half of tutorial.

Knowledge Vault built byDavid Vivancos 2024