The End Of Knowledge - Vault 2 - ICLR (2014-2023)

graph LR classDef task fill:#f9d4d4, font-weight:bold, font-size:14px; classDef subtask fill:#d4f9d4, font-weight:bold, font-size:14px; classDef data fill:#d4d4f9, font-weight:bold, font-size:14px; classDef model fill:#f9f9d4, font-weight:bold, font-size:14px; classDef result fill:#f9d4f9, font-weight:bold, font-size:14px; A[Y-Lan Boureau
ICLR 2017] --> B[Ilan Bourreau from Facebook AI Research
presenting joint work on goal-oriented dialogue. 1] A --> C[Chitchat vs goal-directed
commercial chatbots. 2] C --> D[Users want quick
goal-directed tasks. 3] C --> E[Traditional goal-directed states
need rules. 4] C --> F[End-to-end models generalize
using language. 5] F --> G[Evaluating end-to-end
goal-directed models. 6] A --> H[Restaurant reservation subtasks
for evaluation. 7] H --> I[Synthetic dialogue data
openly available. 8] H --> J[Testing necessary subtasks
identifies problems. 9] A --> K[Part of Facebook
dialogue papers. 10] H --> L[Subtask 1: Querying database. 11] H --> M[Subtask 2: Updating API call. 12] H --> N[Subtask 3: Ranking API options. 13] H --> O[Subtask 4: Providing
additional information. 14] H --> P[Subtask 5: Full
deterministic dialogue. 15] H --> Q[Baselines: TF-IDF, embeddings,
memory networks. 16] Q --> R[Memory networks: attention,
multi-hop access. 17] Q --> S[Results: Nearest neighbor
beats TF-IDF. 18] Q --> T[Embeddings solve subtask 1,
memory 1-2. 19] T --> U[Match type solves subtask 4. 20] S --> V[In/out-of-vocabulary results,
similar other datasets. 21] T --> W[Memory network
attention visualization. 22] T --> X[Memory network example
on dialogue. 23] H --> Y[Dataset enables
end-to-end comparison. 24] H --> Z[Harder realistic datasets
being developed. 25] I --> AA[Small synthetic data
mimics reality. 26] I --> AB[Deterministic slot-filling enables
100%, single answer. 27] I --> AC[Predicts utterance,
no task-specific labels. 28] I --> AD[Architecture trainable
on real data. 29] I --> AE[Open data encourages
end-to-end testing. 30] class A,B,C,H task; class D,E,F,L,M,N,O,P subtask; class I,V,AA,AB,AC,AD,AE data; class Q,R,T,U,W model; class G,J,K,S,X,Y,Z result;

Resume:

1.-The speaker is Ilan Bourreau from Facebook AI Research, presenting joint work with Antoine Bord and Jason Weston on goal-oriented dialogue.

2.-Open-ended chitchat dialogue is very broad, while commercial chatbots for tasks like booking hotels are much more limited and goal-directed.

3.-Users want goal-directed dialogue tasks done quickly without too much extraneous conversation. Success is measured by whether the goal is achieved.

4.-Traditionally, goal-directed dialogue states are defined by slots that need to be filled, which requires manual encoding of rules for each domain.

5.-The promise of end-to-end dialogue models is to generalize to new domains without assumptions about slots, using only language as raw input.

6.-Recent end-to-end neural dialogue systems have shown promise on open-ended chitchat, but the question is how to evaluate them on goal-directed tasks.

7.-The work breaks down the task of restaurant reservation into subtasks to evaluate where end-to-end models succeed and fail.

8.-The synthetic dialogue data produced, combining language patterns with a knowledge base, is available open-source at fb.ai/babi as part of their set of tasks testing AI system requirements.

9.-Necessary but not sufficient subtasks are tested - success doesn't mean the system is smart, but failure indicates a problem to solve before moving on.

10.-The presented work is part of a larger set of dialogue papers from Facebook AI Research, overviewed in a blog post on their website.

11.-The first subtask in restaurant reservation is querying a database, filling in information about party size, cuisine, price range, and location.

12.-The second subtask handles the user changing their mind and updating the API call with new information.

13.-The third subtask involves the API returning options and the system choosing what to display first to the user, likely by some ranking criteria.

14.-The fourth subtask is providing additional information requested by the user, like the restaurant's phone number or address.

15.-The fifth subtask is conducting the full dialogue, combining the steps. The dialogues are made deterministic for reproducible evaluation and comparison between systems.

16.-Baselines are provided, including an information retrieval TF-IDF method, a nearest neighbor approach, and supervised embeddings, along with a memory network end-to-end model.

17.-Memory networks combine a large memory with a learning component that can read and write to it, using soft attention and multi-hop access.

18.-Results show rule-based systems getting 100% as a sanity check. TF-IDF performs poorly, while nearest neighbor does better, unlike in chitchat where TF-IDF was superior.

19.-Supervised embeddings solve the first subtask but fail on others. Memory networks solve the first two subtasks but not the others.

20.-Augmenting memory networks with a match type feature also solves the fourth task of providing additional information but still fails the third task.

21.-Both in-vocabulary and out-of-vocabulary test results are reported. Similar patterns of results are seen on two other datasets - real human-bot dialogues and human-human restaurant booking data.

22.-Visualizing the attention in the memory networks shows it focuses on relevant slots for the first two tasks but fails to attend to ranking and extra info on later tasks.

23.-An example is shown of the memory network attention on more realistic dialogue data, showing reasonable behavior and knowing when it needs to request more information.

24.-Since the dataset was published, further work has improved on their baselines, which was the goal - to enable comparison of end-to-end approaches on this task.

25.-Harder datasets are being developed with more challenging realistic dialogue phenomena. An extended version will be featured as a track in the next Dialog State Tracking Challenge.

26.-The synthetic datasets are kept small (1000 examples) to mimic real-world cases of limited labeled data, but can easily be made larger if needed.

27.-Keeping the slot-filling order deterministic in the dialogues enables 100% to be achievable and having a single right answer candidate.

28.-The system is trained end-to-end to predict the next utterance, not with any task-specific labeling. Testing on a real dataset after training on the synthetic data is expected to perform very poorly and is not the intent.

29.-The same architecture can be trained on real datasets instead to compare approaches. The synthetic dataset is not meant to train systems that directly transfer to real restaurant booking.

30.-The data is openly available and they encourage the community to test their end-to-end dialogue approaches on it and report results to move the field forward.

Knowledge Vault built byDavid Vivancos 2024