Knowledge Vault 2/10 - ICLR 2014-2023
Karl Moritz Hermann; Phil Blunsom ICLR 2014 - Multilingual Distributed Representations without Word Alignment
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef multilingual fill:#f9d4d4, font-weight:bold, font-size:14px; classDef embeddings fill:#d4f9d4, font-weight:bold, font-size:14px; classDef distributional fill:#d4d4f9, font-weight:bold, font-size:14px; classDef multilingualData fill:#f9f9d4, font-weight:bold, font-size:14px; classDef compositionalSemantics fill:#f9d4f9, font-weight:bold, font-size:14px; classDef learning fill:#d4f9f9, font-weight:bold, font-size:14px; classDef evaluation fill:#f9d4d4, font-weight:bold, font-size:14px; classDef corpus fill:#d4f9d4, font-weight:bold, font-size:14px; classDef misc fill:#d4d4f9, font-weight:bold, font-size:14px; A[Karl Moritz
Hermann et al] --> B[Multi-lingual distributional
representations, no alignment. 1] A --> C[Embeddings: extend distributional
hypothesis, multi-lingual. 2] C --> D[Distributional hypothesis infers
meaning from co-occurrence. 3] C --> E[Multi-lingual data: words
semantically close if aligned. 4] E --> F[Multi-lingual data grounds
language like real-world experiences. 5] A --> G[Full-lingual compositional
semantics: paraphrasing, translation. 6] G --> H[Past work: autoencoder error,
sentiment classification. 7] G --> I[Goals: multi-lingual semantic
space, avoid biases. 8] I --> J[Simple model: close if
aligned, far if not. 9] I --> K[Benefits: task-independent,
multi-lingual, semantic joint space. 10] I --> L[Distance minimization has trivial
solution, use hinge loss. 11] I --> M[Bag-of-words composition
model for simplicity. 12] A --> N[Evaluation: cross-lingual document
classification, German-English. 13] N --> O[Two-stage: learn multi-lingual
representations, train classifier. 14] N --> P[English-French improved German
despite no additional German data. 15] N --> Q[T-SNE: learned representations
cluster phrases across languages. 16] N --> R[Bigram composition considering
word order outperformed bag-of-words. 17] A --> S[Recursive model: phrase/sentence
level, no alignment needed. 18] S --> T[Train on comparable/transcribed corpora,
combine signals when available. 19] A --> U[Massively multi-lingual TED talk corpus,
12 languages, multi-label. 20] A --> V[Monolingual data unused,
but could help. 21] A --> W[Neuroscience: early vs late bilinguals
have mixed vs separate representations. 22] A --> X[More elegant than autoencoder-based
multi-lingual learning via trees. 23] class A,B,C,D,E,F,I,J,K,L,M,P,Q,V,W,X multilingual; class C,D,E embeddings; class D,E,F,G distributional; class E,F,P multilingualData; class G,H,I compositionalSemantics; class J,K,L,M,O,S,T learning; class N,O,P,Q,R evaluation; class U,V corpus; class W,X misc;

Resume:

1.-The talk discusses multi-lingual distributional representations without word alignment, aiming to develop parallel corpora and achieve semantic transfer across languages.

2.-Embeddings are learned by extending the distributional hypothesis to multi-lingual corpora and the sentence level.

3.-The distributional hypothesis posits that word meaning can be inferred from the words it co-occurs with. This is more powerful with multi-lingual data.

4.-Multi-lingual data allows learning that words in different languages are semantically close if they align with the same word in another language.

5.-Multi-lingual data can provide a form of semantic grounding, similar to how real-world experiences ground language learning in traditional linguistic theories.

6.-Reasons to pursue full-lingual compositional semantics include paraphrasing (checking if sentences have roughly the same meaning) and translation.

7.-Past work on compositional semantics used objective functions like autoencoder reconstruction error or classification signals like sentiment. The usefulness of these is questioned.

8.-Goals are to learn representations in a multilingual semantic space while avoiding task-specific biases and accounting for composition effects.

9.-A simple model would ensure sentence representations in two languages are close if the sentences are aligned and far if unaligned.

10.-Benefits are task-independent learning, multilingual representations, semantically plausible joint space representations, and using large contexts from compositional vector models.

11.-The distance minimization objective alone has a trivial solution. A noise-contrastive hinge loss forcing unaligned sentences apart is used instead.

12.-A bag-of-words composition model is used for simplicity to focus on evaluating the bilingual objective rather than the composition method.

13.-Evaluation uses a cross-lingual document classification task, classifying German data based on labels from English data. This tests both monolingual and multilingual validity.

14.-The two-stage procedure first learns multilingual representations from parallel data, then trains a classifier on the learned representations.

15.-Adding English-French data improved German representations despite no additional German data, supporting the extended distributional hypothesis to multiple languages.

16.-T-SNE projections show learned representations cluster phrases with similar meanings across English, German and French closely together.

17.-Subsequent experiments with a bigram composition model considering word order outperformed the bag-of-words model.

18.-A recursive model was developed to learn representations at the phrase and sentence level, removing the need for sentence alignment.

19.-This enables training on comparable or transcribed corpora with document-level alignment and combining document and sentence level signals when available.

20.-A new massively multilingual corpus of TED talk transcripts across 12 languages was built for multi-label classification.

21.-The talk aimed to purely validate extending the distributional hypothesis to multilingual data, so monolingual data was not used, though it could help.

22.-Neuroscience shows early vs late bilingual learners have mixed vs separate representations. Sequential learning effects were not explored but seem worth trying.

23.-The multilingual approach was argued to be more elegant than recent autoencoder-based multilingual representation learning which requires generation via source trees.

Knowledge Vault built byDavid Vivancos 2024