Knowledge Vault 5 /67 - CVPR 2021
Language Models Challenges and Progress
Noah Smith
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef models fill:#f9d4d4, font-weight:bold, font-size:14px classDef data fill:#d4f9d4, font-weight:bold, font-size:14px classDef evaluation fill:#d4d4f9, font-weight:bold, font-size:14px classDef efficiency fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#f9d4f9, font-weight:bold, font-size:14px A[Language Models Challenges
and Progress] --> B[Language models: core NLP,
data-driven, pre-trainable 1] B --> C[Improving coverage, generalization,
efficiency, performance key 2] A --> D[NLP evaluations: potentially flawed assumptions 3] D --> E[Human-like generation goal
oversimplifies, avoid some content 4] A --> F[More dataset transparency,
control methods needed 5] A --> G[Research needed: training,
engaging human evaluators 6] A --> H[Vision, NLP can learn
from each others methods 7] A --> I[GroK: eliminates word-type parameters,
allows vocabulary changes 8] I --> J[GroK uses outside sources
for word representations 9] I --> K[GroK excels out-of-domain,
works with smaller lexicons 10] I --> L[Vision may benefit from
GroK-like models 11] A --> M[Transformers common in language
modeling, attention computationally expensive 12] M --> N[Efficient transformers benefit
high and low-resource groups 13] M --> O[Random Fourier features can
make attention more efficient 14] O --> P[Random feature attention RFA:
linear time, constant space 15] O --> Q[RFA implies recency bias,
can help if correct 16] O --> R[RFA: 2x translation speedup,
maintains performance 17] O --> S[RFA minimally affects perplexity,
can improve with techniques 18] O --> T[RFA competitive in speed,
accuracy on long-text tasks 19] O --> U[Pre-trained models adaptable
to RFA by swapping layers 20] A --> V[Challenges remain: evaluation, adaptability,
efficiency, ongoing research needed 21] A --> W[Future areas: social impacts,
human interaction, multilinguality 22] A --> X[Vision-NLP collaboration holds
great potential for both 23] A --> Y[Genie: standardized human NLP
evaluations for methodology research 24] A --> Z[C4 dataset for T5
released for transparency 25] class B,C,I,J,K,L,M,N,O,P,Q,R,S,T,U models class Z,F data class D,E,G,Y,V evaluation class H,W,X future


1.- Language models are central to current NLP solutions, built from raw text data, and can be pre-trained separately from task-specific models.

2.- Improving language models' coverage, generalization, efficiency, and performance is key, as they impact all NLP application areas.

3.- Some NLP evaluations with human judges may be based on flawed assumptions about human perception of machine-generated language.

4.- Aspiring to human-like language generation oversimplifies what humans generate; NLP systems should not emulate some human-authored content.

5.- Greater transparency is needed in datasets underlying language models, along with better methods to analyze and control them.

6.- More research is needed on how to train and engage human evaluators to provide useful information for improving NLP systems.

7.- Computer vision and NLP communities can learn from each other regarding the role of non-researcher humans in research methodology.

8.- GroK is a language model that eliminates word-type-specific parameters, allowing the vocabulary to change without relearning anything.

9.- GroK incorporates outside information sources like lexicons and dictionaries to ground word representations.

10.- GroK outperforms non-compositional baselines in out-of-domain settings, and is robust to smaller lexicons, relevant for technical domains.

11.- Computer vision may benefit from GroK-like models for tasks with large label sets and few training observations per label.

12.- Transformers are commonly used as the encoding function in language modeling, with attention layers being computationally expensive for long sequences.

13.- Making transformers more efficient benefits both high-resource groups pushing model limits and low-resource groups doing more with less.

14.- Attention layers can be made more efficient by replacing exponentiated inner products with linear functions using random Fourier features.

15.- Random feature attention (RFA) runs in linear time and constant space, designed as a drop-in replacement for standard softmax-based attention.

16.- RFA leads to a recency bias assumption in transformers, which can help generalization if the assumption is correct.

17.- RFA achieves nearly 2x decoding speedup on machine translation benchmarks while maintaining performance, outperforming other efficient attention methods.

18.- RFA has minimal effect on perplexity in language modeling and can even improve performance with additional techniques like cross-batch state passing.

19.- RFA is competitive in speed and accuracy on long-text classification benchmarks compared to other efficient attention approaches.

20.- Pre-trained language models can be adapted to use linear attention by swapping in RFA layers while leaving some unchanged.

21.- Challenges remain in evaluation, adaptability, and efficiency of language models and transformers, requiring ongoing research and collaboration.

22.- Social and environmental impacts, applications, human interaction concerns, and multilinguality in NLP are important areas for future discussion.

23.- Collaboration between computer vision and NLP holds great potential for advancing both fields.

24.- Genie is a new leaderboard offering standardized human evaluations for NLP tasks to facilitate research on evaluation methodology.

25.- C4, the dataset used to build Google's T5 language model, has been publicly released to promote transparency.

Knowledge Vault built byDavid Vivancos 2024