Knowledge Vault 2/36 - ICLR 2014-2023
Alex Graves ICLR 2017 - Invited Talk - New Directions For Recurrent Neural Networks
<Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:

graph LR classDef rnn fill:#f9d4d4, font-weight:bold, font-size:14px; classDef external fill:#d4f9d4, font-weight:bold, font-size:14px; classDef dnc fill:#d4d4f9, font-weight:bold, font-size:14px; classDef bptt fill:#f9f9d4, font-weight:bold, font-size:14px; classDef act fill:#f9d4f9, font-weight:bold, font-size:14px; classDef programming fill:#d4f9f9, font-weight:bold, font-size:14px; classDef curriculum fill:#f9d4d4, font-weight:bold, font-size:14px; classDef memory fill:#d4d4f9, font-weight:bold, font-size:14px; A[Alex Graves
ICLR 2017] --> B[RNNs: LSTM, GRU effective
for sequences 1] B --> C[RNNs: end-to-end, less
feature engineering 2] B --> D[RNN memory: fragile,
costly 3] A --> E[External memory: flexible,
separate from cost 4] E --> F[Early examples: NTM,
MemNets, NMT 5] E --> G[DNCs: sophisticated memory
access mechanisms 6] G --> H[DNCs excel at
unseen structures 7] G --> I[DNCs: multiple access
mechanisms combined 8] G --> J[DNCs: 18/20 bAbI,
induction unclear 9] E --> K[Scaling challenge: cost,
sparse access helps 10] A --> L[BPTT: memory cost,
infrequent updates 11] L --> M[Truncated BPTT: misses
long-range 12] L --> N[Synthetic gradients: decoupled
training 13] N --> O[Synthetic gradients: longer
sequences, efficiency 14] N --> P[Synthetic gradients: asynchronous
hierarchical RNNs 15] A --> Q[RNNs: computation tied
to sequence length 16] Q --> R[ACT: learned 'pondering'
time per input 17] R --> S[ACT: separates computation
from data time 18] R --> T[ACT: reveals informative
data patterns 19] R --> U[ACT: more compute
for difficult inputs 20] A --> V[Differentiable vs. manual
programming unclear 21] V --> W[Neural programs simpler
than human-level 22] V --> X[Bridging neural and
symbolic programming challenging 23] V --> Y[Self-programming computers eventual,
path uncertain 24] A --> Z[Automatic curriculum learning
important challenge 25] Z --> AA[Reinforcement learning needs
data collection guidance 26] A --> AB[Computer memory hierarchies
could benefit neural 27] AB --> AC[LSTM like registers,
external like RAM 28] AB --> AD[Fast, frequent neural
memory rewrites penalized 29] AB --> AE[Evolved computer memory
hierarchy informative 30] class A,B,C,D rnn; class E,F,G,H,I,J,K external; class L,M,N,O,P bptt; class Q,R,S,T,U act; class V,W,X,Y programming; class Z,AA curriculum; class AB,AC,AD,AE memory;

Resume:

1.-RNNs with multiplicative units like LSTM and GRU work well and are widely used for tasks involving sequential data.

2.-RNNs are increasingly trained end-to-end, with raw inputs fed in and raw outputs produced, reducing the need for feature engineering.

3.-RNN memory can be fragile, with new information overwriting what's stored. Computational cost also grows with memory size.

4.-External memory allows the network to have less fragile, more flexible memory that is separate from computational cost.

5.-Neural Turing Machines, Memory Networks, and Neural Machine Translation were early examples of neural networks with external read/write memory.

6.-Differentiable Neural Computers (DNCs) are a newer example, with more sophisticated memory access mechanisms like content-based addressing and temporal linking.

7.-DNCs outperformed previous models on tasks like traversing London Underground connections, despite never seeing that structure during training.

8.-DNCs use multiple access mechanisms in combination, like content-based lookup to fill in missing information from a query.

9.-DNCs passed 18/20 bAbI tasks, but failed on basic induction for unknown reasons, highlighting areas for further research.

10.-Scaling up external memory systems has been challenging due to computational cost, but sparse access methods help efficiency.

11.-Back-propagation through time (BPTT) has issues for RNNs like increasing memory cost with sequence length and infrequent weight updates.

12.-Truncated BPTT is commonly used but misses long-range interactions. Approximations to RTRL are promising but not yet practical.

13.-Synthetic gradients predict error gradients using local information, allowing decoupled training of network components without full BPTT.

14.-Synthetic gradients make truncated BPTT more efficient, enabling training on much longer sequences that were previously impractical.

15.-Synthetic gradients enable asynchronous updates and communication between modules ticking at different timescales in a hierarchical RNN.

16.-For typical RNNs, computation steps are tied to input sequence length, which is limiting for complex reasoning tasks.

17.-Adaptive Computation Time (ACT) allows the network to learn how long to "ponder" each input before producing an output.

18.-ACT separates computation time from data time, analogous to how memory networks separate computation from memory.

19.-ACT reveals informative patterns in data, like spikes in computation at uncertain points rather than just where loss is high.

20.-ACT shows that networks spend little compute on incompressible information, and more on difficult inputs or salient image regions.

21.-It's unclear if differentiable systems are fundamentally different from manually written programs, or if they can fully replicate programming abstractions.

22.-Currently learned neural programs seem much simpler than human-level programming with concepts like subroutines and recursion.

23.-Bridging implicit neural representations with symbolic programming abstractions is an open challenge, optimization alone may be insufficient.

24.-The speaker believes computers will eventually learn to program themselves, but the path to get there is uncertain.

25.-Automatic curriculum learning, or learning what to learn next, is an important challenge as we move beyond large supervised datasets.

26.-Reinforcement learning especially needs sophisticated mechanisms to guide data collection, as data is scarcer and trials are costly.

27.-Computers have memory hierarchies (registers, caches, RAM, disks) that match usage patterns, which could benefit neural memory systems.

28.-LSTM controller memory acts like registers or cache, while external read/write memory is more like RAM. Read-only memory could be added.

29.-Rewriting fast, frequently accessed neural memory may need to be penalized differently than slow, infrequently rewritten memory.

30.-Recreating the evolved memory hierarchy of modern computers may be useful in developing neural architectures with memory.

Knowledge Vault built byDavid Vivancos 2024