graph LR
classDef core fill:#ffe0cc,font-weight:bold,font-size:14px;
classDef learn fill:#ccf2ff,font-weight:bold,font-size:14px;
classDef plan fill:#e0ccff,font-weight:bold,font-size:14px;
classDef chall fill:#ffccff,font-weight:bold,font-size:14px;
classDef scale fill:#ffffcc,font-weight:bold,font-size:14px;
Main[OAK Core]
Main --> C1[Options Knowledge pillars 1]
C1 -.-> G1[Core]
Main --> C2[Online interaction births intelligence 2]
C2 -.-> G2[Learning]
Main --> C3[Option: control plus end 3]
C3 -.-> G1
Main --> C4[Agents craft own sub-problems 4]
C4 -.-> G2
Main --> C5[Sub-problem trains option 5]
C5 -.-> G2
Main --> C6[Models predict option outcomes 6]
C6 -.-> G3[Planning]
Main --> C7[Planning chains options 7]
C7 -.-> G3
Main --> C8[All learning online 8]
C8 -.-> G2
Main --> C9[World bigger than model 9]
C9 -.-> G1
Main --> C10[No forgetting critical 10]
C10 -.-> G4[Challenges]
Main --> C11[Virtuous abstraction cycle 11]
C11 -.-> G1
Main --> C12[Perception builds features 12]
C12 -.-> G2
Main --> C13[Utility ranks features 13]
C13 -.-> G2
Main --> C14[Sub-problems reward-respecting 14]
C14 -.-> G2
Main --> C15[Options store value models 15]
C15 -.-> G3
Main --> C16[Planning updates via imagination 16]
C16 -.-> G3
Main --> C17[Domain-general no facts 17]
C17 -.-> G1
Main --> C18[No design-time knowledge 18]
C18 -.-> G1
Main --> C19[Compute scales learning 19]
C19 -.-> G5[Scaling]
Main --> C20[Meta-learning while acting 20]
C20 -.-> G2
Main --> C21[Reward hypothesis: scalar suffices 21]
C21 -.-> G1
Main --> C22[Omit objectives constraints 22]
C22 -.-> G1
Main --> C23[Handle non-stationary worlds 23]
C23 -.-> G4
Main --> C24[Approximate functions inevitable 24]
C24 -.-> G4
Main --> C25[States built online 25]
C25 -.-> G2
Main --> C26[U-box builds state 26]
C26 -.-> G2
Main --> C27[Auxiliary values per feature 27]
C27 -.-> G2
Main --> C28[Off-policy GTD learns 28]
C28 -.-> G2
Main --> C29[Option models plan extended 29]
C29 -.-> G3
Main --> C30[Same plan high low 30]
C30 -.-> G3
Main --> C31[Utility curates abstractions 31]
C31 -.-> G2
Main --> C32[Forgetting plasticity obstacles 32]
C32 -.-> G4
Main --> C33[IDBD adapts rates 33]
C33 -.-> G4
Main --> C34[Play self-generates skills 34]
C34 -.-> G2
Main --> C35[Animals inspire problem creation 35]
C35 -.-> G2
Main --> C36[Five-page pseudocode goal 36]
C36 -.-> G5
Main --> C37[Contrasts LLM offline 37]
C37 -.-> G5
Main --> C38[Unifies RL planning 38]
C38 -.-> G1
Main --> C39[Mechanistic theory target 39]
C39 -.-> G5
Main --> C40[Scales to human 40]
C40 -.-> G5
Main --> C41[No labels demos 41]
C41 -.-> G2
Main --> C42[Discovers objects relations 42]
C42 -.-> G2
Main --> C43[Compute limits abstraction 43]
C43 -.-> G5
Main --> C44[Research not finished 44]
C44 -.-> G5
Main --> C45[RL path to AI 45]
C45 -.-> G1
Main --> C46[Minds from simple rules 46]
C46 -.-> G1
G1[Core] --> C1
G1 --> C3
G1 --> C9
G1 --> C11
G1 --> C17
G1 --> C18
G1 --> C21
G1 --> C22
G1 --> C38
G1 --> C45
G1 --> C46
G2[Learning] --> C2
G2 --> C4
G2 --> C5
G2 --> C8
G2 --> C12
G2 --> C13
G2 --> C14
G2 --> C20
G2 --> C25
G2 --> C26
G2 --> C27
G2 --> C28
G2 --> C31
G2 --> C34
G2 --> C35
G2 --> C41
G2 --> C42
G3[Planning] --> C6
G3 --> C7
G3 --> C15
G3 --> C16
G3 --> C29
G3 --> C30
G4[Challenges] --> C10
G4 --> C23
G4 --> C24
G4 --> C32
G4 --> C33
G5[Scaling] --> C19
G5 --> C36
G5 --> C37
G5 --> C39
G5 --> C40
G5 --> C43
G5 --> C44
class C1,C3,C9,C11,C17,C18,C21,C22,C38,C45,C46 core
class C2,C4,C5,C8,C12,C13,C14,C20,C25,C26,C27,C28,C31,C34,C35,C41,C42 learn
class C6,C7,C15,C16,C29,C30 plan
class C10,C23,C24,C32,C33 chall
class C19,C36,C37,C39,C40,C43,C44 scale
Resume:
Rich Sutton unveiled OAK, an architecture that tries to grow superintelligence from raw experience instead of curated data. It is built around the idea that an agent should face the world with no hard-wired facts, only the capacity to create options—controllable behaviours that can be started and stopped on command. Each option is paired with learned predictions of its likely outcomes, so planning becomes a matter of chaining these high-level behaviours rather than primitive actions. Because the world is far larger than any model the designer can embed, all learning, abstraction and model revision must happen online while the agent lives. The design therefore pushes everything that traditional systems do offline into runtime, making continual, non-forgetting learning the critical enabler.
The core loop is deliberately simple and domain-independent. A perceptual module converts sensations into a growing set of state features. Every feature that proves useful becomes the target of a self-generated sub-problem: learn an option that reliably reaches states where that feature is active without sacrificing long-term reward. Solving the sub-problem yields both a policy and a predictive model of the extended behaviour. These option models are stacked into a hierarchical planner that reasons over temporally abstract transitions, letting the agent look many steps ahead without drowning in low-level detail. Features, options and models therefore bootstrap one another: new features suggest new options, whose mastery creates yet richer features, producing an open-ended cycle of abstraction that Sutton argues is necessary for generally intelligent behaviour.
OAK is presented as a research manifesto rather than a finished product. It explicitly leaves unsolved two hard problems: continual deep learning without catastrophic forgetting, and automatic discovery of useful features from high-dimensional inputs. Sutton claims that once these pieces stabilise, the same compact architecture could scale from toy worlds to human society because it carries no built-in assumptions about the domain. By refusing to encode designer knowledge, the agent must itself discover objects, relations, goals and skills, thereby implementing a purely experiential route to powerful, and perhaps superhuman, intelligence.
Key Ideas:
1.- OAK stands for Options and Knowledge, the two pillars of the architecture.
2.- Intelligence emerges from online interaction, not from static training data.
3.- An option is a controllable behaviour plus a learned termination condition.
4.- Agents create their own sub-problems by selecting salient state features.
5.- Each sub-problem trains an option that reliably attains its feature.
6.- Option models predict long-term outcomes of whole behavioural sequences.
7.- Planning chains options instead of primitive actions for efficient lookahead.
8.- All learning, modelling and abstraction occur during runtime, never offline.
9.- The world is assumed vastly larger than any model the designer can embed.
10.- Continual deep learning without forgetting is the critical missing capability.
11.- Features, options and models form a virtuous cycle of open-ended abstraction.
12.- Perception converts raw sensations into a growing set of state features.
13.- Features are ranked by their utility for solving current and future tasks.
14.- Sub-problems are reward-respecting: they maximise feature attainment minus reward loss.
15.- Options are stored with their value functions and transition models.
16.- Planning updates value estimates by sweeping over imagined option trajectories.
17.- The architecture is deliberately domain-general and contains no built-in facts.
18.- Design-time knowledge is rejected to keep the agent conceptually simple.
19.- Runtime learning scales with available compute, not human expert time.
20.- Meta-learning improves the learning algorithm itself while the agent acts.
21.- The reward hypothesis states that scalar reward is sufficient for all goals.
22.- Multiple objectives, constraints or risk measures are intentionally omitted.
23.- The agent must handle non-stationary worlds that appear to change over time.
24.- Approximate value functions and policies are inevitable in big worlds.
25.- State representations are constructed online from sequences of actions and observations.
26.- The U-box (now perception module) builds the state estimate from raw inputs.
27.- Auxiliary value functions correspond to each discovered feature.
28.- Off-policy algorithms like GTD learn value functions for every option.
29.- Option models enable planning with temporally extended behaviours.
30.- The same planning mechanism works for both low and high-level options.
31.- Feature utility feedback curates which abstractions the agent keeps.
32.- Catastrophic forgetting and plasticity loss are major unsolved obstacles.
33.- IDBD, Sutton’s 1992 algorithm, may help adapt learning rates continually.
34.- Play is viewed as self-generated sub-problems for skill acquisition.
35.- Animals and infants inspire the idea of spontaneous problem creation.
36.- The architecture aims for a concise, five-page pseudocode description.
37.- Oak contrasts with large language models that rely on massive offline data.
38.- The approach unifies reinforcement learning, representation learning and planning.
39.- Success would yield a mechanistic theory of mind, reason and concept formation.
40.- The design is intended to scale from grid worlds to human-level complexity.
41.- No external labels or human demonstrations are required for learning.
42.- The agent must discover objects, relations and goals solely from experience.
43.- Computational resources are the only limit on achievable abstraction depth.
44.- The architecture is presented as a research direction, not a finished system.
45.- Sutton claims reinforcement learning offers the clearest path to strong AI.
46.- The grand quest is understanding how minds can arise from simple, general principles.
Interviews by Plácido Doménech Espà & Guests - Knowledge Vault built byDavid Vivancos 2025