Knowledge Vault 6 /99 - ICML 2024
Can LLMs Reason and Plan?
Subbarao Kambhampati
< Resume Image >

Concept Graph & Resume using Claude 3.5 Sonnet | Chat GPT4o | Llama 3:

graph LR classDef core fill:#f9d4d4, font-weight:bold, font-size:14px classDef limits fill:#d4f9d4, font-weight:bold, font-size:14px classDef eval fill:#d4d4f9, font-weight:bold, font-size:14px classDef improve fill:#f9f9d4, font-weight:bold, font-size:14px classDef practical fill:#f9d4f9, font-weight:bold, font-size:14px Main[Can LLMs Reason
and Plan?] --> C[Core Capabilities] Main --> L[Limitations] Main --> E[Evaluation Methods] Main --> I[Improvement Approaches] Main --> P[Practical Usage] C --> C1[LLMs react rather than
think 1] C --> C2[Humans switch thinking modes 2] C --> C3[Reasoning claims need validation 3] C --> C4[Good reactions lack thinking 4] C --> C5[Models retrieve with errors 7] L --> L1[Block stacking shows limits 5] L --> L2[Models attempt impossible tasks 6] L --> L3[Training limits thought chains 12] L --> L4[Letter tests show limits 13] L --> L5[React prompts face barriers 14] L --> L6[Models cannot self-critique 15] E --> E1[Multiple tries show human
thinking 8] E --> E2[Style succeeds facts fail 9] E --> E3[Basic planning reveals flaws 10] E --> E4[Changes impact results badly 11] E --> E5[Distribution metrics fail here 19] I --> I1[External checks work better 16] I --> I2[Complexity timing stays constant 17] I --> I3[Size restricts learning range 18] I --> I4[Back-prompting helps some cases 26] I --> I5[External checks boost output 24] P --> P1[Models lack deductive skills 20] P --> P2[Planning needs deeper grasp 21] P --> P3[Removing rules helps wrongly 22] P --> P4[Give rough domain facts 23] P --> P5[Minimize human involvement 25] P3 --> P6[Style critique beats facts 27] P4 --> P7[Proofs need verification 28] P5 --> P8[Generate exceeds reasoning 29] P5 --> P9[Tools work within bounds 30] class Main,C,C1,C2,C3,C4,C5 core class L,L1,L2,L3,L4,L5,L6 limits class E,E1,E2,E3,E4,E5 eval class I,I1,I2,I3,I4,I5 improve class P,P1,P2,P3,P4,P5,P6,P7,P8,P9 practical

Resume:

1.- LLMs are primarily System 1 reactive rather than System 2 deliberative processors

2.- Humans can switch between System 1 and 2; LLMs cannot

3.- Claims about LLM reasoning abilities are often misinterpreted human validation

4.- LLMs are impressive as System 1 processors but lack deliberative capabilities

5.- Simple block-stacking problems reveal LLMs' inability to reason logically

6.- LLMs always attempt solutions, even for unsolvable problems

7.- LLMs are N-gram models performing approximate retrieval with constant hallucination

8.- Multiple prompt attempts until success indicates human reasoning, not LLM

9.- LLMs excel at style but struggle with factual correctness

10.- Systematic evaluation shows poor performance on basic planning tasks

11.- Obfuscated versions of problems severely impact LLM performance

12.- Chain of thought prompting doesn't scale beyond training examples

13.- Last-letter concatenation experiments show limited advice-taking capability

14.- React prompting faces similar limitations as chain of thought

15.- LLMs cannot effectively self-critique their solutions

16.- External verifiers perform better than LLM self-criticism

17.- Computational complexity doesn't affect LLM response time

18.- Fine-tuning shows poor generalization beyond training problem size

19.- In vs. out of distribution metrics aren't useful for LLM

20.- LLMs lack deductive closure capabilities

21.- Planning requires understanding action interactions, which LLMs struggle with

22.- Removing preconditions improves LLM performance but eliminates planning necessity

23.- LLMs excel at providing approximate domain knowledge

24.- LLM modular frameworks combine generation with external verification

25.- Human intervention should be minimized in LLM systems

26.- Back-prompting can improve accuracy within reasonable iterations

27.- LLMs can effectively critique style rather than correctness

28.- Alpha proof and similar systems use external verifiers

29.- LLMs are better as generators than reasoners

30.- LLMs remain valuable tools when used appropriately within limitations

Knowledge Vault built byDavid Vivancos 2024