Taskonomy: Disentangling Task Transfer Learning

Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese.

**Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:**

graph LR
classDef codeslam fill:#f9d4d4, font-weight:bold, font-size:14px
classDef representations fill:#d4f9d4, font-weight:bold, font-size:14px
classDef training fill:#d4d4f9, font-weight:bold, font-size:14px
classDef autoencoder fill:#f9f9d4, font-weight:bold, font-size:14px
classDef keyframe fill:#f9d4f9, font-weight:bold, font-size:14px
classDef testing fill:#d4f9f9, font-weight:bold, font-size:14px
classDef future fill:#f9d9d4, font-weight:bold, font-size:14px
A[Taskonomy: Disentangling Task

Transfer Learning] --> B[CodeSLAM: Deep learning

SLAM system 1] A --> C[Sparse vs dense

SLAM representations 2] A --> D[Depth maps: Subspace,

structural correlation 3] A --> E[Autoencoder: Encodes

depth maps 4] A --> F[Depth prediction:

Network modulates 5] A --> G[Training: CNET dataset,

end-to-end 6] G --> H[Code size:

128 dimensions 7] G --> I[Linear decoder,

grayscale images 8] E --> J[Predicted uncertainty:

Depth discontinuities 9] E --> K[Linear decoding:

Depth function 10] E --> L[Jacobian: Constant

derivative 11] E --> M[Smooth code

perturbations 12] B --> N[Keyframe-based SLAM:

Pose, code variables 13] N --> O[Dense bundle adjustment:

Photometric error 14] N --> P[Joint optimization:

Pose, codes 15] N --> Q[Optimization results:

Reconstructions achieved 16] N --> R[Speed: 10 Hz

iterations 17] B --> S[Real-world testing:

New York dataset 18] B --> T[Visual odometry:

NYU dataset 19] T --> U[Simple system:

One optimization 20] T --> V[Zero code prior:

Robustness 21] B --> W[Future: Real data,

self-supervision 22] B --> X[Network improvements:

Architecture, structure 23] B --> Y[Demo: Preliminary

live system 24] B --> Z[Generalization: Zero-code

prediction, optimization 25] class A,B,Y codeslam class C,D representations class E,F,J,K,L,M autoencoder class G,H,I training class N,O,P,Q,R keyframe class S,T,U,V testing class W,X,Z future

Transfer Learning] --> B[CodeSLAM: Deep learning

SLAM system 1] A --> C[Sparse vs dense

SLAM representations 2] A --> D[Depth maps: Subspace,

structural correlation 3] A --> E[Autoencoder: Encodes

depth maps 4] A --> F[Depth prediction:

Network modulates 5] A --> G[Training: CNET dataset,

end-to-end 6] G --> H[Code size:

128 dimensions 7] G --> I[Linear decoder,

grayscale images 8] E --> J[Predicted uncertainty:

Depth discontinuities 9] E --> K[Linear decoding:

Depth function 10] E --> L[Jacobian: Constant

derivative 11] E --> M[Smooth code

perturbations 12] B --> N[Keyframe-based SLAM:

Pose, code variables 13] N --> O[Dense bundle adjustment:

Photometric error 14] N --> P[Joint optimization:

Pose, codes 15] N --> Q[Optimization results:

Reconstructions achieved 16] N --> R[Speed: 10 Hz

iterations 17] B --> S[Real-world testing:

New York dataset 18] B --> T[Visual odometry:

NYU dataset 19] T --> U[Simple system:

One optimization 20] T --> V[Zero code prior:

Robustness 21] B --> W[Future: Real data,

self-supervision 22] B --> X[Network improvements:

Architecture, structure 23] B --> Y[Demo: Preliminary

live system 24] B --> Z[Generalization: Zero-code

prediction, optimization 25] class A,B,Y codeslam class C,D representations class E,F,J,K,L,M autoencoder class G,H,I training class N,O,P,Q,R keyframe class S,T,U,V testing class W,X,Z future

**Resume: **

**1.-** Vision tasks are related, not independent (e.g. depth estimation, surface normals, object detection, room layout)

**2.-** Quantifying task relationships enables seeing tasks in concert, not isolation, to utilize redundancies

**3.-** Reducing need for labeled data is desirable, focus of research on self-supervised learning, unsupervised learning, meta-learning, domain adaptation, ImageNet features, fine-tuning

**4.-** Task relationships enable transfer learning - using model developed for one task to help solve another related task

**5.-** Intuitive example: surface normal estimation benefits more from transfer learning from image reshading task than from segmentation task

**6.-** Quantifying task relationships at scale allows forming complete graph to understand redundancies between tasks

**7.-** This enables solving set of tasks in concert while minimizing supervision by leveraging redundancies (all tasks transferred from 3 sources)

**8.-** Also enables solving desired novel task without much labeled data by inserting it into the task relationship structure

**9.-** Taskonomy: fully computational method to quantify task relationships at scale and extract unified transfer learning structure

**10.-** Defined set of 26 diverse vision tasks (semantic, 3D, 2D) as sample task dictionary

**11.-** Collected dataset of 4M real indoor images with ground truth for all 26 tasks

**12.-** Trained task-specific network for each of 26 tasks, freeze weights

**13.-** Quantify task relationships by using encoder of one task's network to train small readout network to solve another task

**14.-** Readout network performance on test set determines strength of directed task transfer relationship

**15.-** Computed 26x25 transfer functions to get complete directed graph of task relationships

**16.-** Normalize adjacency matrix of graph using analytic hierarchical process to account for tasks' different output spaces and numerical properties

**17.-** Extract optimal subgraph from normalized complete graph to maximize collective task performance while minimizing sources used

**18.-** Subgraph selection also handles transferring to novel tasks not in original dictionary

**19.-** Higher-order transfers (multiple sources transferring to one target) also included in framework

**20.-** Experimental results: 26 tasks, 26 task-specific networks, ~3000 transfer functions, 47,000 GPU hours, transfer training used 8-100x less data

**21.-** Sample computed taxonomy shows intuitive connections (3D tasks connected, semantic tasks connected), enables solving tasks with limited data for some

**22.-** Gain metric: measures value gained by transfer learning. Quality metric: measures how close transfer results are to task-specific networks.

**23.-** Live web API to compute taxonomies with custom arguments and compare to ImageNet features baseline

**24.-** Additional experiments: significance tests, generalization tests, sensitivity analyses, comparisons to self-supervised/unsupervised baselines

**25.-** Taskonomy is a step towards understanding space of vision tasks and treating tasks as structured space vs isolated concepts

**26.-** Provides fully computational framework and unified transfer learning model to move towards generalist perception model

**27.-** Taskonomy outperforms ImageNet feature transfer learning baselines

**28.-** Includes mechanism to handle novel tasks not in original task dictionary

**29.-** Can provide guidance for multi-task learning in terms of gauging similarity between tasks

**30.-** Optimized subgraph maximizes collective performance on all tasks while minimizing number of source tasks

Knowledge Vault built byDavid Vivancos 2024