Knowledge Vault 5 /72 - CVPR 2022
Toward Integrative AI with Computer Vision
Xuedong Huang
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef accessibility fill:#f9d4d4, font-weight:bold, font-size:14px classDef ai fill:#d4f9d4, font-weight:bold, font-size:14px classDef speech fill:#d4d4f9, font-weight:bold, font-size:14px classDef vision fill:#f9f9d4, font-weight:bold, font-size:14px classDef multimodal fill:#f9d4f9, font-weight:bold, font-size:14px A[Toward Integrative AI
with Computer Vision] --> B[Live captioning enables
following along 1] A --> C[AI identifies objects,
summarizes videos 2] A --> D[Speech, language breakthroughs
in 50 years 3] D --> E[Hidden Markov Models
combined knowledge 4] D --> F[IBM pioneered statistical
machine translation 5] D --> G[Deep learning reduced
speech error rates 6] A --> H[Foundation models:
new AI paradigm 7] H --> I[Microsofts unified speech
foundation model 8] H --> J[Transformer handles speech,
translation, summarization 9] D --> K[Racial disparities in
speech recognition 10] D --> L[Z-code improves low-resource
language translation 11] D --> M[Text summarization uses
encoder-decoder architecture 12] H --> N[Foundation models combine
language, vision, speech 13] A --> O[Lessons: probabilistic frameworks,
foundation models, transformers 14] A --> P[Computer vision: 2D/3D signals,
interpretation, tasks 15] P --> Q[Florence: Microsofts computer
vision foundation model 16] Q --> R[Florence uses transformers,
supervised, self-supervised learning 17] Q --> S[Florence outperforms in
43/44 vision benchmarks 18] Q --> T[Florence classifies 400K
open-ended concepts 19] Q --> U[Florence enables open-ended
visual search 20] Q --> V[Florence + GPT-3 generates
creative stories 21] Q --> W[Florence searches photos
by visual concepts 22] Q --> X[Florence excels in segmentation,
human matting 23] Q --> Y[Self-supervised learning improves
Florences segmentation 24] Q --> Z[Encoder-decoder architecture for
image captioning 25] Q --> AA[Florence infers implicit
attributes in captions 26] Q --> AB[Florence powers accessibility
tools like Seeing AI 27] Q --> AC[Florence achieves superhuman
visual question answering 28] A --> AD[Multimodal AI needs embodied
real-world experiences 29] A --> AE[Speaker fielded questions,
offered further discussion 30] class B,AB accessibility class C,H,N,AD ai class D,E,F,G,I,J,K,L,M speech class P,Q,R,S,T,U,V,W,X,Y,Z,AA,AC vision class AD multimodal

Resume:

1.- Accessibility features like live captioning in PowerPoint enable everyone to follow along, even with strong accents.

2.- AI can now identify objects and actions in a video, translate them, generate a text summary, and have an avatar narrate it.

3.- Over the past 50 years, there have been major breakthroughs in speech recognition, language understanding, and machine translation.

4.- Hidden Markov Models provided a probabilistic framework to combine acoustic, phonetic and language knowledge for speech recognition in the 1970s.

5.- In the 1990s, IBM applied similar statistical techniques from speech recognition to pioneer statistical machine translation.

6.- In the 2010s, deep learning replaced Gaussian mixture models in speech recognition, substantially reducing error rates.

7.- Foundation models, massive models trained on huge datasets for many tasks, have become a new AI paradigm.

8.- Microsoft created a unified foundation model for speech in 2017, covering many domains, tasks and languages in one model.

9.- A single transformer model can now do speech recognition, translation, summarization and more in many languages simultaneously.

10.- Despite progress, racial disparities still exist in speech recognition error rates, which the field is working to close.

11.- Z-code is a foundation model that uses monolingual and parallel data to improve low-resource language translation.

12.- Text summarization uses an encoder-decoder architecture similar to machine translation to condense documents into short summaries.

13.- Foundation models combining language, vision, speech etc. are an industry-wide trend across big tech companies.

14.- Three key lessons from speech & language AI are: 1) Probabilistic frameworks 2) Foundation models 3) Encoder-decoder transformers

15.- Computer vision faces challenges of 2D/3D signals, ambiguity of interpretation, and a diverse range of tasks.

16.- Florence is a computer vision foundation model developed by Microsoft, trained on 1 billion images.

17.- Florence uses a Swin transformer image encoder and a transformer text encoder, combining supervised and self-supervised contrastive learning.

18.- Florence outperforms state-of-the-art models on 43 out of 44 computer vision benchmarks, even in zero-shot settings.

19.- Unlike the 22K labels of ImageNet, Florence can classify and caption images with 400K open-ended concepts.

20.- Florence uses semantic language understanding to enable open-ended visual search beyond predefined classification labels.

21.- Combining Florence and GPT-3 allows generating creative stories about images that go beyond literal description.

22.- Florence enables searching personal photos by visual concepts without relying on captions or user signals.

23.- Florence achieves state-of-the-art results on tasks like human matting and image segmentation, even for non-human objects.

24.- Self-supervised learning allows Florence to pseudo-label data and iteratively improve its own image segmentation.

25.- An encoder-decoder architecture allows Florence to excel at image captioning, including for text within images.

26.- Florence's image captioning goes beyond literal description to infer implicit attributes like player jersey letters.

27.- Florence powers accessibility tools like Seeing AI that help vision-impaired users interpret objects in photos.

28.- Florence achieves super-human performance on benchmarks like text-based image captioning and visual question answering.

29.- Multimodal AI combining vision, language, speech etc. still has room for advancement by learning from real-world embodied experiences.

30.- The speaker fielded audience questions and offered to discuss further after the session concluded due to time constraints.

Knowledge Vault built byDavid Vivancos 2024