Knowledge Vault 5 /17 - CVPR 2016
You Only Look Once: Unified, Real-Time Object Detection.
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
< Resume Image >

Concept Graph & Resume using Claude 3 Opus | Chat GPT4o | Llama 3:

graph LR classDef yolo fill:#f9d4d4, font-weight:bold, font-size:14px classDef detection fill:#d4f9d4, font-weight:bold, font-size:14px classDef speed fill:#d4d4f9, font-weight:bold, font-size:14px classDef training fill:#f9f9d4, font-weight:bold, font-size:14px classDef future fill:#f9d4f9, font-weight:bold, font-size:14px A[You Only Look
Once: Unified, Real-Time
Object Detection.] --> B[YOLO: real-time, unified object detection. 1] A --> C[Object detection: draw boxes, identify objects. 2] C --> D[Previous methods accurate but slow. 3] C --> E[Recent work sped up R-CNN. 4] B --> F[YOLO faster, slight accuracy tradeoff. 5] B --> G[Single neural network predicts detections. 6] G --> H[Image divided into SxS grid. 7] H --> I[Cells predict boxes, confidence, classes. 8] I --> J[Confidence reflects object presence, fit. 9] I --> K[Class probability map like segmentation. 10] I --> L[Confidence scores from probabilities, thresholds. 11] I --> M[Non-max suppression removes duplicates. 12] B --> N[Fixed output allows optimization. 13] B --> O[Network predicts detections simultaneously. 14] B --> P[Network trained end-to-end. 15] P --> Q[Ground truth assigned to cells. 16] Q --> R[Box predictions adjusted, confidence increased. 17] Q --> S[Confidence decreased for non-overlapping boxes. 18] Q --> T[Probabilities, coordinates not adjusted without objects. 19] P --> U[Pretrained on ImageNet, trained on detection. 20] B --> V[YOLO performs well, generalizes to artwork. 21] V --> W[YOLO outperforms DPM, R-CNN on artwork. 22] P --> X[YOLO trained on larger COCO dataset. 23] B --> Y[Video demo: real-time laptop detection. 24] Y --> Z[Detection fails on self-referential screen. 25] B --> AA[YOLO code open source, available. 26] AA --> AB[Future work: YOLO + XNOR networks. 27] AB --> AC[Goal: real-time detection on smaller devices. 28] B --> AD[YOLO frames detection as regression. 29] AD --> AE[Unlike sliding window, region proposal classifiers. 30] class A,B yolo class C,D,E,W detection class F,Y,AA,AB,AC,AD,AE speed class G,H,I,J,K,L,M,N,O training class P,Q,R,S,T,U,V,X training class Z future


1.- YOLO (You Only Look Once) is a real-time, unified object detection system.

2.- Object detection involves drawing boxes around objects in an image and identifying them.

3.- Previous object detection methods like DPM and R-CNN were accurate but very slow (14-20 seconds per image).

4.- Recent work focused on speeding up R-CNN, with Fast R-CNN (2s/image) and Faster R-CNN (140ms/image, 7 FPS).

5.- YOLO processes images much faster, at 45 FPS (22ms/image), with a small tradeoff in accuracy.

6.- It uses a single neural network to predict detections from full images in one evaluation instead of thousands.

7.- The image is divided into an SxS grid, with each cell predicting B bounding boxes, confidence for those boxes, and C class probabilities.

8.- Bounding box confidence reflects if the box contains an object and how well the predicted box fits the object.

9.- Class probability map is like a coarse segmentation map, showing the probability of each class for objects in each cell.

10.- Multiplying class probabilities and bounding box confidence gives class-specific confidence scores for each box. Low-scoring boxes are thresholded out.

11.- Non-max suppression removes duplicate detections, leaving the final detections for the image.

12.- The fixed output size tensor allows the full detection pipeline to be expressed and optimized as a single network.

13.- The network predicts all detections simultaneously, incorporating global context about co-occurrence, relative size, and position of objects.

14.- The network is trained end-to-end to predict the full detection tensor from images.

15.- During training, ground truth box centers are assigned to grid cells, which predict those boxes.

16.- The cell's bounding box predictions are adjusted based on best overlap with ground truth. Confidence is increased for the best box.

17.- Confidence is decreased for bounding boxes that don't overlap with any objects.

18.- Class probabilities and box coordinates are not adjusted for cells without associated ground-truth objects.

19.- The network was pretrained on ImageNet and then trained on detection data with SGD and data augmentation.

20.- YOLO performs well on natural images with some mistakes. It generalizes well to artwork.

21.- YOLO outperforms DPM and R-CNN when trained on natural images and tested on artwork.

22.- YOLO was also trained on the larger Microsoft COCO dataset with 80 classes.

23.- The video demonstrates real-time detection on a laptop webcam, identifying objects like dogs, bicycles, plants, ties, etc.

24.- Detection breaks down if the laptop camera is pointed at its own screen due to recursion.

25.- YOLO's training, testing, and demo code is open source and available online.

26.- Future work includes combining YOLO with XNOR networks to develop a faster, more efficient version.

27.- The goal is to enable real-time object detection on smaller devices like CPUs and embedded systems.

28.- YOLO frames object detection as a regression problem, using features from the entire image to predict each bounding box.

29.- This is unlike sliding window and region proposal-based techniques that perform detection by applying a classifier multiple times.

30.- Predicting all bounding boxes simultaneously using features from across the image allows YOLO to learn contextual cues and still be fast.

Knowledge Vault built byDavid Vivancos 2024