The End Of Knowledge - Vault 2 - ICLR (2014-2023) - Max Jaderberg et al. ICLR 2015

graph LR classDef unconstrained fill:#f9d4d4, font-weight:bold, font-size:14px; classDef models fill:#d4f9d4, font-weight:bold, font-size:14px; classDef joint fill:#d4d4f9, font-weight:bold, font-size:14px; classDef results fill:#f9f9d4, font-weight:bold, font-size:14px; classDef summary fill:#f9d4f9, font-weight:bold, font-size:14px; A[Max Jaderberg et al
ICLR 2015] --> B[Unconstrained scene text recognition. 1] B --> C[Unseen word generalization. 3] B --> D[Constrained methods fail. 2] A --> E[Character sequence model. 4] E --> F[CNN word resizing. 6] E --> G[23 character classifiers. 5] A --> H[Bag-of-ngrams model. 7] H --> I[10,000-dim n-gram vector. 8] I --> J[Unique word representation. 9] A --> K[Joint structured model. 10] K --> L[Graph-based word path. 11] K --> M[Hinge loss training. 12] A --> N[Synthetic data training. 14] K --> O[Pretrained models finetuned. 15] A --> P[Real-world dataset results. 16] P --> Q[Joint outperforms character model. 16] P --> R[N-grams correct errors. 17] A --> S[Unseen word experiments. 18] S --> T[Joint generalizes better. 19] A --> U[Unconstrained state-of-the-art. 20] A --> V[Constrained competitive results. 21] V --> W[Rescoring dictionary words. 21] A --> X[CNN models combined. 22] X --> Y[Joint improves accuracy. 23] Y --> Z[Strong unseen generalization. 23] X --> AA[Competitive constrained performance. 23] A --> AB[Whole word context. 24] A --> AC[N-grams still beneficial. 25] A --> AD[Large synthetic CNN training. 26] A --> AE[N-grams boost performance. 27] A --> AF[CRF higher-order extension. 28] A --> AG[Joint excels in both scenarios. 29] A --> AH[Author invites discussion. 30] class B,C,D unconstrained; class E,F,G,H,I,J models; class K,L,M,N,O joint; class P,Q,R,S,T,U,V,W results; class X,Y,Z,AA,AB,AC,AD,AE,AF,AG,AH summary;

Resume:

1.-The paper focuses on unconstrained scene text recognition - recognizing text in images without being limited to a fixed lexicon or dictionary.

2.-Current state-of-the-art methods perform constrained text recognition by choosing output from a fixed set of words, which fails on unseen words.

3.-Unconstrained text recognition is harder as the search space is much larger, but allows generalization to unseen words (zero-shot recognition).

4.-The first model is a character sequence model that uses a CNN to classify each character in the word independently.

5.-The character sequence model has 23 classifiers (one per character), each predicting 1 of 37 classes (A-Z, 0-9, null).

6.-Input images are resized to a fixed size of 32x100 pixels. The model imposes no strong language model.

7.-The second model represents words by the set of character n-grams (up to 4-grams) contained in the string.

8.-The bag-of-ngrams model outputs a 10,000-dimensional vector indicating the probability of each of the 10,000 most common English n-grams.

9.-The 10,000-dim binary vector representation is nearly always unique for English words. The model architecture is a 7-layer CNN.

10.-The two complementary word representations/models can be combined into a single joint model formulated as structured output learning.

11.-Character classifier outputs are nodes in a graph, with a word being a path through the graph. N-gram scores are edge scores.

12.-The joint model is trained to maximize the score of the correct word path compared to the highest scoring incorrect path.

13.-A hinge loss is used, which can be backpropagated through the beam search and CNNs to jointly optimize the full model.

14.-All models are trained purely on synthetically generated realistic data, but evaluated on real-world text image datasets.

15.-The character sequence and n-gram models are pretrained independently, then the joint model is finetuned after initializing from the pretrained weights.

16.-On real-world datasets, the jointly trained model outperforms the individual character sequence model, e.g. 90% vs 86% on ICDAR2003.

17.-Examining specific examples shows how the n-gram scores help correct errors made by the character sequence model alone.

18.-Experiments demonstrate the ability to recognize unseen words by training on 45K words and testing on a different 45K words.

19.-The joint model generalizes much better to unseen words than the character sequence model (89% vs 80%) due to shared n-grams.

20.-In the unconstrained setup, the joint model sets a new state-of-the-art, although still lower than constrained recognition models.

21.-When constraining the joint model by rescoring a short list of dictionary words, it is competitive with state-of-the-art constrained models.

22.-In summary, two complementary CNN models were presented and combined into a joint model trained with a structured output loss.

23.-The joint model improves accuracy over the individual models, demonstrates strong generalization to unseen words, and is competitive in constrained setups.

24.-Using the entire word image allows implicitly modeling things like font and lighting consistency across the characters in each word.

25.-Previous work has used wide context windows, but the inclusion of higher-order n-gram scores still provides significant benefits.

26.-Training on large amounts of synthetic data allows for training much larger CNNs than previous work in this area.

27.-Without the n-gram scores (only unary terms), the model still underperforms despite integrating context across the entire word image.

28.-The model can be seen as an extension of conditional random fields to higher-order terms (up to 4-grams).

29.-The joint model performs very well in both constrained and unconstrained text recognition scenarios.

30.-The author invites further questions and discussion after presenting the key ideas and results from the paper.

Knowledge Vault built byDavid Vivancos 2024