Concept Graph & Resume using Claude 3 Opus | Chat GPT4 | Gemini Adv | Llama 3:
Resume:
1.-The paper focuses on unconstrained scene text recognition - recognizing text in images without being limited to a fixed lexicon or dictionary.
2.-Current state-of-the-art methods perform constrained text recognition by choosing output from a fixed set of words, which fails on unseen words.
3.-Unconstrained text recognition is harder as the search space is much larger, but allows generalization to unseen words (zero-shot recognition).
4.-The first model is a character sequence model that uses a CNN to classify each character in the word independently.
5.-The character sequence model has 23 classifiers (one per character), each predicting 1 of 37 classes (A-Z, 0-9, null).
6.-Input images are resized to a fixed size of 32x100 pixels. The model imposes no strong language model.
7.-The second model represents words by the set of character n-grams (up to 4-grams) contained in the string.
8.-The bag-of-ngrams model outputs a 10,000-dimensional vector indicating the probability of each of the 10,000 most common English n-grams.
9.-The 10,000-dim binary vector representation is nearly always unique for English words. The model architecture is a 7-layer CNN.
10.-The two complementary word representations/models can be combined into a single joint model formulated as structured output learning.
11.-Character classifier outputs are nodes in a graph, with a word being a path through the graph. N-gram scores are edge scores.
12.-The joint model is trained to maximize the score of the correct word path compared to the highest scoring incorrect path.
13.-A hinge loss is used, which can be backpropagated through the beam search and CNNs to jointly optimize the full model.
14.-All models are trained purely on synthetically generated realistic data, but evaluated on real-world text image datasets.
15.-The character sequence and n-gram models are pretrained independently, then the joint model is finetuned after initializing from the pretrained weights.
16.-On real-world datasets, the jointly trained model outperforms the individual character sequence model, e.g. 90% vs 86% on ICDAR2003.
17.-Examining specific examples shows how the n-gram scores help correct errors made by the character sequence model alone.
18.-Experiments demonstrate the ability to recognize unseen words by training on 45K words and testing on a different 45K words.
19.-The joint model generalizes much better to unseen words than the character sequence model (89% vs 80%) due to shared n-grams.
20.-In the unconstrained setup, the joint model sets a new state-of-the-art, although still lower than constrained recognition models.
21.-When constraining the joint model by rescoring a short list of dictionary words, it is competitive with state-of-the-art constrained models.
22.-In summary, two complementary CNN models were presented and combined into a joint model trained with a structured output loss.
23.-The joint model improves accuracy over the individual models, demonstrates strong generalization to unseen words, and is competitive in constrained setups.
24.-Using the entire word image allows implicitly modeling things like font and lighting consistency across the characters in each word.
25.-Previous work has used wide context windows, but the inclusion of higher-order n-gram scores still provides significant benefits.
26.-Training on large amounts of synthetic data allows for training much larger CNNs than previous work in this area.
27.-Without the n-gram scores (only unary terms), the model still underperforms despite integrating context across the entire word image.
28.-The model can be seen as an extension of conditional random fields to higher-order terms (up to 4-grams).
29.-The joint model performs very well in both constrained and unconstrained text recognition scenarios.
30.-The author invites further questions and discussion after presenting the key ideas and results from the paper.
Knowledge Vault built byDavid Vivancos 2024