key: cord-0660593-wabv7pp4 authors: Wang, Hao; Liao, Junchao; Cheng, Tianheng; Gao, Zewen; Liu, Hao; Ren, Bo; Bai, Xiang; Liu, Wenyu title: Knowledge Mining with Scene Text for Fine-Grained Recognition date: 2022-03-27 journal: nan DOI: nan sha: 44b4abcaacab279c59c32a0943b49b25579bc862 doc_id: 660593 cord_uid: wabv7pp4 Recently, the semantics of scene text has been proven to be essential in fine-grained image classification. However, the existing methods mainly exploit the literal meaning of scene text for fine-grained recognition, which might be irrelevant when it is not significantly related to objects/scenes. We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image and enhance the semantics and correlation to fine-tune the image representation. Unlike the existing methods, our model integrates three modalities: visual feature extraction, text semantics extraction, and correlating background knowledge to fine-grained image classification. Specifically, we employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification. Experiments on two benchmark datasets, Con-Text, and Drink Bottle, show that our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively. To further validate the effectiveness of the proposed method, we create a new dataset on crowd activity recognition for the evaluation. The source code and new dataset of this work are available at https://github.com/lanfeng4659/KnowledgeMiningWithSceneText. The text conveys the information, knowledge, and emotion of human beings as a significant carrier. Texts in natural scene images contain sophisticated semantic information that can be used in many vision tasks such as image classification, visual search, and image-based question answering. Several approaches [2, 15, 22, 23, 26, 39] were proposed to incorporate semantic cues of scene text for image classification or retrieval and achieved significant performance improvements. These methods follow a general pipeline that first spots the text by a scene text reading system, then converts the spotted word into text features to combine it with image features for the subsequent tasks. * Authors contribute equally. † Corresponding author. 1 https://github.com/lanfeng4659/KnowledgeMiningWithSceneText This paper explores how to dig deeper into background knowledge and extract context information of scene text for the fine-grained image classification task. Unlike document text, in our observation, the natural scene text is often sparse, appearing as a few keywords rather than complete sentences. Moreover, these few keywords may be vague and give no clue to the classification model when their semantic cues are not directly related to the precise meaning that the image conveys. As shown in Fig. 1 (a) and (b), the literal meaning of the keyword "Soda" explicitly expresses that the bottles in the two images belong to the category Soda despite their intraclass visual variance. However, we hardly understand the object in Fig. 1 (c) by solely fetching the semantic cues of scene text. To understand the image certainly, getting more relevant contextual knowledge about the image is crucial. Therefore, we explore how to dig extra background knowledge and mine the contextual information to enhance the correlation between scene text and a picture. For example, the table in Fig. 1 (d) exhibits related information or knowl-edge of scene text embodied in (c). The description of the entity Leninade informs that it is a Soda beverage bottle. Thus, the knowledge extracted in this manner complements the literal meaning of the raw text and reduces the semantics loss caused by using the literal meaning of scene text only. Specifically, after extracting the text from the image by a scene text reading system [20, 40] , we retrieve relevant knowledge from databases such as ( e.g., WordNet [25] and Wikipedia) that store rich human-curated knowledge with all possible correlation to the target. As shown in Fig. 1 (d) , the possible entities ( e.g., party and political party) can be extracted for the text instance "party" from the knowledge databases. However, all the retrieved contextual knowledge may not necessarily provide helpful semantic cues to understand the visual contents. In order to filter relevant contextual information from irrelevant, we design an attention module that focuses on very pertinent knowledge for the semantics of objects or scenes. We evaluate the performance of our method on two public benchmark datasets, Bottles [2] and Con-Text [16] . The results demonstrate the usage of contextual knowledge behind scene text can significantly promote fine-grained image classification models performances. To further prove the effectiveness of our method, we developed a new dataset consisting of 21 categories and 8785 natural images. Furthermore, the dataset mainly focuses on crowd activity, while most images contain multiple scene text instances. To the best of our knowledge, the existing crowd activity datasets do not contain scene text instances. However, everyday human activities are highly related to scene text presences, for example, procession, exhibitions, press briefing, and sales campaigns. This dataset will be a valuable asset for exploring the role of scene text on crowd activity. In this paper, we propose a method that mines contextual knowledge behind scene text to improve the performance of the multi-modality understanding task. To this end, we design a deep-learning-based architecture that combines three modality features, including visual contents, scene text, and knowledge for fine-grained image recognition. Our method achieves significant improvements and can be applied to other tasks, such as visual grounding [33] and text-visual question answering [3] beyond the fine-grained image classification task. In addition, we propose a new dataset where each image contains multiple scene text instances, which promotes the study of multi-modal crowd activity analysis. The task of fine-grained image classification needs to distinguish images with subtle visual differences among object classes in some domains, such as animal species [12, 18] , plant species [24] and man-made objects [19] . Previ-ous methods [6, 10] classify objects with only visual cues and aim at finding a discriminative image path. Recently, some approaches have shown a growing interest in employing textual cues to combine the visual cues for this task. Movshovitz et al. [26] first propose to leverage scene text for the fine-grained image classification task by using the visual cues of scene text. However, extracting robust visual cues of scene text is challenging due to blur and occlusion of text instances. Karaoglu et al. [15] employ the textual cues of scene text as a discriminative signal and combine the visual features that are obtained by the GoogLeNet [38] to distinguish business place. To fully exploit the complementarity of visual information and textual cues, several methods [2, 22] propose to fuse features of the two modalities with an attentional module. Bai et al. [2] propose an attention mechanism to select textual features from word embeddings of recognized words. To overcome optical character recognition errors, Mafla et al. [22] leverage the usage of the PHOC [1] representation to construct a bag of textual words along with the fisher vector [29] that models the morphology of text. Despite the promising progress, the existing methods exploit the literal meaning of scene text and overlook the meaningful human-curated knowledge of text. The pre-trained language models such as ELMo [30] and BERT [8] are optimized to either predict the next word or some masked words in a given sequence. Petroni et al. [32] find that the pre-trained language models, such as BERT, can recall factual and commonsense knowledge. Such knowledge is stored implicitly in the parameters of the language model and useful for downstream tasks such as visual question answering [17] . This knowledge is usually obtained either from the latent context representations produced by the pre-trained model or by using the parameters of the pre-trained model to initialize a task-specific model for further fine-tuning. To further enhance the language model awareness of human-curated knowledge better, some works [31, 34] explicitly integrate the knowledge in knowledge bases into the pre-trained language model. In our method, we employ both BERT [8] and KnowBert [31] as a knowledge-aware language model and apply them to extract knowledge features. Although previous methods [36] extract knowledge features from sentences on vision-language tasks, they require the annotation of image-text pairs. As shown in Fig. 2 , the proposed network accepts as input an image, a knowledge base, and scene text spotted by a scene text reading system such as [20, 40] . The part of extracting features in our framework consists of three branches, the visual features extraction branch, the knowledge extraction branch for retrieving relevant knowledge, In our method, we employ ViT [9] to extract the global visual features of the input image. We mainly detail the knowledge extraction branch, Knowledge-enhanced features branch, and the visual-knowledge attention component in the following subsections. The goal of this branch is to extract relevant knowledge from Wikipedia and embed them into features. Such knowledge is stored via entities in a knowledge base, and relevant entities can be queried by scene text instances in our method. However, most text instances can map to multiple entities due to the uncertainty of the meaning of the text. For example, the text "apple" can denote the entity of either fruit apple or Apple company. This requires an entity candidate selector that takes as input a sentence and returns a list of C potential entities. Inspired by [11] , we use an entity prior for entity candidate selection. The prior means the probability of a text instance being an entity, which is computed by averaging hyperlink count statistics from Wikipedia, a large Web corpus [37] , and the YAGO dictionary [14] . As depicted in Fig. 3 , first, we combine all scene text instances to sentence according to the spotting order. Then, the tokens of this sentence are obtained as BERT does. The entity candidate selector generates the top C entity candidates of each text instance based on the prior. Finally, the entity embed- dings are obtained via the precomputed entity encoder in KnowBert. Specifically, the entity encoder adopts a skipgram like objective to learn 300-dimensional embeddings of Wikipedia page titles from Wikipedia descriptions. As a result, such entity embeddings encode the factual knowledge mined from Wikipedia descriptions. This branch aims at using the retrieved entity embeddings to enhance the representations of text. The architecture is adapted from KnowBert that incorporates knowledge bases into BERT by inserting a knowledge attention and recontextualization component (KARC) at a particular layer. Following KnowBert, we insert Wikipedia into the 10 th layer of the encoder of BERT. The brief pipeline of this branch is given in Fig. 2 This block uses H i−1 as the query, key, and value to allow each vector to attend to each other. The KARC is the key component for integrating the retrieved entity embeddings to H i . Different from the one in KnowBert, the width of the span is restricted as 1 in our KARC. Namely, these entities named as more than one text instance are ignored due to the sparsity of scene text. The details of KARC are given in Fig. 4 , the word piece representations (H i ) are first projected to H p i by a linear layer. The representations of those word pieces that link to at least one entity are contextualized into contextual word representations S e by a TransformerBlock. Meanwhile, the C candidate entity representations of each token are averaged to form weighted entity embeddings F . Specifically, as KnowBert does, we disregard all candidate entities with scores below a fixed threshold, and softmax normalize the remaining scores to weight the corresponding candidate entity representations. Then, S e are updated by adding entity embeddings F to form word-entity representations S ′ e . The S ′ e is employed to recontextualize the H p i with a Trans-formerBlock, where we substitute H p i for the query, and S ′ e for both the key and value: Finally, a residual connection is adapted to fuse the H where, g is a linear function. The fully connected layer is employed in our method. Generally, not all knowledge of text in an image must have semantic relations to the object or scene. Some retrieved knowledge may have strong correlations with the image, others may be not relevant at all. Therefore, we design an attention component that focuses on very pertinent knowledge for the semantics of objects or scene. The basic idea is that we take the global visual feature f v ∈ R 1×D as query and retrieve those knowledge features that are highly similar to f v from all knowledge features H ∈ R N ×D . The parameter D is the feature dimension. Formally, given f v and H, we first calculate their similarities, which is defined by: where both θ and ϕ are a single linear function that projects the features into a feature space, W ∈ R 1×N is the outputted similarity matrix. Then, W is used for weighting knowledge features. Finally, the weighted features are fed to a residual connection block. The implemented process is defined as follow: where H out ∈ R 1×D is the attended knowledge features, κ is a linear function. The classifier consisting of a fully connected layer and a softmax layer performs the classification task, inputting the concatenation of the global visual features and the knowledge-enhanced features. The objective function is formulated as where M is the number of categories, p m is the probability of predicting the sample as the m th category, y is the associated label. First, we introduce the datasets used in our experiments and the new dataset created by us. Then, the implementation details are given. Third, we evaluate our method on our proposed Crowd Activity dataset and make comparisons with the state-of-the-art approaches. Last, we conduct the ablation studies. We compare with previous methods under the metric of mAP as most existing methods do. Con-Text dataset is introduced by Karaoglu [16] and is a subset of ImageNet dataset [7] . This dataset is constructed by selecting the sub-categories of "building" and "place of business", consisting of 24,255 images classified into 28 categories that are visually similar. Drink Bottle dataset is presented by Bai [2] and consists of various types of drink bottle images contained in soft drink and alcoholic drink sets in ImageNet dataset [7] . The dataset has 18,488 images divided into 20 categories. All categories within the existing two datasets are about products or places of business. The textual cues of those categories are obvious, and most images can be understood by the apparent meaning of scene texts rather than the knowledge behind them. Therefore, we create a new dataset that concentrates on the activities of the crowd for a finegrained image classification task, named as Crowd Activity dataset, as automatically understanding crowd activity is meaningful for social security. This dataset is newly col-lected, where the images are mainly searched on the Internet and collected from streets by mobile phones. All images in this dataset contain at least one text instances. The categories come from activities of daily living and demonstrations stimulated by hot events in recent years. Specifically, this dataset consists of 21 categories and 8785 images in total. As shown in Fig. 5 , the 21 categories broadly fall into two types: activities of daily living( i.e., celebrating Christmas, holding sport meeting, holding concert, celebrating birthday party, celebrity speech, teaching, graduation ceremony, picnic, press briefing, shopping, celebrating Thanks giving day) and demonstrations ( i.e., protecting animals, protecting environment, appealing for peace, Brexit, COVID-19, election, immigrant, respecting female, racial equality, mouvement des gilets jaunes). Before training, we first extract scene text by Google OCR or E2E-MLT. Then, the model of our method is trained in an end-to-end manner. For the data augmentation on images, we first randomly crop an image patch on the original image with the scale from 0.05 to 1.0 while keeping the ratio in a range of [0.75, 1.33]. Next, the image patch is resized to 224 × 224. Finally, we perform normalization on the image by setting both the mean and the standard deviation as (0.5, 0.5, 0.5). As for training BERT and KnowBert, no data augmentation is used other than shuffling the order of scene text before grouping them into a sentence, as both BERT and KnowBert can overfit quickly when the input text is not so abundant. We adapt AdamW [21] to optimize the whole network with an initial learning rate of 3e-5. The learning rate warmup for 500 iterations and the cosine annealed warm restart strategy are adopted at the same time. All models are trained on the dataset for 10 epochs. We conduct all experiments based on PyTorch [27] . The codes of ResNet-152 [13] and ViT [9] are from [22] and the timm package [41] . For both ResNet-152 and ViT, the pre-trained models on ImageNet are used for finetuning. The implementation of BERT [8] and KnowBert are from the huggingface transformers [42] and [31] . The Book-Corpus [43] and English Wikipedia pre-trained model are loaded on BERT. In addition, we use torchtext, which is a package from PyTorch for the GloVe [28] and fastText [4] . During testing, the shorter side of the image is resized to 224. Then a 224 × 224 image patch is cropped from the image center. As for the spotted scene text, we keep their original order for BERT and KnowBert. We compare our method with several baseline methods, including visual baseline (ResNet-152 and ViT), textual/knowledge baseline (fastText and KnowBert), and multi-modal baseline ( [22] and [23] ) on our proposed crowd activity dataset. We conduct two types of experiments using two different dataset settings. 1) The visual baseline and multi-modal baseline models are trained on all training images and tested on all testing images. 2) The textual/knowledge baseline models are trained and tested on the subset of images consisting of spotted texts. The textual cues used in [22] and [23] are from fastText. Tab. 1 displays the quantitative comparisons on the crowd activity dataset. Among previous methods, ViT achieves state-of-the-art performances, while our method outperforms ViT by 7.2% mAP. In particular, the improvements of the subset of demonstrations reach more than 11.0% mAP, which is the highest gain than activities of daily living. The reason is that the visual cues on those demonstration activities are incredibly subtle. For example, most scenarios are that protest marchers hold flags and slogans and walk on the street. Such subtle visual cues require valuable knowledge for better understanding those scenes. Thus, the performance improvement confirms the significance of scene text instances in datasets such as Crowd Activity for the robust classification of fine-grained images. Bai et al. [2] take GoogLeNet [38] as visual backbone while the most recent state-of-the-art methods [22, 23] employ ResNet-152 [13] . For a fair comparison, we first evaluate our method with ResNet-152 and take E2E-MLT [5] as text spotter. Then, we conduct experiments under the setting of ViT and Google OCR. As shown in Tab. 2, our model achieves the best performance on the three datasets. The method [23] outperforms previous methods by using the features of general objects within images. However, our model surpasses it, by 5.39% and 3.72% on Drink Bottle and Con-Text datasets, respectively. The method [22] does not use the information of general objects. Consequently, our method achieves superior performance on the two public datasets over the method [22] . The consistent outperformance of our proposed model over existing methods demonstrates the significance and effectiveness of integrating the knowledge behind scene text for better understanding the objects or scene. To further validate the significance of introducing knowledge to this task, we compare our method with [22] and [23] on our Crowd Activity dataset. Specifically, we train the model with their officially released codes 23 Tab. 2, our method outperforms the method [23] by 8.20% mAP, which further illustrates that mining knowledge is vital to understand the meanings of natural images fully. As some qualitative results of our method are shown in Fig. 6 , the proposed method can identify these visually alike images on Drink Bottle and Con-Text datasets. As illustrated in Sec. 4.3, the visual cues and the literal meaning of scene text in images are highly subtle on the crowd activity dataset. Yet, our method still classifies them very well. Fig. 7 , the ResNet-152 model mainly focuses on the visual contents. However, the ViT model captures the visual contents and harvest the textual cues from the image by self-attention mechanism. Thus, embedding features provides complementary information to boost the performances of ViT instead of solely exploiting the literal meaning of scene text. The impact of knowledge-enhanced features As mentioned before, a direct way to mine knowledge is to exploit the BERT encoder output features. As shown in Tab. 3, the employment of knowledge-enhanced features from BERT achieves significant improvements than the typical word embedding features (GloVe/fastText). The ViT+BERT model surpasses the performance of the ViT+fastText model by 6.69%, 1.63%, 4.63% on Con-Text, Drink Bottle, and Crowd Activity. This superior performance proves that the explicit knowledge in knowledge bases significantly enriches the semantics of scene text for understanding objects. Furthermore, unlike BERT, KnowBert explicitly introduces knowledge from a knowledge base into the model. The experimental results show that the KnowBert model consistently outperforms the BERT model. Therefore, introducing knowledge behind scene text to neural network feature learning enhances understanding natural images. As shown in Fig. 8 , the employment of knowledge substantially enriches the classification accuracy, as the knowledge behind "PM2.5" tells that the third image is about environment. The integration of KARC only in model A improves the performance on all datasets. Moreover, integrating VKAC on top of KARC in model B increases the recognition performance mAP by 2.28%, 1.67%, and 1.29% on Con-Text, Bottles, and Crowd Activity datasets, respectively. The experimental results demonstrate the effectiveness of fusing multi-modal features for this task. Joint optimization Integrating the process of mining knowledge, feature extraction, and classification in a unified network makes it feasible to optimize them jointly. The model that is jointly optimized could achieve better performance than the one with separated feature extraction and classifier, as those processes are complementary to each other. To confirm this assumption, we first train the models of ViT and KnowBert with image data and scene text, respectively, at the supervision of the classification task. Then, the classifier and VKAC are trained, accepting as input visual features and knowledge-enhanced features extracted from the pre-trained models. As reported in Tab. 5, the model trained in an end-to-end manner significantly outperforms the one trained separately, showing the necessity of integrating knowledge mining process into the network. In this paper, we have confirmed that the usage of the knowledge behind scene text can improve the performance of the fine-grained image classification task. Experiments on the two benchmark datasets and the proposed Crowd Activity dataset have verified the effectiveness and efficiency of our method for product recognition and crowd activity analysis. In the future, we will further explore the usage of knowledge mining of scene text on other tasks of multimodal fusion, such as scene text, visual question and answering, and visual grounding. Alicia Fornés, and Ernest Valveny. Word spotting and recognition with embedded attributes Integrating scene text and visual appearance for fine-grained image classification Scene text visual question answering Enriching word vectors with subword information E2E-MLT -an unconstrained end-to-end method for multi-language scene text Differentiable patch selection for image recognition Imagenet: A large-scale hierarchical image database BERT: pre-training of deep bidirectional transformers for language understanding Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition Deep joint entity disambiguation with local neural attention Exploiting temporal information for dcnn-based fine-grained object classification Deep residual learning for image recognition Robust disambiguation of named entities in text Words matter: Scene text for image classification and retrieval Context: text detection using background connectivity for finegrained object classification MMFT-BERT: multimodal fusion transformer with BERT encodings for visual question answering Novel dataset for fine-grained image categorization: Stanford dogs 3d object representations for fine-grained categorization Mask textspotter v3: Segmentation proposal network for robust scene text spotting Decoupled weight decay regularization Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. Fine-grained image classification and retrieval by combining visual and locally pooled textual features Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval Fine-grained visual classification of aircraft Wordnet: A lexical database for english Ontological supervision for fine grained classification of street view storefronts Pytorch: An imperative style, high-performance deep learning library Glove: Global vectors for word representation Fisher kernels on visual vocabularies for image categorization Deep contextualized word representations Knowledge enhanced contextual word representations Language models as knowledge bases Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models Knowledge-aware language model pretraining Grad-cam: Visual explanations from deep networks via gradient-based localization From strings to things: Knowledgeenabled vqa model that can read and reason A cross-lingual dictionary for english wikipedia concepts Going deeper with convolutions Scene text retrieval via joint text detection and similarity learning All you need is boundary: Toward arbitrary-shaped text spotting Pytorch image models Transformers: State-of-the-art natural language processing Aligning books and movies: Towards story-like visual explanations by watching movies and reading books Acknowledgements This work was supported by the National Natural Science Foundation of China 61733007.