key: cord-020848-nypu4w9s authors: Morris, David; Müller-Budack, Eric; Ewerth, Ralph title: SlideImages: A Dataset for Educational Image Classification date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_36 sha: doc_id: 20848 cord_uid: nypu4w9s In the past few years, convolutional neural networks (CNNs) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. Besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. However, this kind of images has received little attention in computer vision. CNNs and similar techniques use large volumes of training data. Currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. In this paper, we address this issue and present SlideImages, a dataset for the task of classifying educational illustrations. SlideImages contains training data collected from various sources, e.g., Wikimedia Commons and the AI2D dataset, and test data collected from educational slides. We have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. Furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data. Convolutional neural networks (CNNs) are making great strides in computer vision, driven by large datasets of annotated photos, such as ImageNet [1] . Many images relevant for information retrieval, such as charts, tables, and diagrams, are created with software rather than through photography or scanning. There are several applications in information retrieval for a robust classifier of educational illustrations. Search tools might directly expose filters by predicted label, natural language systems could choose images by type based on what information a user is seeking. Further analysis systems could be used to extract more information from an image to be indexed based on its class. In this case, we have classes such as pie charts and x-y graphs that indicate what type of information is in the image (e.g., proportions, or the relationship of two numbers) and how it is symbolized (e.g., angular size, position along axes). Most educational images are created with software and are qualitatively different from photos and scans. Neural networks designed and trained to make sense of the noise and spatial relationships in photos are sometimes suboptimal for born-digital images and educational images in general. Educational images and illustrations are under-served in training datasets and challenges. Competitions such as the Contest on Robust Reading for Multi-Type Web Images [2] and ICDAR DeTEXT [3] have shown that these tasks are difficult and unsolved. Research on text extraction such as Morris et al. [4] and Nayef and Ogier [5] has shown that even noiseless born-digital images are sometimes better analyzed with neural nets than with handcrafted features and heuristics. Born-digital and educational images need further benchmarks on challenging information retrieval tasks in order to test generalization. In this paper, we introduce SlideImages, a dataset which targets images from educational presentations. Most of these educational illustrations are created with diverse software, so the same symbols are drawn in different ways in different parts of the image. As a result, we expect that effective synthetic datasets will be hard to create, and methods effective on SlideImages will generalize well to other tasks with similar symbols. SlideImages contains eight classes of image types (e.g. bar charts and x-y plots) and a class for photos. The labels we have created were made with information extraction for image summarization in mind. In the rest of this paper, we discuss related work in Sect. 2, details about our dataset and baseline method in Sect. 3, results of our baseline method in Sect. 4, and conclude with a discussion of potential future developments in Sect. 5. Prior information retrieval publications used or could use document figure classification. Charbonnier et al. [6] built a search engine with image type filters. Aletras and Mittal [7] automatically label topics in photos. Kembhavi et al.'s [8] diagram analysis assumes the input figure is a diagram. Hiippala and Orekhova extended that dataset by annotating it in terms of Relational Structure Theory, which implies that the same visual features communicate the same semantic relationships. De Herrera et al. [9] seek to classify image types to filter their search for medical professionals. We intend to use document figure classification as a first step in automatic educational image summarization applications. A similar idea is followed by Morash et al. [10] , who built one template for each type of image, then manually classified images and filled out the templates, and suggested automating the steps of that process. Moraes et al. [11] mentioned the same idea for their SIGHT (Summarizing Information GrapHics Textually) system. A number of publications on document image classification such as Afzal et al. [12] and Harley et al. [13] use the RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset, which covers scanned documents. While document scans and born-digital educational illustrations have materially different appearance, these papers show that the utility of deep neural networks is not limited to scene image tasks (Fig. 1) . A classification dataset of scientific illustrations was created for the NOA project [14] . However, their dataset is not publicly available, and does not draw as many distinctions between types of educational illustrations. Jobin et al.'s DocFigure [15] consists of 28 different categories of illustrations extracted from scientific publications totaling 33,000 images. Techniques that work well on DocFigure [15] do not generalize to the educational illustrations in our use case scenarios (as we also show in Sect. 4.2). Different intended uses or software cause sufficient differences in illustrations that a dataset of specifically educational illustrations is needed. CNNs and related techniques are heavily data driven. An approach must consist of both an architecture and optimization technique, but also the data used for that optimization. In our case, we consider the dataset our main contribution. When building our taxonomy, we have chosen classes such that one class would have the same types of salient features, and appropriate summaries would also be similar in structure. Our classes are also all common in educational materials. Beyond the requirements of our taxonomy, our datasets needed to be representative of common educational illustrations in order to fit real-world applications, and legally shareable to promote research on educational image classification. Educational illustrations are created by a variety of communities with varying expertise, techniques, and tools, so choosing a dataset from one source may eliminate certain variables in educational illustration. To identify these variables, we kept our training and test data sources separate. We assembled training and validation datasets from various sources of open access illustrations. Bar charts, x-y plots, maps, photos, pie charts, slide images, table images, and technical drawings were manually selected by a student assistant (supported by the main author) using the Wikimedia Commons image search for related terms. We manually selected graph diagrams, which we also call node-edge diagrams or "structured diagrams," from the Kembhavi et al. [8] AllenAI Diagram Understanding (AI2D) dataset; not all AI2D images contain graph edges [8] . The training dataset of SlideImages consists of 2,938 images and is intended for fine-tuning CNNs, not training from scratch. The SlideImages test set is derived from a snapshot of SlideWiki open educational resource platform (https://slidewiki.org/) datastore obtained in 2018. From that snapshot, two annotators manually selected and labeled 691 images. Our data are available at our code repository: https://github.com/david-morris/SlideImages/. The SlideImages training dataset is small compared to datasets like ImageNet [1] , with over 14 million images, RVL-CDIP [13] with 400,000 images, or even DocFigure [15] with 33,000 images. Much of our methodology is shaped by needing to confront the challenges of a small dataset. In particular, we aim to avoid overfitting: the tendency of a classifier to identify individual images and patterns specific to the training set rather than the desired semantic concepts. For our pre-training dataset, a large, diverse dataset is required that contains a large proportion of educational and scholarly images. We pre-trained on a dataset of almost 60,000 images labeled by Sohmen et al. [6] (NOA dataset), provided by the authors on request. The images are categorized as composite images, diagrams, medical imaging, photos, or visualizations/models. To mitigate overfitting, we used data augmentation: distorting an image while keeping relevant traits. We used image stretching, brightness scaling, zooming, and color channel shifting as shown in our source code. We also added dropout with a rate of 0.1 on the extracted features before the fully connected and output layers. We used similar image augmentation for pre-training and training. We use MobileNetV2 [16] as our network architecture. We chose MobileNetV2 as a compromise between a small number of parameters and performance on Ima-geNet. Intuitively, a smaller parameter space implies a model with more bias and lower variance, which is better for smaller datasets. We initialized our weights from an ImageNet model and pre-trained for a further 40 epochs with early stopping on the NOA dataset using the Adam (adaptive moment estimation) [17] optimizer. This additional pre-training was intended to cause the lower levels of the network to extract more features specific to born-digital images. We then trained for 40 epochs with Adam and a learning rate schedule. Our schedule drops the learning rate by a factor of 10 at the 15th and 30th epoch. Our implementation is available at https://github.com/david-morris/SlideImages/. We have performed two experiments, in order to show that this dataset represents a meaningful improvement over existing work, and to establish a baseline. Because our classes are unbalanced, we have reported summary statistics as accuracy averages of each class weighted by number of instances per class. We set a baseline for our dataset with the classifier described in Sect. 3.2. The confusion matrix in Fig. 2 shows that misclassifications do tend towards a few types of errors, but none of the classes have collapsed. While certain classes are likely to be misclassified as another specific class (such as structured diagrams as slides), those relationships do not happen in reverse, and a correct classification is more likely. Figure 2 shows that our baseline leaves room for improvement, and our test set helps to identify challenges in this task. Viewing individual classification errors highlighted a few problems with our training data. Our training The related DocFigure dataset covers similar images and has much more data than SlideImages. To justify SlideImages, we have created a head-to-head comparison of classifiers trained in the same way (as described in Sect. 3.2) on the SlideImages and DocFigure datasets. All the SlideImages classes except slides have an equivalent in DocFigure. We have shown the reduction in the data used, and the relative sizes of the datasets, in Table 1 . The Head-to-head datasets contain only the matching classes, and in the case of the DocFigure dataset, the original test set has been split into validation and test sets. After obtaining the two trained networks, we have tested each network on both the matching test set, and the other test set. Although we were unable to reproduce the VGG-V baseline used by Jobin et al., we used a linear SVM with VGG-16 features and achieved comparable results on the full DocFigure dataset (90% macro average compared to their 88.96% with a fully neural feature extractor). The results ( Table 2) show that SlideImages is a more challenging and potentially more general task. The net trained on SlideImages did even better on the DocFigure test set than on the SlideImages test set. Despite having a different source and approximately a fifth of the size of the DocFigure dataset, the net trained on SlideImages training set was better on our test set. In this paper, we have presented the task of classifying educational illustrations and images in slides and introduced a novel dataset SlideImages. The classification remains an open problem despite our baseline and represents a useful task for information retrieval. We have provided a test set derived from actual educational illustrations, and a training set compiled from open access images. Finally, we have established a baseline system for the classification task. Other potential avenues for future research include experimenting with the DocFigure dataset in the pre-training and training phases, and experimenting with text extraction for multimodal classification. ImageNet: a large-scale hierarchical image database ICPR2018 contest on robust reading for multi-type web images ICDAR2017 robust reading challenge on text extraction from biomedical literature figures (detext) A neural approach for text extraction from scholarly figures Semantic text detection in born-digital images via fully convolutional networks NOA: a search engine for reusable scientific images beyond the life sciences Labeling Topics with Images Using a Neural Network A diagram is worth a dozen images Semi-supervised learning for image modality classification Guiding novice web workers in making image descriptions using templates Evaluating the accessibility of line graphs through textual summaries for visually impaired users Cutting the error by half: investigation of very deep CNN and advanced training strategies for document image classification Evaluation of deep convolutional nets for document image classification and retrieval Figures in scientific open access publications DocFigure: a dataset for scientific document figure classification 2018 IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2018 Adam: a method for stochastic optimization Acknowledgement. This work is financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (Inclu-siveOCW project, no. 01PE17004).