key: cord-0255822-a9ex5eve
authors: Chhablani, Gunjan; Sharma, Abheesht; Pandey, Harshit; Dash, Tirtharaj
title: Superpixel-based Knowledge Infusion in Deep Neural Networks for Image Classification
date: 2021-05-20
journal: nan
DOI: 10.1145/3476883.3520216
sha: 0ccd0111f11430366c75bba471eb1937c09b3a9e
doc_id: 255822
cord_uid: a9ex5eve

Superpixels are higher-order perceptual groups of pixels in an image, often carrying much more information than the raw pixels. There is an inherent relational structure to the relationship among different superpixels of an image such as adjacent superpixels are neighbours of each other. Our interest here is to treat these relative positions of various superpixels as relational information of an image. This relational information can convey higher-order spatial information about the image, such as the relationship between superpixels representing two eyes in an image of a cat. That is, two eyes are placed adjacent to each other in a straight line or the mouth is below the nose. Our motive in this paper is to assist computer vision models, specifically those based on Deep Neural Networks (DNNs), by incorporating this higher-order information from superpixels. We construct a hybrid model that leverages (a) Convolutional Neural Network (CNN) to deal with spatial information in an image and (b) Graph Neural Network (GNN) to deal with relational superpixel information in the image. The proposed model is learned using a generic hybrid loss function. Our experiments are extensive, and we evaluate the predictive performance of our proposed hybrid vision model on seven different image classification datasets from a variety of domains such as digit and object recognition, biometrics, medical imaging. The results demonstrate that the relational superpixel information processed by a GNN can improve the performance of a standard CNN-based vision system.

In the ever-burgeoning field of Deep Learning, the task of image classification and recognition has taken centre stage, chiefly after the introduction of the ILSVRC Challenge 1 . There have been significant architectural innovations for Convolutional Neural Networks (CNNs). In the last few years, the core approach of basic convolution has been adopted to graph-structured data via the introduction of graph neural networks (GNNs), a term coined by Scarselli et al. [20] .

A graph is a representation of binary relations. GNNs can utilise this relational information during their learning. The learning process includes information flow using the AGGREGATE operation. GNNs can construct a graph representation useful for the supervised learning task at hand (e.g., graph classification or regression). These binary relations can be seen in the data instances. In graphs, binary relations are represented as edges. For instance, a graph of countries may have edges between countries that share borders; a graph of cities may have an edge between two cities if they share a bus service; a network of papers, where papers that cite each other are related. In the case of tasks involving images, binary relations can be easily seen at the level of 'image superpixels'. Superpixels are higher-order perceptual groups of pixels in an image. Superpixels group an image into meaningful atomic regions that can be used to replace the rigid structure of the pixel grid in images. They often convey much more information than low-level raw pixels and share some common characteristics such as intensity levels [19] . Figure 1 shows the super-pixellated version of two raw images. One can observe that the superpixels share binary relations with their neighbours, resulting in a graph structure. Our present work treats the relational information conveyed by the superpixels as higher-order spatial information about an image and provides this information to aid in image classification tasks. It has been observed that incorporating (domain) knowledge can significantly enhance the performance of deep learning models [7] , even in problems where the amount of available data is low [6] . Although the higherlevel spatial information encoded in the image superpixels cannot directly be called "domain knowledge" the neighbourhood structure represented by the connections among the superpixels does loosely convey some form of relational or domain information. Our intention in this work is to investigate a methodology of infusing such knowledge into a deep-learning pipeline for image classification problems. In this work, we leverage spatial information from the image and infuse knowledge in the form of binary relations procured from the superpixel graph representation of the image. CNN filters tend to learn parameters based on pixel-level information. We hypothesise that fusing such superpixel-level information can provide a higherlevel understanding of the image and aid a CNN in classification tasks. Specifically, we treat the graph resulting from the image superpixels as an input to a GNN and learn a CNN-based vision system together with the GNN. The coupled hybrid CNN+GNN system is expected to be knowledge-rich, both at the level of raw pixels and superpixels.

Major contributions. Overall, the major contributions of our paper are as follows:

(1) Treating the superpixel representation of an image as a graph and allowing a GNN to extract higher-level domain information about an image; (2) We construct an image classification model by coupling the GNN with a CNN-based baseline; (3) We conduct a series of empirical evaluations of our coupled CNN+GNN hybrid system on four different popular image classification benchmarks and three case studies. The rest of the paper is organised as follows. In Section 2, we provide details of our methodology, including a new loss function suitable for learning the hybrid model. Section 3 provides details of our experiments. Section 4 provides a brief description of related work. Section 5 concludes the paper.

Simple Linear Iterative Clustering (SLIC) [1] is an easy-to-use and straightforward algorithm. SLIC generates superpixels by clustering pixels of an image based on their colour similarity and proximity in the image plane. As a black-box procedure, as is the case in our present study, SLIC takes an approximate number of superpixels and the input image as its input and outputs a segmented image. We use SLIC to construct the segmented version of the input image: each segmented patch in the image is now a superpixel. We then treat the centroid of every superpixel as a node in the graph. We link these nodes by building a radius graph [3] . For every pair of nodes, we form an edge between them if and only if the Euclidean distance between them is less than a pre-ordained radius, ∈ R. Mathematically, a radius graph is defined as = ( , ) such that:

Our main motive in this work is to learn jointly from the image and its superpixel representation. This results in a complementary combination of a vision model (CNN) and a relational model (GNN). The CNN takes care of feature extraction (low-level features such as edges in initial layers to complex objects in the latter layers). In contrast, the graph-based model takes the superpixel-based radius graph and extracts relational information about the image that can act as domain knowledge. Fig. 2 illustrates this hybrid combination.

Regardless of the detailed architectural specifics, the goal is to construct a representation of the input (image) that consists of both the spatial information (extracted by the CNN) and the domain information (extracted by the GNN). This representation can then be used to construct a feedforward fully-connected neural network that outputs a class label for the input image. In order to train the coupled model in an end-to-end fashion and utilise the information from the combined representation adequately, we propose a simple hybrid loss as follows:

Here L and L denote the cross-entropy loss for CNN and GNN, respectively. The parameter determines the relative importance accorded to the two models during training, i.e., a value = 0.5 would mean that both the raw spatial information and the domain information are equally important. For our construction, we treat this as a tunable hyperparameter. For inference on the trained model (i.e., prediction), we construct hybrid logits based on the logits computed by the CNN backbone and GNN backbone for an input image as follows:

where ℎ and ℎ are the logits from CNN and GNN respectively.

This study aims to integrate superpixel-level domain knowledge with vision systems, specifically those based on CNNs. Our empirical experiments attempt to answer the following research questions: RQ1: Can GNNs construct rich relational representations from superpixels?; RQ2: Can GNN-constructed knowledge improve CNN-based vision systems? 

Loss 

We test our hypothesis on a range of datasets: one dataset for handwriting recognition, one from the fashion and clothing domain, two from the object recognition domain, two datasets from the domain of biometrics, and one dataset from the domain of medical diagnosis. We briefly describe these datasets below. A summary of the dataset is provided in Fig. 3 showing the number of classes, number of instances in training and testing set of each dataset.

MNIST [13] is one of the most popular image classification datasets. It is a database of 28x28-sized grayscale images of handwritten digits, and the task is to identify the digit in the image. MNIST has 10 classes, one for every digit.

Fashion-MNIST (FMNIST) [22] is a sister dataset of MNIST, consisting of fashion and clothing items. This dataset was collated from Zalando's database of article images. Each image is a 28x28 grayscale image. The dataset has 10 classes, one for every fashion article type.

. CIFAR-10 [12] is an established object recognition dataset, consisting of 32x32 RGB images. It has ten different classes such as aeroplane, automobile, bird, cat, dog. There are 6000 instances for each class.

. CIFAR-100 [11] is another object recognition dataset, with 100 classes and RGB images, each of size 32x32. This is like the CIFAR-10 dataset, and there are 600 instances for each class.

The Labelled Faces in the Wild (LFW [9, 10]) dataset is a face recognition dataset. Every image is an RGB image of size 250x250. We do not consider all faces in the dataset; we discard faces with less than 20 faces in order to maintain the class balance.

After discarding such faces, we have 62 classes.

SOCOFing. The Sokoto Coventry Fingerprint [21] dataset is made up of 6,000 fingerprint images from 600 African subjects. All images are grayscale and have a resolution of 96x103. The task is to identify the person to which the given fingerprint sample belongs. 

We use the method described in Sec. 2.1 to construct a superpixelbased radius graph for each image in these datasets. For convenience, we refer to this procedure as ℎ that takes two inputs: a set of images and the number of superpixels ( ), and returns a set of radius graphs obtained from the superpixel representation of the images.

We use the PyTorch library for the implementation of CNN and PyTorch Geometric for GNN models. We conduct all our experiments on OVHCloud's 2 "AI Training" platform with the following configurations: 32GB NVIDIA V100S GPU, 45GB main memory, 14

Intel Xeon 2.90GHz processors and a NVIDIA-DGX station with 32GB Tesla V100 GPU,256GB main memory, 40 Intel Xeon 2.20GHz processors.

Our method is straightforward. For each dataset ( ), we determine the value of the number of superpixels ( ) for the procedure ℎ based on visual analysis. Let be a dataset with a set of images and their corresponding class-labels . So, is written as ( , ). In the following steps, and denote the official train-set and test-set for a dataset . These details can be obtained from the official sources for each dataset as referred in this paper. We outline the steps of our methodology below: The following details pertaining to the above steps are relevant:

• We use = 75 for MNIST and FMNIST, = 100 for CIFAR, LFW and SOCOFing and = 200 for COVID to construct superpixel graphs using ℎ; the number of radial neighbours is set to 5 for MNIST, FMNIST and CIFAR datasets, 27 for COVID, 10 for LFW, and 15 for SOCOFing. This was decided using graph visualisation;

• The node feature-vector for the radius graphs are: the normalised pixel value and the location of centroid in the image, resulting in 3-length feature-vector for MNIST and FMNIST, and 5-length feature-vector for CIFAR, COVID and LFW; • We use 90:10 split on (correspondingly, ′ ) to obtain a validation set useful for hyperparameter tuning;

• We use the AdamW optimiser [15] for training all our models; • The hyperparameters are obtained by a sweep across grids such as batch-size: {128, 256}, learning rate: {1e-3,1e-4,1e-5}, weight-decay parameter: {0, 0.001} for all the datasets; • CNN structure is built with 3-blocks; each block consists of a convolutional layer, a batch-norm layer, ReLU activation and a max-pool layer. The number of channels in each convolution layer is 32, 64 and 64, respectively; • The coupled CNN+GNN structure consists of the same CNN backbone as mentioned above; and a GNN structure consisting of three graph-attention layers (GAT) with 32, 64 and 64 channels, respectively, and the number of heads is set to 1; • To determine the optimal value of , we run a grid search over multiple values in the range (0, 1). = 0.75 was found to be the optimal value for our datasets. However, as mentioned earlier, it is advised to use as a trainable hyperparameter for future works.

• We use accuracy as the metric to measure the predictive performance of the models on .

In all our experiments, the primary intention is to show whether relational information provided via a superpixel graph is able to aid in improving the performance of a CNN-based model. One should note that the primary backbone of the CNN in the standalone model (CNN) and that in the coupled model remains the same to draw any meaningful conclusion. The principal results of our experiments are reported in Fig. 4 . The results demonstrate that the higher-level domain knowledge extracted by a GNN from the superpixel graph is able to substantially boost the predictive performance of CNNs in a hybrid-learning setting in some cases. Readers familiar with deep neural networks would agree that networks with complex structures and a large number of parameters tend to be more accurate than those with a lesser number of parameters with a simple structure. A valid argument, therefore, could be that our hybrid model has a more complex structure and more number of parameters than a standard CNN model (the baseline in our work), and hence, it could be learning representations for images in a better way than a CNN alone. But, the results presented in this work should be taken with a pinch of salt and should be understood that the GNN in the hybrid model enforces some form of structural constraints to be learned along with the standard convolutions over the input image as is done in CNNs. However, we have made sure that percentage difference in the number of parameters between the two models (CNN and the hybrid) is approximately 10%.

Various graph-based learning approaches for superpixel image classification can be distinguished based on how they perform neighbourhood aggregation. In MoNet [17] , weighted aggregation is performed by learning a scale factor based on geometric distances. They also test their model on the MNIST Superpixels dataset. In Graph Attention Networks (GATs) [2] , self-attention weights are learned. SplineCNN [8] uses B-spline bases for aggregation, whereas SGCN [5] is a variant of MoNet and uses a different distance metric.

There are some preliminary works on enhancing semantic segmentation based systems using superpixel representations. For instance, Mi et al. [16] propose a Superpixel-enhanced Region Module (SRM), which they train jointly with a Deep Neural Forest. The SRM alleviates the noise by leveraging the pixel-superpixel associations. In DrsNet [24] , coarse superpixel masking and fine superpixel masking are applied to the CNN features of the input image, particularly for rare classes and background areas. Authors of [25] have used superpixel-based multiple local regions joint representation CNN model to classify very high resolution (VHR) panchromatic and multispectral (MS) remote-sensing images. In [23] , the authors propose an image recognition method that is based on superpixels and feature fusion, where the features (such as global, texture, appearance features) are computed from the superpixels representation. In [14] , a superpixel-guided layer-wise embedding CNN is devised by the authors, whose main focus is remote sensing image classification. Superpixels guide the network when it comes to unlabelled samples, and superpixels help in handling irregular spatial dependencies.

Our present work demonstrates that superpixels-based graphs represent domain knowledge for images and that infusing this knowledge in a standard vision-based training process shows significant gains in predictive accuracy. We are able to conclude that a GNN can deal with relational information conveyed by a superpixel graph and can construct high-level relations that can boost the predictive performance when coupled with a CNN model. A straightforward extension is to employ a pre-trained CNN model for domain-specific tasks and observe how the superpixel information impacts the predictive performance.

Our code repository is available at: https://github.com/abheesht17/ super-pixels. We provide details on how to obtain the data used in our experiments. We also provide details on how to use our coderepository for applications in other domains. The authors can be contacted for more details on the code package.

SLIC superpixels compared to state-of-the-art superpixel methods

Superpixel image classification with graph attention networks

The complexity of finding fixed-radius near neighbors

Can AI help in screening viral and COVID-19 pneumonia?

Agnieszka Słowik, and Łukasz Maziarka

A review of some techniques for inclusion of domain-knowledge into deep neural networks

Incorporating symbolic domain knowledge into graph neural networks

Splinecnn: Fast geometric deep learning with continuous b-spline kernels

Unsupervised joint alignment of complex images

Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments

CIFAR-100 Dataset

Cifar-10

The MNIST database of handwritten digits

Superpixel-Guided Layer-Wise Embedding CNN for Remote Sensing Image Classification

Decoupled Weight Decay Regularization

Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation

Geometric deep learning on graphs and manifolds using mixture model cnns

Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images

Learning a classification model for segmentation

The graph neural network model

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms

Image classification with superpixels and feature fusion method

DrsNet: Dual-resolution semantic segmentation with rare class-oriented superpixel prior

Superpixel-based multiple local CNN for panchromatic and multispectral image classification

We sincerely thank Jean-Louis Quéguiner, Head of Artificial Intelligence, Data and Quantum Computing at OVHCloud 3 for providing us with the necessary computing resources for conducting our experiments.