key: cord-0681324-6avfbijr authors: Deshpande, Aditya M.; Minai, Ali A.; Kumar, Manish title: One-Shot Recognition of Manufacturing Defects in Steel Surfaces date: 2020-05-12 journal: nan DOI: nan sha: 5d72fd72b607b98abcdbf8c597c05074bc97aa6f doc_id: 681324 cord_uid: 6avfbijr Quality control is an essential process in manufacturing to make the product defect-free as well as to meet customer needs. The automation of this process is important to maintain high quality along with the high manufacturing throughput. With recent developments in deep learning and computer vision technologies, it has become possible to detect various features from the images with near-human accuracy. However, many of these approaches are data intensive. Training and deployment of such a system on manufacturing floors may become expensive and time-consuming. The need for large amounts of training data is one of the limitations of the applicability of these approaches in real-world manufacturing systems. In this work, we propose the application of a Siamese convolutional neural network to do one-shot recognition for such a task. Our results demonstrate how one-shot learning can be used in quality control of steel by identification of defects on the steel surface. This method can significantly reduce the requirements of training data and can also be run in real-time. To ensure customer satisfaction and reduce manufacturing cost, quality control plays a major role. Often, with a human in the loop, this process consumes a lot of time. With large production requirements and increasing complexity of industrial ecosystems, the human operator's ability to recognize and address the quality has been outpaced. To address this limitation, the automation of quality control is one of the requirements. This automation is done by tracking of the parameters of interest and quantifying their deviations from the desired values. Industry 4.0 and Industrial Internet of Things (IIoT) have resulted in the modernization of manufacturing practices. IIoT has catalyzed instrumentation, monitoring, and analytics in the industry. IIoT has facilitated the collection of large amounts of data from various sensors and manufacturing processes. This has laid the foundation for the use of data-intensive approaches such as deep learning on the factory floor for monitoring and * Email: deshpaad@mail.uc.edu, {ali.minai, manish.kumar}@uc.edu Email addresses are given in order of author names. 1 Corresponding author inspection tasks [29, 23] . This has paved the way for various innovations in smart manufacturing. Recently, the field of deep learning and computer vision has produced several pivotal advances that address complex problems. Deep neural networks have solved challenging problems in visual perception [31, 32] , speech-recognition [46] , language understanding [43, 6] and robot autonomy [21] . These techniques leverage the expressiveness of neural network architectures as powerful and flexible function approximators. Deep neural networks form the basis to learn sophisticated characteristics of the input given large amounts of data. As a result, the field of computer vision is also shifting from statistical methods to deep neural network-based approaches. Computer vision methods can be used for non-invasive inspection of the manufacturing output. The quality of prediction of the deep learning-based vision systems is highly dependent on training data. Although IIoT has enabled a large amount of data collection and storage, this data may require annotations. Labeling the data can be expensive and time-consuming. In the case of growing manufacturing facilities, if the manufacturing of a new product is launched, the inspection requirements for such a product may be completely different. Thus, new training data will be required to train a new model. As a result, training and deployment of the deep learning solutions in manufacturing environments can be challenging. These lim- itations of deep learning-based methods form the motivation of this work. In this paper, we present the novel application of oneshot recognition for steel surface defect detection. The results show the effectiveness of this approach to recognize various defects in steel surfaces by significantly reducing the training data requirements. Even with one sample of a particular class, this approach is able to effectively identify the defect belonging to that class. To the best of our knowledge, this work is first of its kind demonstrating the application of one-shot recognition for quality inspection in steel surfaces. This paper has been organized in the following order: Section 2 is a brief literature review of the research of artificial intelligence and smart manufacturing. Details of our application of one-shot recognition of surface defects using the Siamese network are presented in section 3. Section 4 provides the details of the dataset used in this work. Section 5 presents the experimentation details and results. Section 6 gives the conclusion and future work directions. There is an unprecedented increase in sensory data as a result of Industry 4.0 and IIoT. Decisions in intelligent manufacturing are influenced by the information collected from all across the manufacturing facilities including manufacturing equipment, manufacturing process, labor activity, product line, and environmental conditions. Machine learning has revolutionized data interpretation approaches. Advancements in deep learning and computer vision have provided robust solutions for difficult problems including object detection [31, 32] , object tracking [4] , anomaly detection [17] and feature extraction [35] . Visionbased inspection is classified as nondestructive evaluation technique in manufacturing industry. With combination of recent advancements in computer vision, these inspection processes can be automated and improved without compromising product quality. To design robust automated non-invasive vision systems for quality control, interdisciplinary knowledge of manufacturing and advanced image-processing techniques is essential. The work in [14] presented a detailed overview of inspection tasks that can be potentially automated using vision techniques in the semiconductor industry. An elegant solution for manufacturing defect inspection using convolutional neural networks (CNN) and transfer learning on X-ray images was presented in [8] . Authors have used Mask Region-based CNN [10] for this application. This method can perform multiple defect detection as well as segmentation of the same simultaneously in the input image. This application used the GRIMA database of X-ray images (GDXray) for casting and welding [24] to demonstrate the effectiveness of this approach. Computer vision has been applied for the detection of damage and cracks in concrete surfaces. In one of the early studies applying image processing to detect defects in concrete surface presented a comparative study of various image processing techniques including fast Haar transform, fast Fourier transform, Sobel edge detector, and Canny edge detector [1] . A robust approach using the deep learningbased crack classification of the concrete surface was presented recently in [3] . The authors used a deep CNN and presented a comparative study of their approach with traditional methods including Sobel and Canny edge detection. CNNs were found to be capable of performing without any failures under a wide range of image conditions for crack detection. Work in [33] presented novel and integrative intelligent optical inspection approach to monitoring the quality of printed circuit boards (PCBs) in manufacturing lines. The author in this work also emphasized the use of deep neural networks for non-invasive vision-based inspection. Several other machine learning approaches were applied to monitor PCB manufactur- ing in [42] . This paper presented a detailed comparative study of methods including multi-layer perceptrons, support vector machines (SVMs), radial basis function-based SVMs, decision trees, random forest, naive-Bayes classifier, logistic regression and gradient boosting. Another example of quality control with deep learning in PCB manufacturing can be found in [22] . To enable real-time inspection and localization of various PCB features in the image, authors of this work have trained the YOLO object detector on the annotated data of PCB images. Surface inspection is an important part of the quality control process in manufacturing. A lot of work is being done to detect surface flaws using deep learning which aids in quality control. Authors in [27] have trained the neural network on the surface data of six types including wafer, solid color paint, pearl color paint, fabric, stone and wood. Some of the early work on steel surface defect detection with the application of deep CNNs is available in [40] which used photometric stereo images of steel to train the network models. A novel architecture of neural network designed for segmentation and localization of the defect on the metallic surfaces is presented in [41] . In this work, a cascaded autoencoder (CASAE) is used in the first stage to localize and extract the features of the defect from the input image followed by the accurate classification of the defect using a compact CNN in the second stage. In a similar context, the application of the U-Net architecture of neural network [35] has also proven to be very useful for the saliency detection on surfaces. Authors in [15] obtained the state-of-the-art results with the U-Net architecture for the detection of defects on magnetic tile surfaces. Although deep learning has shown great promises for smart manufacturing, it comes with the cost of large data requirements. Since the annotation of the data collected from the manufacturing lines may not always be possible, there is a limitation on the immediate deployment of these systems. To address these issues, there has been recent interest in the research community to develop neural networks that can effectively learn the mapping from sensor space to the target space from small datasets. Transfer learning in deep neural networks is one such step in that direction [45, 26] . The key idea here is that hidden layers of CNN are generic extractors of the latent features from the data. The transfer learning enables the reuse of a pretrained neural network after fine-tuning with a relatively small dataset for a new task. Thus, the Imagenet CNN architecture [19] which contains more than 60 million parameters may not require training from scratch but only a few thousand training images may be used to learn new classification task. The approaches like few-shot learning and zero-shot learning can further reduce the data requirements for deep learning tasks [38, 30, 44, 34] . The few-shot learning uses only a few examples for each category from a dataset (typically less than 10) to learn image classification. Zero-shot learning is designed to capture the knowledge of various attributes in the data during training and use this knowledge in the inference phase to categorize instances among a new set of classes. The oneshot recognition approach of using Siamese neural network architecture is also an excellent example that requires only one data sample [18] . This network has found applications in areas where data available to train the neural networks may be limited. This one-shot recognition approach has proven to be useful in tasks like drug discovery [2] , natural language processing [25] , audio recognition [7] and image segmentation [36] . The low training data requirements of this approach make it suitable for visual inspection tasks. In this work, we explore the application of Siamese network-based one-shot recognition for the visual inspection task in smart manufacturing. We show the effectiveness of this method on steel surface defect recognition. The results also include the comparison of this approach with conventional CNN and a simple one-shot learning algorithm of the Nearest-Neighbor algorithm with a single neighbor. The key idea behind one-shot image recognition is that given a single sample of the image of a particular class, the network 3 should be able to recognize if the candidate examples belong to the same class or not. The network learns to identify the differences in features of the input image pair in training. During the inference phase, the learned network can be reused with only one example image of a certain class to recognize if the candidate data belongs to the same class or not. The Siamese network architecture used in this work is shown in fig. 1 . This model is trained to learn a good representation of defects in steel surfaces. We use the contrastive loss function explained in section 3.1 for training the network. The model once trained should be able to recognize multiple defects given a single example of each defect. In fig. 1 , the two modules of network are identical and share the same weights. Each module can be viewed as a parametric function of weights θ given by f θ : R N → R n and N >> n. High dimensional input (image) R N is reduced to output which is an encoded vector of lower dimension n. In this case, N = 100 × 100 and n = 5. The readers should note that the outputs from the two modules from layers with size n = 5 are referred to as the encoded vectors f θ (x 1 ) and f θ (x 2 ). The final output of the architecture is the euclidean distance between these encoded vectors. The input to the model is a single channel or grayscale image pairs x 1 and x 2 . Each module being identical has three convolutional layers with a number of feature maps as 4, 8 and 8 from left to right respectively of size 100 × 100 each. The convolutional layers are followed by three fully connected layers of size 500, 500 and 5 respectively. The kernel size of 3 × 3 is used for convolutions with a stride of 1. The ReLU activation function is used on the output feature maps from each layer. For training, we used contrastive loss function [9, 5] . Equation (1) describes the loss function L(·). The loss function is parameterized by the weights of the neural network θ and the training sample i. The i th training sample from the dataset is a tuple (x 1 , x 2 , y) i where x 1 and x 2 are pair of images and the label y is equal to 1 if x 1 and x 2 belong to same class and 0 otherwise. The first term of the right hand side (RHS) of equation (1) imposes cost on the network if the input image pair x 1 and x 2 belongs to same class, i.e., y = 1. The second term penalizes the input sample if the data belongs to different classes y = 0. m > 0 is a margin and its value is constant. The term D θ,i is explained in equation (2) . The equation (2) is the Euclidean distance between the n dimensional outputs of neural network modules for the input image pair of x 1 and x 2 in sample i of the dataset. For the i th sample with y = 1, the second term in equation (1) is evaluated to zero. Therefore, the loss value in this case is directly proportional to the square of distance between f θ (x 1 ) and f θ (x 2 ). The objective is the minimization of the loss, the network weights are learned so as to reduce the distance between the encoded vectors of input samples x 1 and x 2 . Intuitively, this can be understood as the model learning that the two input images are similar. On the other hand, if the input sample has the label of y = 0, the first term on the RHS is nullified. If y = 0 and D θ,i > m, the model is not penalized. The penalty is applied only if the Euclidean distance between f θ (x 1 ) and f θ (x 2 ) is less than the set margin m. The objective in this case is to push the encoded vectors f θ (x 1 ) and f θ (x 2 ) away from each other in the n dimensional space and make the distance between them greater than m. One can think of the second term in loss function as the model learning to understand the differences between x 1 and x 2 which belong to different classes. As a result of this loss function, the Siamese network not only learns to estimate the similarity score of the input pair of images but the loss values of the dissimilar pairs from non-zero second term avoid the collapse of the model to a constant function. For a detailed mathematical explanation of contrastive loss, authors request the readers to refer the paper [9] . We trained our model of the Siamese network using Northeastern University (NEU) surface defect database 2 [39, 13, 12] . This database consists of six classes of surface defects on hotrolled steel strip, viz., rolled-in scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In) and scratches (Sc). Dataset has 1,800 grayscale images in total with 300 samples each of the six classes. The resolution of each sample image is 200 × 200 pixels. Few sample images from the dataset for each class are shown in fig. 2 . The dataset images have a variation in illuminations which introduces further challenges for the image recognition task. This variability results in large differences in samples belonging to the same class. Another challenge that can be observed is due to the similarity in images belonging to the different classes as can be seen in fig. 2 . For example, the similarity in images belonging to the categories of crazing and rolled-in scale steel surfaces is easily noticeable. To overcome the problem of limited quantity and limited diversity of data, we augment the existing steel surface defect dataset with affine transformations. Each image in the dataset is rotated randomly about its center. The angle of the rotation is chosen uniformly from the set of angles {0, π 2 , π, 3π 2 } (in radian). In above equations, I in represents input image, I out is the output image. U(−10, 10) represents uniform distribution to sample the scalar value β. The neural network model was trained using the NEU surface defect dataset. The hyperparameter values used for training the network are provided in the table 1. NEU dataset was divided into two sets for one-shot recognition. The training set consisted of the three classes, viz., rolledin scale, patches, inclusion. The remaining classes of crazing, pitted-surface and scratches were shown to the network in the testing phase for one-shot recognition. Data samples were chosen randomly during training. While sampling an image pair, the two images were chosen from the same category with a probability of 0.5 with a corresponding label of y = 1. Similarly, the images were chosen from two different categories with the remaining probability of 0.5 with Neural network optimizer Adam [16] Adam parameters (β 1 , β 2 ) (0.9, 0.999) label y = 0. This tuple of image pair and label (x 1 , x 2 , y) is then augmented with the transformations described in section 4. Before passing the image in the network, the pixel values of each image were normalized to fall in the range of [−1, 1]. The experiments were performed on the Intel-i7 platform with 16GB RAM and NVIDIA RTX 2070. The training with the surface defect dataset was fairly quick. It took approximately 2 hour for training this architecture from scratch. The training and validation curves for the optimization of our model trained using the dataset of 900 samples of images augmented with transformations as described section 4 are shown in fig. 3 . The training was done for 100 epochs with a batch size of 32. Here, we can see the decreasing trend followed by validation loss along with the training loss as the number of epochs increases. During this learning, the model appears to realize the visual saliencies of the reference image and the candidate image. Thus, loss accumulated during training decreases. During the testing phase, the images were chosen randomly. These images belong to a different set of classes that were never shown to the network during training. Our trained network model was able to perform the recognition of images in real-time during inference. Each sample took approximately 0.0112 seconds for evaluation on CPU. The candidate images were classified to be of the same class as the true image used in one-shot recognition based on the value of equation (2) for the image pairs. The margin m was used as the threshold for this decision. Figure 4 illustrates some of the results of the Siamese network evaluated during the testing. Results are presented with class names along with images as well as the dissimilarity score of the image pair. The dissimilarity score is the value of the equation (2) for true image (x 1 ) as well as candidate image (x 2 ). It can be observed from this figure that images belonging to separate categories have a larger value of the dissimilarity score as compared to images that belong to the same category. The images from dissimilar classes have a score larger than the value of the margin m used in the contrastive loss function. It can be inferred from this observation that the neural network architecture is able to effectively to understand similarities and differences between the features of the input samples. We compared the results of one-shot recognition with the K-nearest neighbor (KNN) classification algorithm and feedforward convolutional neural network architecture. The KNN 5 Fig. 4 . Illustration of results obtained by the network in training phase algorithm was chosen since this can form a basic one-shot learning system. With a value of K = 1, the algorithm was used for image classification of defective surfaces. KNN was shown a single instance of images from each class of the dataset and its test accuracy was evaluated by the proportion of correctly classified test instances. The raw images were used as an input and the euclidean distance between the images was used as a metric in this algorithm for classification of the candidate images into a particular category. We used the KNN implementation from scikit-learn for this purpose [28] . We also compared our approach with a feed-forward CNN classifier [19] . The CNN we used for this comparison had a similar architecture as one of the modules from the Siamese network. The input to the network is a single-channel image. The outputs of this network are the class probabilities of the input image belonging to one of the six classes from the surface defect dataset. The network had three convolutional layers with feature maps of 4, 8 and 8 respectively. The size of each feature map was 100 × 100. The kernel size of 3 was used for convolutions with the stride of 1 in these layers. The third convolutional layer was followed by two fully connected layers of size 500 each. The output layer had 6 neurons. The activation function of ReLU was used except for the output layer which was a sigmoid activation function to represent the class probabilities of the input image. The training set consisted of 80% of the dataset and remaining data was used for validation and testing for this network. The categorical cross-entropy loss was used for training the CNN along with Adam optimizer. This network was trained for 120 epochs with a batch size of 128. The table 2 summarizes our testing results for each method on the steel surface defect dataset. Referring to table 2, it can be seen that the KNN algorithm does not work well and shows poor performance in the inference phase. It is clearly not possible to use it in real-world scenarios since it is not optimized for good feature representation of the data as well as the euclidean distance metric is not the appropriate function to quantify the match between of high dimensional image data [20] . Although the CNN had superior performance, one should also note that one-shot recognition was shown only a single image sample of a new image category to get the observed performance as opposed to 80% of data from each category used to train the CNN. To have a fair comparison between the proposed Siamese network architecture and the CNN, we also present the result of training both the models with the identical data from NEU dataset in table 3. The training set for both the models consisted of 80% of the NEU dataset from all its six classes and remaining 6 data were used in validation and testing. This table consists of the test accuracy of both models. All the training hyperparameters and loss functions were kept the same as described before in this section for respective neural network models. From the results in table 3, it was observed that the CNN and the Siamese network had a competitive performance when trained on the identical data from NEU surface defect dataset. The results in this table also suggest that the Siamese network will converge to the CNN performance as there is increase in the size of dataset used for its training. Based on the results observed from the Siamese network for one-shot recognition, it can be said that this approach has the potential for easy and fast deployment on actual factory floors in case of limited training data. With ever-growing production demands and increasing requirements of automation in quality control, this can form a suitable application for the situations where data annotation is difficult or data availability is limited. In this work, we show the application of one-shot recognition of the Siamese convolutional neural network on steel surfaces. This vision-based approach has two-fold contributions in the automation of quality control. One being non-invasive, the surface quality after production can be remotely inspected without any damage to the steel. The second contribution is the minimal requirement of labeled data for training the images of a new class which makes it easy to adapt this approach for different tasks. This novel application of deep learning and computer vision paves the way for the development of various new innovations in the manufacturing space. The architecture used for the network presented in this work is not optimum. One of the future directions can be to find out better values of hyperparameters for the dataset. In this case, only single-channel image data of surface defects was used to inspect steel surfaces. The more feature-rich sensor data can be a next good step to explore. One of the apparent directions for future work can be exploring a wider class of texture inspection. The other direction of future work is transferring the learned weights of a pre-trained model like VGG net [37] or ResNet [11] in the framework of one-shot recognition fine-tuned for the domain of vision-based inspection. Apart from images, one-shot recognition can also be used for the identification of similarities or differences in time series data. In this case, applications such as health monitoring and predictive analytics of manufacturing machines [47] still remain to be explored. The recurrent neural network modules in Siamese architecture can form a good solution to analyze time-series data. This can be used to identify the similarity and differences between two instances. IIoT and machine learning, in general, can favor the use of various types of raw sensor data to allow intelligent decision making in real-time in modern industries. A large amount of this data is unstructured and the few-shot machine learning approaches have the potential to effectively use this data to get valuable insights. Analysis of edgedetection techniques for crack identification in bridges Low data drug discovery with one-shot learning Deep learning-based crack damage detection using convolutional neural networks Context-aware deep feature compression for highspeed visual tracking Learning a similarity metric discriminatively, with application to face verification Pre-training of deep bidirectional transformers for language understanding Audio Metric Learning by Using Siamese Autoencoders for One-Shot Human Fall Detection Detection and Segmentation of Manufacturing Defects with Convolutional Neural Networks and Transfer Learning Dimensionality reduction by learning an invariant mapping Proceedings of the IEEE international conference on computer vision Deep residual learning for image recognition Semi-supervised defect classification of steel surface based on multi-training and generative adversarial network An end-to-end steel surface defect detection approach via fusing multiple hierarchical features Automated visual inspection in the semiconductor industry: A survey Surface defect saliency of magnetic tile Adam: A method for stochastic optimization An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos Siamese neural networks for one-shot image recognition ImageNet Classification with Deep Convolutional Neural Networks Handwritten digit recognition using k nearest-neighbor, radial-basis function, and backpropagation neural networks End-to-end training of deep visuomotor policies Capacitor detection in pcb using yolo algorithm Enhancing sustainability and energy efficiency in smart factories: A review GDXray: The database of X-ray images for nondestructive testing Learning text similarity with siamese recurrent networks Learning and transferring mid-level image representations using convolutional neural networks Machine learningbased imaging system for surface defect inspection Scikit-learn: Machine Learning in Python Optimisation of manufacturing process parameters using deep neural networks as surrogate models Optimization as a model for few-shot learning You only look once: Unified, real-time object detection Faster R-CNN: Towards realtime object detection with region proposal networks On the development of intelligent optical inspections An embarrassingly simple approach to zero-shot learning U-net: Convolutional networks for biomedical image segmentation One-shot learning for semantic segmentation Very Deep Convolutional Networks for Large-Scale Image Recognition Prototypical networks for fewshot learning A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects Convolutional neural networks for steel surface defect detection from photometric stereo images Automatic metallic surface defect detection and recognition with convolutional neural networks A framework for inspection of dies attachment on PCB utilizing machine learning techniques Attention is all you need Matching networks for one shot learning How transferable are features in deep neural networks? Deep learning for environmentally robust speech recognition: An overview of recent developments Deep learning and its applications to machine health monitoring