key: cord-0851531-529yxpao authors: Shu, Yufeng; Li, Bin title: Surface Defect Detection and Recognition Method for Multi-Scale Commutator Based on Deep Transfer Learning date: 2021-07-30 journal: Arab J Sci Eng DOI: 10.1007/s13369-021-05963-3 sha: 26b6672af92ec332a7bdf11de0ccd6f94819eb92 doc_id: 851531 cord_uid: 529yxpao In view of the fact that traditional strip surface defect detection and recognition methods cannot adapt to the changing actual detection environment, and deep learning-based detection and recognition methods have high requirements for data volume, a new strip surface defect detection and recognition based on deep transfer learning is proposed. Method: First, the ResNet network trained based on the ImageNet dataset is transferred to the Faster R-CNN classic target detection algorithm. In order to deal with the problem of large differences in defect scales, the regional recommendation network in Faster R-CNN is improved and designed. A multi-scale regional recommendation network (MS-RPN) is proposed. The strip surface defect data set is used for experimental verification. The experimental results show that compared with Faster R-CNN, the proposed method has higher accuracy and is more suitable for strip surface defect detection applications. The proposed method has an accuracy of 84.14%, 88.81%, 88.35%, 92.86%, 92.86% and 92.53 for detecting scratches, bruises, cracks, oil stains and black spots, respectively. The continuous social development has made more and more high-tech industries gradually known to people, but many industries have increasingly higher requirements on the quantity and quality of raw materials, such as commutator as an important part of motor in the manufacturing industry, especially in vehicle manufacturing, ship manufacturing, aerospace equipment manufacturing and other industries. Therefore, the quality of the commutator surface directly determines its product quality. In addition, the commutator is very vulnerable to external influences in the process of production and processing, resulting in a variety of problems. For example, different raw materials, processing methods and manufacturing equipment will lead to a series of surface defects of the commutator such as scratches, cracks, oil stains, black spots, which not only affect the appearance to a certain extent, but also may have huge safety risks in quality. For instance, defects can reduce the physical properties of mechanical parts, such as corrosion resistance, wear resistance and fatigue strength, which will cause serious problems in the final product quality [1] . Consequently, how to find and locate defects in time during the production process has become an urgent problem in the commutator production industry. With the rapid development of computer vision technology, mature application systems such as face detection, vehicle detection, speech recognition and text recognition have been widely used in various industries [2] [3] [4] [5] . Computer vision-based surface defect detection technology for commutator has also developed rapidly [6] , in which the product surface image is obtained through non-contact detection and optical automatic collection, and then, the specific features of the surface image are extracted using different algorithms. Computer vision is an artificial intelligence field that trains computers to understand and interpret the visual environment. Using digital pictures from cameras and videos, as well as deep learning models, machines can properly recognize and categorize things. The purpose of machine vision detection technology is to capture the surface image of the work piece using a camera rather than human eyes and then to assess if there are flaws on the surface of the work piece by processing the picture through a computer. Parsytec developed HTS-2 and HTS-2 W surface defect detection systems and applied the classifier technology based on artificial neural network to the field of commutator detection for the first time [7] . Harmonized System or Harmonized Tariff Schedule is represented by an HS or HTS code. The codes were created by the World Customs Organization (WCO) and are used to categorize and characterize items that are sold globally. In most situations, an HTS code must be allocated to a traded good in order to import or export it internationally. The HTS code must conform to the Harmonized Tariff Schedule of the country of import. HTS-2 column (2) has higher duty rates, which are necessary for nations that do not have regular commercial ties with the USA. The HTS lists the nations that are eligible for international trade program or are subject to column 2 tariff rates. Tolba combined decision theory, multi-scale features with the neural network to get an optimal Bayesian classifier and applies it to flat and even defect detection of commutator surface [8] . However, it is difficult to maintain good detection performance in the day-to-day changing environment for the surface defect detection method of commutator based on machine learning and some shallow neural networks. Recently, the deep learning algorithm has been ranked first in the target detection challenge because of its high feature expression ability and high fitting, so it is a very competitive method to transplant the target detection method based on deep learning to the detection of surface defects of commutator. For many years, target detection techniques were investigated. Many classic target detection techniques came into being in the 1990s. Mainly conventional extraction techniques have been utilized to extract characteristics and then coupled with template matching algorithms or target identification classifications. Currently, there are two main types of framework for target detection algorithms based on deep learning: candidate box-based target detection and regression-based target detection. GIR-SHICK R proposed the first target detection framework R-CNN [9] based on candidate boxes in 2012, followed by Fast R-CNN [10] , Faster R-CNN [11] and Mask R-CNN [12] . Regression-based target detection methods are mainly based on You Only Look Once (YOLO) series [13] [14] [15] and Single Shot Multi-Box Detector (SSD) [16, 17] series. Face detection is a computer algorithm which recognizes human faces in digital pictures and is used in a number of applications. The psychological process through which people find and attention to faces in a visual situation is known as face detection. Vehicle detection attempts to offer vehicle counting, speed measurement, traffic accident identification, traffic flow forecasts, etc. Various sensors are utilized to collect traffic information continually generated. Speech recognition is accomplished using acoustic and linguistic modeling techniques. Language modeling connects sounds with word sequences to help discriminate between words that sound similar; acoustic modeling illustrates the link between linguistic units of speech and audio signals. Tedious data input for credit cards, receipts and commercial cards can be automated for text recognition. You may extract text from documents photographs, which you can use to enhance accessibility or to translate documents with the cloud-based API. Fortunately, the method based on deep learning can better overcome the complex changes brought by the practical application environment, because deep learning pays attention to the deep and automatic feature extraction of the model and carries out feature learning layer by layer from high to low, with a good ability of feature extraction and selection. However, the basic conditions of deep learning are a large amount of data as support and sufficient model training. However, in the study of commutator surface defects, because the data are basically collected by manual photography in the production workshop, resulting in small differences between samples and more repetitive data. Moreover, the actual amount of data collected is usually insufficient to meet the huge amount of data required by the deep learning target detection model, which requires transferring the relevant knowledge by the transfer learning from the existing deep learning target detection field to the commutator surface defect detection field to solve the practical problem [18] . The ImageNet dataset [19] is probably the best source data, which needs to transfer the feature expression ability of the network model trained based on the ImageNet dataset to the field of commutator surface defect detection and recognition. Because ImageNet image classification dataset has more than 1 million images, and in addition, edge, texture, space and other feature information are mainly used in the commutator surface defect detection and recognition, while ImageNet as the source data can extract rich and diverse image spatial feature information. Transfer is a method to profound learning (and machine learning), where knowledge is transmitted across models. It can accomplish a specific problem by employing a whole or part of a model that has now been trained to another task with transfer lessons. R-CNNs (Region Based Convolutional Neural Networks) are a kind of machine learning models used in computer vision, especially object detection. In this paper, a new commutator surface defect detection and recognition method is proposed based on transfer learning, in which the feature extraction layer learned by ResNet network [20] in ImageNet is transferred to the commutator surface defect dataset, and the transferred feature layer covers a strong feature extraction ability. Specifically, firstly, the feature layer in Resnet-101 is transferred to the feature extraction layer in Faster R-CNN, a target detection network based on deep learning, and then, a new deep transfer learning network is constructed for commutator surface defect detection and recognition. In order to make full use of the multi-scale feature map in Resnet101, the regional referral network in Faster R-CNN is trimmed to adapt to the ResNet network, and the problem of missing detection due to the large-scale difference in the commutator surface defects is solved. Traditional approaches for detecting and recognizing strip surface defects are unable to adapt to the changing detection environment. A new deep transfer learning-based strip surface defect detection method is suggested. The suggested approach has a greater level of accuracy and is better suited for strip surface fault identification. Big data analysis will be a dependable source of information extracted from sensors and actuators. The data gathered using the integrated telematic and real-time research capabilities are the bright advantage for self-driving automobiles. A sensor fusion system combining 3D camera sensor data with Lidar sensor data is proposed to achieve this goal [21] . Object detection using deep learning models has shown promising results when it comes to detecting an object in pictures. The goal of this research is to use real-life pictures to annotate and locate medical face mask items. Its goal is to keep individuals safe from COVID-19 transmission [22] . Goggles are an important aspect of any ski or snowboarding outfit since they protect your eyes from the weather and harm. Defects, particularly on the lens surface, are inherent in the ski goggles manufacturing business. A new approach based on machine vision is given to solve this challenge [23] . In recent years, transfer learning has attracted more and more attention. As shown in Fig. 1 , in the traditional machine learning method, the training set and the test set must be in the same feature space and have the same data distribution. However, in practice, data in most fields are limited, characterized by heterogeneous migration learning that solves the problem of how to transfer knowledge from existing source domains to target domains, and even between heterogeneous data. Migration from source databases to target databases with different database management systems from different vendors in source and destination data bases, then it is known as heterogeneous migration. Heterogeneous transfer learning is characterized by different source and destination domains, but it can also be coupled with additional problems including heterogeneous data distributions and label spaces. There are three main types of transfer learning: ① case-based transfer in isomorphic space; ② feature-based transfer in isomorphic space; ③ transfer learning in heterogeneous space. Case-based transfer learning mostly occurs when the source dataset is very similar to the target dataset, mainly to filter out the data in the source dataset that is helpful for the classification of the target dataset and add it to the target dataset for model training, so as to improve the generalization ability of the model. Feature-based transfer learning in isomorphic space can well overcome the large data differences between fields. Some common features, such as texture features, edge features, high-level abstract features, are found at the feature level, and then, the feature extraction ability is transferred to the target dataset to improve the model generalization ability on the target dataset. In the field of target detection, it is common to encounter target individuals with large-scale differences in an image. If the same convolution feature extraction strategy is adopted for individuals with large-scale differences in the target, insufficient feature extraction for large-scale targets and As shown in Fig. 2a , this strategy is called image pyramid. By adjusting the input image to different sized copies, the copies with different scales are combined with the original image to become image pyramid. Then, these different scale images are input into the feature extraction network to extract convolution features, which are easy to produce multi-scale features with semantic representativeness. At last, these prediction features are combined, and the final prediction results are output. Image pyramid method has been improved in recognition accuracy and positioning accuracy, mainly because the features from various sizes of images do surpass those based on single-scale images. Although the performance is improved, such a strategy consumes a lot of time and memory, which limits its application in real-time tasks. Other methods, including Fast R-CNN and Faster R-CNN, do not use this policy by default. In Faster R-CNN, the strategy shown in Fig. 2b is adopted, mainly to extract the feature of single-scale image input by convolutional neural network with multi-scale output and use the convolutional feature map of the last layer. Compared with image pyramid, this strategy requires significantly less memory and computing cost, so it can be deployed in the training and testing phase of real-time network. As shown in Fig. 2c , the SSD network can directly predict the convolution feature map of different scales, which not only combines the multi-scale features, but also greatly improves the operation speed. In contrast, in the FPN network, the connection between different scales is further strengthened to form a feature pyramid, which is mainly to carry out 2 × up sampling of the feature map of the previous scale and connect it with the feature map of the scale equivalently, fully combining the semantic information of the deep convolution feature map and the spatial information of the shallow convolution feature map, which greatly improves the detection accuracy. In addition, the detector based on the depth neural network can be easily modified and embedded by using the feature pyramid building module. Faster R-CNN network is a real-time target detection model put forward by GIR-SHICK R et al. in 2015, in which the neural network is firstly used to generate candidate regions to solve the problem of time-consuming generation of candidate regions by previous algorithm (R-CNN, SPP Net and Fast R-CNN all use selective search (SS) to extract the region proposal, but it takes up most of the time of the whole detection process and becomes the bottleneck restricting the continuous improvement of detection speed of these algorithms). It is a very important algorithm in the R-CNN series of improved algorithms to reduce the extraction time of the region proposal and integrate the candidate region extraction, feature extraction and classification into a whole framework. The Faster R-CNN network structure is shown in Fig. 3 , which is mainly composed of four parts: feature extraction layer, candidate region proposal network, ROI pooling and vehicle detection. Feature extraction layer is composed of a group of basic convolution, non-linearization and pooling. The purpose is to extract the feature map of input image, which is shared for RPN (Region proposal network) network and full-connection layer. The introduction of the regional network of proposals is a unique approach that enhances Faster R-CNN (RPN). RPN is a totally convolutional, trained end-to-end network that predicts object bounds and object scores with each detection. Features extracted from shared convolution layer are fed into RPN network, which uses an N × N-sized convolution core to slide on the output characteristic map of shared convolution layer and maps each sliding position to a low-dimensional feature vector. These low-dimensional vectors are fed into two full-connection layers that output positions and scores of candidate boxes of multiple sizes and proportions. Because the RPN network outputs candidate regions of different sizes, the ROI pooling layer maps candidate regions of different sizes to feature vectors of the same length. Finally, the fixed-size feature vector is sent to the regression layer to fine-tune the detection box and output. The faster R-CNN network structure is shown in Fig. 3 , in which the figure is composed of different types of components such as feature extraction layer, convolution feature map, ROI polling and object detection. Feature extraction layer The feature extraction layer is used to extract the features from input training data, and convolution layers are utilized. Each convolution layer has a collection of filters that aid in feature extraction. The convolution output is subsequently routed via a ReLU activation unit (Rectified Linear Unit). This unit transforms the data into a non-linear form. Convolution feature map It is a convolution to simply apply a filter to an input that leads to an activation. A feature map showing the locations and strengths of the identified feature, e.g., the picture, on an input, resulting from repeated use of the same filter on an income. ROI Pooling Interest pooling area (sometimes called RoI poolings) is a procedure that is commonly employed with convolutionary neural networks in object identification applications. ROI pooling takes each ROI from the input and transforms a piece of the input feature map that corresponds to that ROI into a fixed dimension map. Surface detection of the commutator is of great importance to the performance of the motor. Because traditional methods of machine learning and deep learning require a high amount of data, it is easy to cause insufficient model training when applied directly to the defect detection of the commutator surface. In view of this, in this paper, the feature extraction layer (convolution and pooling layer) in the ResNet model, which is pre-trained on the ImageNet dataset with the same image data, is transferred to the detection task of the target dataset using the feature-based transfer learning based on the isomorphic space, including the feature extraction ability of the model (such as edge feature extraction ability, texture feature extraction ability, shape and other high-level abstract feature extraction abilities), so as to improve the generalization ability of the commutator surface defect model. ResNet, short for residual networks, is a traditional neural network that serves as the foundation for many computer vision applications. The identity matrix is used to resolve the problem. ResNet employs a skip connection in which the output of the convolution block is also supplemented by an original input. This helps resolve the problem of the gradient disappearance by enabling an alternate way to pass through the gradient. At the feature level, there are some invariant universal features, which are common for both the ImageNet dataset and the commutator surface defect dataset, so feature-based transfer learning based on isomorphism space can be carried out. The term "visiblesurface detection method" refers to algorithms that identify visible items. Users perceive an image with non-transparent items and surfaces, but they cannot see them from behind views of things that are closer to their eyes. To achieve a realistic screen appearance, these concealed surfaces must be removed. The areas of the screen that are visible from the specified viewing position are recognized when a 3D item has to be shown on a 2D screen. In this paper, the model flow of commutator surface defect detection is shown in the figure, and its core idea is to transfer model features to reduce training complexity. Firstly, ResNet trained on ImageNet is transferred to the commutator surface defect detection model. Then, the dataset of commutator surface defect is loaded and the convolution feature map is extracted using the trained ResNet, so that the pre-trained detection module is trained continuously. Ima-geNet is a picture database structured under the hierarchy of WordNet (at the moment just nouns) in which hundreds and thousands of photos are represented in each hierarchical node. The initiative helped advance the vision of computers and profound study into learning. Researchers can use the data for non-commercial usage free of charge. Target Data offer the most accurate, fast and complete pre-mover data available as an end-to-end platform for integrated mover commercialization solutions. Some parameters in the trained ResNet network are kept unchanged and with the pre-trained detection module. Finally, a commutator surface detection model based on transfer learning is obtained, which is trained under the dataset of commutator surface defects and fine-tuned for further model prediction. Flow chart of commutator surface defect detection and identification algorithm based on transfer learning is shown in Fig. 4 , in that the ImageNet dataset and commutator surface defect dataset are taken as an input for feature training and feature training process. In that, machine vision-based surface defect equipment has replaced artificial visual inspection in a broad range of industrial areas such as 3C, cars, domestic appliances, production machinery, semi-conductors and electronics, chemical, pharmaceutical, aero-aerospace and lighting sector. Then, the feature transfer data and pre-training are processed in commutator surface defect detection model construction further the process followed in fine-tuning model to obtain the feature extraction module and detection module. The extraction of the feature is a technique of reducing the dimensions in which a big group of pixels in an image are represented effectively to capture interesting parts of the picture. Finally, the model prediction is obtained. In transfer learning, only part of the feature extraction layer applied to commutator surface detection is transferred, not the whole network. Therefore, the network structure of surface defect detection for commutator is defined first. In this paper, the surface defect detection of commutator is improved on the basis of Faster R-CNN network. As mentioned above, the feature extraction layer of Faster R-CNN network uses ZF or VGG-16, which is replaced by a more expressive residual network and transferred based on the ImageNet dataset training model. The convolutional neural network ZFNet is a classic convolutional neural network. they are difficult to calculate and memory demanding. In this paper, the structure of the commutator surface defect network is shown in Fig. 5 . It is observed that the network in this paper mainly consists of three parts: transfer network module, region proposal network module and detection module. The combination of feature maps F1, F2 and F3 of different scales generated by the transfer network is the feature pyramid, that is, the input of the region proposal network. In order to meet the multi-scale feature map input, the region proposal network is modified appropriately, and the detection module follows the construction of fast R-CNN. The feature extraction network in this paper is transferred from the ResNet-101 network trained under ImageNet. ResNet-101 is a 101-layer deep neural network. It is a pretrained version of the network may be loaded from the database on more than a million images. The following is the demonstration of the specific structure of ResNet. The details of the network are shown in Table 1 . A region proposal network (RPN) is proposed in a Faster R-CNN network, which takes an image (of any size) as input and outputs a set of rectangular object recommendations, each with an object score. In this paper, some improvements have been made on the RPN network in order to cater to the multi-scale detection and the output of the multi-scale feature map in ResNet-101, which can better detect the surface defects of the commutator with different scale sizes. The MS-RPN network described in this paper mainly decomposes the original RPN network into four parts. In each sub-RPN network, the process of generating region proposal is not changed, and a small network continues to slide on the feature map of different scales, which is fully connected to the n × n sliding window of the input transformation feature map. Each sliding window is mapped to a low-dimensional vector that is fed into two full-connected layers of the same level, namely the box regression layer (reg) and the box classification layer (cls). Border box regression is an important approach for the refining or foresight of location boxes in modern object recognition algorithms. The bounding box regression is typically taught to go back to nearby bounding boxes of pre-defined target classes from area proposals or fixed anchor boxes. On a classification layer is computed cross-entropy loss and weighted classification tasks with mutually exclusive classes… For example, to define the number of classes K on the network, a fully linked layer with outgoing K dimensions and a SoftMax layer should be entered before the classification layer. In each sliding window, k region proposals are predicted at the same time, so the reg layer has 4 k outputs. The cls layer outputs a 2k fraction to estimate the probability of each proposed object/ non-object. The k region proposals are parameterized relative to the k reference boxes (called Anchor). Unlike RPN networks, each sub-RPN network uses only one scale (in the original RPN network, here is 3) and three aspect ratios. Each Anchor is centered on a sliding window. Specifically, each sub-RPN network generates k = 3 anchors at each sliding location (and k = 9 in the RPN). Therefore, there are a total of W × H × k Anchors in the conversion convolution characteristic map with the size of W × H (Fig. 6 ). The loss function determines the network training associated with objects in a single image. Because this network has multiple branches, the loss function of the network is redesigned according to Faster R-CNN. The two types of losses used in the loss function are classification loss and bounding box regression loss. The losses are calculated with the help of MS-RPN loss function. ( The loss function of MS-RPN consists of two losses: classification loss L cls and bounding box regression loss L reg . For any training sample, the loss function is: where i = the index of the i-th candidate box in a small batch process; p i = the probability that the i-th candidate box is a defect, and if i is a candidate defect, p * i is 1 or 0. The definitions of the classification loss function L cls and the regression loss function L reg are as follows: (2) (3): where R is a smooth L1 function [] ,t i = {t x , t y , t w , t h } is a vector, representing the predicted parameterized candidate frame coordinates. (1) l n { p n i }{t n i }|M = L cls (p n i , p n * i ) + L reg (t i , t * i ) where p k i = 0 or 1, representing the probability that the i -th region proposal is a defect; t i = the vector representing the predicted candidate frame coordinates. t * i = the coordinate vector corresponding to the true boundary; k ∈ [1, 4] represents the k-th branch of the detection network; k = the loss weight of the k-th branch; S k = the training sample set of the k-th branch. The loss function for the entire network is: From the pictures of commutator with scratch, bruise, crack, oil stain and black spot defects taken in commutator production workshop, 6600 pictures of commutator with scratch, 6800 with bruise, 6900 with crack, 9500 with oil stain and 8400 with black spot defects were obtained, respectively. Then, the data are expanded using dataset expansion operations such as horizontal flip, rotating picture set, and in order to facilitate dataset labeling for all images, the image is ground truth labeled using the dataset labeling tool (Labeling). When the labeling is complete, a fixed-format XML file is generated that contains the number of the image and the center coordinate of the surface defect of the commutator in the image, as well as the width and height. A set of samples that have been tagged with one or more labels is referred to as labeled data. In most cases, labeling takes a set of unlabeled data and adds useful tags to each component. Humans can be asked to make judgments about an unlabeled piece of data to get labels. Figure 7 is labeling the dataset using image labeling software, in which the green box is labeled ground truth inside and the XML file saved after the labeling is complete on the right. Some datasets and labeled datasets are shown in Fig. 8 . In the experiment, the model was evaluated based on accuracy, recall rate, F1 value and average time estimate. where TP = the correct part in the detection; TP + FP = the total detection recognition result; TP + FN = the total detection sample. In this paper, the proposed algorithm was compared with Faster R-CNN algorithm on the defect data set of the commutator surface. HOG and YOLO are the two most often used object detection algorithms. YOLO is used by deep learning-based neural networks, whereas HOG is a feature descriptor that has been shown to perform well with SVM (8) Fig. 7 Labeling the dataset using image labeling software and similar machine learning models. In the experiment, only when the IOU of the defect detection area and the real labeled defect area was more than 70%, the detection would be considered successful, and then, the recall rate and accuracy rate were calculated to obtain the F 1 value. Only when the IOU of the defect detection area and the true labeled defect area was greater than 70% in the experiment was the detection judged successfully, and the recall rate and accuracy rate were computed to get the F1 value. Each experiment was repeated 3 times, and the mean of 3 F 1 values was taken as the evaluation result. The statistical results of the experimental data are shown in Table 2 . As shown in Table 2 , the accuracy of this method for detecting scratches, bruise, cracks, oil stains and black spots reaches 84.14%, 88.81%, 88.35%, 92.86%, 92.86% and 92.53%, respectively. Compared with the original Faster R-CNN, the overall results show that it has increased by nearly 14 percent, mainly because in this paper, ResNet network is first transferred to the field of commutator surface defect detection using transfer learning. Compared with the backbone network ZF or VGG-16 of the original Faster R-CNN, this network has more powerful feature expression ability, which can extract more useful information to detect and identify people. Moreover, ResNet can output multi-scale feature map. On this basis, the improved MS-RPN network greatly enhances the adaptability of the overall scale change of the network, especially for some small-scale defects, such as small target black spots, scratches, with higher detection accuracy. Since the feature extraction network in this model is based on ResNet transfer trained by ImageNet big data set, in order to verify whether the added migration learning is helpful to the detection and recognition of commutator surface defects, a group of comparative experiments are set in this paper to train the surface defect detection model of commutator with and without transfer learning, respectively. As shown in Fig. 9 , the loss comparison between the two models with and without transfer learning shows that the network performance with transfer learning is better. In this paper, RPN is selected as a reference to compare the performance of MS-RPN. In the experiment, if the IOU of prediction area and ground truth is greater than 0.7, the detection is considered successful. Only the number of region proposals is changed, and other settings remain unchanged. The curvilinear relationship between the number of region proposals and the recall rate of MS-RPN and RPN is shown in Fig. 10 . Recall for both zones increases with the number of region proposals. Figure 10 shows that the recall rate of using 500 region proposals in RPN can be achieved by applying only 200 region proposals in MS-RPN, so it is believed that the reduction of the number of region proposals will help to improve the efficiency of algorithm detection. In this paper, Faster R-CNN, a target detection algorithm based on deep learning, is applied to the surface defect detection and recognition of commutator, which overcomes the problem that traditional detection methods cannot adapt to the changing practical application environment. In addition, according to the theory of transfer learning, the ResNet network trained under ImageNet is transferred to the commutator surface defect detection model proposed in this paper, which achieves more powerful feature expression ability. Moreover, in order to deal with the defect of different scales in detection, a multi-scale candidate region proposal network MS-RPN which is more suitable for ResNet network is proposed. Then, the test is performed on the commutator surface defect dataset, finding that this algorithm is improved significantly compared with Faster R-CNN algorithm. Therefore, compared with the original Faster R-CNN algorithm, the proposed commutator surface defect detection and identification algorithm is more suitable in the field of commutator surface defect detection and identification. The proposed method has an accuracy of 84.14%, 88.81%, 88.35%, 92.86%, 92.86% and 92.53 for detecting scratches, bruises, cracks, oil stains and black spots, respectively. Application of target detection algorithm based on deep learning in cold rolling surface defect detection Research status and challenges of deep learning in the field of fault diagnosis Mini-YOLOv3: realtime object detector for embedded applications Natural scene text recognition based on encoder-decoder framework The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method Overview of machine vision surface defect detection Identification of surface features on cold-rolled stainless-steel strip Fast defect detection in homogeneous flat surface products Rich feature hierarchies for accurate object detection and semantic segmentation Fast R-CNN Faster R-CNN: towards real-time object detection with region proposal networks Mask R-CNN You only look once: Unified real-time object detection YOLO9000: Better faster stronger YOLOv3: an incremental improvement SSD: Single shot multibox detector SSD: Deconvolutional single shot detector Content-based movie recommendation using different feature sets Addressing uncertainty in implicit preferences Deep residual learning for image recognition Procuring cooperative intelligence in autonomous vehicles for object detection through data fusion approach Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection Novel framework based on HOSVD for ski goggles defect detection and classification