key: cord-0058847-cnz2rhv1 authors: Caluña, Giovanny; Guachi-Guachi, Lorena; Brito, Ramiro title: Convolutional Neural Networks for Automatic Classification of Diseased Leaves: The Impact of Dataset Size and Fine-Tuning date: 2020-08-24 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58799-4_68 sha: ec8de750f8dd0b6235ab678847531621102fdd5a doc_id: 58847 cord_uid: cnz2rhv1 For agricultural productivity, one of the major concerns is the early detection of diseases for their crops. Recently, some researchers have begun to explore Convolutional Neural Networks (CNNs) in agricultural field for leaves diseases identification. A CNN is a category of deep artificial neural networks that has demonstrated great success in computer vision applications, such as video and image analysis. However, their drawbacks are the demand of huge quantity of data with a wide range of conditions, as well as a carefully fine-tuning to work properly. This work explores and compares the most outstanding five CNNs architectures to determine their ability to correctly classify a leaf image as healthy and unhealthy. Experimental tests are performed referring to an unbalanced and small dataset composed by healthy and diseased leaves. In order to achieve a high accuracy on the explored CNN models, a fine-tuning of their hyperparameters is performed. Furthermore, some variations are done on the raw dataset to increase the quality and variety of the leaves images. Preliminary results provide a point-of-view for selecting CNNs architectures for leaves diseases identification based on accuracy, precision, recall and F1 metrics. The comparison demonstrates that without considerably lengthening the training, ZFNet achieves a high accuracy and increases it by 10% after 50 K iterations being a suitable CNN model for identification of diseased leaves using datasets with a small variation, number of classes and dataset sizes. One of the major issues affecting farmers around the world are plant diseases caused by various pests, viruses, and insects, resulting in huge losses of money. The best way to relieve and combat this is timely detection. In almost all real solutions, early detection is determined by human, such an expensive and timeconsuming solution. Automatic approaches for leaf diseases identification aim to identify the signs of diseases at initial phase and decrease the huge effort of human observing in large farms. These methods involve some stages such image acquisition (to obtain images to be analyzed), image pre-processing (to enhance images quality), feature extraction (to get useful properties from images pixels), and classification (to categorize an image in classes based on its discriminating features). Recent research have begun to explore CNNs approaches in medicine [1] , mechanical [2] and in the agricultural field to identify plant disease and pests from images. CNNs, a category of deep learning approaches, have been extensively adopted for image classification purposes. They have been introduced for general purposes analysis of visual imagery and are characterized by their ability to learn key feature on their own. Their adoption for specific fields brings with it several challenges such as choosing the appropriate hyperparameters values such as learning rate, step and maximum number of iterations, as well as providing a sufficient quantity and quality of training dataset. In order to determine how the CNNs structure, hyperparameter values and data quality influence performance reached in the classification of leaves diseases, this work explores and compare the five outstanding and standard CNN architectures such as Inception V3 [3] , GoogLeNet [4] , ZFNet [5] , ResNet 50 [6] , and ResNet 101 [6] , which outperform other techniques of image classification. Furthermore, we evaluate some pre-processing tasks on the Plant Village dataset [7] , a set of images of diseased and healthy crops such as pepper, tomato and potato, and perform fine-tuning of the hyperparameter of each CNN model x. The remaining of this paper is organized as follows. Section 2 describes most relevant related works. The raw dataset and the operations applied for increasing its quality and size are described in Sect. 3 . Explored CNN architectures are presented in Sect. 4 . Experimental setup is presented in Sect. 5. Experimental results, obtained on processed datasets, are gathered and discussed in Sect. 6. Finally, Sect. 7 deals with the concluding remarks and future work. Among various CNN architectures used in agricultural field: CaffeNet, AlexNet, GoogleNet, VGG, ResNet, and InceptionV3 have been utilized as the underlying model structure, aiming at exploring the performance to identify types of leaves diseases from healthy leaves of single or multiple crops such as tomato, pear, cherry, peach, and grapevine [8] [9] [10] [11] [12] [13] [14] [15] . The effectiveness of a CNN model in identifyng leaves diseases on single or multiple crops depends mainly on the quantity and quality images. The main benefit of working with multiple crops is to enrich image feature description learning but presents drawbacks when they have to identify diseases on early stages [16] . Several researchers have explored structural changes of some standard CNN models. In [17] , LeNet CNN architecture has been modified to identify three maize leaf diseases from healthy leaves. Based on ResNet50 CNN architecture, three different architectures were proposed in [16] , to create a single multi-crop CNN model. The proposed architecture is capable of integrating contextual meta-data of the plant species information to identify seventeen diseases of five crops: wheat, barley, corn, rice and rape-seed. For experimental tests, the proposed multi-crop CNN model used a dataset constructed with more than one hundred-thousand of images taken by cellphone in real fields. In addition, some optimization studies have been introduced aiming at the problems of too many parameters, computation and training time consumed by CNN models. For instance, a global pooling dilated CNN is implemented in [18] to address the problems of too many parameters of AlexNet CNN to recognize six common cucumber leaf diseases. The impact of the size and variety of the datasets on the effectiveness of CNN models is studied in [19] . The study discussed the use of transfer learning on GoogleNet to deal with the problem of leaf diseases recognition based on an image database of 12 plant species including rice, corn, potato, tomato, olive, and apple. An empirical exploration of finetuning and transfer learning on VGG 16, Inception V4, ResNet (with 50, 101 and 152 layers) and DensNet (with 121 layers) is addressed in [20] . Results obtained shown that DenseNet improves constantly its accuracy with increasing number of epochs, with no signs of overtting and performance degradation. On the other hand, several specialized CNN architectures have been designed with a reduced number of hidden layers to provide a fast and accurate leaves diseases identification, such as [21] , which used only two hidden layers to differentiate healthy and unhealthy images of rice leave. In [22] , a novel architecture based on three-channel CNN with four hidden convolutional layers is presented. It is proposed for identification of vegetable leaf disease by combining the color components of diseases. Besides, a hybrid convolutional encoder network is stated in [23] . It identifies six classes of diseases on potato, tomato and maize crops. CNN demands huge amounts of varied data, otherwise the model may be not robust or overfitting. Therefore, the raw dataset was augmented performing preprocessing operations such as illumination changes mainly based on brightness and contrast variation, data augmentation, and the addition of random classes on their data. Figure 1 shows some examples from the original data set and the output after transformations performed over the raw data. The current work used the Plant Village dataset [7] , such a publicly accessible dataset that contains 13K RGB images of healthy and infected leaves of plants, with resolution 256 × 256. The dataset contains 26 different anomalies in 14 crop species including apple, blueberry, cherry, corn, grape, orange, peach, bell pepper, potato, raspberry, soybean, squash, strawberry and tomato. In this work, all images of diseased leaves were labeled as unhealthy, resulting in a raw dataset with two classes (healthy and unhealthy leaves). B. Data Augmentation: It enriched the data with augmented images result of: rotating the image with probability = 1 from 5 to −5 • , random distortions with gridwidth = 4, gridheight = 4 and magnitude = 8, left-right flips, topbottom flips, random zooms with probability = 0.5 and percentage area = 0.8 (where probability corresponds how often the operation is applied and the magnitude is the magnitude of the distortions). They were performed by using the augmentator library [24] . It produced 9 K additional leaves images. C. Random Classes Added: In order to distinguish healthy from diseased leaves, this procedure altered the raw dataset adding 6 classes including images of Airplane, Brain, Chandelier, Turtle, Jaguar, and Dog. This resulted in 1,238 additional images. Inception V3 [3] , GoogLeNet [4] , ResNet 50 [6] and ResNet 101 [6] were selected because of their successful performance for image classification tasks [25, 26] . On the other hand, ZFNet [5] was also explored due to their simplicity and low computation costs. The architectures of explored CNN models are schematized in Fig. 2 . Each one is characterized by a repeating sequence of convolutional layers, layers of activation functions, pooling layers, specialized modules (such as Inception and Residual modules aiming at efficient computations) and fully connected layers to output the classification label. Each layer type is named according to the performed operations. In this work, the last fully connected layer was adjusted to support two output classes. -Convolutional Layer: It consists of a set of filters to extract different features from the input image. The main purpose of each filter is build a feature map to detect simple structures such as borders, lines, squares, etc. and grouping them into subsequent layers to construct more complex shapes. In mathematical terms, convolution computes dot products between the input data and the entries of the filters to learn and detect patterns from the previous layer, as is shown in Eq. 1. where I represents the image and K the filter also often called kernel, m and n represent the dimensions of the filter typically of 3×3, 5×5, 7×7 or 11×11. (i,j) represents the pixels where the convolution operation will be performed. where r is the number of channels of the image. The filters are slipped from the left up to the right down corner applying the convolutional operation to create the feature map. -Activation Functions: In the convolution process many negative values are generated. These values are unuseful for the next layers and produce [4] ; b) Inception V3 [3] ; c) ResNet 50 [6] ; d) ResNet 101 [6] ; e) ZFNet [5] . more computational load. Therefore, after a convolutional layer, an activation function is applied. The activation function sets to zero negative values, which helps to get a faster and effective training. Rectified Linear Unit (ReLU) is the most widely used by Deep Learning techniques. ReLU aims to increase the non-linearity in the images, setting the value to 0 if the input is less than 0, and raw otherwise. -Pooling layer: The pooling operation splits the convolved features into disjoint regions of a size given by a filter F. Then, the 'max' or 'mean' values are taken from each disjoint region to save main features and reduce progressively the parameters, and thus the computational load by sub sampling the spatial size of its input. -Fully Connected Layer (FC): It aims to take the values from the previous layer and order them to drop out the class which the input belongs within a single vector of probabilities. -Inception Module: It is used in deeper CNNs [4] for providing efficient computations and dimensional reduction. This module is designed with stacked 1 × 1 convolutions after the max-pooling layer. It is a different method to apply the convolution operation through different size filters to extract the most different and relevant parameters during the training. The different map features obtained are concatenated before the next layer. A significant change over this kind of modules was proposed by [3] . They applied a factorization operation over the inception modules by using large filters and dividing into two or more convolutions with smaller filters. Factorized inception module produced ×12 lesser parameters than AlexNet [27] (one of the most common implemented architecture). -Residual Module: In this module, each layer is the input for the next layer and directly feeds into the layers about 2-3 hops away. Residual Module characterizes the ResNet [6] CNN architecture aiming to face the vanish gradient [28] . The learning residual function is given by Eq. 2. where X is the input of the block, F(X) is the residual function and H(X) is the final output of a residual block. This reformulation was done on the basis that a shadow network has less training error than the deeper counterpart. So, the residual modules support the deeper networks have an error which is no greater than the shallow ones. The five explored CNNs architectures for the classification of leaves images are described in the following subsections. It is an enhanced modification of AlexNet [27] architecture formed by eight layers. It is characterized by its simplicity. This has its convolutional layers followed by a max pooling layer. Each convolutional layer has decreasing filters starting with a filter size of 7 × 7 as shown in Fig. 2e) . ZFNet supports the hypothesis that, bigger filters loss more pixel information which can be conserved with the smaller ones to improve the accuracy. This model uses the ReLu activation function and the iterative optimization of batch stochastic gradient descent as learning algorithm for computing the cost function of a certain number of training examples for each mini-batch. GoogLeNet [4] also called Inception V1 uses Inception Modules to build deeper networks with more efficient computation. It includes 1 × 1 Convolution at the middle and global average pooling at the end of the network that replaces a fully connected layers to save a lot of parameters and improve the accuracy. GoogleNet has 22 layers of deep, starting with a size filter of 7 × 7 decreasing through the network, followed by 9 inception modules, as illustrated in Fig. 2a ). GoogleNet-Inception V3 [3] reflects an improved version of the previous version. In this architecture, computational efficiency and fewer parameters are realized. With 42 layers depth, the computation cost is only around 2.5 higher than GoogLeNet [4] , and much more efficient than VGGNet [29] . Its most relevant features are: factorization technique applied trough the network and inceptions modules. The factorization is applied in order to reduce the number of parameters keeping the network efficiency. Inception V3 starts with small filters of 3 × 3 and keeps them until reach the inception modules, as shown in Fig. 2b ). The Residual Networks, ResNets, represent a very deep network up to 152 layers. They learn from residual representation functions instead of learning the signal representation directly. ResNet 101 and ResNet 50 [6] included skip connection, also called shortcut connections, to fit two or more stacked layers into a desired residual mapping instead of hoping they directly fit. The skip connection is a solution to the degradation problem (accuracy saturation) generated in the convergence of deeper networks. They provide effectiveness in deeper networks, improving the learning process. The main difference between ResNet 101 and ResNet 50 is the number of convolutional layers, ResNet 101 has twice the number of layers than ResNet 50, as depicted in Fig. 2d ) and c) respectively. For experimental tests, each dataset was splitted into three sets with percentage ratio of (85%), (10%), and (5%) for training, validation, and testing respectively. The training set was processed to generate the LMDB dataset. LMDB is a compressed format for working with large dataset, where all images were sampled randomly from the training one. Explored CNNs received as input the generated LMDB. CNN models were trained using Python interfaces of Caffe framework available to face with research code and rapid prototyping. All experiments were performed using Tesla K80 Nvidia Graphics Processing Unit (GPU) that provides 11,441MiB of GPU memory, which allowed to use the original image resolution of 256 × 256. In order to determine how the complex conditions on the dataset impact the performance of the CNN models applied to leaves diseases identification, this work train the explored CNN models with the original raw data, and with the training datasets augmented using three diverse operations (illumination changes, data augmentation and random classes added). In addition, the crucial hyperparameters such as learning rate (lr), step, and maximum number of iterations (#It.) were ne-tuned aiming at identifying a fast and accurate CNN model. Lr is the most critical hyperparameter to be tuned to achieve good performance. It controls how quickly or slowly a CNN model learns a problem, a value that is too small might produce a long training process that could get stuck, while a value that is too large might generate a sub-optimal set of weights, too fast or an unstable training. Step is a decaying lr function, which reduces the lr value over the total number of iterations at a given step size. A smaller step value results in more frequent reduction of the lr value over the total number of training iterations, which may result in slow training process due to small lr value. #It. determines at which iteration the highest accuracy might be achieved. CNNs models differ in their hyperparameters, the default values of all hyperparameters are shown in Table 1 . This work used Stochastic Gradient Descent (SGD) in the training process to compute the gradient descent of the cost function for each batch size. Accuracy ((T p + T n )/(T p + F n + T n + F p )), precision (T p /(T p + F p )), recall (T p /(T p + F n )) and F1 ((2 × (recall × precision)/(recall + precision)) metrics were computed to measure the ability of explored CNNs models to classify leaves images into the corresponding class, and to determine how the different operations over the raw dataset and the hyperparameters fine-tune influence on the performance achieved. The parameters T p , T n , F p , and F n refer to the healthy leaf correctly classified as healthy one, unhealthy leaf correctly classified as unhealthy one, healthy leaf classified as unhealthy one, and unhealthy leaves classified as healthy one, respectively. In this sense, accuracy quantifies the ratio of correctly classified samples, precision quantifies the ratio of correctly classified positive samples over the total classified positive instances, recall measures the ratio of correctly classified positive samples over the total number of samples, and F1 quantifies the harmonic meaning between accuracy and recall which shows how accurate and robust a classifier is. Using the raw dataset with only two classes and minor variations among all samples, experimental results depicted in Fig. 3 clearly show that ZFnet takes advantages of the used default training settings parameters presented in Table 1 . Indeed, ZFnet reaches the highest values (up 90%) overcoming all the explored models at the iteration 10K, it is attributed to its shallowness and simplicity. Experimental tests also demonstrate that deeper networks achieve lower values which can be improved only if more training iterations are performed with larger and more varied datasets. In this sense, ResNet 101 achieves the worst accuracy, precision, recall and F1 values. These results are due to the dataset characteristics which tend to cause overfitting problems in deeper models. In addition, ResNet 50 presents a better improvement in all the measures compared to ResNet 101, particularly because of the less layers and parameters managed by the model. As it is well know, the dataset size and poor variation data have a fundamental impact on the overall performance of deeper CNN. For this reason, ResNet101 was examined to determine the impact of different pre-processing operations to vary the leaves raw images and dataset size. Results reported in Fig. 4 show that both illumination variation and additional random classes datasets, experiments B and C, respectively, achieve lower accuracy than the raw dataset. In the illumination variation case, it is attributed to an overfitting produced by the repetition of the same image. Whereas with the random classes included, the accuracy performance decreases by 5%. This is produced because although the variation in the dataset size is increased, the difference between the two main classes (healthy and unhealthy leaves) is the same. Overall, augmented dataset reaches a notable increasing in accuracy due to the addition of new relevant information to the main classes, which avoids the overfitting produced in the previous cases. It can be affirmed because ResNet101 accuracy increases a significant percentage around 20% in experiment D as more data variability is applied. Similar behaviour can be seen in the precision values. Precision decreases with illumination changes, increases a little with data augmentation and the eight additional classes, while in experiment D the precision with all the changes obtains the best performance. It is important to highlight that precision is a quite relevant measure in this work, since it establishes the ability of the explored CNN model to identify diseased leaves. On the other hand, F1 and Recall scores reach the highest values in experiment B by using augmented dataset. In other words, applying the variations hinders the depth CNN model when it has to identify healthy leaves, but its performance improves when it has to identify diseased leaves. Fine-tuning aims to increase the performance of the explored CNNs models by making small variations on their critical hyperparameters. Results obtained from training with the integrated dataset (raw dataset with all changes applied to the raw one) are shown in Table 2 . From the scores obtained by ZFNet it is clear that shallow CNN models often leads to a higher amount of parameters. Besides, they can be efficient models with a limited number of classes, data variations and dataset sizes for the diseased leaves classification process. With few iterations in test 5, ZFNet achieves a high accuracy and increases it by 10% after 50 K iterations. On the other hand, each explored CNN model has a particular behaviour for the evaluated hyperparameter values. For instance, GoogLeNet is highly susceptible to high lr and hence the values obtained from test 1 and 2 are lower than 50% in all metrics. The improvement reached in the rest of the tests is produced because of their default large batch size, which gives fast learning with an smaller lr. GoogLeNet also gives great precision reaching values up to 90% in test 5, which was trained with just 547 iterations. It marks an improvement of the model of 10% against the scores obtained with raw data. Inception V3 reaches its highest accuracy (77%) under conditions established in test 2. A relevant fact is presented in test 6, where the model presents an abrupt change from the tendency. The accuracy of test 6 with a small lr and several iterations (apparently the optimal conditions for Inception V3) dropped to its worst point (under 50%). It is presumed to a possible overfitting because of the large #It. and small lr. A highest precision score is obtained with a lr = 0.01 and a large number of iterations (around 40K). The recall and F1 scores retain the tendency to increase with high values, except in test 6 where all results were deficient. ResNet 101 scores depend on the number of iterations and, to a lesser extent, on the lr value. This is demonstrated by the accuracy reached by the model, which does not change significantly unless the #It. is increased. This fact is attributed to the depth of the model. Its highest accuracy is 86%, which shows a significant increase from experiments with raw data. This is attributed to the new variation on the data set. The model also shows great performance in precision, recall and f1 score metrics. On the contrary the accuracy obtained from ResNet 50 was under 55% in all tests, a very low performance against the other models. This fact is mostly attributed to the dataset used to the different tests (called all changes) because in all the variations performed on the model, the accuracy obtained was poor. In this work, an empirical analysis among five state-of-the-art CNNs was carried out. The explored CNN architectures include GoogLeNet, Inception V3, ResNet 50, ResNet 101 and ZFNet. They differ in the number of convolutional layers, number of parameters and in the way they are arranged. From the experiments, ZFNet has a growing trend in accuracy, precision and F1 scores with no over fitting or performance decreasing manifestations. ZFNet obtained a 93% validation accuracy score for 50 K iterations beating the rest of the explored CNN models for identification of diseased leaves using datasets with a small number of classes, data variations and dataset sizes. Based on obtained results, as it was expected, all CNN models requiere a careful fine-tuning of their hyperparameters because they not show direct correlations among hyperparameter values and their depth. As future work, we propose to extend this work exploring more preprocessing techniques to improve the quality and quantity of a limited dataset an thus to accomplish a reliable classification for several leaves diseases. Automatic colorectal segmentation with convolutional neural network Automatic microstructural classification with convolutional neural network Rethinking the inception architecture for computer vision Going deeper with convolutions Visualizing and understanding convolutional networks Deep residual learning for image recognition An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing Deep neural networks based recognition of plant diseases by leaf image classification Assessing the performance of convolutional neural networks on classifying disorders in apple tree leaves Identification of apple leaf diseases based on deep convolutional neural networks. Symmetry Can deep learning identify tomato leaf disease? Deep learning convolutional neural network for apple leaves disease detection Deep learning models for plant disease detection and diagnosis How convolutional neural networks diagnose plant disease Image-based tomato leaves diseases detection using deep learning Crop conditional convolutional neural networks for massive multi-crop plant disease classification over cell phone acquired images taken on real field conditions Maize leaf disease classification using deep convolutional neural networks Cucumber leaf disease identification with global pooling dilated convolutional neural network Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification A comparative study of fine-tuning deep learning models for plant disease identification A Deep Learning Approach for the Classification of Rice Leaf Diseases Three-channel convolutional neural networks for vegetable leaf disease recognition Seasonal crops disease prediction and classification using deep convolutional encoder network An analysis of deep neural network models for practical applications Imagenet classification with deep convolutional neural networks Learning deep resnet blocks sequentially using boosting theory Very deep convolutional networks for large-scale image recognition Acknowledgement. This work used the supercomputer 'Quinde I' from the public company Siembra EP in the Republic of Ecuador.