key: cord-0057748-lckvwf92 authors: Zhu, Jiongye; Wang, Xiaohan; Lei, Ling; Ye, Minchao; Qian, Yuntao title: Random Convolutional Network for Hyperspectral Image Classification date: 2021-03-18 journal: Geometry and Vision DOI: 10.1007/978-3-030-72073-5_26 sha: f1079a4248211b66b6c4c95056c742a5d2da0e3a doc_id: 57748 cord_uid: lckvwf92 Convolutional neural network (CNN) has proved remarkable performance in the field of hyperspectral image (HSI) classification for it has excellent feature extraction ability. However, HSI classification is a small-sample-size problem due to the labour cost of labeling. CNN may perform poorly on HSI data due to the ill-conditioned and overfitting problems caused by the lack of enough training samples. Extreme learning machine (ELM) is a kind of single-layer feedforward neural network (FNN) with high training efficiency, which simplifies the learning of parameters. Therefore, in this paper, we try to combine the convolutional feature extraction method of CNN and the parameter randomization idea of ELM, and then propose a random convolutional network (RCN) model. The proposed RCN randomly generates the parameters of three-dimensional (3D) convolution kernels in convolutional layer used for the joint spectral-spatial feature extraction. RCN avoids ill-conditioned and overfitting problems in the case of small samples by significantly reducing the number of parameters to be trained. At the same time, further analyses on the convolution kernel sizes and the number of convolution kernels have been carried out. Experiments on two real-world HSI datasets have demonstrated that the proposed RCN algorithm has excellent generalization ability. It is generally known that hyperspectral image (HSI) classification has long been a hot research topic. Over the past decades, various methods have been adopted for higher accuracy, including support vector machine (SVM) [13] , random forest [5] , sparse logistic regression (SLR) [15] , convolutional neural network (CNN) [4] , extreme learning machine (ELM) [8, 9] , etc. Feature extraction is a method to extract the representative features of the data, which is one of the most important parts in the process of HSI classification [1] . Extracting prominent features has a great influence on the generalization performance of classifiers. A series of feature extraction methods have been used, such as deep stacked autoencoders [3] , manifold learning [17] and minimum noise fraction (MNF) [6] . However, the above methods with insufficient ability of feature extraction are unable to acquire satisfactory classification accuracy. Meanwhile, CNN has outstanding performance of feature extraction on HSIs. CNN uses convolution kernels to extract features from data with convolutional layers, and the parameters of convolution kernels are updated in the training process so that it avoids the complex and unreliable explicit feature extraction process. Generally, the labeling work of HSI data costs a lot of manpower which puts forward higher requirements for the generalization ability of the classifiers in the case of small samples. However, there are a large number of parameters in CNN, which means a lot of training samples are required. When the number of available samples is limited, the overfitting and ill-conditioned problems are more likely to occur, which will reduce the classification accuracy of CNN. Besides, time consuming is one of the disadvantages of CNN for it has to update all the parameters in the training process through back propagation (BP) [12] algorithm which is executed iteratively. Compared with CNN, ELM randomly generates its parameters of hidden neurons, which reduces the number of parameters to be trained. Therefore, ELM tends to have good performance in small-samplesize classification tasks for it averts ill-conditioned problems. Meanwhile, the straightforward solution idea of ELM makes it have fast computation capability. To deal with the problem of imbalance between the number of parameters and available training samples, we propose a new network in this paper, namely random convolutional network (RCN), which combines spectral-spatial convolutional feature extraction of CNN and the parameter randomization of ELM. The proposed approach greatly reduces the number of the parameters in the model by randomly generating the parameters of three-dimensional (3D) convolution kernels. Therefore, RCN not only works efficiently but also has high generalization ability. The rest of this paper is organized as follows. Section 2 introduces CNN and ELM. Section 3 presents the proposed RCN model. The experimental results on HSI data are analyzed in Sect. 4, followed by conclusions in Sect. 5. CNN is a variation development of Multilayer Perceptron (MLP) [11] . In recent years, it has demonstrated remarkable performance in the field of image classification and attracted considerable attentions. CNN consists of input layer, hidden layers and output layer. The hidden layers of CNN include convolutional layer, pooling layer and fully-connected layer [7] . The feature extraction of CNN is realized via several convolution kernels [10] in convolutional layers. Sliding on the input image, each convolution kernel obtains a corresponding feature map, then a series of feature maps are transferred to pooling layer which downsamples the image to reduce the number of parameters and retain useful information. The fully-connected layer summarizes all previous operations and combines the extracted features. Finally, what the output layer does is to output the classification labels through logic function or softmax function. As a kind of deep neural network (DNN) [16] , CNN trains the parameters by constructing a loss function, and then it calculates the input and output values of each layer by parameter initialization and forward propagation to obtain the output layer error. Afterwards, parameters will be iteratively updated through BP algorithm until loss function minimizes. The weight-sharing network structure of CNN reduces the number of parameters and complexity of the network calculation scale. However, there are still plenty of parameters which need to be trained so that numerous available training samples are generally required for enabling CNN to have higher classification accuracy. Furthermore, using BP algorithm to update parameters increases time complexity of the algorithm. In the traditional artificial neural networks, the parameters of hidden layer nodes are optimized by iterative algorithm [2] . These iterative steps often make the training process of parameters take up a lot of time, so that the efficiency of the training process can not be guaranteed. In order to overcome this drawback of BP algorithm, ELM was proposed by Huang [9] . The parameters of hidden layer nodes are generated randomly, and the output weights of the network are obtained by minimizing the loss function, which can be solved in an explicit form, so that no iterative steps are required. For a single hidden layer neural network, suppose there are N training samples as X = [x T 1 , x T 2 , . . . , x T N ] T ∈ R N ×m , where each row x i represents the input feature vector of a sample, and m is the input feature dimension. The labels are represented by one-hot encoding where P is the number of classes. The output of hidden layer with L hidden neurons can be expressed as where g(x) is the activation function (e.g., sigmoid), w j ∈ R m denotes the weight vector connecting input layer to hidden layer, β j ∈ R P denotes the weight vector from hidden layer to output layer, b j is the bias of jth hidden neuron. Note that Eq. (1) can be rewritten as where H is the output of hidden layer, β is the weight matrix in the output layer, and T is the expected output matrix. In ELM, w j and b j (j = 1, 2, . . . , L) are assigned with random numbers, so H can be directly calculated. Then β can be directly solved viaβ where H † is the Moore-Penrose generalized inverse of matrix H. Due to the good generalization ability, ELM has recently drawn increasing attentions in the pattern recognition and machine learning fields. However, for the lack of feature extraction ability, ELM is difficult to provide satisfactory accuracy in the field of HSI classification. In addition, the number of hidden neurons demands manual adjustment. In this paper, in order to solve ill-conditioned problems in the absence of samples and improve the training efficiency, we propose the RCN with combination of convolutional feature extraction method of CNN and the parameter randomization of ELM. The proposed RCN algorithm adopts a random convolutional layer with 3D convolution kernels as the feature extraction layer. Specially designed for HSIs, RCN has joint spectral-spatial feature extraction ability by adopting 3D convolution kernels. Besides, the convolution kernels in RCN are randomly generated so that the number of network parameters which need to be trained is greatly reduced. Therefore, even in the case of small samples, RCN has high generalization ability for it averts ill-conditioned problems and can extract spectralspatial features. Also, for the reason that the time-consuming BP algorithm is no longer used, the speed of the training process is increased. Figure 1 shows the architecture of RCN. The proposed RCN extracts features through several 3D convolution kernels. More specifically, a HSI data cube D ∈ R A×B×C is convoluted with multiple convolution kernels K i ∈ R I×J×K (i = 1, 2, . . . , U, where U denotes the number of kernels) to get the ith feature cube D F i ∈ R A×B×C . All feature cubes are stacked along the spectral dimension to form the final feature cube D F ∈ R A×B×UC (see Fig. 2 ), and features of the pixels can be extracted from D F . In the output layer, RCN adopts SVM with radial basis function kernel (SVM-RBF) as classifier. The algorithm framework of RCN is listed in Algorithm 1. Generate the convolution kernel K i with random numbers in [0, 1]. Obtain the ith feature cube by 3D convolution: After the introduction of proposed RCN algorithm in previous sections, we conduct experiments on two real-world HSI datasets to evaluate the performance of the RCN here. The experimental process of this article is shown in Fig. 3 . Two well-known HSI datasets are adopted for experiments, namely Indian Pines and Pavia University, respectively. The Indian Pines was obtained by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pine Test Site of Northwestern Indiana on June 12, 1992. The spatial size of the raw image is 145 × 145, and the spectral size is 220. The noisy bands (bands 104-108, 150-163, and 220) are discarded, so the dataset used for experiments only has 200 bands remained. Indian Pines includes 16 kinds of land-cover classes and 10249 labeled pixels. Table 1 lists the number of labeled samples for each class. The second HSI dataset is the Pavia University, which was captured by the ROSIS sensor over the city of Pavia, Italy. After removing 12 noisy bands, the image has the size of 610 × 340 × 103. The Pavia University consists of 9 kinds of land-cover classes and 42776 labeled pixels in total, which are listed in Table 2 . The numbers of training samples and testing samples are the same as in Ref. [14] . In Indian Pines dataset, the required training set and testing set are listed in Table 1 . In Pavia University dataset, we take 50 samples per class as training set, which are listed in Table 2 . In order to test the performance of proposed RCN algorithm, following four classification algorithms are compared: -CNN [14] : CNN is a kind of feedforward neural network (FNN) with convolution computation and depth structure. It is one of the representative algorithms of deep learning, which is widely used in HSI classification. In this paper, the compared CNN consists of an input layer, three convolutional layers with ReLU activation function, two maxpool layers, four fully-connected layers and one output layer. The architecture of compared CNN is shown in Fig. 4 . The detailed parameter settings can be found in [14] . It should be noted that the feature extraction process in [14] is quite different from proposed RCN in this paper, we do not re-conduct experiments in [14] , and only directly take the accuracies from [14] for comparison. -MLP [14] : Multi-layer perceptron (MLP) compared here has three layers: an input layer, a hidden layer with ReLU as activation function and an output layer with softmax function. The number of nodes in the input layer is equal to the number of bands in the specific HSI dataset, and the output layer contains the same number of nodes as the number of classes. The number of nodes in the hidden layer is calculated by the formula (n b + n c ) · 2 3 , where n b and n c denote the number of bands and classes respectively. Specifically, for the Indian Pines dataset, MLP topology is set to 200 − 144 − 16, and 103 − 75 − 9 for the Pavia University dataset. The experimental results are directly taken from [14] . -ELM [8] : ELM is a single-layer FNN, the hidden node parameters of which are generated rondomly without using any iterative algorithm. The compared ELM in this paper uses sigmoidal function as activation function and the number of hidden nodes is set to 1000. -SPEC-SVM: As a supervised classification algorithm, SVM is a frequently adopted as a benchmark. Here SVM is applied to raw spectrums. Misclassification penalty parameter is set to C ∈ {2 −10 , 2 −9 , . . . , 2 10 }, and RBF kernel is utilized with parameter γ ∈ {2 −10 , 2 −9 , . . . , 2 10 }. Size of kernels ( 9 10 11 12 13 14 15 16 Size of kernels (7 × 7 × 3) × 1 (7 × 7 × 5) × 1 (7 × 7 × 7) × 1 (7 × 7 × 9) × 1 (9 × 9 × 3) × 1 (9 × 9 × 5) × 1 (9 × 9 × 7) × 1 (9 × 9 × 9) × 1 Table 4 . Experimental design for testing the number of convolution kernels. Number of kernels(7 × 7 × 7) × 1(7 × 7 × 7) × 2(7 × 7 × 7) × 3(9 × 9 × 9) × 1(9 × 9 × 9) × 2(9 × 9 × 9) × 3 -RCN: The algorithm proposed in this work. Feature cubes are extracted from raw HSI data cube in the convolutional layer of RCN. SVM with RBF kernel is used as the classifier with the same parameters range as that adopted in SPEC-SVM. In this paper, we design experiments to analyze the effect of convolution kernel sizes and the number of convolution kernels on the accuracies. In all the experiments of RCN, (I × J × K) × U means that RCN adopts U convolution kernels sized (I × J × K). We randomly generate the parameters of 16 convolution kernels with different sizes to analyze the influence of kernel's spatial and spectral sizes on accuracies. The sizes of these 16 convolution kernels are shown in Table 3 . We choose four kinds of basic convolution kernel sizes and set up experiments for analyzing the number of convolution kernels' influence on accuracies. The experimental settings are listed in Table 4 . In order to make the results reliable, we repeat the experiments for 10 times, the training samples and testing samples are randomly selected in each run. The accuracies are averaged over 10 times of experiments. Overall accuracy (OA), average accuracy (AA) and kappa coefficient (κ) are adopted as evaluation criteria. For Indian Pines dataset, the experimental results are shown in Table 5 . As we can see in Table 5 , the proposed RCN has better performance than CNN, MLP, ELM and SPEC-SVM. For instance, RCN with (3 × 3 × 3) × 1 convolution kernel reaches 85.60% OA while MLP obtains 75.24%, ELM obtains 54.12%, and SPEC-SVM obtains 72.91%. Compared with CNN, the accuracies of RCN are approximately 6% higher in OA and AA, 3% higher in κ. Testing the Size of Kernels: In Table 5 , we can see that when spatial size of kernel is fixed, with the increase of spectral size from 3 to 9, the accuracies Testing the Number of Kernels: From Table 5 , it can be seen that RCN with two or three convolution kernels is much easier to obtain better accuracies than that with only one kernel. That is to say, multiple convolution kernels may enable RCN to have better recognition ability easily. RCN with (5 × 5 × 5) × 3 convolution kernels obtains maximum accuracies (OA: 88.75%, AA: 93.84%, κ: 87.17%). We can see that best OA (88.75%) and κ (87.17%) are obtained by RCN with (5 × 5 × 5) × 3 convolution kernels and RCN with (7 × 7 × 3) × 1 convolution kernel reaches best AA (94.09%) in Table 5 . For Pavia University dataset, RCN also reaches higher accuracies than MLP, ELM and SPEC-SVM (see Table 6 ). When it comes to be compared with CNN, the proposed RCN generally has better performance. Testing the Size of Kernels: Table 6 shows us that when kernel size is (5 × 5 × 5) × 1, RCN reaches best OA (92.48%) and best κ (90.06%) while the maximum AA (92.83%) is obtained by RCN with (3 × 3 × 3) × 1 convolution kernel. In general, with the increase of spectral size, the accuracies decrease a little gradually. Testing the Number of Kernels: When the number of convolution kernels increases, to some extent, the accuracies are better than that with a single kernel (see Table 6 ). In this part, RCN with (5 × 5 × 5) × 2 convolution kernels reaches maximum accuracies (OA: 92.12%, AA: 93.42%, κ: 89.67%). In Table 6 , when the size of convolution kernel is (9×9×3)×1, RCN achieves best OA (92.48%) and κ (90.06%) while RCN with (5 × 5 × 5) × 2 convolution kernels reaches best AA (93.42%). To sum up, the performance of RCN is much better than that of ELM and SPEC-SVM. Also, RCN performs better than CNN in general for it solves the overfitting and ill-conditioned problems in the case of small samples. For RCN itself, increasing the spectral size of convolution kernel may have a negative effect on the accuracies. For a specific HSI dataset, when only one convolution kernel is used, it is meaningful to select the appropriate kernel size to get better generalization ability. When the number of convolution kernels increases, RCN will get better porformance in general. In other words, if higher precision is demanded, using multiple convolution kernels is worth trying. In this paper, we have proposed a new network named RCN for joint spectralspatial feature extraction in HSIs. The proposed algorithm applys the straightforward solution idea of ELM on the CNN model which has remarkable feature extraction performance. Similar to ELM, the parameters of convolution kernels are randomly generated without any iterative tuning. These 3D convolution kernels form a random convolutional layer of RCN, which is used for feature extraction. After extracting features from raw HSI data, SVM-RBF is used for classification in the output layer. Compared with CNN, it can be seen that RCN greatly reduces the number of parameters which need to be trained so that RCN is easier to obtain more accurate solutions for it avoids overfitting and ill-conditioned problems in the case of small samples. Through the experiments on two HSI datasets in this paper, it can be concluded that the proposed RCN outperforms CNN, MLP, ELM and SPEC-SVM in classification tasks, which proves that RCN is an algorithm with excellent generalization ability and high training efficiency. Feature extraction and classification for EEG signals using wavelet transform and machine learning techniques A fast iterative shrinkage-thresholding algorithm for linear inverse problems Deep learning-based classification of hyperspectral data Deep feature extraction and classification of hyperspectral images based on convolutional neural networks Investigation of the random forest framework for classification of hyperspectral data Mapping lithology in canada's arctic: application of hyperspectral data using the minimum noise fraction transformation and matched filtering Convolutional neural network architectures for matching natural language sentences Extreme learning machine for regression and multiclass classification Extreme learning machine: theory and applications Fast polynomial approximation of heat kernel convolution on manifolds and its application to brain sulcal and gyral graph pattern analysis Face recognition: a convolutional neural-network approach MRI brain tumour segmentation using hybrid clustering and classification by back propagation algorithm. Asian Pac Classification of hyperspectral remote sensing images with support vector machines A new deep convolutional neural network for fast hyperspectral image classification Hyperspectral image classification based on structured sparse logistic regression and three-dimensional wavelet texture features Deep learning in neural networks: an overview Discriminative analysis for symmetric positive definite matrices on lie groups