key: cord-0166000-ay050cs1
authors: Soni, Sanskar; Chouhan, Satyendra Singh; Rathore, Santosh Singh
title: TextConvoNet:A Convolutional Neural Network based Architecture for Text Classification
date: 2022-03-10
journal: nan
DOI: nan
sha: a015f7f44a08451240ab47ea421d20d86b718068
doc_id: 166000
cord_uid: ay050cs1

In recent years, deep learning-based models have significantly improved the Natural Language Processing (NLP) tasks. Specifically, the Convolutional Neural Network (CNN), initially used for computer vision, has shown remarkable performance for text data in various NLP problems. Most of the existing CNN-based models use 1-dimensional convolving filters n-gram detectors), where each filter specialises in extracting n-grams features of a particular input word embedding. The input word embeddings, also called sentence matrix, is treated as a matrix where each row is a word vector. Thus, it allows the model to apply one-dimensional convolution and only extract n-gram based features from a sentence matrix. These features can be termed as intra-sentence n-gram features. To the extent of our knowledge, all the existing CNN models are based on the aforementioned concept. In this paper, we present a CNN-based architecture TextConvoNet that not only extracts the intra-sentence n-gram features but also captures the inter-sentence n-gram features in input text data. It uses an alternative approach for input matrix representation and applies a two-dimensional multi-scale convolutional operation on the input. To evaluate the performance of TextConvoNet, we perform an experimental study on five text classification datasets. The results are evaluated by using various performance metrics. The experimental results show that the presented TextConvoNet outperforms state-of-the-art machine learning and deep learning models for text classification purposes.

Natural language processing (NLP) involves computational processing and understanding of the natural/human languages. It involves various tasks that rely on various statistics and data-driven computation techniques [1] . One of the important tasks in NLP is text classification. It is a classical problem where the prime objective is to classify (assign labels or tags) to the textual contents [2] . Textual contents can either be sentences, paragraphs, or queries [2] , [3] . There are many real-world applications of text classification such as sentiment analysis [4] , news classification [5] , intent classification [6] , spam detection [7] , and so on.

Text classification can be done by manual labeling of the textual data. However, with the exponential growth of text data in the industry and over the Internet, automated text categorization has become very important. Automated text classification approaches can be broadly classified into three Sanskar Soni and Satyendra Singh Chouhan is with the Department of Computer Science and Engineeering, Malaviya National Institute of Technology, Jaipur, 302017, INDIA E-mail: {2018ucp1265,sschouhan.cse}@mnit.ac.in Santosh Singh Rathore is with the Department of Information Technology, ABV-IIITM Gwalior,474015, INDIA E-mail: santoshs@iiitm.ac.in categories: Rule-based, Data-Driven based (Machine Learning/Deep Learning-based approaches), and Hybrid approaches. Rule-based approaches classify text into different categories using a set of pre-defined rules. However, it requires complete domain knowledge [8] , [9] . Alternatively, machine learningbased approaches have proven to be significantly effective in recent years. All the machine learning approaches work in two stages: first, they extract some handcrafted features from the text. Next, these features are fed into a machine learning model. For extracting the handcrafted features, a bag of words, n-grams based model, term frequency, and inverse document frequency (TF-IDF) and their extensions were popularly used. For the second stage, many classical Machine Learning algorithms such as Support Vector Machine (SVM), Decision Tree (DT), Conditional Probability-based such as Naïve Bayes, and other Ensemble-based approaches are used [10] , [11] , [12] , [13] .

Recently, some of the Deep Learning methods, specifically RNN (Recurrent Neural Network) and CNN (Convolutional Neural Network), have shown remarkable results in text classification [14] , [15] , [16] , [17] , [18] , [19] , [20] . CNN-based models are trained to recognize patterns in text, such as key phrases. Most CNN-based models utilize one-dimensional (1-D) convolution followed by a one-dimensional max-pooling operation to extract a feature vector from the input word embeddings. This feature vector is fed into the classification layer as an input for classification purposes. Input word embedding is a word matrix where each row represents a word vector. Therefore, one-dimensional (1-D) convolution extracts n-gram based features by performing convolution operation on two or more than two-word vectors at a time.

However, improving text classification results by utilizing the n-gram features in between different sentences using convolution operation still remains an open research question for all the researchers. Furthermore, the input matrix structure also remains a point to ponder, which could be revamped to apply multidimensional convolution. This paper presents, TextConvoNet, a new CNN-based architecture for text classification. Unlike existing works, the proposed architecture uses a 2-dimensional convolutional filter to extract the intra-sentence and inter-sentence n-gram features from text data. First, it represents the text data as a paragraph level (multi-sentence) embedding matrix, which helps in applying 2-dimensional convolutional filters. Thereafter, multiple convolutional filters are applied to the extract features. The resultant features are concatenated and fed into the classification layer for classification purposes. To evaluate the performance of the presented TextConvoNet, we perform experiments on some benchmark binary as well as multi-class classification datasets. Evaluation of the TextConvoNet is done based on various performance metrics such as accuracy, precision, recall, F1-score, specificity, and G-mean. Additionally, we compare the performance of the TextConvoNet with state-or-the-art classification models.

The contributions of this paper are as follows.

• An approach is presented to represent input text data as a multidimensional word embedding representation. It would overcome the restriction of applying onedimensional convolution only. • A multi-scale feature extraction approach using twodimensional convolution and pooling operation is presented. • The overall model is tested on five benchmark datasets, and extensive experiments are performed to test the proposed model's effectiveness. • We have also evaluated different possible versions of TextConvoNet for text classification purpose. • An extensive comparison of the proposed model with existing state-of-the-art CNN-based models on five benchmark datasets is performed to show the model's efficacy. • Ablation study of TextConvoNet is performed, which is focused on optimizing the hyper-parameters involved at different layers used in the presented model. The rest of the article is organized as follows. Section II discusses the literature review. Section III starts with background information on text classification using CNN and then provides details of the presented TextConvoNet architecture. Section IV provides the details of the experimental setup and analysis. It starts with details of datasets used followed by performance metrics, implementation details, experimental results, and comparison with the state-of-the-art. It also presents the ablation study and few-shot learning analysis of TextConvoNet. Section V presents the conclusions along with the future directions.

Text classification is one of the important tasks in Natural Language Processing. There have been many data-drivenbased approaches suggested for text classification. Recently, deep learning-based approaches have emerged and performed significantly well in text classification. In this section, we discuss some of the relevant deep learning-based models suggested for text classification.

There are mainly two neural networks that have been very popular in NLP problems: Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN). LSTM can extract current information and remember the past data points in sequence [16] , [17] , [21] , [22] . However, LSTM-based models have very high training time because it is characterized as each token per step. In addition, LSTM with attention networks introduces an additional computation burden. It is because of the exponential function and normalized alignment score computation of all the words in the text [23] .

One of the early CNN-based models was Dynamic CNN (DCNN) for text classification [24] . It used dynamic maxpooling, where the first layer of DCNN makes a sentence matrix using the word embeddings. Then it has a convolutional architecture that uses repeated convolutional layers with dynamic k-max-pooling to extract feature maps on the sentence. These features maps are capable enough of capturing the short and long-range relationship between the words.

Later on, Kim [25] gave the simplest CNN-based model for text classification, which has become the benchmark architecture for many recent models for text classification. Kim's model adapted a single layer of convolution on top of the input word vectors obtained from an unsupervised neural language model (word2vec). It has been evident to improve upon the state-of-the-art binary and multi-class text classification problems. Recently, there have been some attempts towards improving the architectures of CNN-based models [26] , [27] , [28] , [18] , [19] . Instead of using pre-trained low-dimensional word vectors as an input to CNN, authors in [28] directly applied CNNs to high-dimensional text data to learn the embeddings of small text regions for the classification. Most of the existing works have shown that convolutions of sizes 2, 3, 4, or 5 give significant results in text classification.

Some of the researches explored the impact in model performance while using the word embeddings and CNN architectures. Encouraged by VGG [29] and ResNets [30] , Conneau et al. [31] proposed a VDCNN (Very Deep CNN) model for text processing. It applies CNN directly at the character level and uses small convolutions and pooling functions. The research exhibited that the performance of VDCNN improves with the increase in depth. In another work, Le et al. [32] shown that deep architectures can outperform shallow architectures where text data is represented as a sequence of characters. Later on, in another research, Squeezed-VDCNN suggested [18] , which improved VDCNN to work on mobile platforms. However, a basic shallow-and-wide network outperforms deep models, such as DenseNet [33] , with word embeddings as inputs.

To the best of our knowledge, above discussed, all the CNNbased networks extract the n-gram based feature using varied sizes of kernels/filters. In light of the above works, we present a novel CNN-based architecture that extracts intra-sentence ngram features and captures the inter-sentence n-gram features.

In this section, first, we discuss the existing CNN-based approach of text classification taking a simple example. Next, we present the proposed TextConvoNet framework.

Text classification problems can be formally defined as follows.

Definition 1. Given a text dataset T consisting of labelled text articles. Depending on a particular NLP task, text articles have a particular label/class l ∈ L. In case of binary classification, there will be total of two labels for the text dataset. A text article te ∈ T consists of sentences and words. Let us say, text article te i contain m sentences s 1 , . . . , s m and sentence s j (0 ≤ j ≤ m) contain n words. Words in a sentence s j can be represented as W j = w 1 , . . . , w n .

The objective of text classification is to learn a model M that can correctly classify any new text articles t new into label l ∈ L.

Kim et al., [25] presented a simple and effective architecture for the text classification. We call it Kim's CNN model throughout the paper for simplicity. This presented architecture served as a guiding light and basis for many CNN-based architectures for text classification. Many recent architectures internally use this model [34] , [35] , [36] , [37] . In Kim's CNN model, sentences are matched to the embedding vectors made available as an input matrix to the model. It makes use of only a single layer of convolution with word vectors on top obtained from an existing pre-trained model, with kernel sizes 3, 4, and 5. The resultant feature maps are further processed using a max-pooling layer to distill or summarize the extracted features, subsequently sent to a fully connected layer. Figure 2 shows a simple example of text classification using CNNbased Kim's model.

As shown in Figure 2 , the input to the model is a sentence represented as a matrix. Each row of the matrix is a vector that represents a word. 1D convolution is performed onto the matrix with the kernel size being 3 along with 4 and 5. Max-Pooling is performed upon the filter maps, which are further concatenated and sent to the last fully connected layer for the classification purpose. Formally, the sentence modeling is as follows.

Sentence Modelling. For each sentence, w p ∈ R z denote the word embedding for the p th word in the sentence, where z is the word embedding dimension. Suppose that a sentence has n words, the sentence can now be represented as an embedding matrix W ∈ R n×z . So we can refer to it as a word matrix where every row denotes the vectors for a particular word of the sentence. Let w p:p+q represents the concatenation of vectors w p , w p+1 , . . . , w q . The convolution operation is performed on this input embedding layer. It involves a filter k ∈ R s,z that applies to a window of s words to produce a new feature. For example, a feature c p is generated using the window of words w p:p+s−1 by Equation 1 .

Here, b ∈ R and f denotes the bias and non-linear activation function respectively. The filter (kernel) k applies to all possible windows using the same weights to create the feature map.

(2)

In real-world scenarios, the paragraphs are stringed together in a very complex manner, making it very difficult for any model to come up with correct labeling, whether it be any sentiment or a news category. So there may be instances when the model cannot extract the inter-sentence features and hence fails to come up with a suitable result. Taking inspiration from the above shortcoming, we try to present an alternative input structure for the model and propose a novel CNN model using the alternative input structure and employing 2-D Convolution [38] . The description of our input structure is mentioned below.

Paragraph Modelling. For each sentence in a paragraph, let w i ∈ R z represents the word embedding for the i th word in the sentence, where z is the dimension of the word embedding. Given that a paragraph has m sentences and n word in each sentence, the paragraph can be represented as an embedding matrix W ∈ R m×n×d . Such type of arrangement could be termed as a paragraph matrix where each row depicts each sentence of a paragraph, with each cell as a single word w and the 3 rd dimension as the embeddings or the word vectors.

The overall architecture of our proposed model, TextCon-voNet, is shown in Figure 1 . The presented TextConvoNet model uses an alternate input structure of the paragraph, using 2D convolution instead of 1D convolution and differing kernel sizes. TextConvoNet sends the input matrix into 4 parallel pathways of Convolution layers. The first two layers (intrasentence layers) with 32 filters each and kernel sizes of (1 × 2 and 1 × 3), respectively, are concatenated and have the role of extricating the intra-sentence n-gram features. The other two layers (inter-sentence) with 32 filters each and kernel sizes of (2 × 1 and 2 × 2) concatenated together have the sole purpose of drawing out the inter-sentence n-gram features. These two intra-sentence and inter-sentence layers are further concatenated and fed into the fully connected layer consisting of 64 neurons and subsequently perform the relevant classification task. A detailed explanation of the architecture is given as follows.

1) Convolution Layer: This layer applies filters to input to create feature maps and condense out the input's detected features. Let F (m, n) be an input paragraph of size m × n and f is the filter with kernel size of (g × h) where b(g, h) represents the biases. The outcome of the convolutional layer is computed by Equation 3 .

2) ReLu Activation Layer: The purpose of the ReLu activation layer after each convolution layer is to normalize output. This layer also aids the model to learn something complex and complicated with a reduced possibility of vanishing gradient and cheap computation costs. The activation function for ReLu is calculated using Equation 4 . Here γ is the input to the ReLu function.

3) Concatenation Layer: It concatenates the various input blobs and concatenates them in a continuous manner.

Layer: This is a multilayer perceptron that is connected to all the activations from the previous layers. The activation of these neurons is calculated by matrix multiplication of its weights added by an offset value.

Let assume the input to be β with size u and r be the number of neurons in the fully connected layer. The output of this layer is computed using Equation 5 .

where ϑ denotes the activation function and output matrix is F u×r . 5) Dropout Layer: The dropout layer randomly activates or deactivates (make them 0) the outgoing edges of hidden units at each update of the training phase, which also helps to reduce overfitting. 6) Classification Layer: This is the final layer that performs classification based on the attributes extricated by the previous layers. It is a traditional ANN layer with softmax or sigmoid as the activation function. 7) Loss Function: For binary-class text classification task, TextConvoNet is trained by minimizing the binary-cross entropy (Equation 6) over a sigmoid activation function. For the task of multi-class classification, TextConvoNet is trained by minimizing the categorical-cross entropy (Equation 7) over a softmax activation function. The above loss functions can be formulated as

Here i is the index of a training instance, j is the index of a label (class), t ij is output of the final fully connected layer, and y ij is the ground truth (actual value) of i th training sample of the j th class.

Since almost all the real-life conversations, reviews, remarks are generally very long and complex and thus convey a different perspective in each line, although there is only a single deep-rooted sentiment attached to the whole paragraph. To uphold the semantics, the paragraph is converted into paragraph-level sentence embedding without any preprocessing of text. The Embedding matrix is then sent into 4 lateral pathways subdivided into the the intra-sentence (kernel sizes 1 × 2,1 × 3) layer and the inter-sentence layer (kernel sizes 2 × 1, 2 × 2) with 32 filters in every layer. These hyperparameters were selected through the GridSearchCV method from the plethora of other suitable hyperparameter choices. The results also complemented our thinking/approach of selecting small window sizes to capture each and every minute detail. Similarly, using the GridSearchCV method, the learning rate was chosen to be 0.01 and the number of neurons to be 64 for the final fully connected layer. The convolutional layers were limited to four only, as increased layers led to overfitting.

We have created various variants of the proposed model to come up with an effective text classification framework. Figure 1 shows the baseline model. However, it might be interesting to see whether increasing the number of n-gram based kernels, and inter-sentence kernels will improve the efficacy of the model. Thus, we have extended the baseline to create two versions of TextConvoNet: TextConvoNet 4 and TextConvoNet 6.

• TextConvoNet 4: Our base/parent model with 4 convolution layers (with different kernel sizes), 2 for extracting out the intra-sentence n-gram features, and the other 2 for extricating the n-gram based inter-sentence attributes. • TextConvoNet 6: Same framework as above but extending the convolutional pathways to 6, 3 for drawing out intra-sentence n-gram features and other 3 for intersentential n-gram features. We have also performed modifications on various parameters of TextConvoNet: number of filters, dropout rate, kernel sizes, number of nodes in fully connected layer and optimizers, etc. However, the effectiveness of these modifications is experimentally validated in section IV-E.

In this section, first, we provide a description of the used datasets. For the performance evaluation, we conducted experiments on various binary and multi-class text classification datasets. This section also discusses the performance metrics used and baseline machine learning and deep learning models used for comparison. In the last, we discuss the results of the experimental analysis.

In experiment and results, first, we present the performance comparison of TextConvoNet with other baseline models. Moreover, we conduct a statistical test to assess whether TextConvoNet performed significantly differently from other baseline models. Next, we discuss the ablation study of TextConvoNet with respect to various parameters of the presented model. In addition, we analyze the performance of our model by varying the number of sentences in a paragraph. Thereafter, we discuss the experimental results with respect to dataset size, i.e., we want to see how does the presented TextConvoNet performs with minimal data and in challenging scenarios.

We have performed experiments on various publicly available binary and multiclass datasets. Only a subset of instances from the datasets was included for training and testing. The complete details about the datasets are given in Table I . No additional changes have been done in the datasets, and no preprocessing has been applied to the text. For the experiments, we used 2 binary and 3 multiclass datasets. The binary datasets are the famous SST-2 and Amazon Review Dataset, whereas multiclass datasets consisted of Ohsumed(R8), Twitter Airline Sentiment, and the Coronavirus Tagged Datasets. All the datasets are publicly available and are sourced from Kaggle. The details of the datasets are mentioned below. 

In the experimental evaluation, we have used various classification evaluation measures such as Accuracy, Precision, Recall, F1-score, Specificity, G-means, and MCC (Mathews Correlation Coefficient). The reason for choosing the above performance metrics is due to the wide variety of applications of NLP in the current scenario. There are many cases where precision is given more preference over accuracy and similar for other performance metrics. So we evaluated our model TextConvoNet on a multitude of performance metrics in order to get a wholesome analysis of the model and check its viability for each and every NLP application. Table II shows the descriptions/formulae of all the performance metrics considered.

To assess the statistical significance of the presented TextConvoNet 4 and TextConvoNet 6 with other considered machine learning and deep learning techniques, we performed the Wilcoxon Signed-Rank paired sample test. It is a nonparametric test, which does not assume the normality of within-pair differences. It tests the hypothesis of whether the median difference is zero between the tested pair or not. We have used a significance level of 95% (i.e., α=0.05) for all the tests. The framed Null Hypothesis (H 0 and Alternative Hypothesis (H a ) are as follow. H 0 : No statistically significant variation is there between the paired group for value α=0.05. H a : Statistically, significant variation is present between the paired group for value of α=0.05.

Null hypothesis can be rejected when the experimental pvalue has come out to be lesser then the α value, and can be concluded that there is a significant difference between the paired group. If this is not the case then automatically accept the null hypothesis. Further, we have performed an effect size analysis using Pearson Effect r measure. The effect size shows the magnitude of performance difference among the groups. The larger the effect of size the stronger the relationship between two variables. It is defined by Equation 8 .

Where 2n = number of observations, including the cases where the difference is 0 and z is the z-score value defined by Equation 9

.

According to Cohen [39] , the effect size is: Low, if r≈0.1; Medium, if r≈0.3; and Large, if r≈0.5.

For a comprehensive performance evaluation of the proposed TextConvoNet, we have used seven different machine learning techniques namely, Multinomial Naive Bayes [40] , Decision Tree (DT) [41] , Random Forest (RF) [42] , Support Vector Classifier (SVC) [43] , Gradient Boosting classifier [44] , K-Nearest Neighbour (KNN) [45] , and XGBoost [46] . A comprehensive evaluation of the proposed TextConvoNet using these techniques helps in establishing the usability of the TextConvoNet and increases the generalization of the results. Since TextConvoNet is a convolutional neural networkbased deep learning architecture, we have included some deep learning-based approaches for performance comparison. Specifically, we have implemented Kim's CNN model [25] , Long Short Term Memory (LSTM) [47] , [16] and VD-CNN [29] based model proposed for text classification (Table  IV ) and compared our model with these models. We have also compared our model with other recent attention and/or transformer-based deep learning models such as BERT [48] , Attention-based BiLSTM [21] , and Hierarchical attention networks (HAN) [22] . The description and implementation details of these techniques are given as follows. All implementation has been carried out using Python libraries. [25] : A detailed description of Kim's model has been provided in Section III-A. The implementation details of this model are as follows. In Kim's model, sentences are matched to the embedding vectors that are made available as an input matrix to the model. It uses 3 parallel layers of convolution with word vectors on top obtained from the existing pretrained model, with 100 filters of kernel sizes 3, 4, and 5. It is followed by a dense layer of 64 neurons and a classification layer.

2) Long Short Term Memory (LSTM) [47] , [16] : Long Short-Term Memory Networks (LSTMs) is a special form of recurrent neural network (RNN) that can handle long-term dependencies. Also, LSTMs have a chain-like RNN structure, but the repeated module has a different structure. Rather than having a single layer of the neural network, there are four, which communicate in a unique way. Some of the works used the LSTM model for different text classification tasks [16] , [17] , [21] , [22] . For the comparison with our model, we used a single LSTM layer with 32 memory cells in it followed by the classification layer.

3) Very Deep Convolutional Neural Networks (VDCNN) [29] : Unlike TextConvoNet, which is a shallow network, VDCNN uses multiple layered convolution and max-pooling operation. Therefore, inspired by VDCNN, we implemented its version based on word embedding. This model uses four different pooling processes, each of which reduces the resolution by half, resulting in four different feature map tiers: 64, 128, 256, and 512, followed by a max-pooling layer. After the end of 4 convolution pair operations, the 512×k resulting features are transformed into a single vector which is the input to a three-layer fully connected classifier (4096,2048,2048) with ReLU hidden units and softmax outputs. Depending on the classification task, the number of output neurons varies. 4) Attention+BiLSTM [21] : It starts with an input layer that Tokenize input sentences and indexed lists followed by an embedding layer. There exist bidirectional LSTM cells(100 hidden units) which can be concatenated to get a representation of each token. Attention weights are taken from linear projection and non-linear activation. Final sentence representation is a weighted sum of all token representations. The final classification output is derived from a simple dense and softmax layer. 5) BERT [48] : The pre-trained BERT model can be finetuned with just one additional output layer to create state-ofthe-art models for a wide range of NLP tasks, and we have also used the same strategy in order to compare TextConvoNet with BERT based model. We used pretrained bert-base-uncased embeddings to encode, followed by a dense layer of 712 neurons ended by a classification layer.

6) Hierarchical Attention Networks (HAN) [22] : We set the word embedding dimension to 300 and the GRU dimension to 50 in our experiment. A combination of forward and backward GRU provides us with 100 dimensions for word/sentence annotation in this scenario. The word/sentence context vectors have a dimension of 100 and are randomly initialized. We utilize a 32-piece mini-batch size for training.

All the experiments to examine the model performance of TextConvoNet 4 and TextConvoNet 6 is carried out on a system having with Dual-Core Intel Core i5 processor and 8 GB RAM, running Macintosh operating system, with 64bit processor and access to NVidia K80 GPU kernel. All experiments were performed in Python 3.0. The models are trained over a mini-batch size of 32 using Adam as an optimizer. The Learning rate is chosen to be 0.1, and the models are trained over 10 epochs with Early stopping to avoid overfitting. All these hyperparameters are chosen by making use of a hyperparameter optimization technique called GridSearchCV. For generating word vectors from sentences, we use GloVe 6 a pre-trained word embedding model.

In the first experiment, we tested the performance of TextConvoNet on five datasets (Section IV-A). Table III shows the performance of TextConvoNet (TextConvoNet 6 and TextConvoNet 4) in comparison with baseline machine learning models and state-of-the-art Deep Learning based models. In Table III , results are reported for binary classification datasets (2 and 4) and multi-class classification datasets (3, 4 and 5) . From the results, following inferences can be drawn.

• In terms of accuracy, all the deep learning models are producing value above than 70%. The highest value is for dataset-3 and it is given by the proposed TextConvoNet 6. The minimum average gain in accuracy over all datasets is reported to be 1.5% with maximum average gain of 31%. • In terms of precision, TextConvoNet performs better than all other models for all datasets except dataset-2 with minimum average gain in precision over all datasets to be 4.1% and maximum average gain of 70.8% with almost similar results for dataset-2. • In terms of recall, TextConvoNet produces almost comparable results for dataset-1 and dataset-2 with a minimum average gain of 3.7% and a maximum average gain of 65.8% alongside the best results on dataset-5. 6 https://nlp.stanford.edu/projects/glove/ • The average gain values for F1 Score was also found to similar to the Precision with TextConvoNet performing best on dataset-5. • There was a minimum average gain of 1.1% and a maximum average gain of 18.6% in terms of Specificity. TextConvoNet produced almost comparable results for dataset-2 than other machine and deep learning models. • The minimum and maximum average gain in Gmean1 and Gmean2 is 3.9%,68% and 2.7%,65.66% respectively. • In terms of MCC, TextConvoNet performs extremely well on all the datasets with minimum average gain of 7.4% and a maximum average gain of about 30%. • Overall, in multi-class datasets (3, 4 

TextConvoNet perform better than all the other models in terms of all the performance metrics. We have also compared the results of TextConvoNet with two attention-based model: BiLSTM followed by attention (Attention+BiLSTM) & Hierarchical Attention Network (HAN) and one transformer-based BERT. The Table IV shows the results of TextConvoNet with BiLSTM +Attention, BERT and HAN on different performance measures. From the table, we can observe, the HAN has poor results on the all the datasets in comparison with other models. On dataset 1 and 3, TextConvoNet 4 has the better results in comparison with others. On dataset 2, 4, and 5 TextConvoNet 6 outperforms others. Overall, the results of TextConvoNet are better in comparison with other models. ' 1) Wilcoxon signed-rank test: Tables V and VI report the Wilcoxon signed-rank test results in terms of p-values and effect r for the TextConvoNet 6 and TextConvoNet 4 with other techniques, respectively. The last row in the tables shows whether a statistically significant performance difference has been found between the compared groups. For Table V , it is observed that there is a statistically significant difference between the presented TextConvoNet 6 and all other considered techniques. The experimental p-values are less than the significance level of 0.05 in all groups. Further, effect r values are higher than 0.45 in all the groups, showing a large magnitude of performance difference between the TextConvoNet 6 and other techniques. Similarly, from Table VI, it is observed that the performance difference between the presented TextConvoNet 4 and other techniques is statistically significant at the given significance level for all cases. A significant difference can be seen in all the groups as the P-values are below 0.05. Further, effect r values are higher than 0.40 for all the groups showing a large magnitude of the performance difference.

We evaluated our model variant TextConvoNet 6 over a variety of parameters given in Table VII to analyse their effect on the various performance metrics. Between any two versions, there is a change in the kernel size. Within a version, between any two sub-versions, there are changes in the number of filters, dropout rate before the classification layer, optimizer, and the units in the fully connected layer. Generally, performance metrics can change adhering to the needs and application and the type of model. In practice, each NLP application is unique. Therefore, for every application of NLP, a unique approach/model is needed. Hence, we have performed an ablation study on the selected Datasets (Table  I) . The results are shown in Table VIII . From the results between various versions of TextConvoNet, the following observations have been drawn.

• We conclude that Adam is the best optimizer for the model as other optimizers were taking a large amount of time train on or did not give good results as in the case of RMSProp (V1.2, V2.2, V3.2, V4.2). • Dropout Rate of 0.4 was found to be optimum as for the value 0.5 model was slightly overfitting. • It was observed that the performance of TextConvoNet 6 slightly betters itself in all the metrics while increasing the value from m 4 to m on datasets having a maximum length of sentences in a paragraph considerably small, as shown for (Dataset-4). • Datasets having maximum length of sentences in a paragraph considerably larger, TextConvoNet 6 is seen to perform well on lower values of m( m 4 , m 2 , 3m 4 ).

The possible reason for the above observations could be that in smaller paragraphs, TextConvoNet 6 finds it difficult to extract the features from the text due to lesser data and hence require a large number of sentences to work on, as evident in (Dataset-4) . On the other hand TextConvoNet 6 won't mind dropping a portion of sentences like in (Dataset-2) having larger paragraphs due to ample amount of textual data already present.

In minimal data and challenging scenarios, it becomes rather important that the model can train well on a minimalist dataset (i.e., a dataset with a fewer number of training instances) and perform reasonably well on the test set [49] . Fewshot learning for text classification is a scenario in which a small amount of labeled data for each category is available. The goal is for the prediction model to generalize to new unseen examples in the same categories both quickly and effectively. In the experiment, the test dataset's size remains constant as mentioned in Table I It is observed that the TextConvoNet model performs better than all the other baseline models with lower test error rates at even lower proportion of training examples and TextConvoNet was able to achieve it without any change in it's parameter space. TextConvoNet extracts not just the n-gram based charac- teristics between the words of the same sentence as 1-D CNN does, but also the inter-sentence n-gram based features. As a result, TextConvonet will be able to extract additional features that 1-D CNN models will not be able to. This strengthens our claim that the proposed TextConvoNet performs reasonably well even with a lesser number of training examples.

In this paper, we presented a convolutional neural networkbased deep learning architecture for text classification. The important feature of TextConvoNet is that it not only extract the n-gram features from the text data, but also capture the inter-sentence n-gram features. This was made possible by providing alternate input representation for text data. The extensive performance evaluation has shown that the proposed TextConvoNet is effective for binary and multi-class classification problems. Therefore, it is evident that extracting the intersentence relationship improves the text classification task. In future work, we would explore the idea of representing input in higher dimensions so that convolution operations can capture various features from the textual data.

We have prepared a supplementary file, which includes the following details. The first part discusses the implementation of all the machine learning and deep learning models with the used values of the control parameters. The next part contains additional results which were not reported in the paper due to space constraints. This file is available in the GitHub repository 7 and is uploaded with the paper. 

Natural language processing: a historical review

Text classification algorithms: A survey

Deep learning based text classification: A comprehensive review

Baselines and bigrams: Simple, good sentiment and topic classification

Toward a better performance evaluation framework for fake news classification

Owi: Open-world intent identification framework for dialog based system

Drifted twitter spam classification using multiscale detection test on kl divergence

Feature engineering for text classification

Integrating associative rule-based classification with naïve bayes for text classification

Text categorization with support vector machines: Learning with many relevant features

Text classification using machine learning techniques

The influence of preprocessing on text classification using a bag-of-words representation

A comparison of event models for naive bayes text classification

Combining knowledge with deep convolutional neural networks for short text classification

A c-lstm neural network for text classification

Rethinking complex neural network architectures for document classification

Sgm: sequence generation model for multi-label classification

Squeezed very deep convolutional neural networks for text classification

Multichannel cnn with attention for text classification

Deep learning-based text classification: A comprehensive review

Attentionbased bidirectional long short-term memory networks for relation classification

Hierarchical attention networks for document classification

Attention-based lstm for aspect-level sentiment classification

A convolutional neural network for modelling sentences

Annual Meeting of the Association for Computational Linguistics

Convolutional neural networks for sentence classification

Deep learning for extreme multi-label text classification

Effective use of word order for text categorization with convolutional neural networks

Deep pyramid convolutional neural networks for text categorization

Very deep convolutional networks for large-scale image recognition

Deep residual learning for image recognition

Very deep convolutional networks for text classification

Do convolutional networks need to be deep for text classification

Densely connected convolutional networks

Understanding convolutional neural networks for text classification

A clinical text classification paradigm using weak supervision and deep representation

Sentiment classification using convolutional neural networks

Document-level text classification using single-layer multisize filters convolutional neural network

Image-based text classification using 2d convolutional neural networks

What does effect size tell you

Naive bayes classifiers

A survey of decision tree classifier methodology

Classification and regression by randomforest

A fast iterative nearest point algorithm for support vector machine classifier design

Greedy function approximation: a gradient boosting machine

Learning k for knn classification

Xgboost: extreme gradient boosting

Long short-term memory

Bert: Pre-training of deep bidirectional transformers for language understanding

Few-shot learning for short text classification