key: cord-1042895-mjw188hr
authors: Kotwal, Shallu; Rani, Priya; Arif, Tasleem; Manhas, Jatinder; Sharma, Sparsh
title: Automated Bacterial Classifications Using Machine Learning Based Computational Techniques: Architectures, Challenges and Open Research Issues
date: 2021-10-12
journal: Arch Comput Methods Eng
DOI: 10.1007/s11831-021-09660-0
sha: a7a3c7b138beeafcbc5a2e25f14f7fbc2d57a19e
doc_id: 1042895
cord_uid: mjw188hr

Bacteria are important in a variety of practical domains, including industry, agriculture, medicine etc. A very few species of bacteria are favourable to humans. Whereas, majority of them are extremely dangerous and causes variety of life threatening illness to different living organisms. Traditionally, this class of microbes is detected and classified using different approaches like gram staining, biochemical testing, motility testing etc. However with the availability of large amount of data and technical advances in the field of medical and computer science, the machine learning methods have been widely used and have shown tremendous performance in automatic detection of bacteria. The inclusion of latest technology employing different Artificial Intelligence techniques are greatly assisting microbiologist in solving extremely complex problems in this domain. This paper presents a review of the literature on various machine learning approaches that have been used to classify bacteria, for the period 1998–2020. The resources include research papers and book chapters from different publishers of national and international repute such as Elsevier, Springer, IEEE, PLOS, etc. The study carried out a detailed and critical analysis of penetrating different Machine learning methodologies in the field of bacterial classification along with their limitations and future scope. In addition, different opportunities and challenges in implementing these techniques in the concerned field are also presented to provide a deep insight to the researchers working in this field.

Microorganisms or microbes are microscopic organisms that can be seen under microscope as they are too small to be seen with naked eyes. The microorganisms exist in both single celled and multi-cellular form. The earth is full of microorganisms i.e., bacteria, fungi, algae, viruses, protozoa, etc. and play a very important role in environment as well as human's life. A very less category of microbes are beneficial for humans and they are extensively used in various activities like food processing, agriculture, industries, medical field, etc. Some microbes are used in the fermentation of foods, treating sewage, production of fuel, maintaining fertility of soil, preparing medicines, etc. however many microbes are very harmful for humans and ecosystem [1] . The microorganisms are responsible for causing various diseases in humans like typhoid, food poisoning, AIDS, polio, milder form of cold, cancer etc. In general, the microbes are very important living creatures on earth and humans must be aware about the fact that they can cause great threat to other organisms living on this planet [2] .

A. Bacteria Bacteria are a type of prokaryotic microorganisms that lack true nucleus and their genetic material (DNA) remains scattered in the cytoplasm. Based upon their morphology, these are classified into spherical (cocci), spiral (Sprillium), rod (bacilli), comma (vibrious) or crockscrew (spirochaetes) [3] . These are gener-ally omnipresent; however human beings act as perfect host for them. A broad category of bacteria are harmless and even helpful in many ways whereas, some are pathogenic thus, leads to many diseases in humans and other animals, e.g. tetanus, typhoid fever, foodborne illness, cholera, tuberculosis, etc.

[4] many of these can prove to be fatal also. B. Machine Learning: Machine learning (ML) is a sub-field of Artificial Intelligence (AI) and can be defined as "the field of computer science that gives computers the ability to learn without being explicitly programmed". ML is extensively used in the prediction systems, network packet classification [5] sentimental analysis [6] speech recognition [7] , medical diagnosis [8] and financial industry, signal processing [9] fretting fatigue analysis [10] , agriculture [11] etc. Generally, a ML model is divided into two parts: training and testing. In training part, samples are taken as input and features are learned by learning algorithm to build the model and in testing part, the learning model uses the execution engine to make prediction for the previously unknown data. ML is classified into three categories; these are supervised learning, unsupervised learning and reinforcement learning. Several ML algorithms are already available i.e., Linear Regression, Support Vector Machine (SVM), Decision Tree, Naïve Bayes, Artificial Neural Network (ANN), Random Forest (RF), Deep Learning (DL), K-Nearest Neighbour (KNN) etc. [12] that can be either used directly on the given data or one can create a hybrid model combining these pre-existing algorithms with techniques like genetic algorithm [13] , neuro-fuzzy [14] etc.. (a) Linear Regression Linear Regression is a statistical method that is used for predictive analysis. It shows a linear relationship between a dependent and one or more independent variables [12] . (b) Decision Tree This algorithm is used for classification problems. It works on both categorical and continuous dependent variables by splitting data into two or more homogeneous sets on the basis of a root node [12] . (c) SVM SVM is extensively used for solving classification problems, hand writing analysis [15] , face recognition [13] , text recognition etc [16, 17] . (d) Naïve Bayes It is a classification technique based on Bayes theorem with an assumption of independence between predictors. This model assumes that the presence of a particular feature in a class is unrelated with all the other features. This algorithm is easy to implement and is particularly useful for very large datasets [18] . (e) ANN ANN is a ML algorithm that tries to mimic the functioning of a human nervous system. It includes a large number of connected processing units that work together to process information and also generate meaningful results from it. DL method is an enhancement over traditional ANN that is concerned with building much larger and more complex neural network (NN). It automatically extracts features from the given data and does not require the user to do so. It can be used for solving both classification and regression problems. Convolutional neural network (CNN) is a type of DL architecture which is becoming dominant in the area of image. The AlexNet was the first CNN architecture to show DL is effective in computer vision tasks. CNN has achieved the best performance in many areas such as plant disease identification [19] , self driving cars [20] , flex electricity effect representation [21] , remote sensing [22] , etc., (f) RF RF is an ensemble learning method for solving classification and regression problems. In random forest, prediction is achieved using number of decision trees [23, 24] . (g) KNN KNN is a distance based ML algorithm and is also one of the most widely used data mining techniques in pattern recognition and classification problems [12] . C. Traditional Methods in Bacteria Classification:

Due to the dual nature of bacteria we can classify them into various categories keeping in view their vital utility in the field of medical diagnosis, biotechnology, food processing, genetic engineering etc. Traditionally, the classification of bacteria was used to perform manually by the microbiologists. The most commonly used techniques include observing phenotypic characteristics of bacteria species like size, shape, color, etc. that too under the microscope. Some bacterial species possess huge similarity in shape and size that force the microbiologists to change the methodology to perform different test like motility testing, molecular testing, and biochemical testing etc. in order to execute successful classification. The whole procedure is entirely dependent on human expertise, which includes long training time of skilful operators. In this type of methodology, there may be the chances that results can be affected by expert's daily mood. Due to above said drawbacks, suitable and reliable automatic methods are needed for bacterial classification [24, 25] .

After going through the thorough literature survey, it has been observed that the traditional methods of classification are time consuming and frequently prone to errors. To overcome these problems the application of traditional methods creates a wide scope for the scientists to adopt ML approaches in the field of bacteria classification. ML techniques have already shown promising results in classification of microscopic images [26] which includes diabetic image classification, cervi-cal cancer cell classification, oral cancer classification, microorganism classification, etc. This provides base and encouragement to carry out similar work in microbiology to classify different bacteria. As compared to traditional methodology, the ML techniques have proven to be faster, accurate, convenient and cheaper in nature. [27] A lot of work has been done by various researchers in the field of bacteria classification using different techniques. It has been experimentally shown by many researchers that ML algorithms can efficiently classify images on the basis of feature extraction techniques. ML techniques have been employed by many researchers to develop an automated tool for segmentation, feature extraction and classification of microscopic bacterial images. In Fig. 1 a flowchart representing automatic microscopic bacterial image classification system is given. As shown in Fig. 1 , the ML model works in five different phases: first phase executes the acquisition of bacterial images, second phase performs data-preprocessing to remove unwanted noise, blur etc. through different operations to make the image data more clear and technically suitable for processing. The third phase deals with the segmentation of data. The fourth phase carries out feature extraction and feature selection on digital images and the last one involves classification of images into their respective classes.

Microorganisms play vital role in human's life and contribute widely in earth's ecosystem. Humans can easily witness the importance of microbes in their real life by citing an example of the scenario created by COVID-19 worldwide. Similarly the bacteria are also considered to be one of the most important microorganisms that plays immense role in human's life and earths ecosystem. Identification and classification of bacteria is a tedious task due to its small size and is not visible through naked eyes. Traditional methodology is cumbersome and time consuming, thus waste lot of precious time of the microbiologist in its classification. Research in the field of bacteria takes place only when a proper classification of the said species has been performed and then placed in an appropriate category. Inter family classification is easy when compared to intra family due to minute variation its shape and size. Traditional methodologies involves the classification of bacteria based on morphological features and these morphological features are less in number when compared to automated technologies that involves modern computing environment. The advent of AI in the field of image classification using different ML techniques gave edge to the different researchers to automate the classification of bacteria based on image analysis. The use of different ML techniques assists the microbiologist to a greater extent thus by reducing the time of bacterial classification. Subsequently the ML based models also enhances Fig. 1 Article overview the accuracy of bacterial image classification and has shown tremendous results. Classification based on ML uses different feature extraction and feature selection techniques to overcome the limitation of traditional methodologies using morphological features that are limited in number. The decision taken using ML model involves extremely high number of parameters thus by adding to its accuracy. Figure 2 , diagrammatically depicts the major sections of article covering This review work is structured as follows: Section II covers the different sources of information from where the articles have been collected during literature survey. Section III presented the use of different ML techniques in image analysis through literature survey w.e.f 1998 to 2020. Section IV covers the discussion on classification techniques and their performance metrics in each case. Section V presents conclusion and future scope of each article undertaken for the study. Future scope from all the articles is presented to give fruitful direction to the researchers interested to carry out further improvement this domain.

The aim of this review paper is to identify the related studies in the field of bacteria classification and identification using ML techniques. The time period taken for the review is from year 1998 onwards up to year 2020. The articles were searched and collected from reputed publication houses like Springer link, IEEE Explore, Elsevier, ACM digital library, Research Gate, PLOS etc. Keywords used while searching for the relevant articles in respective journal databases are given in the Table 1 . Initially, a total of 81 research articles were obtained in the concerned field but after careful investigation, 21 irrelevant and duplicate articles were excluded. Then after screening the abstract and full text of remaining articles, numbers of articles were reduced to 48, considering work written in English Language only ( Table 2) . Out of which 40 relevant articles solely based on ML techniques were finally selected to conduct this review. The selected articles include research papers and book chapters that are published in reputed journals and in the proceedings of national and international conferences.

Research articles from 15 journals and 12 IEEE conferences were included in this review paper. Out of 14 journals that are referred, 7 journals are published by Springer link, 5 

ML is a branch of AI that has shown great success in computer science field. ML algorithms get trained on given data and thereafter can be used to predict the outcome on new data. In microbiology, the researchers use ML techniques for analysis of digital microscopic bacterial images. The study of bacteria is known as bacteriology, in which researcher analyzes the bacteria on the basis of its morphological features and its genetics. Good bacteria are economically important in many areas like: food processing, biotechnology, fiber retting, pest control and genetic engineering. The Escherichia coli is used to prepare vitamin K and riboflavin, Lactobacillus is used to make curd and cheese and Ruminococcus spp. helps digest cellulose by secreting the enzyme cellulase. The pathogenic bacteria are harmful for humans; they create many diseases in humans like: Mycobacterium Tuberculosis cause tuberculosis disease in humans, Saprotrophic bacteria attack and decompose organ and cause food poisoning [28] . Thus it becomes very important to classify these species of bacteria through ML techniques to understand their behavior. ML techniques have been extensively employed by various researchers for studying different types of bacteria and the following section discusses some of important studies during the period from 1998 to 2020.

A. Studies carried out w.e.f. 1998 to 2010:

In the year 1998, Holmberg et al. [29] presented an ANN based technique for the classification of urinary bacteria. They used ANN with electronic nose to extract features of bacteria from sensor data. The data was collected from the department of microbiology, The University Hospital in Linkoping. A total of 100 Petri dishes containing bacteria (E. coli, Enterococcus, Proteus mirabilis, Pseudomonas aeruginosa and Staphylococcus saprophytica) were monitored, and finally 48 features were extracted from the dataset. These features were then fed to the proposed system and it resulted in a final system classification accuracy of 76%.

In the same year,neural network based technique was proposed by Veropoulos et al. [30] for the classification of tuberculosis bacteria. The architecture consisted of multi-layered NN using the back propagation learning rule and 2 hidden units. In this proposed system, a dataset consisting of 1147 image objects was taken which consisted of 267 tubercle bacilli and 880 other objects. Out of these total images, 1000 images were used for training purpose and 147 for testing. The dataset was collected from South African Institute for medical Research (SAIMR), Cape Town. The proposed system showed accuracy with KNN-91.8%, BP-97.9%, SCG-96.6% and KA-95.9%.

In the year 2001, Liu et al. [31] worked on the classification of morphotypes bacterial species using morphological features. The authors used KNN classifier for the classification purpose and proposed an image analysis program, CMEIAS for Windows NT environment. In this proposed system the authors obtained 1937 grayscale digital images of various communities. The CMEIAS shape classifier gave an accuracy of 96.0% on the training set of 1471 cells and 97.0% on the test set of 466 cells representing all 11 bacterial morphotypes classes.

In 2004, Forero et al. [32] implemented a technique for the classification of Mycobacterium Tuberculosis bacilli using K-means clustering algorithm. In this paper, the authors used microscopic sputum samples (between 8 to 100 images per full sample) and not individual images. A set of 397 negative images from 31 subjects and 75 positive images from 4 patients were acquired. The proposed approach achieved a specificity of 93.54% and sensitivity of 100%.

Three-layered back propagation neural network (BPNN) was introduced by Xiaojuan et al. [33] for classification of bacteria. Six features have been extracted from 8 types of bacteria. 60 samples from the NMCR (National Microbe Culture resource) database were selected for the training purpose and 20 samples were used for testing purpose. The proposed system achieved classification accuracy of 86.3%.

Men et al. [34] presented a ML based technique for the classification of heterotrophic colonies of bacteria. The classification was done using SVM. The primary heterotrophic bacteria colony images were first pre-processed and thereafter were used to extract 6 different features i.e. area, perimeter, equivalent diameter, shape factor, length and width for the experiment. All the pursuant features were then fed to the classification algorithm. Total 300 instances were obtained after feature extraction, of which 200 were heterotrophic bacteria colony and 100 belonged to non-heterotrophic bacteria colony. Out of these, 60% were used as training set and remaining 40% as testing set. The proposed method achieved a training accuracy of 98.7% and testing accuracy of 96.9%.

A ML based approach was implemented by Chen et al. [35] (2009) to count and classify the bacterial colonies on Petri dish. The proposed method was able to recognize chromatic and achromatic images, and it was also able to deal with both colored and clear medium images. After recognizing, dish/plate regions containing bacteria were detected. The morphological features of each species were used for classification by using SVM with Radial basis function. Petri dishes were collected with two different types of medium and bacteria stains. The first type of plate which contains blue color Mitis-Salivarius was collected from the Department of Pediatric Dentistry at the University of Alabama at Birmingham and the second type of plate contains the clear LB agar was obtained from the Division of Nephrology, Department of medicine at the University of Alabama at Birmingham. This proposed system is robust and effective automatic on bacterial colony. This method achieved the accuracy of 96%, the accuracy rate for the 25 Chromatic images is 92% and the accuracy rate for the 75 achromatic images is 97%.

In the same year, Xiaojuan et al. (2009) [36] proposed a BPNN for the classification of wastewater bacteria. The experiment consisted of total 16 samples out of which, 10 samples were used for training and 6 samples were used for testing purpose. The proposed method employed a self-adaptive accelerated back propagation algorithm for training the bacterial microscopic images classification process. After training, the method was tested with CECC database and the results showed that the given approach was effective in improving the speed and consistency of performing large-scale surveys or rapid determination of bacterial abundance, morphology that could make the estimation of bacterial condition more accurate. The proposed approach gave and accuracy of 85.5%. Osman et al. (2009) [37] developed a NN method with the combination of genetic-algorithm for the detection in tissues of Mycobacterium tuberculosis bacteria. The proposed method comprised of four main steps i.e. image segmentation using K-means clustering, feature extraction, feature selection and classification using GA-NN approach. The authors extracted seven-Hu moment invariant as features and applied genetic algorithm for feature selection and Levenberg-Marquardt algorithm to train multilayer Perceptron for classification of bacteria into two classes i.e. 'possible TB' and 'true TB'. For the given experiment, the Ziehl-Neelsen (ZN) stained tissue images of tubercle bacilli were collected from the Pathology Department, Hospital University Sains Malaysia, Kelantan. A total of around 960 object images (360 'true TB' and 600 'possible TB') were obtained from 120 tissue slide images with various staining condition. From these 960 object images, the best 680 images were chosen, out of which, 400 object images (200 'true TB' and 200 'possible TB') were used for training and 280 object images (80 'true TB' and 200 'possible TB')were used for testing. The given approach achieved an accuracy level of 89.64%.

An automated classification method was introduced by Khutlanget al. (2010) [38] in order to identify Mycobacterium tuberculosis images of ZN-stained sputum smears. The author used KNN, PNN and SVM classifier for classification of objects. The dataset used in the experiment consisted of total 11,259 instances of bacteria, out of which, 6901 instances from 11 subjects were used for training, which further consisted of 4999 instances of bacilli and 1902 instances of non-bacilli. For testing purpose, a total of 4358 instances of bacteria obtained from 8 subjects were used, out of which, 1838 instances were of bacilli and 2520 instances were of non-bacilli. The given approach achieved the sensitivity and specificity values of 95%.

Inthe same year, Hiremanth et al. (2010) [39] developed a ML based technique for the classification of cocci bacterial cell. The proposed system employed three classifiers namely 3 , K-NN, and NN for identifying the arrangement of cocci bacterial cells i.e. their geometric features. The ANN consisted of 5 neurons for 5 shape features as input, three outputs, a back propagation function and a gradient descent function. The dataset consisted of 500 color digital bacterial cell images of different types i.e. cocci, diplococci, streptococci, tetrad, sarcinae and staphylococci. The proposed approach achieved an accuracy range of 84% to 94% with 3 classifier, for K-NN classifier it gave an accuracy of 75% to 100% with k = 1 and from 96 to 100% for k = 3. The NN classifier achieved an accuracy of 98% to 100%. Among all the classifiers, ANN proved to be the best for the given experiment.

Akova et al. [40] in 2010 proposed a ML based method to classify bacterial serovars. A total of 28 subclasses from 5 different bacterial species (E. coli, staphylococcus, salmonella, vibrio and Listeria) were used in the given experiment. The dataset consisted of 2054 random samples of the bacteria which were then split into 80% and 20% for training and testing purpose respectively. Initially, the textural features were obtained from the samples which were then fed to the Support vector data description (SVDD) and Bayesian classifier. The Bayesian classifier was not specific to bacteria classification problem, thus the results for the Bayesian classifier were validated on the benchmark letter recognition dataset from the UGI repository. The experiment was able to achieve an accuracy of 82%. B. Studies carried out w.e.f. 2011 to 2020:

In 2011, Rulaningtyas et al. [41] implemented NN trained with back propagation approach for the classification of tuberculosis bacteria. The classification was done using NN with fine-tuned hyper-parameter values i.e. momentum = 0.9, learning rate = 0.5, mean square error = 0.00036 and number of hidden layers = 20. The dataset used in the experiment consisted of 100 samples of tuberculosis binary images out of which 75images were used for training process and 25 images for testing process. The given approach delivered and accuracy of 80%.

Cocci Classifier Tool was developed by Hiremath et al. [42] to detect its different types i.e. cocci, diplococci, streptococci, tetrad, sarcinae and staphylococci. The proposed system consisted of three classifiers namely 3 , KNN classifier and NN classifier. The NN classifier comprised of an input layer with five neurons that were provided with five shape features as inputs, a hidden layer and an output layer with six neurons, 1 for each class. The authors took 1733 cells of cocci, diplococcic and tetrad for classification and achieved accuracy 97%. The remaining 310 cells representing classes' sarcinae, streptococci and staphylococci were classified with 91% accuracy, resulting in an overall classification accuracy of 94% for the entire training set and testing set. The accuracy achieved by 3 classifier was in the range of 92-99%, while that of KNN was from 91-100% and that of NN classifier was from 97-100%. From all the classifiers, NN gave best results.

In the year 2013, Ahmed et al. [43] implemented a distributed grid computing for the classification of vibrio bacteria. The fisher's Discriminant analysis was first selected for extracting the features and subsequently SVM with linear kernel was used for classification. The technique involved standardization of scatter patterns obtained from bacterial colony using histogram equalization and image centering. Thereafter, features like haralick texture, Zernike and Chebyshev moments were extracted, resulting into a feature vector containing thousands of features. Finally the most prominent features were extracted using Fisher's criterion. The dataset consisted of a total of 1000 images, i.e. 100 images per strain. The proposed method achieved accuracy levels varying from 90 to 99% for different bacterial species.

In the same year, Chayadevi et al. [44] employed both statistical and NN approach for extracting bacterial clusters and counting different bacterial species from digital microscopic images. The proposed technique involved thresholding method and binarization; followed by segmentation and feature extraction. Finally, K-means clustering and self-organizing maps (SOM) were used for clustering and counting purpose. Authors used primary dataset collected from the hospital that consisted of 320 digital images of bacteria. The results obtained by using the proposed system were compared with the manual count taken by the doctors, wherein the proposed system proved to be more accurate than human visual counting.

In 2014, Ferrari et al. [45] proposed a multistage classification technique using SVM with radial basic function (RBF) kernels. The given experiment used solid agar plates from which a total of six species of bacteria were classified i.e. Enterococcus faecalis, E. coli, Klebsiella, Proteus mirabilis, Staphylococcus aureus, Streptococcus agalactiae. For each bacterial class, two solid agar plates had been inoculated. In total, 22 images of agar plate were analyzed, from these images, 74 isolated bacterial colonies had been identified. 60% of samples were used for training while 20% for cross validation and 20% of the samples were used to measure the generalization performance of the classifier. The proposed system gave an accuracy of 93%.

Novel RF based technique was applied by Ayas et al. [46] for tuberculosis bacilli bacteria classification. The dataset for the given experiment was obtained from the Mycobacteriology laboratory at Faculty of Medicine, Karadenix Technical University. The dataset consisted of 116 images from five positive-smear slides of five different subjects. During the experiment, 40 images were used for training and 76 images for the purpose of testing. The given method achieved an accuracy level of 89.38% with sensitivity of 93% and specificity of 96.97%. The results were then compared with SVM, ANN and GPDF, where the given approach showed better performance than the other techniques.

Govinda et al. [47] introduced a technique for the classification of tuberculosis bacilli bacteria using ZN images. The given approach used SVM with library tools (LIBSVM) for classification purpose. The experiment was carried out using two microscopic image sets: the first one was obtained from Public health Image library (PHIL), an open source for microscopic images and second was obtained from Global Hospital and health city, Chennai. These image sets were then used to extract the ZN stained images of 34 tuberculosis positive and 16 negative patients. The proposed method achieved an accuracy of 90.89% with sensitivity of 72.89%.

In the same year, DL based technique was implemented by Nie et al. [48] for classification of bacterial images. The classification was done by using CNN model with multiple intermediate layers. The authors cultured bacteria on two different media (ISP2 and Oatmeal agar) where each plate contained 2 colonies selected from a set of 17 strains. The plates were incubated at 30 degree Celsius for 8 days and were imaged at 2nd, 4th, 6th and 8th day after spotting. From these, 862 images of growing colonies were obtained. The proposed CNN architecture was then trained on these images which resulted in accuracy ranging from 52.5 to 78.47%.

In 2016, Seo et al. [49] proposed a ML algorithm for the classification of five species of staphylococcus bacteria. The experiment used Hyperspectral microscopic images of five bacterial species were obtained from the Poultry Microbiology Safety and processing Research unit of the U.S National poultry research center, Agricultural research service in Athens, GA. Firstly, the five species (aureus, haemolyticus, hyicus, sciuri and simulans) were collected by spectral data generation from region-of-interest (ROIs) using threshold based segmentation. In the next step, outliers were removed using Mahalanobis distance method, followed by key wavelength selection using Pearson's correlation coefficient. In the last step, classification was done using SVM and Partial least square Discriminant analysis (PLS-DA). The proposed method gave an accuracy of 89.8% and 97.8% for SVM and PLS-DA respectively. The SVM method enabled to identify not only gram-negative bacteria, but gram-positive bacteria too.

In the same year, neural network based technique was implemented by Priya et al. [50] for the classification of tuberculosis bacilli bacteria. The presented technique involved the steps in the order: segmentation, feature extraction by Fourier descriptor, and feature selection using fuzzy entropy. The selected features were given to fed to hybrid classifier i.e. SVM coupled with multilayer perceptron network (SVNN) for classification. The result of SVNN was compared with the BPNN and the given approach showed better accuracy results. The dataset was collected from the South African National Health Laboratory Services, Groote Schuur Hospital in Cape Town, where it was prepared by smearing the sputum specimen on a clean slide. A total of 1537 objects from 100 images were used for training. The proposed system achieved accuracy of 92.5% with 95% of sensitivity and 90% of specificity.

Lopez et al. [51] presented classification technique for identification of mycobacterium tuberculosis (MT) using CNN. The authors created a patch dataset with 9,770 patches of positive and negative smears from 492 images. CNN models were trained using three versions of patches, R-G, RGB and grayscale. For comparative analysis of different versions, ROC curve was implemented and the best experimental results were obtained by using R-G format. The author's used three convolutional layers for evaluation of image dataset with a robust balanced fusion and classify a patch in positive or negative for MT. This proposed system achieved accuracy of 96%.

Again same year, CNN based technique was also introduced by Turra et al. [52] for the classification of hyper spectral bacterial colonies. The architecture consisted of two convolutional layers, 1 pooling layer and a softmax layer for classification. The dataset was taken from American Type Culture Collection (ATCC) which consisted of 16 (1750 colonies) ) and that colonies grown on Petri dishes to form 106 Hyperspectral Imaging (HSI) volumes. The CNN model uses 50,000 training iterations. The given experiment was compared with SVM and RF, and showed a better overall performance. The accuracy achieved by this approach was 99.7%, whereas the accuracy of SVM was 99.5%and that of RF was 93.8%.

In the year 2017, Zielinski et al. [53] presented a DL based hybrid technique for classification of bacterial species and genera. The authors used deep convolutional neural network (DCNN) to obtain image descriptors and then pooling encoder was used to produce feature vectors and the classification task was finally carried out using SVM or RF. The experiment was carried out on the dataset of Digital Images of Bacteria Species (DIBaS) collected from Chair of Microbiology of the Jagiellonian University in KraKow, Polandwhich consisted of 33 bacteria species with 20 images of each bacterium. The dataset was partitioned randomly into 50% training and 50% testing instances. The proposed approach achieved an accuracy of 97.24%.

Mohamed et al. [54] implemented a ML technique for bacteria classification. The technique involved feature extraction using Bag of Words model and classification using multiclass Linear SVM. The authors selected database of 200 DIBaS images containing 10 species of bacteria (Acinetobacterbaumanni, Actinomycesisraelii, Enterococcus faecium, lactobacillus jensenii, lactobacillus paracasei, fusobacterium, lactobacillus delbrueckii, lactobacillus reuteri, micrococcus spp. and candida albicans) with 20 images of each species. The dataset was divided into two parts: 70% dataset was used for training and 30% was used for testing. This method achieved an accuracy of 97% for the ten-class classification problem and for 8 classes 100% accuracy was achieved.

In the same year 2018, Panicker et al. [27] proposed a DL based approach for the automatic detection of tuberculosis bacilli (TB) from microscopic sputum smear images. This proposed method used image binarization for image de-noising and CNN for pixel-level classification. The CNN architecture consisted of 3 Convolutional layers, a fully connected layer and a sigmoid layer. The dataset of TB images was collected from InstitutoNacional de pesquisas da Amazonia (INPA) lab, Manaus, Brazil. It consisted of 120 images from which 900 positive patches and 900 negative patches were cropped. Out of these, 80% patches were used for training and 20% for testing. The proposed method achieved a recall value of 97.13%, precision value of 78.4% and F-score of 86.76%.

Wahid et al. [55] applied Inception V1 approach for the classification of microscopic bacterial images. The technique involved manual cropping of the images and conversion of images from grayscale to RGB and then flipping the image and finally translating the resulting image. The classification of images was done by using 'Inception V1', DCNN model. The authors collected 5 bacteria species (clostridium botulinum, vibrio cholera, neisseriagonorrhoeae, borreliaburgdoferi and mycobacterium tuberculosis) with 500 images from several online resources such as: HOWMED, PIXNIO, etc. During the experiment, 100 images were used for testing and 400 images were used for training. This method achieved an accuracy level of 95%.

Again same year, CNN based approach was applied by Traore et al. [56] for image recognition of Vibrio cholera and Plasmodium falciparum in order to classify epidemic pathogen. The presented approach was implemented with seven hidden layers: six convolution layers, one fully connected layer, and softmax layer was used for the final classification. Tensorflow was used as the backend framework for implementing DL. The dataset was downloaded from Google and consisted of 200 images of Vibrio Cholera and 200 images of Plasmodium Falciparum. The data was portioned into 80% training and 20% testing set. This method achieved an overall accuracy of 94%.

Rahmayuna et al. [57] proposed a technique to classify the images of four popular Pathogen bacteria at genus level. The authors used two steps to classify the images: first step involved improving the image quality with Contrast Limited Adaptive Histogram Equalization (CLAHE) method and second step involved texture analysis. After this, classification was done by using Linear and Radial Basis Function (RBF). The authors collect 600 optical images of bacteria which included: 150 images of Escherichia, 150 images of Listeria, 150 images of Salmonella and 150 images of Staphylococcus. From this 540 images were used for training and 60 images were used for testing on the proposed framework and an accuracy of 90.33% was obtained.

DL based approach was introduced by Hay et al. [58] for the identification of gut bacteria from larval zebra-fish intestine using 3D light sheet fluorescence microscopy images. The proposed CNN architecture was used for classifying the images into bacterial or non-bacterial. The authors uses Google open source Tensorflow framework for implementation of the CNN architecture. The Comparison of CNN with RF and SVM was also presented in this paper and it was observed that the proposed CNN architecture gave best results with an accuracy of 90%.

In 2019, Mithra et. al. [59] applied Deep belief neural networks for the classification of Mycobacterium tuberculosis bacilli from ZN-stained microscopic images. The presented model was also evaluated with the existing models and concluded than deep belief neural network classifier gave better results. The authors collected dataset from ZNSM-iDB having 500 images of autofocus, no bacillus, few bacilli, overlapping and over stained images. Out of this 275 images were used for training the dataset and 225 images were used for testing. The proposed system achieved accuracy of 98.23% with sensitivity of 97.55% and specificity of 97.86%.

In the same year, Treebupachatsakul et al. [60] implemented LeNet CNN approach for recognition of two species of bacteria i.e. staphylococcus and lactobacillus. The proposed method was implemented by using Python programming and the Keras API with Tensorflow DL framework. The authors created their own dataset which consisted of more than 400 sample images of two types i.e. Staphylococcus Aureus and Lactobacillus Delbruekii. The dataset was separated into two parts: 80% of data was used for training and 20% for testing datasets. The proposed system achieved an accuracy of 75%.

Hybrid technique based on Inception-V3 and SVM was proposed by Ahmed et al. [61] for classification of microscopic bacterial images. The proposed method worked by image preprocessing using manual-cropping and conversion from grayscale to RGB then flipping the image and at last, translation of image followed by feature extraction using Inception-V3 DCNN model. The classification of the images into their respective class was then done by using SVM. The authors used seven bacteria species such as: Clostridium Botulinum, Borreliaburgdoferi, Rickettisiaricketsii, Mycobacterium tuberculosis, Streptococcus pyogenes, Vibrio Cholerae and Neisseria gonorrhoeae. 80% samples of the dataset were used for training hybrid network which include 800 microscopic images and 20% samples were used for training DCNN model which include 160 samples. For testing authors used 200 images of seven bacteria. The proposed system achieved an accuracy of 96%.

Abd-Alhalem et al. [62] presented DL technique for classification of bacteria like (Actinobacteria, Firmicutes and Proteobacteriaon) on the basis of DNA nucleotides i.e. A, T, C & G using CNN based on random projection on different data reduction layers. The proposed system consisted of four layers; three layers are Convolutional layers each having pooling layer and fourth layer is fully connected layer which uses Softmax activation function. The author's collected 2000 sequences which can be grouped into 5 ordered taxonomic ranks named phylum, class, order, family and genus. The proposed system was compared with SVM classifier, and the proposed method gave better results.

Next year, Bonah et al. [63] used meta-heuristic optimization algorithm for food-borne pathogenic bacteria classification using hyper spectral imaging. The proposed method experimentally showed better results w.r.t. SVM, Competitive Adaptive Weighted Sampling (CARS) and particle swarm optimization. In the same year, DL based classifier was implemented by the same authors [65] for classification of food-borne bacterial species at the cellular level by using HMI technology combined with CNN. The proposed method implemented 1D-CNN architecture, KNN and SVM for classification. The imaging dataset for five food-borne bacteria i.e. Campylobacter fetus, Escherichia coliform, Listeria Innocua, Salmonella Typhimurium and Staphylococcus Aureuswas collected from Poultry Microbiological safety and Processing Research Unit (PMSPRU) of the U.S. Department of Agriculture in Athens, Georgia, USA. The proposed system achieved an overall accuracy of 90%.

In the same year, Mhathesh et al. [66] applied DL technique for the classification of 3D light sheet fluorescence microscopy images of larval zebra fish. The authors used CNN for classification of bacterial images. In this model different activation functions i.e. sigmoid, Tanh and ReLuwere used to analyze the accuracy of the model. The results of the given approach were then compared with other classifiers such as Support vector classifier, RF and ConvNet. The given approach achieved accuracy of 95% which outperformed the other selected techniques.

In the same year H. Sajediet. al. [67] used Extreme Gradient Boosting classification (XGBoost) method to classify three different Myxobacterial suborders i.e. Cystobacterineae, Sorangiineae and Nannocystineae. The proposed method used Gabor transform to extract texture features and XGBoost to classify three categories of bacteria. The accuracy achieved by the proposed system was 90.28%. In addition, the authors also write some literature review related to classification of bacteria by using ML methods including deep neural network.

Tables 3 and 4 summarize the different ML approaches employed for classification of different bacterial species along with the limitation and future scope of each work. Throughout the literature survey, it is demonstrated that ML techniques have achieved tremendous success in automating traditional procedure of bacteria classification. The majority of the work has focused on food bacteria, different bacterial colonies, wastewater bacteria, cocci bacteria and some pathogenic bacteria species like tuberculosis, vibrio cholera etc. The type of images used included ZNstained sputum smear microscopic images, hyper spectral images and digital microscopic images of bacteria species as well as whole agar plate. In entire course of study we have found that the datasets and the number of studies are limited. Most of the datasets employed by researchers are private and cannot be accessed directly. Very few datasets like DIBas http:// miszt al. edu. pl/ softw are/ datab ases/ dibas/) [68] and HOWMED (http:// howmed. net/ micro biolo gy), PIXNIO (http:// pixnio. com/ photos/ scien ce/ micro scopy images) [69] are available online. They provide initial point for ML based bacteria classification. But still there is a need of high quality benchmark datasets. Lot of work has been done by different researchers using different datasets, but limited work has been done on each dataset making difficult for researchers to compare the performance of the different techniques on different datasets. This creates a scope to carry out rigorous research in this field. It has been observed that few ML techniques perform better on particular dataset but face performance degradation on some other dataset. Several researchers working in this domain have used supervised (SVM, RF, KNN etc.) and unsupervised ML algorithms (K-means clustering) to achieve bacteria classification through semi-automatic mode. The performance on these techniques has been evaluated using accuracy metrics. In case dataset is imbalanced the accuracy cannot be considered as only a parameter for measuring efficiency of a system. Majority of the scientist have carried out experimentation on imbalanced datasets and have used only accuracy as performance metrics. To thoroughly examine and study the system using imbalanced dataset few researchers have used more performance metrics like Precision, Recall and F-score. From 2015 onwards, deep learning based fully automatic system using CNN has shown excellent results when implemented for bacteria classification. On DIBas dataset CNN has achieved an accuracy of 97.3%.

To train deep learning model for better accuracy, comparatively the large datasets are required to achieve the set target; small datasets if used may create the problem of over fitting in this case. The DIBas and other private dataset on which deep learning has been applied are small and it is not clear from the literature review that whether the proposed deep learning models are over fitted or otherwise. To solve small dataset issue, various methods like data augmentation and transfer learning have also been employed by the researchers. Although these methods are also insufficient and can carry other related problems. Like in transfer learning one cannot provide suitable convolutional filters specific to the task and in data augmentation unknown pattern and data cannot be managed. It is very difficult to create large datasets, because the most of the technologies are patent and lack the thorough involvement of microbiologists. The optimal way to improve the efficiency of deep learning model is to work in collaboration with microbiologists, to design and annotate the labeled benchmark datasets.

In this paper, the authors presented the literature review of the research done in the field of bacteria classification using ML based on the articles published in various journals and conference proceedings such as Elsevier, Springer, IEEE Xplore, ACM digital library, PLOS etc. The review timeline is taken from the year 1998 to 2020. From the literature review, it has been analyzed that ML techniques can be effectively used for the classification of various types of bacteria. Based on data and imaging modalities, researchers used different feature extraction and selection techniques for ML techniques to be more effective and efficient in case of bacterial classification.

This section aims to discuss the bacterial image classification consist of data acquisition, data pre-processing, segmentation, feature extraction and classification, as shown in Fig. 1 .

Pre-processing is a prominent step to increase the performance of ML methods. In bacterial image classification field, researchers have used variety of microscopic images including gram stained images, ZN-stained sputum smear images and hyper-spectral images etc. To extract reliable features different image pre-processing techniques i.e. background correction, image resizing, image enhancement and color-space transformation have been employed. In addition, to reduce errors resulting from noise and artifacts, The field of ML has already proven its worth in fixing different problems like prediction systems, image recognition, speech recognition, medical diagnosis and financial industry, etc. When it comes to applying ML approaches in the field of microscopic bacterial classification, there are significant numbers of challenges that are required to be solved. These issues can be summed up in a single phrase if the datasets are small, image quality is low and target objects are also small in nature. Due to the availability of small public datasets the researchers are devoid of studying the large variety of bacterial species which leads to a wide amount of research gap. This disadvantage of having small dataset may leads to another problem of model over fitting. Several strategies, like data augmentation, dropout and regularization have proven to be successful in preventing over fitting arises due to the availability of smaller datasets. Image rotation, vertical and horizontal flip, zoom in and out in a certain range, random horizontal shifts and vertical shifts are all effective data augmentation methods for microscopic bacterial images. Image noise, poor spatial resolution small object size, and low image contrast are among other issues that plague microscopic images. To address these difficulties various image pre-processing techniques like adaptive median filtering and Gaussian filtering are employed. In addition to this the researchers have several other opportunities to explore more in image pre-processing techniques like Wiener filter, unsharp mask filtering, deep neural networks such as autoencoders [70] , deep residual dense network [21] , CNN [62] , linear contrast adjustment, etc. More feature descriptors using the combination of colour and texture features can also be applied for extracting more significant features from bacterial images [16] . The researchers have used a wide range of ML algorithms in this domain for feature extraction and classification. However, both the quality and quantity of data are insufficient for creating effective ML model. Some researchers ( [45, 62, 64, 65] ) merge one ML technique with other ML techniques, like SVM with radial basis function, CNN with random projection and DL with LSTM, ResNet and 1D-CNN to produce hybrid model for bacterial image classification. However the desired efficiency of the model is yet to be attained due to poor data quality. As a result, the fundamental difficulty that ML-based research still faces is the scarcity of high-quality data sets which arises due to the non availability of standard Tables 3 and 4 

datasets in different online repositories. The majority of the data used for experimentation by the authors was their own created private data and lacks several benchmarks. Due to this reason, till date only few bacteria species have been classified by using different ML approaches thus by restricting the researchers in deriving extreme potential for solving complex problems relating this field.

In this paper, an overview of the different ML techniques used in the bacterial classification is given. ML techniques have been extensively applied in the field of bacterial classification. The field of microbiology has improved with the application of these techniques. ML techniques help the microbiologist in many aspects i.e. in the identification, classification of bacteria and also the over-all automation of these processes. The aim of the review is to explore different ML based methodologies used by various researchers in the bacterial image classification. The review included articles published in different journals and conference proceedings during the time period from 1998 to 2020. Different imaging and data modalities have been explored in the domain of bacterial classification. The data collected in this paper highlights the efforts put by the researchers in their studies. However it is difficult to compare their methods and performances due to the difference in the dataset and imaging modalities used by each of them. It can be noted that the variation in their performances may also be attributed to the different approaches and methods used in their research work. It has been observed that ML techniques proved to be a better approach, by giving better accuracy and precision values during the classification of bacteria. DL methods have also been used extensively for the classification of bacteria. In DL some state-of-the-art algorithms are also used i.e. deep belief network, CNN and LSTM for classification purpose. The researchers studied very few species of bacteria due to the limited availability of the datasets. Some researchers used dataset by downloaded the imaging datasets from Google, some created their own, while others collected it from different laboratories or hospitals. ML techniques have been used in many fields and give better results, but due to lack of datasets, the performance is not as per the requirement. Some researchers combined DL techniques with other ML techniques i.e. LSTM along with other image processing method to create a hybrid ML model as per their requirement. From the review it is analyzed that most of the researcher used 2D images while few used 3D imaging data also. ML has been successfully used in the classification of bacteria in the past few years.

In future, the researchers can combine DL techniques with other ML approaches for getting better results. Researchers can explore other DL approach like recursive neural networks, auto encoders etc. in microbiology field. The dataset can also be enhanced by using different species of bacteria and then classify that data with different ML techniques or combination of such techniques. There is a great potential in future for ML techniques in the field of microbiology. Researchers continue to explore other domains like informatics, medicine, entomology and biology where ML techniques can be used.

Funding No funds, grants, or other support was received.

Microorganisms

Microbes

Bacteria

A perspective of the machine learning approach for the packet classification in the software defined network

Enhancing the classification accuracy in sentiment analysis with computational intelligence using joint sentiment topic detection with medlda

Machine learning paradigms for Speech recognition: an overview

Segmentation of cervical cells for automated screening of cervical cancer: a review

Special section on emerging challenges in computational intelligence for signal processing applications

Multi-scale analysis of fretting fatigue in heterogeneous materials using computational homogenization

Performance evaluation of machine learning techniques for mustard crop yield prediction from soil analysis

A review on machine learning techniques

Recognizing gender from human facial regions using genetic algorithm

Usage and implementation of neuro-fuzzy systems for classification and prediction in the diagnosis of different types of medical disorders: a decade review

Handwriting Analysis based on Segmentation Method for Prediction of Human Personality using Support Vector Machine

Text recognition in scene image and video frame using color channel selection

A novel feature descriptor for image retrieval by combining modified color histogram and diagonally symmetric co-occurrence texture pattern

Algorithms of Machine Learning

Plant disease identification using deep neural networks

Deep learning for self-driving cars: chances and challenges

Computational machine learning representation for the flexoelectricity effect in truncated pyramid structures

Woodland labeling in Chenzhou, China, via deep learning approach

An advanced deep residual dense network (DRDN) approach for image super-resolution

Environment Microorganism classification using conditional random fields and deep Convolutional neural networks

A survey for the application of content-based microscopic image analysis in microorganism classification domains

Machine learning and deep learning based computational approaches in automatic microorganisms image recognition: methodologies, challenges, and developments

Automatic detection of tuberculosis bacilli from microscopic sputum smear images using deep learning methods

Bacteria classification based on feature extraction from sensor data

Image processing and neural computing used in the diagnosis of tuberculosis

CMEIAS: a computer-aided system for the image analysis of bacterial morphotypes in microbial communities

Identification of tuberculosis bacteria based on shape and color

An improved BP neural network for wastewater bacteria recognition based on microscopic image analysis

Application of support vector machine to heterotrophic bacteria colony recognition

An automated bacterial colony counting and classification

A novel wastewater recognition method based on microscopic image analysis

A genetic algorithm-neural network approach for Mycobacterium tuberculosis detection in Ziehl-Neelsen stained tissue slide images

Classification of mycobacterium tuberculosis in images of ZN-stained sputum Smear

Identification and classification of cocci bacterial cells in digital microscopic images

A Machine-learning approach to detecting unknown bacterial serovars

Automatic classification of tuberculosis bacteria using neural network

Identification and classification of cocci bacterial cells in digital microscopic images

Classification of bacterial contamination using image processing and distributed computing

Extraction of bacterial clusters from digital microscopic images through statistical and neural network approaches

Multistage Classification for Bacterial colonies recognition on solid Agar plates

Random forest-based tuberculosis bacteria classification images of ZN-stained sputum smear sample

Automated tuberculosis screnning using zeihl-neelson images

A Deep Learning Framework for Bacterial InageSegmentaion and Classification

Identification of Staphylococcus species with hyperspectral microscope imaging and classification algorithms

Automated object and image level classification of TB images using support vector neural network classifier

Automatic classification of light field smear microscopy patches using Convolutional Neural Networks for identifying mycobacterium tuberculosis

CNN-based identification of hyperspectral bacterial signatures for digital microscopy

Deep learning approach to bacterial colony classification

Automated classification of Bacterial Images extracted from Digital Microscope via Bag of Words Model

Classification of microscopic images of bacteria using deep convolutional neural network

Deep convolution neural network for image recognition

Pathogenic Bacteria Genus Classification using Support Vector Machine

performance of convolutional neural network for identification of bacteria in 3D microscopy datasets

Automated identification of mycobacterium bacillus from sputum images for tuberculosis diagnosis

Bacteria classification using image processing and deep learning

Combining deep convolutional neural network with support vector machine to classify microscopic bacteria images

Bacterial classification with convolutional neural network based on different data reduction layers

Vis-NIR hyperspectral imaging for the classification of bacterial foodborne pathogens based on pixel-wise analysis and a novel CARS-PSO-SVM model

2020) singlecell classification of foodborne pathogens using Hyperspectral microscope imaging coupled with deep learning framework

Classification of foodborne bacteria using Hyperspectral microscope imaging technology coupled with Convolutional neural networks

A 3D Convolutional neural network for bacterial image classification

Image-processing based taxonomy analysis of bacterial macromorphology using machine learning model

Bacteria images on HOWMED

Recent deep learning techniques, challenges and its applications for medical healthcare system: a review

Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest.