key: cord-0076631-8eoffi6n authors: Jain, Anurag; Nadeem, Ahmed; Majdi Altoukhi, Huda; Jamal, Sajjad Shaukat; Atiglah, Henry kwame; Elwahsh, Haitham title: Personalized Liver Cancer Risk Prediction Using Big Data Analytics Techniques with Image Processing Segmentation date: 2022-03-28 journal: Comput Intell Neurosci DOI: 10.1155/2022/8154523 sha: 7082dffd96324f73a8acad1ad1bca92bf063b2bf doc_id: 76631 cord_uid: 8eoffi6n A technology known as data analytics is a massively parallel processing approach that may be used to forecast a wide range of illnesses. Many scientific research methodologies have the problem of requiring a significant amount of time and processing effort, which has a negative impact on the overall performance of the system. Virtual screening (VS) is a drug discovery approach that makes use of big data techniques and is based on the concept of virtual screening. This approach is utilised for the development of novel drugs, and it is a time-consuming procedure that includes the docking of ligands in several databases in order to build the protein receptor. The proposed work is divided into two modules: image processing-based cancer segmentation and analysis using extracted features using big data analytics, and cancer segmentation and analysis using extracted features using image processing. This statistical approach is critical in the development of new drugs for the treatment of liver cancer. Machine learning methods were utilised in the prediction of liver cancer, including the MapReduce and Mahout algorithms, which were used to prefilter the set of ligand filaments before they were used in the prediction of liver cancer. This work proposes the SMRF algorithm, an improved scalable random forest algorithm built on the MapReduce foundation. Using a computer cluster or cloud computing environment, this new method categorises massive datasets. With SMRF, small amounts of data are processed and optimised over a large number of computers, allowing for the highest possible throughput. When compared to the standard random forest method, the testing findings reveal that the SMRF algorithm exhibits the same level of accuracy deterioration but exhibits superior overall performance. The accuracy range of 80 percent using the performance metrics analysis is included in the actual formulation of the medicine that is utilised for liver cancer prediction in this study. e liver is the second-largest organ in the human body after the skin. Approximately three pounds is the weight of a healthy adult's liver. e liver is situated on the right side of the body, under the right lung, and is covered by the ribcage [1] . A sulcus separates each of the lobes (a ridge). is situation is similar to that of a chemical factory. e liver's role in digestion is to produce proteins and bile, both of which the body needs to function effectively, the removal of toxins from the body that have been eaten [2] . By using vitamins, carbohydrates, and minerals stored in the liver, it is able to break down numerous nutrients from the gut while also controlling cholesterol excretion. It also produces rapid energy when needed. roughout the body, the cell serves as the fundamental unit that constructs the tissues. Growing and dividing into new cells are typical functions of cells in their normal state [3] . e cell is replaced with a fresh one if it gets old or broken. Every now and again, something goes wrong during the operation. In contrast to the fact that the body does not manufacture new cells, nodules and tumours are produced by the tissues of old or damaged cells. Liver tumours are classified into two types: benign and malignant [4] . In comparison with malignant tumour, benign tumour is less dangerous. Tumours that are not damaging to the patient's life are benign tumours, which are very uncommon. ey are not usually re-grown after it has been excised, unlike malignant tumours. However, it does not spread to other parts of the body; instead, it attacks tissues in their immediate surroundings. Tumours that are malignant are malignant tumours, which are cancerous and may be fatal [5] . When it is removed from the body, it re-grows and becomes very dangerous. A stomach or intestinal infection may be lethal and spreads throughout the body, affecting many organs. Primary liver cancer and secondary liver cancer are the two forms of liver cancer that may occur in people. Primarily, liver cancer refers to a tumour (malignant) that begins in the liver itself. It is probable that secondary liver cancer develops in another place of the body and then spreads into the liver [6] . Hepatocellular carcinoma (HCC) is the term used to describe a tumour that develops in hepatocyte cells. Cancer of the liver that has developed from inside the organ itself. Hepatocellular carcinoma is responsible for around 75-90 percent of all liver cancer cases in the United States. Primarily, liver tumours are classified into several categories, including cholangiocarcinoma or two-bile-duct cancer, coupled HCC and cholangiocarcinoma tumour of mesenchymal tissue, sarcoma, and hepatoblastoma. In children and young adults [7] , this uncommon malignant tumour manifests itself. Based on the insights achieved, new technologies in the computer science sector are expected to emerge in the next years. e "third paradigm" is derived from the many analyses and implementations that have been carried out [8] . e findings in biomedical applications were obtained due to the experimental analysis and the numerous surveys that were carried out during the research process. Various discoveries have been developed to fulfil the needs of the imaginative future and keep up with the ever-increasing number of requirements. e data processing complexity increases as a result of the speed parameter being used [9] . In this case, the study is concentrated on developing applications that benefit from an increase in the speed of computation and an increase in available computing resources. Gathering and processing of a wide range of data is the primary reason for the development of this paradigm, which is beneficial to researchers [10] . In the fields of medicinal applications and biomedical research, some of the most significant breakthroughs have been made. e development of new drugs is a complicated process that involves a variety of procedures. Various molecular structures were chosen and identified from among the n number of potential possibilities (see Figure 1 ). e time consumption in the discovery of biological applications, which was endured for 10 to 15 years after the discovery [11] , was documented. As the number of ligands available in the pharmaceutical industry grows, a big data analytics technology called virtual screening (VS) [12] is being used to screen them all. e primary goal of the approach that has been established is the prediction of ligands in order to find the protein receptor. Using the docking technique, it is possible to shorten the amount of time it takes to identify new medications for the treatment of liver cancer. Hepatocellular carcinoma (HCC) is the most difficult kind of cancer to treat since it develops in the liver's tissue and is very harmful in today's society. Global liver cancer is one of the types of liver cancer that has increased from 641000 to 643000 in the last four decades [13] . Figure 1 depicts the mortality rates and the increment in liver cancer in developing nations and developed countries, respectively. Extensive data analysis can help speed up the process of medication development, which is a time-consuming endeavour. To provide an example, the creation of aspirin, which is used in biomedical therapy, was inspired by a study of patients' electronic health records (EHRs) who were contaminated [14] . In this study, the records of the patients were gathered from the database of the United States Preventive Services Task Force, which use aspirin to treat cancer cells. In addition, raloxifene [15] , which was approved by the FDA in 2007, and dapoxetine, used for the diagnosis of ejaculation are examples of medications that have been approved. Using healthcare informatics software, a large portion of the therapeutical sector has examined gene expression and cellular screening in order to determine the chemical makeup of the cancer cell [16] . As an update, numerous conversations have taken place in the biomedical sector in preparation for the drug development process that will be discussed in the following sections. Speed-up learning is a sort of machine learning in which the problem solvers solve the issue based on their previous expertise [17] . It examines the previous problem solver's experience and traces their steps and solutions. A distinction is made between rote and explanation-based learning. Roughly speaking, rote learning is the more traditional approach, finding out via getting advice. In this sort of learning, the advice may come from a variety of sources, such as human experts and other internet-based information [18] . Learning by example is an inductive learning method in which the decision tree is utilised to guide the learner through the process. is algorithm is based on Quinlan's algorithm, which is also known as ID3. It is the process of inductive learning in which the unlabelled data are grouped in comparable groups called clusters using the Euclidean distance and the Manhattan distance as a basis for grouping [19] . Similarly, to inductive learning, learning by analogy is a kind of learning in which information is retrieved from previous knowledge. It is one of the most basic deduction strategies in human cognition. e rest of the article is organized as follows: Section 2 represents the background analysis, Section 3 represents the proposed work, and Section 4 represents the experimental study, and Section 5 represents the conclusion and future work. It was necessary to add several software and technologies in order to create the new medication. In the suggested portion, numerous platforms from the current structure were explored in detail, allowing for a more in-depth examination of the planned work. rough the utilisation of enormous datasets, the MapReduce approach [20] , an advanced and rarely used technique in the IT sector, is employed in big data analytics. e MapReduce technique is an advanced and seldom used technique in the IT field. A large number of nodes can benefit from parallel and distributed MapReduce execution because of the technique's high scalability and reliability [21] . MapReduce is a method that is straightforward in terms of programming, and it is widely utilised in a variety of real-time applications. e MapReduce approach used to handle a large amount of data at one time. e key benefit of the MapReduce approach is that it is easy to install and has a lower level of fault tolerance than other techniques. e most important job is to establish a model for the discovery of a new medication [17] . e MapReduce approach, which is used to identify new drugs, makes use of two processes, namely, the map function and the reduction function. Technique. Apache Mahout, a key approach discovered by the Apache Foundation that leverages the library function of machine learning algorithms in conjunction with the Hadoop platform as its foundation, is a major technology. Mahout has been at the forefront of new and innovative developments since the various algorithms were implemented [22] . Mahout is used for big data processing data structures that are compatible with a single machine learning approach, such as deep learning. Despite the fact that this methodology includes the Java library function, it does not include the user interface structure [23] . In order to examine the varied chemical compositions of the obtained data, the chemical expert created Open Babel, which is an open-source programme that is available for free. e primary goal of this programme is to construct multiplatform libraries for molecular models [24] , as well as to do various data conversions for the medicine that has been produced. e research [17] indicated that back propagation produced the greatest results in terms of accuracy (71.59 percent), precision (69.74 percent), and specificity (82 percent). e NBC classifier has much better sensitivity (77.95 percent) than the other classifiers. e KNN technique, when applied to the AP Liver dataset and using common characteristics (SGOT, SGPT, and ALP), provides a high accuracy when compared to other algorithms. ANN and SVM performance were evaluated on various cancer datasets in this study [18] , with accuracy, sensitivity, specificity, and area under the curve all being measured and compared (AUC). e BUPA liver disorder training set (70 percent) and testing set (30 percent) were chosen, and after analysis, SVM provided (accuracy, 63.11 percent; sensitivity, 36.67 percent; specificity, 100.0 percent; AUC, 68.34 percent) and artificial neural networks provided (accuracy, 63.11 percent; sensitivity, 36.67 percent; specificity 100.0 percent, and AUC 68.34 percent) (accuracy, 57.28 percent; sensitivity, 75.00 percent; specificity, 32.56 percent; AUC, 53.78 percent). In research [19] , a dataset of 78 percent of liver cancer patients associated with cirrhosis was employed, which included two forms of liver cancer: HCC and nontumour livers. e data were separated into two groups: training and testing. e K-nearest neighbour approach was used to eliminate the values that were missing. Employing principal component analysis, the author optimised a fuzzy neural network before comparing the GA search results to the improved fuzzy neural network. In this study, it was discovered that using a smaller number of genes, FNN-PCA could achieve an accuracy of 95.8 percent. e classification of the liver and nonliver disease datasets was based on the findings of this study [20] . Medical data from a Chennai medical centre with 15 features were used for preprocessing, and the C4.5 and Naive Bayes classifiers were used for the study. e C4.5 algorithm outperformed the Naive Bayes method in terms of accuracy. e major contributions of the proposed work are to identify cancer using both features obtained using map reducing technique and image processing is used to identify the classes of cancer in the patients' CT scans, and to reduce execution time and enhance the accuracy rate. In the following section, the technique has been developed to discover the new drug for the treatment of liver cancer in the field of big data analytics [25] . e approach was made in the initial stage of dataset selection and the algorithm discovery in the big data society. e liver cancer diagnosis is based on the protein deficiency. e protein deficiency of the liver tissue is identified using the 4JLU receptor, which includes the crystal Computational Intelligence and Neuroscience structure of BRCA1 [12] . e protein data bank is used to obtain the structural information needed to conduct the study (PDB). Because the receptor had to be built from the ground up, the Cambridge library's collection of ligands was used [13] . is different ligand contains 10 6 ligands, of which the proposed work randomly collects 10 4 ligands. Virtual screening process is carried out with the use of AutoDock Vina (AV). Input images are acquired from the Kaggle dataset to extract the features of cancer from the tumour sets. e liver cancer features are collected from the infected and dis-infected data. Figure 2 represents the frequency of the affinity (kcal/mol) and shows the median as the separation point. e separation point is taken using the median point which is calculated as −5.8 Kcal/mol. e difference between the active and inactive ligands is used to compute the separation point. Using the real positive and true-negative values, the greater value is arrived at [26] . e false number is neglected using the mean value. e molecular format PDBQT [27] is the complex structure, which converts the ligands to the fingerprint format (FPF). is converts the various algorithms into machine learning algorithm. Open Label is the toolbox, which is considered to follow the chemical composition of the discovered drug. is will follow the chemical conversion taking place in the drug structure. e FPF, which is hexadecimal in structure, is converted into a binary structure that comprises n * m matrix formats. e vector element is created and transferred towards the label class of the dataset [28] . From the pseudocode, the first stage, like with many other ground filtering methods, is the production of V imin , which is based on the cell size parameter and the amount of data present. It is possible to provide the two vectors corresponding to [min: cellSize: max] for each coordinate-xi and yi-directly from the user's input or to quickly and automatically compute them from the data provided. Instead of generating a raster for each of the (x, y) dimensions, the SMRF method generates a raster spanning the ranges between the ceiling and floor of the lowest and maximum values for each dimension. If the cell size parameter is not an integer and is not specified, the same general rule applies to values that are evenly divided by the cell size parameter. Using the previous example, if the cell size is equal to 0.5 m and the x values are in the range 52345.6 to 52545.4, the range would be [52346 52545]. It is designed to be applied to both the first and final returns of the point cloud, while it is possible to build a minimal surface that is almost as good with just the latest returns, as stated in the next paragraph. However, even though the last return of any given pulse is most likely to be ground, this is not always the case: for example, it is possible that the last return of one pulse happens to hit an object at a given location, while the first return of another pulse happens to strike closer to the ground at the same location. A minor inaccuracy would be introduced into the DEM as a result of the early removal of the first return from the second pulse in this example, which would be impossible to remove with any filter. erefore, it is recommended that both the first and last returns be utilised, since the unnecessary observations are quickly deleted during the first grid-generation process. e minimal surface grid V imin created by the vectors (xi, yi) is filled with the elevation data that are closest to the original LIDAR data and is the lowest elevation. e data construction step is followed by the five-model formation. is model is used to train the dataset with the labelled class that is used to predict the severity of the cancer using machine learning algorithm [14] . e prediction is made for the discovery of a new drug with certain chemical composition. Figure 3 represents the flowchart of the proposed work. e implemented algorithm is based on the MapReduce algorithm using the Java implementation. In the proposed work, the best three algorithms were selected and combined to form the classifier with the higher accuracy. e electronic health records include information such as the patient's identification number, status, age, gender, hepatosis, ascites, edema, billi, cholesterol, albumin, and other vital signs. e data under consideration must be clinically converted, that is, made acceptable for further processing, before it can be used. e clinical transformation stage is also referred to as the preparation step in certain circles. Null values, irrelevant values, and noisy values may be found in the unprocessed data. ese data flaws would result in misclassification, and as a result, they would need to be converted therapeutically. Mode function is used to impute missing data from the considered dataset with values generated using the mode function. Following the preparation of data, three subsets from the datasets are prepared for use in the random forest classification system for categorising occurrences. When generating the subgroup, three characteristics will be taken into consideration: platelet count, alkaline phosphate, and cholesterol levels. e random forests are constructed by combining three classification techniques, namely, C4.5, J48, and Naive Bayes, into a single structure. ere are many other voting methods that may be used for an ensemble of classifiers; however, in this case, we will use the majority vote technique to execute voting with a variety of classifiers. e ultimate conclusion of the majority of classifiers will be shown as the output in this case. Random forest is one of the machine learning techniques that is constructed using the multilayer of decision trees. is method is developed using the bagging process [29] . e independent variable X is considered, which is combined with the decision tree K to form the classification matrix of h 1 (X), h 2 (X), . . ., h k (X). Each of the classifier is trained and classified using the matrix obtained in the classification process. SMRF (scalable MapReduce random forest) is one of the techniques of the big data learning [15] . is proposed technique consists of three phases, which is implemented as follow: Step 1. e descriptor file from the dataset is subjected towards the attribute description. Step 2. It is represented as the generating stage and subdivides the given dataset into bootstrap samples that can be trained using the bagging algorithm Step 3. It is represented as the voting phase where the decision trees give the classification results. e proposed SMRF technique decides the decision of the classification with the higher voting technique. Figure 4 shows the scalable random forest algorithm based on MapReduce technique. Bayes theorem-It is of importance to determine which theory is the most likely for given space S. In the context of machine learning, the term is defined by the observed training data. P() is the initial probability that the hypothesis is true before any training facts are learned, and P() is the prior probability that the hypothesis is true before any background knowledge is learned about the right hypothesis. Presumptions may have some prior knowledge depending on the facts given, even if no prior information is available. In a similar vein, prior probability (α) on the provided training data is calculated. q(α) will represent the probability based on the supplied data. In general, the probability of x provided by y may be represented as Q(x|y), which stands for probability of x given by y. If you are interested in machine learning, the portion of interest is Q(β), which is the posterior probability on a hypothesis based on a particular training dataset, which may be used to determine the confidence in a given dataset [16] . e base theorem is the cornerstone of the Bayesian learning approach because it calculates the posterior probability Q(β) from the prior probability, Q(α) and Q(β) being the probabilities of the past and future. e Bayes theorem is a mathematical formula that predicts the likelihood of an event: According to Bayes' theorem, Q(||) grows as Q(||) and Q(||) increase in importance. If Q(|) grows, it can be observed from the equation that the value of Q(|) decreases. Most likely, the observed will be independent of the observed. e S-hypothesis will be the most likely one to be tested based on the observed facts. When the most likely values are selected, the hypothesis known as the Maximum A Posteriori Bayesian Inference Data Prior Information Statistical Conclusion (MAP) hypothesis is used. When computing each candidate hypothesis, this approach makes use of the Bayes theorem: In the final step, q(β) is removed since it is not reliant in any way and acts as a constant. Computational Intelligence and Neuroscience occurrences that are comparable to one another would have the same categorisation as one another. K star work utilises transformation, which picks one instance of a transformation at random from all of the possible transformations using the entropic measure. Entropy is employed as a distance metre in this approach, and the distance between the instances is computed using it. e complexity of a transformation is measured by the distance between occurrences of the transformation. It was accomplished via the use of instance transforms and mappings for a limited number of transformations. Assume that is the initial position and that is the ending point. Let us suppose that X is predefined and that there are an infinite number of points. Let x equal X; then, x will be the map x : y. e map instance itself is denoted by the symbol X((�)), and q is terminated. q is a transformation on, and it has a single definition. Explanation: x(n) � xn (1(. . . x1 (n). . .)), where x is the number of elements in the set. en, x1. . . xn is the number of times x equals x1. . . xn. When q is a probability function X * , it means that it should satisfy the requirements of the following qualities: r * is the probability function that defines all paths moving from α to β As mentioned the probability function q * which is defined as the probability of all tracks from instance a to instance b: r * satisfies following properties: e L * function is then defined as e proposed SMRF technique is performed using the Hadoop environmental factors. e Java workbench is adopted to run the random forest algorithm with the same parameters of the traditional algorithm. e system's precision is determined by the parameters marked as K. To compare the various algorithms with the proposed approach, many methodologies were investigated. e mean value of the proposed work determines the accuracy of the system. e experimental analysis of various applications was considered to analyse the proposed work that is tabulated in Table 1 . e experimental analysis of various applications is shown in Table 1 . In the various analyses, the proposed SMRF algorithm has the better accuracy in various fields and lesser error factor. Figure 5 represents the comparison of the proposed algorithm with the traditional algorithm. For SMRF, the accuracies in datasets "corral" and "ionosphere" are 97.66% and 93.l6%, respectively, which are much higher than the traditional random forest. e experimental results with the mean parameter, that is represented as K, are shown in Figure 5 . e proposed algorithm has 10 nodal points with the 100-decision tree structure. e SMRF algorithm has parallel performance, which reduces the classification timings and increases the system's accuracy based on the MapReduce model. Scalability of the system is higher when compared to the other algorithms. is proposed work results in the good accuracy in the classification that would yield the better drug discovery. Database images are collected from the cancer imaging Archive, which consists of both normal and abnormal images. e database images consist of MRI images and CT scan images, as well as ultrasound scan images. ese images are the collection of both normal lung and abnormal lung. e proposed work consists of around Morphological operations consist of categories such as close, erosion, dilation, mask, and mark. ese procedures are carried out to smoothen the dilated area and to remove the unwanted particles within the converted RGB image. Using these techniques, the filtered picture may be separated into its parts by structural and morphological procedures. e output results of this process in MRI scan, CT scan, and ultrasound scan is shown in Figure 6 . Figure 7 represents the preprocessing stage in cancer images. e segmentation process is based on the watershed algorithm and Sobel edge detection technique. e watershed algorithm is a mathematical morphology method founded on topology conception and may just belong to the region-founded segmentation approaches. Its intuitive proposal originates from the topography; photos are viewed as a topology remedy within the topography; and the grayscale value of each pixel on images stands for the elevation at this point. For the watershed algorithm, there are numerous calculation approaches; an effective algorithm [7] based on immersion simulation proposed by Vincent and Soille is a milestone of the watershed algorithm study, for it improves an order of magnitude in calculation when put next with the long-established watershed algorithms, and for this reason, the watershed algorithm has been applied largely. us, the results of watershed segmentation are shown in Figure 8 . Consider the following scenario: the input picture is of an elephant. is picture, complete with pixels, is the first image to be put into the convolutional layer system. A black-and-white image is read as a 2D layer, with each pixel given a value between zero and two hundred and fifty-five (255), with zero being entirely black and two hundred and fifty-five representing fully white. For a colour image, on the other hand, the result is a 3D array with three layers: blue, green, and red layers, each of which has a value between 0 and 255. e reading of the matrix then occurs, for which the programme picks a smaller picture, referred to as the "filter," from which the information (or kernel) is read. ere is no difference between the depth of the filter and the depth of the input. e filter then generates a convolution movement that moves together with the input picture, moving one unit to the right of the image every time it is used. After that, it multiplies the values by the values of the original image. Each multiplied figure is added together, and a single number is formed as a result of this process. Iterating the method with the full picture results in a matrix that is smaller than the original input image. e feature map of an activation map is the last array in the process of creating an activation map. In order to conduct operations such as edge detection, sharpening, and blurring, it is necessary to convolute a picture by applying several filters. All that required is the specification of parameters such as the size of the filter, the number of filters, and/or the network's architectural design. From a human standpoint, this behaviour is analogous to recognising the basic colours and edges of a picture. However, in order to identify the picture and detect the traits that distinguish it as, for example, that of an elephant and not that of a cat, distinguishing characteristics such as the elephant's enormous ears and trunk must be recognised. In this case, the nonlinear and pooling layers will be used to help. e nonlinear layer (ReLU) is added after the convolution layer, and it is responsible for increasing the nonlinearity of the picture by applying an activation function to the feature maps. e ReLU layer eliminates any negative values from the picture and boosts the image's correctness. Despite the fact that there are various procedures available, such as tanh or sigmoid, ReLU is the most common since it can train the network much more quickly. In the next stage, many photos of the same item are created so that the network can always identify the image, regardless of its size or position on the network. For example, in the elephant image, the network must be able to detect the elephant regardless of whether it is walking, standing still, or racing. It is necessary to have picture flexibility, and here is where the pooling layer is useful. It works in conjunction with the picture's dimensions (height and width) to gradually shrink the size of the input image, allowing the items in the image to be seen and identified no matter where they are positioned in the image space. Pooling also aids in the prevention of "overfitting," which occurs when there is too much information and no room for new ones. Max pooling is perhaps the most wellknown example of pooling, in which the picture is split into a succession of nonoverlapping sections. Max pooling is the process of detecting the maximum value in each region of the picture in order to eliminate any unnecessary information and reduce the size of the image to its smallest possible size. It also helps to account for distortions in the picture as a result of this activity. e fully connected layer is the next step, which includes an artificial neural network for use with CNN. It is possible to forecast the picture classes with improved accuracy by using an artificial network that incorporates diverse information. At this point, the gradient of the error function is computed in relation to the weight of the neural network being considered. e weights and feature detectors are tweaked to get the best possible performance, and the process is performed over and again. e classification process is performed using the method of convolutional neural network. Convolutional neural network consists of many layers, which would give the certain rate of classification in the three categorised database images. Appendix 1 represents the flowchart of the proposed work. is would help the patient and the practitioners to identify the early stage of liver cancer and help with the diagnosis. Figure 9 shows the classification results of the proposed dataset. Table 2 represents the performance metrics of the proposed work with various sample images. Figure 6 : Image outputs in preprocessing. Table 2 represents the various CT images performance metrics using the proposed work. Table 3 represents the comparison of the proposed work with the existing work. e SMRF method is implemented in the Hadoop cluster distributed computing environment. We use the Weka workbench to run classic random forest with the same settings as before, and we set the K value to 100 to be able to compare the accuracy levels of the two methods side by side. As an assessment measure, we employ 10-fold cross-validation to evaluate the results of various approaches. As a result, we compute the mean of the accuracy of these two classifiers in order to decrease the bias of datasets that have been classified in a certain way. e SMRF algorithm yields the better results than the traditional algorithm in the case of liver cancer prediction. is proposed model has developed based on the MapReduce model. is made the drastic changes in the big data analysis or in cloud computing environment. e comparative study with the various algorithms gives the better results of the implemented results. e proposed structure is based on the decision trees, which is used on the drug discovery of the liver cancer. To draw a conclusion that the SMRF algorithm is more suitable to classify massive datasets in distributing computing environment than the traditional random forest algorithm [30] . Data Availability e data that support the findings of this study are available on request from the corresponding author. All authors declare that they do not have any conflicts of interest. e price of innovation: new estimates of drug development costs Zinc: a free tool to discover chemistry for biology Virtual screening-an overview CBO-IE: a data mining approach for healthcare IoT dataset using chaotic biogeography-based optimization and information entropy Breast and cervical cancer in 187 countries between 1980 and 2010: a systematic analysis Plant leaves disease classification using bayesian regularization Back propagation deep neural network A novel blood pressure estimation method based on the classification of oscillometric waveforms using machine-learning methods FAYOUMI, king abdulaziz university, the family of MapReduce and large-scale data processing systems Image forgery detection using singular value decomposition with some attacks Gaussian process regression (GPR) based non-invasive continuous blood pressure prediction method from cuff oscillometric signals A robust quasi-quantum walks-based steganography protocol for secure transmission of images on cloud-based E-healthcare platforms Novel machine learning applications on fly ash based concrete: an overview Myocardial infarction detection based on deep neural network on imbalanced data Multiobjective genetic algorithm and convolutional neural network based COVID-19 identification in chest X-ray images A multitier deep learning model for arrhythmia detection Learning Apache Mahout Classification Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading Building a Virtual Ligand Screening Pipeline Using Free Software: A Survey ROSETTALIGAND: protein-small molecule docking with full side-chain flexibility Cancer detection using aritifical neural network and support vector machine Liver cancer classification using principal component analysis and fuzzy neural network Review of video compression techniques based on fractal transform function and swarm intelligence Estimating the surveillance of liver disorder using classification algorithms Efficient deep learning approach for augmented detection of coronavirus disease An intelligent agent based framework for liver disorder diagnosis using artificial intelligence techniques Liver disease prediction using bayesian classification Taxonomy on EEG artifacts removal methods, issues, and healthcare applications