034

A Disparateness-Aware Scheduling 
using K-Centroids Clustering and PSO 

Techniques in Hadoop Cluster

E. Laxmi Lydia1,Dr. M.Ben Swarup2

1Associate Professor, Department of Computer Science and 
Engineering, Vignan’s Institute Of Information Technology, 

Visakhapatnam, Andhra Pradesh, India.
2Professor, Computer Science and Engineering, Vignan’s Institute 

Of Information Technology, Visakhapatnam, 

Andhra Pradesh, India.

Email:elaxmi2002@yahoo.com

Abstract. Big data storage management is one of the most challenging issues for Hadoop cluster 
environments, since large amount of data intensive applications frequently involve a high degree of data 
access locality. In traditional approaches high-performance computing consists dedicated servers that are 
used to data storage and data replication. Therefore to solve the problems of Disparateness among the jobs 
and resources a “Disparateness-Aware Scheduling algorithm” is proposed in the cluster environment. 
In this research work we represent K-centroids clustering in big data mechanism for Hadoop cluster. This 
approach is mainly focused on the energy consumption in The Hadoop cluster, which helps to increase 
the system reliability. The Hadoop cluster consists of resources which are categorized for minimizing the 
scheduling  delay in the Hadoop cluster using the K-Centroids clustering algorithm. A novel provisioning 
mechanism is introduced along with the consideration of load, energy, and network time. By integrating 
these three parameters, the optimized fitness function is employed for Particle Swarm Optimization 
(PSO) to select the computing node. Failure may occur after completion of the successful execution in the 
network. To improve the fault tolerance service, the migration of the cluster is focused on the particular 
failure node. This can recomputed the node by PSO and the corresponding optimal node is predicted. 
The experimental results exhibit better scheduling length, scheduling delay, speed up, failure ratio, energy 
consumption than the existing systems. 

Keywords: K-Centroids Clustering, Big data, Hadoop Cluster, data access locality, data replication, 
systemreliability, particle swarm optimization

International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02


035

1. Introduction 

In recent years, big data has rapidly developed into a hotspot that attracts great attention from academia, 
industry, and even governments around the world [1–2]. Nature and Science have published special issues 
dedicated to discuss the opportunities and challenges brought by big data [3, 4]. McKinsey, the well-known 
management and consulting firm, alleged that big data has penetrated into every area of today’s industry 
and business functions and has become an important factor in production [5]. Using and mining big data 
heralds a new wave of productivity growth and consumer impetus. O’Reilly Media even asserted that “the 
future belongs to the companies and people that turn data into products”[6]. Some even say that big data 
can be regarded the new petroleum that will power the future information economy. In short, the era of big 
data has already been in the offing.
What is big data? So far, there is no universally accepted definition. In Wikipedia, big data is defined as 
“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult 
to process using traditional data processing applications”[7]. From a macro perspective, big data can 
be regarded as a bond that subtly connects and integrates the physical world, the human society, and 
cyberspace. Here the physical world has a reflection in cyberspace, embodied as big data, through Internet, 
the Internet of Things, and other information technologies, while human society generates its big data-
based mapping in cyberspace by means of mechanisms like human–computer interfaces, brain–machine 
interfaces, and mobile Internet [8]. In this sense, big data can basically be classified into two categories, 
namely, data from the physical world, which is usually obtained through sensors, scientific experiments 
and observations (such as biological data, neural data, astronomical data, and remote sensing data), and 
data from the human society, which is often acquired from such sources or domains as social networks, 
Internet, health, finance, economics, and transportation.
Apache Hadoop is a software framework that supports data-intensive distributed applications under a 
free license. It has been used by many big technology companies, such as Amazon, Facebook, Yahoo and 
IBM. Hadoop [1] is best known for MapReduce and its distributed file system (HDFS). MapReduce idea is 
mentioned in a Google paper [11], to be simply the task of MapReduce is another processing of divide and 
concur. Hadoop [8] is aimed at problems that require examination of all the available data. For example, 
text analysis and image processing generally require that every single record be read, and often interpreted 
in the context of similar records. Hadoop uses a technique called MapReduce to carry out this exhaustive 
analysis quickly. HDFS gives the distributed computing storage provides and support. They are the two 
main subprojects for Hadoop platform. .Hadoop set FIFO algorithm as its default algorithm. According to 
our research of algorithm for Hadoop, we found that it unable to satisfy the demand of users. We cannot 
only keep the idea of first come first served. We need to think about the requirement form that has the 
higher priority, but at the same time we also can keep the fairness to other users. Then  we announced 
K-centroids Clustering algorithm in Big Data-Hadoop Cluster.

1.1 Hadoop And Hdfs Overview 

Hadoop Distributed File System (HDFS)[4] is the primary storage system used by Hadoop applications. 
The Hadoop distributed file system is designed to handle large files (multi-GB) with sequential read/
write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS track 
of overall file directory structure and the placement of chunks. DataNode reports all its chunks  to the 
NameNode at bootup. Each chunk has a version number which will be increased for all update. Therefore, 
the NameNode know if any of the chunks of a DataNode is stale those stale chunks will be garbage 
collected at a later time.  To read a file, the client API will calculate the chunk index based on the offset of 
the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a 

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02

036

copy of that chunk. From this points, the client contacts the DataNode directly without going through the 
NameNode.  
To write a file, client API will first contact the NameNode who will designate one of the replica as the 
primary (by granting it a lease). The response of the NameNode contains who is the primary and who are 
the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change 
is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a ―
commit‖ request to the primary, which determines an order to update and then push this order to all 
other secondaries. After all secondaries complete the commit, the primary will response to the client about 
the success. All changes of chunk distribution and metadata changes will be written to an operation log file 
at the NameNode. This log file maintain an order list of operation which is important for the NameNode 
to recover its view after a crash. The NameNode also maintain its persistent state by regularly check-
pointing to a file. In case of the NameNode crash, a new NameNode will take over after restoring the state 
from the last checkpoint file and replay the operation log. 

Fig.1

1.2 Mapreduce Overview 

The MapReduce frame work [3] consists of a single master JobTracker and one slave TaskTracker per 
cluster node. The master is responsible for scheduling the jobs’ component tasks in the slaves, monitoring 
them, and re-executing any failed tasks. The slaves  executed the tasks as directed by the master. As 
mentioned, MapReduce applications are based on a master-slave model [6]. This part describes the various 
operations that are performed by a generic application to transform input data into output data according 
to that model. The user defined map and reduce functions [5], the map function processes a key/value pairs 
and return a list of intermediate key/value pairs  
Map(k1,v1)—-list(k2 v2) 
The reduce function merges all intermediate values having the same intermediate key: 
Reduce (k2, list(v2))---list(v2) 
The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the 
input path, and select some TaskTracker based on their network proximity to the data sources, then the 


037

Job Tracker send the task requests to those selected Task Trackers.  
Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each 
record parsed by the “Input Format”, it invokes the user provided “map” function, which emits a 
number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer 
into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one 
of the R local files (suppose there are R reducer nodes). 
When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all 
the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.  Each 
TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invokes the 
“reduce” function, which collects the key/aggregated Value into the output file (one per reducer node). 

Fig.2

2. Related Work 

Hadoop’s MapReduce operation is initiative requesting from taskTracker to jobtracker .The principle 
is similar to ordinary, non-preemptive scheduling operating system, which is cannot be interrupt once the 
task is assigned. As what I have learned about the Hadoop algorithms, there are four classic algorithms. 

2.1 First In First Out (FIFO):  

This expression describes the principle of a queue processing technique or servicing conflicting demands 
by ordering process by first come, first served behavior, what comes in first is handled first, what comes in 
next waits until the first is finished. This is the default algorithm in Hadoop.  

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02

038

2.2 Round Robin (RR): 

In computer operation, one method of having different program process take turns using the resources of 
the computer is to limit each process to a certain short time period, then suspending that process to give 
another process a turn (or “time-slice”). This is often described as round-robin process scheduling. 

2.3 Height Priority First (HPF): 
The algorithm scheduling process, each will be assigned to handle the highest priority ready process. Priority 
setting for the number of static when it can be dynamic. Static priority number is in the process of creation 
is based on the initial characteristics of the process or user requirements identified in the process cannot be 
changed during operation. Dynamic priority number refers to the process and then create an initial priority to 
determine when the first number, after running in the process as the process characteristics change. 

2.4 Weighted Round Robin (WRR): 

Weighted Round Robin is a scheduling discipline. Each packet flow or connection has its own  packet 
queue in a network interface card. It is the simplest approximation of generalized. While GPS serves 
infinitesimal amounts of data from each nonempty queue, WRR serves a number of packets for each 
nonempty queue.

3. Disparateness  Aware Scheduling  Approach In Hadoop Cluster 
Environment 

This section explains the overall flow description of the proposed Disparateness-aware scheduling 
approach in cluster environment. Initially, the Disparateness cluster environment is created along with 
the properties of resource such as resource type, processing speed, and the memory. In order to avoid 
the scheduling delay, the system needs to form a cluster using the K-centroids clustering. Depending 
up on higher priorities, the node will move to the cluster. Furthermore, the cosine similarity is finding 
out to compute the clusters. After accomplishing the cluster, the fitness function is estimated with the 
consideration of load, energy, and time for each cluster. Thus, the clusters are scheduled and then executed  
after uploading the load. Once any failure occurs during the process, the value must recomputed using 
PSO and predicted another optimal node. Figure 3  depicts the overall flow diagram of the proposed 
methodology. The major components of the proposed system are briefly discussed as follows:

Table. 1 Symbols and its descriptions


039

3.1 K-Centroids Clustering 

The well-known clustering problem can be solved by K-means, which is one of the simplest unsupervised 
learning algorithms. Assume the K number of clusters for classifying a given cluster processor in a simple 
and easy way. K-means clustering does not have a guarantee for optimal solution as the performance is 
based on an initial centroids. Thus, the proposed system uses the partitioning clustering, say, K-centroids 
clustering as described in the following algorithm. Table I shows the notation and its description that 
employed in the proposed system

Fig. 3 An overall flow diagram of the proposed  K-centroids  clustering  In Hadoop Cluster.

Algorithm 1: K-Centroids Clustering  
Input: Cluster Processors CPn, k value 
Begin:  
Initialize k centroids Ck  
For i = 0, 1, 2…n then // Cluster Resource Property

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02

040

    While (Curr_Index < k) then 
         IfSim [i] [Curr_Index] < Min then 

Min = Sim [i] [Curr_Index]  
           Min_Index = Curr_Index; 
      End If  
      Curr_Index++; 

End While 
  SetClusterID (Min_Index); 
End for i; 
End Begin; 
The inputs taken for the above K-centroids clustering algorithm are the Cluster processors and the K value. 
At first, the K number of centroids are initialized and the vectors V1and V2 are defined with respect to the 
retrieving time, Cluster processing speed, and size of Cluster resource. Depending up on the vectors, the 
cosine similarity is estimated. Then, the similarity measures verifies for the minimum value of the current 
index. If the value is said to be less, then it sets as a minimum value. It checks until the current index is less 
than the k number of centroids. The minimum index is set as the cluster ID. Further, the fitness function 
is estimated until all jobs are scheduled by the consideration of load, energy, and time for each cluster. It is 
defined as:

f(n) = T(i,r) + L(i,r) + E(i,r)                                                                                                                                                      (1)

3.1.1 Time Computation 

The time computation for ith cluster completion in rth Cluster resource is described as:

T(i,r) = ITr + ET(i,r) + RT(i,r)                                                                                                                                                 (2)

Where, the Executing Time of ith cluster in rth Cluster resource and Retrieving Time of ith cluster in rth 
Cluster resource are given respectively as  in equation (3) and equation (4).

                                                                                                                                                               (3)

                                                                                                                                                    (4)

Herein, BWsr is the bandwidth for the scheduler to the receiver and δsr  mentions the delay that occurs 
between the scheduler and the receiver for network communication. 

3.1.2 Load Computation 

The load computation of rthCluster resource at ithcluster submission is as follows:


041

                                                                                                                                   (5)

3.1.3 Energy Computation 

An energy required for ith cluster in rth cluster resource is defined as:

E(i,r)=IEr + EE(i,r)                                                                                                                                                                   (6)
Where, the Execution Energy of ith cluster in rth Cluster resource is given as:

EFir = CS * CPEr                                                                                                                                                                  (7)

Based on the consideration of time, load, and energy, the proposed estimation of fitness function has the 
following advantages: 
	The scheduling delay can be avoided in the initial stage. 
	Minimize the computation cost 
	Decrease the execution time

3.2 Pso Based Disparateness-Aware Scheduling Model 

The heuristic optimization algorithms are widely used for solving a wide variety of NP-complete 
problems. PSO is considered as the latest evolutionary optimization techniques, which has fewer algorithm 
parameters than the other approaches.  
Algorithm 2:  Disparateness-Aware Scheduling using PSO 
Input :  Cluster List Jn  , Cluster Resource CRm 
Output :  Allocated Cluster List ALn 
Begin 
For x = 0, 1, 2 … n then 
     TGR = ; // Temporary Cluster Resource list 
     For y= 0, 1, 2 … m then
If (Jx . Rtype(). Equals( CRy . Rtype() ) then 
         TCR.add (CRy) 
         End IF; 
  End Fory; 
  SCR = PSO (TCR, Jx ) // Selected Cluster resource 
         Jx . setclusterResource SCR  

ALx ← SGR  
End For x; 
End Begin; 
The inputs taken for this model are cluster list and the  cluster resource. Initially, the temporary cluster 
resource list is empty and the similarity should be further verified. It checks whether the resource type of 
cluster list is similar to the resource type of cluster resource. If the verification is similar, then the cluster 
resources are added to the temporary cluster resource list. The technique of PSO is applied for the 
temporary cluster resource list and the cluster list to accomplish the selected cluster resource. The final 
output obtained in this scheduling model is the allocated cluster resource along with the selected cluster 
resource. When a failure occurs during this process, it can be recomputed by PSO and the other optimal 
node will be predicted. The advantages over the heterogeneous-aware scheduling model are:

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02

042

	Minimize the number of failures 
	Increases the resource utilization 

4. Performance Analysis 

This section compares the performance of the proposed Disparateness-Aware Scheduling (DAS) 
algorithm with two existing scheduling algorithms: Height Priority First (HPF)[12] and Weighted Round 
Robin (WRR)[12].The performance metrics used for the analysis are: system reliability, scheduling delay, 
scheduling length, speed up, energy consumption, and failure ratio with respect to the number of clusters. 
Due to the dynamic resource availability, the behavior of scheduling algorithms on real cluster platforms is 
not practical. The simulation is the optimum choice for testing and comparing the scheduling algorithms, 
where the experiments on real platforms are often non-reproducible. Thus, an extensive simulation 
environment of cluster system is built as in Table 2.

Table. 2 Simulation parameters

4.1 System Reliability Vs. Number Of Clusters 

The reliability of the system can be calculated by the average reliability of all clusters and is mathematically 
defined as follows:

                                                                                                                                                     (8)

Here, R[EAK] is the distribution of the reliability probability of clusters AK , and  n  is the number of clusters. 
Fig.2 describes the system reliability in terms of the number of clusters for the existing HPF and WRR 
and the proposed DAS model. The system reliability of DAS model is greater than the other two existing 
approaches, where its value is gradually decreased with respect to the number of clusters.


043

Fig. 4 The result of system reliability with respect to the number of clusters

4.2 Scheduling Length Vs. Number Of Clusters 

Schedule length is measured as:  

SL=max{SL(A1), SL(A2),…, SL(An)                                                                                                                              (9)

Fig. 5 The result of schedule length with respect to the number of Clusters.

Fig. 5 depicts the result of the schedule length in terms of number of clusters for the existing HPF and WRR 
and the proposed DAS model. The schedule length is lower than the other two existing systems.

4.3 Speed Up Vs. Number Of Clusters 

The speed up is the ratio of the sequential execution time to the schedule length of the output schedule. It 
is computed as follows:

                                                                                                                                      (10)

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


044

International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02

Fig.6 shows the result of speed up with respect to the number of clusters for the existing HPF and WRR and 
the proposed DAS model. The speed up of HAS model is higher than the other two existing approaches, 
where its value is gradually increasing with regard to the number of clusters.

Fig. 6 The result of speed up with respect to the number of clusters.

4.4 Scheduling Delay Vs. Number Of Cluster Resource

Fig. 7 The result of scheduling delay with respect to the number of cluster resources.

Fig.7  shows the result of scheduling delay with respect to the number cluster resource for the existing 
HRDS2 and MCMS and the proposed HAS model. The proposed system has a lower scheduling delay 
than the other two scheduling algorithms.  

4.5 Energy Consumption Vs. Number Of Cluster Resource 

Fig.8  shows the result of energy consumption with respect to the number cluster resource for the existing 
HPF and WRR and the proposed DAS model. The proposed system consumes less energy than the other 
two scheduling algorithms.  Its value gradually increases in regards to the number of cluster resource.


045

Fig. 8 The result of energy consumption with respect to the number of cluster resource.

4.6 Failure Ratio Vs. Number Of Cluster Resource 

The result of the scheduling delay with respect to the number cluster resource for the existing HPF and 
WRR and the proposed DAS model is shown in Fig.8. The proposed system has a lower failure ratio than 
the other two existing scheduling algorithms.

Fig. 9 The result of failure ratio with respect to the number of Cluster resource.

5. Conclusion And Future Work 

This paper proposes a ―Disparateness-Aware Scheduling algorithm in the cluster environment. In this research 
work we represent K-centroids clustering in big data mechanism for Hadoop cluster. This approach is mainly 
focused on the energy consumption in  the Hadoop cluster, which helps to increase the system reliability. The 
Hadoop cluster consists of resources which are categorized for minimizing the scheduling delay in the Hadoop 
cluster using the K-Centroids clustering algorithm. A novel provisioning mechanism is introduced along with 
the consideration of load, energy, and network time. By integrating these three parameters, the optimized 
fitness function is employed for Particle Swarm Optimization (PSO) to select the computing node. Failure may 

  A Disparateness-Aware Scheduling using K-Centroids Clustering and PSO Techniques in Hadoop Cluster   


046

occur after completion of the successful execution in the network. To improve the fault tolerance service, the 
migration of the cluster is focused on the particular failure node. This can recomputed the node by PSO and the 
corresponding optimal node is predicted. The experimental results exhibit better scheduling length, scheduling 
delay, speed up, failure ratio, energy consumption than the existing systems.

References 

[1]V. Mayer-Schonberger, K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and 
Think, Houghton Mifflin Harcourt, 2013. 
[2]A. Cuzzocrea, Privacy and security of big data: current challenges and future research perspectives, in: 
Proceedings of the First International Workshop on Privacy and Securityof Big Data, PSBD ’14, 2014. 
[3]Big data, Nature 455(7209) (2008) 1–136. 
[4]Dealing with data, Science 331(6018) (2011) 639–806. 
[5]C. O’Neil, R. Schutt, Doing Data Science: Straight Talk from the Frontline, O’Reilly Media, Inc., 2013. 
[6]Big data, http://en.wikipedia.org/wiki/Big_data, 2014. 
[7]G. Li, X. Cheng, Research status and scientific thinking of big data, Bull. Chin. Acad. Sci. 27(6) (2012) 647–
657.
[8]Y. Wang, X. Jin Xueqi, Network big data: present and  future, Chinese J. Comput. 36(6) (2013) 1125–1138. 
[9]X.-Q. Cheng, X. Jin, Y. Wang, J. Guo, T. Zhang, G. Li, Survey on big data system and analytic technology, J. 
Softw. 25(9) (2014) 1889–1908. 
[10] J.Dean,S．Ghemawa．MapReduce：Simplified Data Processing on Large Cluster．OSDI’04，Sixth 
Symposium on Operating System Design and Implementation, SanFrancisco ， CA ，December,2004 
[11]http://www.vmware.com/appliances/directory/up loaded_files/What%20is%20Hadoop.pdf. 
[12]  Haiyang Li ―PWBRR Algorithm of Hadoop Platform.

International Journal of Advanced Network  Monitoring  and Controls                     Year 2016, No. 02