key: cord-0901989-b02eghd4
authors: Yen, Tseng-Chang; Wang, Kuo-Hsiung; Wu, Chia-Huang
title: Reliability-based measure of a retrial machine repair problem with working breakdowns under the F-Policy
date: 2020-10-06
journal: Comput Ind Eng
DOI: 10.1016/j.cie.2020.106885
sha: 21cb8decd9cdd877254929e3f02ef94d065562db
doc_id: 901989
cord_uid: b02eghd4

We study reliability and sensitivity analysis of a retrial machine repair problem with working breakdowns operating under the F-policy. The F-policy studies the most common issue of controlling arrival to a queueing problem and it requires startup time before allowing failed machines to enter the orbit. The server is subject to working breakdowns only when there is at least one failed machine in the system. When the server is busy it works at a fast rate, but when it is subject to working breakdowns, it works at a slow rate. Failure and repair times of the server are exponentially distributed. The Laplace transform technique is utilized to develop two system performance measures such as system reliability and mean time to system failure (MTTF). Extensive numerical results are carried out to reveal how performance measures are affected by changing various values of each system parameter.

Cloud computing is a burgeoning computing paradigm and has gained tremendous popularity in recent years. Cloud computing paradigm shifts the deployment of computing infrastructure (such as CPU, network, and storage) from end users to the cloud data center.

The provision of cloud computing services is delivered over the Internet via virtual machines (VMs) and virtual computing resources. Cloud services provide scalable and flexible infrastructure for Internet of things (IoT), machine learning, blockchain, computer-assisted instruction (CAI), and etc. In the COVID-19 pandemic era, many countries have decided to temporarily close schools to reduce contact and save lives. To avoid disrupting learning and upending lives, teaching is moving online on an untested and unprecedented scale. Onlinelearning is an Internet-based CAI system, which uses Internet technology to improve the efficiency of education. By using the benefits of cloud computing, the online-learning cloud system supports the ability to dynamically allocate computing and storage resources for teaching material. To keep the quality of online-learning service and the stability of real-time data transition, the reliability of the cloud system should be investigated. The maintenance of online-learning cloud system can be modelled as a retrial machine repair problem.

The retrial machine repair problem has been widely adopted to study many real-world systems, such as call centers, telecommunication systems, cloud computing networks, etc. In this paper, we consider a retrial machine repair problem (RMRP) with working breakdowns that combines F-policy and exponentially startup time before allowing failed machines to enter the retrial system. The server may be subject to working breakdown while serving a failed machine. When a server is busy or subject to working breakdowns, failed machines join the retrial orbit. Arriving failed machines which cannot receive service immediately join the retrial orbit for a certain period of time and retry to receive service. Maintaining a high level of system reliability is generally a requisite for many real-world repairable systems.

For the most comprehensive concepts in retrial queues, we referred the reader to Artalejo (1999a Artalejo ( , 1999b , Artalejo and Gomez-Corral (2008) , and Phung-Duc (2019). The literature on retrial systems was very vast and rich. Artalejo and Falin (2002) provided extensive comparisons of the standard and retrial queueing systems. Sherman and Kharoufeh (2006) studied an unreliable M/M/1 retrial queue with infinite-capacity orbit and normal queue. The authors derived stability conditions and some important stochastic decomposability results. Wang (2006) extended Sherman and Kharoufeh's work to study an M/G/1/K retrial queue with server breakdowns. Efrosinin and Winkler (2011) proposed a Markovian retrial queue with constant retrial rate and an unreliable server that combines the threshold recovery policy. The authors developed the important results for the waiting time distribution and the optimal threshold level. The authors provided the extensive results of the arrival distribution, the busy period, and the waiting time process. Reliability analysis of the retrial queue with server breakdown and repairs was studied by Wang et al. (2001) . Gharbi and Ioualalen (2006) considered finite-source retrial systems with multiple server breakdowns and repairs using generalized stochastic petri nets model. The authors developed steady-state performance and reliability indices. Moreover, they also employed sensitivity analysis with numerical illustration. Ke et al. (2013) presented the availability analysis of a repairable retrial system with warm standbys. Kuo et al. (2014) investigated reliability analysis of a retrial system with mixed standbys. The authors provided sensitivity analysis for the mean time-to-failure as well as the steady-state availability. Chen (2018) proposed reliability analysis of retrial machine repair system with working breakdowns and a single repair server with recovery policy. Both sensitivity analyses and relative sensitivity analyses of system reliability and mean time to system failure were performed. Chen and Wang (2018) analyzed the system reliability of the N-policy retrial machine repair problem with a single controllable server. Numerical experiments were provided to reveal how performance measures are affected by the change of each system parameter.

For F-policy queues, Wang et al. (2008) considered the optimum control of a G/M/1/K system combining F-policy and exponentially startup time before starting to allow clients in the system. The authors applied the supplementary variable technique to develop the steady-state probability distribution of the number of clients in the system. Wang and Yang (2009) examined the F-policy M/M/1/K system with server breakdown. Quasi-Newton methods and the direct search were adopted to determine the optimal threshold F, the optimal service rate, and the optimal startup rate at minimum cost. An F-policy M/M/1/K system with working vacations and exponentially startup time was explored by Yang et al. (2010) . Extensive numerical results were carried out to reveal the influence of system parameters on the cost function. For the systematic developments of the control F-policy systems, one can refer to the research papers by Jain et al. (2016a Jain et al. ( , 2016b and Chang et al. (2018) . The fault tolerance problem of repairable redundant system with general retrial times operating under F-policy was analyzed by Jain and Sanga (2017) . Numerical results were provided to study the effects of different system parameters on various performance measures. Shekhar et al. (2017) proposed the computational approach to calculate the transient and steady state probabilities of the machine repair problem with an unreliable server operating under F-policy. Greedy selection and Newton-quasi methods were used to determine the optimal solutions at minimum cost. Jain and Sanga (2019a) studied F-policy M/M/1/K retrial queueing system with state-dependent rates. A cost function was conducted to determine the optimal service rate at minimum cost. The optimal control Fpolicy for M/M/R/K queue with an additional server and balking was investigated by Jain and Sanga (2019b) . A cost model was framed to determine the optimal threshold parameter and optimal service rate. The finite state dependent queueing model with general retrial times operating under F-policy was explored by Jain and Sanga (2019c) . Sensitivity analysis of the cost function with numerical illustrations were provided. Jain and Sanga (2020a) provided a state-of-the-art and survey of literature on the state-dependent queueing models operating under F-policy. Recently, Jain and Sanga (2020b) investigated the machine repair system with general retrial times operating under F-policy in fuzzy environment. The signed distance method was used to defuzzify the cost function. The authors applied the genetic algorithm to find the optimal control parameter and corresponding minimum cost. To the best of our knowledge, system reliability and sensitivity analysis of a RMRP with working breakdowns operating under the Fpolicy has never been discussed in the literature. This paper differs from previous work in that:

(i) Queueing problems are broadly divided into two categories, one aims at controlling service category (N-policy) whereas the other aims at controlling arrivals category (Fpolicy). The F-policy reliability problem has distinct characteristics which are different from the N-policy reliability problem.

(ii) It aims at controlling arrivals to study reliability characteristics of a retrial machine repair problem;

(iii) We simultaneously consider the F-policy and working breakdowns which has never been investigated in the existing work.

The rest of the paper is organized as follows. Section 2 gives the descriptions of the problem.

In Section 3, we use the Laplace transform technique to develop the main results such as system reliability and mean time to system failure. Sensitivity and relative sensitivity analysis is also addressed in this section. In Section 4, extensive numerical results were carried out to reveal the influence of system parameters on related analysis such as system reliability, mean time to system failure, sensitivity, and relative sensitivity. Section 5 concludes the paper.

We propose a retrial machine repair problem (RMRP) with identical and   N M S independent machines and a single server in the repair facility. As many as machines can M be operated simultaneously in parallel, the rest of the machines are available as warm-S standby spares. Operating machines fails according to independent Poisson distribution with parameter

Similarly, each of the available warm standbys fails independently of the state .  of all the others with Poisson rate ( ). Whenever one of operating machines fails,  0     it is immediately replaced by a warm standby, if ready. We assume that when a warm standby machine moves into an operating state, its failure characteristics will be that of an operating machine. Each failed operating or standby machine is instantly sent to a single server. The server is unreliable and may experience a working breakdown during busy periods. When the server is busy it serves at a fast service rate , but when it is subject to working breakdowns, 1  in place of terminating service completely, it continues at a lower service rate . The 2 1 ( )

 service time distribution is assumed to be exponentially distributed. It is assumed that the server may encounter working breakdown at any time with breakdown rate . Whenever the server  breaks down, it is immediately repaired at a repair rate . Breakdown and repair times of the  server are assumed to be exponentially distributed. When the server recovers from a working breakdown, the service rate is enhanced from to . Whenever a machine fails, it is 2  1  immediately sent to the repair facility where repair work is provided in the order of breakdowns;

that is, first-come, first-served (FCFS) discipline. Arriving failed machines discern that the server is busy and must enter the orbit after a certain period of time and retry to receive service.

Each failed machine in the orbit repeats its service request with retrial time. Retrial times follow the exponential distribution at rate .  Furthermore, to control arriving failed machines, an F-policy is implemented. The definition of an F policy is described below: When the number of failed machines in the system reaches full capacity K, no further arriving failed machines are allowed to enter the system until the number of failed machines in orbit decreases to the threshold value F ( ). At 0 1    F K that moment, the server must adopt exponentially start up time with parameter to start  allowing failed machines to join the system. Thus, the system operates continuously until the number of failed machines in the system reaches to full capacity K again at which time the above process is repeated over and over. According to the F-policy, to ensure that the server can operate normally, the repair facility prevents new failed machines from entering with probability . On the contrary, when system is full, the entering of new failed machine causes c the server to be unable to complete the repair due to insufficient system resources. The system then moves to the unsafe failure (uf) state with probability , and the occurrence of the 1 c unsafe failure always results in a system fail.

A practical situation related to an online-learning cloud system is presented for illustrative purposes. To satisfy the requests of extensive concurrent online-learning connections, an online-learning cloud system is established via a VM cluster on four quad-core servers. Since there are total 16 CPU cores (4 4 cores), 16 VMs (each VM is assigned with a CPU core) are  deployed to establish a private cloud ( Fig. 1 ) which 12 VMs served as working online-learning servers, and 4 VMs as warm-standby servers. Meanwhile, one physical server with Linux system and pre-installed Cloudstack cloud administration tool is deployed as the VM repair server. Besides, a 1TB CEPH distributed storage system is installed to improve the I/O efficiency of online-learning system. Each VM consists of a kernel program and a 25 GB root filesystem with pre-installed Debian Linux system and the Moodle online-learning tool. An extra 200 GB root filesystem is mounted to the VM repair server to provide storage pool (treated as the orbit) for failed VMs. Therefore, there are at most 8 (treated as ) failed VMs K can be sent to repair server to wait for repair procedure. To promise the quality of service, a failed working VM is replaced with a warm standby VM, and failed VMs are sent to the VM repair server, which could check and repair the malfunctioned root filesystem. When a failed VM arrives, it accepts repair procedure immediately according to its system log file if the VM repair server is available. Otherwise, the failed VM is transmitted to the storage pool of the repair server to wait for filesystem check and reconfiguration. When the number of failed VMs reaches 8, the Cloudstack has a 80% (treated as ) chance to prohibit any failed VMs from c entering to repair server due to storage space limitation K. The Cloudstack may fail to prevent new failed VMs from entering the repair server with probability 0.1 and cause an onlinelearning system error due to excessive use of storage space. A faster service rate is provided when the VM repair server is normal. The VM repair server may be subject to failure and provide a slower service rate due to a software or hardware problem, such as abnormal program execution or disk hot swap. 

In this RMRP, we describe the system states by the pair where denotes the ( , ) i j i system states and represents the number of failed machines in the retrial orbit, j

. Various values of define system states as follows: 0,1, 2,...,

: the server is operating normally but is idle, and arriving/retrial failed machines are 0  i allowed to join the system for service;

: the server is operating normally but is busy, one failed machine is being served and 1  i arriving failed machines will be forced to enter the orbit;

: the server is operating normally but is busy and the system is blocked, and arriving 2  i failed machines are not allowed to enter the system;

: the server is busy during a working breakdown, and failed machines are allowed to 3  i join the system;

: the server is busy during a working breakdown, and failed machines are not allowed 4  i to join the system;

: the server is idle during a working breakdown, and arriving/retrial failed machines 5  i are allowed to join the system.

With the above definition of system states, the state space can be presented as

At time t, the probabilities of different states in the system are defined as follows:

probability of having failed machines in the retrial orbit when the server is in

The mean failure rate and the mean retrial rate are given by

The state transition-rate diagram for the RMRP with working breakdowns operating under the F-policy can be shown in Figure 2 . Figure 2 . Reliability of retrial machine repair system with working breakdowns under the F-policy.

The differential-difference equations for each state at time t can be derived by Figure 2 . For instance, the differential difference equation for the state is

Implementing the Laplace transform and the technique of integration by parts in calculus as 

Similarly, by taking the Laplace transform to the differential-difference equation of each system state, the Laplace transform equations (1)-(21) are established.

(2) * * * 0 0,0 1 1,0 1 2,0 0,0 

denotes the transition rate matrix which is also a squared matrix of order . It can be 6 2  K represented by sub-matrices as follows:

.

( 

The sub-matrix is a matrix with only one 

1 , , 1 1, ; , 2 1, 1; 0, other , . 

The sub-matrix and are and matrices defined by 

The sub-matrix and is a diagonal matrix of order defined by are the probability of that the system has failed at or before time t, the reliability function 2 ( ) uf P t is obtained in the following:

,

.

It is always finite since the integrated function is a monotonic decreasing and bounded function of . Thus, it implies that the mean time to system failure (MTTF) exists and can be calculated t by (26)

Consequently, the MTTF can be obtained as

For each system parameter , differentiating (25) with respect to we obtain the  ,  gradient of system reliability for the sensitivity analysis, shown below

Similarly, the gradient of MTTF can be derived by differentiating (27) 

Moreover, the relative gradient on and MTTF can be implemented by ( )

Since the derived results and are complex non-linear function of the system ( )

parameters, the explicit closed-form of the associated derivative with respective to parameter is difficult to be obtained. Hence, we calculate the gradient numerically. Taking as     an example, at point in Equation (28) is computed by

with a small difference magnitude . 

This section presents some numerical results to perform several related analyses. In 

First, as the illustration example of online-learning cloud system provided in Section 2.1, the base (initial) parameters are selected as , , , ,

Based on the 0.04

above basic setting, except for parameters M and S (which are always fixed), we change one specific parameter at one time and keep the values of the other parameters. That is, we investigate the effects of each parameter on the system reliability to recognize which parameters are critical and needed to be controlled well. The numerical results of for system ( ) Y R t reliability assessment are depicted in Figures 3-13 . From these figures, we observe that , ,  c K significantly affect system reliability (see Figures 3, 11, 13 ). Parameters and affect the  1  system reliability moderately (see Figures 5, 6) . Parameters affects the system 2 , , ,    F reliability slightly (see Figures 7, 8, 9, 12) while and affect the system reliability rarely   (see Figures 4, 10) . It shows the managers several possibilities for reliability enhance: (i) enlarge the system capacity; (ii) reduce the mean failure rate or improve the VM stability; (iii) add some necessary mechanisms for successfully blocking the failed VMs when the system is full.  Figure 11 . System reliabilty affected by . c Figure 12 . System reliabilty affected by . F Figure 13 . System reliabilty affected by . K

In this subsection, some numerical experiments are perfomed to investigat how each system parameter affects the mean time to system failure (MTTF 

Extensive numerical results are carried out on sensitivity and relative sensitivity analysis.

First, sensitivity analysis on system reliability is provided and depicted in Figure 14 . One can see from Figure 14 that the order of sensitivity analysis on system reliability affected by each parameter can be ranked below: Furthermore, the signs of 1 2 .

        c gradient on system reliability with respect to paramters , are negative. It indicates ,  ,    that system reliability increases as these paramters decrease. In constast, the signs of gradient on system reliability with respect to paramters are positive and therefore the ,

system reliability increases as these paramters increase. In addition, the results show that parameters and have significant effects and therefore they should be carefully estimated  c as accessing the system reliability. Therefore, a monitoring system with continuous data collection may be crucial to control them well. Figure 14 . Sensitivity analysis on reliability function.

Next, as for sensitivity analysis on MTTF affected by each system parameter, Table 9 tabulates the gradient of MTTF with respect to various parameters. We observe from Table 9 , the order of sensitivity analysis on MTTF affected by each parameter can be ranked below:

Besides, the signs of sensitivity on MTTF affected by Finally, the related sensitivity analyses on system reliability and MTTF are also performed.

Extensive numerical results for relative sensitivity on system reliability and MTTF are shown in Figure 15 and Table 10 , respectively. It appears from Figure 15 that the order of relative sensitivity on system reliability affected by each parameter can be ranked below:

From 

In this paper, we introduced a retrial machine repair problem with working breakdowns operating under the probabilistic F-policy. We applied the Laplace transform technique and the matrix-analytical method to solve the differential-difference equations of the investigated system. Then, the explicit expressions of the system reliability and the mean time to failure were derived. A practical illustration of the online-learning cloud system that the proposed model can be applied to is provided. Based on the application case, a reliability-based sensitivity analysis for the effects of system parameters was performed. The associated numerical results were tabulated and displayed graphically to evaluate the impacts of different parameters. The results indicated that the effect of the mean startup rate is ignorable. Increasing the prohibit success probability, the mean service rate, or the system capacity can effectively enhance the system reliability. The results also show that the mean arrival rate and the prohibit probability need to be evaluated carefully. Once they are changed but the manager does not notice it, the system reliability may become low and cause additional operating cost. In the future, we provide some possible research topics shown below: 1. some random times (e.g. repair times or retrial times) obey general distribution, 2. system availability of the model, 3. server may be subject to unpredictable breakdowns, and 4. the optimal management strategy.

We consider a F-policy retrial machine repair problem with working breakdowns.  One practical illustration of the online-learning cloud system is presented. 

The differential equations are tabulated and solved by Laplace transform technique.  Closed-form expressions for the reliability and the MTTF are explicitly derived.  Reliability sensitivity analysis with respect to system parameters are performed.  Several management insights for managers to enhance the reliability are provided.

Accessible bibliography on retrial queues. Mathematical and Computer Modelling

A classified bibliography of research on retrial queues: progress in 1990-1999. Top

Standard and retrial queueing systems: a comparative analysis

Retrial Queueing Systems: A Computational Approach

Analysis of a standby redundant system with controlled arrival of failed machines

Reliability analysis of a retrial machine repair problem with warm standbys and a single server with N-policy

System reliability analysis of retrial machine repair systems with warm standbys and a single server of working breakdown and recovery policy

Queueing system with a constant retrial rate, nonreliable server and threshold-based recovery

GSPN analysis of retrial systems with servers breakdowns and repairs

Control F-policy for fault tolerance machining system with general retrial attempts

F-policy for M/M/1/K retrial queueing model with statedependent rates. Performance Prediction and Analytics of Fuzzy, Reliability and Queuing Models

Optimal control F-policy for M/M/R/K queue with an additional server and balking

Admission control for finite capacity queueing model with 24 general retrial times and state dependent rates

State dependent queueing models under admission control Fpolicy: a survey

Fuzzy cost optimization and admission control for machine interference problem with general retrial

Control F-policy for Markovian retrial queue with server breakdowns. 1 st IEEE International Conference on Power Electronics, Intelligent Control and Energy System

Queueing analysis of machine repair problem with controlled rates and working vacation under F-policy

Availability of a repairable retrial system with warm standby components

Reliability-based measures for a retrial system with mixed standby components

Retrial queueing models: A survey on theory and applications

Threshold control policy for maintainability of manufacturing system with unreliable workstations

An M/M/1 retrial queue with unreliable server

Reliability analysis M/G/1 queues with general retrial times and server breakdowns

Reliability analysis of the retrial queue with server breakdowns and repairs

A recursive method for the F-policy G/M/1/K queueing system with an exponential startup time

Controlling arrivals for a queueing system with an unreliable server: Newton-Quasi method

Optimization and sensitivity analysis of controlling arrivals in the queueing system with single working vacation