key: cord-0141444-nr6sdxql
authors: Hong, Yili; Lian, Jiayi; Xu, Li; Min, Jie; Wang, Yueyao; Freeman, Laura J.; Deng, Xinwei
title: Statistical Perspectives on Reliability of Artificial Intelligence Systems
date: 2021-11-09
journal: nan
DOI: nan
sha: 294b85beece6d24a8b497b4378aa385a97482ac3
doc_id: 141444
cord_uid: nr6sdxql

Artificial intelligence (AI) systems have become increasingly popular in many areas. Nevertheless, AI technologies are still in their developing stages, and many issues need to be addressed. Among those, the reliability of AI systems needs to be demonstrated so that the AI systems can be used with confidence by the general public. In this paper, we provide statistical perspectives on the reliability of AI systems. Different from other considerations, the reliability of AI systems focuses on the time dimension. That is, the system can perform its designed functionality for the intended period. We introduce a so-called SMART statistical framework for AI reliability research, which includes five components: Structure of the system, Metrics of reliability, Analysis of failure causes, Reliability assessment, and Test planning. We review traditional methods in reliability data analysis and software reliability, and discuss how those existing methods can be transformed for reliability modeling and assessment of AI systems. We also describe recent developments in modeling and analysis of AI reliability and outline statistical research challenges in this area, including out-of-distribution detection, the effect of the training set, adversarial attacks, model accuracy, and uncertainty quantification, and discuss how those topics can be related to AI reliability, with illustrative examples. Finally, we discuss data collection and test planning for AI reliability assessment and how to improve system designs for higher AI reliability. The paper closes with some concluding remarks.

Autonomous systems are the main applications. Typical examples include autonomous vehicles, industrial robotics, aircraft autopilot systems, and unmanned aircraft (e.g., drones).

Empowered by the sensor, wireless communication, and big data technologies, autonomous systems use AI systems to perceive the operating environment and make decisions in a realtime manner. Autonomous vehicle (AV) perhaps is the example that is most close to everyday life. Ma et al. (2020) provided a detailed description of how various AI and big data techniques are integrated into AV systems. There are many manufacturers working on the design and testing of AV (e.g., Waymo and Cruise) . There are also programs that allow the AV units to be tested on public roads (e.g., the AV tester program by the California Department of Motor Vehicles).

Industrial robotics empowered by AI systems that can achieve a high level of automation are taking the global manufacturing industry into the era of Industry 4.0. Industrial robotics can improve productivity to a large extent and reduce the cost of production. Webster and Ivanov (2020) discussed the evolution of the integration of robots and AI technology in economics and society. In aerospace, AI systems advance the industry with tremendous progress in aircraft autopilot systems and unmanned aircraft/drones. For example, Doherty et al. (2013) studied a high-level mission specification and planning using delegation in unmanned aircraft. Baomar and Bentley (2016) discussed robust learning by imitation approach to extend the capabilities of the intelligent autopilot system. Sarathy et al. (2019) discussed the safety issue in applying AI in unmanned aircraft. Overall, the reliability of those autonomous systems is important because of the critical nature of those applications.

At the component level, AI technologies and modules provide innovative approaches to solve problems in many areas, although their autonomous level is not the same as the autonomous level of systems such as self-driving cars. Examples of those technologies include CV and NLP. The CV includes image (e.g., face) and pattern recognitions, and NLP includes speech recognition and machine translations. The application of those technologies, most of the time with the participation of humans, has greatly improved human productivity. There are many kinds of bots that are built based on AI technology such as chatbot and content- In the medical area, AI is used to assist diagnosis with a high level of automation. For example, Imran et al. (2020) provided an AI-based diagnosis approach for COVID-19 using audio information from the patient's cough. Esteva et al. (2021) discussed the application of DL-based CV approach on medical imaging, medical video, and clinical development. In summary, the reliability of those AI applications at the component level is still important because it will affect the validity of decision-making and user experience.

For any system, reliability is often of interest because failure events can lead to safety concerns.

Failures of AI systems can lead to economic loss and even, in some extreme cases, lead to loss of life. For example, a failure in the autopilot system of an autonomous car can lead to an accident with loss of life. Thus, reliability is critical, especially for autonomous systems.

To provide some concrete examples, we use the AI incident cases reported on the website AI Incident database (2021) , which gathers news entries from various sources. Among those 126 incidents reported up to date, we found 72 incidents can be related to reliability events. Figure 2 plots the counts for the AI application sectors, and for the AI systems and technologies, based on the 72 reliability-related cases. The figure shows AI applications are popular in many sectors and various AI systems and technologies are applied. Despite the prevalence, we also notice that 29 incidents involve deaths or injuries among those 72 events, which shows reliability issues can lead to serious loss.

From another point of view, the large-scale deployment of AI technologies requires public trust. High reliability is one important aspect for winning the trust of the consumers, which requires reliability demonstration. The importance of the reliability of AI systems has been highlighted by several authors. For example, Jenihhin et al. (2019) reviewed the challenges of reliability assessment and enhancement in autonomous systems. Athavale et al. (2020) discussed the trends in AI reliability in safety-critical autonomous systems on the ground and in the air.

The proper demonstration of AI system reliability requires capturing real-world scenarios together with real failure modes. Thus, data collections are essential for demonstrating AI reliability, and statistics can play an important role in such efforts. After obtaining AI reliability data, the statistical modeling and analysis, and reliability predictions can be done.

The data can also be used to identify causes of reliability issues and thus one can improve the design of the AI systems for better reliability, which is challenging but provides opportunities for statistical reliability research.

AI reliability falls within the larger scope of AI safety and AI assurance, the importance of which has been emphasized in many existing research papers. Amodei et al. (2016) outlined concrete problems in AI safety research, and Batarseh, Freeman, and Huang (2021) provided a comprehensive review of research on AI assurance. Thus, from a larger scope, AI reliability is an important aspect of AI assurance, where research efforts need to be devoted on.

The rest of the paper is organized as follows. Section 2 describes a general "SMART" statistical framework for AI reliability. Section 3 briefly describes the common methods used in traditional reliability methods and how they can be linked to AI reliability studies. Section 4 discusses new challenges in statistical modeling and analysis of AI reliability, and several specific topics for statistical research with illustrative examples. Section 5 discusses how to use the design of experiments approaches in data collection for AI reliability studies and improvement.

Section 6 contains some concluding remarks.

In this section, we introduce the "SMART" framework for AI reliability study, which contains five components. Here, the acronym "SMART" comes from the first letter of the five components below.

• Structure of the system: Understanding the system structure is a fundamental step in the AI reliability study.

• Metrics of reliability: Appropriate metrics need to be defined for AI reliability so that data can be collected over those metrics.

• Analysis of failure causes: Conducting failure analysis to understand how the system fails (i.e., failure modes) and what factors affect the reliability.

• Reliability assessments: Reliability assessments of AI systems include reliability modeling, estimation, and prediction.

• Test planning: Test planning methods are needed for efficient reliability data collection. The first three points are covered in Section 2. The traditional and new framework for AI reliability assessment is covered in Sections 3 and 4, respectively. Test planning is covered in Section 5.

For AI systems (e.g., autonomous vehicles), we can conceptually divide the overall system into hardware systems and software systems. Figure 1 lists some commonly seen hardware and software systems. In addition to the typical hardware in a product (e.g., the mechanical devices in a vehicle), the hardware used for AI computing can include central processing unit (CPU), application-specific integrated circuit (ASIC), graphics processing unit (GPU), tensor processing unit (TPU), and intelligence processing unit (IPU). There are also various types of cameras, sensors, and devices that are used for collecting images, sounds, and other data formats that feed into the AI system. The hardware can also include network infrastructure as wireless communication is common for AI systems.

The core of many software systems consists of machine learning/deep learning (ML/DL) based algorithms and other rule-based algorithms. The algorithms include image recognition and speech recognition, CV, NLP, and classifications. The text cloud in Figure 3 (a) shows the variety of algorithms used in the reliability-related cases reported in the AI Incident database. In addition to the core ML/DL algorithms, the software system can also include data collection, processing, and decision-making components.

Many algorithms are based on DL models. The widely used DL algorithm structures include deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), and reinforcement learning (RL). An introduction to those neural network structures can be found in Goodfellow, Bengio, and Courville (2016) . Transfer learning also has been used in AI systems. A comprehensive introduction of transfer learning is available in Pan and Yang (2009) .

Hardware reliability is in general well studied, or there are mature methods for testing and assessing hardware reliability. Thus, the focus of AI reliability, different from traditional reliability studies, is on the software system. More specifically, it is on the reliability of those ML/DL algorithms. Compared to hardware reliability, software reliability is typically more difficult to test, which brings challenges to the research and development of reliable AI systems.

In addition to the hardware and software systems, there are two other factors to consider as the AI system structure: the hardware-software interaction, and the interaction of the system to the operating environment. The new challenges are on how hardware error affects the software and are there algorithms or architectures more robust to hardware failures. AI systems are typically trained or developed for the use of a certain operating environment.

When the operating environment changes, it is likely that the AI systems will encounter errors. Thus system structures that can be adaptive to the operating environment will make the system more reliable.

The usual definition of reliability is the probability of a system performing its intended functions under expected conditions. Reliability is closely related to robustness and resilience, but the focus is on the time dimension. There is little work that formally defines the reliability of AI systems. For AI systems, because the software system is of major concern, the definition of AI reliability is more toward the software part. Kaur and Bahl (2014) defined the reliability of software as "the probability of the failure-free software operation for a specified period of time in a specified environment."

There are three key elements in the definition of reliability, "failure", "time", and "environment". The failure events of an AI system can be mostly related to software errors, in addition to the failure of hardware. For hardware failures, Hanif et al. (2018) discussed that AI hardware failures are related to soft errors, aging, process variation, and temperature. For software failures, software errors and interruptions are generally considered as failure events.

For example, the occurrence of a disengagement event is considered as a failure of the system for AV (e.g., Min et al. 2020) . The environment includes the physical environment for hardware systems such as temperature, humidity and vibration, and the environment defined by the training dataset. For example, if there is an object that is not in the training dataset, the AI system may not be able to recognize it, and thus the object is beyond the intended operating environment.

The operating environment defined by the software system is usually more important for AI reliability.

Metrics are needed to characterize reliability for AI systems such as failure rate, event rate, error rate, etc. For hardware, the bit flip error rate has been used in literature. Both The metrics for software related failures are more complicated. The measurement of the reliability of an AI algorithm is associated with the performance of the AI algorithm. Most AI algorithms are designed to solve problems of classification, regression, and clustering, etc.

Bosnić and Kononenko (2009) used prediction accuracy from ML algorithms as a reliability measure. Zhang, Mukherjee, and Lebeck (2019) defined statistical robustness from three aspects: sampling quality, convergence diagnostic, and goodness of fit. Jha et al. (2019) introduced an attribution-based confidence metric for DNN, which can be computed without the training dataset. Overall, there are many metrics available at the algorithm level, but in general lack universal metrics for algorithm reliability. For the system level, the event rate is suitable for a wide range of applications.

The Hardware failures can also include the network infrastructure failure because the network is an important component for many AI systems. Thus, network failure is also a form of hardware failures. Traditional factors such as the physical environment (e.g., temperature, humidity, vibration), and product use rate affect the hardware systems.

There can be various reasons for the failures at the software level. Figure 3 (b) shows the text cloud for the potential failure causes for the reliability-related cases reported in the AI Incident database. As we can see from the plot, typical causes are prediction errors, data quality, model bias, adversarial attacks (AA), and so on.

Many prediction errors are caused by distribution shift. Distribution shift usually means the operating environment is different from the training-set environment. For example, Mårtensson et al. (2020) studied the reliability of DL models on the out-of-distribution MRI data. AA can be a critical issue for reliability. In AA, a small permutation to the data is applied to make the model prediction inaccurate, leading to reliability incident. For example, Song et al. (2018) provided an example of the AA on object detection tasks. Data quality can also lead to software failures. If the data that are fed into the algorithm come with noises, are contaminated, or are from faulty sensors, failures can also occur. For example, Ma et al.

(2020) discussed sensor data quality on AI reliability.

Failure causes can also be different based on the algorithm type (e.g., CNN, RNN). Certain algorithms may be less prone to failure than others. Compared to hardware issues, software issue is more of a concern for AI reliability. exploratory analysis of the causes of disengagement events using the California driving test data and found that software issues were the most common reasons for failure events.

Based on the above discussion, the factors that can affect AI reliability can fall into three categories: operating environment, data, and model (i.e., algorithm). Figure 4 shows the Venn diagram for the three factors that affect AI reliability, which will be further discussed in Section 4.1. The interactions of the three factors (e.g., data-model interaction) complicate the reliability modeling problems and provide many opportunities for statistical research.

Reliability assessment of AI systems requires data that are collected over the reliability metrics.

Depending on the reliability metric, different kinds of data can be collected. The existing work on AI reliability data analysis is sparse.

At the system level, the reliability data is usually in the format of time-to-event data.

The availability of reliability data for autonomous systems is limited due to the sensitive nature of reliability data. The California Department of Motor Vehicles (2020) (2020) provided a review of methods for improving the reliability of NLP. Zhao et al. (2020) proposed a safety framework based on Bayesian inference for critical systems using DL models. In summary, a general modeling framework for reliability of AI systems needs to be developed.

Traditional reliability analysis mainly uses the time-to-event data, degradation data, and recurrent events data to make reliability predictions. The classical methods of reliability data analysis can be found in, for example, Nelson (1982) , Lawless (2003) , and Meeker, Escobar, and Pascual (2021) . The area of reliability analysis has gone through many changes due to technology development, especially due to sensory prevalence, and new opportunities have been outlined in Meeker and Hong (2014) , and Hong, Zhang, and Meeker (2018) . In this section, we give a brief introduction to traditional reliability analysis methods and link them to AI reliability data analysis. (cdf) and probability density function (pdf) of T are,

respectively. Here, µ is the location parameter, σ is the scale parameter, and θ = (µ, σ) ′ .

For lognormal distribution, we can replace Φ and φ with the standard normal cdf Φ nor and pdf φ nor , respectively. For the Weibull distribution, we can replace Φ and φ with Φ sev (z) =

For failure-time data, the reliability function is defined as R(t; θ) = Pr(T > t). The likelihood function can be written as,

The maximum likelihood estimates can be obtained by finding the value of θ that maximizes

(1). The inference of the reliability of the product is based on the estimated reliability function

Degradation data measure the performance of the system over time. When the performance deterioration reaches a pre-defined failure threshold, a failure occurs. For illustration, Figure 5 (b) plots the degradation paths for a group of laser units, in which the performance deterioration is measured by the percentage of current increase. Performance degradation can also occur for AI models over time. For example, the performance of AI-based models can deteriorate over time after deployment in the field due to conditional changes. For computing hardware, the soft error rate can increase over time for AI systems.

For degradation data, the two widely used classes of models are the general path models (e.g., Lu and Meeker 1993 , Bagdonavičius and Nikulin 2001 , and Bae, Kuo, and Kvam 2007 and stochastic models. The stochastic models include the Winer process (e.g., Whitmore 1995), gamma process (e.g., Lawless and Crowder 2004) , and the inverse Gaussian process (e.g., Ye and Chen 2014) . Covariate information is often incorporated through regression models.

Here we give a brief introduction to the general path model (GPM), which was originally proposed by Lu and Meeker (1993) . Suppose the degradation level is D(t) at time t. We consider a failure occurred for a unit if its D(t) reaches a failure-definition level D f . Then the

The basic idea of GPM is to find a parametric model that fits all paths well. Let y ij be the degradation measurement for unit i at time t ij , j = 1, . . . , n i and i = 1, . . . , n. Here, n is the number of units, and n i is the number of measurements from unit i. Then the degradation path can be modeled as,

where α represents the vector of fixed-effect parameters and γ i represents the vector of random effects for unit i. The random effects are assumed to follow multivariate normal distribution

The error is assumed to be independent and follows normal distribution ǫ ij ∼ N(0, σ 2 ǫ ). Denote the parameter in model as θ = {α, Σ, σ 2 ǫ }. For the estimation of θ, we can obtain the maximum likelihood estimates by maximizing the likelihood function,

for an increasing path. Except for some simple degradation paths, in most situations F T (t; θ) does not have a closed-form expression. Numerical methods such as numerical integration and simulation can be used to compute the cdf of T . The inference of the reliability is made based on F T (t; θ).

Recurrent events occur when a system can experience the same events repeatedly over time. For AI systems, failure-time data and degradation data mainly result from hardware failures, while recurrent events data mainly come from software failures. Recurrent events occur in AI systems such as the disengagement events in autonomous vehicles as analyzed in Min et al. (2020) . Although we defer the details of the dataset to Section 3.3, Figure 6 (a) plots the recurrence of the disengagement events over time for AV units.

The recurrent events data are often modeled by the event intensity models or mean cumulative functions with regression models that are often used to incorporate covariates. Nonhomogeneous Poisson process (NHPP) and renewal process are widely used (e.g., Yang et al. 2013, and Hong et al. 2015) . Lindqvist et al. (2003) proposed the trend-renewal process, which can include the NHPP and renewal process as special cases.

Here we briefly introduce the NHPP model. Denote N(t) as the number of events occurred in (0, t] and N(s, t) as the number of events in time (s, t] . For a Poisson process, the number of recurrences in (s, t] follows a Poisson distribution with parameter Λ(s, t). That is,

Here, Λ(s, t) represents the cumulative intensity function between time s and t. That is,

, and λ(u) is a positive recurrence rate. For NHPP, the intensity function is non-constant and it can be assumed as a known function form with unknown parameters. For example, the power-law function,

is a commonly used form for the intensity function with parameter θ = (β, η) ′ .

For the parameter estimation, we can use the maximum likelihood method. Suppose we have n units, and the event times for unit i is t ij . The event time are ordered as 0 < t i1 < t i2 < · · · < t in i < τ i . Here, n i = N(τ i ) is the number of events and τ i is the last observation time for unit i. Then the likelihood is,

with 0 j=1 (·) = 1. The statistical inference is based on the estimated intensity function, λ(t; θ).

Software reliability is an area of traditional reliability that is closely related to AI reliability.

In modeling software reliability, usually a software reliability growth model (SRGM) based on NHPP is built. The assumption is that the faults of software should be fixed through testing.

There are basically two different classes of traditional SRGMs based on the shape of cumulative failures against time as described in Wood (1996) : S-shaped and concave. In Table 1 Different from traditional software, in which the operating environment is relatively stable and the system is less affected by the environment, the performance of AI algorithm depends on operating environment to a larger extent. In addition, there may be intrinsic errors inside the AI algorithm that can not be removed. Thus, the assumption in traditional software reliability, that the reliability of software goes to 1 as testing time goes to infinity, may not hold for AI reliability. To fix this, Bastani and Chen (1990) used two independent Poisson processes to model the feature of reliability of AI, in which the failure rate decreases through time in one Poisson process, and the failure rate is fixed in the other Poisson process. Such an idea can be further extended to model more complicated reliability AI problems.

There are also NHPP-based software reliability models that were developed for the fault intensity function and the mean cumulative function (Song, Chang, and Pham 2017) . Here we briefly discuss those ideas. As introduced in Pham and Pham (2019), let λ(t) be the fault intensity function, Λ(t) be the mean cumulative function, and N(t) be the number of faults. 

The mean cumulative function Λ(t) is modeled as,

Further, the λ(t) term is multiplied by a random effect that represents the uncertainty of the system fault detection rate in the operating environments. The paper also takes the uncertainty of the operating system into consideration, and lets λ(t) be a stochastic process, which considers both dynamic additive noise model and static multiplicative noise model for λ(t). These models, however, have not been applied in modeling an AI system yet.

In this section, we provide an illustration on how traditional reliability methods can be applied in modeling AI reliability. Min et al. (2020) analyzed the disengagement events data from AV testing program overseen by the California Department of Motor Vehicles (DMV). A disengagement event happens when the AI system and/or the backup driver determines that the driver needs to take over the driving. Disengagement events can be seen as a sign of "not reliable enough" of the AV. The program provides data on disengagement event time points and the monthly mileage driven by the AVs. i. The total follow-up time is τ = 730 days. Let x i (t), 0 < t ≤ τ , be the mileage driven for unit i at time t. The unit of x i (t) is k-miles (i.e., 1000 miles). The daily average of monthly mileage was used for x i (t), and x i (t) can be represented as x i (t) = nτ l=1 x il 1(τ l−1 < t ≤ τ l ). Here, n τ = 24 is the number of months in the follow-up period, x il is the daily mileage for unit i during month l, τ l is the ending day since the starting of the study for month l, and 1(·) is an indicator function. Let x i (t) = {x i (s) : 0 < s ≤ t} be the history for the mileage driven for unit i. Figure 6 shows a visualization of a subset of the recurrent events data from manufacturer Waymo. In particular, Figure 6 (a) shows the observed window and events of 20 AVs, and Figure 6 (b) shows the corresponding driven mileage of five AV units.

As described in Section 3.1, the NHPP model is usually used to describe recurrent event rate. The event intensity function for unit i is modeled as,

Here, λ 0 (t; θ) = λ 0 (t) is the baseline intensity function (BIF) with parameter vector θ. Be-

, θ] is the mileage-adjusted event intensity, the BIF can be interpreted as the event rate per k-miles at time t when x i (t) = 1. The baseline cumulative intensity function (BCIF) is Λ 0 (t; θ) = Λ 0 (t) = t 0 λ 0 (s; θ)ds. Note that Λ 0 (0; θ) = 0 and Λ 0 (t; θ) is a non-decreasing function of t. The BCIF Λ 0 (t) can be interpreted as the expected number of events from time 0 to t when x(t) = 1 for all t. The CIF for unit i is

Commonly used parametric models for NHPP, such as those listed in Table 1 , can be applied to model Λ 0 (t). Other than the parametric model, Min et al. (2020) also proposed a more flexible nonparametric spline method to estimate the BCIF of NHPP. In the spline model, the BCIF is represented as a linear combination of spline bases. That is, Λ 0 (t; θ) = ns l=1 β l γ l (t), β l ≥ 0, l = 1, . . . , n s , Here, θ = (β 1 , . . . , β ns ) ′ is the vector for the spline coefficients, γ l (t)'s are the spline bases, and n s is the number of spline bases. Taking the derivative with respect to t, the BIF is,

Because of the constraints that Λ 0 (0; θ) = 0 and that Λ 0 (t; θ) is a non-decreasing function of t, I-splines with degree 3 is used (e.g., Ramsay 1988) . Each I-spline basis takes value zero at t = 0 and is monotonically increasing. By taking non-negative coefficients (i.e., β l ≥ 0), a non-decreasing Λ 0 (t; θ) is obtained. Numerical algorithms are used for parameter estimation. indicating that there are improvements in the AV reliability through time for these three manufacturers. However, for Zoox, although at the starting point, the estimated BIF is just above 0.5, which is a lot better compared with the estimated BIF of PonyAI, it keeps unchanged through time. This pattern indicates that there are not many improvements in the AV reliability for Zoox.

As discussed in Section 2, an AI system can fail due to hardware and software reasons. Because hardware failures can be well tackled by the traditional reliability framework, we focus on the software aspect of the problem. For the software components, ML/DL algorithms are widely used in many AI systems. As discussed in Section 2.3, the three factors mainly contribute to failure events are environment, data, and model (i.e., algorithm).

As a framework to model AI reliability, we focus on interruptive events, which are caused by operating environment, data, and models, that can lead to software errors. We define those events as failure events. Because reliability focuses the performance over the time, we can link the failure events to the time process. Assuming the arrival of such events follows a stochastic process (e.g., NHPP), we can further link the events to reliability prediction. Then, the traditional methods in reliability can all be applied. The idea of modeling for AI reliability is illustrated in Figure 4 , in which the counting process N(t) represents the event process.

To introduce the modeling framework, we define some more notations. Suppose there are k types of interruptive events and the arrival of those events follows a counting process N j (t)

with intensity function λ j [t; x(t)], which depends on covariate vector x(t).

Here, x(t) is a general vector that contains external information such as the operating environment (e.g., the occurrence of distribution shift), low quality of input data , data noise, and arrival of adversarial attacks (AA).

The probability of such an interruptive event resulting in a failure event is modeled as p j (z). Here, z is a general vector that summarizes the internal reliability properties of the AI systems. For example, z may contain information on the system's ability of out-of-distribution (OOD) detection, its robustness to low quality data and AA, and its ability to generate highly accurate prediction with low uncertainty. The probability p j (z) can be modeled as,

through parameter β j , j = 1, · · · , k. Thus the overall intensity for the counting process N(t)

for the failure events is,

Based on model (2), reducing p j (z) can improve the reliability of the AI systems. The research remaining is to model the event process, which depends on the three factors and their interactions.

Here we discuss some characteristics of the three factors. For the operating environments, one common contribution to failure events is that the operating environments are different from the training environments, which is referred to as OOD samples. For example, a new object appears and the AI algorithm can not recognize it. If the algorithm fails to detect the OOD samples in making a prediction, an incorrect decision is likely to be made, and potentially leads to errors. Thus, OOD detection, and being able to make an appropriate adaptation for OOD samples are important in improving AI reliability.

Also, the quality of the training data is highly related to the performance of the algorithm.

Data with errors, for example, caused by sensor mal-function, can lead to errors in prediction.

Another aspect of data quality is related to the data bias/imbalance issue, which is more of importance for many classification algorithms used in AI systems. The effect of data quality on the performance of the AI systems is coupled with the algorithm. Thus, it is of interest to study how data and algorithms affect model accuracy. AA can be viewed as a special kind of "data quality" issue, in which one purposely uses problematic inputs to the algorithm so that the AI system will fail to generate the correct output. The robustness of an algorithm from AA is a key to the reliability of the AI system.

Most AI system depends on the accuracy of the prediction powered by ML/DL algorithms.

The adopted algorithm needs to provide accuracy high enough on the training set so that it can be used in practice. Thus, it is important to study the relationship between reliability and model accuracy. In addition, the prediction made by the algorithm is associated with uncertainty. High uncertainty can lead to less reliable performance, and quantifying uncertainty in prediction is also an important task.

In the following sections, we give a brief introduction to OOD detection, the modeling of data quality and AI algorithms, AA, and model accuracy and uncertainty quantification, with some illustrative examples.

The OOD observations in the data never appear in the training set. In classification problems, many ML tasks assume the labels in the test set all appear in the training set. However, it is possible that we encounter a new class in the test dataset. For example, we want to use the images of hind legs of the frogs caught in southeastern Asian rain forests to predict the species of the frogs. According to Lambertz et al. (2014) , it is possible that these frogs belong to a new species that has never been discovered.

There are various existing methods for detecting OOD samples for technologies used in AI. and generative and hybrid models (Choi et al. 2018 ).

Here, we give a concrete application for an illustration of OOD detection. Given a pretrained CNN-based network, let x i and y i , i = 1, . . . , n, be the inputs and outputs of the network, respectively. Here, n is the number of samples. Let j be the class index, and n j be the number samples within class j. The current total number of classes is denoted by k.

We use f (x) to denote the output of the penultimate layer (the layer before the activation function) of the neural network. Then we assume that f (x) given class j follows a multivariate normal distribution. That is, f (x)|y = j ∼ MVN(µ j , Σ), where the mean and variancecovariance matrix are estimated as,

One can define the Mahalanobis distance-based confidence score of x i to measure the distance of x i to its nearest class,

If M(x i ) is beyond a fixed threshold, it is defined as a new class. The above classification method is based on linear discriminant analysis (LDA), where different classes share the same co-variance matrix.

For illustration, Figure 8 shows the histograms of LDA-based Mahalanobis distance on the MNIST data (e.g., Deng 2012) . Data with labels 0 and 1 are considered OOD samples and are not used in training set. Based on the histogram, we see that the Mahalanobis distances of two OOD classes, (i.e., data with labels 0 and 1), are distinct from the data with labels 2 to 9. The LDA procedure has good classification, which can be used for OOD detection.

The reliability of AI systems is affected by data quality and the specific algorithms used. The data quality can refer to accuracy in data collection, the efficiency of data collection, the quality of data processing, feature derivation, etc. Take the classification task for illustration, the imbalance and noise in the training set are able to cause the drop of accuracy (Ning et al. 2019 ). Here, data imbalance means the imbalance on the proportions of observations among different class labels. Furthermore, the deviation of the distribution of labels in the test set from the training set affects the robustness of AI algorithms as well. It is important to study how data quality affects model accuracy.

Here we give a brief introduction of the method in Lian et al. (2021) , in which a mixture experimental design was used to study class imbalance in the training set and the difference in distributions of labels from training to test sets. The performance of AI algorithms is measured by the area under the receiver operating characteristic curves, abbreviated by AUC, for each class. The XGboost (e.g., Chen et al. 2015) and CNN (e.g., Kim 2014) are considered. Following the notation from the paper, we define z 1 , which is 1 if the XGboost algorithm is used and is 0 if the CNN algorithm is used. Lian et al. (2021) investigated the changes on effects of class imbalance generated by two different datasets, the KEGG data, and bone marrow data. The two datasets are derived for the classification of the relationship between gene pairs (Yuan and Bar-Joseph 2019). We define z 2 , which is 1 if the KEGG data is used and is 0 if the bone marrow data is used.

A surrogate model established for the performance of AI algorithms, averaged AUC among all classes (y), with covariates proportions of labels (x 1 , x 2 , x 3 ), AI algorithms (z 1 ), and choice of datasets (z 2 ) is as following,

where m = 3, h = 2, and β j , β jj ′ , γ kj and δ kk ′ are regression coefficients. Note that (3) does not contain a term for main effects of two processing variables z 1 and z 2 . In order to draw inference for two processing variables, the sum-to-zero constraint is imposed as m j=1 (γ kj + γ k ) = 0, where γ k is the coefficient for processing variable z k .

To maintain the ability of trained AI models identifying classes, the proportions of labels Table 2 : The first 7 runs for the DOE when the covariate factors z 1 = 1 (XGboost) and z 2 = 1 (KEGG data) are shown in the table. For runs 8-28, the configurations are the cross design of (x 1 , x 2 , x 3 ) values in the table with (z 1 , z 2 ) taking values in {(1, 0), (0, 1), (1, 1)}.

x Table 2 gives the design of experiments (DOE) for class proportions of the training set with 28 runs.

To explore the effect of deviation in distribution between training set and test set, Lian et al. (2021) considered three different scenarios, which are balanced scenario, consistent scenario, and reversed scenario. We can see design distinctions among three scenarios. As an illustration, we visualize the results from balanced scenario in which the test set has equal proportion for the three labels. Figure 9 shows the triangle contour plots of prediction for the mean AUC. In general, balanced training datasets produce higher accuracy. For bone marrow data, both algorithms need greater x 3 to obtain the maximum average AUC. The XGboost outperforms CNN on both datasets. Compared to XGboost, x 3 has the first priority to CNN as XGboost has a more systematic pattern.

The research on AA focuses on finding adversarial points to the data. Adversarial points x * to a given x and the output of the model f (x) is defined as the points such that x − x * is small enough for a defined norm (usually L 2 norm) and f (x) = f (x * ). To find an adversarial point, one needs to solve the following optimization problem,

where R is some set of perturbations and · is some norm. Often one can consider x * = x+r with r from a certain set of perturbations.

Adversarial attacks can lead to misclassification, which can further lead to reliability issues. Chen et al. (2019) found out that other AI algorithms, such as XGboost, can also suffer from AA. To ensure the accuracy of the AI application, efforts should be made to prevent or mitigate the destruction of AA. From this perspective, it is necessary to detect AA and to study how AA affects the reliability of AI systems.

An ML/DL model has to be accurate enough (i.e., higher than a threshold) so that the model can be applied in the field. Thus model accuracy is a key factor to reliability. One question that is often asked is how much should trust on the model accuracy, which leads to the uncertainty quantification (UQ) problem. Quantifying the uncertainty of ML models is the key to understand the reliability of model prediction, especially for critical AI tasks.

Much work has been conducted regarding the UQ in DL models. Two types of approaches are often used: the ensemble Monte Carlo approach (Abdar et al. 2021 ) and the Bayesian approach. Ensemble Monte Carlo frameworks usually train multiple models on respective datasets and use models' predictions as a predictive distribution of the DL model. Lakshminarayanan, Pritzel, and Blundell (2016) presented an ensemble method in which a large number of models are trained through re-sampling of the dataset. Gal and Ghahramani (2016) proposed the dropout idea to construct multiple models. A random sample of network nodes is dropped out from the model during training. And an empirical distribution over the outputs is built through these multiple models' predictions.

Bayesian approaches usually assign prior over the network weights and quantify the uncer- As an illustration, we describe how variational inference is used to conduct UQ. Consider n observations, and let X = {x i : i = 1, . . . , n} be the collection of inputs, where x i is an input tensor. The corresponding response vector is y = (y 1 , . . . , y n ) ′ . A deep BNN is constructed as f (x, ω), in which f (x, ω) denotes the BNN model outputs and ω is a vector that contains all weight parameters in the neural network. The weight parameters ω in the network are treated as random variables.

The response is assumed to follow a normal distribution with the model output as the mean.

where η = (ω ′ , σ 2 ) ′ . A prior distribution p(η) is assigned over the parameters. By Bayes' theorem, the posterior of the model parameters is,

With the posterior distribution, one can make a prediction for a new observation with input x new . Specifically, we denote the prediction from the BNN as y new . Then the posterior for y new is,

Due to the high-dimensional parameter space, the exact posterior p(η|X , y) is usually intractable, so we use a variational distribution q(η; θ) to approximate it. Here, θ is the parameter vector in the variational distribution. Then the estimate of θ can be obtained by the following optimization, 

The negative of (5) is called the evidence lower bound, and the term E q(η;θ) {log[q(η; θ)]} is the negative entropy of the variational distribution q(η; θ). The UQ can be carried out based on the estimated variational distribution q(η; θ * ).

The key step in AI testing is to build reliability testbeds by using various approaches to collect data on the performance of AI systems or components under various operational and environmental variables. Statistical methods, such as the DOE, computer experiments, and reliability test planning can help with efficient data collection.

For the testing of AI components, most of which are ML/DL algorithms, statisticians can collect data for the performance of algorithms, especially using DOE. For example, a mixture design was used to collect the performance data of AI algorithms as described in Section 4.3.

In a traditional setting, in which statisticians are typically participating in the DOE but leave the data collection to the subject experts. However, for AI algorithms, most of them can run with modern computing power, which statisticians have access to. Thus, it provides an opportunity to do the DOE, data collection, and performance data analysis in a streamline. In addition to AVs, another use of the AI product is drones. Delivery drones can be useful in daily lives. According to Masunag (2015) , the first Federal Aviation Administration approved drone delivered 24 packages of medicine to rural Virginia. Also, drones can be used in photo captures. Test of safety and reliability of drones is a meaningful topic. However,

there are relatively few papers in the test plan of drones. Hosseini, Nosratollahi, and Sadati (2017) claimed that the reliability of unmanned aerial vehicle (UAV) should be considered in the designing phase so that it is less likely to redesign the UAV. The paper proposed a design algorithm based on a multidisciplinary optimization method. Deng and Li (2014) pointed out the safety requirements of UAV are different in different flight phases and simulations need to be done in those different flight phases: takeoff, climb, level flight, etc. However, the statistical DOE idea is not well applied in the area of AI system test planning.

In traditional reliability analysis, accelerated tests (AT) are widely used to obtain information in a timely manner for products that can last for years or even decades. An introduction of AT can be found in Escobar and Meeker (2006) . The basic idea of AT is to test units at high levels of use rate, temperature, voltage, stress, or some other accelerating variables. Based on the data collected on AT, a statistical model is built to predict reliability. AT plays an important role in reliability analysis because it provides an efficient way for rapid product development. Sequential testing idea is also used in reliability test planning (e.g., . AT is also used in software reliability (e.g., Fujii et al. 2010 To increase the stress on the AI systems, one way is to use input-data acceleration. For example, identical twins are a particularly stringent stress test for facial recognition algorithms.

Using input data with a lot of noise can test the reliability of the system more easily. In addition, testing the systems under AA can be viewed as a form of input-data acceleration.

Operating environment acceleration, which is to test the AI systems under the OOD situation that goes beyond the envelope of the training data, can also put stress on the systems. Also, error injection can be considered as a way of putting stress on the AI algorithms. For example, Bosio et al. (2019) used fault injections to study the reliability of deep CNN for automotive applications.

In summary, the idea of AT can be applied in AI testing, and additional modeling efforts are needed to make reliability predictions based on the AT data collected over the AI tests.

The key step is to model the acceleration factor and link the reliability performance to the normal use condition.

The ultimate goal of statistical reliability analysis is to improve designs for reliable AI systems.

In this section, we discuss several points that can be useful for improvements in AI reliability. Figure 11 shows a flow chart for AI reliability improvement. The following explains the main idea of the flow chart.

As illustrated in Figure 11 , starting with an initial design, one can use AT to speed up the development cycles and collect data in a more efficient time manner. Then, one can use statistical modeling to make assessment and prediction for AI reliability, as described in Section 5.2. With failure events observed, one needs to find failure causes. Some of the causes are discussed in Section 2.3. For existing AI failure events, it is important to find the causes of the failures. For example, cause analysis can be applied to those events reported in the AI Incident (2021) .

With the cause analysis results, the next step is to do design improvement. The following aspects can be considered: enhancing OOD detection, improving data quality and reducing biases. Enhancing OOD detection is an important component for the overall reliability of the AI systems. It is also crucial to design the operational domain in an appropriate way. We should design algorithms that are more robust to low quality data and are more robust to hardware failures. Meanwhile, it is also important to improve data quality and reduce biases.

Choosing algorithms that are more robust to errors can be also important. Architectural vulnerability factor (AVF) is a measure for the vulnerability of the DNN for errors (e.g., Goldstein et al. 2020 ). Using DNN structures that have low AVF can be an effective way to improve the algorithm design for reliable AI.

Iterations can be taken for the four steps (i.e., reliability test, assessment, cause analysis, and improvement) until the reliability reaches the desirable level. Then the system can be deployed to the field. Field tracking is still needed to ensure that the AI system performs the same as it is demonstrated in the development stage. 

In this paper, we provide statistical perspectives on the reliability analysis of AI systems.

The objective of the paper is to provide some general discussion with illustrations on several concrete problems, while we are not trying to be exhaustive in literature review because the AI literature is vast and involves many areas.

We provide a statistical framework and failure analysis for AI reliability. We discuss the traditional reliability methods and software reliability with AI applications. We describe research opportunities including OOD detection, the effect of data quality and algorithms, and model accuracy and UQ with illustrative examples. We also discuss data collection, test planning, and improvements for AI systems. As described in the paper, there are many exciting opportunities in studying the reliability of AI systems and statistics can play an important role in the area.

One challenge is the limited public availability in reliability data from AI systems, which is common for all systems and products because reliability data are usually proprietary and sensitive. Also, the collection of field test data is usually costly and time consuming. The publicly available California DMV database for AV test is one exception. For the reliability data of AI algorithms, as mentioned in Section 5.1, one can collect using in-house computing power. However, it is useful to build data repository for AI reliability datasets. As for modeling methods, Bayesian methods have been widely used in reliability (e.g., Hamada et al. 2008 ).

Although we did not discuss Bayesian reliability in this paper, Bayesian methods can be an area that worth exploring for AI reliability modeling.

We would like to remark that this paper focuses on the reliability aspect of AI systems.

We do not cover other aspects of AI systems, such as safety, trustworthiness, and security, which also need to be addressed for the large-scale deployment of AI systems. A more broad picture is aiming to address all those issues, called the AI assurance (e.g., Batarseh, Freeman, and Huang 2021) , of which reliability is certainly an important dimension.

A review of uncertainty quantification in deep learning: Techniques, applications and challenges

Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems 191

Concrete problems in AI safety

AI and reliability trends in safety-critical autonomous systems on ground and air

Degradation models and implied lifetime distributions

Estimation in degradation models with explanatory variables

Hands off the wheel in autonomous vehicles?: A systems perspective on over a million miles of field data

An intelligent autopilot system that learns flight emergency procedures by imitating human pilots

Assessment of the reliability of AI programs

A survey on artificial intelligence assurance

Variational inference: A review for statisticians

Weight uncertainty in neural network

Exploratory analysis of automated vehicle crashes in California: A text analytics & hierarchical Bayesian heterogeneity-based approach

A reliability analysis of a deep neural network

An overview of advances in reliability estimation of individual predictions in machine learning

Autonomous vehicle tester program

Robustness verification of tree-based models

EAD: elastic-net attacks to deep neural networks via adversarial examples

Xgboost: extreme gradient boosting

WAIC, but why? generative ensembles for robust anomaly detection

Flight safety control and ground test on UAV

The MNIST database of handwritten digit images for machine learning research

Autonomous vehicles: disengagements, accidents and reaction times

High-level mission specification and planning for collaborative unmanned aircraft systems using delegation

A review of accelerated test models

Deep learning-enabled medical computer vision

Autonomous vehicles disengagements: Trends, triggers, and regulatory limitations

Formal scenario-based testing of autonomous vehicles: From simulation to the real world

A software accelerated life testing model

Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

Interpretation of neural networks is fragile

Reliability evaluation of compressed deep learning models

Deep Learning

Explaining and harnessing adversarial examples

Practical variational inference for neural networks

Variational rejection sampling

Bayesian Reliability

Robust machine learning systems: Reliability and security for deep neural networks

Did we test all scenarios for automated and autonomous driving systems?

Failure prediction for autonomous driving

Black-box alpha divergence minimization

System unavailability analysis based on windowobserved recurrent event data

Big data and reliability applications: The complexity dimension

Multidisciplinary design optimization of UAV under uncertainty

AI4COVID-19: AI enabled preliminary diagnosis for COVID-19 from cough samples via an app

Challenges of reliability assessment and enhancement in autonomous systems

Attribution-based confidence metric for deep neural networks

Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?

Software reliability, metrics, reliability improvement using agile process

Convolutional neural networks for sentence classification

Special session: Reliability analysis for ML/AI hardware

Simple and scalable predictive uncertainty estimation using deep ensembles

Anatomy, histology, and systematic implications of the head ornamentation in the males of four species of Limnonectes (Anura: Dicroglossidae)

Covariates and random effects in a gamma process model with application to degradation and failure

Statistical Models and Methods for Lifetime Data

Sequential Bayesian design for accelerated life tests

Training confidence-calibrated classifiers for detecting out-of-distribution samples

A simple unified framework for detecting out-of-distribution samples and adversarial attacks

Robustness with respect to class imbalance in artificial intelligence classification algorithms

Enhancing the reliability of out-of-distribution image detection in neural networks

The trend-renewal process for statistical analysis of repairable systems

Using degradation measures to estimate a time-tofailure distribution

Analysis of autopilot disengagements occurring during autonomous vehicle testing

Artificial intelligence applications in the development of autonomous vehicles: a survey

The reliability of a deep learning model in clinical out-of-distribution MRI data: A multicohort study

First FAA-approved drone delivery takes medicine to rural Virginia

Statistical Methods for Reliability Data

Reliability meets big data: Opportunities and challenges, with discussion

Software reliability growth models predict autonomous vehicle disengagement events

Uncertainty quantification with statistical guarantees in end-to-end autonomous driving control

Variational boosting: Iteratively refining posterior approximations

Reliability analysis of artificial intelligence systems using recurrent events data from autonomous vehicles

A family of software reliability models with bathtubshaped fault detection rate

Applied Life Data Analysis

Capjack: capture in-browser crypto-jacking by deep capsule network through behavioral analysis

Deep learning vs. traditional computer vision

GPU lifetimes on Titan supercomputer: Survival analysis and reliability

A survey of the usages of deep learning for natural language processing

A survey on transfer learning

A generalized software reliability model with stochastic fault-detection rate

Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI

Monotone regression splines in action

Realizing the promise of artificial intelligence for unmanned aircraft systems through behavior bounded assurance

IEEE/AIAA 38th Digital Avionics Systems Conference (DASC)

Detecting out-of-distribution examples with in-distribution examples and gram matrices

Simulation driven design and test for safety of AI based autonomous vehicles

Physical adversarial examples for object detectors

A three-parameter fault-detection software reliability model with the uncertainty of operating environments

Intriguing properties of neural networks

Robotics, artificial intelligence, and the evolving nature of work

Estimation degradation by a Wiener diffusion process subject to measurement error

Contrastive training for improved out-of-distribution detection

Software reliability growth models

Statistical reliability analysis of repairable systems with dependent component failures under partially perfect repair assumption

The inverse Gaussian process as a degradation model

Deep learning for inferring gene relationships from single-cell expression data

How reliable should military UAVs be?

A case for quantifying statistical robustness of specialized probabilistic AI accelerators

A safety framework for critical systems utilising deep neural networks

Assessing the safety and reliability of autonomous vehicles from road testing

Assessing the safety and reliability of autonomous vehicles from road testing

The authors acknowledge the Advanced Research Computing program at Virginia Tech and Virginia's Commonwealth Cyber Initiative (CCI) AI testbed for providing computational resources. The work is supported by CCI and CCI-Coastal grants to Virginia Tech.