key: cord-0067984-5uzutfv4 authors: Zuo, Mingjian title: System reliability and system resilience date: 2021-09-29 journal: Front DOI: 10.1007/s42524-021-0176-y sha: d84740e9f74dba9224c4134f7f350ae02d623f85 doc_id: 67984 cord_uid: 5uzutfv4 nan With the development of new technology, engineering systems are becoming more complex in structure and possessing more advanced functions. Customers are setting higher quality requirements for new products. Reliability must be considered in all aspects of the product life cycle. Inadequate reliability considerations may cause civil aviation disasters, nuclear power plant accidents, spacecraft launch failures, power system shutdowns, and other major accidents. Since the emergence of the reliability discipline in the 1950s, reliability theory has been developing rapidly, which has played an irreplaceable role in promoting the progress of major industries such as aviation, aerospace, and nuclear energy. It has also greatly improved the quality of daily necessities such as computers, appliances, and automobiles. The capability of manufacturing high-end equipment with high reliability and long life has become an important strategic indicator of a country's global strength and competitiveness. In recent years, due to climate change and ecosystem degradation, extreme weather has become more frequent all over the world. Under unusual conditions such as extreme weather and external attacks, it is no longer possible for sophisticated engineering systems to meet their performance requirements if only traditional reliability indexes are used, and resilience has become an important index to consider for system design, evaluation, and optimization. To respond to the increasingly complex international environment and diversified challenges, the US has been requiring military equipment systems to ensure functional stability under uncertain and diverse combat conditions, to have the ability to quickly restore functions after experiencing damage, and to adapt rapidly to changing environmental conditions. Around 2010, the US Department of Defense introduced the concept of "resilience" into weapon and equipment development and established the Engineering Resilience System Project. This project formulated a long-term strategy to continuously improve the resilience of weapon equipment through its design, improvement, manufacturing, and deployment. Resilience has become a new important guiding performance index throughout the entire lifecycle of weapon equipment in the US (Scott, 2012) . Actually, resilience has become a key performance index of all critical systems including power grid, infrastructure, and communication systems. 2 System reliability and system resilience: Definition and relationship The word "reliability" first appeared in the literature of Kohler to praise his poet friend Robert Sousse in 1816 (Coleridge, 1983) . Later, with the application of statistics and probability theory in manufacturing and maintenance of vacuum tubes, reliability theory and reliability engineering emerged (Saleh and Marais, 2006) . In 1957, the Electronic Equipment Reliability Advisory Group of the US Department of Defense first published a report titled Military Electronic Equipment Reliability Study (Coppola, 1984) , which defined reliability as the ability of a product to perform its required functions under required conditions for a required time period. The required conditions are the working conditions of the product. The required time represents the mission time of the product. The required function means that the performance of the product satisfies the requirements of design and usage. The capability is mainly reflected in indexes of system performance, including probability of success, failure rate, and mean time between failures. In 1965, Barlow and Proschan (1965) systematically documented the mathematical definition of reliability, maintenance strategies, stochastic models, and redundancy optimization models, which laid the theoretical foundation for the rapid development and research of reliability engineering. "Resilience" originated in Greek and was first used to describe the ability of materials to return to their original states after deformation. In the 1970s, a renowned ecologist Holling (1973) introduced the concept of resilience into ecosystems, which represents the ability of an ecosystem to maintain its function when affected by environmental changes. Subsequently, the concept of resilience was extended to social, economic, management, medical, engineering, and other fields. However, due to the diverse application contexts, the definition of resilience has many different versions. Ecosystem resilience (Carpenter et al., 2001; characterizes the ability of ecosystems to self-sustain, self-regulate, and resist various external pressures and perturbations in order to maintain biological composition, ecosystem structure, and function after deviations from equilibrium by external disturbances. The concern of ecosystem resilience is not the time or ability to return to a single stable state but the transition between many stable states and the duration of different stable states. Resilience in the social domain (Walker et al., 2004; Adger et al., 2005 ; Labrague and de Los Santos, 2020) focuses on human psychological, community, and team recovery ability when major emergencies or natural disasters occur. Resilience in the economic fileds (Rose and Liao, 2005; Rose, 2007) mainly concerns the ability of economic entities to withstand disasters such as economic crises and recovery after the shocks. It focuses on the prevention, prediction, and post-disaster economic construction strategies after major economic disasters such as economic crises. Resilience in medical field (van de Leemput et al., 2014) is mainly concerned with the development of mental diseases such as depression and pays special attention to quantitative judgment and intervention before critical changes in mental illness. Since 2010, engineering resilience research has been focusing on resistance or recovery capability during disruptive events in major infrastructure (Buldyrev et al., 2010; Ouyang and Duenas-Osorio, 2014 ) (such as power systems, network systems, transportation systems, buildings, and nuclear power plants) and weaponry equipment systems. The relationships between system reliability and system resilience can be described as follows. (1) Reliability is used to evaluate the ability of a product or device to function satisfactorily under usual working conditions and generally does not consider the ability to survive in extreme or sudden conditions. During the usage of such products or devices, the main focus is on maintenance strategies or methods to maintain their reliability, which is limited to their functional recovery under usual working conditions. (2) Resilience is used to evaluate the ability of a product or device to resist disturbance from natural or humancaused events, including prediction of extreme events, evaluation of the impact of extreme events, absorption of the shock, response, adaptation, and recovery. Resilience emphasizes the adaptability of the product or device to external disturbances, which generally do not fall within the scope of usual working conditions. (3) Both reliability and resilience are important capabilities and characteristics of a system. System reliability mainly depends on the reliabilities of the components within the system, the relationship between these components and the system structure. On the other hand, system resilience depends not only on the performance of the components and the redundancy within the system, but also on the self-organizing ability of the components (the ability to reconfigure or reconstruct). Since system resilience analysis considers system reliability under disaster conditions, many reliability analysis methods and state transition analysis approaches may be used for system resilience analysis when unexpected events occur. (4) For design, evaluation, and optimization of system reliability, we generally assume that the components in the system are statistically independent. However, system resilience researchers usually assume that the components in the system are statistically dependent or related. Resilience researchers in the ecological, social, management, and economic areas have been considering the correlations among components, the redundancy in the system, and the dynamic evolution process. Therefore, for resilience analysis of engineering systems we should consider dependence factors such as common-cause failures and cascading failures and take advantage of complex reliability models. (5) System resilience and system reliability are two different aspects of system performance. This means that the system with a high resilience may have a low reliability and vice versa. System resilience emphasizes the system's ability to resist damage in a specific performance state and/or to restore to its original performance state after the downgrade. However, system reliability refers to the ability for the system to remain at a specific performance level. During certain emergencies, if the demand for a system become lower, the system may have a higher probability of meeting this reduced demand, meaning that the system reliability is higher during this period of time. 3 Research challenges of system reliability and system resilience Deep space exploration, advanced aviation equipment, high-speed rail transportation, and nuclear energy engineering are some of the research frontiers worldwide. The complexity of system structures, the diversity of functions, and the multidimensionality of operational data pose new challenges to system reliability theory and technology. First, the reconfigurable system structures and the system polymorphism characteristics are out of the scope of the traditional system reliability modeling and analysis category, which have made system reliability evaluation much more difficult. Second, there are more and more functional couplings among subsystems and components within a system, presenting modeling difficulties for functional dependency and fault cascading. Traditional reliability modeling methods based on statistical independence of components are unable to meet the needs of modern complex system reliability design, analysis, and evaluation. Third, due to the wide-spread adoption of sensor technology, a vast amount of system operating data of various types is now available. Integrating these multidimensional operating data with expert knowledge and test data to achieve a more accurate system reliability assessment is a challenging problem. Fourth, the system to be studied is no longer a "machine" or a physical system anymore, instead, it is a "man-machine-environment" system. The added human factors and uncertain environmental factors bring new challenges to system reliability modeling and analysis. Reported system reliability modeling methods have covered system structures such as parallel/series systems, k out of n systems (Kuo and Zuo, 2003) , consecutive k out of n systems (Cui et al., 2015) , multistate systems (Lisnianski and Levitin, 2003; Liu et al., 2019) , distributed computing systems (Li and Peng, 2015) , and phased mission systems . The factors that have been considered for such systems include discrete and continuous shocks (Wang et al., 2020) , dynamic environment (Hong and Meeker, 2013; Zou et al., 2015) , statistical dependence (Liu et al., 2019) , multi-criteria (Zhang and Chen, 2016) , fault tolerance (Mo et al., 2008) , and other complex failure mechanisms (Peng et al., 2010; Ye and Xie, 2015; Wu et al., 2016) . Reviews on different aspects of reliability analysis were also reported (Zio, 2009; Si et al., 2011) . Reliability evaluation, prediction, and analysis driven by performance data (such as operation and maintenance data) are the hotspots of research in the field of reliability (Yun et al., 2012; Hong and Meeker, 2013; Zou et al., 2015; Liu et al., 2016) . The existence of uniform optimal solutions for the component allocation problem of the consecutive k out of n system was a difficult problem (Cui and Hawkes, 2008) . Objective diversification, multi-level and ambiguity, component polymorphism, component reliability uncertainty, structural redundancy, strategy diversity, failure dependency, and networked system structure also add considerable difficulties and challenges to building and solving system reliability models (Wang and Li, 2015; Kim et al., 2016) . Maintenance decision models and maintenance optimization methods are also critical research topics in statistics, probability, and operations research (Berrade et al., 2012; Compare and Zio, 2015; Zhao et al., 2015) . Searching for optimal test sequences with constraints such as unit test sequence, multiple test equipment, and reachability test was also a challenging problem (Ji et al., 2015) . The on-going COVID-19 pandemic, the increasing frequency of extreme weather, and the rapid changes and development of international political and military situations have also posed new challenges to reliability researchers as we need to develop better tools to make systems more reliable and more resilient under these turbulent situations. A complex network is a high level abstraction of a complex system. It starts with the study of network topology and network dynamics to represent the key characteristics of complex systems. Resilience theory based on a complex network provides excellent quantitative analysis tools and methods for system resilience research. In-depth research on the resilience of complex networks has been carried out from multiple aspects, such as resilience index (including early warnings) (Scheffer et al., 2009) , cascade failure (Buldyrev et al., 2010) , and multi-layer dependency network (Duan et al., 2019) . A series of important and influential research results have been obtained recently, laying a solid theoretical foundation for engineering applications of complex system resilience models. Additional challenges of complex network resilience research include (Liu et al., 2020) : Modeling of dynamic evolution of complex systems, largescale real network resilience research, multi-layer collaborative network resilience analysis, improving reported network resilience methods, network resilience control, and network optimization. In addition to the theory of complex network resilience, data-driven management and control of social networks and social organizations under major epidemic conditions, early warning and prevention of possible collapse of ecosystems, multi-hazard-coupled infrastructure resilience assessment and optimization, immune networks, and human organ function resilience are also research frontiers that must be considered. As we face the needs of complex projects and infrastructures, it's an urgent task to develop advanced universal resilience frameworks and intelligent platforms for resilience design, verification and optimization. Social-ecological resilience to coastal disasters Mathematical Theory of Reliability Maintenance scheduling of a protection system subject to imperfect inspection and replacement Catastrophic cascade of failures in interdependent networks From metaphor to measurement: Resilience of what to what? Surrogates for resilience of social-ecological systems Genetic algorithms in the framework of Dempster-Shafer theory of evidence for maintenance optimization problem Reliability engineering of electronic equipment: A historical perspective A note on the proof for the optimal consecutive-k-out-of-n: G line for n £ 2k M-consecutive-k, l-out-of-n system Universal behavior of cascading failures in interdependent networks Resilience and stability of ecological systems Field-failure predictions based on failure-time data with dynamic covariate information Fault detection techniques based on multivariate statistical analysis Derating design for optimizing reliability and cost with an application to liquid rocket engines Optimal Reliability Modeling: Principles and Applications COVID-19 anxiety among front-line nurses: Predictive role of organisational support, personal resilience and social support Series phased-mission systems with heterogeneous warm standby components Service reliability modeling of distributed computing systems with virus epidemic Multi-State System Reliability: Assessment, Optimization and Applications Chatter reliability prediction of turning process system with uncertainties Reliability assessment for multistate systems with state transition dependency Mission reliability analysis of fault-tolerant multiple-phased systems Multi-dimensional hurricane resilience assessment of electric power systems Reliability and maintenance modeling for systems subject to multiple dependent competing failure process Economic resilience to natural and man-made disasters: Multidisciplinary origins and contextual dimensions Modeling regional economic resilience to disasters: A computable general equilibrium analysis of water service disruptions Highlights from the early (and pre-) history of reliability engineering Earlywarning signals for critical transitions Remaining useful life estimation: A review on the statistical data driven approaches Resilience, adaptability and transformability in social-ecological systems A general discrete degradation model with fatal shocks and age-and state-dependent nonfatal shocks Redundancy allocation optimization for multistate systems with failure interactions using Semi-Markov process Evaluating the reliability of multi-body mechanism: A method considering the uncertainties of dynamic performance Stochastic modeling and analysis of degradation for highly reliable products Economic design of a loadsharing consecutive k-out-of-n: F system Multi-objective reliability redundancy allocation on an interval environment using particle swarm optimization Approximate methods for optimal replacement, maintenance, and inspection policies Reliability engineering: Old problem and new challenges Reliability forecasting for operators' situation assessment in digital nuclear power plant main control room based on dynamic network model