key: cord-0110579-lklrc6ue authors: Tan, Samson; Taeihagh, Araz; Baxter, Kathy title: The Risks of Machine Learning Systems date: 2022-04-21 journal: nan DOI: nan sha: 61d9162eea5aceb5b78c2d1230f18d8dfe10a208 doc_id: 110579 cord_uid: lklrc6ue The speed and scale at which machine learning (ML) systems are deployed are accelerating even as an increasing number of studies highlight their potential for negative impact. There is a clear need for companies and regulators to manage the risk from proposed ML systems before they harm people. To achieve this, private and public sector actors first need to identify the risks posed by a proposed ML system. A system's overall risk is influenced by its direct and indirect effects. However, existing frameworks for ML risk/impact assessment often address an abstract notion of risk or do not concretize this dependence. We propose to address this gap with a context-sensitive framework for identifying ML system risks comprising two components: a taxonomy of the first- and second-order risks posed by ML systems, and their contributing factors. First-order risks stem from aspects of the ML system, while second-order risks stem from the consequences of first-order risks. These consequences are system failures that result from design and development choices. We explore how different risks may manifest in various types of ML systems, the factors that affect each risk, and how first-order risks may lead to second-order effects when the system interacts with the real world. Throughout the paper, we show how real events and prior research fit into our Machine Learning System Risk framework (MLSR). MLSR operates on ML systems rather than technologies or domains, recognizing that a system's design, implementation, and use case all contribute to its risk. In doing so, it unifies the risks that are commonly discussed in the ethical AI community (e.g., ethical/human rights risks) with system-level risks (e.g., application, design, control risks), paving the way for holistic risk assessments of ML systems. While machine learning (ML) has the potential to improve the quality of our lives, it also has the potential to bring about new harms and exacerbate existing ones. Much of the ML literature has naturally focused more on the benefits of ML than the harms to society. Therefore, we provide a counter-perspective by categorizing the risks posed by ML systems. For the purposes of discussion, we consider an ML system to be the ML-based component(s) of a software system. In some scenarios, this software system may also interface with a hardware system, such as in an autonomous robot. Since the risks of traditional computer systems have been extensively discussed [87, 162] , we focus our efforts on the new, or increased, risks brought about by their ML components. Due to existing power structures, ML systems often benefit some demographics at the expense of others [13, 62, 119, 137, 175] . Recent work and news have highlighted the negative impacts of ML systems on society and the environment [12, 28, 44, 65, 118, 119, 171, 177] . For example, Rousseau et al. found GPT-3 [27] to be highly unreliable and downright dangerous for healthcare applications [153] . This has led to calls and proposals for legislation to regulate the sale and use of ML systems, such as the draft EU AI Act [63] , the US' Algorithmic Accountability Act (AAA) [1] , and California's Automated Decision Systems Accountability Act (ADSAA) [29] . The EU AI Act focuses on predefined classes of applications deemed either "prohibited" or "high risk" (Title III), leaving others to be governed by existing product safety regulations with an exception being transparency obligations for specific classes of applications (Title IV). Presently, it does not provide the guiding criteria for the creation of these lists. In contrast, the AAA and ADSAA outline general criteria for an ML system to be considered "high risk", such as whether the system uses personal or demographic-related information. However, none of the three Authors' addresses: Samson Tan, samson.tmr@comp.nus.edu.sg, Salesforce Research Asia, and School of Computing, National University of Singapore; Araz Taeihagh, spparaz@nus.edu.sg, Lee Kuan Yew School of Public Policy and CTIC, National University of Singapore; Kathy Baxter, kbaxter@salesforce.com, Office of Ethical & Humane Use, Salesforce. regulations provide guidance on how "medium" and "lower risk" applications should be assessed and managed. China's white paper on trustworthy AI goes into more detail on managing teams and the development process, but does not discuss risk-based differentiated measures [36] . We argue that accurately characterizing the risks posed by ML systems is crucial to enacting meaningful regulations while not stifling innovation [30, 176] . We note that the latter does not mean giving businesses a free pass to not find safer, more inclusive solutions and prioritize their profits above all. Instead, organizations must recognize and address the risks of technological disruption. A comprehensive risk taxonomy will enable this, while also helping organizations define internal policies on ML applications not covered by the law. Although the European Commission and German Data Ethics Commission have published frameworks for risk-based regulation [47, 63] , algorithmic impact assessments (IAs) are often used in the literature instead of risk assessments (RAs) [4, 46, 122, 126, 150, 185] . IAs primarily focus on the ML system's impacts (e.g., on human rights and the environment) while RAs, in addition, are able to surface direct risks of the system that contribute to its impacts. Even as IAs and RAs are increasingly mandated by law [1, 29, 63, 186] , the notion of risk, impact, or harm is ill-defined in most existing IA/RA frameworks and proposed regulations [1, 29, 63, 68, 122, 185] . This is likely due to their tendency to focus on algorithmic instead of ML systems (a subset of the former). Consequently, they lack detailed discussions of the risks and risk factors specific to ML systems. This creates loopholes that can be exploited by malicious actors [190] and risk identifying only easily measurable harms [125] . Other studies examine in detail the risks of specific applications (e.g., autonomous vehicles) [119, 128] but many of the highlighted risks are domain-specific and not broadly applicable. We argue that a structured understanding of ML-specific risks is crucial to the development of comprehensive RAs. Software risk assessment is a well-established practice with taxonomies that overlap with the concerns of ML risk assessment, such as reliability, safety, and security [87, 162] . However, work in this area has been primarily concerned with the system's negative consequences for the organization developing it, as opposed to the affected communities. While the latter has always existed, the recent proliferation of ML systems has increased both their likelihood and severity, in addition to introducing new risks resulting from learned behavior. In summary, existing work on algorithmic IA and software RA lacks detailed discussions of ML-specific risks. On the regulation side, upcoming legislation does not provide sufficient guidance on non-high-risk applications and often refer to a vague idea of risk. To advance the field on this front, we develop a Machine Learning System Risk framework (MLSR) of the risks posed by ML systems and their risk factors. MLSR is inspired by existing work on software risks and tech-related harms, application-specific studies, and the Universal Declaration of Human Rights [4, 87, 126, 176, 188] . MLSR connects the direct risks of a system (first-order risks) to the risks that arise from its interaction with the real world (second-order risks), discussing in detail the factors that contribute to each type of risk. This will help organizations perform holistic risk assessments, devise appropriate mitigation measures, and make project approval/denial decisions, especially when a proposed ML system does not appear immediately dangerous. Finally, an understanding of the risks surrounding ML systems will aid in the creation of appropriate and enforceable standards and regulations. In this paper, we focus on organizing the risks posed by ML systems into a framework of first and second order harms and leave the assessment of these risks to future work. Impact assessments aim identify and monitor the positive or negative consequences of existing and planned projects [189] , while risk assessments focus on the potential negative consequences of planned projects [147] . Such assessments help us assert some control over the future instead of being subjected to its whims [14] . We use the ISO definition of risk: the consequence of an event combined with its likelihood of occurrence [92] . Risk and impact assessments have been (and continue to be) extensively used across industries to evaluate both the risk/impact of projects on the environment and affected communities [26, 42, 48, 57, 60, 72, 123, 130, 194] and the risk to a project or system [82, 114, 141, 162, 167, 199] . Algorithmic IAs evolved from those for social and environmental impact [122, 130, 189] , while RAs have a long history of usage for safety-or mission-critical applications, such as nuclear power plants [72] , hazardous material transportation [191] , civil aviation [95] , and civil engineering [199] . In information systems and software development, risks are often studied from the organization's perspective [163] ; negative consequences are note-worthy if they threaten the organization. For example, Susan Sherer breaks software risk down into three categories of consequences for the organization: development, use, and maintainability [162] . Of the twelve risks that make up this taxonomy, only one captures the impact of the developed software on its users, society, or the environment -safety. Higuera and Haimes present a comprehensive three-level taxonomy of sixty-four software risks [87] . Similarly, almost all of these risks are concerned with threats to the project's success, with only two -safety and human factors -addressing risks to its users. While the negative impact of software systems on society and the environment is not a new phenomenon [155] , the recent proliferation of biased and unsafe ML systems has cast the spotlight on software systems and the algorithms powering them [13, 137] . Consequently, researchers in the ethical AI and public policy communities have called for the use of IAs [129, 145, 160] to surface the direct, indirect, and insidious harms that could be caused by an ML system, such as entrenching historical discrimination. Such assessments can be a way of improving accountability in the use, procurement, and development of ML systems by ensuring actors perform due diligence [1, 63, 119, 145, 150] . However, regulatory support is crucial to accountability-it cannot exist without the threat of consequences for non-compliance [21, 129, 178] . Reisman et al. highlight the difficulties facing governments in assessing the impact of automated decision systems 1 (ADSs), in large part due to their "black-box" nature and trade secrecy claims [150] . While they provide some guidelines for implementing IAs in the public agency context, they do not identify specific relevant risks or impacts. In contrast, the Canadian government's ADS IA tool is a questionnaire for assessing socioeconomic and environmental impact, impact on government operations, system complexity, data management, and procedural fairness [100, 185] . While comprehensive, it lacks a well-defined taxonomy of impacts to guide users and does not examine the system in detail [115, 125] . On the other hand, Microsoft's Harms Modeling framework comprises a comprehensive taxonomy of thirty-eight tech-related impacts and example questions to assess each one [126] . Krafft et al. presented the Algorithmic Equity Toolkit with an IA worksheet for community members to assess the impact of a government-deployed ADS on them [109] . New Zealand's Algorithm Charter appears to bridge the gap between IAs and RAs by defining impact as a component of risk, but does not define the terms "algorithm", "risks" and "impacts". The Ethics and Algorithms Toolkit provides a step-by-step checklist that operationalizes algorithmic risk as impact on people and property, data use, level of accountability, and bias [4] . Impact assessments may also be performed post-hoc and informally by external/independent auditors to highlight the flaws in already deployed ML systems, such as in the Gender Shades study [28] . However, it is important to acknowledge that post-hoc analysis is often challenging, especially for ML platforms that do not have access to customer data to detect if their customers are using the system to cause harm. Furthermore, it may be difficult to observe some vulnerable communities to determine if they are disparately impacted unless user demographic data is collected. Despite their usefulness, existing IAs risk only operationalizing "impact" in ways amenable to computational evaluation. This will lead to a loss of trust when they inevitably fail to capture pertinent harms [125] . First-order risks stem directly from the machine learning system (Section 4) and their consequences lead to second-order risks when the system interacts with the real world (Section 5). We summarize them in Tables 1 and 2 (Appendix A). To summarize, research on software RAs largely focus on the risk to the organization, as opposed to the risk to society or the environment. Existing work on algorithmic IAs identifies the negative consequences of algorithmic systems to society and the environment but focuses primarily on discrimination risk or does not link them to specific system development choices. An organization will require such traces to address the concerns identified by IAs. Presently, proposed legislation also tends to address "risks" and "impacts" at a high level, which need to be operationalized in greater detail to be useful. In contrast, MLSR documents the various failure modes of an ML system and their contributing factors, and connects them to its impacts on society and the environment. The risks posed by ML systems can be categorized into first-order, second-order, and beyond. We explore first-and second-order risks in the following sections but acknowledge the presence of higher-order risks due to knock-on effects. While some of the risks described below may apply to all software systems, we focus on the aspects specific to ML systems. Hence, we will refer to our framework as the Machine Learning System Risk framework (MLSR). To construct this taxonomy, we used algorithmic impacts as a starting point, with safety, discrimination, and environmental risks [4, 126, 171] . We then included software risks pertinent to ML systems, such as design, implementation, safety, privacy, security, and organizational risks [82, 162] . Separately, we surveyed the ML literature and identified common themes such as training/validation data, algorithm, robustness, design, implementation, privacy, security, and emergent behavior risks. We supplemented the above with reporting on ML-related incidents and professional experience, identifying more risks such as application, misapplication, and control risks. Concurrently, we also identified factors that affect each risk. Finally, we grouped them into first-and second-order risks using the following criteria: First-order risks directly arise from the choices made during the ML system's conception, design, and implementation, and relate to the ways it can fail. The consequences of these choices lead to second-order risks when the ML system interacts with the world. Second-order risks, hence, relate to the impact of first-order consequences on the real world. Although differentiating the risks into two orders increases MLSR's complexity, it allows practitioners to safely exclude a second-order risk from their list of concerns if the ML system is not vulnerable to the associated first-order risk. First-order risks can be generally broken down into risks arising from intended and unintended use, system design and implementation choices, and properties of the chosen dataset and learning components. Note: This is an unordered list. This is the risk posed by the intended application or use case. It is intuitive that some use cases will be inherently "riskier" than others (e.g., an autonomous weapons system vs. a customer service chatbot). Application domain. As alluded to above, the intended purpose of the ML system can be a major risk factor, holding all other variables constant. Other than the specific use case, the domain could also contribute to the application risk. For example, it is intuitive that the negative consequences are more severe for an image classification system used to aid melanoma diagnoses than one used for Lego brick identification. Consequentiality of system actions. The impact of the ML system's actions on the affected community members is another important factor in the system's application. For example, a slightly inaccurate automated text scoring system carries relatively minor consequences if used only for providing feedback on ungraded homework, compared to being used for grading school assignments. While inaccuracies in the latter use case may affect a student's annual ranking, it carries a lower risk compared to using the same system to grade national exams that determine a student's future, where even minor inaccuracies can unfairly impact their ability to enter their desired university or major [65] . Protected populations impacted. Most societies have special protections for certain population groups such as children, the elderly, disabled, or ethnic minorities. For example, in the US, the Child Online Privacy Protection Act imposes stricter requirements on operators of websites or online services directed to children under 13 years of age [66] . Similarly, some social groups may be more vulnerable to the negative impacts of an ML system and lower thresholds for harm may therefore be necessary for them. The US Federal Trade Commission has warned of penalties against companies that sell or use biased AI systems that harm protected groups [67] . Effect on existing power differentials and inequalities. Use cases that entrench or amplify power differentials between the organization employing the system and the affected population should be assigned a higher risk from a human rights perspective. This can take the form of increased surveillance , which increases the organization's power over the public but not vice-versa. Other applications may amplify systemic inequalities due to the ease, scale, and speed with which predictions can now be made [62] . Additionally, the act of codifying it in a potentially black-boxed ML system may entrench these learned biases when humans fail to question their predictions [120] . Scope of deployment environment. A system operating in an open environment, such as the outdoors, will often have to account for more uncertainties than in a closed one, such as an apartment. Consequently, there is a higher likelihood of failure in the former. For example, an autonomous cleaning robot deployed in a park will be exposed to a significantly more diverse range of inputs than one used in an apartment. In the latter, the system does not need to handle significant changes in weather conditions and seasons. Additionally, the ability to navigate uneven and unstable terrain will likely be less critical for an indoor cleaning robot compared to one deployed in a park. We refer to this "openness" as the deployment environment's scope: a wider scope presents more potential points of failure and, therefore, a higher risk. Scale of deployment. The scale of a use case will also significantly affect its risk. For example, a system that affects a community of 42 will likely have a lower upper bound of negative consequences compared to being deployed worldwide. Presence of relevant evaluation techniques/metrics. Although held-out accuracy is commonly used to evaluate ML models developed for research, this assumes that the training distribution and the deployment environment's distribution are identical. Such evaluation will be insufficient for ML systems meant to be used in the real world since this assumption is often violated. The result is poor system robustness to distributional variation with various second-order consequences (see Section 4.5). Therefore, any evaluation of an application's risk must consider the availability of metrics to evaluate performance on the dimensions relevant to the application or deployment environment [181] . For example, a taskoriented chatbot should not only be evaluated using the success rate of the held-out validation set, but also its ability to cope with misspellings, grammatical variation, and different dialects, and generate sentences in the appropriate register. The lack of appropriate metrics reduces the ability to detect such flaws before deployment and increases the risk of negative consequences. Similarly, it is difficult to predict the impact of a risk on the real world. For example, group-level F1 scores for a face recognition system are not indicative of the magnitude of the system's impact on an individual when it is wrong in the real world (e.g., the consequences of arresting a wrongly identified but innocent minority [3] ). Optionality of interaction. The ability to opt-out of interacting with or being affected by an ML system can limit its negative impacts on a person. For example, choosing to interact with a human customer service agent rather than a chatbot may reduce the risk of being misunderstood if the chatbot has not been specifically trained on the customer's language variety. Inversely, being unable to opt-out of the interaction may increase the likelihood and frequency that an individual will experience negative consequences resulting from the ML system. For example, replacing human agents with automated ones as interfaces to essential services may unintentionally prevent the underprivileged from using them due to linguistic barriers. This is a real possibility when the agents have trained on the prestige variety of a language, but the people most in need of access to social welfare services only speak a colloquial variety. Accountability mechanisms. From an organizational perspective, mechanisms that hold the actors accountable for the systems they build reduce the likelihood of negative consequences. For example, an organization might create explicit acceptability criteria, such as comparable accuracy across social groups, reward engineers for meeting these criteria, and block deployment when the system falls short. However, this will only work when acceptance criteria are not in conflict (e.g., engineers being rewarded more for increased user engagement than meeting an acceptable bias threshold). Stakeholders' machine learning literacy. To give useful feedback and seek remediation, the affected community member might require basic knowledge of how ML systems work and the ways they could be impacted. For example, someone unaware of how recommendation algorithms work (or even the existence of such algorithms) may be unable to appreciate the extent to which their political views are influenced by their consumption of social media and video streaming sites [10, 15, 79, 151] . 2 The affected individual will hence be unaware that they are in an echo chamber, resulting in an inability to break free or give appropriate feedback to the product developers [96] . Research has also shown a person's knowledge of AI to affect their interpretation of machine-generated explanations [59] . This is the risk posed by an ideal system if used for a purpose/in a manner unintended by its creators. In many situations, negative consequences arise when the system is not used in the way or for the purpose it was intended, and can be thought of as being "misapplied". An example is a semi-autonomous vehicle being used as if it were fully autonomous, with the driver taking their hands off the wheel or even leaving the driver's seat while the vehicle is in motion [20] . Ability to prevent misuse. The ability to prevent misuse before it occurs significantly reduces misapplication risk. In the case of autonomous vehicles, the car might be programmed to automatically slow to a stop if individuals remove their hands from the wheel or if there is a significant weight decrease in the driver's seat while the car is in motion. However, while such failsafes significantly reduce risk, they do not entirely eliminate it since they can be bypassed [9] . Ability to detect misuse. Being able to detect if the ML system is being used for unintended purposes is crucial to preventing misuse. This can take the form of a component that alerts the organization when a user tries to process inputs with features that match those belonging to prohibited applications (e.g., using a computer vision system for physiognomic purposes), or detect prohibited actions (e.g., leaving the driver's seat when the semi-autonomous vehicle is in motion). Merely relying on whistleblowers and journalists to detect misuse will likely result in the vast majority of misuses going undetected. The detection method's efficacy would, therefore, inversely affect the misapplication risk. Ability to stop misuse. Assuming it is possible to detect misapplication, the next factor in managing this risk is an organization's ability to stop misuse once it has been detected. For example, the ability to detect if a customer is using a computer vision system for an unacceptable application (e.g., face recognition for predictive law enforcement) and terminate their access will significantly lower the likelihood of the system being used for such purposes. This is directly related to the system's control risk (see Section 4.8). Being able to instantly shut the system down or terminate the user's access will lower the likelihood and severity of negative consequences stemming from misuse, compared to a delayed or non-response, and could be the difference between life and death for the people affected by the system. This is the risk of the ML algorithm, model architecture, optimization technique, or other aspects of the training process being unsuitable for the intended application. Since these are key decisions that influence the final ML system, we capture their associated risks separately from design risks, even though they are part of the design process. Performance of model architecture, optimization algorithm, and training procedure. Different combinations of model architecture, optimization algorithm, and training procedure have different effects on its final performance (e.g., accuracy, generalization). These choices are independent of modeling choices (discussed in Section 4.6), where the ML practitioner translates a problem statement into an ML problem/task (e.g., by defining the input and output space). For example, a language model can be trained with either the causal or masked language modeling objective [52] . While the latter is suitable for text classification, it may be suboptimal for text generation. Additionally, some training procedures (e.g., domain adversarial training [74] ) may improve the ML system's ability to generalize to new domains with minimal extra training data but may hurt performance on the original domain. While accuracy on general benchmark datasets is often used to differentiate models, a better indicator of real-world efficacy is performance on similar applications, due to nuances in the target distribution and the tendency of state-of-the-art models to be optimized for leaderboards [61] . Reliability and computational cost of machine learning component(s) in production. Beyond efficacy, it is also important to consider the reliability and resource intensiveness of the chosen ML algorithm, model architecture, and optimization technique combination in production scenarios. From an operational standpoint, a highly accurate system that is computationally intensive or failure-prone may be less desirable than a slightly less accurate one without those flaws. Explainability/transparency. Algorithmic opacity and unpredictability can pose risks and make it difficult to ensure accountability [175] . While new mandated levels of transparency and explainability of algorithms are being demanded through the likes of the EU's General Data Protection Regulation (GDPR) to tackle bias and discrimination [81] , it can be at times impossible for the experts to interpret how certain outputs are derived from the inputs and design of the algorithm [8, 175] . This suggests the difficulty of assigning liability and accountability for harms resulting from the use of the ML system, as inputs and design rules that could yield unsafe or discriminatory outcomes cannot as easily be predicted [105, 111] . Therefore, a system that can explain its decision in the event of a mistake is often desirable in high-stakes applications. A mistake can take the form of an accident resulting from a decision [132] , a denied loan [106] , assigning different credit limits based on gender [101] . While explainability on its own is insufficient to reduce biases in the system or make it safer, it may aid the detection of biases and spurious features, thereby reducing safety and discrimination risks when the flaws are rectified. Other use cases, such as judicial applications [49] , may require such explainability due to their nature. However, not all machine learning algorithms are equal in this regard. Decision trees are often considered highly explainable since they learn human-readable rules to classify the training data, 3 while deep neural networks are a well-known example of a black-box model. While there have been recent advances in explaining neural network predictions [98] , researchers have also demonstrated the ability to fool attention-based interpretation techniques [144] . This may allow developers to prevent the network's predictions from being correctly interpreted during an audit. The choice of an ML algorithm and its training method, therefore, affects this aspect of algorithm risk. This is the risk posed by the choice of data used for training and validation. Due to their data-driven nature, the behavior of machine learning systems is often heavily influenced by the data used to train them. An ML system trained on data encoding historical or social biases will often exhibit similar biases in its predictions. Separate from the training data, validation datasets are often used to evaluate an ML model's ability to generalize beyond the training data, to new examples from the same distribution, or to examples with different characteristics (other distributions). Representative validation data can be used to detect potential mismatches between the training data and the deployment environment, such as the presence of social biases or spurious features in the training data. We summarize key data risks specific to ML systems and refer the reader to Demchenko et al. for a detailed discussion of the general issues around big data [50] . Control over training and validation data. Using pretrained models (e.g., GPT-3 [27] , BERT [52] , Inception [174] ) for processing unstructured data such as images and text is becoming increasingly common. While this can significantly improve performance, the trade-off is reduced control over the training data for teams that do not pretrain their own models and simply build on top of publicly released models or machine learning API services (e.g., translation). Given the discovery of systemic labeling errors, stereotypes, and even pornographic content in popular datasets such as ImageNet [16, 135, 187] , it is important to consider the downstream ramifications of using models pretrained on these datasets. The studies mentioned above were performed on publicly available datasets; Birhane et al. further highlight the existence of pretrained models trained on private datasets that cannot be independently audited by researchers [16] . Demographic representativeness of training and validation data. Due to the data-driven nature of machine learning, training an ML system on data that insufficiently represent underrepresented demographics may lead to disproportionate underperformance for these demographics during inference, especially if unaccounted for during model design [28, 89, 173] . This is representativeness in the quantitative sense, of the "number of examples in the training/validation set", and the performance disparity can result in allocational harms where the minority demographics have reduced access to resources due to the poorer performance. For example, poor automated speech recognition performance for minority dialect speakers (e.g., African American Vernacular English) will have devastating consequences in the courtroom [99, 107] . We may also think of representativeness in the qualitative sense, where stereotypical examples are avoided and fairer conceptions of these demographics are adopted [28, 75, 103] . Since labels are often crowdsourced, there is the additional risk of bias being introduced via the annotators' sociocultural backgrounds [6, 51] and desire to please [139] . Similarity of training and validation data distribution to deployment distribution. Where demographic representativeness deals with the proportion of subpopulations in the dataset, distributional similarity is more concerned with major shifts between training and deployment distributions. This can occur when there is no available training data matching a niche deployment setting and an approximation has to be used. However, this comes with the risk of domain mismatch and consequently, poorer performance. For example, an autonomous vehicle trained on data compiled in Sweden would not have been exposed to jumping kangaroos. Subsequently deploying the vehicle in Australia will result in increased safety risk from being unable to identify and avoid them, potentially increasing the chance of a crash [204] . Quality of data sources. The popular saying, "garbage in, garbage out", succinctly captures the importance of data quality for ML systems. Common factors affecting the quality of labeled data include annotator expertise level [197] , inter-annotator agreement [23, 136] , overlaps between validation and training/pretraining data [116] . The recent trend towards training on increasingly large datasets scraped from the web makes manual data annotation infeasible due to the sheer scale. While such datasets satiate increasingly large and data-hungry neural networks, they often contain noisy labels [117] , harmful stereotypes [2, 12, 54] , and even pornographic content [166] . Kreutzer et al. manually audited several multilingual web-crawled text datasets and found significant issues such as wrongly labeled languages, pornographic content, and non-linguistic content [32] . An even greater concern from the ML perspective is the leakage of benchmark test data and machine-generated data (e.g., machine-translated text, GAN-generated images) into the training set [54] . The former was only discovered after training GPT-3 [27] , while the latter is inevitable in uncurated web-crawled data due to its prevalence on the Internet. Researchers have also discovered bots completing data annotation tasks on Amazon Mechanical Turk, a platform used to collect human annotations for benchmark datasets. 4 However, cleaning such datasets is no mean feat: blocklist-based methods for content filtering may erase reclaimed slurs, minority dialects, and other non-offensive content, inadvertently harming the minority communities they belong to [54] . In fact, the very notion of cleaning language datasets may reinforce sociocultural biases and deserves further scrutiny [179] . Presence of personal information. The presence of personal information in the training data increases the risk of the ML model memorizing this information, as deep neural networks have been shown to do [7, 201] . This could lead to downstream consequences for privacy when membership inference attacks are used to extract such information [165] . We discuss this in greater detail in Section 5.4. This is the risk of the system failing or being unable to recover upon encountering invalid, noisy, or out-of-distribution (OOD) inputs. There is often significant variation in real-world environments, compared to research benchmarks. For example, objects may appear different under various lighting conditions or wear out over time, and human-generated text often exhibits sociolinguistic variation [112] . Additionally, malicious actors may exploit flaws in a system's design to hijack it (e.g., in the form of an adversarial attack [80] ). The inability to handle the above situations may lead to negative consequences for safety (e.g., autonomous vehicle crashes) or fairness (e.g., linguistic discrimination against minority dialect speakers [181, 182] ). Since ML systems sit at the intersection of statistics and software engineering, our definition encompasses two different definitions of robustness: the first relates to distributional robustness, where a method is resistant to deviations from the training data distribution [90] ; the second refers to the ability of a system to "function correctly in the presence of invalid inputs or stressful environmental conditions" [91] . Scope of deployment environment. Similar to Section 4.1, the deployment environment's scope determines the range of variation the ML system will be exposed to. For example, it may be acceptable for an autonomous robot operating in a human-free environment to be unable to recognize humans, but the same cannot be true for a similar robot operating in a busy town square. A larger range, therefore, usually necessitates either a more comprehensive dataset that can capture the full range of variation or a mechanism that makes the system robust to input variation. A broader scope may also increase the possibility of adversarial attacks, particularly when the system operates in a public environment. Mechanisms for handling of out-of-distribution inputs. Out-of-distribution (OOD) inputs refer to inputs that are from a distribution different from the training distribution [69] . They include inputs that should be invalid, noisy inputs (e.g., due to background noise, scratched/blurred lenses, typographical mistakes, sensor error), natural variation (e.g., different accents, lens types, environments, grammatical variation), and adversarial inputs (i.e., inputs specially crafted to evade perception or induce system failure). Incorporating mechanisms that improve robustness (e.g., adversarial training [121] ) reduces robustness risk, but often comes with extra computational overhead during training or inference. Failure recovery mechanisms. In addition to functioning correctly in the presence of OOD inputs, system robustness also includes its ability to recover from temporary failure [157] . An example of recovery is an autonomous quadrupedal robot regaining its footing without suffering physical damage after missing a step on the way down a staircase [64] . This is the risk of system failure due to system design choices or errors. While the ML model is the core component, we should not neglect the risks resulting from how the problem is modeled as an ML task and the design choices concerning other system components, such as the tokenizer in natural language processing (NLP) systems. Data preprocessing choices. ML systems often preprocess the raw input before passing them into their modeling components for inference. Examples include tokenization [127] , image transformation, and data imputation and normalization. Additionally, data from multiple sources and modalities (image, text, metadata, etc) may be combined and transformed in ETL (extract, transform, load) pipelines before being ingested by the model. The choices made here will have consequences for the training and operation of the ML model. For example, filtering words based on a predefined list, as was done for Copilot [39] . Such simplistic filtering does not account for the sociolinguistic nuances of slurs and offensive words, and could unintentionally marginalize the very communities it was intended to protect [12] . Modeling choices. The act of operationalizing an abstract construct as a measurable quantity necessitates making some assumptions about how the construct manifests in the real world [94] . Jacobs and Wallach show how the measurement process introduces errors even when applied to tangible, seemingly straightforward constructs such as height [94] . A mismatch between the abstract construct and measured quantity can lead to poor predictive performance, while confusing the measured quantity for the abstract construct can have unintended, long-term societal consequences [22] . In contrast to recent end-to-end approaches for processing unstructured data (e.g., image, text, audio), ML systems that operate on tabular data often make use of hand-engineered features. The task of feature selection then rests on the developer. Possible risks here include: 1) Training the ML component on spurious features; 2) Using demographic attributes (e.g., race, religion, gender, sexuality) or proxy attributes (e.g., postal code, first or last name, mother tongue) for prediction [137] . The former could result in poor generalization or robustness, the latter, entrenching discrimination against historically marginalized demographics. For example, the automated essay grading system used in the GRE was shown to favor longer words and essays over content relevance, unintentionally overscoring memorized text [24, 146] . Other automated grading systems have proven to be open to exploitation by both students and NLP researchers [35, 53] . Specificity of operational scope and requirements. Designs are often created based on requirements and specifications. Consequently, failing to accurately specify the requirements and operational scope of the system increases the risk of encountering phenomena it was not designed to handle. This risk factor is likely to be most significant for ML systems that are high stakes or cannot be easily updated post-deployment. Design and development team. Although software libraries such as PyTorch [140] and transformers [195] are increasing the accessibility of machine learning, a technical understanding of ML techniques and their corresponding strengths and weaknesses is often necessary for choosing the right modeling technique and mitigating its flaws. Similarly, good system design requires engineers with relevant experience. A team with the relevant technical expertise may be able to identify gaps in the design requirements and help to improve them. Conversely, the lack of either increases the risk of an ML system failing post-deployment or having some unforeseen effects on the affected community. There have been calls for mandatory certification of engineers to ensure a minimum level of competency and ethical training, though they are largely voluntary [38] . Additionally, the diversity of a team (in terms of demographics) will affect its ability to identify design decisions that may disproportionately impact different demographics [142] , such as using proxy attributes in modeling or training an international chatbot only on White American English. Stakeholder and expert involvement. Since the development team is unlikely to be able to identify all potential negative consequences, other experts (e.g., human rights experts, ethicists, user researchers) and affected stakeholders should be consulted during the design process [71, 181] . This involvement helps to mitigate the team's blind spots and identify unintended consequences of its design choices, allowing them to be addressed before anyone is harmed. In some cases of participatory machine learning, affected stakeholders can directly influence the system's design as volunteers [83] . This is the risk of system failure due to code implementation choices or errors. A design may be imperfectly realized due to the organization's coding, code review, or code integration practices leading to bugs in the system's implementation. Additionally, the rise of open-source software packages maintained by volunteers (e.g., PyTorch) brings with them a non-trivial chance for bugs to be introduced into the system without the developers' knowledge [183] . Reliability of external libraries. Software development is increasingly reliant on open source libraries, and machine learning is no different. Despite their benefits (e.g., lower barrier to entry), using external libraries, particularly when the development team is unfamiliar with the internals, increases the risk of failure due to bugs in the dependency chain [183] . Additionally, over-reliance on open source libraries may result in critical systems going down if the dependencies are taken offline [41] . The level of risk here is therefore determined by the reliability of and community support for the library in question. For example, a library that is widely used and regularly updated by a paid team will likely be more reliable than one released by a single person as a hobby project, even though both are considered open source libraries. However, this is not a given, as the recently discovered Log4j vulnerability demonstrates [108] . Other common sources of bugs resulting from the use of external libraries are API changes that are not backward-compatible [203] . Code review and testing practices. The intertwined nature of the data, model architecture, and training algorithm in ML systems poses new challenges for rigorously testing ML systems [202] . In addition, deep learning systems often fail silently and continue to work despite implementation errors. 5 Good code review and unit testing practices may help to catch implementation errors that may otherwise go unnoticed, lowering the implementation risk [158] . This is the difficulty of controlling the ML system. In many scenarios, the ability to shut down an ML system before it causes harm can significantly reduce its second-order risks. For example, the ability to instantly override an autonomous weapon system's decision may be the difference between life and death for a wrongly targeted civilian [68] . Level of autonomy. ML systems are often designed with different levels of autonomy in mind: human-in-the-loop (human execution), human-on-the-loop (human supervision), and full autonomy [68, 131] . Fully autonomous systems may be more difficult to regain control of, in the event of a malfunction; however, it may be simpler to program contingency measures since system developers may assume that the system always bears full responsibility. For example, a real estate company's automated house-flipping system was able to proceed with purchasing over 6,000 houses even after its neural-network-based forecasting algorithm 6 generated inaccurate forecasts of house prices during the COVID-19 pandemic, resulting in financial losses of over USD 420,000,000 [73] . In contrast, its competitor emerged relatively unscathed due to its use of human supervision [168] . On the other hand, although a human-supervised system is designed to make intervention easier, the dynamics of human-machine interactions may increase the difficulty of determining responsibility as a situation unfolds. While human oversight is theoretically desirable, the above paradox indicates that a human-on-the-loop design could increase control risk if the additional complexity is not accounted for. Manual overrides. In human-on-the-loop and fully autonomous systems, the ability to rapidly intervene and either take manual control of or shut down the system is crucial to mitigating the harms that result from misprediction. One factor that significant impacts this ability is the latency of the connection to the ML system (remote vs. on-site intervention). This is particularly important in applications that may cause acute physical or psychological injuries, such as autonomous weapons/vehicles and social media bots with a wide reach. Other factors include the ease with which the human supervisor can identify situations requiring intervention and the ease of transitioning from an observer to actor [192] . These are often tightly connected to the design choices made with regard to the non-ML components of the system. For example, appropriate explainability/interpretability functionality may help the human supervisor identify failures (e.g., when the system's actions and explanations do not align). For high-stakes applications, human supervisors will need to be sufficiently trained (and potentially certified) to react appropriately when they need to assume control. This is the risk resulting from novel behavior acquired through continual learning or self-organization after deployment. Although the most commonly discussed ML systems are those trained on static datasets, 7 there is a paradigm of machine learning known as continuous, active, or online learning. In the latter, the model is updated (instead of retrained) when new data becomes available. While such a paradigm allows an ML system to adapt to new environments post-deployment, it introduces the danger of the ML system acquiring novel undesirable behavior. For example, the 5 https://ppwwyyxx.com/blog/2017/Unawareness-Of-Deep-Learning-Mistakes 6 https://www.zillow.com/z/zestimate 7 Systems that are continuously retrained fall in this category. Microsoft Tay chatbot, which was designed to learn from interactions with other Twitter users, picked up racist behavior and conspiracy theories within twenty-four hours of being online [148] . This paradigm (and associated risks) will likely be most relevant for robots and other embodied agents that are designed to adapt to changing environments [154] . Task type. The danger of emergent behaviors will likely differ depending on the task the ML system is designed to perform. For example, an NLP system that is mainly in charge of named entity recognition will likely be less dangerous than a chatbot even if both acquire new behaviors through continual learning since the former has a limited output/action space. Novel behavior can also emerge when ML systems interact with each other. This interaction can take place between similar systems (e.g., AVs on the road) or different types of systems (e.g., autonomous cars and aerial drones). This is similar to the idea of swarm behavior [19, 159] , where novel behavior emerges from the interaction of individual systems. While desirable in certain situations, there remains a risk of unintended negative consequences. Scale of deployment. The number of deployed systems interacting is particularly relevant to novel behaviors emerging due to self-organization since certain types of swarming behavior may only emerge when a certain critical mass is reached [55] . For example, swarm behavior would be more likely to emerge in vehicular traffic comprising mainly autonomous vehicles surrounding traditional vehicles than vice-versa. Second-order risks result from the consequences of first-order risks and relate to the risks resulting from an ML system interacting with the real world, such as risks to human rights, the organization, and the natural environment. This is the risk of direct or indirect physical or psychological injury resulting from interaction with the ML system. By nature, ML systems take away some degree of control from their users when they automate certain tasks. Intuitively, this transfer of control should be accompanied by a transfer of moral responsibility for the user's safety [143] . Therefore, a key concern around ML systems has been ensuring the physical and psychological safety of affected communities. In applications such as content moderation, keeping the system updated may involve the large-scale manual labeling and curation of toxic or graphic content by contract workers. Prolonged exposure to such content results in psychological harm, which should be accounted for when assessing the safety risk of these types of ML systems [134, 170] . First-order risks may lead to safety risk in different ways. For example, poor accuracy may lead to the system failing to recognize a pedestrian and running them over [33] , a melanoma identifier trained on insufficiently diverse data may result in unnecessary chemotherapy [169] , or swarming ML systems that endanger human agents (e.g., high-speed maneuvers via inter-vehicular coordination making traffic conditions dangerous for traditional vehicles) [196] . The inability to assume/regain control in time may also result in increased safety risk, (e.g, overriding an autonomous weapon before it mistakenly shoots a civilian) [68] . This is the risk of an ML system encoding stereotypes of or performing disproportionately poorly for some demographics/social groups. ML systems gatekeeping access to economic opportunity, privacy, and liberty run the risk of discriminating against minority demographics if they perform disproportionately poorly for them. This is known as "allocational harm". Another form of discrimination is the encoding of demographic-specific stereotypes and is a form of "representational harm" [43] . The Gender Shades study highlighted performance disparities between demographics in computer vision [28] while Bolukbasi et al. discovered gender stereotypes encoded in word embeddings [18] . Recent reporting has also exposed gender and racially-aligned discrimination in ML systems used for recruiting [45] , education [65] , automatic translation [86] , and immigration [149] . We focus on how discrimination risk can result from first-order risks and refer the reader to comprehensive surveys for discussions on the biases in ML algorithms [17, 94, 124, 161, 172] . There are various ways in which first-order risks can give rise to discrimination risk. For example, facial recognition systems may be misused by law enforcement, using celebrity photos or composites in place of real photos of the suspect [76] . This leads to discrimination when coupled with performance disparities between majority and minority demographics [28] . Such disparities may stem from misrepresentative training data and a lack of mitigating mechanisms [161] . Insufficient testing and a non-diverse team may also cause such disparities to pass unnoticed into production [59, 142] . Finally, even something as fundamental as an argmax function may result in biased image crops [198] . This is the risk of loss or harm from intentional subversion or forced failure. Goodfellow et al. discovered the ability to induce mispredictions in neural computer vision models by perturbing the input with small amounts of adversarially generated noise [80] . This is known as an evasion attack since it allows the attacker to evade classification by the system. Some attacks emulate natural phenomena such as raindrops, phonological variation, or code-mixing [11, 58, 180, 182, 200] . ML systems tend to be highly vulnerable if the models have not been explicitly trained to be robust to the attack. Another attack vector involves manipulating the training data such that the ML system can be manipulated with specific inputs during inference, (e.g., to bypass a biometric identification system) [34] . This is known as "data poisoning. " The application, control over training data, and model's robustness to such attacks are potential risk factors. Finally, there is the risk of model theft. Researchers have demonstrated the ability to "steal" an ML model through ML-as-a-service APIs by making use of the returned metadata (e.g., confidence scores) [102, 110, 138, 184] . Extracted models can be deployed independent of the service, or used to craft adversarial examples to fool the original models. The application setting and design choices significantly affect the amount of metadata exposed externally. For example, while an autonomous vehicle does not return the confidence scores of its perception system's predictions, model thieves may still be able to physically access the system and directly extract the model's architecture definition and weights. The risk of loss or harm from leakage of personal information via the ML system. Although we only focus on privacy in this section, we use the GDPR's definition of personal data due to its broad coverage: "any information relating to an identified or identifiable natural person". 8 Privacy breaches often result from compromised databases [133] and may be mitigated with proper data governance and stewardship [152] . However, we wish to highlight privacy risks that are specific to ML systems. Although federated learning [164] has been proposed to avoid storing training data in a central location (avoiding the problem of compromised databases), it may still be possible to recover training examples from a model learned in this manner [77, 78] . Researchers have also demonstrated that information about the training data can be retrieved from an ML model [37, 70, 165] , and in some cases, the training examples themselves can even be extracted [31] . Therefore, simply securing the training data is now insufficient. The risk of harm to the natural environment posed by the ML system. There are three major ways in which ML systems can harm the environment. The first is increased pollution or contribution to climate change due to the system's consumption of resources. This relates to the energy cost/efficiency during training and inference, hence, the energy efficiency of the chosen algorithm, its implementation, and training procedure are key factors here [5, 113, 171] . Other key factors include the energy efficiency of the system's computational hardware and the type of power grid powering the ML system since some power sources (e.g., wind turbines) are cleaner than others (e.g. fossil fuels) [85] . The second is the negative effect of ML system's predictions on the environment and relate to the system's use case, prediction accuracy, and robustness. For example, an ML system used for server scaling may spin up unnecessary resources due to prediction error, causing an increase in electricity consumption and associated environmental effects. Another ML system may be used to automatically adjust fishing quotas and prediction errors could result in overfishing. Finally, automating a task often results in knock-on effects such as increased usage due to increased accessibility. This is known as the Jevons Paradox [97] or Khazzoom-Brookes postulate [25, 104, 156] . For example, public transit users may adopt private autonomous vehicles and cause a net increase in the number of vehicles on the road [128] . The risk of financial and/or reputational damage to the organization building or using the ML system. An organization may incur said damage when said ML system is shown to result in negative consequences for safety, fairness, security, privacy, and the natural environment. For example, a company was lambasted for its search engine's response to a query about India's ugliest language [93] . Reputational damage can also occur if the public perceives the system to potentially result in said negative consequences, such as in the case of a police department trialing the Spot robot [88] . Although we have discussed a number of common risks posed by ML systems, we acknowledge that there are many other ethical risks such as the potential for psychological manipulation, dehumanization, and exploitation of humans at scale [126] . This is aligned with the notion of surveillance capitalism, in which humans are treated as producers of data that are mined for insights into their future behavior [205] . These insights are often used to sell advertisement exposures. This incentive mismatch between the public and companies can lead to design choices that are detrimental to the former but beneficial to the latter [206] . Examples include the fanning of religious tensions that increased offline violence [84, 193] and encouraging the proliferation of outrageous content to increase engagement [56] . The negative impacts of an ML system (e.g., fairness, safety, environmental impact, surveillance capitalism) are often the result of suboptimal design choices during its conception, design, and implementation. Existing work on software RA largely focuses on the risks to the project's success but not the project's risk to the world. Additionally, they were developed for deterministic software systems and do not account for the new risks posed by software systems that learn from data. On the other hand, existing work on algorithmic IA primarily focuses on assessing these impacts or examining the factors leading to specific categories of impacts (e.g., fairness). Existing draft regulations also refer to a vague notion of risk that can be intentionally or unintentionally misinterpreted to suit the actor's position [190] . To improve the quality of discussions around ML risk, we present a Machine Learning System Risk framework (MLSR) to connect the risk of system failures (first-order effects) to the risk of societal and environmental impacts (second-order effects). We do this by examining advances in the ML literature and connecting them to research on algorithmic impacts. Drawing connections between specific first-and second-order risks helps ML practitioners pinpoint the parts of the ML system and development process that may have negative impacts. Although the second-order risks may appear to be significantly more pressing, our framework indicates that they are often the symptoms of problems in the system's design & development. Hence, addressing the first-order risks in an ML system will naturally mitigate its second-order risks to the external world. It is equally critical that both internal and independent regulators holistically assess the risks of proposed and existing ML systems, beyond those often discussed in the ethical AI community. Therefore, MLSR is a first step towards a common vocabulary for such assessments. For practitioners who are looking to reduce the negative impacts of the ML systems they build, we recommend using MLSR to map out how the first-order risks posed by the ML system lead to its second-order risks before using conventional risk estimation methods to quantify them. Future work may also explore the ways in which it can be combined with RA tools such as severity-likelihood estimation to develop templates for conducting RAs. Connecting social and environmental impacts to specific technical failures may also help to inspire the creation of new ML techniques that address these risks, creating more diversity in technical research and reducing the emphasis on beating the state-of-the-art. Safety Application Neural generation-based chatbot for dispensing health advice Misapplication Leaving the driver's seat of a semi-autonomous vehicle while in motion Algorithm Poor human identification accuracy in AV perception resulting in accidents Training & validation data Melanoma identifier trained on non-diverse data leading to unnecessary chemotherapy Robustness AV perception system misclassifying the moon as a traffic light while on an expressway Design Under-specification of weather conditions the AV is expected to operate in Implementation Software bugs in open-source library used in AV perception system Control Inability to prevent an autonomous weapon from shooting a civilian Emergent behavior Microsoft Tay Table 2 . Summary of second-order risks and first-order risks that may lead to each risk. Algorithmic Accountability Act of Persistent anti-muslim bias in large language models The Computer Got It Wrong': How Facial Recognition Led To False Arrest Of Black Man Ethics and Algorithms Toolkit Carbontracker: Tracking and predicting the carbon footprint of training deep learning models Karthikeyan Natesan Ramamurthy, and Moninder Singh A closer look at memorization in deep networks Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI CR Engineers Show a Tesla Will Drive With No One in the Driver's Seat Facebook news and (de) polarization: reinforcing spirals in the 2016 US election Synthetic and Natural Noise Both Break Neural Machine Translation On the Dangers of Stochastic Parrots: Can Language Models Be Too Big Race After Technology: Abolitionist Tools for the New Jim Code Against the Gods: The Remarkable Story of Risk Users polarization on Facebook and Youtube Large image datasets: A pyrrhic win for computer vision Language (Technology) is Power: A Critical Survey of "Bias" in NLP Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings Swarm intelligence: From natural to artificial systems CHP investigates man caught doing potentially deadly Tesla autopilot stunt on Bay Area roads Analysing and assessing accountability: A conceptual framework Sorting things out: Classification and its consequences Inter-annotator Agreement for a German Newspaper Corpus Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country Energy Efficiency and Economic Fallacies Identifying risks using a new assessment tool: the missing piece of the jigsaw in medical device risk assessment Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification Automated Decision Systems Accountability Act Leadership and pandering: A theory of executive policymaking Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. Extracting training data from large language models Quality at a glance: An audit of web-crawled multilingual datasets Lawsuit filed against Tesla after accident that injured 5 police officers Targeted backdoor attacks on deep learning systems using data poisoning These students figured out their tests were graded by AI -and the easy way to cheat. The Verge China Academy of Information and Communications Technology and JD Explore Academy. 2022. White Paper on Trustworthy Artificial Intelligence Label-Only Membership Inference Attacks AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries Banned: The 1,170 words you can't use with GitHub Copilot Profiles of the Future How one programmer broke the internet by deleting a tiny piece of code US Nuclear Regulatory Commission, et al. 2013. Spent Fuel Transportation Risk Assessment. NUREG-2125 Advances in Neural Information Processing Systems 30 Amazon Scraps Secret AI Recruiting Tool That Showed Bias Against Women Amazon scraps secret AI recruiting tool that showed bias against women Data.govt.nz. 2020. Algorithm Charter For Aotearoa New Zealand Data Ethics Commission of the Federal Government (Germany) PRIAM: a privacy risk analysis methodology. In Data privacy management and security assurance The judicial demand for explainable artificial intelligence Big security for big data: Addressing security challenges for the big data infrastructure Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Don't take "nswvtnvakgxpm" for an answer -The surprising vulnerability of automatic content scoring systems to adversarial input Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Critical mass in the emergence of collective intelligence: a parallelized simulation of swarms in noisy environments Five points for anger, one for a 'like': How Facebook's formula fostered rage and misinformation Ecological risk assessment framework for low-altitude overflights by fixed-wing and rotary-wing military aircraft From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks The who in explainable ai: How ai background shapes perceptions of ai explanations Social impact assessment: the state of the art. Impact Assessment and Project Appraisal Utility is in the Eye of the User: A Critique of NLP Leaderboards Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor Proposal for a Regulation laying down harmonised rules on artificial intelligence Robust rough-terrain locomotion with a quadrupedal robot Flawed Algorithms Are Grading Millions of Students' Essays Children's Online Privacy Protection Rule Aiming for truth, fairness, and equity in your company's use of AI Regulating human control over autonomous systems A Baseline for Detecting Misclassified and Out of-Distribution Examples in Neural Networks. 2017. Hendrycks, Dan and Gimpel, Kevin Model inversion attacks that exploit confidence information and basic countermeasures Value Sensitive Design: Shaping Technology with Moral Imagination Probabilistic risk assessment in the nuclear power industry Zillow, facing big losses, quits flipping houses and will lay off a quarter of its staff Domain-adversarial training of neural networks Word embeddings quantify 100 years of gender and ethnic stereotypes Garbage In, Garbage Out: Face Recognition on Flawed Data Inverting Gradients-How easy is it to break privacy in federated learning? Oya Beyan, Stefan Decker, and Chunming Rong. 2021. Towards General Deep Leakage in Federated Learning Effects of the news-finds-me perception in communication: Social media use implications for news seeking and learning about politics Explaining and Harnessing Adversarial Examples European Union regulations on algorithmic decision-making and a "right to explanation Risk filtering, ranking, and management framework using hierarchical holographic modeling Ores: Lowering barriers with participatory machine learning in wikipedia The Facebook whistleblower says its algorithms are dangerous. Here's why Towards the systematic reporting of the energy and carbon footprints of machine learning Facebook translates 'good morning' into 'attack them', leading to arrest Software Risk Management NYPD stops using Boston Dynamics' robodog following backlash. MSN Moving beyond "algorithmic bias is a data problem Robust statistics Risk management -Vocabulary India's 'Ugliest' Language? Google Had an Answer (and Drew a Backlash) Measurement and Fairness An assessment of risk and safety in civil aviation ChamberBreaker: Mitigating the Echo Chamber Effect and Supporting Information Hygiene through a Gamified Inoculation System The coal question: Can Britain survive? (1865) How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods Testifying while black: An experimental study of court reporter accuracy in transcription of The Government of Canada's Algorithmic Impact Assessment: Take Two Apple's 'sexist' credit card investigated by US regulator The Thieves on Sesame Street are Polyglots -Extracting Multilingual Models from Monolingual APIs One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision Economic implications of mandated efficiency in standards for household appliances Crashed software: assessing product liability for software defects in automated vehicles Credit denial in the age of AI Racial disparities in automated speech recognition The Log4j security flaw could impact the entire internet An Action-Oriented AI Policy Toolkit for Technology Audits by Community Advocates and Activists Thieves on Sesame Street! Model Extraction of BERT-based APIs Accountable algorithms Sociolinguistic patterns Quantifying the carbon emissions of machine learning Fuzzy risk assessment of oil and gas offshore wells Understanding Canada's Algorithmic Impact Assessment Tool Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets Webvision database: Visual learning and understanding from web data Autonomous vehicles for smart and sustainable cities: An in-depth exploration of privacy and cybersecurity implications Algorithmic decision-making in AVs: Understanding ethical and technical concerns for smart cities When Do We Trust AI's Recommendations More Than People's? Towards Deep Learning Models Resistant to Adversarial Attacks AI and Big Data: A blueprint for a human rights, social and ethical impact assessment Atmospheric pollution in Leicester A Survey on Bias and Fairness in Machine Learning Algorithmic Impact Assessments and Accountability: The Co-Construction of Impacts Microsoft. 2020. Harms Modeling Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Policy and society related implications of automated driving: A review of literature and directions for future research Assembling Accountability: Algorithmic Impact Assessment for the Public Interest Environmental impact assessment Trusted autonomy between humans and robots: Toward human-on-the-loop in robotics and autonomous systems This Road In Yosemite Is Causing Teslas On Autopilot To Crash Equifax Officially Has No Excuse YouTube moderators are being forced to sign a statement acknowledging the job can give them PTSD. The Verge Pervasive label errors in test sets destabilize machine learning benchmarks How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy Knockoff nets: Stealing functionality of black-box models On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications PyTorch: An imperative style, high-performance deep learning library Information security risk analysis Ignoring Diversity Hurts Tech Products and Ventures The moral responsibility gap and the increasing autonomy of systems Learning to Deceive with Attention-Based Explanations Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing Understanding Mean Score Differences Between the e-rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test Risk assessment: Theory, methods, and applications Why Microsoft's 'Tay' AI bot went wrong New Zealand passport robot tells applicant of Asian descent to open eyes Algorithmic impact assessments: A practical framework for public agency accountability The homogeneity of right-wing populist and radical content in YouTube recommendations Data governance and stewardship: designing data stewardship entities and advancing data access Doctor GPT-3: hype or reality? Nabla Technologies Blog Marc Toussaint, and Michiel Van de Panne. 2021. From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence The generational impact of software The Khazzoom-Brookes postulate and neoclassical growth Learning robust failure response for autonomous vision based flight Unit Tests for Stochastic Optimization Wilfried Elmenreich, Farshad Arvin, Ahmet Şekercioğlu, and Micha Sende. 2021. Swarm intelligence and cyber-physical systems: concepts, challenges and future trends Disparate impact in big data policing Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview The three dimensions of software risk: technical, organizational, and environmental Information systems risks and risk factors: are they mostly about information systems? Privacy-preserving deep learning Membership inference attacks against machine learning models It Began as an AI-Fueled Dungeon Game. It Got Much Darker Risk assessment in maritime transportation Redfin CEO explains how its iBuyer home buying program avoided pitfalls that sunk Zillow Group Do AI models recognise rare, aggressive skin cancers? An assessment of a direct-to-consumer app in the diagnosis of Merkel cell carcinoma and amelanotic melanoma The Psychological Well-Being of Content Moderators: The Emotional Labor of Commercial Moderation and Avenues for Improving Support Energy and Policy Considerations for Deep Learning in NLP Mitigating Gender Bias in Natural Language Processing: Literature Review A framework for understanding unintended consequences of machine learning Rethinking the Inception Architecture for Computer Vision Governance of artificial intelligence Governing autonomous vehicles: emerging responses for safety, liability, privacy, cybersecurity, and industry risks Towards Autonomous Vehicles in Smart Cities: Risks and Risk Governance Assessing the regulatory challenges of emerging disruptive technologies Linguistically-Inclusive Natural Language Processing Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots Reliability Testing for Natural Language Processing Systems It's Morphin' Time! Combating Linguistic Discrimination with Inflectional Perturbations An Empirical Study of Bugs in Machine Learning Systems Stealing machine learning models via prediction apis Treasury Board of Canada Secretariat Directive on Automated Decision-Making Andrew Ilyas, and Aleksander Madry. 2020. From imagenet to image classification: Contextualizing progress on benchmarks Universal Declaration of Human Rights Demystifying the Draft EU Artificial Intelligence Act-Analysing the good, the bad, and the unclear elements of the proposed approach A GIS-based framework for hazardous materials transport risk assessment Lucid Reveals New Details of its 'DreamDrive' Advanced Driver Assist System, its Answer to Tesla's Autopilot. FutureCar An Independent Assessment of the Human Rights Impact of Facebook in Myanmar Risk assessment and information systems Transformers: State-of-the-art natural language processing Autonomous Maneuver Coordination Via Vehicular Communication Learning from multiple annotators with varying expertise Image Cropping on Twitter: Fairness Metrics, Their Limitations, and the Importance of Representation, Design, and Agency Risk assessment of construction projects It's Raining Cats or Dogs? Understanding deep learning (still) requires rethinking generalization Machine learning testing: Survey, landscapes and horizons An Empirical Study on TensorFlow Program Bugs Volvo admits its self-driving cars are confused by kangaroos. The Guardian Big other: Surveillance capitalism and the prospects of an information civilization The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. Hachette UK. Samson Tan, Araz Taeihagh, and Kathy Baxter A SUMMARY OF FIRST-AND SECOND-ORDER RISKS First-Order Risks Risk Factors Application • Application domain • Consequentiality of system actions • Protected populations impacted • Effect on existing power differentials and inequalities • Scope of deployment environment • Scale of deployment • Presence of relevant evaluation techniques/metric • Optionality of interactions • Accountability mechanisms • Stakeholders' machine learning literacy Misapplication • Ability to prevent misuse • Ability to detect misuse • Ability to stop misuse Algorithm • Performance of model architecture, optimization algorithm, and training procedure • Reliability and computational cost of machine learning component(s) in production • Explainability/transparency Training & validation data • Control over data • Demographic representativeness of data • Similarity of training, validation, and deployment distributions • Quality of data sources • Presence of personal information Robustness • Scope of deployment environment • Mechanisms to improve OOD input handling • Failure recovery mechanisms Design • Data preprocessing choices • Modeling choices • Specificity of operational scope and requirements • Design and development team • Stakeholder and expert involvement Implementation • Reliability of external libraries • Code review and testing practices We are grateful to Anna Bethke, Min-Yen Kan, and Qian Cheng for their insightful feedback on drafts of this paper.