Chapter 9 Fragility and Intelligibility of Deep Learning for Libraries Michael Lesk Rutgers University Introduction On February 7, 2018, Mounir Mahjoubi, then the “digital minister” of France (le secrétariat d’État chargé du Numérique), told the civil service to use only computer methods that could be understood (Mahjoubi 2018). To be precise, what he actually said to l’Assemblée Nationale was: Aucun algorithme non explicable ne pourra être utilisé. I gave this to Google Translate and asked for it in English. What I got (on October 13, 2019) was: No algorithm that can not be explained can not be used. That’s a long way from fluent English. As I count the “not” words, it’s actually reversed in mean- ing. But, what if I leave off the final period when I enter it in Google Translate? Then I get: No non-explainable algorithm can be used Quite different, and although only barely fluent, now the meaning is right. The difference was only the final punctuation on the sentence.1 This is an example of the fragility of an AI algorithm. The point is not that both translations are of doubtful quality. The point is that a seemingly insignificant change in the input produced such a difference in the output. In this case, the fragility was detected by accident. 1In the months between my original queries in October 2019 and the final preparations for publication in November 2020, the algorithm has changed to produce the same translation with or without a period: “No non-explicable algorithm can be used.” 101 102 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Machine learning systems have a set of data for training. For example, if you are interested in translation, and you have a large collection of text in both French and English, you might notice that the word truck in English appears where the word camion appears in French. And the system might “learn” this translation. It would then apply this in other examples; this is called general- ization. Of course if you wish to translate French into British English, a preferred translation of camion is lorry. And if the context of your English truck is a US discussion of the wheels and axles underneath railway vehicles, the better French word is le bogie. Deep learning enthusiasts believe that with enough examples, machine learning systems will be able to generalize correctly. There can be various kinds of failures: we can discuss both (a) problems in the scope of the training data and (b) problems in the kind of modeling done. If the system has sufficiently general input data so that it learns well enough to produce reliably correct results on examples it has not seen, we call it robust; robustness is the opposite of fragility. Fragility errors here can arise from many sources—for example, the training data may not be representative of the real problem (if you train a machine translation program solely on engineering documents, do not expect it to do well on theater reviews). Or, the data may not have the scope of the real problem: if you train for “boat” based on ocean liners, don’t be surprised if the program fails on canoes. In addition, there are also modeling issues. Suppose you use a very simple model, such as a linear model, for data that is actually perhaps quadratic or exponential. This is called “underfit- ting” and may often arise when there is not enough training data. The reverse is also possible: there may be a lot of training data, including many noisy points, and the program may decide on a very complex model to cover all the noise in the training data. This is called “overfitting” and gives you an answer too dependent on noise and outliers in your data. For example, 1998 was an unusually warm year, but the decline in world temperature for the next few years suggests it was noise in the data, not a change in the development of climate. Fragility is also a problem in image recognition (“AI Recognition” 2017). Currently the most common technique for image recognition research projects is the use of convolutional neural nets. Recently, several papers have looked at how trivial modifications to images may impact im- age classification. Here (figure 9.1) is an image taken from (Su, Vargas, and Sakurai 2019). The original image class is in black and the classifier choice (and confidence) after adding a single un- usual pixel are shown in blue, with the extraneous pixel in white. The images were deliberately processed at low resolution—hence the pixellation—to match the input requirement of a popu- lar image classification program. The authors experimented with algorithms to find the quickest single-pixel change that would deceive an image classifier. They were routinely able to fool the recognition software. In this ex- ample, the deception was deliberate; the researchers searched for the best place to change the image. Bias and mistakes We have seen a major change in the way we do machine learning, and there are real dangers in- volved. The current enthusiasm for neural nets risks the use of processes which cannot be under- stood, as Mahjoubi warned, and which can thus conceal methods we would not approve of, such as discrimination in lending or hiring. Cathy O’Neil has described this in her book Weapons of Math Destruction (2016). There is much research today that seeks methods to explain what neural nets are doing. See Lesk 103 Figure 9.1: Examples of misclassification. Guidiotti et al. (2017) for a survey. There is also a 2018 DARPA program on “Explainable AI.” Techniques used can include looking at the results over a range of input data and seeing if the neural net can be modeled by a decision tree, or modifying the input data to see which input elements have the greatest effect on the results, and then showing that to the user. For example, Mariusz Bojarski et al. describe a self-driving system that highlights what it thinks is important in what it is seeing (2017). However, this is generally research in progress, and it raises the question of whether we can trust the explanation generator. Many popular magazines have discussed this problem; Forbes, for example, had an explana- tion of how the choice of datasets can produce a biased result without any deliberate attempt to do so (Taulli 2019). Similarly, the New York Times discussed the way groups of primarily young white men will build systems that focus on their data, and give wrong or discriminatory answers in more general situations (Tugend 2019). The MIT Media Lab hosts the Algorithmic Justice League, trying to stop organizations from building socially slanted systems. Similar thoughts come from groups like the Data and Society Research Institute or the AI Now Institute. Again, the problems may be accidental or deliberate. The phrase “data poisoning” has been used to suggest malicious creation of training data or examples of data designed to deceive ma- chine learning systems. There is now a DARPA research program, “Guaranteeing AI Robustness against Deception (GARD),” supporting research to learn how to stop trickery such as a demon- stration of converting a traffic stop sign to a 45 mph speed limit with a few stickers (Eykholt et al. 2018). More generally, bias in systems deciding whether to grant loans may be discriminatory but nevertheless profitable. Even if you want to detect AI mistakes, recognizing such problems is difficult. Often things will be wrong and we won’t know why. And even hypothetical (but perhaps erroneous) explana- tions can be very convincing; people easily believe plausible stories. I routinely give my students a paper that concludes that prior ownership of a cat prevents fatal myocardial infarctions; its re- sult implies that cats are more protective than statin drugs (Qureshi et al. 2009). The students are very quick to come up with possibilities like “petting a cat is relaxing, relaxation reduces your blood pressure, and lower blood pressure decreases the risk of heart attacks.” Then I have to ex- plain that the paper evaluates 32 possibilities (prior/current ownership ⇥ cats/dogs ⇥ 4 medical conditions ⇥ fatal/nonfatal) and you shouldn’t be surprised if you evaluate 32 chances and one is significant at the 0.05 level, which is only 1 in 20. In this example, there is also the question of reverse causality: perhaps someone who is in ill health will decide he is too sick to take care of a 104 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.2: Panoramic landscape. pet, so that the poor health is not caused by the lack of a cat, but rather the poor health causes the absence of a cat. Sometimes explanations can help, as in a machine learning program that was deliberately trained to distinguish images of wolves and dogs but was trained using pictures of wolves that always contained snow and pictures of dogs that never did (Ribeiro, Singh, and Guestrin 2016). Without explaining that, 10 of 27 subjects thought the classifier was trustworthy; after point- ing out the snow only 3 of 27 subjects believed the system. Usually you don’t get such a clear presentation of a mis-trained system. Recognition of problems Can we tell when something is wrong? Here’s the result of a Google Photo merge of three other photos; two landscapes and a picture of somebody’s friend. The software was told to make a panorama and stitched the images together (Peng 2018). It looks like a joke, and even made it into a list of top jokes on reddit. The author’s point was that the panorama system didn’t understand basic composition: people are not the same scale as mountains. Often, machine learning results are overstated. Google Flu Trends was acclaimed for several years and then turned out to be undependable (Lazer et al. 2014). A study that attempted to compare the performance of machine learning systems for medical diagnosis with actual doctors found that of over 20,000 papers analyzed, only a few dozen had data suitable for an evaluation (Liu et al. 2019). The results claimed comparable accuracy, but virtually none of the papers Lesk 105 presented adequate data to support that conclusion. Unusually promising results are sometimes the result of overfitting (Brownlee 2018); this is what was wrong with Google Flu Trends. A machine learning program can learn a large number of special cases and then find that the results do not generalize. In other cases problems can result when using “clean” data for training, and then encountering messier data in applications. Ideally, training and testing data should be from the same dataset and divided at random, but it can be tempting to start off with examples that are the result of initial and higher quality data collection. Sometimes in the past we had a choice between modeling and data for predictions. Consider, for example, the problem of guessing what the weather will be tomorrow. We now do this based on a model of the atmosphere that uses the Navier-Stokes equations; we use supercomputers and derive tomorrow’s atmosphere from today’s (Christensen 2015). What did we do before we had supercomputers? Solving those equations by hand is impractical. One of the methods was “pre- diction by analogy”: find some day in the past whose weather was most similar to today. Suppose that day is Oct. 20, 1970. Then use October 21, 1970 as tomorrow’s prediction. Prediction by analogy doesn’t require you to have a model or use advanced mathematics. In this case, however, it doesn’t work as well—partly because we don’t have enough past days to choose from, and we only get new days at the rate of one per day. In fact, Huug van den Dool estimated the number of days of data needed to make accurate predictions as 1030 years, which is far more than the age of the universe (Wilks 2008). The under- lying problem is that the weather is very random. If your state lottery is properly run, it should be completely pointless to look at past winning numbers and try to guess the next one. The weather is not that random but it has too much variation to be solved easily by analogy. If your problem is very simple (tic-tac-toe) you could indeed write down each position and what the best next move is; there are only about 255,000 games. To deal with more realistic problems, much of machine learning research is now focused on obtaining larger training sets. Instead of trying to learn more about the characteristics of a system that is being modeled, the effort is driven by the dictum, “more data beats better algorithms.” In a review of the history of speech recognition, Xuedong Huang, James Baker, and Raj Reddy write, “The power of these systems arises mainly from their ability to collect, process, and learn from very large datasets. The basic learning and decoding algorithms have not changed substantially in 40 years” (2014). Nevertheless, speech recognition has gone from frustration to useful products such as dictation software or home appliances. Lacking a model, however, means that we won’t know the limits of the calculations being done. For example, if you have some data that looks quadratic, but you fit a linear model, any attempt at extrapolation is fraught with error. If you are using a “black box” system, you don’t know when this is happening. And, regrettably, many of the AI software systems are sold as black boxes where the purchasers and users do not have access to the process, even if they are imagined to be able to understand it. What’s changing Many AI researchers are sensitive to the risks, especially given the publicity over self-driving cars. As the hype over “deep learning” built up, writers discussed examples such as a Pittsburgh med- ical system that proposed to send patients with both pneumonia and asthma home, because the computer had not understood that patients with both problems were actually being sent to the ICU (Bornstein 2016; Caruana et al. 2015). 106 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Figure 9.3: Explainability. Many people work on ways of explaining or presenting neural net software (Harley 2015). Most important, perhaps, are new EU regulations that prohibit automated decision making that affects EU citizens, and provides a “right of explanation” (Metz 2016). We recognize that systems which don’t rely on a mathematical model may be cheaper to build than one where the coders understand what is going on. More serious is that they may be more accurate. This image is from the same article on understandability (Bornstein 2016). If there really is a tradeoff between what will solve the problem and what can be explained, we know that many system builders will choose to solve the problem. And yet even having explana- tions may not be an answer; a key paper on interpretability discusses the complexities of meaning related to explanation, causality, and modeling (Lipton 2018). Arend Hintze has noted that we do not always impose a demand for explanation on people. I can write that the New York Public Library main building is well proportioned and attractive without anyone expecting that I will recite its dimensions or the source of the marble used to construct it. And for some problems that’s fine: I don’t care how my camera decides on the focus distance to the subject. Where it matters, however, we often want explanations; the hard ethical problem, as noted before, is if better performance can be achieved in an inexplicable way. Recommendations 2017 saw the publication of the “Asilomar AI principles” (2017). Two of these principles are: • Safety: AI systems should be safe and secure throughout their operational lifetime, and verifiably so where applicable and feasible. • Failure Transparency: If an AI system causes harm, it should be possible to ascertain why. The problem is that the technology used to build many systems does not enable verifiability and explanation. Similarly the World Economic Forum calls for protection against discrimina- tion but notes many ways in which technology can have unanticipated and undesirable effects as a result of machine learning (“How to Prevent” 2018). Lesk 107 Historically there has been and continues to be too much hype. An important image recog- nition task is distinguishing malignant and benign spots on mammograms. There have been promises for decades that computers would do this better than radiologists. Here are examples from 1995 (“computer-aided diagnosis can improve radiologists’ observational performance”) (Schmidt and Nishikawa) and 2009 (“The Bayesian network significantly exceeded the perfor- mance of interpreting radiologists”) (Burnside et al.). A typical recent AI paper to do this with convolutional neural nets reports 90% accuracy (Singh et al. 2020). To put this in perspective, the problem is complex, but some examples are more straightforward, and even pigeons can reach 85% (Levenson et al. 2015). A serious recent review is “Diagnostic Accuracy of Digital Screening Mammography With and Without Computer-Aided Detection” (Lehman et al. 2015). Very re- cently there was another claim that computers have surpassed radiologists (Walsh 2020); we will have to await evaluation. As with many claims of medical progress, replicability and evaluation are needed before doctors will be willing to believe them. What should we do? Software testing generally is a decades-old discipline, and many basic principles of regression testing apply here also: • Test data should cover the full range of expected input. • Test data should also cover unexpected and even illegal input. • Test data should include known past failures believed cleared up. • Test data should exercise all parts of the program, and all important paths (coverage). • Test data should include a set of data which is representative of the distribution of actual data, to be used for timing purposes. It is difficult to apply these ideas in parts of the AI world. If the allowed input is speech, there is no exhaustive list of utterances which can be sampled. If a black-box commercial machine learning package is being used, there is no way to ask about coverage of any number of test cases. If a program is constantly learning from new data, there is no list of previously fixed failures to be collected that reflects the constantly changing program. And obviously the circumstances of use matter. We may well, as a society, decide that forcing banks evaluating loan applications to use decision trees instead of deep learning is appropriate, so that we know whether illegal discrimination is going on, even if this raises the costs to the banks. We might also believe that the safest possible railway operation is important, even if the automated train doesn’t routinely explain how it balanced its choices of acceleration to achieve high punctuality and low risk. What would I suggest? Organizationally: • Have teams including both the computer scientists and the users. • Collaborate with a statistician: they’ve seen a lot of these problems before. • Work on easier problems. As examples, I watched a group of zoologists with a group of computer scientists discussing how to improve accuracy at identifying animals in photographs. The discussion indicated that 108 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 you needed hundreds of training examples at a minimum, if not thousands, since the animals do not typically walk up to the camera and pose for a full-frame shot. It was important to have both the people who understood the learning systems and the people who knew what the pictures were realistically like. The most amusing contribution by a statistician happened when a computer scientist offered a program that tried to recognize individual giraffes, and a zoologist complained that it only worked if you had a view of the right-hand side of the giraffe. Somebody who knew statistics perked up and said “it’s a 50% chance of recognizing the animal? I can do the math for that.” And it is simpler to do “is there any animal in the picture?” before asking “which animal is it?” and create two easier problems. Technically: • Try to interpolate rather than extrapolate: use the algorithm on points “inside” the training set (thinking in multiple dimensions). • Lean towards feature detection and modeling rather than completely unsupervised learn- ing. • Emphasize continuous rather than discrete variables. I suggest using methods that involve feature detection, since that tells you what the algorithm is relying on. For example, consider the Google Flu Trends failure; the public was not told what terms were used. As David Lazer noted, some of them were just “winter” terms (like ‘basketball’). If you know that, you might be skeptical. More significant are decisions like jail sentences or college admissions; knowing that racial or religious discrimination are not relevant can be verified by knowing that the program did not use them. Knowing what features were used can sometimes help the user: if you know that your loan application was downrated because of your credit score, it may be possible for you to pay off some bill to raise the score. Sometimes you have to use categorical variables (what county do you live in?) but if you have a choice of how you phrase a variable, asking something like “how many minutes a day do you spend reading?” is likely to produce a better fit than asking people to choose “how much do you read: never, sometimes, a lot?” A machine learning algorithm may tell you how much of the variance each input variable explains; you can use that information to focus on the variables that are most important to your problem, and decide whether you think you are measuring them well enough. Why not extrapolate? Sadly, as I write this in early April 2020, we are seeing all sorts of ex- trapolations of the COVID-19 epidemic, with expected US deaths ranging from 30,000 to 2 million, as people try to fit various functions (Gaussians, logistic regression, or whatever) with inadequately precise data and uncertain models. A simpler example is Mark Twain’s: “In the space of one hundred and seventy-six years the Lower Mississippi has shortened itself two hun- dred and forty-two miles. That is an average of a trifle over one mile and a third per year. There- fore, any calm person, who is not blind or idiotic, can see that in the ‘Old Oolitic Silurian Period,’ just a million years ago next November, the Lower Mississippi River was upwards of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod. And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen” (1883). Lesk 109 Finally, note the advice of Edgar Allan Poe: “Believe nothing you hear, and only one half that you see.” References “AI Recognition Fooled by Single Pixel Change.” BBC News, November 3, 2017. ?iiTb,ffrr rX##+X+QKfM2rbfi2+?MQHQ;v@9R3983d3. “Asilomar AI Principles.” 2017. ?iiTb,ff7mim`2Q7HB72XQ`;f�B@T`BM+BTH2bf. Bojarski, Mariusz, Larry Jackel, Ben Firner, and Urs Muller. 2017. “Explaining How End-to- End Deep Learning Steers a Self-Driving Car.” NVIDIA Developer Blog. ?iiTb,ff/2p# HQ;bXMpB/B�X+QKf2tTH�BMBM;@/22T@H2�`MBM;@b2H7@/`BpBM;@+�`f. Bornstein, Aaron. 2016. “Is Artificial Intelligence Permanently Inscrutable?” Nautilus 40 (1). ?iiT,ffM�miBHXmbfBbbm2f9yfH2�`MBM;fBb@�`iB7B+B�H@BMi2HHB;2M+2@T2` K�M2MiHv@BMb+`mi�#H2. Brownlee, Jason. 2018. “The Model Performance Mismatch Problem (and What to Do about It).” Machine Learning Mastery. ?iiTb,ffK�+?BM2H2�`MBM;K�bi2`vX+QKfi?2@K Q/2H@T2`7Q`K�M+2@KBbK�i+?@T`Q#H2Kf. Burnside, Elizabeth S., Jessie Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom, Berta M. Geller, Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, and C. David Page. 2009. “Probabilistic Computer Model Developed from Clinical Data in National Mammography Database Format to Classify Mammographic Findings.” Radiology 251 (3): 663–72. Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Read- mission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (KDD ’15), 1721–30. New York: ACM Press. ?iiTb, ff/QBXQ`;fRyXRR98fkd3jk83Xkd33eRj. Christensen, Hannah. 2015. “Banking on better forecasts: the new maths of weather predic- tion.” The Guardian, 8 Jan 2015. ?iiTb,ffrrrXi?2;m�`/B�MX+QKfb+B2M+2f�H2t b@�/p2Mim`2b@BM@MmK#2`H�M/fkyR8fD�Mfy3f#�MFBM;@7Q`2+�bib@K�i?b@r 2�i?2`@T`2/B+iBQM@biQ+?�biB+@T`Q+2bb2b. Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramèr, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. “Physical Adversarial Examples for Ob- ject Detectors.” 12th USENIX Workshop on Offensive Technologies (WOOT 18). Guidiotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Giannotti Fosca, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Computing Surveys 51 (5): 1–42. Halevy, Alon, Peter Norvig, and Fernando Pereira. 2009. “The Unreasonable Effectiveness of Data.” IEEE Intelligent Systems 24 (2). Harley, Adam W. 2015. “An Interactive Node-Link Visualization of Convolutional Neural Net- works.” In Advances in Visual Computing, edited by George Bebis et al., 867–77. Lecture Notes in Computer Science. Cham: Springer International Publishing. “How to Prevent Discriminatory Outcomes in Machine Learning.” 2018. White Paper from the Global Future Council on Human Rights 2016–2018, World Economic Forum. ?iiTb, ffrrrXr27Q`mKXQ`;fr?Bi2T�T2`bf?Qr@iQ@T`2p2Mi@/Bb+`BKBM�iQ`v@Qmi+ QK2b@BM@K�+?BM2@H2�`MBM;. https://www.bbc.com/news/technology-41845878 https://www.bbc.com/news/technology-41845878 https://futureoflife.org/ai-principles/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/ http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://machinelearningmastery.com/the-model-performance-mismatch-problem/ https://doi.org/10.1145/2783258.2788613 https://doi.org/10.1145/2783258.2788613 https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning 110 Machine Learning, Libraries, and Cross-Disciplinary ResearchǔChapter 9 Huang, Xuedong, James Baker, and Raj Reddy. 2014. “A Historical Perspective of Speech Recog- nition.” Communications of the ACM 57 (1): 94–103. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–1205. Lehman, Constance, Robert Wellman, Diana Buist, Karl Kerlikowske, Anna Tosteson, and Di- ana Miglioretti. 2015. “Diagnostic Accuracy of Digital Screening Mammography with and without Computer-Aided Detection.” JAMA Intern Med 175 (11): 1828–1837. Levenson, Richard M., Elizabeth A. Krupinski, Victor M. Navarro, and Edward A. Wasserman. 2015. “Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images.” PLoS One, November 18, 2015. ?iiTb,ff/QBXQ`;fRyXRjdRfDQm` M�HXTQM2XyR9Rj8d. Lipton, Zachary. 2018. “The Mythos of Model Interpretability.” ACM Queue 61 (10): 36–43. Liu, Xiaoxuan et al. 2019. “A Comparison of Deep Learning Performance against Health-Care Professionals in Detecting Diseases from Medical Imaging: a Systematic Review and Meta- Analysis.” Lancet Digital Health 1 (6): e271–97. ?iiTb,ffrrrXb+B2M+2/B`2+iX+Q Kfb+B2M+2f�`iB+H2fTBBfak83Nd8yyRNjyRkjk. Mahjoubi, Mounir. 2018. “Assemblée nationale, XVe législature. Session ordinaire de 2017–2018.” Compte rendu intégral, Deuxième séance du mercredi 07 février 2018. ?iiT,ffrrrX�b b2K#H22@M�iBQM�H2X7`fR8f+`BfkyRd@kyR3fkyR3yRjdX�bT. Metz, Cade. 2016. “Artificial Intelligence Is Setting Up the Internet for a Huge Clash with Eu- rope.” Wired, July 11, 2016. ?iiTb,ffrrrXrB`2/X+QKfkyRefydf�`iB7B+B�H@BMi 2HHB;2M+2@b2iiBM;@BMi2`M2i@?m;2@+H�b?@2m`QT2f. O’Neil, Cathy. 2016. Weapons of Math Destruction. New York: Crown. Peng, Tony. 2018. “2018 in review: 10 AI failures.” Medium, December 10, 2018. ?iiTb,ffK2 /BmKX+QKfbvM+2/`2pB2rfkyR3@BM@`2pB2r@Ry@�B@7�BHm`2b@+R37��/78N3j. Qureshi, A. I., M. Z. Memon, G. Vazquez, and M. F. Suri. 2009. “Cat ownership and the Risk of Fatal Cardiovascular Diseases. Results from the Second National Health and Nutrition Ex- amination Study Mortality Follow-up Study.” Journal of Vascular and Interventional Neu- rology 2 (1): 132–5. ?iiTb,ffrrrXM+#BXMHKXMB?X;QpfTK+f�`iB+H2bfSJ*jjRdj kN. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “ ‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedingsofthe22ndACMSIGKDDIn- ternational Conference on Knowledge Discovery and Data Mining (KDD ’16), 1135–1144. New York: ACM Press. Schmidt, R. A. and R. M. Nishikawa. 1995. “Clinical Use of Digital Mammography: the Present and the Prospects.” Journal of Digital Imaging 8 (1 Suppl 1): 74–9. Singh, Vivek Kumar et al. 2020. “Breast Tumor Segmentation and Shape Classification in Mam- mograms Using Generative Adversarial and Convolutional Neural Network.” Expert Sys- tems with Applications 139. Su, Jiawei, Danilo Vasconcellos Vargas, and Kouichi Sakurai. 2019. “One Pixel Attack for Fool- ing Deep Neural Networks.” IEEETransactionsonEvolutionaryComputation23 (5): 828–841. Taulli, Tom. 2019. “How Bias Distorts AI (Artificial Intelligence).” Forbes, August 4, 2019. ?iiTb,ffrrrX7Q`#2bX+QKfbBi2bfiQKi�mHHBfkyRNfy3fy9f#B�b@i?2@bBH2M i@FBHH2`@Q7@�B@�`iB7B+B�H@BMi2HHB;2M+2fOR++e7j8/d/3d. Twain, Mark. 1883. Life on the Mississippi. Boston: J. R. Osgood & Co. https://doi.org/10.1371/journal.pone.0141357 https://doi.org/10.1371/journal.pone.0141357 https://www.sciencedirect.com/science/article/pii/S2589750019301232 https://www.sciencedirect.com/science/article/pii/S2589750019301232 http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/ https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87 Lesk 111 Tugend, Alina. 2019. “The Bias Embedded in Tech.” The New York Times, June 17, 2019, section F, 10. Walsh, Fergus. 2020. “AI ‘outperforms’ doctors diagnosing breast cancer.” BBC News, January 2, 2020. ?iiTb,ffrrrX##+X+QKfM2rbf?2�Hi?@8y38dd8N. Wilks, Daniel S. 2008. Review of EmpiricalMethodsinShort-TermClimatePrediction, by Huug van den Dool. Bulletin of the American Meteorological Society 89 (6): 887–88. https://www.bbc.com/news/health-50857759