key: cord-0633810-fhptchfq authors: Stijn, Jip van title: Moral Decision-Making in Medical Hybrid Intelligent Systems: A Team Design Patterns Approach to the Bias Mitigation and Data Sharing Design Problems date: 2021-02-16 journal: nan DOI: nan sha: 58fe205c1a50f4e0c1c3480903e6b08ca99dd2f0 doc_id: 633810 cord_uid: fhptchfq Increasing automation in the healthcare sector calls for a Hybrid Intelligence (HI) approach to closely study and design the collaboration of humans and autonomous machines. Ensuring that medical HI systems' decision-making is ethical is key. The use of Team Design Patterns (TDPs) can advance this goal by describing successful and reusable configurations of design problems in which decisions have a moral component, as well as through facilitating communication in multidisciplinary teams designing HI systems. For this research, TDPs were developed to describe a set of solutions for two design problems in a medical HI system: (1) mitigating harmful biases in machine learning algorithms and (2) sharing health and behavioral patient data with healthcare professionals and system developers. The Socio-Cognitive Engineering methodology was employed, integrating operational demands, human factors knowledge, and a technological analysis into a set of TDPs. A survey was created to assess the usability of the patterns on their understandability, effectiveness, and generalizability. The results showed that TDPs are a useful method to unambiguously describe solutions for diverse HI design problems with a moral component on varying abstraction levels, that are usable by a heterogeneous group of multidisciplinary researchers. Additionally, results indicated that the SCE approach and the developed questionnaire are suitable methods for creating and assessing TDPs. The study concludes with a set of proposed improvements to TDPs, including their integration with Interaction Design Patterns, the inclusion of several additional concepts, and a number of methodological improvements. Finally, the thesis recommends directions for future research. Over the past decades, the healthcare domain has witnessed a steep increase in automation. eHealth applications allow for higher quality and more cost-effective care (Elbert et al., 2014) , robot-assisted surgery has proven to be effective and safe in several medical domains (Ghezzi & Corleta, 2016) , and machine learning algorithms are capable of classifying radiology images with malicious cancers better than many a radiologist (Lakhani et al., 2018) . As cutting-edge technological advancements continue to be made, there is ample reason to believe that the near future will continue to bring increasingly autonomous systems into the medical sector. While this process of automation has unprecedented potential, concerns have been voiced both in the academic realm and society. Dystopian scenarios sketch autonomous systems slowly replacing healthcare professionals, providing cold and impersonal care to patients. Others point at undesirable effects of current autonomous systems, such as the death of pedestrian Elaine Herzberg due to a collision with a self-driving car (Heaven, 2018) , deaths resulting from the use of surgical robots (Alemzadeh et al., 2016) , or algorithms maltreating minorities in their job chances (Dastin, 2018) , access to healthcare (Obermeyer et al., 2019) , and even exam results (Adam, 2020) . It is undoubtable that medical autonomous systems are bound to make mistakes that influence life and death while they, unlike human health care professionals, cannot be held accountable for their actions (Pepito et al., 2019) . These are legitimate worries that should be taken into serious consideration. Due to the high beneficial and capitalist potential of autonomous systems, it seems implausible and unreasonable to stop the movement of increasing automation in healthcare altogether. What is possible, however, is to study the positive and negative effects of these systems, and design them carefully as to make sure that their actions align with our goals and values. While it is certain that the job description of many healthcare professionals will change with the technological advancements, the fear that autonomous systems will completely take over the healthcare sector has diminished substantially in recent years. There is a general academic consensus that, albeit we may not be able to fully apprehend what artificial intelligence will be capable of in the far future, humans possess a distinct set of qualities that machine agents will not be able to parallel in the near future (Dellermann et al., 2019) . Among such qualities are thinking outside the box, reacting to unexpected situations, and navigating the social world. In fact, a growing body of literature advocates that the abilities of human and artificial intelligence should not be viewed as competitive, but rather as complementary. This philosophy has coined the term 'Hybrid Intelligence' (HI), aiming at utilizing the complementary strengths of human and artificial intelligence, so that they can perform better than either of the two separately (Kamar, 2016; Peeters et al, 2020; Akata et al., 2020) . Following this paradigm implies studying not just the cooperation of humans with an autonomous system, but rather the union of humans and machine agents in a single system. This conception allows for research that would not be possible from either the autonomous or human factors perspectives (e.g. studying the interactions between the weaknesses and strengths of both types of agents). The field of hybrid intelligence comes with its own set of unique challenges which are inherently interdisciplinary in nature. One of these challenges is the study of how complex human-computer teams take moral decisions, and, following from this, how we can design such systems in a way that they behave ethically. As the previous section pointed out, there is a pressing demand for research in this area, considering the fact that moral decisions in medical humancomputer teams are already affecting many people and will soon affect many more, having crucial consequences for their health. The moral component of autonomous machines has been the subject of recent studies, especially in the field of machine ethics. This discipline attempts to contribute to the creation of Artificial Moral Agents (AMAs), that follow certain ethical rules, either implicitly or explicitly (Anderson & Anderson, 2007) . However, most of this research focuses on fully autonomous systems, excluding the human component of how these machines are used in practice. From a hybrid intelligence approach, it is key to not only include machine requirements but also human cognitive capabilities in the study of AMAs. As this is a highly interdisciplinary endeavor, it is first and foremost crucial to establish a common language to talk about moral situations in Hybrid Intelligent systems. Van der Waa et al. (2020) have made a first attempt using so-called Team Design Patterns (TDPs), describing the task allocation of human-computer teams in morally sensitive situations. However, this language has minimal sufficiency for expressing human cognitive components and requirements. This is a vital element for a truly hybrid intelligent approach toward moral decision making in autonomous systems. Additionally, the method has so far solely been used as a taxonomy and has not been utilized in the design of a system, meaning that there have been no evaluations of its effectiveness. The aim of the current thesis is twofold. Firstly, it attempts to advance the conceptualization of moral decisionmaking in hybrid intelligent systems in (and perhaps even outside) the medical domain. It will do so by creating Team Design Patterns for two use cases: bias mitigation and data sharing in a hybrid intelligent digital assistant for diabetes type II care. These patterns are a contribution to the academic literature on their own, as they are reusable entities for solving similar design challenges. Additionally, the creation of these patterns is meant to contribute to filling the gaps in the literature mentioned above: (1) the expression of both human cognitive components and AI requirements in the pattern language and (2) the development of TDPs not only as a library of successful and reusable design solutions, but also in their application as a method for the design process Hybrid Intelligence by a multidisciplinary team. Secondly, this thesis aims to contribute to a scientific standard in the methodology of conceptualizing moral decision-making in hybrid intelligent systems, as there is currently no such standard yet. This research will carefully document the used methods both for arranging the TDPs as well as their usability evaluation. As there are currently no documentations of methods for creating and evaluating TDPs, it is outside the scope of this thesis to fully establish such a methodology. Rather, it may serve as a first benchmark in the development of a sound methodological framework for the conceptualization of moral decision-making in human-computer teams. Following from the research aims, the current thesis attempts to answer several research questions (RCs): RQ 1: How can Team Design Patterns describe moral decision-making for bias mitigation and data sharing in a medical hybrid intelligent system, so that they are usable by researchers from the various disciplines involved in the design of such systems? As will become clear in Chapter 4, usability of the Team Design Patterns is defined as having three main components: understandability, generalizability, and effectiveness. Hence, the current thesis will answer three sub-questions of the first research question. The next section gives an overview of how the current thesis is organized to provide an answer to these research questions. The second chapter of this thesis describes the methodology and research design used in this research. It first addresses the two methodological theories the design of this research is based on: Socio-Cognitive Engineering (SCE) and Value-Sensitive Design. It then describes the research design, including the use-case of a Hybrid Intelligent system to counter Diabetes Type II through prevention, diagnosis, treatment, and management. The third chapter presents the foundation component from the SCE approach. It starts with an outline of the operational demands, analyzing what kind of support is needed in the moral domain of the system for diabetes care. This leads to the identification of two main design challenges: the mitigation of harmful biases in learning models and the sharing of patient data for medical and research aims. Subsequently, it analyzes the relevant human factors, focusing on psychological theories of moral decision-making. Finally, it gives an overview of the technological principles relevant to the envisioned system, including the distinction between machine learning and knowledge reasoning, and an overview of the current academic literature regarding techniques and solutions to the design challenges. Chapter 4 consists of the specification phase of the research, explaining how the knowledge from the foundation component was integrated into a set of patterns. This chapter addresses the objectives, functions, and claims of the design patterns, after which the proposed Team Design Patterns are presented. Chapter 5 describes the setup of the evaluation of the design patterns proposed in chapter 4. It suggests the development and use of a survey to obtain a combination of qualitative and quantitative data from researchers with varying backgrounds. It then presents the qualitative and quantitative results of this questionnaire. Finally, Chapter 6 discusses the findings of this research, and provides suggestions for improvements of the team design patterns as presented in Chapter 4. Additionally, this chapter reflects on the methodology of this research. Finally, this chapter concludes with a brief summary of the research and a set of suggestions for future research in the area of moral decision-making in medical Hybrid Intelligent systems. The previous chapter demonstrated the need for studying moral decision-making in hybrid intelligent systems and presented a set of research questions to advance this goal. The current chapter sets out the methodological framework for this thesis, and the research design following from adopting this methodology. The first section (2.1) gives a detailed explanation of Team Design Patterns, including an example scenario. It then discusses the Socio-Cognitive Engineering methodology and its integration with Value-Sensitive Design. The second section discusses the research design of the thesis, consisting of three phases: foundation, specification, and evaluation. Additionally, it presents the use case of a digital assistants for patients and healthcare professionals in the prevention, diagnosis, treatment and management of diabetes type II. As introduced in chapter 1, this thesis aims to conceptualize and describe moral decision-making in hybrid intelligent systems in the medical domain. A promising format for this is the use of Team Design Patterns (Van Diggelen & Johnson, 2019; Van Diggelen et al., 2019; Van der Waa et al., 2020) : a combination of text and pictorial language to describe possible solutions to recurring design problems. These patterns represent reusable and generic HI design solutions in a coherent way aimed at facilitating the multidisciplinary HI design process. Subsection 2.1.1 presents the aims, use, and inner workings of Team Design Patterns as described in the academic literature so far. As there is no set of methods yet to arrive at Team Design Patterns, subsection 2.1.2 provides the motivation for adopting Socio-Cognitive Engineering methodology to create the desired design patterns, extended with elements of Value-Sensitive Design (Neerincx et al., 2019; Harbers & Neerincx, 2017 ). The notion of design patterns originated in the domain of architecture (Alexander et al., 1977) . Alexander noticed that common design problems usually resulted in similar solutions and argued for a universal language to describe generic solutions for frequently occurring problems. For example, the design of an upstairs bathroom may be different for each specific house, but always requires sufficient foundation, piping, and mechanisms for safe use of electricity, which often results in the same type of design. Alexander (ibid.) stressed that these patterns should only capture the core of the solution on a general and abstract level to allow for the specific chances and limitations of each individual design. This way, 'you can use this solution a million times over, without ever doing it the same way twice' (ibid.). Since then, design pattern languages have been developed in several other disciplines including workflow engineering, object-oriented programming, and interaction design (Van Diggelen & Johnson, 2019) . Although the languages differ strongly across domains, they are often aided with graphical illustrations in order to optimally communicate the proposed solution of each pattern. Recently, several scholars have attempted to create a design pattern language in the domain of humanagent teaming. However, an often-voiced worry in this domain is that it is difficult to codify the intuitive, straightforward process that is teamwork. Additionally, human-agent teamwork is relevant in many application domains with diverse sets of stakeholders, who all have differing jargon and varying knowledge of human-agent teaming. Van Diggelen and Johnson (2019) created so-called 'Team Design Patterns' (TDPs) with the aforementioned challenges in mind, resulting in a promising language to express human-agent teamwork. According to the authors, there are four requirements to these TDPs: 'Team pattern design solutions should be (1) simple enough to provide an intuitive way to facilitate discussions about human-machine teamwork solutions among a wide range of stakeholders including non-experts, (2) general enough to represent a broad range of teamwork capabilities, (3) descriptive enough to provide clarity and discernment between different solutions and situations, and (4) structured enough to have a pathway from the simple intuitive description to the more formal specification.' Van Diggelen & Johnson, 2019, p.118 Table 1 shows some of the main concepts of the pattern language, including their equivalents in the pictorial language. It uses images of actors (either a human or machine agent) carrying blocks with text to signifying the tasks each actor is responsible for. The boxes, or tasks, can either have a direct contribution to the team goal (indicated by opaque boxes) or an indirect contribution (aimed at making the team more efficient but not directly contributing to the team goal, indicated by partly transparent boxes). Patterns can consist of multiple frames, each showing a different allocation of tasks. A transition between these frames is depicted by solid arrows. Sometimes these transitions are initiated by a specific actor, which is indicated by a dotted arrow from that actor to the solid transition arrow. Van der Waa et al. (2020) attempted to use this pattern language specifically for describing task allocation in moral decision-making and noted that two other concepts were important. Firstly, they added a distinction between moral tasks, requiring some sort of moral capabilities (depicted with a red color) and non-moral tasks, not requiring such capabilities (depicted with a blue color). Secondly, they incorporated the requirement of moral capabilities of the actors, by depicting them with a big heart (full human or human-like moral capabilities), a small heart (some moral capabilities, but not at the human level), or no heart (indicating no moral capabilities). As an example, let us look at an example scenario of a blind man called Jason. Recent technological advancements allow the design of a robotic guide dog that can support Jason in his daily needs, together constituting a human-agent team. Designers of the robotic dog can use TDPs to explicate the expected behaviors, requirements, and responsibilities in this (elementary version of a) hybrid intelligent system. Let us imagine the envisioned robotic dog guiding Jason to a doctor's appointment. They are walking on a countryside road and encounter a T-intersection. They need to turn right to reach the doctor's office, but suddenly a cry for help is heard from the left. What should the team do? Figure 1 shows three possible TDPs to describe this situation in increasing levels of complexity. Example pattern 1: This pattern shows the basic tasks: Jason follows the robot dog, and the robot dog guides their owner. Just by itself, this example is not capable of describing how the human-agent team should react to the cry for help. Using this pattern in the development of the system could result in the team simply ignoring it by design. Example pattern 2: Jason has another task while following the robot dog: scanning for abnormalities. Since this task does not necessarily bring the team closer to the doctor's office, it is an indirect task. Additionally, because the task may require some moral cognitive components, it has a red color. If Jason finds any abnormalities, he can take the initiative to change the task division, resulting in the transition to a different frame. Now, the robotic dog stands still, while Jason decides to change the route. Example pattern 3: This pattern is similar to pattern 2, but the robot dog has more responsibilities. Not only Jason, but also the robotic dog can scan for abnormalities and initiate a transition. This may be useful because the robotic dog may be able to perceive visual abnormalities or even have better hearing than Jason. After the transition, when Jason is deciding whether to change the route, the robotic dog can provide decision-support. For example, the robotic dog may be able to assess the level of urgency of the person in need or remind Jason of the urgency of his own doctor's appointment. Socio Cognitive Engineering (SCE, Neerincx et al., 2019) is a methodology that meets the first two requirements. It was developed specifically for the design of hybrid intelligent systems, combining elements from cognitive engineering, user-centered design, and requirement analysis. In the past decade it has been implemented in a wide range of systems in various domains, including digital assistants supporting children with diabetes (Looije et al., 2016) , social robots for the elderly (Peeters et al., 2016) , and a support system for disaster response (Mioch et al., 2012 ). An overview of SCE is illustrated in Figure 2 . The methodology is always applied to satisfy a specific demand in a particular context. The foundation layer consists of three components. Firstly, it includes an analysis of operational demands, which revolves around inspecting the work domain of the hybrid intelligent system and the support that is needed. Additionally, it includes an analysis of human factors relevant to the system, as well as an analysis of the technological principles that may be appropriate for the envisioned support. In the specification component of SCE, a number of objectives of the envisioned system is defined. This leads to the recognition of functions of the system, which are contextualized by scenario-like descriptions of the supposed human-machine interactions. Finally, the system's functions are supposed to bring about certain effects, which are called a claims. Lastly, the evaluation component aims to use a prototype or simulation to test whether the specified functions really have the claimed effect in order to fulfill the objectives. The results of the evaluation can then be utilized to revise and enhance the foundation and specification components, incrementally advancing the product. Neerincx et al. (2019) emphasize that SCE can be used in a cyclic fashion, incrementally advancing the foundation, specification, and prototype of the system. However, there are no strict rules in the order of revisions, as the designers can utilize the concepts in the framework to advance their design in whichever order they see fit (Ibid.). As mentioned before, SCE was not specifically developed for the design of Team Design Patterns, but it is a promising methodology in doing so. To satisfy the third requirement mentioned above, the moral values of the stakeholders of the patterns must be incorporated in the methodology. Value Sensitive Design (VSD) is the most elaborate of methodological frameworks to deal with values in the design process (Harbers & Neerincx, 2017) . It defines values as 'what a person or group of people considers important in life' (Friedman et al., 2013) . Stakeholders can be direct, meaning that they directly interact with the envisioned system, or indirect, meaning that they do not interact with the system but may still be affected by it. When one value comes at the expense of another, this often results in so-called value tensions. Harbers & Neerincx (2017) proposed to integrate VSD methods in the SCE methodology in each of the three components. Most importantly, this would result in a careful analysis of the direct and indirect stakeholders of the product in the foundation layer. When sufficient attention is paid to the values of the possible stakeholders, they argue, this will naturally affect the specification phase. This can happen implicitly, when they are used in the process of defining the scenarios, objectives, functions, and claims. Additionally, values can be supported in the specification phase explicitly through value stories, when there is a separate objective to support a certain value. Finally, in the evaluation phase it is necessary to test the effects of the prototype on the values of the stakeholders. Harbers & Neerincx (2017) Finally, the design process led to a prototype virtual assistant that was positively received by the stakeholders and domain experts. While the SCE and VSD methodologies are not specifically geared towards the design of Team Design Patterns, this chapter has thus far shown their suitability in doing so. I propose treating the Team Design Patterns as a special and abstract prototype that can be created through using the foundation, specialization and evaluation concepts as described above. However, TDPs are special in the sense that their stakeholders are not just the people that will be working with and affected by the envisioned system in its application domain (e.g. the operator, team leader, and passengers in Harbers & Neerincx' study), but also the designers of the system. As these researchers and developers are the primary prospective users of these patterns, it is key to include their wishes and demands in the three phases of the methodology as well. The research of the current thesis is structured according to the three phases of the SCE methodological framework. As mentioned in the previous section, this methodology is always used to solve a problem in a particular context. For this thesis, this requires a use case of a system in development in which moral decisions are made. The use case that this thesis focuses on is a hybrid intelligent system aimed at countering Diabetes Type II (DT2) in the Netherlands, being developed by the Netherlands Organization for Applied Scientific Research (TNO). This system will take the shape of a digital assistant for both health care professionals and patients in the prevention, diagnosis, treatment, and management of the disease. The system is hybrid because it requires a close cooperation between human knowledge and social skills on the one hand, and machine-based learning on the other. The DT2 use case is embedded in a four-year research program called FATE, aimed at researching novel techniques for fair, transparent, explainable (co-)learning decision making in human-machine teams. The research presented in this thesis roughly coincided with the first half year of this five-year project, meaning that the research project was in its very early stages. Hence, the envisioned DT2 system was treated as a use case and not (yet) as a system ready for deployment. The current research is largely exploratory by nature, as it aims to acquire knowledge about ways to conceptualize moral decision-making in medical hybrid intelligent systems in such a way that it may help researchers in communicating about such systems and make them more ethical by design. Both the research questions and the proposed methodology to answer them are dissimilar to any research in this field thus far. Therefore, the aim of the research design is not to acquire definitive answers, but rather to enhance the pattern language and sketch a framework for the methodology for creating patterns in the domain of moral decisionmaking. This can serve as a starting point for prospective cycles of the SCE methodology and future research in general. The foundation, specification, and evaluation components are used as a guidance for the creation of a first HI pattern language for AI-based decision support systems, particularly geared towards decisions with moral aspects in lifestyle-related health management. The previous chapter described the SCE and VSD methodological frameworks and explained the motivation for adopting them for the research in this thesis. It also presented the general research design, as well as the DT2 use case. The current chapter describes the foundation layer of the SCE methodology as shown in Figure The operational demands analysis of the SCE methodology revolves around the question: What kind of support is needed in the application domain? (Neerincx et al., 2019) In its early design stages, the aim of the envisioned DT2 system is to provide support to both health care professionals and patients in the prevention, diagnosis, treatment, and management of the disease, leading to a large number of possible applications. However, the general structure of the FATE system is shown in Figure 3 . AI developers and healthcare professionals create models based on existing patient data and domain knowledge. These models may, for example, be geared towards predicting patient's risk of developing diabetes (aiding prevention) or suggesting a diagnosis. Alternatively, they may predict the best type and dose of medicine (aiding treatment) or predict the best type of lifestyle change to make the disease as unintrusive as possible (aiding management). Through several modules, these predictions or suggestions are presented to the relevant patient or healthcare professional. Finally, the models are improved and updated by patients' medical and behavioral data. Table 3 ). The current thesis focuses on the former two challenges (bias mitigation and data sharing), because of time constraints and because they are the most tangible (the third problem only arises in later stages of the DT2 system's development). The first moral problem is that learning models may develop biases that may result in unfair treatment by the system. The underrepresentation of certain ethnicities in the input data, for example, can result in a racial bias in the system. People of those ethnical backgrounds may then receive worse care than others, which can be unfair and unjust. The system's developers may employ techniques to mitigate the harmful bias (see section 3.3.2), but this usually results in a lower average accuracy of the system. The direct stakeholders in this moral issue are the AI developer and the patients. The most prominent value-tension is between the system's (average) effectivity and fair treatment of each patient. The second moral problem that may arise in the prospected DT2 system is the sharing of patients' health and behavioral data. This is useful for the healthcare professionals because they have an increased ability to track the development of the disease in their patients, allowing more personalized care and quicker responses to urgent situations. The tracking and sharing of patient data also results in improvements in the models of the system, as these may be self-learning when they get the true values of their predictions. Additionally, they can detect patterns in large amounts of health and behavioral data, resulting in more personalized care. However, the designers of the system have a moral and legal obligation to ask for consent to share the patient's data to their doctor or to use it to enhance the system's models. The data is the patients' property and their autonomy may be compromised when it is shared without their consent. Yet, although it is key to respect the patient's autonomy regarding their own data, some patients may find it confronting, distressing, or simply too complex to do their research in how and to whom their data will be shared. The direct stakeholders in this moral issue are the systems' designers, the healthcare professionals, and the patients. The most predominant value-tension is between beneficence (resulting from more personalized care from the healthcare professional and the system) and the patient's privacy. Autonomy Table 3 : Overview of possible value-tensions in the prospected DT2 system. Due to time constraints, the current thesis focuses on the former two processes. In the treatment and management applications of the envisioned system, the learning models are expected to give suggestions for altering the patient's lifestyle (e.g. more exercise, less smoking, a different diet), possibly based on the individual's unique profile. Additionally, the possibility to track the progress of the disease based on the patient's (bio)medical and behavioral data may result in better and more personal suggestions from health care professionals. These suggestions would likely be visible on a mobile device, in the shape of an eHealth application. This functionality has unprecedented potential for providing personalized care but has a risk of being experienced as invasive. The direct stakeholders in this moral issue are the systems' designers, the healthcare professionals, and the patients. The main value-tension in this moral problem is between beneficence (due to more personalized care) and the patient's autonomy. In very simple terms, morality is a feature of an action or a person that can be considered to be on a scale from good to bad. Moral decision-making has received substantial attention for as long as philosophers have existed, as questions about what is "good" and how one can achieve it have been at the center of many a debate. Throughout history, research into morality can be divided into two very high-level categories: prescriptive and descriptive. The former category, also called ethics, has received most attention, and focuses on answering questions such as what is good and how one should act to arrive there. The latter category is concerned with studying how people actually make decisions that fall in the moral domain and has only developed with the rise of academic disciplines such as psychology and econometrics. It is often difficult to keep these categories entirely separated, as the descriptive study of decisions that fall in the moral domain inherently necessitates the delineation of what counts as good or bad, which is a normative presupposition. Nevertheless, the normative-descriptive distinction is helpful to structure the description of academic theories in this domain. Hence, the following subsection will touch on the wide debate regarding normative morality, or ethics. The second subsection will then give an overview of descriptive accounts and models regarding moral decision-making. The latter section is most directly relevant for the current research, as this study attempts to describe and conceptualize actual moral decision-making in human-computer teams. However, the current thesis is not merely descriptive in nature but also attempts to design optimal structures for moral decision-making, which can facilitate and promote ethical behavior of the human-computer team in question (Hancock, 2003) . Hence, the definition of what is successful moral decision-making is an inherently ethical question. In broad terms there are three major approaches in normative ethics: (1) virtue ethics, (2) consequentialism, and (3) deontology. Virtue ethics, originating in the work of Aristotle, revolves around the development of a moral character. It treats virtues and vices as the foundational components of morality, meaning that achieving 'good' is by developing virtuous character traits (Hursthouse, 1999) . As the name suggests, consequentialism treats the consequences of actions as the foundational components for ethical behavior. The best way to do 'good', according to this theory, is trying to predict the consequences of your optional actions and pick those that maximize well-being (Sinnott-Armstrong, 2003). Finally, deontology (or duty ethics) emphasizes moral rules as the basic component of morality. According to this theory, initiated by Immanuel Kant, the most ethical way to behave is to act in accordance with a set of universal rules that translate into rights and duties. An important manifestation of this approach is visible in the United Nations' Universal Declaration of Human Rights (UN General Assembly, 1948). In bioethics, an important and influential subset of duty ethics is Beauchamp and Childress' widely accepted four principles of bioethics (2001). The set of principles is made up of respect for autonomy, beneficence, nonmaleficence, and justice, which can largely be traced back to the ancient Greek Hippocratic tradition. While the philosophical debate about how one should act is still very alive, the rise of the field of psychology has paved the way for a different scientific angle towards morality: the descriptive account of how people actually take moral decisions (regardless of whether they pick the ethically "right" choice). One of the most influential theories in moral psychology was coined by Lawrence Kohlberg (1958) . He described the development of moral reasoning in humans as a transgression through multiple stages, inspired by Piaget's stage theory of cognitive development (1932) . Based on empirical research on children, he concluded that humans necessarily transgress from 'pre-conventional' stages of reasoning (based on egoism, obedience and punishment) to 'conventional' stages of reasoning (based on social norms or law and order), and ultimately to 'post-conventional' stages of reasoning (based on social contracts and universal ethical principles). As in Piaget's work, stages cannot be skipped because they each provide a new perspective in the development of reasoning that is necessary for the following stage (Dovidio et al., 2017) . While Kohlberg has used a body of evidence to support his step-wise developmental model of moral reasoning, the field of behavioral psychology soon found that moral reasoning as specified in Kohlberg's theory does not account for people's actual moral behaviors (Ibid.). In fact, there is a general academic consensus that there is a considerable gap between people's moral ideas and putting those to practice. This was famously revealed by Darley & Batson (1973) , who showed that people who were in a rush to give a lecture on the topic of 'Good Samaritans' were unlikely to offer help to a person in need. Several lines of research have attempted to explain this gap between moral reasoning and moral behavior. Sternberg (2012) , for example, developed a chronological psychological model of 8 steps that need to be taken for successful moral decision-making, as shown below. If the subject fails to take any of the 8 steps, they will not engage in ethical behavior (regardless of their conscious moral principles). 1. recognize that there is an event to which to react; 2. define the event as having an ethical dimension; 3. decide that the ethical dimension is of sufficient significance to merit an ethics-guided response; 4. take responsibility for generating an ethical solution to the problem; 5. figure out what abstract ethical rule(s) might apply to the problem; 6. decide how these abstract ethical rules actually apply to the problem so as to suggest a concrete solution; 7. prepare for possible repercussions of having acted in what one considers an ethical manner; 8. act. Sternberg, 2012 In Sternberg's view, his model is not only an accurate depiction of successful moral decision-making but can also be used for the "moral education" of kids and subsequently close the gap between theory and practice. His perspective is strongly embedded in western rationalism, as he explicitly assumes that ethical reasoning and ethical behavior can be largely rational (Ibid). This is a controversial notion, which I will soon return to. However, Sternberg points out two important requirements for moral reasoning that are relevant for any model. Firstly, he emphasizes that the person in question needs to 'define the situation as having an ethical dimension'. For any choice in any situation, the interpretation of that choice as an ethical one has a profound impact on how one reacts to it. This component provides one answer to the perceived gap between people's ethical ideals and their practice: they often simply do not recognize the situation in which their (or any) moral values apply. If a problem is not perceived or defined as a moral problem, someone will most likely also view their reaction and its consequences as amoral. Secondly, in the fourth step, Sternberg stresses that taking 'responsibility for generating an ethical solution to the problem' is an essential requirement for engaging in ethical behavior. Many would attest that a large factor in most of the world's largest problems is the fact that many recognize them, but few feel responsible for their solutions. Especially in organizational structures where it is not immediately clear who bears responsibility for the ethical outcome of certain problems, it can thus be advantageous to make explicit agreements in order to ensure that someone is accountable for reacting to a moral decision. However, these models trying to explain moral reasoning have received widespread criticism on being too rationalistic. Dennis Krebs and colleagues (1997) noted that till that moment, moral decision-making had nearly solely been studied using abstract moral dilemmas, while real-life moral decision-making has many other shapes that are much less philosophical and more social in nature. Around the turn of the millennium, Jonathan Haidt (2001) posed the hypothesis that not moral reasoning, but moral emotions and intuitions are the origin of moral judgement and decision-making. In his view, people perform moral reasoning in a so-called 'post-hoc rationalization' only to justify moral judgements that have already been formed by immediate moral intuitions (Ibid). This 'social intuitionalist' model is supported by compelling evidence and soon gained much attention, not the least from evolutionary psychologists. The idea that emotions play a large role in moral judgement is sensible from an evolutionary perspective, as most theorists agree that morality has developed in humans (and other species) as a mechanism to ensure cooperative and pro-social behaviors in groups (Tomasello & Vaish, 2013) . A group or society in which individuals hurt each other and play by their own rules is probably less successful and has a lower chance of survival than a group that shows cooperative behavior towards one another. In the words of Dennis Krebs and colleagues, the original function of moral judgement was 'to induce those with whom one formed cooperative relations to uphold the cooperative systems in order to maximise the benefits to all' (1997). Subsequentially, groups with 'moral' behavior had a higher chance of survival, leading to their genes (and ideas) flourishing, which in turn led to both a natural and social ingraining of moral responses to certain situations. As the spread of these attitudes through genetics nor social interaction is necessarily dependent on language, which makes it possible for them to be perceived as 'intuitions' (and may explain why they are sometimes so hard to put to words). This evolutionary function fits the development of social emotions such as shame, anger, envy, and guilt. Yet the intuitionalist perspective on morality has its own downsides, as it alludes to the idea that morality is ruled by emotions and cannot be influenced by conscious thought. This begs the question whether we have any control over our arguably most important decisions, while at the same time discounting an entire body of rationalistic literature. The rationalist and intuitionalist views on moral decision-making were united in dual-process theory. This theory encompasses the idea that the functioning of the brain can be divided into two 'systems' or 'pathways'. One of the pathways is 'fast', evolved early in the development of humans and animals, and is often associated with quick, automatic, and emotional responses. The other pathway is 'slow', ends in the prefrontal cortex, and is usually associated with conscious and controlled thought (Sloman, 2002) . The notion of these two brain mechanisms explaining human behavior has been deeply influential for the past two decades, affecting virtually every subdiscipline in psychology, neuroscience, and beyond. In a persuasive series of articles (2004, 2007, 2008, 2009) , Joshua Greene and colleagues applied dual process theory to moral reasoning in an attempt to synthesize the rationalist and intuitionalist models described above. The basis of their proposed theory can be found in neuroscientific fMRI research in 2001 (Greene et al., 2001) . In this study, participants faced traditional 'trolley problem' dilemmas (Thomson, 1984) . The brain imaging indicated that deontological moral judgement is associated with neural activity in the fast, emotional system in dual process theory, while utilitarian moral judgements go together with activity in the slower, conscious system. While the contraposition of two competing driving forces for human behavior -one rational, one emotional -is an old and familiar picture, aligning this distinction with the debate between normative moral theories and mapping these on brain processes was a ground-breaking enterprise. Greene's theory would not be influential without being the subject of sobering criticism. In response to dual process theory in general, many authors have stressed that the high-level distinction between two systems does not do justice to the complex and interactive nature of the brain (e.g. Osman, 2004; Pennycook et al., 2018) . While it may seem attractive to classify all human thoughts and behaviors as resulting from a one-dimensional distinction in brain processes, this reduces the perceived complexity of the processes, eventually making it harder to understand them in their entirety. Additionally, there is always a risk of identifying the competition between the two systems as the cause of reasoning and behavior, while the evidence is merely of correlational nature. Finally, Greene's experiments have been subjected to a long list of methodological objections (Berker, 2009 ). For example, scholars have noted that the moral dilemmas used in the research can better be described as 'personal versus impersonal' than 'typical retributivist versus typical consequentialist' (Ibid.). This may explain the neuroscientific and behavioral differences found in their study better than a distinction between two philosophical ideas. This supports the wider criticism that it is unlikely that a twofold distinction in brain processes maps perfectly on an ancient and abstract philosophical debate. Regardless of these denunciations, it is important to take Greene's dual process theory seriously when researching human moral reasoning and behavior. People's moral decision-making can (at least partly) be explained as an interplay between (sometimes competing) mechanisms that also determine behavior in nonmoral domains (Bucciarelli et al., 2008 ). Sloman's dual process theory (2002) may serve as a first tool for structuring further research. The study of people's reaction to moral choices has not solely received attention in psychology and cognitive neuroscience. Recently, economists and econometrists too have entertained the topic. Caspar Chorus (2015), for example, applied the economic perspective of discrete choice analysis in an attempt to synthesize many of the models mentioned above. Figure 4 shows the resulting model. Where Greene's model was criticized for lumping all human morality in two big categories, Chorus shows that moral behavior is a subtle interplay between several long-and short-term processes. This includes feedback-loops of behaviors, expectations, and changing personal and societal norms due to the behavior of the individual and others. he attests that his model is a first step in deconstructing the inner nuances in moral reasoning and behavior, and that it can be used as a framework for further research. Finally, research into moral decision-making on a team level is highly relevant to the conceptualization of moral decision-making in human-machine teams. Unfortunately, few scholars have investigated this matter, with a few exceptions. Van Soeren & Miles (2003) examined the role of teams on moral distress in end-of-life decisionmaking in critical care through a case study. They found that tensions between the stakeholders (including the family, intensive care professionals, and transplant team members) often arose from a lack of communication and feelings of being unheard. They suggest shaping a process of regular interdisciplinary team reviews for all stakeholders to give input from their perspective (Ibid.). According to the authors, this would allow for everyone involved to take a step back from dealing with consecutive crises, and instead look at the overall continuity of the patient's condition. They add that this would benefit a mutual development of trust among team members, resulting in fewer moral tensions in the decision-making process. In a similar vein, Gunia et al. (2012) researched the role of contemplation, conversation, and explanation in moral decision-making. In conversation style, they differentiated between conversation partners that promoted more self-interested behavior (self-interested conversation) and conversation partners that stimulated more community-minded behavior (moral conversation). They found that contemplation and moral conversation promote more selfless behavior, while immediate choice and self-interested conversation cause more selfish behavior (Ibid.). Additionally, they found that people can usually provide explanations consistent with their decisions just before and just after those decisions. While it may be unsurprising that these two studies on team processes in moral decision-making emphasize the role of communication processes, it is important to explicate and specify these factors in an attempt to get a clearer overview of how team processes precisely affect decisions in the moral domain. At the intersection of human factors and technology lies the categorization of different types of cognitive aids. McLaughlin et al. (2019) recently made a useful contribution to this field by linking advancements in aid development to psychological theories of cognition. The authors affirm that traditional taxonomies of cognitive aids focus on their surface characteristics, while at the most fundamental level they can be categorized by which cognitive process they support. McLaughlin and colleagues identify five main cognitive processes: attention, memory, perception, decision-making, and knowledge (Ibid.). Each of these classes consists of several more specific processes. For example, attention aids can support humans in selective, orienting, sustained, or divided attention. The entire taxonomy is shown in Figure 5 . As mentioned in chapter 2, the envisioned system to support the prevention, diagnosis, treatment, and management of DT2 is still in its very early stages at the time of writing this thesis. The necessary technologies in terms of hardware and software are still largely unclear. The general vision of the system is that it will be able to provide personalized care on a range of (mobile) devices, using modules that combine different types of artificial intelligence. Subsection 3.3.1 gives a short overview of the high-level distinction between machine learning and knowledge reasoning systems. The subsequent two sections give an overview of possible solutions the academic literature prescribes for the bias mitigation and data sharing design problems. The envisioned DT2 support system is expected to make use of artificial intelligence to provide personalized This is also known as statistical parity (Ibid.). Individual fairness, in contrast, entails that people with similar traits with respect to a certain task be treated similarly. While these are intuitive conditions, it has proven difficult to find a universal measure to assess whether an algorithm meets these conditions. Till Speicher and colleagues (2018) proposed the use of existing inequality indices from the field of economics to assess the degree of fairness of algorithms, addressing both group and individual fairness. In a comprehensive review of the currently available methods for reducing unfairness in machine learning algorithms, Friedler et al. (2019) identify three types of methods based on how they affect the algorithm: pre-processing, in-processing, and post-processing methods. Firstly, pre-processing methods have recognized that training data is a frequent cause of unfairness, as it can capture historical discrimination or under-represent minority groups. Calmon and colleagues (2017) provided a probabilistic framework for discriminationpreventing preprocessing. They defined an optimization problem for dataset transformations that trade off group fairness, individual fairness, and overall accuracy (defined as data utility), using regular classifiers. In-processing methods consist of modifications to existing algorithms in order to reduce unfair predictions. Kamishima and colleagues (2012) , for example, attempted to create a 'prejudice remover regularizer' that enforces a classifier's independence from sensitive information. The method was applied to a logistic regression classifier but can be adapted to be geared to different types of algorithms. A third fairness-enhancing approach is to modify the results of a classifier through post-processing methods. For example, Kamiran et al. (2012) created a method to modify decision tree leaf labels after training. Pleiss et al. (2017) , too, address simple post-processing methods to satisfy fairness constraints. However, they critically note that these methods often hinge upon withholding predictive information for randomly chosen inputs (Ibid.). Many would agree that this is unsatisfactory in sensitive settings such as healthcare, as it implies that individuals for whom a correct prediction or diagnosis is available purposefully receive a false prediction for the sake of achieving equal accuracy for all subgroups. As is clear from this brief overview, methods for addressing bias and unfairness in learning algorithms are still in rapid development and are likely to change substantially over the coming years. Moreover, it is evident that the suitability of fairness-enhancing measures and methods are dependent on the input data, learning algorithm, and development aim at hand. An HI bias mitigation framework should allow for rapid developments in the field and for the possibility to test and apply methods on a case-to-case basis. The sharing of medical data is a widespread phenomenon, especially due to the increasing popularity of eHealth applications (Blenner et al., 2016) . In recent years, several legal requirements have been put in place to safeguard user privacy in digital environments with a strong emphasis on consent (e.g. General Data Protection Regulation (GDPR), 2018). In a comprehensive study of eHealth software for diabetes care, Blenner et al. (2016) found that although many of these applications have privacy policies that formally fulfill the legal requirements, they can be misleading for their users. Patients may, for example, have the mistaken belief that their health data is not shared with third parties, even though this is generally the case. While the scientific realm has formulated ethical guidelines for the data sharing and consent-giving process in smart health (e.g. Jones & Moffitt, 2016; O'Connor et al., 2017) , this has not yet been converted into a concrete set of best practices for the design of eHealth systems. However, several scholars have provided typologies of consent in (digital) medical settings, which can provide a first framework for the technological underpinnings of its design. Firstly, the most elementary form of medical consent was coined simple consent, relevant for minor medical procedures that pose a low health risk (Whitney et al., 2003) . The simple consent procedure consists of a short explanation of what the intervention entails, followed by an explicit or implicit agreement by the patient. Conversely, recent years have witnessed considerable attention for the concept of informed consent: the requirement to ensure that the patient truly understands what they are consenting to (Grady, 2015) . According to Whitney and colleagues, the process of achieving informed consent should consist of a 'discussion of nature, purpose, risks and benefits of proposed intervention, any alternatives, and no treatment, followed by explicit patient agreement or refusal' (2003). However, Christine Grady (2015) observed that the informed consent process usually differs substantially in detail and formality depending on whether it is intended for clinical interventions or for research purposes. She also noted that 'with more recently embraced learning paradigms, these goals are converging, or at least the boundaries are shifting' (Ibid.), calling for a more comprehensive and universal set of requirements for the clinical and research consent-giving process. Sarah Moore and colleagues (2017) recognized the increasing role of digital systems in obtaining consent, making a distinction between in-person and remote consent. While in-person consent has an inherent social mechanism for establishing mutual understanding and obtaining informed consent, remote consent is less personal, making it more difficult to ensure that the subject has fully understood the terms. In a similar vein, Rowbotham and colleagues (2013) noticed that the dense texts in many remote consent forms resulted in a low degree of the users' attention. They conducted an experiment in which one group of subjects was given a 'standard' consent form, while another group was presented with an introductory video, standard consent language, and an interactive quiz with special attention to data privacy, aggregation, and sharing. The second group had a significantly and substantially better understanding of the research risks and procedures than the control group. This interactive informed consent is especially relevant for remote variants of obtaining consent. The recognition of individual preferences regarding data sharing and the increasing complexity of digital systems has led to the identification of two additional types of consent. Firstly, Rake and colleagues (2017) advocated for a personalized consent flow, that would allow subjects to control the health data collected by their mobile and wearable devices, in order to regulate to what extent these are shared for research purposes. This type of consent necessitates a differential set of consent rules that can be divergent for each subject, as well as a mechanism through which the patient can oversee and change these whenever they want (for the latest developments in this domain, see Rau et al., 2020) . Finally, Bunnik and colleagues (2013) , too, recognized a need for personalized consent rules, but they also noticed that some patients desire a larger choice space (and larger amounts of information) than others. Hence, in a consent system for deciding which hereditary diseases to test for in a personal genome test, they developed a tiered-layered-staged model for patients to receive differing amounts of choices and support based on their preference (Ibid.). The previous chapter explained the foundation for the creation of Team Design Patterns for moral decisionmaking in hybrid intelligent systems in the medical domain. This included an operational demands analysis, resulting in three areas in the DT2 system where moral decision-making plays an important role, of which bias mitigation and data sharing were chosen to investigate in this thesis. It also gave an overview of the academic knowledge regarding moral decision-making and current solutions for these two design problems. The current chapter presents the specification phase of the SCE method, in which all this foundational knowledge is brought together into Team Design Patterns describing possible solutions for the design problems. The operational demands and technology analyses were used to create the patterns' team process configurations. Additionally, the (dis)advantages of each pattern address the value tensions identified in the analysis of operational demands The taxonomy of cognitive aids and human factor literature was used for the patterns' human requirements, while Van Harmelen & Ten Teije's hybrid AI boxology (2019) was used for to identify AI requirements. The patterns' pictures were created using the Google Draw tool. Section 4.1 addresses the objectives, functions, and claims of the patterns. Subsequently, subsection 4.2.1 presents the patterns for the bias mitigation solution. In the creation of Pattern 2 for this design problem it became evident that another layer of more specific patterns could be created to elucidate one abstract task. These more specific patterns are addressed in subsection 4.2.2. Then, subsection 4.2.3 presents the design patterns for the data-sharing design problem. Objectives The envisioned Team Design Patterns have three main objectives. Firstly, they are aimed at facilitating communication among researchers and designers from the numerous disciplines that are involved in the creation of hybrid intelligent systems. This may be more complicated than it seems, as it means that the patterns have to cover ethical concepts that are apprehensible for AI engineers and address technical approaches in software engineering in such a way that human factors experts understand it. Above all, these wide ranges of knowledge must be incorporated without being too abstract to provide meaningful conceptualizations of the design problems and solutions. A second objective of the TDPs is that they need to provide solutions not to one specific design challenge in a single system. Rather, they must be applicable to a set of design problems in various hybrid intelligent systems, preferably in several application domains. This way, the value of the TDPs lies in the possibility to reuse and improve them, drawing lessons from their implementations so far. Lastly, the use and application of the TDPs should have a positive effect on the envisioned systems. In the case of these patterns aimed at morality in hybrid intelligence, this objective means that the patterns should lead to more thoughtful and explicit moral decision-making than without their use. The created TDPs have a number of functions that aim to contribute to fulfilling the abovementioned objectives through claims. The most important function of the patterns is their composition: meaning the combination of a textual introduction, a pictorial and textual stepwise representation of the proposed solution, and a table indicating key features of the pattern. Each of these components can be subdivided in smaller functions. For example, each concept of the TDP pictorial language illustrated in section 2.1.1 serves as a requirement for conveying key information of the patterns. Additionally, the tables include the human and AI requirements of each pattern, as well as possible advantages and disadvantages of its implementation. Each of these features can again be understood as functions of their own, as they structure the patterns use and functioning. The claims of the TDPs correspond to the objectives stated earlier. To facilitate communication between experts from various backgrounds, the claim of the patterns is that they are understandable for designers and researchers across various relevant disciplines. Specifically, the prospective users should understand the solution presented by each pattern (i.e. their content), as well as how the pattern works (i.e. how the functional components relate to and complement one another). Hence, the understandability claim can be subdivided in two claims: understandability and coherency. These two features may not fully constitute the objective of facilitating communication between disciplines, but they are undoubtedly a fundamental requirement for it. In order to be applicable to more than one situation, the second claim is that the patterns are generalizable. The functions should all contribute to a level of abstractness through which the patterns describe solutions that can be reused. Finally, the third claim of the patterns is that they are effective, meaning that they lead to better or more appropriate moral decision-making when they are used in the design of a system. There may be a certain degree of subjectiveness in the effectiveness claim, as there is no clear universal notion of the 'appropriateness' of moral decision-making, as this is a normative action. Even though the effectivity claim may be partly subjective, it is still important to the assessment of TDPs as an intended improvement and explication of moral decision-making in hybrid intelligent systems is one of their core objectives. In the envisioned DT2 system, machine learning models will be used to make predictions for patients, for example to give them a diagnosis of pre-diabetes. For Diabetes Type II, it is known that the disease runs a different course in people with a Surinamese or Hindu background than for individuals with European heritage. This means that these different groups also have differing importance of factors for diagnosis (e.g. blood glucose, weight...). Simply removing "sensitive" data types, such as ethnicity, gender, sexual orientation or socioeconomic status may do more harm than good, as they are needed for accurate predictions and help to give the patient the customized care they need. However, the inclusion of these sensitive characteristics while minorities are underrepresented can result in social discrimination: It may bring about systematic differences in the accuracy of the predictions for the minority groups and a misfit of the resulting care to the personal circumstances. As the models are self-learning, the risk of social discrimination remains after deployment, necessitating a mechanism that mitigates harmful biases. In this pattern, the machine agent solely performs a classic machine learning task: predicting the diagnosis (e.g. diabetes) of patients. The human AI developer supervises this process, measuring both the overall accuracy and the fairness of the predictions. If the human agent thinks that the balance between these two measures is off (e.g. because people with a Surinamese background receive significantly less accurate diabetes diagnoses), the human AI developer initiates a takeover. In this takeover, the machine stops its task, while the human AI developer changes the model. After he finishes, the situation goes back to normal. In this pattern, the machine agent has more moral responsibilities than in Pattern 1. The machine agent makes the predictions regarding patients' diagnoses, and performs moral supervision on itself: it measures whether its own predictions are equally accurate for each subgroup (e.g. if diabetes diagnoses for people from Surinamese background are as accurate as for patients from other ethnic groups). The human AI developer is on stand-by. If the machine agent measures a bias in its own predictions, it initiates a handover to the human agent. In this handover, the machine agent explains to the human why a moral decision is necessary (e.g. because the model is racially biased towards people with a Surinamese background). Subsequently, the human and the computer agent make a joint decision in changing the model. In this pattern, the machine is fully autonomous and is responsible for the entire process of making the diagnoses, assessing the fairness an accuracy, and keeping the balance by changing the model when necessary. In order to make this possible, this must be preceded by a value elicitation phase, in which the AI developer accurately programs this process. A recurrent examination (which can happen based on time passed or number of decisions made) repeats the value elicitation process and monitors possible ethical drift (Van der Waa et al., 2020). This makes sure that the machine works in accordance with the set of goals and values specified by the human. While addressing the bias mitigation design challenge described above, it became clear that other design challenges could be nested within these patterns. For example, in Pattern 2, the human and machine agent have a joint responsibility to change the model, which presents a design challenge of its own. There may be several configurations of the team-members' tasks and responsibilities to solve this sub-challenge, which may be addressed by so-called sub-patterns. The sections below present four such sub-patterns, describing different arrangements of the AI developer and the machine agent changing the learning model for DT2 diagnoses. In this pattern, the machine agent has a memory support role. The human agent decides whether a model change is desirable (e.g. because the model has a strong racial bias). After this, the machine agent aids in the weighing of methods by showing previously taken measures in similar situations (e.g. over-sampling patients with a Surinamese background). The human agent takes this in consideration and picks a method to change the learning model. The machine then stores this choice in memory in order to provide future memory aid. In this sub-pattern, the machine has a predictive role. As the AI developer considers the various methods possible to solve the accuracy-fairness imbalance, they can employ the machine agent to predict the effects of applying such a method. The machine simulates the effects on the fairness-accuracy balance based on previous observations and presents these to the AI developer. The human can take these simulations into account when choosing the appropriate method to change the model. In this pattern, the machine has more moral responsibilities, while the human agent only has a reviewing role. In the first frame, not the human, but the machine decides whether a model change is desirable, based on preset conditions (e.g. if the learning model's diabetes diagnoses are over 10% less accurate for people with a Surinamese background). After this, the machine simulates all possible methods to change the model, and their effects on the accuracy-fairness trade-off. The machine agent then suggests the optimal method, which the human agent reviews. The human agent takes this into consideration, and finally picks the preferred method. An important challenge in the design of medical Hybrid Intelligent systems is the sharing of patient data. In the Diabetes Type II case, there are many situations in which access to patient data can improve the accuracy and effectivity of the learning models. For example, it is desirable that the models in the system can learn if their predicted diagnosis was right or wrong. Additionally, the sharing of health-behaviors (e.g. exercise, diet, weight) can have an incredible potential for lifestyle advice personalization. Finally, for optimal medical treatment of the patient it can be very beneficial to notify the patient's doctor in the case of a critical medical situation (e.g. three hypos in one week). However, the designers of the system have a moral and legal obligation to ask for consent to share the patient's data to their doctor or to use it to enhance the system's models. The patient's data is their property, and their autonomy may be compromised when data is shared without their consent. Yet, although it is key to respect the patient's autonomy regarding their own data, some patients may find it confronting, distressing, or simply too complex to do their research in how and to whom their data will be shared. This pattern describes a common way of attaining medical consent from a patient: through an in-person encounter with a healthcare professional (HCP). In this pattern, the HCP signs the patient into the system and explains the terms and conditions of giving consent. The HCP and the patient then perform the joint action of deciding whether to give consent to share the patient's data with the system. The HCP records this and the machine agent then shares the patient's data accordingly. This type of consent is usually all-or-nothing: there is a standard format for what the patient consents to, as there is little time to go over a large set of differentiated rules (e.g. only sharing certain data with certain individuals). The joint moral decision is largely based on a trust relationship between the HCP and the patient. This pattern describes the configuration of simple remote consent (Whitney et al., 2003; Moore et al., 2017) between the machine agent and the patient, without the assistance of an HCP. It consists of the machine-agent signing in the patient and presenting the terms and conditions of giving all-or-nothing consent: if the patient does not consent to the predetermined set of data sharing rules, they cannot join the system or use the service. This is a very common way of obtaining consent for digital systems outside the medical realm. This pattern describes a configuration of informed consent (Whitney et al., 2003) , which is legally and morally stronger than simple consent. The machine agent indicates what the terms and conditions of giving all-or-nothing consent entail. Additionally, there is a mechanism through which the patient's understanding of the possible effects of giving consent is confirmed, for example through a small quiz, similar to the solution proposed by Rowbotham et al. (2013) . In this pattern, the patient can pick differentiated consent rules, based on the type of the shared data, and the person who the data will be visible to. In the first frame, the machine agent presents all the different consent options, explains why it is necessary that this moral decision has to be made, and what the consequences of this choice are. The patient reads the terms of these consents. In the next frame, the patient provides their consent rules, based on the type of shared data (e.g. personal, health-behaviors, medication intake, sensitive) and the person the data will be visible to (e.g. their doctor, the AI developers, no one). In the final frame, the machine agent shares the patient's data according to the rules, while the patient supervises this process. If the patient is not satisfied with one or more of the sharing rules (e.g. they realize that their weight is shared to their doctor and they do not feel comfortable with that), they can take the initiative to repeat the process and set new consent rules. This pattern is similar to the solution proposed by Rake and colleagues (2017) . This pattern is similar to the previous pattern but resolves the possible disadvantage of the patient being overwhelmed by choices. It starts with establishing the patient's preferred level of autonomy, for example by giving the patient three options: 'Standard consent rules', 'I want to choose but I need help', or 'Let me choose everything'. Depending on this choice, the system gives a varying number of options to choose between, accompanied with varying levels of decision-support. This way, the patients themselves can determine how complicated the consent-giving process is. Again, the patient can supervise the data-sharing process and may decide to readjust the consent rules (and, possibly, the preferred level of autonomy) if they are dissatisfied with the way their data is shared. This pattern is similar to the solution proposed by Bunnik et al. (2013) . In this pattern, the machine agent has a self-learning capability, also leading to more moral responsibilities. In the first frame, the patient chooses their consent rules, while the machine agent gives decision support proportional to how much autonomy the patient desires. When the patient provides a new consent rule, the machine agent uses this to create a model, and learn their consent behavior. In the next frame, the machine agent shares the data according to this model. When it detects a new consent type (e.g. a data type or person not yet included in the model), it switches to the third frame. In this frame, the machine agent suggests a consent rule based on its model of the patient's consent behavior. The patient then reviews this suggestion and decides whether to give consent to this rule or not. The machine agent then includes this rule in their model, and the second frame becomes active again. The previous chapter presented the Team Design Patterns that were created by combining the operational demands, human factors, and technological principles addressed in Chapter 3. The next step Socio-Cognitive Engineering method is an evaluation of the designed patterns. This is necessary to validate whether the patterns have their claimed effects and serves as input for the foundation and specification layer in the next iteration. The current chapter describes the suggested methods for this process, as there is no current standard for evaluating Team Design Patterns. The first section describes the questionnaire that was created to evaluate the patterns, as well as the qualitative and quantitative methods of analysis. The second section describes the results of using these analysis methods. The current literature does not prescribe a standard method for evaluating Team Design Patterns. As follows from the SCE methodology, it is key to validate whether the functions in the patterns establish the desired effects as expressed in the claims. This leads to a method with four metrics: (1) understandability, (2) coherency, (3) effectiveness, and (4) generalizability. Because the patterns and pattern language are still in an early stage of development, we chose to perform a usability test with a sample of the pattern's prospective direct users: researchers and designers of hybrid intelligent systems. As the patterns are designed for the purpose of facilitating communication between different disciplines in the design process of hybrid systems, the ideal shape of the evaluation would be a focus group that simulates this process, after which participants can reflect on their experience and the patterns. However, due to time constraints of the current thesis and mobility constraints due to the COVID-19 pandemic, the research employed a questionnaire to evaluate the patterns. The motivation for using a questionnaire was that they can provide quantitative ratings regarding the metrics, but, more importantly, they can add qualitative data for a richer image of what is necessary to improve the patterns and pattern language. The questionnaire was created using guidelines provided by Brinkman (2009) Table 4 . Participants were asked to what extent they agreed with these statements on a five-point Likert scale ranging from 'Completely disagree' to 'Completely agree'. Each of the questions was followed by the prompt 'Please explain your answer' and a long answer text field. This section finished with an open questions regarding the two shown patterns: 'In your view, which tasks/concepts are missing in or should be added to the patterns on this page? And why?' Afterward, participants were asked to what extent they agreed with the statement 'I understand how I'm expected to judge the patterns in this questionnaire' on a five-point Likert scale. Table 4 Likert scale question for each pattern. The next section of the questionnaire showed a video introducing the concept of 'sub-patterns', with the explanation that the joint task to 'Change the model' in Bias Mitigation Pattern 2 can be specified further. It then presented the (sub-)design challenge of changing the model presented in Chapter 4, including its relevance, the actors involved, and the moral tension it entails (which is the same as the main pattern). It then presented two of the sub-patterns in Chapter 4: The Memorizing Machine (Pattern 2.1) and The Suggesting Machine (Pattern 2.4). Again, the patterns consisted of a textual description combined with the pictorial and graphical representations of the patterns. The succeeding page consisted of the four Likert scale statements shown in Table 4 , each followed with a long text field for the participants to explain their answer. Again, the section ended with the question 'In your view, which tasks/concepts are missing in or should be added to the patterns on this page? And why?'. The final section of the questionnaire presented two of the data-sharing patterns described in Chapter 4: Differentiated Consent (Pattern 4) and Self-learning Consent (Pattern 6), along with the four Likert scale statements followed by open answer fields. These patterns were chosen because both are relatively complex, expectedly leading to a relatively large variance on the measures. Additionally, Pattern 4 describes an existing solution, while Pattern 6 describes a hypothetical solution that is not existent yet. Differences in responses to The proposed design solution of this pattern is easy to understand. The combination of pictorial and textual information in this pattern provides a coherent representation of the solution. The implementation of this pattern will lead to appropriate moral decision-making in human-machine diabetes care. The solution in this pattern can be applied to other human-machine systems than diabetes care. these patterns may provide some insight into the usability of Team Design Patterns to describe existing solutions versus their ability to plan prospective solutions to design problems. After a pilot trial with two researchers, it was evident that filling in the questionnaire took longer than an hour. To minimize the burden on participants' schedules and to decrease the risk that they would be demotivated to start or finish the questionnaire, the final section was made optional. The rationale behind this was that, in this stage of the research, it would be better to gather the evaluations regarding fewer patterns of many researchers from different disciplines as opposed to having a few respondents evaluating more patterns. As described above, the target users of the patterns are researchers and designers in the field of hybrid intelligence. As this evaluation is meant as a first reflection on the usability and value of the patterns and pattern language from various perspectives, it was imperative to approach researchers and designers from a broad variety of disciplines. Hence, thirty researchers and developers from various research areas at TNO were approached through a network sample. Participants were approached by email with a link to the online questionnaire, after which they had two weeks to respond. Twenty of the thirty approached researchers and designers responded, taking around 45 minutes to fill in the questionnaire. Only five respondents chose to fill in the optional part of the questionnaire. Due to this low number their quantitative results were emitted in the analysis, while their open answers were included in the qualitative analysis. One of the approached researchers tried filling in the questionnaire but had many objections to the taken approach, resulting in limited understanding of the patterns and difficulties answering the questions. To still include the perspective of this researcher, the questionnaire was substituted with a 1.5-hour unstructured interview by phone, focusing on their vision and objections. The results were then included in the qualitative analysis. For the quantitative part of the analysis, the Likert scale questions regarding background knowledge and the ratings regarding the metrics (understandability, coherency, effectiveness, and generalizability) of each of the patterns were analyzed. Due to the nonparametric nature of Likert scale data and the relatively small sample size (N=20), the mode and median were used as an indication for the distribution of the responses. For the same reason, Spearman's rank test was performed to test for correlations between variables, while Wilcoxon's ranked sum test was used to test for statistically significant differences between ratings for the patterns. The goal of the qualitative analysis is to get further insight into the metrics described above. Additionally, it aims to reveal concepts and requirements that are still missing from the patterns from the perspective of their anticipated users. Hence, the qualitative data was analyzed thematically and largely data driven. The four metrics were used as predetermined themes, in which sub-themes were inferred by categorizing the responses on an increasingly abstract level. The distributions of self-reported background knowledge are illustrated in Figure 6 . Participants generally showed high self-reported experience scores for Human cognition & Behavior and Human-AI cooperation, both with a mode of 4. Previous knowledge of FATE, the research program providing the use case, was relatively low (with a mode of 1), as well as experience with Team Design Patterns (with a mode of 2). Distributions of understandability scores of all four patterns are depicted in Figure 9 . Pattern 2 was rated less understandable than Pattern 1, although not significantly (p=.055), and had a significantly lower understandability rating than sub-pattern 2.1 (p=.011) and 2.2 (p=.039). Understandability scores of Pattern 1 Figure 8 : Distribution of self-reported understanding of how to judge the patterns and Pattern 2.1 were correlated (r=.54, p=.015) and understandability scores of Pattern 2 and 2.2 were correlated (r=.48, p=.03). There were no correlations between expertise ratings and understandability scores. Figure 9 : Overview of the distributions of understandability scores per pattern, rated on a five-point Likert scale. Histograms of the distributions of the patterns' coherency scores are visualized in Figure 10 . Pattern 2.1 and 2.2 received significantly higher coherency scores than Pattern 1 and 2 (p<0.04). Coherency scores between all four patterns were strongly correlated (.55