key: cord-0621205-cq8kuo0u authors: Zhou, Jianlong; Verma, Sunny; Mittal, Mudit; Chen, Fang title: Understanding Relations Between Perception of Fairness and Trust in Algorithmic Decision Making date: 2021-09-29 journal: nan DOI: nan sha: 496b8c4bc9129ef82e7612fc34e1d3e32e452808 doc_id: 621205 cord_uid: cq8kuo0u Algorithmic processes are increasingly employed to perform managerial decision making, especially after the tremendous success in Artificial Intelligence (AI). This paradigm shift is occurring because these sophisticated AI techniques are guaranteeing the optimality of performance metrics. However, this adoption is currently under scrutiny due to various concerns such as fairness, and how does the fairness of an AI algorithm affects user's trust is much legitimate to pursue. In this regard, we aim to understand the relationship between induced algorithmic fairness and its perception in humans. In particular, we are interested in whether these two are positively correlated and reflect substantive fairness. Furthermore, we also study how does induced algorithmic fairness affects user trust in algorithmic decision making. To understand this, we perform a user study to simulate candidate shortlisting by introduced (manipulating mathematical) fairness in a human resource recruitment setting. Our experimental results demonstrate that different levels of introduced fairness are positively related to human perception of fairness, and simultaneously it is also positively related to user trust in algorithmic decision making. Interestingly, we also found that users are more sensitive to the higher levels of introduced fairness than the lower levels of introduced fairness. Besides, we summarize the theoretical and practical implications of this research with a discussion on perception of fairness. Artificial Intelligence (AI) has powerful capabilities in prediction, automation, planning, targeting, and personalisation [6] . It has been increasingly used to make important decisions that affect human lives in different areas ranging from social and public management to promoting productivity for economic wellbeing. For example, AI can be used to decide the loan approval in banks and manage engagement and outcomes of job for workers within an organization. These algorithms are also utilized by various hiring platforms to recommend and recruit candidates in human resource settings [13, 14] (such AI-informed decision making is also called algorithmic decision making). Besides all these functionalities of AI, a paramount concern with AI's decision making is equal treatment or equitably of decision based on people's performance or needs [22] is required [19, 30] . This setting of equitable treatment is also known as fairness in AI. On the other hand, unintentional (or intentional) discrimination can cause unfairness in AI and lead to poor decision making. Thus fairness becomes critical as a fair decision making system amplifies the satisfaction levels with algorithmic decision making [19, 30] . Often, the fairness is a consequence of either the training data or the design of machine learning models, which is the fairness human actually perceives in algorithmic decision making, ultimately affect their adoptions in real-world applications [20] . Besides, inputs to AI models such as machine learning models are often historical records or samples of events. They are usually not the precise description of events and conceal discrimination with sparse details which are very difficult if not impossible to identify. AI models are also imperfect abstractions of reality as their sole purpose if better generalization capabilities. Therefore, the imprecision and uncertainty associated with AI are imminent. Meanwhile, AI models are usually "black-boxes" for users and even for AI experts [39] . Users simply provide input data to an AI system, and after selecting some menu options, the system displays colourful viewgraphs and/or recommendations as output. It is neither clear nor well understood why these AI algorithms made a certain prediction, or how trustworthy their prediction are. In a nutshell, the concerns demonstrate that the successful use of AI critically depends on user trust in AI systems. One of widely cited definition defines trust as "the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party" [24] . Considerable research on fairness has evidenced that fairness perceptions are linked to trust such as in management and organizations [18, 31] . Different from above, in algorithmic decision making, mathematical fairness introduced by AI models and/or data (also refers to introduced fairness in this paper) is perceived by humans (also refers to perception of fairness in this paper) implicitly or explicitly. The perceived fairness is a central component of maintaining satisfactory relationships with humans in decision making [1] . Given various mathematical formulations of fairness, three major findings are: 1) demographic parity most closely matches human perception of fairness [32] ; 2) effects of transparency and outcome control on perceived fairness [21] ; and 3) factors affecting perceptions of fairness in algorithmic decision making [35] . While the fairness (or discrimination) is either introduced by AI models and/or the data, it is critical to understand whether an introduced level of fairness is affecting its perception by humans in algorithmic decision making. Therefore, in this work we aim to investigate the relations between the introduced fairness and human perception of fairness. Our aim in this paper is to understand what is the perception of introduced fairness by humans in particular, is it positive or negative? Importantly, we further dwell to understand whether the introduced fairness affects users trust in algorithmic decision making. In this regard, we utilise the statistical parity as the actual fairness level of an AI system (defined in sec. 3, Eq. 1) as it has been widely accepted as a metric to measure fairness. We then design a user study to investigate the perception of fairness by simulating a human resource recruitment for candidate shortlisting by manipulating introduced fairness. Due to lock-down restrictions, this user study was performed online to collect the participant responses during the COVID-19 pandemic. In summary, our experimental results demonstrate that two important findings: 1) introduced fairness is positively related to human perception; and 2) simultaneously, high level of fairness leads to the increased trust in algorithmic decision making. These findings illustrate that trust judgments can be influenced by fairness information which are comprehensively discussed both theoretical and practically in a dedicated section. With the increasing uses of AI in critical domains especially human related decision making such as allocation of social benefits, hiring, and criminal justice [3, 11] , fairness is becoming one of key concerns in algorithmic decision making. The current research on fairness in machine learning focuses on the formalisation of the definition of fairness and quantifying the unfairness (bias) of an algorithm with different metrics [8, 12, 26] . These work typically begins by outlining fairness in the context of different protected attributes (sex, race, origin, culture, etc.) receiving equal treatments by algorithms [2, 16] . Various definitions on fairness are investigated ranging from statistical bias, group fairness, individual fairness, to process fairness, and others resulting in 21 different definitions [27] . However, it is impossible to satisfy all definitions of fairness at the same time [9, 15] . Despite the proliferation of fairness definitions and unfairness quantification approaches, little work is found to investigate human's perceived fairness (perception of fairness) when the fairness defined by a specific definition is introduced. This paper uses statistical parity as the definition of fairness to investigate human perception of fairness in algorithmic decision making. The statistical parity and its utilisation in this work are briefly described in section 3. Various researches have been investigated to learn user trust variations in algorithmic decision making. Zhou et al. [38, 41] argued that communicating user trust benefits the evaluation of effectiveness of machine learning approaches. Kizilcec [17] proposed that the transparency of algorithm interfaces can promote awareness and foster user trust. It was found that appropriate transparency of algorithms through explanation benefited the user trust. Ribeiro et al. [29] explained predictions by learning an interpretable model locally around the prediction and visualizing importance of the most relevant features to improve user trust in classifications. Other studies that empirically tested the importance of explanation to users, in various fields, consistently showed that explanations significantly increase users' confidence and trust [4, 33] . Zhou et al. [40] investigated the effects of presentation of influence of training data points on predictions to boost user trust and found that the presentation of influences of training data points significantly increased the user trust in predictions, but only for training data points with higher influence values under the high model performance condition. Zhang et al. [37] investigated the effect of confidence score and local explanation on trust. It was found that confidence score can help calibrate people's trust in an AI model, but local explanations were not able to create a perceivable , , effect for trust calibration as expected, maybe because of the experiment design. In addition, researchers found that user trust had significant correlations with users' experience of system performance [39] . Yin et al. [36] also found that the stated model accuracy had a significant effect on the extent to which people trust the model, suggesting the importance of communication of ML model performance for user trust. These previous work primarily focuses on the investigation of effects of explanation and model performance on user trust in algorithmic decision making. However, less attention has been paid to the perception of fairness and its effects on trust, which is investigated by this paper. It was found that perceptions of fair treatment on customers are important in driving trustworthiness and engendering trust in the banking context [31] . Earle and Siegrist [10] found that procedural fairness had no effects on trust when issue importance was high, while procedural fairness had moderate effects on perceived fairness and trust when issue importance was low. Nikbin et al. [28] found that perceived service fairness had a significant relationship with trust, and confirmed the mediating role of satisfaction and trust in the relationship between perceived service fairness and behavioural intention. Lee [20] investigated how people perceived decisions made by algorithms as compared with decisions made by humans in a management context. It was found that algorithmic decisions were perceived as less fair and trustworthy and evoked more negative emotion than human decisions. Previous work pays more attention to relations between the perception of fairness especially procedural fairness and user trust in social interaction context such as marketing and services, however, little work is found on the effects of fairness on user trust in algorithmic decision making. This study investigates whether the introduced fairness is positively received by humans and how such fairness affects user trust by simulating a candidate shortlisting in a human resource recruitment setting in algorithmic decision making. Fairness is a complex and multi-faceted concept that depends on context and culture [2] . Narayanan [27] described at least 21 mathematical definitions of fairness from the literature. This is because of different reasons such as different contexts/applications, different stakeholders, impossibility theorems, as well as allocative versus representational harms. In this study, the statistical parity, one of group fairness definitions, is used to represent fairness. The statistical parity suggests that a predictor is fair if the predictionˆis independent of the protected attribute so that It also means that subjects in both protected and unprotected groups have equal probability of being assigned to the positive predicted class. Taken the recruitment as an example, this would imply equal probability for male and female applicants to have positive predicted recruitment: where = 0 represents male applicants and = 1 represents female applicants. Based on these preliminaries, statistical parity difference ( ) is defined as: where is in the range of [0, 1]. = 0 represents the complete fairness, and = 1 represents the complete unfairness. This paper manipulates various fairness levels of between [0, 1] to learn how introduced fairness is perceived and affects trust in algorithmic decision making. This paper poses the following hypotheses: H1 The human perceived fairness will be positively related to the introduced fairness. That is, the high level of introduced fairness will result in the high level of perceived fairness by humans, and vice versa; H2 The user trust will be positively related to the introduced fairness. That is, the high level of introduced fairness will result in the high level of trust in algorithmic decision making, and vice versa; H3 Humans will be more sensitive to the changes of high level of introduced fairness than the low level of introduced fairness. In this study, a company is supposed to recruit staff for a position. The company posted the job description and different applicants submitted their applications. Participants are supposed to be a recruiter to shortlist applicants for the position. A machine learning system named Automatic Recruiting Assistant (ARA) is used to help participants make decisions on the shortlisting of applicants. ARA is a candidate assessment software that uses historical recruiting data to train a machine learning model and predict whether a candidate will be shortlisted. The ARA's accuracy keeps constant for all shortlisting. This study investigates the perception of fairness, which is measured with a single scale that focuses on a global perception of appropriateness [7] . In this study, the questionnaire on the fairness is "overall, female and male applicants are treated fairly by ARA". The scale is on a 5-point Likerttype response scale ranging from 1 (strongly disagree) to 5 (strongly agree). Trust is assessed with six items using self-report scales as the following [25] . The scale is on a 5-point Likert-type response scale ranging from 1 (strongly disagree) to 5 (strongly agree). • I believe the ARA is a competent performer. • I trust the ARA. • I have confidence in the advice given by the ARA. • I can depend on the ARA. • I can rely on the ARA to behave in consistent ways. • I can rely on the ARA to do its best every time I take its advice. This section details the experiment to examine our hypotheses on the introduced fairness, human perception of fairness and trust in an automatic recruitment algorithm. Tasks were designed to investigate effects of different fairness levels on user trust in algorithmic decision making. The protected attribute in this study is the gender of applicants. In this case, the is the difference of shortlisted rate by the gender. In this study, fairness was introduced by manipulating with its discrete values of 0, 0.1, 0.2, 0.3, 0.4, . . . , 0.8, 0.9, and 1.0, where each 's discrete value was used as a measure of fairness to define the number of male and female applicants as well as number of male and female applicants shortlisted in each task respectively. Table 1 shows 11 task examples corresponding to different values. In this table, "Rate (Male)" represents the predicted success rate for male applicants, "Rate (Female)" represents the predicted success rate for female applicants, "Male #" represents the number of male applicants, "Female #" represents the number of female applicants, "Listed Male #" represents the number of shortlisted male applicants, and "Listed Female #" represents the number of shortlisted female applicants. With the same settings of as in the table, different number of male and female applicants were used to generate another 11 tasks. All together 22 tasks were conducted by each participant. Two additional training tasks were also conducted by each participant before the formal tasks. The order of tasks was randomized during the experiment to avoid any bias. Due to social distancing restrictions and lockdown policies during the COVID-19 pandemic, our experiment was implemented using the flask framework in Python and was deployed on the Heroku cloud server online. The deployed application link was then shared with participants to invite them to conduct tasks. In this study, participant responses to tasks were stored in a MySQL database that was directly connected to the flask application. Figure 1 shows the screenshot of a task conducted in the experiment. 20 participants were invited via various means of communications such as emails, text messages and social media posts who are mainly university students around the age group of 20-30 years with the average of around 25 years old. After each task was displayed on the screen, the participants were asked to answer seven questions based on the task. The first question was on fairness of applicant shortlisting shown in the task while the other six questions were on the trust of the participant in the decision making from the ARA. This study aims to understand: 1) how the introduced fairness is perceived by humans, and 2) how the introduced fairness affects user trust in algorithmic decision making. In order to perform the analyses, we first normalised the collected trust and fairness data. We then performed one-way ANOVA tests on the normalised data followed by post-hoc comparison using Tukey HSD tests. The fairness and trust values Understanding Relations Between Perception of Fairness and Trust , , were normalised with respect to each subject to minimise individual differences in rating behavior using the equation given below: where and are the original fairness or trust ratings and the normalised fairness or trust rating respectively from the user , and are the minimum and maximum of the ratings respectively from the user in all of his/her tasks. Figure 2 shows the mean normalised perceived fairness (perception of fairness) over introduced fairness (error bars represent the 95% confidence interval of a mean and it is the same in other figures). A one-way ANOVA test found that there were statistically significant differences in perceived fairness among 11 introduced fairness levels ( (10, 429) = 29.872, < .000). The further post-hoc comparison with Tukey HSD tests were conducted to test pair-wised differences in perceived fairness between two introduced fairness levels. It was found that the perceived fairness at = 0, 0.1, and 0.2 had significant differences with all other levels from 0.4 to 1.0 respectively (for all, < .001). The perceived fairness at = 0 ( < .001) and 0.1 ( < .005) also had significant differences with = 0.3 respectively. However, there were no significant differences found in perceived fairness among any pair of at 0, 0.1, and 0.2. It was also found that the perceived fairness at = 0.3 had significant differences with = 0.6, 0.7, ..., 1.0 respectively (for all, < .017). Furthermore, the perceived fairness at = 0.4 ( < .006) and 0.5 ( < .005) had significant differences with = 1.0 respectively. Despite no other significant difference found in perceived fairness among introduced fairness levels, Figure 2 shows that the perceived fairness has a clear decreasing trend with the decrease of introduced fairness (increase of levels). The results suggest that participants' perception of fairness was positively related to the introduced fairness (H1), but was not sensitive to the small changes of introduced fairness. Moreover, participants were more sensitive to the perceived fairness with high levels than low levels as we expected (H3). These findings also imply that the introduced fairness can be safely used to validate the perception of fairness of humans. Following the findings of the trend of perceived fairness as described above, we divided introduced fairness into three groups: Therefore, a one-way ANOVA test was conducted for these three groups of introduced fairness. It was found that there were statistically significant differences in perceived fairness among three introduced fairness levels ( (2, 117) = 104.725, < .000). The post-hoc comparison with Tukey HSD tests were conducted to test pair-wised differences in perceived fairness between two introduced fairness group levels. It was found that the perceived fairness at the introduced fairness level of Group A was significantly higher than that at levels of Group B ( < .001) and Group C ( < .001) respectively. The perceived fairness at the introduced fairness level of Group B was also significantly higher than that at the level of Group C ( < .001). The results show that human perception of fairness was positively related to the introduced fairness. The findings imply that the introduced fairness based on can safely reflect perception of fairness in algorithmic decision making. Figure 4 shows mean normalised trust ratings over introduced fairness ( ) levels. A one-way ANOVA test found that there were statistically significant differences in trust ratings among 11 fairness levels ( (10, 429) = 11.550, < .000). Then the post-hoc comparison using Tukey HSD tests found significant differences in trust responses between introduced fairness level pairs as shown in Table 2 . It shows that participants had significantly higher trust in AI-informed decisions under high introduced fairness levels (low values) than that under low introduced fairness levels (high values). For example, participants had significantly higher trust under = 0 than that under = 0.7, < .003. However, user trust did not show significant differences under high introduced fairness levels (e.g. = 0, 0.1, 0.2, 0.3). We further analyse trust differences under three introduced fairness group levels of A, B, C. as described above, , , a one-way ANOVA test found that there were statistically significant differences in user trust among three introduced fairness group levels ( (2, 117) = 48.272, < .000). The further post-hoc comparison with Tukey HSD tests found that user trust was significantly higher at the introduced fairness level of Group A than that at the levels of Group B ( < .001) and Group C ( < .001) respectively. User trust was also significantly higher at the introduced fairness level of Group B than that at the level of Group C ( < .001). The findings suggest that user trust had a positive relationship with the introduced fairness as we expected (H2). The higher the introduced fairness level was, the higher trust in decisions users had. Our study found that the introduced fairness was positively related to the perceived fairness by humans. Besides, it also shows that high levels of introduced fairness resulted in high levels of human perception of fairness. These findings confirm that the introduced fairness level can be safely used to evaluate the human perception of fairness. Furthermore, the introduced fairness was also positively related to user's trust in algorithmic decision making. Once again we see that the high level of introduced fairness benefited user trust. It was also found that participants were more sensitive to the introduced fairness with high levels than low levels. Fairness heuristic theory [23, 34] suggests that when individuals face uncertain circumstances they rely on impressions of fairness to determine whether to cooperate and enter into exchange relationships with the other party, which suggests that individuals use fairness judgements to form their perceptions of trust. The social exchange theory [5] also argues that fair actions and the treatment by one party generate reciprocation in the form of trust by the other party in the exchange. In the context of candidate shortlisting in human resource settings utilised in this paper, recruiters were unsure about the outcomes from the Automatic Recruiting Assistant. As a result, recruiters formed the trust perception based upon fairness perception, and high level of perception of fairness resulted in the high level of trust in algorithmic decision making. These findings have significant implications in algorithmic decision making applications. For example, when the trust is difficult to examine in algorithmic decision making, human perception of fairness can be used to estimate user trust in algorithmic decision making. While human perception of fairness is positively related to introduced fairness. Our findings also imply that the introduced fairness can be safely used to validate the human perception of fairness. Furthermore, since human is more sensitive to the high level of fairness, the high level of fairness instead of the low level of fairness can be explicitly presented in the user interface of AI applications to boost user trust in algorithmic decision making. Overall, the findings from this study at least have the following implications: 1) the estimation of user trust in algorithmic decision making by human perception of fairness; 2) the user interface design of AI applications to boost user trust by explicitly presenting high level of fairness to users; 3) manipulation of human perception of fairness by manipulating level of introduced fairness. Fairness is a key concern in algorithmic decision making in many application areas including shortlisting candidates for advertised positions or loan approvals based on historical records. The human perception of fairness affects adoptions of AI applications. This paper understood the relations between the introduced fairness and human perception of fairness and investigated how the introduced fairness affected user trust in algorithmic decision making. Experimental results showed that the introduced fairness was positively related to human perception of fairness, and concurrently it was also positively related to user's trust. Interestingly, the users were more sensitive to fairness with high levels than that with low levels. The findings can be used to help to estimate trust in algorithmic decision making and user interface design for AI solutions. The future work of this study will focus on the introduction of AI explanations into the pipeline to understand their effects on user trust in algorithmic decision making. When consumers care about being treated fairly: The interaction of relationship norms and fairness norms An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias Fairness in criminal justice risk assessments: The state of the art Explaining Recommendations: Satisfaction vs. Promotion Exchange and Power in Social Life AI in the public interest Measuring Justice and Fairness The measure and mismeasure of fairness: A critical review of fair machine learning Algorithmic decision making and the cost of fairness On the Relation Between Trust and Fairness in Environmental Risk Management Certifying and removing disparate impact Measuring the biases that matter: The ethical and casual foundations for measures of fairness in algorithms Implicit Skills Extraction Using Document Embedding and Its Use in Job Recommendation Artificial Intelligence, Employee Engagement, Fairness, and Job Outcomes The Impossibility Theorem of Machine Fairness-A Causal Perspective Avoiding discrimination through causal reasoning How Much Information?: Effects of Transparency on Trust in an Algorithmic Interface Employees' Perceptions of Trust, Fairness, and the Management of Change in Three Private Universities in Cyprus Fairness in human resource management, social exchange relationships, and citizenship behavior: testing linkages of the target similarity model among nurses in the United States Understanding perception of algorithmic decisions: Fairness, trust, and emotion in response to algorithmic management Procedural Justice in Algorithmic Fairness: Leveraging Transparency and Outcome Control for Fair Algorithmic Mediation What Should Be Done with Equity Theory? Fairness heuristic theory: Justice judgments as pivotal cognitions in organizational relations An Integrative Model of Organizational Trust 2013. I Trust It, but I Don't Know Why: Effects of Implicit Attitudes Toward Automation on Trust in an Automated System Fair inference on outcomes Translation tutorial: 21 fairness definitions and their politics The effects of perceived service fairness on satisfaction, trust, and behavioural intentions Why Should I Trust You Designing fair AI for managing employees in organizations: a review, critique, and design agenda The impact of fairness on trustworthiness and trust in banking Human Perception of Fairness: A Descriptive Approach to Fairness for Machine Learning MoviExplain: A Recommender System with Explanations Uncertainty management: The influence of uncertainty salience on reactions to perceived procedural fairness Factors Influencing Perceived Fairness in Algorithmic Decision-Making: Algorithm Outcomes, Development Procedures, and Individual Differences Does Stated Accuracy Affect Trust in Machine Learning Algorithms Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making Be Informed and Be Involved: Effects of Uncertainty and Correlation on User Confidence in Decision Making Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent Physiological Indicators for User Trust in Machine Learning with Influence Enhanced Fact-Checking Measurable Decision Making with GSR and Pupillary Analysis for Intelligent User Interface Authors would like to thank all participants for this study.