key: cord-0802439-gpq15x43 authors: Rayhan, M. D.; Alam, M. D. Golam Rabiul; Dewan, M. Ali Akber; Ahmed, M. Helal Uddin title: Appraisal of high-stake examinations during SARS-CoV-2 emergency with responsible and transparent AI: Evidence of fair and detrimental assessment date: 2022-05-20 journal: nan DOI: 10.1016/j.caeai.2022.100077 sha: 523e4c0b6d4a7a2323f2a9d31613e38bb3fb556b doc_id: 802439 cord_uid: gpq15x43 In situations like the coronavirus pandemic, colleges and universities are forced to limit their offline and regular academic activities. Extended postponement of high-stakes exams due to health risk hereby reduces productivity and progress in later years. Several countries decided to organize the exams online. Since many other countries with large education boards had an inadequate infrastructure and insufficient resources during the emergency, education policy experts considered a solution to simultaneously protect public health and fully resume high-stakes exams -by canceling offline exam and introducing a uniform assessment process to be followed across the states and education boards. This research proposes a novel system using an AI model to accomplish the complex task of evaluating all students across education boards with maximum level of fairness and analyzes the ability to fairly appraise exam grades in the context of high-stakes examinations during SARS-CoV-2 emergency. Basically, a logistic regression classifier on top of a deep neural network is used to output predictions that are as fair as possible for all learners. The predictions of the proposed grade-awarding system are explained by the SHAP (SHapley Additive exPlanations) framework. SHAP allowed to identify the features of the students' portfolios that contributed most to the predicted grades. In the setting of an empirical analysis in one of the largest education systems in the Global South, 81.85% of learners were assigned fair scores while 3.12% of the scores were significantly smaller than the actual grades, which would have had a detrimental effect if it had been applied for real. Furthermore, SHAP allows policy-makers to debug the predictive model by identifying and measuring the importance of the factors involved in the model's final decision and removing those features that should not play a role in the model's “reasoning” process. The global health crisis resulting from the 2020 coronavirus outbreak caused the biggest worldwide education disruption in history. Extended postponement of highstakes exams due to health risk prevented millions of examinees to proceed further in their career. Moreover, the prolonged suspension of exams can have a huge financial impact. These operational barriers may reduce university admissions and course enrollments (Dhanalakshmi et al., 2021) . The virus outbreak has led governments around the globe to temporarily shut down kindergartens, schools, colleges, universities, and other learning institutions nationwide. The majority of the countries decided to continue the massive closure of schools 1 since the fourth week of March 2020 as the number of Covid-19 cases exponentially increased 2 . Thus, The lockdown and social distancing measures immediately had an enormous impact on education. The suspension of the education system was maintained during the development of the vaccines, which was expected to take many years (Niko Kommenda, 2020) . The large education boards in the Global South were suspended till the midyear of 2021. Global and National policy-makers of education bodies were trying their best to tackle the unprecedented emergency in education. For instance, to ensure learning continuity, many countries transferred campus learning to online learning. However, education bodies remained puzzled over how to handle the scheduled assessments and exams, especially the highstakes exams such as school leaving, gateways for the job, and other major public Halting the graduation process will impact the future career of these students as well as the economy of the affected countries. Interruption of high-stakes exams delays student qualification and graduation, which in its turn delays entering higher education or the job market. Hence, throughout 2020 and 2021, handling high-stakes exams was among the top priorities of all policy-makers' agenda (UNESCO's COVID-19 Education Response). Lower and middle-income countries are more vulnerable to time and resource constraints in handling nationwide high-stakes exams (Davidson and Katopodis, 2020) . Various measures have been considered to cope with the emergency including cancellation, postponement, derogation, on-screen test, paper-based examinations with physical distancing, remote assessment, and using alternative approaches for validation and certification. Each of the solutions has its drawbacks in terms of fairness and evaluation quality; therefore, the set of solutions is not equally applicable to all education systems. If one set of solutions was feasible then the country's policy response would not be diverse 3 . At the beginning of the pandemic and school closure, education bodies around the world planned to arrange the high-stakes exams on the reopening of schools. As 2020 was reaching its end, with millions of candidates' progression already hampered, countries rapidly opted for new strategies regarding the assessment of high-stakes exams. During the absence of formal assessment, many country governments introduced and approved alternative approaches to high-stakes exams. After a long period of waiting and lengthy consideration over accountability and health risk, one of the E9 countries 4 , Bangladesh, decided to assign the arithmetic mean of the previous two exams to the final exam of its graduating college students. Around the same time, the Indian Central Board of Secondary Education (CBSE) promulgated a 40:30:30 formula for evaluation of Class XII graduation exam (The Indian Express). The 40:30:30 is the percentage to be considered from class XII, XI and X result. Bangladesh and India declared that they had no other option but to consider an alternative instead of the offline college leaving exam to protect the safety, health, and social-emotional well-being of students and educational personnel as well as to alleviate the logistic and financial burdens associated with organizing and conducting exams. Adapting an evaluation system because of war or a major conflict is referred to as latent assessment strategy (Clarke, 2011) . The latent assessment model for the Bangladesh Higher Secondary Certificate exam is a mathematical model that generates results looking at previous two achievements in the secondary stage. The purpose of an alternate appraisal model for high-stakes exam grades as an emergency measure is to maximize fairness and equity. The higher education board of Bangladesh applied a simple arithmetic function based on two parameters: performance in junior and senior secondary level. A mathematical model's abstraction about the real-world system is expressed in its conceptual model (Levins, 1966) . Simple models with a small number of parameters are typically unaware of the probability distribution. Evaluating high-stakes exam candidates with such simpler models has a higher probability of distorting the true distribution as well as the fairness of evaluation. The country's experts believed-there was apparently no solution that would favor everybody amid too many uncertainties (Hashan, 2020) . Several times before the All India Senior School Certificate Examination result publication, the high courts of India ordered CBSE to modify the proposed formula to ensure equity (News18, b) . The consequence of publishing high-stakes grades with arithmetic formula is quite upsetting when reportedly thousands of candidates challenged their awarded grade and extreme self-injurious attempt were even taken by some young adults (News18, a) . Not only does the detriment effect create unexpected consequences but the unexpectedly higher number of students with an awarded grade above 90% will make the university admission an uphill task. The arithmetic formula for high-stakes exam assessment made a large number of gainers and losers. This article identifies the research gap since there is no globally conceded model as an alternative to high-stakes exams during crises and, hereby proposes a robust framework that is capable of appraising grades of all types of candidates based on the student portfolio. A student portfolio is a set of educational data including academic accomplishment, identification, awards, obtained marks, honors, certifications, etc. compiled in any form, presented publicly or privately. Recent literature shows that AI and educational data mining are research fields still in their infancy and their successful application in educational institutions has to be demonstrated more extensively (Saddiqa et al., 2021) (Lemay et al., 2021) . The previous decade has witnessed a remarkable growth in research on computer-based evaluation based on a student portfolio (Chen et al., 2020b) . However, little effort has been attempted to incorporate deep learning technology into formal assessment systems (Chen et al., 2020a) . This research established an inclusive and equitable machine learning model beyond the idea of uniform assessment formula. The research question below drives the efforts of this study. Is an AI-driven assessment system an effective pragmatic solution during the absence of formal high-stakes assessment activity? This research problem became obvious to the author when the Global South nations were in quest of an inclusive assessment formula to assess high-stakes exams (News18, c) . While the method applies to any education system, this study exemplifies the method by applying it to one of the E-9 member country's higher school certificate assessment. Moreover, the model is applicable as a smart learning analytic tool (Chen and Li, 2021) (Solano-Flores et al., 1999) to support precision education (Yang et al., 2021, Table4) which will promote fairer student assessment (Friedler et al., 2008) . A systematic literature review discovered that early student performance prediction can help universities to provide timely actions, like planning for appropriate training to improve students' success rate (Alyahyan and Düştegör, 2020) . The centralized highstakes examinations are summative evaluations that are generally carried out at the end of the learning process. One disadvantage of this type of assessment is that the learners discover their true performance when it is too late. A reliably predicted grade sheet would enhance a student's confidence as well as suggest emphasizing particular subjects. Therefore, the prediction of academic performance in higher education provides several benefits to teachers, students, policy-makers, and institutions. The proposed system is tested in the setting of an empirical study with the central board's Class 12 examination, which aims at determining the model's ability to fairly evaluate high-stakes exam grades, while optimizing the gainer-to-loser ratio. An exam-5 J o u r n a l P r e -p r o o f ple of a hypothetical dedicated computer to predict exam grades is illustrated in Figure 1 . works like a vending machine that outputs a grade per subject when some student portfolio is input. Later in Section3.1, the Turing test will be applied to the machine. The current research on autonomous assessment systems is discussed in 2.1 followed by a discussion on the pandemic's global disruption of formal education and assessment in 2.2. The E-9 member country's secondary education system, in which this study's evaluation model is tested, is then shortly described in 2.3 followed by sections about emerging assessment policies during the pandemic 2.4 and AI systems in higher education 2.5. Building blocks or components of the proposed model are discussed in 2.6. The research problem, the model's architecture and feature engineering based on students' academic data are analyzed in 3.1, 3.2 and 3.3 respectively. Data are subsequently analyzed in Section 4.1, the model's performance is reported in Section 4.2, and interpretation of predicted scores is provided in Section 4.3. A discussion of the impact of the findings and the study's limitations are given in Section 5. lastly, 6 concludes the paper. The main contributions of the study are summarized below: • Proposes an alternative inclusive solution to evaluate transcripts of high-stake examinations; • Harnesses the students' numerical and categorical data to learn the underlying distribution; • Establishes a generic framework that produces fair and trustworthy evaluations of each candidate of the central education board; • Develops an autonomous automated evaluation system for Higher Secondary Certificate, which had a detrimental effect on 3.12% of the students only; • Uses SHAP to minimize the negative impact of including irrelevant features both in the predictive and explanation models. In a framework for building an effective student assessment system-the World Bank has reported that the latent assessment strategy can be applied to countries where there is no formal assessment activity or where there is no formal assessment activity or when the education system has been interrupted due to war or other conflict (Clarke, 2011) . The world has experienced an educational crisis several times in history. The educational cost due to World War II led to a considerable decline in educational attainment in higher education. The educational disruption due to suspension of enrollment and assessment process becomes apparent if one observes the timeseries data (during, before, after) of the student performance (Ichino and Winter-Ebmer, 2004) . As a crisis lasts over time, the odds of educational stress increase correspondingly. The economic loss due to World War II was observed even 40 years after the war (Ichino and Winter-Ebmer, 2004) . A trustworthy data-centric method for student assessment without relying on traditional assessment could minimize the economic and emotional impacts by minimizing the dropout rate. In today's world, remote learning tools look practical, but the education sector hesitates to implement an intelligent assessment system for emergencies, despite the fact that sufficient computational resources are available to train predictive models of student academic performance. The purpose of assessment, which is to fairly determine student's progress, can be met using a predictive model as many researchers in the field of AI in Education has suggested. A large-scale project in a primary school in Vietnam has implemented an artificial neural network to predict a student's probability of succeeding in math and Vietnamese (Musso et al., 2020) . The model reached very high accuracies (95-100%). Alongside the ability to predict student performance, the model had important implications for policy-makers by highlighting the reasons (the features) for the prediction. A recent study (Rodríguez-Hernández et al., 2021a) has illustrated the most important features contributing to prediction of academic performance. However, the study did not evaluate whether the proposed solution was fair to all learners. In essence, no research suggested an inclusive high-stakes assessment model that is fairer than human decisions and can be considered as an al-8 J o u r n a l P r e -p r o o f ternative during the absence of examination. It is very challenging to generate a highly accurate assessment model that works for all students and that minimize its detrimental effects. Therefore, more studies should be carried out to develop large-scale assessment models for statewide high-stakes exams. The SDG-Education 2030 Steering Committee provides strategic guidance to the global education community and ensures follow-up and review for the education in the 2030 Sustainable Development Agenda 5 . As soon as Covid-19 spread quickly around the world and was declared a pandemic by the WHO, the Steering Committee underscored that the Covid-19 pandemic was not only a global health crisis but also an educational crisis. The SDG-Education 2030 Steering Committee called on its member states' governments to respect strategic policy recommendations in response to the pandemic (SDG-Education 2030 Steering Committee). The committee has drawn attention to teachers' and education personnel's safety, health, and well-being. Besides, the Steering Committee urged governments to maintain strong political commitment and investment in education throughout and after the crisis (Guterres, 2020) . The UN warned that the pandemic was creating severe disruption in the world's education systems and was threatening a loss of learning, whose impact may stretch beyond one generation of students. The report empirically anticipated the economic impact on households that is likely to widen pre-existing inequities in education. Nearly 23.8 million children and youth (from pre-primary to tertiary) may drop out or not have access to school next year (2021) due to the pandemic's economic impact alone, pushing thereby millions into into severe poverty. Other research by the world bank (Pedro et al., 2021) suggested 25 percent more students may fall below a baseline level of proficiency needed to participate effectively and productively in society. The United Nations encouraged authorities to bring about a set of solutions, previously considered difficult or impossible, to implement and ensure that education systems are more flexi-5 The 2030 Agenda for SDG4 is "ensure inclusive and equitable quality education and promote lifelong learning opportunities for all." ble, equitable, and inclusive. UN Secretary-General called for national authorities and the international community to come together to place education at the forefront of recovery agendas and protect investment in education. During this state of confusion and chaos, it's not only pedagogy that will be affected but numerous factors like organizational routines and placement rates at various educational institutions. Both (SDG-Education 2030 Steering Committee), (Guterres, 2020) foresee that the pandemic will have specifically impacted the education community of low-and middle-income countries. In the future, this may be necessary to conduct a cohort study between the group who has been kept out of school during the pandemic (whose formal assessment was postponed) and the regular group before the pandemic who went through regular assessment. Such studies will uncover the educational, economic, and emotional cost of the Covid-19 pandemic. During 2020, no policy recommendation suggested reopening of schools or arranging examinations during the emergency. In fact, the United Nations reported that some countries opened schools and colleges, only to close them again after a resurgence of the virus. International Association of Universities (IAU) is an official partner of UNESCO which acts as the global voice of higher education institutions and organizations from around the world. IAU had planned to carry out three global surveys on the impact of Covid-19 on universities and other higher education institutions. IAU's first Global Survey Report (Marinoni et al., 2020) was conducted in order to capture a description of the worldwide disruption caused by Covid-19 on higher education where the results are analyzed both at the global level and at the regional level in four regions of the world (Africa, the Americas, Asia & Pacific and Europe). According to the survey, 80% of respondents believe that Covid-19 will have an impact on the enrollment numbers of the new academic year. Respondents from Asia & Pacific are the most negative, 85% of them believe that Covid-19 will have a major negative impact on their enrolment numbers since college leaving examinations are on hold and the dropout rate after college might rise. The World Bank Group has done a rapid assessment of post-secondary education disruption due to Covid-19 (Bank, 2020) . The survey suggests flexible adaptations of admission and examination protocols for the incoming academic year to ensure a healthy higher education community during the crisis. Nonetheless, IAU's report has produced motivation for remodeling public policy during the pandemic. Two-thirds of 424 higher educational institutions across the world reported that their senior management and faculty have been consulted by public or government officials in the context of public policies related to Covid-19 (Marinoni et al., 2020) . This indicates most current research at higher educational institutions is focused on public policy regarding the Covid-19 pandemic and these researches are being recognized by their respective governments. Before the partition of India and Pakistan in 1947, the education system of the Indian sub-continent was governed by the British colonials (Rahman et al., 2010) . As Bangladesh became a separate independent country from Pakistan in 1971, its education system was restructured under the direction of Dr. Qudrat-e-Khuda (Rahman et al., 2010) . Since then, letter grading has been adopted in the assessment of student performance in all phases of secondary school. The higher education in Bangladesh consists of general, technical, engineering, agriculture, business, and medical courses. The minimal criterion for higher education admission is the Higher Secondary Certificate (HSC) or equivalent. The entire secondary education is a seven-year program divided in three stages: 3 years of junior secondary (grade 6-8), 2 years of senior secondary (grade 9-10), and 2 years of higher secondary (grade 11-12). The completion of the three years of junior secondary stage, the two years of senior secondary stage, and the two years of higher secondary stage are assessed by the junior school certificate exam (JSC), secondary school certificate (SSC) and the higher secondary certificate (HSC) test. Figure 2 illustrates the grade scale from 0 to 5 in X-axis, while the Y-axis represents the passing years. The three horizontal lines in Figure 2 reflect the three-phase of the secondary school system. The letter grades at the JSC, SSC and HSC level w.r.t the corresponding scores are illustrated in Table 1 . HSC candidates come from all corners of the county, including towns and rural areas. Hence, the college graduation exam is a high-stakes exam as the exam is organized country-wide by the education board. The universal exam-centric assessment method is indispensable for the E-9 countries with large education bodies. During the crisis of the coronavirus pandemic, classroom teaching was hosted online. However, more than one million candidates were waiting to take the HSC exam and proceed to the next stage of life whilst another million from the previous level were added to the queue. As an action to mitigate the educational burden, the Board of Intermediate and Secondary Education of Bangladesh published the HSC result of the year 2020 by simply averaging JSC and SSC grades from the previous academic record (Xinhua) . The model can be understood from the diagram in Figure 3 . Unlike most mathematical models, the structure has parameters {w 0 , w 1 }, a functional form {average} and variables {JSC, SSC} and HSC result as output. In this article, the weighted mean model by the E-9 member country is referred to as the baseline model. According to the baseline model in Figure 3 , the HSC result is an output of the Average function as Equation (1) where, w 0 = 0.25 and The baseline model loses the non-linear association of HSC results with other discriminatory variables thereby likely to under-fit the true distribution of real-world observation (Aho et al., 2014 ). The baseline model, illustrated in Figure 3 , assumes that the real world operates deterministically. This means that the HSC candidate's subject grades tend to occur somewhere between the corresponding JSC and SSC grades. In contrast, this research assumes that the real world is stochastic and that two or more candidates with the same JSC and SSC grades can obtain significantly different HSC grades. The subject-wise formula for All India Senior School Certificate Examination (AISSCE) has a constant term X f which is the top best performance grade in class X similar to an intercept in a linear equation. Independent of group transition or any 13 J o u r n a l P r e -p r o o f irregular event, each subject grade is awarded as a function of X f which may benefit a large number of candidates while being detrimental for those who could not perform well in class X but deserve better in AISSCE. The AISSCE subject scale is between The AISSCE and HSC assessments were designed considering only regular type of students who have similar performance in both the senior and higher secondary schools. Hence, the baseline model is inadequate to compute the result of all candidates. To address the concerns of the baseline model, human committee was formed to provide recommendations about irregular candidates and more optimal parameter values {w 0 , w 1 } for the baseline model (Hossain, 2020) . However, the committee was not allowed to reject the the baseline model (Mamun, 2020) , that is the functional form must remain as Equation (1). AI-driven apps are rapidly being employed in different fields of higher education where computer systems automatically analyze digital data and provide recommended actions. A lot of higher educational institutions already merged learning analytics with their existing system. Learning analytics explore the history of students' administrative data, learning activities, etc. to proprovide insights about the past and make predictions about the future. Generally, recommendation systems provide advising services for students as a sort of chatbot (Wang et al., 2021) . Chatbots are built using speech recognition and natural language processing. For example, in Germany chatbots assist undergraduate students on their choice of subjects and answer queries regarding courses and course units. It has also become possible to automate admission procedures where an AI model assesses a candidate's portfolio and forecasts if a candidate meets the prerequisites to enroll in a university course. Such a mdoel is trained to understand the pattern between portfolio, the admission decision, and degree completion from historical data, based on past admission decisions made by the human commit- tee. An analogous application is early-warning systems that predict the dropout risk of a student and provide opportunity for educators to help individuals overcome their deficiencies. As numerous fields of secondary and post-secondary education are adopting autonomous decision-making methods, the reputation of institutions that might use such systems primarily relies on the fairness of the process. Even if AI systems are accurate, they should never violate fair decision-making, legal norms, or ethical principles. Some input data may unwillingly introduce a prejudice against individuals. For example, an automated classification system in higher education will breach equality if it discriminates against individuals based on their family and religion. In essence, the AI model has to be developed so that it can avoid any infringement of the constitutional norms of the given region. Policy-makers must employ an AI system in higher education only after the risk of losing people's trust and loyalty is prevented. A recent survey among the Dutch adult population (Helberger et al., 2020) found that a greater number of respondents were optimistic regarding the fairness of automated decision making with AI; 54% considered AI a fairer decision-maker than human decision-makers, whereas only 33% believed human-only decisions were fairer than AI. About 9% believed the fairness depends on the context, 3% didn't consider any of the decision-makers fairer than the other, and 1% thought both AI and humans should work together. The first building block of the appraisal model is an artificial neural network. Neural network classifiers with cross-entropy cost function and sufficient sample data can produce outputs as good estimates of Bayesian probabilities (Richard and Lippmann, 1991 was proposed in (Salakhutdinov et al., 2007) as Restricted Boltzmann Machine (RBM). An RBM with with an arbitrarily large number of neurons has been demonstrated as a universal model for complex data distribution (Larochelle and Bengio, 2008) . Stacking several RBM, a deep learning model can be created and in many cases, the topology outperformed the typical feed-forward network. For instance, (Hassan et al., 2019) extracts abstractions of physiological signals through three stacked RBM to find the complex relationship of human emotion with the input signals. Conditioned on independent variables, the binary outcomes are produced using a sigmoid function in logistic regression (Bonney, 1987) . A logistic regression model projects the independent variables into a one-dimensional space which goes into the squashing function to produce binary outcomes. Multinomial logistic regression modeling is suggested over statistical modeling to identify anomaly intrusion (Wang, 2005) . AI model interpretability is indispensable to understand how each feature contributes to the final outcome. However, machine learning models with a complex architecture cannot explain its predictions. Therefore, the SHAP (SHapley Additive exPlainer) framework was used to address this tension between accuracy and interpretability (Lundberg and Lee, 2017) . SHAP measures feature importance using the same scale as the one for predictions, which makes it easier for humans to interpret. A perceptive (Tjoa and Guan, 2020) interpretability framework has more usage in the field of medical AI or high-stakes decision-making AI to achieve responsible AI. The mean Shapley values or base values of features can be useful to interpret variable importance for all data points (Bosch, 2021) . The Turing test is a popular method to determine a machine's ability to exhibit intelligent behavior. Alan Turing in his evolutionary paper (Turing, 1950) first coined the concept of the Turing test in terms of an imitation game which is a quantified approach in the quest to determine whether a machine can think. There are three entities in the imitation game: an interrogator and two two participants X and Y, where one is a human participant and the other is a machine participant. X and Y both perform an activity as told by the interrogator, such as attempting a math problem, cooking and serving a dish, performing a medical diagnosis, writing computer programs, summarizing articles, driving a vehicle, etc. The objective of the machine is to perform the assigned task with adequate intelligence such that its outcome is barely distinguishable from the human one. In the case of an intelligent machine, an interrogator will not have more than a 70% chance of making the right identification. The exam grade appraiser model in this research is considered as the machine in the Turing test using the same game setting described by (Turing, 1950) . In essence, upon inserting the student identifier, the associated transcript of the high-stakes examination will be shown to the interrogator's output device. The interrogator will receive two transcripts, one produced by a machine another achieved from the human evaluation process. The interrogator will see the retrieved grade sheet for the first time and will have zero knowledge on how the student performed in the submitted exam. However, the interrogator can access the previous portfolios of students and the distribution of student features. The proposed model is designed with the hypothesis that it can produce outcomes similar to what is exhibited in the real world. Therefore, the proposed model is trained to imitate the transcript generated from the real world where students take an exam. To persuade the interrogator, transcripts from the sources X and Y have to be very similar so that they look as coming from the same universe. In contrast, a large dissimilarity will make the interrogator believe that one of the transcripts is generated by a rule-based system. The appraisal machine's endeavor is mathematically described in the objective function in Equation (3). where the output for each course is a one-hot encoded vector. The true labels and the predictions y c,i ,ŷ c,i . m denotes the number of candidates and k is the number of courses taken by each candidate. A machine that outputs a constant grade for all students will be easily distinguishable as a machine-generated transcript. Similarly, simple arithmetic operations such as averaging previous grades, maximum/median/mode of institutional performance, or as a function of junior and secondary school grades will first and foremost make the two transcripts in front of the interrogator highly dissimilar and secondly, due to the fact that interrogator has access to the information of portfolios and distributions, therefore, such assessment formula will fail the Turing test. More than one million (13,65,789) candidates for the HSC exam for the 2020 ses- Table 2 . J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f fore, used as an encoded information about an institution's location. A more detailed overview of feature construction is mentioned in Section 3.3. Events of detriment appraisal is stored as training dataset for logistic detriment classifier. Only a detection of detriment effect is required for the modification of the output distribution. Once θ is converged, manually eliminating a feature from flatten array is allowed before publishing the grade on transcript. Feature contributions are revealed with Additive feature Explainer. A complex model with many parameters is proven to fit data better than a simple model with few parameters (Myung, 2003) . Therefore, unlike the baseline model which contains a small number of parameters, the proposed machine learning pipeline contains nearly a million model weights denoted as θ and initially, the model parameters do not obtain any prior distributional assumption about the true distribution. Model outputs are referred to using the y-hat notationŷ generated upon providing input arguments (θ, x) to the appraisal model. Following a tweak or dynamic modification of the model parameters, a different probability distribution is generated each time (Myung, 2003, eq. (1)). The appraisal model is allowed to tweak its θ as long as the model moves closer to zero error. The appraisal model finds the adjustment of the θ aiming to reduce cost function while fitting the underlying distribution of the provided high-stakes exam performance. Given m number of candidates of previous sessions, there are feature vector X = (x 1 , x 2 , ..., x m ) and observed output vector Y = (y 1 , y 2 , ..., y m ). A random sample from the population of past candidates is given to the model during the training or, tweaking phase to adjust its θ as such the model produces a probability distribution f p (y|x, θ) which is most likely the underlying true distribution. The optimal parameter searching during the tweaking phase is formulated as Maximum Likelihood Estimation (Aldrich, 1997) . Appraisal model for the nation-wide exam candidates learns to appraise with only one objective function that expresses the deviation in appraisals from the actual outcome. No prior bias for any particular group of candidates is given as such the appraisal model loses its validity. For categorical outcomes, the objective function for the appraisal model is the categorical cross-entropy function provided in Equation (3). Several choices of the objective function L(θ) are provided in Equations (4) and (5) specifically for numeric outcomes. The Equation (5) allows the policy-makers to set a penalty term as hyper-parameter. Policy-makers can decide to add on a non-detrimental guarantee using a penalized objective function as Equation (5) where the penalty factor γ ∈ (0.5, 1.0] protects the model to output distributions that are below the true distribution. For equiprobable outcomes, the γ will motivate the appraisal model to assign the higher marks which might eliminate the risk of detrimental effect. On the other hand, while appraising irregular candidates, their portfolio might contain several subject grades of the HSC examination. Hence, the past history of HSC is provided along with the portfolio for appraisal of the remaining HSC grades. In terms of appraising irregular candidates, the aim is to appraise some or all HSC subject grades such that the fairness of evaluation is maximized by appraising grades which are more likely the observed real results. Now, to determine whether a generated output by the model is close to the true distribution of previous graduates', an Energy state is 24 J o u r n a l P r e -p r o o f introduced (in Equation (6), a joint configuration of the portfolio, HSC grades, and a layer of neurons) between the irregular candidates' portfolio and output labels. A low energy state close to zero is expected to acheive similar outcome as the real-world data. The model can approximate any true distribution p by approximating q and the precision of approximation w.r.t true distribution is their difference measured as relative entropy in Equation (7). where KL(p||q) is zero when q = p, otherwise it is non-negative. By going downhill of the engergy function, the model seeks to yield KL(p||q) to reach to zero. The aim is therefore, to find the optimum configuration of θ in Equation (8) that reduces the gap between the energy of the observed data and the produced data. The objective function L(θ) for irregular candidates deals with a free energy function associated with the Equations (6) and (9). Finally, the objective function when the partial outcomes are known, is defined as in Equation (10) which was introduced in Conditional Restricted Boltzman Machine (Mnih et al., 2011) : The core unit of the proposed architecture is the artificial neuron, which is also the building block for the human brain. Billions of neurons form a human brain that is capable of multiple tasks. In this context, the artificial brain is specifically designed to meet the goal of the identified research objective. The arithmetic block is loosely a resemblance of brain neurons, which flashes activation signals; such type of arithmetic block is largely used in solving reasoning tasks. Here, a more low-level conceptual Multiple logistic regression models were stacked for as many multi head output. As the probability distribution of the output layers is modified once a logistic regression classifier detects a detriment flag, the assigned grade by the backbone deep learning model is discarded by considering that the predicted grade has higher chance of being a detrimental assessment. Then the autonomous evaluation process will again predict a higher grade than the previously appraised grade. In the later process, some of the input factors may be required to multiply with zero in order to disable the effect of those particular features. An example of the outcome probability correction after detriment flag detection is shown in Figure 8 where the subject grade was awarded a letter grade 'C' by the backbone model. Following the detriment flag detection, the 2nd highest higher grade 'A-' is chosen as the output grade. Once the iterative correction process is done, the outcomes can then be generated applying the soft-max function on the distinct probabilities. From the facts and the data exploration, it can be observed that a candidate's portfolio of higher secondary examination in Bangladesh must inherit one category from each branch of the qualitative feature space as shown in Figure 9 . A candidate can be registered to take the public exam from an institute or privately. Students who could not pass in one or more subjects in the previous session re-register in the following year for improvement or retake the entire test. Categories with binary labels, e.g., educational version, registration type, etc. are encoded into a single binary bit where 0 and 1 represents distinct labels of specific category as shown in Equation (12). Particularly the qualitative feature space is initially selected by conducting a student's t-test when a significant P-value is found (Brevard and Ricketts, 1996) . Recent studies have shown that the prior academic achievement, socioeconomic conditions, and high school characteristics are important predictors for academic performance (Rodríguez-Hernández et al., 2021b) (Rizvi et al., 2019) . Besides qualitative properties mentioned in Figure 9 , topographical information and particularly the migration event is discovered as an impacting variable due to significant P-value. This feature did not require additional pre-processing to incorporate because a unique educational institution identifier or EIIN is a 6 digit number that exists with JSC, SSC, and HSC transcripts. The postal code, village, road, district, etc. topographical information are encoded within the first 5 digits of the EIIN number. Since the EIIN code does not represent a numeric value hence, the five-digit decimal code is converted into a (10 × 5) matrix E where columns are indicator vectors in Equation (13). e 9,5 e 9,4 e 9,3 e 9,2 e 9,1 However, privately registered candidates in Figure 9 do not appear in the exam from any institutions. An option to make the model understand a candidate is privately registered is to use a custom predefined constant EIIN e.g. EIIN: 99999, which is non-existing and can be represented with the matrix E. Finally, other variables such as gender, age, family members, and place of birth did not show significant impact during the T-test hence, those features are eliminated from the student portfolio. The mapping of grade points is shown in Table 1 . The baseline model discards the "Fail" grade and the proposed model does that as well by considering six passing grades as the output. Quantitative representations of subject-wise marks from JSC and SSC are clipped between the range [33, 100] and the subject-wise output marks of the higher secondary examination are also clipped between the range of pass marks [33, 100] . While training the appraisal model with the aim of appraising numeric scores, it is efficient to scale down the range in between [0, 1] using the equation in Equation (14) in order to achieve faster convergence (Wan, 2019) . The final result sheet of the HSC examination carries letter grades of seven subjects along with a cumulative grade point average based on a scale that is generated from the range of total obtained marks out of hundred. Since the Higher Secondary Examination transcripts in Bangladesh contain letter grades shown in Table 1 therefore, in this research JSC and SSC subject grades are provided in ordinal scales as well as the labels produced by the appraisal model are also categorical. Thus, each candidate is a data point in high dimensions derived from qualitative and quantitative features from the candidate portfolio. The training data of this study has zero (1) erroneous attribute values, (2) missing attribute values, (3) incomplete attributes values. Inevitably, the training data got rid of the sources of "attribute noise" (Zhu and Wu, 2004) . The two other sources responsible for "class noise" are-wrongly labeled instances and contradictory examples (Zhu and Wu, 2004) . Each training instance of the real-world data is validated by many stakeholders before being published as a record. Moreover, In the event of a probable mistake in the published grade, the training data has also gone through the student inquiry procedure in the real world. Thus, the chances of a mislabeled instance are negligible. A high rate of contradictory examples can often cause the appraisal model to have fallacious decision boundaries. One common issue in the classification task is the noise of training samples that may lead to the creation of small clusters of one target class in the region of the domain corresponding to another class. If a lot of such data points are near the decision boundaries then may create a hindrance for the learning algorithm and increase the complexity due to over-fitting behavior (Libralon et al., 2009) . Fortunately, neural networks try to fit the noisy data after many iterations of the gradient descent. During the initial training process, the model only focuses on the true signal. Applying an early stopping strategy to the appraisal model will provably help the model to learn from true signal (Kingma and Ba, 2015) . Because stopping the learning process before it over-fits the noisy samples will prevent the model to memorize all the boundary data points. Early stopping was set to be activated when validation performance started to decrease even though training performance was going up. Furthermore, pruning the noisy data points during the training process will help the appraisal model to learn the task more confidently (Northcutt et al., 2021) . Therefore, early stopping (Kingma and Ba, 2015) and confident learning (Northcutt et al., 2021) was applied to suppress the effect of noise during the training process. Although the performance is evaluated based on model prediction the accuracy of the appraisal process highly depends on the correctness of training samples the model receives. Therefore, it is indispensable to first ensure (I) the typical distribution of class labels across training and testing observations and, (II) the true rate of signals in the machine learning data. The first concern about the class distribution is discussed in 4.1. Lastly, this study will answer the following issues in 4.3: • If the training data had useful signal in it. • If the number of training sample was useful for training. for all the candidates or randomness in the appraisal. As a responsible AI model, it is indispensable to make sure that, in the worst case, only a portion of candidates are treated inequitably due to erroneous appraisal. In addition to the confusion matrix to scrutinize appraisal model performance, the maximum chance of being a loser due to machine error is sought in each subject using the scale in Figure 12 . The heat map in Figure 13 shows the portion of losers per hundred is significantly low. The heat map is sort of a litmus test for the responsible appraisal model, which will help researchers benchmark their model in this domain. Any rational alternative appraisal model will have some reasoning errors, as the real world is non-stationary. A model with no errors will achieve the highest value of 1.0 at Fair-Zone in Figure 13 . As the unified appraisal model (one architecture and objective function) introduces some deviation for the test data set in Figure 11 , the litmus test is conducted by comparing each appraisal with truly obtained grades in an individual subject using the deviation scale in Figure 12 . In the real world, two identical deserving candidates will frequently have a one-letter grade difference in their transcript due to a 1 or 2 score difference in the exam script. shows the output probability of getting an A+ in a particular subject. The Explainer process, i.e., the SHAP method, determines an expected probability output E[f (z)] for each output node, which is set as a base value in the force plot (Lundberg and Lee, 2017, Fig.1 ). E[f (z)] is the output if no features of the current outputŷ c,i was known. In this example, the deciding factors for an A+ pushed the probability slightly above the base value. Input contributions for all letter grades (A+, A, A-, B, C & D) can be produced in parallel. For simplicity, only the top three features pushing the probability higher or lower than the base value are kept in Figure 14 whilst the original interactive plot includes more than 400 attributes. Table 3 . The fairness ratio and KL divergence between the real outcome and model prediction indicate that the implemented model has achieved a reasonable performance to rationally appraise high-stakes assessment. To discover the presence of a signal, a random dataset with the same dimension as the original training data was produced by performing a shuffle of the input vectors with the aim to train the appraisal model . If input matrix x has no signal, then the model would output the most observed grade during the training procedure. How the new appraisal model performed on real-world test data was determined from the confusion matrices for the obligatory courses in Figure 15 . The appraisal on test data revealed that the model that was trained on a random dataset could not learn from input signals as the confusion matrices for the compulsory subjects produced in Figure 15 showed the appraisal model assigned maximum probability for the most encountered grade point. For example, as the letter grade distribution in Figure 10 , Aappeared most of the time as the target label for the Bangla subject. Therefore, the new model naively predicted an Agrade for Bangla as shown in Figure 15 to minimize the categorical cross-entropy loss. Consequently, the erroneous appraisal of the Bangla course would allow a substantial number of graduates to become gainers (who would attain lower than A-) and losers (who would attain higher than A-). Likewise, for the other two compulsory subjects, English and ICT, the new model awarded the highest observed letter grades. Thus, the appraisal model that learnt to predict the HSC grades from the synthesized training dataset failed to rationally appraise the three compulsory subjects-: Bangla, English. and ICT. In contrast, the appraisal model linked with the original training data did not maximize its soft-max probability only based on the highest appearing grade, as illustrated in 4.2; therefore, the model that was trained with real-world historical data could appraise the exam grades from the pattern discovery between grades and features. To ensure the quality of the training data, this study followed the iterative procedure (Angluin and Laird, 1988 , Table( 1)) to determine the noise rate η b and compared the required number of training samples m for obtaining an appraisal model with a high probability that the appraised grades are not too different from the true outcome (Angluin and Laird, 1988, Theorem(2) ). Firstly, the noise determination procedure outputs the rate of disagreement, i.e., the proportion of disagreement with θ during the initial training process of the appraisal model as an estimator for the noise η b . The noisy examples were identified using the confidence joint (Northcutt et al., 2021, Figure 1) . The iterative search procedure converged on a small fraction of η b = 0.375. Then, with the quantification of noise rate in the training sample, the lower bound of the required number of observations was determined using the equation in (Angluin and Laird, 1988, eq. (1)) in Equation (15). The right-hand side of the inequality in Equation (15) i.e., the finite set of appraisal rules. The value of N is a large number, but it has a small contribution to the sample size due to the log operation. The training samples were approximately four-fold bigger than the minimum bound in Equation (15). If Equation (15) did not satisfy, a new η b was generated with the noisy instances discarded. In the worst case, the trade-off had to be made by increasing the tolerance level ϵ. In response to Covid-19 in the year 2021, some of the E-9 member countries followed the weighted average method to appraise 12th grade students without the exam. The baseline model considers only two parameters: associated grade point of the junior school certificate and the secondary school certificate. However, the method is not inclusive in the sense that the process is not applicable to candidates who are irregular or have migrated to different study groups. The perceived fairness was one pitfall of the baseline model. For instance, if the compulsory subject 1 is appraised with the weighted average method where w ssc bangla = 0.75 and w jsc bangla = 0.25 then from this appraisal process, ≥20.12% of the candidates get grades higher than the +1 deviation (at least 20.12% gainer). Moreover, ≥3.27% of candidates become losers in the first compulsory subject (Bangla) if appraised by a weighted average method. While in compulsory subject 2 (English), ≥40% of students are given grades higher than +1 grade point, which makes the distribution and appraised grades substantially dissimilar to those in original transcripts. The unbounded gainer and lower proportion in each subject increase the risk of deviating a lot from the true distribution. Another disadvantage of the baseline model is the ceiling for HSC grade points. To illustrate, a student who obtained 3.5 in SSC and 3.5 in JSC, achieves 3.5 in the HSC, which is the maximum obtainable (3.5 * 0.75 + 3.5 * 0.25) from the weighted average model. But it is natural for a student with a 3.5 in previous board exams to obtain an A or A+ later in the HSC exam. Therefore, setting a ceiling for higher secondary results violates the stochastic nature of the real world. This section will focus on the national level distribution of the cumulative grade point average generated from the baseline model and the proposed model to the ground truth distribution. Deviation or dissimilarity from the CGPA distribution in the passing year of 2019 is analyzed using the KL Divergence metric. In Table 4 , the proposed model happens to produce a closer distribution curve to the real world curve as the model learned to capture complex patterns in its latent space. On the other hand, the baseline model that considers only JSC and SSC grade points generated a distribution curve that has a KL Divergence of 1.19 with the real world curve. From the comparison shown in Table 4 , a lower relative entropy, or KL Divergence, is better matched with the reference distribution, the true distribution of average cumulative grade points. Therefore, appraisal through proposed machine learning intimates the real-world distribution better than the baseline model, which will prevent an anomalous distribution of HSC examination results. The proposed model, therefore, retains transparency, accountability, and fairness at the individual level as well as at the aggregated level. If any national education board wants to implement an alternative appraisal system, this study can help in benchmarking their solutions with the proposed model. Moreover, policy-makers and people will be informed about the level of fairness they can expect from the proposed appraisal model, which will eventually decrease policymakers' skepticism and people's mistrust or fear of AI-driven transcript generators. As a thought experiment, consider the worst-case scenario wherein the lock-down is going to be extended for another couple of months, or another outbreak of locust attack, earthquake, or cyclone arrives. Will education policy experts be able to equitably protect the nation's future workforce and academic loss without creating an adverse effect? To achieve a unified and responsible AI appraisal system, this article presents a framework addressing the critical task of alternatively appraising high-stakes exam candidates during an emergency, guaranteeing that appraisal of candidates from quantitative and qualitative features is attainable with high fairness and minimal risk factor. From the findings shown in this study, it is important to know the reliability of AIdriven assessment tools in the future. As with the majority of studies, the design of the current study is subject to limitations. A major limitation in applying the methodology could be the inconsistency of the historical data. A recent policy change on the assessment will possibly make the previous portfolio information obsolete, resulting in obtaining fewer training data points. The limitation to adapting the proposed model is that, as opposed to an established assessment system, there are not enough institutional structures and policies for an alternative emerging model (Clarke, 2011) . The authors perceive that the limitation of not being accurate 100% of the time is because this is the maximum achievable accuracy with the architecture. Therefore, the future direction will include meta-learning algorithms to dynamically generate the architecture of the appraisal model. Moreover, the future endeavor is to create a knowledge base from the trained model's latent variables so that policy-makers can identify factors that are responsible for higher secondary exam performance and improve them. The study sets out an AI agent receiving maximum fairness and the lowest detrimental effect as an appraisal system for high-stakes exams. Any national emergency creates a barrier to arranging high-stakes exams in large education systems, such as the E-9 countries, causing candidates' progress to halt. In order to ensure fairness and equity to the entire batch of candidates, appraising subject-wise grades using a responsible AI model having strong predictive skills can be a rationale policy as an alternative to examination. This research suggests explainable AI as an equitable substitute for high-stakes exams during crisis situations. The rigorous empirical research scrutinizes the strength of the proposed inclusive computerized system, which was developed to appraise transcripts of a higher secondary standardized exam. The universal function approximation technique lies at the core of the appraisal model. The appraisal model will be useful even when there is no need to assess candidates with machine learning; predicting high-stakes exam performance beforehand, and ensuring extra care can be one use case. Not only will a rational machine learning model come in handy when policy-makers look for an alternative of the exam arrangement on a large scale, but this will also build trust in AI and data science responses to epidemics that can mitigate potential harm. The appraisal framework provides high flexibility for policy-makers in terms of choosing output types, portfolio features, loss functions, activation functions, and other hyper-parameters associated with the framework. In a particular case, if the authorities decide to take exams on a reduced number of subjects and appraise the entire transcript on that basis, the proposed method can leverage the fairness of appraisal by generating the rest of the grades considering all the existing information as input. This trained model can further be used as a checkpoint for transfer learning. Model selection for ecologists: the worldviews of aic and bic R. a. fisher and the making of maximum likelihood 1912-1922 Predicting academic success in higher education: literature review and best practices Learning from noisy examples 2020. The COVID-19 Crisis Response Representation learning: A review and new perspectives Logistic regression for dependent binary observations Identifying supportive student factors for mindset interventions: A two-model machine learning approach Residence of college students affects dietary intake, physical activity, and serum lipid levels Sequential, typological, and academic dynamics of selfregulated learners: Learning analytics of an undergraduate chemistry online course. Computers and Education: Artificial Intelligence 2, 100024 Application and theory gaps during the rise of artificial intelligence in education. Computers and Education: Artificial Intelligence 1, 100002 Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: A retrospective of all volumes of computers & education Framework for building an effective student assessment system: Read/saber working paper In a pandemic, everyone gets an asterisk Engagement detection in online learning: a review Adaptive appearance model tracking for still-to-video face recognition A study on covid-19 -impacting indian education Rules of business Enabling teachers to explore grade patterns to identify individual needs and promote fairer student assessment Policy brief:education during covid-19 and beyond Op-ed: When you are an hsc student in 2020 Human emotion recognition using deep belief network architecture Who is the fairest of them all? public attitudes and expectations regarding automated decision-making Multilayer feedforward networks are universal approximators Hsc 2020 cancelled: You can't make everyone happy The long-run educational cost of world war ii Adam: A method for stochastic optimization Explainable automated essay scoring: Deep learning really has pedagogical value Classification using discriminative restricted boltzmann machines Comparison of learning analytics and educational data mining: A topic modeling approach. Computers and Education: Artificial Intelligence 2, 100016 The strategy of model building in population biology Pre-processing for noise detection in gene expression classification data A unified approach to interpreting model predictions Hsc exams cancelled: Will the batch of 2020 suffer because of it? The impact of covid-19 on higher education around the world Conditional restricted boltzmann machines for structured output prediction Identifying reliable predictors of educational outcomes through machine-learning predictive modeling Tutorial on maximum likelihood estimation Boy kills himself over poor marks in class 10 board exams Cbse 10th result 2021: Delhi hc to hear plea seeking modification in assessment formula Parents, students flag concerns over cbse, cisce result calculation formula Covid vaccine tracker: when will a coronavirus vaccine be ready? Confident learning: Estimating uncertainty in dataset labels Activation functions: Comparison of trends in practice and research for deep learning Inara: Intelligent exoplanet atmospheric retrieval a machine learning retrieval framework with a data set of 3 million simulated exoplanet atmospheric spectra Simulating the potential impacts of covid-19 school closures on schooling and learning outcomes: A set of global estimates Historical development of secondary education in bangladesh: Colonial period to 21st century Neural network classifiers estimate bayesian a posteriori probabilities The role of demographics in online learning a decision tree based approach Artificial neural networks in academic performance prediction: Systematic implementation and predictor evaluation. Computers and Education: Artificial Intelligence 2, 100018 Artificial neural networks in academic performance prediction: Systematic implementation and 51 and Education: Artificial Intelligence 2, 100018 Open data interface (odi) for secondary school education Restricted boltzmann machines for collaborative filtering The sdg-education 2030 steering committee recommendations for covid-19 education response Mastering the game of go with deep neural networks and tree search Management of scoring sessions in alternative assessment: the computer-assisted scoring approach Explained: What is cbse's formula for evaluating class xii students' results? A survey on explainable artificial intelligence (xai): Toward medical xai Computing machinery and intelligence Managing high-stakes exams and assessments during the covid-19 pandemic Influence of feature scaling on convergence of gradient iterative algorithm Directions of the 100 most cited chatbotrelated human behavior research: A review of academic publications. Computers and Education: Artificial Intelligence 2, 100023 A multinomial logistic regression modeling approach for anomaly intrusion detection Bangladesh cancels major public examination amid covid-19 fears Human-centered artificial intelligence in education: Seeing the invisible through the visible. Computers and Education: Artificial Intelligence 2, 100008 Ai technologies for education: Recent research & future directions. Computers and Education: Artificial Intelligence 2, 100025 Class noise vs. attribute noise: A quantitative study • A machine learning model is introduced as an alternative to appraise transcripts of high-stake examination • Exploits both quantitative and qualitative raw features and generalizes underlying distribution Established automated process could appraise rationale grades for Bangladesh Higher Secondary Board examination 97% of the times without academic detriment • Further detrimental effect can be suppressed by interpreting model prediction and disabling responsible features J o u r n a l P r e -p r o o f ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f