key: cord-0917901-5kbbwzcp
authors: De Pretis, Francesco; Landes, Jürgen
title: EA(3): A softmax algorithm for evidence appraisal aggregation
date: 2021-06-17
journal: PLoS One
DOI: 10.1371/journal.pone.0253057
sha: 4aea8c9e282703ddc2afbd56cf378f8ba15fbd3c
doc_id: 917901
cord_uid: 5kbbwzcp

Real World Evidence (RWE) and its uses are playing a growing role in medical research and inference. Prominently, the 21st Century Cures Act—approved in 2016 by the US Congress—permits the introduction of RWE for the purpose of risk-benefit assessments of medical interventions. However, appraising the quality of RWE and determining its inferential strength are, more often than not, thorny problems, because evidence production methodologies may suffer from multiple imperfections. The problem arises to aggregate multiple appraised imperfections and perform inference with RWE. In this article, we thus develop an evidence appraisal aggregation algorithm called EA(3). Our algorithm employs the softmax function—a generalisation of the logistic function to multiple dimensions—which is popular in several fields: statistics, mathematical physics and artificial intelligence. We prove that EA(3) has a number of desirable properties for appraising RWE and we show how the aggregated evidence appraisals computed by EA(3) can support causal inferences based on RWE within a Bayesian decision making framework. We also discuss features and limitations of our approach and how to overcome some shortcomings. We conclude with a look ahead at the use of RWE.

Real World Evidence (RWE) [1] is one of the new frontiers of medical research and inference and attracts growing interests in academic and industrial research. RWE comprises observational data obtained outside the context of Randomised Controlled Trials (RCTs) which are produced during routine clinical practice. According to a broader understanding, it may be possible to point at any source of information, that is related to medications and not directly retrievable from RCTs, as a potential generator of RWE, e.g. social networks [2] .

Despite being known for a long time and in some cases applied as an informative support in the drug approval process [3] (e.g. the anticoagulant Rivaroxaban [4] ), RWE has recently been brought to the fore by the US Congress with the Pub.L. 114-255 (21st Century Cures Act) which modified in 2016 the Food and Drug Administration (FDA) procedures for a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 medications licensing. The act allows, under certain conditions, pharmaceutical companies to provide "data summaries" and RWE such as observational studies, insurance claims data, patient input, and anecdotal data rather than RCTs data for drug approval purposes. After the turn to RCTs as gold-standard in the drug approval process, this is the first act allowing for uses of RWE in the drug approval process in an industrialised country. This move sparked interest also of the European Medical Agency (EMA) and the Japanese Pharmaceuticals and Medical Devices Agency (PMDA) [5, 6] .

The use and standards for proper use of RWE have ignited a serious debate in the scientific community [7] [8] [9] [10] [11] ; for a special issue see [12] . Proponents of the use of RWE point to the fact that RWE can be produced much faster than conducting and analysing a clinical study [13, 14] . This allows pharmaceutical companies to obtain approval for new products or new indications (off-label use) quicker, which can benefit companies as well as patients [15] . Faster and safe drug approval procedures are particularly relevant during the current Covid-19 pandemic [16, 17] . However, many researchers have expressed concerns related to data quality, validity, reliability and sensitivity to capture the exposure, adverse effects and outcomes of interest when using RWE [18] [19] [20] [21] [22] . Using RWE for medical inference presents methodological challenges [23] , though some efforts have been carried out to efficiently merge evidence coming from RCTs and observational studies [24] [25] [26] , also for causal inference purposes [27, 28] . Attempts to provide a framework for appraising the quality of evidence for medical inference have been going on since long before the current debate on uses of RWE began, e.g. GRADE [29, 30] . However, these frameworks do not provide a clear way to quantitatively solve this problem nor do they lend themselves to an integration into a standard decision making framework [31] [32] [33] [34] .

The US National Research Council has issued following call: "The risk-of-bias assessment of individual studies should be carried forward and incorporated into the evaluation of evidence among data streams" [35] . This point appears crucial to us for appraising RWE. There is however no commonly accepted methodology for carrying out RWE appraisals. A possible solution to this problem is to split the appraisal of RWE into multiple more manageable appraisals along different dimensions and then to aggregate these appraisals. However, how can we aggregate these multiple appraisals? Subsequently, how can we use this aggregate for decision making?

We here address these two questions by proposing an algorithm based on (1) the softmax function-a generalisation of the logistic function to multiple dimensions-as an instrumental tool for aggregation within (2) a Bayesian decision making framework. While the softmax function was initially introduced in statistical mechanics, it has now found wide-spread applications in machine learning and artificial intelligence methods at large [36] [37] [38] . On the other hand, Bayesian approaches are increasing in popularity in part due to their intuitive incorporation of information and updating procedures.

Drawing on these traditions, we present an Evidence Appraisal Aggregation Algorithm, EA 3 (suggested pronunciation: "EA-cube") compressing a generic vector of evidence appraisals along multiple dimensions into a scalar. Roughly, input data (evidence appraisals) are first processed through the softmax function and next aggregated by the application of a geometric mean. EA 3 is then shown to have some desirable properties. It offers the possibility of emphasizing or the de-emphasizing the maximum values associated to each evidence appraisal via a cautiousness parameter (the thermodynamic β of softmax). Furthermore, EA 3 allows one to incorporate the importance of the dimensions of appraisals. Eventually, we show how EA 3 can be used to support assessments of causal hypotheses within a Bayesian decision making approach.

To the best of our knowledge, EA 3 represents one of the first attempts to solve the problem of evidence appraisal through an easy-to-exploit numerical measure [39, 40] . In line with the previously mentioned US Environmental Protection Agency (EPA) recommendations [35], our appraisals can be understood as risk-of-bias assessments-but also of other possible methodological flaws. We offer a formalisation of such assessments and facilitate a tracking of these assessments through evidence aggregation to the calculation of probabilities of hypotheses of interest. Our proposal commits to be thus "transparent, reproducible and scientifically defensible" as suggested by the EPA [35, p. 79] .

The rest of this article is organised as follows: in Materials and Methods, we introduce the softmax function as well as a motivating example and then present our softmax algorithm in some detail and discuss its properties. The Results section puts forward a method to apply EA 3 in Bayesian decision making problems. A final Discussion outlines advantages and limitations of our approach and points to important future work.

In this section, we first introduce the softmax function, then we present the EA 3 algorithm and discuss its properties.

The softmax function (more correctly softargmax, also known as normalised exponential function) is a function from R k to R k (k 2 N) mapping a vectorÃ ¼ ha 1 

to a vector sðÃÞ as follows:

where β is a real number different from zero, see Table 1 for an overview of key notation. We now briefly discuss some of the properties of the softmax function (henceforth softmax) and recall some of its applications to mathematical physics, probability theory, statistics, machine learning and artificial intelligence. Normalisation. While the input vector may contain any real number, the output of softmax is normalised in the sense that all components of the output vector are in the unit interval and sum to one. The output vector can hence be understood as a probability distribution over k elementary events where the probabilities are proportional to the exponential of the input vector.

Translational invariance. Softmax is invariant under translations: letÃ 0 be obtained from a vectorÃ by adding a constant c 2 R to every component of A then

So, ifÃ 0 is obtained fromÃ via translation, then sðÃ 0 Þ ¼ sðÃÞ.

Softmax is not scale invariant. It is easy to prove that multiplying every component of an input vectorÃ by some constant c does, in general, not return the same output vector.

The β parameter allows one to change the base of the exponential function. This choice permits one to emphasise or de-emphasise the maximum value belonging to the input vector, the greater β the greater the maximal component of the output vector. For β = +1 the output vector vanishes everywhere except those components at which the input vector is the greatest (in this case, softmax becomes an argmax). Conversely, for β = −1 the output vector vanishes everywhere except those components at which the input vector is the smallest (argmin). In the limit case β = 0 the output vector is the uniform probability distribution resulting in a loss of all the information contained in the input.

The first use of softmax goes back to 1868 when Ludwig Boltzmann introduced the function for modelling ideal gases. Today, softmax is known as the Boltzmann-Gibbs distribution in statistical mechanics, where the index set {1, . . ., l, . . ., k} represents the microstates of a classical thermodynamic system in thermal equilibrium and a l is the energy of that state l and β the inverse temperature (thermodynamic β) [41, 42] . Beyond the representation of physical systems, the distribution and this modeling have paved the way to some noteworthy algorithms based on the same statistical mechanics assumptions, e.g. Gibbs sampling [43] .

The normalisation property has led to applications of softmax in probability theory to represent a categorical distribution [44] and in statistics to define a classification method through the so-called softmax regression, an equivalent to multinomial logistic regression [45, 46] . This property has been widely used also in medical statistics [47] [48] [49] .

In recent years, two fields have been seeing a raising interest towards softmax: machine learning and artificial intelligence [50, 51]. The term softmax itself has been first introduced by Bridle in neural networks, where it is usually employed as an activation function to normalise data [52] . In computer science, applications of softmax are varied: classification methods (again, softmax regression) for supervised and unsupervised learning [53-55], computer vision [56-58], reinforcement learning [59] [60] [61] and hardware design [62] , just to name some current areas of application. Additionally, a considerable number of conference papers is witnessing the popularity of softmax and its proposed variants [63] [64] [65] [66] [67] .

Consider the hypothesis that paracetamol use causes asthma in children [68] . Only relatively few RCTs have been conducted that could help us determine the truth of this hypothesis [69] . RWE will thus have to (!) play an important role in treatment and prescription decisions that have to be made now, that is before (meta-analyses of) RCTs can deliver conclusive evidence [70] .

RWE for and against this causal hypothesis is, for example, obtained from relatively large surveys [71] [72] [73] [74] [75] [76] [77] . Such evidence is clearly less confirmatory than well-run RCTs and we hence need to find a way to appraise this evidence. De Pretis et al. (2019) [78] suggested that such surveys can be appraised along three independent and relevant dimensions: duration of the surveyed time period, the sample size and the methodology for adjustment and stratification. Appraisals are represented by numbers in the unit interval where 1 represents a perfect appraisal (e.g, perfect methodology for adjustment and stratification) and 0 represents the worst possible score (e.g. tiny sample size). These three appraisals are then aggregated by taking their arithmetic mean.

Simply taking the arithmetic mean is problematic for a number of reasons. Firstly, the dimensions of appraisal are all given the same weight. This problem can be easily addressed by moving to a weighted mean where the weights represent the importance of the dimensions of appraisal. Secondly, every weighted mean of three equal numbers c is equal to c. That is, multiple imperfections of RWE of equal degree c lead to an overall appraisal equal to c. We think, the overall appraisal ought to be less than c, multiple imperfections are worse than just one imperfection. Thirdly, a decision maker has no flexibility in the aggregation of appraisal to represent his/her attitude towards the question "how much worse are multiple imperfections than a single imperfection". We hence think that a suitable aggregation is not idempotent.

We next present and explain the EA 3 algorithm to aggregate evidence appraisals, which addresses these points.

We assume that evidence is appraised in k relevant and pairwise different and mutually independent dimensions represented by a normalised appraisal vector

k , see the E-Synthesis subsection for a suggested set of dimensions for appraisal. We do not commit to a fixed number of evidence appraisals (in agreement with multi criteria decision making in medicine [33] and risk prediction for multiple outcomes [79] ). We also make use of a given ranking of the importance of the different dimensions of appraisal. We represent this ranking by a vectorR ¼ hr 1 . . . r l . . . r k i 2 ð0; 1Þ k such that P k i¼1 r i ¼ 1. The more important the appraisal a l , the greater the value r l . EA 3 proceeds in 5 steps listed in Table 2 and explained below:

1. Appraisals weighted by ranking:

Description.

Step 1 weighs every appraisal by its importance. Table 2 . EA 3 algorithm structure with objectives described for each step.

Step Objective 1 Appraisals weighted by ranking 2

Softmax with a positive thermodynamic β 3

Rescaling 4

Geometric averaging 5

Normalisation to unit interval https://doi.org/10.1371/journal.pone.0253057.t002

2. Softmax with a positive thermodynamic β

Step 2 applies, as advertised above, softmax with a parameter β representing cautiousness, cf. the discussion following Proposition 1.

Rescaling

where × denotes the scalar product between two vectors of the same length k. Description.

Step 3 rescales the softmax of Step 2 by aggregated ranked appraisals. Softmax has the well-known property that it is invariant under uniform pointwise translations, σ (ha 1 , . . ., a k ) = σ(ha 1 + c, . . ., a k + ci). This property means for our application that applying softmax to a study S 1 and to a study S 2 which is appraised to be better according to every dimension by the same amount (c) it holds that σ(S 1 ) = σ(S 2 ). This is clearly undesirable as a uniformly better study should score better than a uniformly worse study. Multiplying bỹ A �R is a simple and intuitive way of ensuring that EA 3 is not invariant under uniform pointwise translations. Not only is our algorithm sensitive to pointwise translations, it is even the case that every improvement of an appraisal leads to a greater number v f (see Proposition 2).

v : ¼ Q k i¼1 expðb � r i � a i Þ P k i¼1 expðb � r i � a i Þ � ðÃ �RÞ ¼ expðb � ðÃ �RÞÞ P k i¼1 expðb � r i � a i Þ � ðÃ �RÞ

Step 4 compresses the vector to a scalar. To achieve this task, we apply a geometric mean, as it is routinely performed in machine learning for comparing items with a different number of properties and numerical ranges [80] [81] [82] .

5. Normalisation to unit interval:

Step 5 ensures that the final output is in the unit interval. We find this normalisation convenient for our application and point out that this step might not be necessary for other applications.

To summarize, given two k-tuplesÃ;R 2 ½0; 1� k � ð0; 1Þ k as input the algorithm returns a single number in the unit interval as output. We can understand EA 3 as a map and thus write EA 3 (Ã,R) 2 [0, 1] (see Corollary 1 for a proof that EA 3 maps into the unit interval).

Denoting by c@k a vector of length k with all components equal to c, we find that:

for all c 2 [0, 1] and all β > 0 it holds that

Proof. The computation is straightforward:

This observation demonstrates the role of β and how the simplest ranking scheme (all dimensions are ranked equally) acts in the simple case in which all appraisals are equal to c, see Fig 1 for an illustration. The greater β, the smaller v f , the further away the curves plotted in Fig  1 are away from the identity map. This means that a study with all appraisals equal to c will have an aggregate, v f , equal to less than c. In other words, RWE that is less than perfect in more than one respect has an even lower aggregated appraisal. This seems right, studies which might produce poor evidence for multiple reasons are considered to produce very poor evidence. It is for this reason that we require that β > 0. β = +1 represents maximal cautiousness, if the study is not perfect in all respects (c < 1), then EA 3 c@k; 1 k @k À � ¼ 0. β = 0 represents maximal optimism (and in our eyes overly strong optimism) in that EA 3 c@k; 1 k @k À � ¼ c, a study with a number of imperfections (c < 1) is overall as good as just a single imperfection.

Furthermore, note that if β � 0, then Eq (2) Proof. It suffices to verify that all the partial derivatives of EA 3 (�,R) with respect to the a l are strictly positive for all a l 2 [0, 1]. Since the normalisation step is a multiplication by a scalar which does not depend onÃ, it suffices to verify that all the partial derivatives of v with respect to the a l terms are strictly positive for all a l 2 [0, 1].

We now compute that this is indeed the case:

The sharp inequality follow from the fact that exp(β � Proof. Applying Proposition 2 it suffices to show that EA 3 (0@k,R) = 0 and EA 3 (1@k,R) = 1. The first condition follows from 0@k �R ¼ 0 and the second from 1@k �R ¼ 1.

Also note that ifÃ ¼ 0@k, thenÃ �R ¼ 0 and thus v f = 0. IfÃ 6 ¼ 0@k, thenÃ �R > 0 and thus v f > 0. 

is the second factor within the scope of the exponential function. The smaller parameter β and the greater the number of appraisals (the greater k), the closer EA 3 c@k; 1 k @k À � gets to the identity map. This graph clearly displays the monotonicity of these functions.

https://doi.org/10.1371/journal.pone.0253057.g001

Similarly, ifÃ ¼ 1@k, thenÃ �R ¼ 1 and thus v f = 1. IfÃ 6 ¼ 1@k, thenÃ �R < 1 and thus v f < 1.

Returning to the suspected causal link between paracetamol use and asthma, we now compare the aggregated appraisals of several RWE-providing surveys involving children, previously considered in [78] , according to De Pretis et al. (2019) [78] and according to EA 3 . See Table 3 for the formulae and Figs 2 and 3 for a graphical comparison under the assumption of equally Lesko and Mitchell (1999) 

important appraisal dimensions,R ¼ 1 3 @3. We note that for β = 0 both approaches agree and that the aggregate appraisal computed with EA 3 decreases with increasing cautiousness parameter β.

We are not aware of other approaches of qualitative aggregations of multiple evidence appraisals for medical inference. We hence lack a standard against which to benchmark our proposal. However, there are substantive bodies of literature on aggregating numerically represented judgements and preferences, which, at times, tackle a formally equivalent aggregation problem. A related proposal for medical inference is the GRADE methodology, which puts forward a way to obtain a qualitative confidence rating in hypotheses. The suggestion is to use the lowest confidence ranking for critical outcomes as the aggregate confidence [83] . By contrast, our approach is quantitative and all appraisals contribute to the aggregate.

Another field relevant our work is the current research on Bayesian hierarchical models for aggregation. In the already mentioned [24, 25] such models are employed to combine different study types in meta-analysis and account for bias, with the objective of its correction. Whereas in this article we consider one study and multiple appraisals of bias, the inverse may be considered true in [24] . There, the author employs a bias-correcting Bayesian hierarchical model [84] to combine different study types in meta-analysis. That model is based on a mixture of two random effects distributions, where the first component corresponds to the model of interest and the second component to the hidden bias structure. The resulting model is thus adjusted by the internal validity bias of the studies included in a systematic review.

We now illustrate how EA 3 can be incorporated into the Bayesian decision making framework [85] , in which decisions are based on all the available evidence [86] . In this framework, a decision maker is facing a decision problem in which a number of possible acts are at his/her disposal. However, the decision maker is unsure about the state of the world and thus adopts a prior probability function defined over a finite set of possible worlds, O.

All the available evidence is then used to determine a posterior probability function by conditionalising the prior probability function. In order to represent the decision maker's preferences all pairs of acts and worlds, the possible outcomes, are assigned a utility value in the real numbers. Normatively correct decisions are those which maximise the decision maker's expected utilities, where expectations are calculated with respect to the updated probability function [87] [88] [89] .

One immediate issue in this framework is that it is hard to calculate a posterior probability function. This issue is normally solved by applying Bayes' Theorem (see the following subsection). Bayes' Theorem is ubiquitous in Bayesian analyses and it is straight-forwardly applied, if the evidence can be taken at face value. In medical inference, where evidence cannot be taken at face value, numerous methodological design features and choices (conscious and subconscious) bear on the information a study provides.

Consider a set of exhaustive and mutually exclusively statistical hypotheses H 1 , . . ., H n , i.e. the states of the world. Let us denote the available evidence by E. Bayes' Theorem then allows us to compute the posterior probability of the hypothesis H h PðH h jEÞ |ffl ffl ffl ffl {zffl ffl ffl ffl }

So, the posterior probability can be computed from prior probabilities over hypotheses and conditional probabilities. The prior probabilities are provided by the decision maker's prior beliefs about the state of the world. The conditional probabilities are likelihoods specified by the statistical hypotheses. Hence, computing the posterior probability is a simple exercise in the probability calculus-under the assumption that the conditional probabilities are likelihoods specified by statistical models.

In medical inference problems with RWE, the calculations of Bayes' Theorem remain valid, the statistical models however do not specify the relevant likelihoods for RWE. The challenge hence arises to specify these conditional probabilities. We next show how this can be done via an application of EA 3 .

How should the posterior probabilities QðE 1 jHÞ look like, given a single study E 1 ? For starters, the evidence can be taken at face value,Ã ¼ 1@k, then QðE 1 jHÞ should just be PðE 1 jHÞ. If the evidence contains no information whatsoever,Ã ¼ 0@k and v f = 0, then the posterior QðE 1 jHÞ should just equal the prior probability PðE 1 Þ, so QðE 1 jHÞ ¼ PðE 1 Þ. That is, whether H is true or not, this does not change the probability of obtaining E 1 . In all other cases, the posterior probability QðE 1 jHÞ should be somewhere between the posterior PðE 1 jHÞ and the prior probability PðE 1 Þ.

These considerations suggest that QðE 1 jHÞ may be computed as a weighted mean of the posterior and the prior probability:

Applying Corollary 1 we see that QðE 1 jHÞ is different from the prior, if the posterior and the prior are different and v f > 0. From a theoretical point of view, one may interpret the convex combination in Eq (3) as a Jeffrey update [90] . Under this interpretation, v f is interpreted as the probability that the evidence can be taken at face value and 1 − v f can be interpreted as the probability that the evidence is completely uninformative.

QðH h jE 1 Þ ¼ QðH h Þ � QðE 1 jH h Þ QðE 1 jH 1 ÞQðH 1 Þ þ QðE 1 jH 2 ÞQðH 2 Þ þ . . . þ QðE 1 jH n ÞQðH n Þ ¼ PðH h Þ � QðE 1 jH h Þ QðE 1 jH 1 ÞPðH 1 Þ þ QðE 1 jH 2 ÞPðH 2 Þ þ . . . þ QðE 1 jH n ÞPðH n Þ ¼ PðH h Þ � ðv f � PðE 1 jH h Þ þ ð1 À v f ÞPðE 1 ÞÞ P n g¼1 PðH g Þ � ðv f � PðE 1 jH g Þ þ ð1 À v f ÞPðE 1 ÞÞ :ð4Þ

The assumption of a single available RWE study is, of course, rather unrealistic. We now show how to deal with multiple available RWE studies, E ¼ fE 1 ; . . . ; E s g. We begin by applying EA 3 to all every study individually, thus obtaining s-many outputs v 1 f ; . . .

; v s f . Under the assumption that the studies have been conducted independently from each other, we can generalise Eq (4) as follows:

E-Synthesis is a Bayesian framework developed for determining probabilities of particular drugs causing a specific adverse reaction [78, [91] [92] [93] [94] [95] . In order to facilitate the inference from real world data to a causal hypothesis a layer of so-called "indicators" has been inserted between the hypothesis of interest and the data. The indicators have been derived from Hill's Guidelines [96] and serve the role as (probabilistic) testable consequences of the causal hypothesis. Learning that an indicator is true raises the probability of the causal hypothesis to a degree. For example, learning that there is correlation between a drug and an adverse effect does not entail that the drug causes an adverse reaction. Nevertheless, the presence of a correlation does increase our suspicion that there indeed might be a causal relationship between a drug and an adverse event.

Evidence for adverse reactions often emerges spontaneously in form of case reports and suspected adverse reactions are often confirmed only from observational data [97] . Such RWE is at a high risk of bias and hence the RWE needs to be appraised. E-Synthesis has been designed to incorporate such appraisals of RWE, making their role explicit by formalising them as variables (previously, these variables have been termed "evidential modulator" variables). The following dimensions of appraisal have been suggested within the E-Synthesis framework: sample size, duration of the study, degree of sponsorship bias, degree of adjustment for covariates and the degree of analogy between the study population and the studied population. Randomised studies can also be appraised for how well blinding, randomization and placebo control were implemented. E-Synthesis was originally intended for philosophical applications, however it has also recently been developed for more practical matters. As yet, no suggestion has been made of how to aggregate evidence appraisals and how to incorporate these appraisals for decision making. We next show how this can be done for a specific indicator of causation applying EA 3 . Denoting by © the causal hypothesis of a drug D causing a specific adverse drug reaction (ADR) and by Ind an indicator variable, we have for the posterior probability of © for RWE, QðO c jEÞ,

This calculation uses the fact that the causal indicator variable mediates the inference from data to the causal hypothesis © in the technical sense that conditionalisation on it renders the data and © independent.

We now return to the motivating example of determining a probability of the causal hypothesis (©) that paracetamol use causes asthma in children. In the E-Synthesis approach, the Beasley et al. (2011) [77] study is informative about the "rate of growth" indicator, so Ind = RoG. The posterior probability of © (given only this study) is thus computed as:

Using Eq (3) and the suggested conditional probabilities of P(RoG|�) 

We note that in the model of De Pretis et al. (2019) [78] this single study is conclusive evidence that RoG holds, i.e. there does exist a strongly increasing dose-response relationship between paracetamol use in children and severe onset of asthma. This probability is 

In this article, we presented an algorithm to support the assessment of the inferential strength of RWE in order to make sound decisions. We proceeded by considering different dimensions of appraisal and then moved on to aggregate multiple appraisals according to the different dimensions into an aggregate. Subsequently, we showed how such an aggregate can be used within a Bayesian decision making framework. Our formal approach carries forward evidence appraisals, incorporates them into an overall appraisal of the evidence and integrates it into decision making [35] . It also enables sensitivity analyses of these appraisals via variation of appraisals, variations ofÃ, as well as sensitivity analyses of the ranking, variations ofR, and the cautiousness parameter β. Furthermore, our approach is transparent, reproducible and scientifically defensible, thus satisfying the desiderata suggested by the US Environmental Protection Agency [35, p. 79]. While our formal aggregation approach is motivated by the need to appraise RWE for medical inference, the developed algorithm is, in principle, applicable to other aggregation problems, too. Whether it is suitable to a particular problem depends on particular circumstances.

Our approach is limited by the assumptions we made, e.g. we assumed that the dimensions of appraisal are independent of each other and that rankings and appraisals can be represented numerically. If at least one of our assumptions fails to hold in an application, then the theoretical considerations made here might not apply. These limitations may be overcome by applications of multi-criteria decision making methodology [98] .

In future work, we aim to determine empirically supported dimensions for evidence appraisal, calibrate ranking schemes and determine (normatively and/or descriptively) appropriate values of the β-parameter in order to assess the validity and reliability of EA 3 based on actual data [35] . The β-parameter which represents cautiousness reflects risk attitudes which can differ from user to user and from application to application. Furthermore, EA 3 reflects the position of a single agent (or of a unanimous committee). In reality, drug approval or withdrawal decisions are a group effort involving experts from different areas (toxicologists, pharmacists, clinicians, statisticians as well as patient representatives [99] ), which have different risk attitudes (different β), different appraisals (differentÃ) and different rankings (differentR). We thus plan to integrate EA 3 into a multi-agent framework which represents different (risk) attitudes, preferences and areas of expertise of stakeholders in drug (un-)safety assessments.

We expect the assessment and use of RWE for medical inference to continue to grow in coming years, drawing on scientific fields in which there are, by the very nature of the investigation, (next to) no randomised studies. For example, in macroeconomics we cannot simply randomly assign countries into different trial arms to learn about the disputed causal relationships between minimum wages and employment [100] and in nutrition science it is not possible to randomise people into drinkers of red wine and non drinkers for a trial lasting several years to learn about the hypothesised causal influences of red wine on health and well-being [101] . Similarly, in pharmacovigilance ADRs may take too long to manifest (years of treatment with olanzapine cause tardive dyskinesia [102] ) or be too rare yet fatal (in some cases, 1 fatality in every 10,000 patients [103]) to be detected by RCTs. We think that the use of RWE for pharmacovigilance and medical inference more widely is an area holding great promise despite 

Real-World Evidence-What Is It and What Can It Tell Us?

Pharmacology and social media: Potentials and biases of web forums for drug mention analysis-case study of France

Use of Real-world Data for New Drug Applications and Line Extensions

Rivaroxaban real-world evidence: Validating safety and effectiveness in clinical practice

Real-World Data for Regulatory Decision Making: Challenges and Possible Solutions for Europe

Trial designs using real-world data: The changing landscape of the regulatory approval process

Multidimensional Evidence Generation and FDA Regulatory Decision Making

Harnessing the Power of Real-World Evidence (RWE): A Checklist to Ensure Regulatory-Grade Data Quality

Real-World Evidence and Real-World Data for Evaluating Drug Safety and Effectiveness

Real world data: an opportunity to supplement existing evidence for the use of long-established medicines in health care decision making

A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. eGEMs

Guest editors' note on special issue on real-world experience and randomized clinical trials

Real-world Evidence versus Randomized Controlled Trial: Clinical Research Based on Electronic Medical Records

Examining the Impact of Real-World Evidence on Medical Product Development

Substantial evidence of effect

Expert opinion on Real World Evidence RWE in drug development and usage

Drug Evaluation during the Covid-19 Pandemic

Multiple Object Extraction from Aerial Imagery with

Reinforcement Learning and Higher-Order Learning in Multi-Agent Finite Games

International Joint Conferences on Artificial Intelligence Organization; 2020

On Passivity and Reinforcement Learning in Finite Games

Hardware Implementation of a Softmax-Like Function for Deep Learning

Removing the Target Network from Deep Q-Networks with the Mellowmax Operator

Model-Free IRL Using Maximum Likelihood Estimation

On Controllable Sparse Alternatives to Softmax

An Alternative Softmax Operator for Reinforcement Learning

Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax

Risk factors for asthma: is prevention possible? The Lancet

Risk of wheezing and asthma exacerbation in children treated with paracetamol versus ibuprofen: a systematic review and meta-analysis of randomised controlled trials

The Association of Acetaminophen and Asthma Prevalence and Severity

The Safety of Acetaminophen and Ibuprofen Among Children Younger Than Two Years Old

Paracetamol sales and atopic disease in children and adults: an ecological analysis

Asthma Morbidity After the Short-Term Use of Ibuprofen in Children

Paracetamol use in pregnancy and wheezing in early childhood

Acetaminophen Use and the Symptoms of Asthma, Allergic Rhinitis and Eczema in Children

The Role of Acetaminophen and Geohelminth Infection on the Incidence of Wheeze and Eczema

Acetaminophen Use and Risk of Asthma, Rhinoconjunctivitis, and Eczema in Adolescents

E-Synthesis: A Bayesian Framework for Causal Assessment in Pharmacosurveillance

Criteria for evaluating risk prediction of multiple outcomes

A New Geometric Mean FMEA Method Based on Information Quality

Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction

The dropout learning algorithm

GRADE guidelines: 11. Making an overall rating of confidence in effect estimates for a single outcome and for all outcomes

The hierarchical metaregression approach and learning from clinical evidence

The Foundations of Statistics

On the Application of Inductive Logic

Scientific Reasoning

Bayesian Philosophy of Science

In Defence of Objective Bayesianism

The Mathematics of Changing One's Mind, via Jeffrey's or via Pearl's Update Rule

Reviewing the Mechanistic Evidence Assessors E-Synthesis and EBM+: A Case Study of Amoxicillin and Drug Reaction with Eosinophilia and Systemic Symptoms (DRESS)

Epistemology of Causal Inference in Pharmacology

New Insights in Computational Methods for Pharmacovigilance: E-Synthesis, a Bayesian Framework for Causal Assessment

Pharmacovigilance as personalized evidence

Artificial intelligence methods for a Bayesian epistemology-powered evidence evaluation

The environment and disease: association or causation

Worldwide withdrawal of medicinal products because of adverse drug reactions: a systematic review and analysis

Multi-criteria decision making methods: A comparative Study. Dordrecht, The Netherlands: Kluwer

The Food and Drug Administration Advisory Committees and Panels: How They Are Applied to the Drug Regulatory Process

Philosophy of Economics

On the evidentiary standards for nutrition advice

Randomised double-blind comparison of the incidence of tardive dyskinesia in patients with schizophrenia during long-term treatment with olanzapine or haloperidol

Drug Induced Liver Injury: Premarketing Clinical Evaluation-Guidance for Industry

The authors are grateful to Bolin Gao (University of Toronto, Canada) for discussing his recent works on the softmax function. The authors would also like to thank Martin Posch (Medical University of Vienna, Austria) for helpful suggestions to improve the manuscript.

Conceptualization: Francesco De Pretis, Jürgen Landes.