key: cord-0282474-pb2xl80p authors: Fornari, L.; Ioumpa, K.; Nostro, A. D.; Evans, N. J.; De Angelis, L.; Paracampo, R.; Gallo, S.; Spezio, M.; Keysers, C.; Gazzola, V. title: Neuro-computational mechanisms of action-outcome learning under moral conflict date: 2021-09-30 journal: bioRxiv DOI: 10.1101/2020.06.10.143891 sha: bbeb2e4abacd9679d5c0e808681265b100f79cd0 doc_id: 282474 cord_uid: pb2xl80p Predicting how actions result in conflicting outcomes for self and others is essential for social functioning. We tested whether Reinforcement Learning Theory captures how participants learn to choose between symbols that define a moral conflict between financial self-gain and other-pain. We tested whether choices are better explained by model-free learning (decisions based on combined historical values of past outcomes), or model-based learning (decisions based on the current value of separately expected outcomes) by including trials in which participants know that either self-gain or other-pain will not be delivered. Some participants favored options benefiting themselves, others, preventing other-pain. When removing the favored outcome, participants instantly altered their choices, suggesting model-based learning. Computational modelling confirmed choices were best described by model-based learning in which participants track expected values of self-gain and other-pain separately, with an individual valuation parameter capturing their relative weight. This valuation parameter predicted costly helping in an independent task. The expectations of self-gain and other-pain were also biased: the favoured outcome was associated with more differentiated symbol-outcome probability reports than the less favoured outcome. FMRI helped localize this bias: signals in the pain-observation network covaried with pain prediction errors without linear dependency on individual preferences, while the ventromedial prefrontal cortex contained separable signals covarying with pain prediction errors in ways that did and did not reflected individual preferences. We often have to learn that certain actions lead to favorable outcomes for us, but harm others, 52 while alternative actions are less favorable for us but avoid or mitigate harms to others (1) . 53 Much is already known about the brain structures involved in making moral choices when the 54 relevant action-outcome contingencies are well known (2-9), but how we learn these 55 contingencies remains poorly understood, especially in situations pitting gains to self against 56 losses for others. 57 Reinforcement learning theory (RLT) has successfully described how individuals learn 58 to benefit themselves (10, 11) and most recently, how they learn to benefit others (12-15). At 59 the core of reinforcement learning is the notion that we update expected values (EV) of actions 60 via prediction errors (PE) -the differences between actual outcomes and expected values 61 represented in mind. 62 Ambiguity in morally relevant action-outcome associations raises specific questions 63 with regard to RLT, especially if outcomes for self and others conflict. If actions benefit the 64 self and harm others, are these conflicting outcomes combined into a common valuational 65 (M2). M1 instantiates model-free learning in which participant's decisions are based on the 134 history of past reward values, without representation of the nature of outcomes, while M2 135 instantiates model-based learning with separable representations of the expected outcomes for 136 self and others. For M2, we further compared a variant that scales outcomes based on personal 137 preferences for money vs. shock (M2Out) vs. a variant that initially tracks expectations 138 independently of personal preferences regarding outcomes, but that introduces weights at the 139 decision phase of the task (M2Dec). We expect all but M0 to perform comparably well during 140 the conflict phase of our experiment. We however expect that if participants use model-based 141 learning under conflict, M2 should outperform M1 during the 11 th trial when one outcome is 142 removed. We had no specific predictions about whether M2Dec or M2Out would predict 143 choices at the 11 th trial more accurately. 144 Shocks were made visible to the participants through pre-recorded videos showing 145 facial expressions from the confederate. In the fMRI experiment participants believed these 146 recordings to be part of a real-time video-feed, while in the Online experiment they were aware 147 that the videos had been pre-recorded, but believed that after the task we would select 15 trials 148 and administer the corresponding shocks to another participant in the lab. We used this video-149 feedback, instead of the symbolic feedback more often used in neuroeconomic paradigms, to 150 explore the neural systems activated in situations where the consequences of our actions are 151 only available from the facial expressions of the people around us, and to remain closer in 152 design to the paradigms used in nonhuman animal models (21). 153 Finally, we used the neuroimaging data to inform our understanding of how our 154 participants update values in our tasks. Given substantial literature on how individuals learn to 155 maximize gains for the self (22, 23), we focus on how the brain updates values based on shocks 156 to others. In particular, given the conceptual difference between M2Out and M2Dec, we ask 157 whether we can distinguish and localize areas in which BOLD signals covary with prediction 158 errors in ways that do vs. do not depend on wf. More specifically, given an extensive literature 159 that has outlined a network of brain regions recruited while witnessing painful vs non-painful 160 facial expressions (24-27), and the availability of a multivariate signature to quantify activity 161 in this network (28), we first ask whether activity in this network covaries with prediction errors 162 for shocks, and if so, whether it does so in ways that do or do not depend linearly on wf. Second, 163 we perform a more explorative analysis to localize where in the brain the BOLD signal covaries 164 with pain prediction errors in ways that do or do not significantly depend linearly on wf. Recent 165 reviews identified that the ventromedial prefrontal cortex (vmPFC) has BOLD signals that 166 covary positively with the current value of multiple outcomes, in particular for chosen options 7 details of the videos). Inter-stimulus and inter-trial intervals were adapted to the fMRI and online situations, and Participants vary in their preference but most favour one symbol in the first 10 trials of 207 a block 208 Figure 3A -B shows participants' choices in the Conflict condition (see also Table 1 ). Using a 209 binomial distribution, with 60 trials for the fMRI experiment (10 trials x 6 blocks), and 120 210 trials for the Online experiment (10 trials x 12 blocks), we found that in the fMRI experiment 211 about half the participants chose the pain-reducing option above chance, about half were within 212 what would be expected without preference, and just 3 chose the lucrative option above chance. 213 For shorthand, we refer to the choice patterns in these subgroups, divided by the binomial 214 probability, as 'considerate', 'ambiguous' or 'lucrative' preferences. In the Online experiment, 215 participants were evenly subdivided across these three preference subgroups. Figure 1D ). To probe participants' explicit learning, in the Online experiment we asked them to 270 report how likely each symbol was associated with high-shock and high-money. Overall, 271 participants tended to report higher probabilities for symbols with higher probability, also 272 capturing the difference between the Conflict and NoConflict condition (comparing the N and 273 U shapes in Figure 4A ). Because choices are thought to be driven by the difference in expected 274 value across options, we summarized explicit reports as the difference in reported probability 275 for the favored minus the non-favoured symbol, with larger reported differences in the correct 276 direction providing more evidence for (explicit) learning ( Figure 4B ). We observed that the 277 reported differences in probability across the symbols was different from zero, and in the 278 expected direction in all cases (All BF10>3), showing that as a group, participants with 279 considerate and lucrative preferences also learned something about the association with 280 outcomes that they appear to be less interested in. Second, participants show evidence of 281 noticing the Conflict vs No-Conflict difference, which they would not need to do if they 282 focused exclusively on maximizing money or minimizing shocks: participants with considerate 283 preference (green) reported lower association with high-shock for their preferred (shock-284 reducing) symbol in the conflict and no-conflict block (both 3 rd and 4 th violin below zero), and 285 they reported that their favored symbol had lower high-money probability in the Conflict (1 st 286 violin) and higher high-money probability in the No-Conflict (2 nd violin) than their non-287 preferred symbol. A similar pattern is true for participants with lucrative preference (blue): 288 they reported that their favorite (money-maximizing) symbol had a higher high-money 289 probability than their non-favored symbol after both Conflict and No-Conflict blocks, and that 290 this symbol was more likely in Conflict and less likely in No-Conflict blocks to lead to high-291 shock than their non-preferred symbol. Also participants with ambiguous preference could 292 report above chance differences in probability for the two symbols in the correct direction in 293 all cases ( Figure 4C ). 294 Interestingly, the magnitude of explicitly reported probability difference was biased in 295 favour of the outcome participants seem to weigh more strongly. When separating participants 296 in preference-groups, we see that those choosing the pain-reducing option above chance show 297 a stronger difference, that was also closer to the actual difference, in reported probability for 298 the shock than money outcomes, and those choosing the lucrative option show a stronger 299 difference, that was also closer to the actual difference, for the money than the shock outcomes 300 (difference between the filled green violins in Figure 4B for the considerate preference= 301 18.2±4.45s.e.m, W=377.5, BF10=577.22, p<0.001 and between the filled blue violins for the 302 lucrative=9.2±3.25s.e.m., W=233, BF10=7.67, p<0.018). Because the thresholds for grouping 303 participants in these three preference groups is somewhat arbitrary, we also examined whether 304 over all participants, the preference (calculated as proportion of pain-reducing choices) was 305 predictive of their bias, and this was the case ( Figure 4D , Pearson's r= 0.51; t(77)=5.2, p= 1.6 x 306 10 -6 ). 307 That ambiguous preference participants could still report above chance probability 308 estimates (Wilcoxon signed-ranked, all p<0.001, BF10>100; Figure 4C ) suggests that their lack 309 of above chance choice preference was not due to a total lack of symbol-outcome association 310 learning. A lack of clear preference for one outcome over the other seems also to have played 311 a role. Indeed, when the conflict was removed ( Figure 4E orange), ambiguous preference 312 participants do demonstrate learning, but choose the favorable option less frequently than the 313 other two groups (Bayesian one way ANOVA, F(2,76)=15.6, p<0.001; BFincl group=8284.52, 314 Figure 4F ). At the single subject level, in No-Conflict blocks, 54% of the ambiguous preference 315 subjects fell below the learning threshold determined using a binomial distribution (71/120 316 correct choices; red line in Figure 4F ), while just 7% of those with considerate and 4% of those 317 with lucrative preference were below this threshold. The ambiguous preference group thus 318 appears to contain a mix of individuals that do not show significant learning in our task even 319 without conflict, and individuals that can learn without conflict but fail to display clear 320 preferences in the presence of conflict. Conflict condition are not the result of a simple rule 'reset choice when my prefered 377 preferred outcome is removed, but not when my non-preferred outcome is removed', but 378 considers the probability of outcomes associated with the remaining outcome, which are the 379 only difference between Conflict and No-Conflict blocks. 380 381 Interestingly, removing the guiding outcome in Conflict trials did not lead to a mirror 382 symmetrical (i.e. ~80%) preference for the other symbol: the average choice allocation 383 switched from ~80% (considerate) or ~20% (lucrative) pain-reducing choices on the 10 th to 384 just below, or above, ~50%, on the 11 th , respectively ( Figure 5A ). That the first choice after 385 DropOut is not as polarized as the 10 th trial is in line with the bias we observed in the explicit 386 reports ( Figure 4B ,D): given that the difference in reported outcome probabilities was less 387 pronounced for the less-guiding outcome, one would expect less polarized choices in a decision 388 based on these lesser differences in expected values. For the ambiguous group, there was no 389 robust evidence that removing either quantities lead to a change of preference towards the 390 better option for the remaining outcome. It is important to emphasize that the choice on trial 391 11 occurs before the participant is presented with the single remaining outcome, and thus 392 probes choices exclusively based on expected values that were learned during conflict trials. 393 Supplementary Figure 4 also shows choices on trials 12-20, but as these trials do not contain 394 two potentially conflicting outcomes, but only one outcome, we do not expect our models to 395 adequately represent participants' reasoning and learning over those subsequence trials. 396 In summary, choices reveal substantial individual variability in preference, but most 397 participants show evidence of some form of learning. Explicit reports and their difference 398 across Conflict and No-Conflict blocks reveal that participants have explicit and separable 399 representations of outcome probability differences across symbols in terms of money and 400 shock, although these representations are more differentiated for the outcome that their choices 401 indicate to bear more weight. Finally, devaluation performed by informing participants that 402 their preferred outcome will no longer be delivered leads to an immediate switch away from 403 their previously preferred option during Conflict. Together this supports the notion that choices 404 in our conflictual conditions are dominated by model-based learning in which participants 405 represent separable outcomes for money to self and shock to others, with the individual 406 preferences influencing the magnitude of their respective symbol-outcome association. This 407 would suggest that M2Out (Figure 2 ) might be the computational implementation of RLT 408 learning most suitable to capture our participants' choices in our paradigm. 409 Devaluation, as implemented in our conflict Drop-Out blocks, is critical to distinguish 412 model-based and model-free learning. To directly compare our neurocomputational models 413 ( Figure 2A ), we therefore fitted our competing models on the first 10 trials of the Conflict 414 DropOut blocks, and then examined (i) how well they predicted those first 10 choices using an 415 approximation of the leave-one-out information criterium (LOOIC(33)) and (ii) how well they 416 predicted choices on the 11 th trial that was not included into the model fitting, when one of the 417 outcomes is removed. Because the 11 th trial was not included in the fit, we can directly quantify 418 the likelihood of the observed choices on the 11 th trial under the different models to assess 419 which model best predicts behavior under devaluation. A priori, we expect the four models to 420 make very different predictions on the 11 th trial. M0, by design, continues to predict random 421 choices. Because M1 does not have separable representations of money and shock, its 422 predictions for the 11 th trial are simply based on the symbol with highest composite expected 423 value (EV) based on the first 10 trials, and thus predicts that choice behavior does not change 424 on the 11 th trial. Finally, because the M2 models have separate EV for money and shock, we 425 programed the model to transform the information that participants receive before they perform 426 the 11 th choice (i.e. one outcome will no longer be delivered), into a revised decision criterion 427 based on the remaining expected value only, without using a wf in their choice, as there is no 428 longer a conflict to resolve ( Figure 2B ). For M2Out and M2Dec, choice is then dictated for the 429 11 th trial by cat_logit(τ(EVR(symbol1),EVR(symbol2)), where subscript R represents the 430 remaining quantity: M (money) or S (shock). We thus expect choices not to change much if 431 the less preferred outcome is removed, and to change significantly, if the preferred outcome is 432 removed. 433 Figure 6A illustrates the predictions of the three learning models (M1, M2Out, M2Dec) 434 as compared with the actual choices for the critical DropOut blocks. For visualization purposes, 435 we did not include M0 and the ambiguous preference participants in this figure, because they 436 do not predict or show the learning behavior we are trying to capture. During the initial 10 trials 437 in which the conflict is present, all three learning models (M1, M2Out, M2Dec) capture the 438 general shape of the learning curve, and can accommodate individual differences in preference 439 number increases for participants with considerate preferences, and fewer pain-reducing 441 choices for participants with lucrative preference. Including all participants (including the 442 ambiguous participants), model comparison over the first 10 trials with the LOOIC confirms 443 that M1, M2Out and M2Dec perform similarly well, i.e. remain within a standard error of one-444 another, and perform better than the random-choice model M0 ( Figure 6B , and Supplementary 445 Figure 5 for results that exclude the ambiguous and non-believers groups). Note that the 446 information criterion (IC) scale captures how much information is lost when comparing model 447 predictions with actual choices, and smaller IC thus characterize models that better describe 448 behavior. A similar pattern is true for the choices in the fMRI experiment (Supplementary 449 Figure 5c ). Next, we used the 11 th trial to arbitrate across M1, M2Out and M2Dec. As we saw 450 in Figure 5 , participants' actually responded to devaluation in ways that depended on their 451 preference. Lucrative preference participants kept their choices unchanged when shocks were 452 removed, but switched their preference to just above 50% pain reducing choices on average 453 when money was removed. The converse was true for Considerate preference participants. 454 When money is removed and only shocks remain ( Figure 6B , gray background), all learning 455 models correctly predict that considerate preference participants do not change their choices, 456 but only M2Out (purple, and marked by the black arrow-head) accurately predicts that 457 participants with lucrative preference change their preference to just above 50%. In contrast, 458 M1 (gray) fails to predict any change for those lucrative participants, and M2Dec (light blue) 459 overestimates their change. When shocks are removed and only money remains ( Figure 6B p<0.001, BF10>1000, Figure 7B ). Parameter recovery further shows that if one simulates 526 participants with different wf, M2Out accurately recovered much of that variance 527 (r(wfsimulated,wfestimated)=0.69, p<10 -6 , BF10>10 6 Supplementary Figure 6 ). For our fMRI 528 participants, we also performed a Helping task not involving learning that we had previously 529 developed(4) (Supplementary material §3). In each Helping trial, participants viewed a victim 530 receive a painful stimulation, and could decide to donate money. If no money was donated, the 531 victim received a second stimulation of equal intensity. If some money was donated, each € 532 donated reduced the second stimulation by one point on the 10 point pain scale. We found that 533 the wf value estimated in our learning task in the fMRI experiment could significantly predict 534 how much money participants on average donated to reduce the victims pain in this different 535 Helping task (Kendall τ(wf,choice)=-0.47, BF10=76, p<0.001, Figure 7C To examine whether signals in the affective vicarious pain signature (AVPS) covary 560 with prediction errors for shocks (PES), and whether they do so in a way that correlates with 561 wf or not, we extracted the wf-normalized PES parameter estimate image (βPES) from each 562 participant, and dot-multiplied with the AVPS neural signature. It is critical for the 563 interpretation of our fMRI analysis that we use wf-normalized PE predictors in our fMRI 564 analysis (as described in Methods: fMRI data analysis: PE wf-normalization) to avoid that wf 565 influences results twice: in calculating the parameter estimate image and in calculating the 566 regression of the parameter estimates and wf. The AVPS is known to have larger signals when 567 viewing painful compared to non-painful facial expressions (28). We coded shock outcomes in 568 terms of their value for a participant and not their intensity, i.e. a non-painful shock had value 569 +1 and a painful shock of -1. Hence, trials with painful shocks should typically have negative 570 prediction errors, and βPES should load negatively on this neural signature. We found this to be 571 the case ( Figure 8B , t(24)=-5.55, p=1E-5, BF10=2090). Next, we asked whether the strength of 572 the loading depended on wf using a Bayesian correlation, and found evidence of the absence 573 of such a relationship (r=-0.058, p=0.78, BF10=0.257, Figure 8C ). The facial pain observation 574 network thus does carry signals that negatively covary with prediction errors for shocks in our 575 paradigm, and these signals do not depend on personal preferences (wf). Voxels contributing 576 substantially to the loading can be found in Figure 8D . 577 Next, we performed a more explorative voxel-wise linear regression that predicts the 578 parameter estimate of the PES modulator using a constant and wf. As expected for a region 579 involved in valuation, we found the vmPFC to have signals covarying positively with PES (i.e. 580 higher signals when shocks were lower than expected). A more ventral cluster showed signals 581 associated with PES in a way that depended on wf ( Figure 8E , warm colors and Supplementary 582 Table 6 ), while a more dorsal vmPFC cluster showed an association with PES after removing 583 variance explained by wf ( Figure 8E , cold colors and Supplementary Table 7 ). In addition, we 584 find that the left somato-motor cortex, including BA4 and 3, also harboured signals covarying 585 positively with PES, with a more dorsal location representing it with a significant dependence 586 on wf, and a more ventral region in a way that did not significantly depend on wf. 587 Because PES and PEM predictors in our fMRI design are somewhat correlated due to 588 the outcome structure of our task (average r=-0.26, ranging from -0.49 to -0.03 across our 589 participants), we performed a parameter recovery analysis (See Supplementary Material §11) 590 that confirmed that if we simulate participants with signals that covary with PES but not PEM, 591 we recover significant parameter estimates for PES but not PEM, and vice versa. We also found 592 that the PEM parameter estimate images do not load on the AVPS, and found evidence of 593 absence of such loading (t(24)=0.392, p=0.7, BF10=0.226, Figure 8B ), confirming our ability to 594 appropriately separate and localize PEM from PES signals. 595 that are less intense than expected (i.e. positive PES) in participants that place more weights on shocks (high 1-wf 610 values). Cold color in the outer two columns: results of the contrast constant>0 in the same linear regression with 611 (1-wf) (punc<0.001, k=FWEc=132voxels). This identifies voxels with signals that increase with increasing PES 612 after the variance explained by (1-wf) is removed. All results in (E) were FWE cluster corrected at p<0.05, 613 following cluster cutting at punc<0.001, specified using the critical FWE cluster size FWEc. Note that the reverse 614 contrasts for (e) did not yield results that survive FWE cluster correction. The light gray line helps the viewer to 615 visually compare the location of the clusters across the two contrasts. Our aim was to investigate how participants learn action-outcome associations during 618 moral conflict -so called morally salient features (35) -that contrast gains for the self against 619 pain for others. We had three central questions. First, whether such learning would be model-620 free or model-based, and whether they can be modeled using reinforcement learning models. 621 Second, whether individual preferences for maximizing self-gains vs. minimizing other-pain 622 would bias learning towards the preferred outcome. Third, if such bias exists, whether and how 623 the brain would process other-pain exclusively in a biased fashion, or whether it might have 624 co-existing representations that are biased and unbiased. 625 Our results indicate that participants' choices appear to be dominated by biased model-626 based learning (16). The dominance of model-based learning was apparent (i) from the ability 627 to provide above chance explicit reports for both outcomes that also captured the difference 628 between Conflict and No-Conflict blocks and (ii) from the reaction to devaluation by not 629 altering choices if the preferred outcome was preserved, but by switching preference when the 630 preferred outcome was removed in the conflict blocks. That models were biased was borne out 631 from the fact that (i) individuals had more differentiated reports for their favoured outcome-632 type (ii) that preferences after devaluation were asymmetric, with preference levels depending 633 on the weight of the remaining quantity. Considering the dominance of the biased model-based learning, we next asked how this 661 biased model-based learning would be implemented in the brain. Given the considerable 662 literature on learning to choose options that maximize gains for the self (22, 36), we focused 663 on how shocks to others are processed in the brain under conflict. In principle, there might be 664 two ways in which the bias in our model-based learning could be implemented in the brain: (1) 665 the bias occurs so early that a person maximizing self-gain fails to process shocks altogether 666 (but see Supplementary material §15 for preliminary eye-tracking data) or (2) all participants 667 initially process shocks in similar ways and modulate these signals with wf only in later stages 668 of processing. In case (1) we would fail to find BOLD signals associated with PES in a way 669 that does not depend on wf anywhere in the brain, while in case(2) we could expect that earlier 670 stages of processing (e.g. the network associated with pain observation(24-26, 28)) has signals 671 associated with PES that do not depend on wf, while later stages of processing associated with 672 valuation, particularly in the vmPFC(14, 22, 29-31) might have PES signals that depend on wf 673 and inform choices. The psychological (37) and neuroimaging(38-42) literature cannot 674 adjudicate these alternatives, as they have shown biases of empathy at both early and late levels 675 of processing. However, our fMRI data speaks to option 2: the pain observation network, as quantified using the affective vicarious pain signature(28), had signals that covaried with 677 prediction errors for shocks with evidence for the absence of a linear dependence on wf, while 678 within the medial prefrontal cortex, we found more dorsal regions with signals that covary with 679 prediction errors for shocks with magnitudes that did not significantly depend on wf, as well as 680 more ventral signals that did. That the ventral but not the dorsal cluster had valuation signals 681 that depend on wf, i.e. on whether a participant focuses on minimizing other-pain or not, A limitation of our fMRI study is that, unlike prediction errors for shocks and money 688 that we show to be distinguishable, outcomes and prediction errors were so highly correlated 689 (shock: mean r= 0.84, ranging 0.75-0.87; money: mean r= 0.84, ranging 0.78-0.86) that we 690 cannot establish whether signals are truly representing prediction errors rather than outcomes. 691 While this problem is not unusual (e.g. (12)), a novel version of the paradigm better 692 dissociating prediction error and outcome will be necessary to explore whether the pain 693 observation network simply processes outcomes or already integrates outcomes with 694 expectations as predictive coding theories might predict(43-45). However, whether the pain 695 observation network processes outcomes or prediction errors, it does so in ways that do not 696 depend on wf, thereby providing evidence against the notion that the bias is so pervasive that 697 the brain does not at all represent shocks independently of bias. This hybrid representation of 698 some signals that are biased and others that are not, echoes the fact that in our Online 699 experiments, even participants with lucrative preference have above chance access to the 700 association of the symbols to shocks ( Figure 4B ), while the magnitude in probability difference 701 does correlate with wf ( Figure 4D) . A promising direction in which our study could be 702 expanded would involve the use of electro-physiological recordings. In particular, one could 703 then explore whether signals in the pain observation network covarying with prediction errors 704 earlier than the vmPFC signals that do depend on wf, and whether coherence across these 705 regions increases during shock observation, to test the idea that vmPFC valuation may be 706 informed by unbiased activity in the pain observation network. 707 In addition to these core fMRI findings, we made a couple of additional, less expected 708 findings in our explorative analysis. First, we failed to find signals robustly associated with 709 prediction errors for money when applying a whole brain correction for multiple comparisons high wf (i.e. with highly lucrative preferences) and more participants with low wf, in our fMRI 714 study, could help explain why it was more difficult to find signals that represent the self-money 715 compared to the other-shocks. Additionally, money was always delivered (i.e. money was 716 delivered also in the low-money condition), and we had robust activations in regions associated 717 with monetary rewards when outcomes were revealed. The strength of that trial-independent 718 signal may have made it more difficult to detect the apparently smaller differences between 719 trials with higher and lower monetary outcomes. In the future, it might be helpful to either 720 compare conditions with no monetary reward at all against a condition with monetary reward, 721 and to recruit a larger number of lucrative participants. Second, we also found regions around 722 the left central sulcus encoding prediction errors for shocks in ways that depend on wf or not, 723 extending into both SI (area 3b and 1) and MI (area 4). While it is tempting to interpret these 724 activations as representing some form of embodiment of the somatic pain of others onto the 725 observer's own somatomotor cortex (4, 27), such conclusions should be made with care 726 considering the limitations of reverse inference, in particular given the explorative nature of 727 this whole brain analysis (46). Neuromodulating these foci, akin to what we have done 728 previously in our helping task (4), and exploring the impact on choices, would be important to 729 explore their causal contribution to learning and decision making. 730 Participants varied substantially in their preferences under conflict. In our Online study, 731 about one third prefered the symbol leading to high monetary gain, another third, the symbol 732 minimizing the pain to the other, while for the remaining third the number of trials we acquired 733 did not suffice to identify a significant preference. This pattern was true despite that to 734 maximize conflict, in the online experiment we asked each participant to adjust the amount of 735 money on offer for the high money outcome to be equivalent to the cost of the other receiving 736 the painful shock-intensity we used (Supplementary Material §2) . Indeed, participants that had 737 a lower indifference point, later choose the lucrative option more often, and these individual 738 differences in choice are thus not a trivial consequence of individual differences in the amount 739 of money on offer. In our models, we find that an individual parameter, the weighting factor 740 (wf, conceptually similar to the salience factor alpha in the Rescorla-Wagner Learning Model 741 (19)) is an effective way to capture these individual differences in the relative weight placed 742 on the money vs shocks. Importantly, wf had external validity, in that it predicted how much 743 money the same participants gave to reduce shocks to the same confederate in a different task, the Helping Task (4), which does not require learning. We also found that wf was better than 745 our trait measures of empathy and money attitude scales at predicting donation in this Helping 746 Task. Recent work has supported the view that state empathy is regulated by motives and 747 context (see (37, 47) for reviews). Our moral conflict task creates financial incentives known 748 to downregulate empathy (37). It is thus perhaps unsurprising that the IRI, which measures Our study has several additional limitations. First, we limited our model comparison to 773 a number of hypotheses driven by RLT models. We did not test ratio or logarithmic ratio 774 models of valuational representation in this study. These valuational structures are known to 775 occur but are less often indicated in modeling gains and losses to the self (17). Future 776 experiments could be optimized to explore whether such alternative ways to combine these 777 values may be more appropriate under certain moral conflicts. Second, our participants or at least some of them, may not use a RLT model at all. Instead, they may use rules such as 'choose 779 a symbol randomly, and only switch if you encounter x unfavorable outcomes in a row'. Future 780 studies may benefit from studying how well such heuristics may perform. Third, we used wf 781 to address quantitative individual differences. We recognize that wf only indicates preferences 782 during our tasks and that it is risky to interpret the wf as suggesting stable moral values, as is 783 the case for all behavioral studies absent strong evidence regarding stable moral commitments. 784 Future studies may wish to explore whether different individuals may be best captured using 785 qualitatively different models and in conjunction with validated evidence of long-term moral 786 commitments and values. This might be particularly relevant when including participants with 787 independently demonstrated morally considerate commitments on the one hand, or psychiatric 788 disorders affecting social functioning on the other. Fourth, our evidence for a biased model-789 based approach hinges on drop-out trials in our Online experiment, and our request for 790 probability reports. We introduced these trials to rigorously reveal the nature of the learning 791 that participants deploy in conflict situations -and they indeed provided the data necessary to 792 adjudicate across our models. However, we must also consider the possibility that these very 793 trials also influenced participants to deploy the model-based learning that provides more 794 optimal decisions during drop-out and more accurate probability reports. Performing a conflict 795 task with only a single devaluation trial on a large number of participants may be a way to 796 exclude this possibility. However, if participants were to have adapted their strategy to 797 optimally fit the requirements of the task, M2Dec would have been even more adaptive. 798 To summarize, our data sheds light onto the processes at play when adults have to learn 799 that certain actions lead to favorable outcomes for us but harm others, while alternative actions 800 are less favorable for us but avoid harm to others. We show that in our task, the choices of a 801 majority of participants are best described by a biased model-based approach, in which some 802 brain signals (e.g. in the pain observation network) covary with prediction errors for the harm 803 to others in ways that do not depend on whether participants prefer to maximize gains for self 804 or minimize harm to others, while others (e.g. in the vmPFC) do depend on this individual 805 preference. We foresee that the neurocognitive model we introduce and this task will be 806 particularly useful, in the future, to understand how neurocomputational processes may differ 807 in antisocial populations that often fail to make choices that appropriately consider harm to 808 others in situations in which such harm results from pursuing personal goals. 809 Two independent experiments were performed: an Online study and an fMRI study. 812 Behavior was tested in a larger sample of participants, but this had to be done online, using the 813 Online platform Gorilla (https://gorilla.sc/), due to COVID-19 restrictions in place at the time. 814 Brain activity was measured in a smaller number of participants in our fMRI scanner. Table 1 815 gives an overview of the number of participants and experimental conditions included in each 816 study. 817 In total, 106 healthy volunteers with normal or corrected-to-normal vision, and no 819 history of neurological, psychiatric, or other medical problems, or any contraindication to fMRI 820 (for the fMRI experiment only) were recruited for our experiments (Table 1) In the Online experiment, participants' main task was the probabilistic reinforcement 838 learning tasks, in which participants had to choose between two new symbols based on learning 839 how much money each symbol could earn them, and whether the symbols lead to a painful 840 shock to another participant (Figure 1) . The task included a deceptive cover-story in which 841 participants were informed, at the very beginning of the study, that while they perform the task 842 online, a second participant is present in our lab. It was explained that the separation of the two 843 participants allowed the experiment to be performed under COVID19 restrictions. In reality, 844 no one was invited to our lab, and this cover story served to create a situation in which 845 participants took what they thought to be decisions with real implications for self and others, 846 while at the same time minimizing discomfort to others (no shocks actually delivered to the 847 other person). Participants then received an general overview of all the tasks they will 848 encounter during the experiment. The text also informed them that 15 trials would be randomly 849 selected from the main Learning task, and that the self-money of these trials would be added No-Conflict. This condition was identical to the Conflict blocks with the exception that the 891 symbol-outcome associations did not likely introduced a moral conflict because the symbol 892 that was usually best for the self was also usually best for the other: one symbol was most likely 893 associated with high monetary reward for the participants and non-noxious stimulation to the 894 other participant, while the other symbol was usually associated with low monetary reward and 895 a noxious stimulations. There was therefore a clear incentive to choose for the symbol leading 896 to the higher reward, as it was also the symbol that would be most beneficial for the other. 897 898 ConflictMoneyDropout. Each block included two sets of 10 trials. The first 10 trials were 899 identical to those presented in the Conflict condition. After the 10th trial, the participants were 900 informed that the money outcome was removed and the block continued with the last 10 trials, 901 in which participants would only see the videos of the actor receiving the stimulation, and no 902 money would be rewarded. Participants were informed that the probability of each symbol of 903 being associated with high or low stimulation would remain the same over the 20 trials. for each symbol separately, to recall the probability of that symbol to be associated with higher 925 monetary reward and with the noxious stimulation using a scale from 0 to 100% ( Figure 1C) . 926 The order in which the two symbols were presented and the type of question (association with 927 high shock or with high reward) was randomized between participants. The slider starting 928 position was always at 50%. p=0.004). Participants were led to believe that the videos they saw were a life-video-feed of 948 another participant receiving these shocks in a closeby room, although in reality they were pre-949 recorded movies. All participants were presented the same videos, albeit in randomized order, 950 so that we can compare the average donation of the participants as a willingness to give up 951 money to reduce the pain of others. 952 As the participants of the fMRI experiment came in person to the lab, the cover story 953 was slightly different from that adopted in the online version. For the fMRI we used the same 954 cover story used and validated in Gallo et al. (4) . Each participant was paired with what they 955 believed to be another participant like them, although in reality it was a confederate, author 956 S.G. They drew lots to decide who plays the role of the learner (or the donor in the Helping 957 task) and of the pain-taker. The lots were rigged so that the confederate would always be the 958 pain-taker. The participant was then taken to the scanning room while the confederate was 959 brought to an adjacent room, connected through a video camera. Participants were misled to 960 think that electrical stimulations were delivered to the confederate in real-time, and that what 961 the participants saw on the monitor was a live feed from the pain-taker's room. In reality, we 962 presented pre-recorded videos of the confederate's reactions. 963 In contrast to the Online experiment, in which the high-money amount was selected for 964 each participant using the Optimization task, in the fMRI Learning task, the high-money 965 amount offered was fixed at 1.5€ for all participants, and corresponded to the average amount 966 associated with the indifferent point in the Optimization task (1.53 ± 0.37SD; Supplementary 967 Material 2). The low-money amount was the same as in the Online experiment, 0.5€. 968 At the end of the fMRI tasks, participants were debriefed and asked to fill out the 969 interpersonal reactivity index (IRI) empathy questionnaire (51), and the money attitude scale 970 (MAS) (52). To assess whether participants believed that the other participant really was 971 receiving electrical shocks, at the end of the experiment, participants were asked 'Do you think 972 the experimental setup was realistic enough to believe it' on a scale from 1 (strongly disagree) 973 to 7 (strongly agree). All participants reported that they at least somewhat agreed with the 974 statement (i.e. 5 or higher). The fMRI task was programmed in Presentation (www.neurobs.com), and presented 976 under Windows 10 on a 32inch BOLD screen from Cambridge Research Systems visible to 977 participants through a mirror (distance eye to mirror: ~10cm; from mirror to the screen: 978 ~148cm). The timing of the task was adapted to the requirements of fMRI: each trial started 979 with a jittered fixation cross lasting 3-9 seconds. Then the two symbols appeared and 980 participants could make their choice without a time restriction. After the button press, there 981 was a fixation cross raging from 3-9 seconds and the video with a duration of 2 seconds 982 followed. 983 Where ANOVAs were used, we report BFincl which reports the probability of the data given a 994 model including the factor divided by the average probability of the data given the models not 995 including that factor. Normality was tested using Shapiro-Wilk's. If this test rejected the null-996 hypothesis of a normal (or bi-variate normal for correlations) distribution, we used non-997 parametric tests such as the Wilcoxon signed rank or Kendall's Tau test, while if the null-998 hypothesis was not rejected, we used parametric tests. We always used default priors for 999 Bayesian statistics as used in JASP. 1000 Our experiment represents a variation of a classical two armed-bandit task and was 1002 modeled using a reinforcement learning (RL) algorithm with a Rescorla-Wagner updating 1003 rule (19). We compared 4 models explained in Figure 1E . For the Learning Task, our aim was to identify brain activity that scales with PES for 1060 shock when the outcome is revealed, and to compare this activity across participants with 1061 different weighting factors. Analyses and the experimental design therefore focused on the 1062 outcome phase. In line with other studies (e.g., (12)), activity during the decision phase will 1063 not be analysed here for two reasons. First, to isolate activity when outcomes are revealed, we 1064 randomized the interval between outcomes and the response screen. As a result, decisions can 1065 have occurred at any time between the last outcome and the next button press, making it 1066 difficult to capture the activity linked to that decision. Second, the button press required for the 1067 response triggers significant brain activity in frontal regions that could have been hard to 1068 dissociate from the valuation processes we are interested in. 1069 To analyse activity during the outcome phase, we ideally would have included 1070 predictors for the monetary outcome (OutM) and the shock (OutS) as well as their prediction errors (PEM and PES). Unfortunately, PES and OutS are too highly correlated (0.741, ranging 1072 0.750 to 0.866 across our 27 participants), and the same is true for PEM and OutM (0.749, 1073 ranging 0.781 to 0.862 across our 27 participants), to allow us to include them into a single 1074 design matrix. Indeed, we performed a parameter recovery attempt that simulates voxels with 1075 various mixtures of OutS, PES, OutM and PEM across 24 participants with the addition of 1076 noise, and tried to recover them using design matrices with all 4 parametric modulators, and 1077 failed to have appropriate recovery (data not shown). This is not surprising, because for time-1078 efficiency, here we aimed to focus on the initial learning of the association, including only 10 1079 trials per block, and during the initial learning outcomes with high OutS will also cause high 1080 PES and trials with high OutM will also cause high PEM. Our parameter recovery simulations 1081 however confirmed that we can include PES and PEM in a single design matrix, as these are depend on wf or not, we divided PES with (1-wf) and PEM with wf before entering them into 1100 the design matrix. As a result, the first PES value in the parametric modulator would always be 1101 PES=-1 if it was a high-shock outcome or PES=+1 for a low-shock outcome, independently of 1102 the participant's wf value. If signals covary with PES in a way that does not linearly depend on 1103 wf, the parameter estimate across participants (βPES) would violate H0:βPES=0, but not would violate both of these H0. Note that for outcomes, the coding was +1 for good outcomes 1106 (i.e. high money or low shock) and -1 for bad outcomes (i.e. low money or high shock). EV 1107 and PE follow that polarity. 1108 Results were then analysed in two ways. First, to improve reverse inference, we use a 1109 We thank C. Gavanozi, I. Gembutaite and B. Hoekzema for helping with data acquisition in 1138 the Replication and Outcome-Dropout study. We thank A. Veggerby Lind for helping 1139 recording the stimuli used in the Replication and Outcome-Dropout studies. We thank A. 1140 Gentile for his input and help with developing the learning models. VG, CK, AN, LD, KI, SG, 1141 LF and RP thank P. Lockwood as her work inspired the development of the learning tasks. 1142 The authors declare no competing interests. 1s. All videos were validated by independent groups of subjects, who were asked to rate the 1318 intensity of the pain experienced by the confederate on a scale from 1 to 10, with '1' being 'just 1319 a simple touch sensation' and '10' being 'most intense imaginable pain'. As low pain intensity 1320 videos we selected the ones with ratings of 1 and 2 and as high pain intensity videos we selected 1321 the ones with ratings of 4, 5 and 6 Videos were edited using Adobe Premiere Pro CS6 (Adobe, 1322 San Jose, CA, USA). In the Online experiment, before the learning tasks, we determined the amount of 1372 monetary reward that would have a subjective value equivalent to the painful shock received 1373 by the other participant. This task enabled us to personalize the amount of money participants 1374 were later offered as high reward in the learning tasks to create a meaningful conflict. 1375 Participants always had to choose between a pain-reducing option combining 0.5€ for them 1376 with a low shock to the confederate, and a lucrative option combining a higher amount of 1377 money for themselves with a high shock to the confederate. The amount of money offered in 1378 the lucrative option varied in steps of 0.25€ across the 5 types of choices (see the In total there were 2 sessions of 15 trials each presenting the hand video and 2 1424 presenting the face. The single session with 6 blocks of Learning task was presented at the end, 1425 after the four Helping Task sessions. This was because the fMRI experiment was centered on 1426 the Helping task, and the Learning task was meant as a pilot data, which should have been 1427 followed by a second fMRI data acquisition centered to the Learning, which was however 1428 impossible due to COVID19 restrictions. To better understand what drove participants' choices in the Online learning task, at the 1462 end of the Online task, participants were asked the question "What motivated your choices?". 1463 Five options were proposed to the participant and they could choose to select 1 or more. In the 1464 case they selected more than 1, they were asked to select them in order of importance, starting 1465 from the most important. The options presented to participants were the following: Since there were 5 alternatives, to quantify motivation, values from 0 to 5 were assigned 1484 to A-E according to its position in the list of motivations (0 if it had not been selected, 5 if it 1485 had been indicated as first motivation, 4 as second, 3 as third, 2 as forth and 1 as fifth). To 1486 assess whether the motivation overall was more lucrative or pain-reducing, we then derived a 1487 secondary measure of 'Pain-Avoidance' that would be negative if the motivation was mainly 1488 'gaining as much money as possible' and positive if the motivation was 'preventing harm to 1489 others' or 'avoiding to see the other person receiving the shocks'. Specifically, this was 1490 calculated as (D+E)/9-A/5. We divided D+E by 9 and A by 5, because a person with maximum 1491 pain reducing motivation could choose D and E as their first motivations and thus obtain 1492 5+4=9, and one maximally lucrative would choose A as their first motivation. We decided on 1493 choices during devaluation at the 11 th trial (as inlet Figure 6A , and Figure 6C The average correlation between the time courses of the parametric modulators for PES and 1676 PEM was -0.26, ranging from -0.49 to -0.03. Due to this correlation, we explored whether our 1677 experimental design and GLM approach can disentangle voxels that represent PES from those 1678 representing PEM, and whether they can differentiate voxels linearly dependent on wf from 1679 those that are not. Our GLM included, during the outcome period, a boxcar for the duration of 1680 the movie with two parametric modulators, one for PES and one for PEM. Both have been 1681 normalized by dividing them with 1-wf and wf respectively. This was done, as described in the 1682 Methods and Materials section of the main manuscript, to ensure that PES and PEM predictors 1683 become independent of preference and wf per se. When used in the GLM, the parameter 1684 estimates for these normalised PES values can then be compared across participants to identify 1685 if the brains of participants with higher weight on shocks (i.e. larger value for 1-wf) show larger 1686 signals for a given outcome than participants with lower weight on shocks. Using the original 1687 PES values would make that interpretation difficult, because they are already dependent on wf. 1688 Table 9 : Percentage of two-tailed t-tests significant at p<0.05 from the 1000 simulations using 1720 signals generated with multiplications with wf or (1-wf), and in brackets, the percentage of p1-tailed<0.001. The top 1721 row specifies how the signals were generated before adding 1sd of noise, the leftmost column, the null hypothesis 1722 that was tested in the hypothesis testing. The above tables explore evidence against the null hypothesis, but for univariate analysis we 1725 also ask whether we can actually provide evidence for voxels mixed without a certain factor 1726 that the GLM provides evidence for the null hypothesis using Bayesian statistics (8), using a 1727 bound of BF10<1/3. Using a Bayesian test, with n=25, we know that |t|<1 provides evidence in 1728 Summary: In our simulations, with 1sd of noise, we can detect voxels with signals linearly 1743 dependent on PES and/or PEM accurately: If we use alpha=0.05, as we would for the AVPS 1744 analysis, signals generated by including PES but not PEM are detected as representing PES in 1745 ~99% of cases, and only in ~5% of cases as representing PEM, and vice versa for voxels 1746 generated to include PEM but not PES. Using Bayesian statistics, we can even provide 1747 evidence in favour of the H0 for the former (βPES=0) in ~70% of cases, and the latter H0 1748 (βPEM=0) in 66% of cases. Within our sample size, we thus have power to arrive at conclusions that match the way we generated the signals in the majority of simulations. Even The development of generosity and moral cognition across five 1148 cultures Neural Responses to Ingroup 1150 and Outgroup Members' Suffering Predict Individual Differences in Costly Helping Moral 1153 transgressions corrupt neural representations of value The causal role of the somatosensory cortex in prosocial behaviour The cognitive neuroscience of moral judgment and decision making The Cognitive Neurosciences Empathic concern drives costly 1160 altruism A Neurocomputational Model of Altruistic 1162 Choice and Its Implications The neuroscience of social decision-making. Annu. Rev. 1164 11. W. Schultz, Updating dopamine reward signals Neurocomputational 1172 mechanisms of prosocial learning and links to empathy The Anterior Cingulate Gyrus and 1175 Social Cognition: Tracking the Motivation of Others When Implicit Prosociality Trumps Selfishness: The Neural Valuation System Underpins More 1178 Optimal Choices When Learning to Avoid Harm to Others Than to Oneself Model-free 1181 decision making is prioritized when learning to avoid harming others Goals and habits in the brain How Costs Influence Decision Values for Mixed Outcomes Instrumental Responding Following Reinforcer Devaluation A theory of Pavlovian conditioning: Variations in the 1189 effectiveness of reinforcement and nonreinforcement Harm to Others Acts as a Negative Reinforcer in Rats The valuation system: a coordinate-based meta-1196 analysis of BOLD fMRI experiments examining neural correlates of subjective value Neural Circuitry of Reward Prediction Error Meta-analytic evidence for common and distinct neural 1201 networks associated with directly experienced pain and empathy for pain Is Empathy for Pain Unique in Its Neural Correlates? A Meta-1204 A meta-analysis of neuroimaging 1206 studies on pain empathy: investigating the role of visual information and observers' 1207 perspective Somatosensation in social perception Empathic pain evoked by sensory and emotional-communicative cues 1211 share common and process-specific neural representations Prefrontal and 1213 The neurobiology of rewards and values in social decision making Neural mechanisms of 1217 observational learning An agent independent axis for executed and modeled choice in medial 1219 prefrontal cortex Practical Bayesian model evaluation using leave-one-1221 out cross-validation and WAIC Theory of Probability Impressions of Milgram's obedient teachers: 1224 Situational cues inform inferences about motives and traits Prediction error in reinforcement learning: a meta-1226 analysis of neuroimaging studies Empathy: a motivated account Singer, I feel how you feel but not always: the empathic brain and its 1229 modulation What are you feeling? Using 1231 functional magnetic resonance imaging to assess the modulation of sensory and 1232 affective responses during empathy for pain Changes in brain activity following the voluntary control of 1234 empathy Responsibility modulates pain-1236 matrix activation elicited by the expressions of others in pain Obeying orders reduces vicarious 1239 brain activation towards victims' pain Where and how our brain 1241 represents the temporal structure of observed action Predictive coding: an account of the mirror 1243 neuron system Hebbian learning and predictive mirror neurons for actions, 1245 sensations and emotions Inferring mental states from neuroimaging data: From reverse inference 1248 to large-scale decoding Dissociating the ability and propensity for empathy The amygdala and ventromedial prefrontal cortex in morality and 1254 psychopathy Reduced 1256 spontaneous but relatively normal deliberate vicarious representations in psychopathy Measuring individual differences in empathy: Evidence for a 1259 multidimensional approach The development of a money attitude scale Using Bayes factor hypothesis testing in 1263 neuroscience to establish evidence of absence Revealing neurocomputational mechanisms of 1265 reinforcement learning and decision-making with the hBayesDM package Factor analysis and AIC Bayesian measures of 1270 model complexity and fit SPM12 manual Testing each value against zero using a Wilcoxon signed rank test showed that for the 1510 considerate preference group values were significantly above zero (W=380, p<0 This suggests that participants in the Considerate and Lucrative group 1514 could and decided to consciously report motivations that were in line with their choices, while 1515 the Ambiguous group appeared to have less clearly polarized monetary or pain-avoiding 1516 motivations. explorative whole brain analysis where the cluster-cutting threshold was set at 0.001. The 1708 proportion of significant results was as follows: 1709 favour of H0: βPES=0 over H1: βPES≠0, and |r|<0.17 for H0: r(wf, βPES)=0 over H1: r(wf, 1729 βPES)≠0 (BF10<1/3, using default priors in JASP) Highlighted in yellow are the questions 1819 which we considered more interesting to analyze -Question 5: effect of group on the sense of guilt when witnessing the high intensity electrical 1824 stimulations Independent sample 1826 t-test t(51)=2.85, p=0.01, BF10=6.93) and Ambiguous (Independent sample t-test t(53)=-2.38, 1827 p=0.02, BF10=2.44); while there was no evidence of difference between the amount of guilt 1828 reported by Lucrative and Ambiguous -Question 6: absence of evidence of the effect of group on the amount of responsibility 1831 perceived during the task (Main effect of group on responsibility, F(2,76)=2.10, p=0 564) 1833 -Question 12: The three groups do not differ in the amount of motivation (F(2,76)=0.05, p=0 BFincl=0.115) 1835 -Sum of the answers to question 8-9 The causal role of the somatosensory cortex in prosocial behaviour A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code A parameter recovery assessment of time-1863 variant models of decision-making A Method for Efficiently 1865 Sampling from Distributions with Correlated Dimensions Association, A multidimensional approach to individual differences 1868 in empathy The development of a money attitude scale Using Bayes factor hypothesis testing in 1872 neuroscience to establish evidence of absence Author Correction: Using Bayes factor 1874 hypothesis testing in neuroscience to establish evidence of absence priors on the wf, S and M parameters were all truncated normal distribution (between 0 and 1554 1) with mean 0.5 and standard deviation 0.2, while the prior for the parameter was a truncated 1555 normal distribution (between 0 and infinity) with mean 1 and standard deviation 3. The 1556 estimates plotted in Figure 6c display the estimated posterior means from each simulated data 1557 set. 1558In the main manuscript, we use a hierarchical Bayesian model to estimate these parameters 1559 under the assumption that we sampled multiple individuals from the same underlying 1560 population. In particular for S and M, it is likely that different participants may gravitate 1561 onto similar S and M given the similarity in volatility experienced by them. For the 1562 parameter recovery we perform here, this assumption is not true: we deliberately sampled the 1563 entire space of possible S and M using a uniform distribution. Accordingly, for the parameter 1564 recovery, each simulated behavior was analysed individually using a non-hierarchical model. 1565The priors for this individual implementation were not informed by the hyper-parameters of 1566 the final study, and were simply aimed at informing us about the proportion of all possible 1567 variance in these parameters that can be retrieved from analysing participants one at a time. p<0.001, BF10>1000). As reported in the main text, the same was true for wf. In contrast, 1573 parameter recovery shows that although the learning rates can be significantly recovered by 1574 M2Out, the correlation values are more modest than for wf and τ (Kendall's ( Ssimulated, 1575 Sestimate)=0.25, Kendall's ( Msimulated, Mestimate)=0.24, both p<0.001, BF10>1000), and the 1576 values obtained from our model-fitting should thus be interpreted more tentatively. In 1577 particular, we find that is difficult to estimate for the outcome that participants consider less 1578 in their choices: for simulated participants with wf<0.1 (i.e. that minimize shocks to others), 1579Kendall's between simulated and estimated values drops to 0.03 for M but is 0.36 for S, 1580 while for simulated participants with wf>0.9 (i.e. that maximize gains to the self), it is 0.32 for 1581 M but drops to 0.06 for S. Accordingly, we did not include these parameter estimates in the 1582 main manuscript due to their reduced robustness. 1583It might also be noted, that the median learning rates estimated by the hierarchical Bayesian 1584 model for the Online experiment (Supplementary Figure 6B ) was close to 0.25, and was very 1585For this parameter recovery, we ran 1000 simulations. In each, we simulated 25 participants. 1690For each participant, we used their own design matrices (the same used for the actual GLM 1691 first level analysis of the fMRI activity after convolution with the haemodynamic response 1692 function) to mix signals in each subject using three mixings (i) -1*PES+0*PEM+noise; (ii) 1693 0*PES+1*PEM+noise, and (iii) -1*PES+1*PEM+noise. Noise was a random gaussian set at 1694 1std of the mixed signal. Next, we ran a GLM using the same design matrix, and saved the 1695 parameter estimates for PES (we will call βPES) and PEM (we will call βPEM) for each 1696 participant. We then perform a t-test for βPES and one for βPEM to see if across the 25 1697 parameter estimates (one per participant) there is evidence against the null hypothesis 1698 H0:βPES=0 or H0:βPEM=0. Of course, if PES was mixed into the voxels activity (case i or 1699 iii), a significant t-test would be a hit, while a non-significant t-test would be a miss, and the 1700 same applies to PEM for case ii. After repeating this procedure 1000 times, we count the 1701 proportion of the 1000 simulations where a t-test was significant against H0:βPES=0 or 1702 βPEM=0. Additionally, to see how often the analysis falsely detects a dependence on wf 1703 although wf was not included in the mixing, we also look at r(wf, βPES)=0 and r(wf, βPEM)=0. 1704Initially, we use p<0.05 as a criterion, to look at the specificity and sensitivity for the case in 1705 which we explore responses in the AVPS, which is univariate. We also indicate proportions at 1706 p<0.001, but this time for a one-tailed test, in parenthesis, to provide results relevant for an All our learning models had in common that they learned by updating expected values for the 1762 two symbols using prediction errors, and additively combined self-money and other-shock. All 1763 of them captured individual variability in preference using a weighting factor (wf). What varied 1764 was when the two outcomes were combined. In our model-free version (M1), the outcomes 1765 themselves are combined into a single composite outcome value, which is already biased by 1766 the participants preference ( Figure 2 ). Translated in psychological terms, this model would 1767 capture a model-free learning in which each symbol is associated with a value that captures 1768 how good or bad the outcomes for those symbols have felt in the past. In case of devaluation, 1769 this model-free learning doesn't switch its preference to the symbol that has the highest 1770 expected value on the remaining quantity because it does not have an internal model that 1771 separates expectations for self-money and other-shock. In contrast, in our unbiased model-1772 based version (M2Dec), participants track expectations separately for self-money and other-1773 shock -independently of preference -and preference only plays out during the decision-phase. 1774In psychological terms, this captures an unbiased model-based learning in which participants 1775 separately know how likely each symbol will lead to self-money or other-shock, respectively. 1776Only when a decision must be taken, will they combine these predictions with the relative value 1777 that self-money and other-pain have for them personally, to come to a decision under conflict 1778 ( Figure 2) . In this version, the variable 'expected value' represents expectation in the objective 1779 units in which the outcome is coded, independently of whether this particular outcome is more 1780 or less valued by the participant in the conflict situation. Accordingly, in case of devaluation,