key: cord-0330434-egxr9tqz authors: Kobayashi, Kenji; Lee, Sangil; Filipowicz, Alexandre L. S.; McGaughey, Kara D.; Kable, Joseph W.; Nassar, Matthew R. title: Dynamic Representation of the Subjective Value of Information date: 2021-02-14 journal: bioRxiv DOI: 10.1101/2021.02.12.431038 sha: dedc069ddd48ec6a0c135f0502d124ac45484252 doc_id: 330434 cord_uid: egxr9tqz To improve future decisions, people should seek information based on the value of information (VOI), which depends on the current evidence and the reward structure of the upcoming decision. When additional evidence is supplied, people should update VOI to adjust subsequent information seeking, but the neurocognitive mechanisms of this updating process remain unknown. We used a modified beads task to examine how the VOI is represented and updated in the human brain. We theoretically derived, and empirically verified, a normative prediction that the VOI depends on decision evidence and is biased by reward asymmetry. Using fMRI, we found that the subjective VOI is represented in right dorsolateral prefrontal cortex (DLPFC). Critically, this VOI representation was updated when additional evidence was supplied, showing that DLPFC dynamically tracks the up-to-date VOI over time. These results provide new insights into how humans adaptively seek information in the service of decision making. . Experimental paradigm. We adopted the beads task with three key modifications: asymmetry in the reward structure, initial evidence prior to information seeking, and an updating event (one extra bead). (A) Participants observed a number of beads drawn from a jar and made a bet on its composition. Each bead was marked with a face or a house. There were two possible jar compositions: 60% face beads and 40% of house beads, or 40% face beads and 60% house beads. The jars are colored here only for illustrative purposes. (B) Reward structure. Participants earned more reward points by correctly betting on one of the two jar types. The experiment consisted of two blocks, in each of which one of the two reward structures was presented in each trial. The first block involved a baseline shift and the second block involved a scale manipulation. (C) Trial sequence. In a third of the trials (bet-only trials), participants were presented with a number of beads from the jar and immediately made a bet on its type. In the remaining trials (information-seeking trials), they were presented with the initial beads, an extra bead, and then allowed to seek further information by drawing more beads from the jar before making a bet on one of the two jars. Participants could draw as many beads as they needed within five seconds, but each additional draw incurred a cost (0.1 points). The extra bead was presented to evoke updating in the value of information. The posterior of the jar type is determined by the numbers of high-reward beads (the 138 majority bead in the high-reward jar, e.g., face) and low-reward beads (the majority 139 bead in the low-reward jar, e.g., house) observed from the jar so far ( Fig. 2A) . The more The probability of the jar type (the true jar is the high-reward jar) increases with the number of observed high-reward beads and decreases with the number of observed low-reward beads. (B) The probability of the jar type is determined by the beads difference. (C) Due to the reward asymmetry, when equal numbers of high-reward and low-reward beads have been observed (the diagonal), the EV to bet on the high-reward jar is higher than the low-reward jar. The agent would experience the smallest EV difference, and hence the highest uncertainty on the bet, when more low-reward beads have been drawn (the white region). (D) The EV difference is smallest at the beads difference of −5 across all reward structures. Top: Bet EVs are not affected by a baseline shift in rewards. Bottom: The relative magnitudes of EVs remain the same when rewards are scaled down overall. (E) The theoretical VOI is highest when the uncertainty on the bet is highest (the beads difference = −5, the black region). This is because the next bead would provide evidence in favor of either jar type, resolving the uncertainty. (F) The theoretical VOI takes an inverted-U shape across all reward structures. Top: The VOI is unaffected by a baseline shift in rewards. Bottom: When the rewards are scaled down, the magnitude of VOI becomes smaller as well, but the peak location remains the same. EV: expected value, VOI: value of information. high-reward beads have been drawn, the more likely the jar is the high-reward jar, and 141 vice versa. More specifically, the posterior is determined by the difference in the 142 numbers of observed beads (high-reward beads minus low-reward beads) ( Fig. 2B ; Eq. 143 1). When more high-reward beads have been observed than low-reward beads (the 144 beads difference > 0), the probability of the high-reward jar is higher than the probability 145 of the low-reward jar, and it increases with the beads difference. Conversely, when 146 more low-reward beads have been observed (the beads difference < 0), evidence 147 favors the low-reward jar. In order to evaluate the EV of a bet, the agent needs to combine the posterior on the jar 150 type with the reward structure (Fig. 2C ). Due to the reward asymmetry, when the current 151 evidence does not favor either jar (the beads difference = 0; the diagonal in Fig. 2C ), the 152 EV to bet on the high-reward jar is higher than the EV to bet on the low-reward jar. The EVs to bet on the two jars are closest to each other when more low-reward beads have 154 been observed (the beads difference = −5; the white region in Fig. 2C ). This prediction 155 holds across all of our reward structures (Fig. 2D ); a baseline shift in rewards does not 156 affect the EV difference, and a scale manipulation in rewards multiplicatively affect both 157 EVs without changing their relative magnitudes. Therefore, if forced to bet on one of the 158 two possible jars, the EV-maximizing agent would experience the highest choice 159 uncertainty, not when equal numbers of beads have been observed, but when more 160 low-reward beads have been observed than high-reward beads. Under economic theories, the VOI, or the value of drawing an additional bead, is 163 evaluated based on how much the next bead would improve the upcoming bet on 164 average (Eq. 2). Qualitatively, the theoretical VOI tends to increase with the uncertainty 165 about which jar type to bet on, because an additional bead would provide more 166 evidence for either jar type and resolve the uncertainty over possible actions (Fig. 2E ). For instance, when the agent is under high uncertainty on the bet (the beads difference 168 = −5; the black region in Fig. 2E ), an additional bead would help them make a bet 169 irrespective of its type; if the next bead is a high-reward bead, it provides additional 170 evidence in favor of the high-reward jar, whereas if it is a low-reward bead, it favors the 171 low-reward jar. The agent can improve the EV by making a bet conditional on the next 172 bead type. On the other hand, when the agent has observed more high-reward beads 173 than low-reward beads (e.g., the beads difference = +10; top right in Fig. 2E ), or when 174 the agent has observed many more low-reward than high-reward beads (e.g., the beads 175 difference = −10; bottom left in Fig. 2E ), an additional bead would not affect the 176 subsequent bet; the agent would bet on the high-reward jar or low-reward jar, 177 respectively, no matter what the next bead would be. Therefore, the theoretical VOI 178 takes an inverted-U shape as a function of the beads difference, with its peak at a 179 negative beads difference (−5) (Fig. 2F ). Therefore, our theoretical framework yields an important prediction that the information-182 seeking strategy should be biased by the reward asymmetry; participants should draw 183 additional beads more frequently when more low-reward beads have been observed than high-reward beads (the beads difference < 0). The predicted bias holds across 185 reward structures (Fig. 2F) ; manipulation of the reward baseline (in the baseline block) 186 does not affect the VOI, and manipulation of the reward scaling (in the scale block) 187 affects the overall magnitude of the VOI but does not drastically alter its inverted-U 188 shape. This prediction might be somewhat counterintuitive, as the motivation for 189 information seeking is expected to be higher when the current evidence favors the less 190 desirable state (the low-reward jar). However, it is consistent with the widespread notion 191 of confirmation bias that an agent needs less evidence to bet on a desirable state than 192 an undesirable state (e.g., Gesiarz et al., 2019) . More generally, the prediction echoes 193 the general assumption that information seeking should be driven not by the motivation 194 to predict the state (which jar is the true jar?) but to maximize rewards (which jar to bet 195 on?). If, in contrast to our theoretical assumption, an agent is solely motivated to 196 accurately predict the state, they would seek information the most when the beads 197 difference is zero. Therefore, a bias in information seeking would suggest that 198 participants seek information based on its instrumentality for future reward seeking, as 199 normatively prescribed. To our knowledge, the bias in information seeking under the 200 reward asymmetry is a novel theoretical prediction that has not yet been directly tested. Behavior 203 We examined participants' information-seeking behavior, and in particular, whether it 204 was biased due to the reward asymmetry as predicted. If participants sought to improve 205 their subsequent bet choice and maximize rewards, the frequency of information 206 seeking (i.e., how often they drew at least one bead) should be biased towards a 207 negative beads difference, i.e., when more low-reward beads have been drawn than 208 high-reward beads. Observed information-seeking behavior was biased in the predicted direction (Fig. 3A ). In both baseline and scale blocks, the frequency of drawing an additional draw was 212 highest when more low-reward beads had been drawn than high-reward beads. Sensitivity to the reward asymmetry was also confirmed by the bet on the jar type in the 214 bet-only trials (Fig. 3B) ; the frequency of betting on the high-reward jar increased with 215 the beads difference, and the indifference point (the point at which participants were 216 equally likely to bet on either jar) was shifted towards a negative beads difference. These results show that participants incorporated both the current evidence and reward 218 asymmetry in reward-seeking and information-seeking choices. A notable deviation from the theoretical prediction is that participants' information 221 seeking was not sensitive to the reward scale manipulation. In our framework, the 222 theoretical VOI is smaller when the rewards are scaled down (even though its peak 223 location remains the same) while it is unaffected by a reward baseline shift (Fig. 2F ). Thus, if our participants were perfectly sensitive to the reward structure on a trial-by-trial 225 basis, their information seeking should be affected by trial-wise reward manipulation in 226 the scale block but not in the baseline block. To test this, we examined how information-227 seeking behavior differed across reward conditions and blocks. To characterize the relationship between information seeking and the beads difference without assuming its 229 functional form, we used Gaussian Process (GP) logistic regression (Rasmussen & 230 Williams, 2006) . We fit four models to participants' behavior; Model 1 assumed 231 sensitivity to the scale manipulation but not to the baseline manipulation, as normatively 232 prescribed; Model 2 assumed sensitivity to both manipulations; Model 3 assumed a 233 difference between blocks but no sensitivity to manipulation in either block; and Model 4 234 assumed no difference between blocks or reward conditions. We found that Model 3 235 outperformed other models, including Model 1, according to both leave-one-participant- −1142.25, and −1166.17) . Therefore, participant's information-seeking 239 behavior was systematically different between blocks, even though they did not change 240 their strategy based on the reward structure on a trial-by-trial basis. Fig. 3 . Behavior. Participants' information-seeking and reward-seeking behavior was biased by the reward asymmetry as predicted. (A) Participants' information seeking, or the frequency at which they drew at least one bead, peaked when more low-reward beads had been drawn than high-reward beads. (B) In the bet-only trials, the frequency with which they bet on the high-reward jar increased with the beads difference and was biased by the reward asymmetry. Lines indicate the best-fit model, which assumed sensitivity to blocks but not to reward manipulations within blocks. Error bars: bootstrap SD resampling participants. We speculate that shifting information-seeking strategies on a trial-by-trial basis was too 243 cognitively taxing for our participants, because we also manipulated the beads 244 difference and the trial type (information-seeking or bet-only). Despite this limitation, we 245 observed that participants' information seeking exhibited a clear bias in both blocks. Indeed, we observed that Model 3, which allowed asymmetry in information seeking, 247 performed better than another model (Model 5) that assumed symmetric information information that would not be useful for future decisions (i.e., information seeking for its 264 own sake), and one study that examined instrumentality-driven information seeking 265 used a one-shot paradigm that did not involve any updating (Kobayashi & Hsu, 2019) . 266 Thus, it remains unknown to what extent the neural representation of VOI is 267 generalizable across tasks and decision contexts, and whether previously reported 268 regions also represent and update the VOI in our experimental paradigm. To look for brain regions that represent the VOI, we empirically estimated subjective 271 VOI from the information-seeking behavior. We used the winning model of our GP We found a cluster in the right DLPFC representing subjective VOI ( Fig. 4B ; cluster-278 forming threshold p < .001, cluster-mass p < .05, whole-brain FWE corrected; peak MNI 279 coordinate = [48, 42, 24] ). Activation in this cluster peaked when more low-reward 280 beads had been drawn in both blocks, consistent with the prediction (Fig. 4C ). This 281 cluster is the only one that survived our whole-brain statistical threshold (we also 282 observed a cluster in the right anterior insula at a more lenient threshold, p < .10; peak 283 MNI coordinate = [30, 24, 4]; Fig. S1 ). Interestingly, the DLPFC cluster overlaps with a VOI cluster reported in a previous study 286 that examined one-shot instrumentality-driven information seeking (Kobayashi & Hsu, 287 2019) (Fig. S1 ), providing converging evidence that the right DLPFC represents the VOI 288 across decision contexts, at least when information is primarily acquired based on its 289 instrumentality for future value-guided decisions. Updating of VOI representation 292 We then turned to our final question: how is the VOI updated upon the arrival of 293 additional evidence in the brain? When the evidence available to agents changes, they 294 need to track the up-to-date VOI in order to seek information adaptively over time. Specifically, we examined how the right DLPFC responds to the extra bead presented 296 after the initial beads but prior to the information-seeking choice (Fig. 5A ). We derived 297 the VOI updating, or the difference between the posterior and prior VOI, as a function of 298 the difference in the initial beads (the prior evidence) and the type of the extra bead (the 299 evidence that causes updating). For instance, if participants have observed many more 300 low-reward beads than high-reward beads (the beads difference < −5), an extra high-301 reward bead would positively update the VOI, as it slightly increases the uncertainty on 302 the bet, while an extra low-reward bead would negatively update the VOI, as it further 303 decreases the uncertainty on the bet. The directionality of updating is the opposite when 304 more high-reward beads have been observed (the beads difference > 0). The subjective VOI was estimated for each block based on information-seeking behavior (Fig. 3A) . (B) The right DLPFC represented the subjective VOI (cluster-mass p < .05, whole-brain FWE corrected; the peak MNI coordinate: [48, 42, 24] ). (C) As predicted, the right DLPFC activation peaks at a negative beads difference in both blocks. Error bars: SEM. We hypothesized that the right DLPFC tracks the up-to-date VOI over time, such that it 307 responds not only to the VOI based on the initial beads but is dynamically updated to 308 the appropriate updated VOI after observation of the extra bead. To test this, we 309 estimated the effects of the initial VOI and VOI updating on BOLD signals from the 310 region of interest (ROI) defined above (Fig. 4B) . In order to avoid a strong assumption 311 about the time course of the updating process, we estimated the effects of initial VOI 312 and VOI updating across time using finite impulse response (FIR) functions aligned to 313 the presentation of the extra bead (Fig 5, top) . We included three FIRs in a GLM, one 314 parametrically modulated with the initial VOI, one modulated with the VOI updating, and 315 one without parametric modulation (intercept). Since the ROI was originally defined 316 Fig. 5 . Updating of the VOI representation. The right DLPFC tracks VOI as it is updated by an extra bead, presented after the initial beads but prior to information seeking. (A) The VOI updating was calculated as the signed difference between the VOI after the extra bead and the VOI before the extra bead. (B) Time courses of the initial VOI signal (grey) and the VOI updating signal (purple) in the right DLPFC. The right DLPFC responds not only to the initial VOI but also to the updating of VOI (temporal cluster-mass p < .05, FWE corrected). Since the region of interest was defined based on the initial VOI signal, estimation of the initial VOI signal is biased, but estimation of the updating signal is unbiased. Error bars: SEM. based on its response to the initial VOI (albeit in an earlier time window), the estimated 317 effect of the initial VOI is biased, but the estimated effect of the VOI updating depends 318 critically on the exact bead that was drawn, and thus is independent of our ROI 319 selection process (Fig 5A) . The estimated time courses are shown in Fig. 5B . As expected, the right DLPFC 322 represents the initial VOI early on. Importantly, the right DLFPC also positively 323 responded to the VOI updating (cluster-forming threshold p < .05, cluster-mass p < .05, 324 FWE corrected across time). The rise of the VOI updating signal lags behind the initial 325 VOI signal in time, but they go back to the baseline in parallel. The estimated time 326 courses look somewhat sluggish, which presumably reflects the nature of our 327 experimental paradigm in which participants had several seconds to complete 328 information seeking. This evidence demonstrates that neural representations in right DLPFC shift from the 331 initial (a priori) VOI to the updated (a posteriori) VOI, suggesting that this brain region 332 dynamically tracks the VOI based on the up-to-date evidence in service of adaptive 333 information seeking over time. In order to make better decisions, we need to seek information adaptively based on 337 what we already know (up-to-date decision evidence) and what is at stake (reward 338 structure). When our knowledge is updated, we need to update the VOI accordingly to 339 decide whether to seek further information. Deficits in updating the VOI could lead to 340 excessive repetition of information seeking even after enough evidence is accumulated 341 (Hauser et al., 2017) , or conversely, premature jumping to conclusions without enough 342 evidence (Dudley et al., 2016; Ross et al., 2015) . Despite its importance and ubiquity in 343 the real world, we know little about how people evaluate and update the VOI. In this 344 study, we used a variant of the beads task, in which decision evidence was 345 parametrically manipulated on a trial-by-trial basis, to examine how information seeking 346 is shaped by current evidence and asymmetric reward structure, and how the VOI is 347 represented and updated in the brain. We theoretically derived, and empirically verified, the normative prediction that 350 information seeking should be biased by reward asymmetry. Participants were more 351 likely to seek information when the current evidence preferred the less rewarding state 352 due to high uncertainty on which state to bet. While the current study used asymmetric 353 monetary rewards, our theoretical framework can be generalized beyond economic 354 decision making based on the notion that the people assign intrinsic values to beliefs 355 that they can hold (Kunda, 1991; Sharot & Garrett, 2016) . If people are incentivized to 356 hold certain beliefs, they will be more motivated to seek information when the current 357 evidence supports the less desirable belief, even without extrinsic reward asymmetry 358 (e.g., people check the latest number of COVID-19 cases more often when it is 359 increasing than decreasing). It is worth noting, however, that the current study only 360 examined reward structures where a correct bet yields asymmetric rewards but an 361 incorrect bet does not, while outcomes of an incorrect prediction could also be 362 asymmetric in some real-world scenarios (e.g., it would be more punishing to 363 underestimate the chance of COVID-19 transmission and end up causing a 364 superspreader event than to overestimate it and avoid a social gathering). More 365 comprehensive, generalizable predictions would be obtained by expanding our findings 366 to various reward structures. Our theoretical and behavioral findings may provide some insight into confirmation 369 biases observed across domains. Confirmation bias is commonly framed as biases in 370 updating processes and/or decision criteria due to reward asymmetry or other factors 371 such as pre-commitment (Gesiarz et al., 2019; Leong et al., 2019; Luu & Stocker, 2018; 372 Talluri et al., 2018) . We showed that, even without biases in updating or decision 373 criteria, information seeking should be biased by reward asymmetry. The current study 374 was not designed to test conventional confirmation bias; our behavioral measure of 375 information seeking is not sensitive to a bias in updating, and a bias in decision criteria 376 is not distinguishable from non-neutral risk attitude in our paradigm. Future research 377 may examine how confirmation bias in updating and/or decision criteria affects information seeking, and conversely, how the information seeking bias would strengthen 379 or weaken the effects of confirmation bias. Another exciting question for future research 380 would be whether people exhibit an information-seeking bias upon sampling evidence 381 from internal representations rather than the external world, such as episodic memory 382 (Shadlen & Shohamy, 2016) . Our finding of the VOI representation in DLPFC is consistent with a previous fMRI study 385 on instrumentality-driven information seeking (Kobayashi & Hsu, 2019) , despite a 386 number of key differences in task design. First, our paradigm required probabilistic 387 inference on the hidden jar composition based on observable evidence, while and examined its effect on information-seeking behavior and underlying neural signals. Thus, the current study not only replicates but also critically extends Kobayashi & Hsu 396 (2019)'s findings by showing that DLPFC is sensitive to the current evidence and biased 397 by reward asymmetry, a key theoretical prediction of the instrumentality-driven VOI. 398 Along with neuroimaging evidence that DLPFC is also activated upon information 399 seeking driven by factors other than instrumentality (Gruber et al., 2014; Kang et al., 400 2009; Jepma et al., 2012) , these results suggest that DLPFC is critical for adaptive 401 information seeking across decision contexts and domains. Unlike previous studies, we did not find VOI representation in reward regions (e.g., 404 striatum or VMPFC) or ACC (Bromberg-Martin & Hikosaka, 2009 Brydevall et al, 405 2018; Charpentier et al., 2018; Gruber et al., 2014; Kaanders et al., 2020; Kang et al., 406 2009; Krebs et al., 2009; Lau et al., 2020; White et al., 2019) . It is possible that we 407 lacked statistical power to detect signals in these regions; indeed, we found a VOI 408 cluster in anterior insula at a liberal threshold (Fig. S1 ), which often coactivates with 409 ACC in task-based and resting-state fMRI (Fox et al., 2005; Menon & Uddin, 2010; 410 Seeley et al., 2007) . Alternatively, the involvement of these regions could depend on 411 task and decision context. For instance, striatum and/or VMPFC may be more important 412 when the information-seeking cost is larger and variable, which would demand online Importantly, we showed that DLPFC not only represents the VOI based on the initial 432 evidence but also updates it when additional evidence is supplied, or in other words, All procedures were approved by the Institutional Review Board at the University of 467 Pennsylvania. Participants 15 people (11 female, 4 male, age: 18-28, mean = 21.27, standard 470 deviation = 2.79) participated in the experiment. They provided informed consent in 471 accordance with the Declaration of Helsinki. Task design We adopted a variant of the beads task (Furl & Averbeck, 2011; Huq et 474 al., 1988; Phillips & Edwards, 1966) ; the participant was presented with a jar containing 475 two types of beads and asked to guess its composition (i.e., which type made up the 476 majority of the beads) by drawing some beads from the jar (Fig. 1A) . Our variant had 477 three important features. First, the participant was rewarded for identifying the correct 478 jar composition, but the reward structure was asymmetric, such that the participant 479 could earn more rewards by correctly betting on one jar type than the other (Fig. 1B) . Second, a variable number of beads was drawn from the jar and presented to the 481 participant at the beginning of each trial, empirically manipulating the evidence available 482 to the participant before they seek information. Third, an extra bead was presented on a 483 subset of trials to update the initial evidence. These features allowed us to examine how 484 the brain represents and updates the VOI based on evidence that changes over time. The experiment consisted of two interleaved trial types, bet-only trials and information-487 seeking trials (Fig. 1C) . In the bet-only trials, the participant was first presented with a 488 number of beads drawn from the jar. Each bead was depicted as a rounded picture of a 489 face or a house (one picture for face or house each was used throughout the 490 experiment). Beads marked with a face were presented to the left and those marked 491 with a house to the right. The participant was told that these beads were drawn from 492 one of two jars: a face-majority jar, which consisted of 60% face beads and 40% house 493 beads, and a house-majority jar, which consisted of 60% house beads and 40% face 494 beads. Rewards for correct and incorrect bets (in points) were also presented, in green 495 and gray, respectively. Rewards for a bet on the face-majority jar were shown above the 496 face beads, and rewards for a bet on the house-majority jar above the house beads. Rewards for a correct bet on one jar were numerically larger than rewards for a correct 498 bet on the other jar (reward asymmetry), while an incorrect bet on either jar yielded the 499 same rewards (Fig. 1B) . After the presentation of the initial beads for 3 seconds, the 500 participant was asked to make a bet. During the bet phase of the task, face and house 501 beads were separately outlined by magenta boxes, and the participant could press the 502 left or right button on a response box to bet on the face-or house-majority jar, 503 respectively. Trials in which the participant did not make a bet within 3 seconds were 504 terminated and discarded from the analysis. In the information-seeking trials, the participant was first presented with the initial beads 507 screen (same as the bet-only trials), followed by a blank screen (0-2 seconds). Next, an extra bead drawn from the jar was presented, either marked with a face or a house (1 509 second), which was added to the corresponding group of beads on the initial screen (0-510 2 seconds). The participant was then asked to decide whether to draw more beads from 511 the jar before making a bet on its composition (information-seeking phase). Two choices 512 appeared on the screen, "draw" and "bet", and the participant pressed one button to 513 draw one more bead and another button to terminate the information-seeking phase 514 and proceed to the bet (the sides of the options were randomized across trials). The 515 participant was allowed to draw as many beads as they wanted within 5 seconds, and a 516 face or house bead was added to the screen every time they pressed the "draw" button. The participant was told that each draw incurred a constant small cost (0.1 points). Once they pressed the "bet" button (or when 5 seconds have passed), they were 519 presented with the bet screen (same as the bet-only trials). The task was programmed in Matlab (The MathWorks, Natick, MA) using MGL 522 (http://justingardner.net/mgl/) and SnowDots (http://code.google.com/p/snow-dots/) 523 extensions. Procedure In a separate task session before scanning, participants received extensive 526 training on the task, in which various aspects of the task were gradually introduced 527 (betting on the jar composition, asymmetric rewards, costly draws, and multiple reward 528 structures). During the subsequent session, participants completed the task inside the 529 scanner. Participants made responses using an MRI-compatible button box. They were 530 compensated based on the total points they acquired in the scanning session (500 531 points = $1). The scanning experiment consisted of two blocks, which differed in reward structure 534 (Fig. 1B) . In the first block (the baseline block), one of the two reward structures, 535 ( ! , " , # ) = (70, 10, 0) or (170, 110, 100), was randomly presented in each trial, 536 where ! is the reward for a correct bet on the high-reward jar, " is the reward for a 537 correct bet on the low-reward jar, and # is the reward for an incorrect bet; thus, the 538 participant earned a baseline reward of 100 points irrespective of their bet in half of the 539 trials. In the second block (the scale block), one of the two reward structures, ( ! , " , # ) 540 = (70, 10, 0) or (7, 1, 0), was randomly presented in each trial; thus, the participant 541 earned a tenth of the rewards in half of the trials. Each block consisted of two scanning 542 runs, one where the high-reward jar was the face-majority jar and one where the high-543 reward jar was the house-majority jar; their order was counterbalanced across 544 participants. On each trial, the participant was presented with 20 or 30 initial beads from the jar. The 547 difference in the number of initial beads marked with a face or house was uniformly 548 sampled from a discrete set of values ranging from -10 to 10 in increments of 2. Unbeknownst to the participant, the true jar type was stochastically determined following 550 the Bayesian posterior conditional on the initial beads difference (see Eq. 1 below). In 551 the information-seeking trials, the type of the extra bead presented and of all additional beads drawn by the participant (face or house) were stochastically determined based on 553 the hidden jar type. The participant was not provided with feedback on their bet 554 accuracy or rewards on a trial-by-trial basis. They were however informed of the total 555 number of points they had accumulated at the end of each run. Theory Normative predictions about the VOI, or how much an optimal agent should pay 558 for information, were derived under assumptions that the agent conducts full-Bayesian 559 inference on the jar type, deterministically makes an optimal choice to maximize the 560 expected value (EV), is risk neutral, and optimally seeks information based on its 561 instrumentality, or how much it would improve the EV of the subsequent bet choice. Our 562 theoretical framework did not consider any additional information-seeking motives, such 563 as curiosity, savoring, dread, or uncertainty reduction. Let ! be the state where the true jar is the high-reward jar and " the state where it is 566 the low-reward jar. Let ! be the action to bet on ! and " the action to bet on " . Let 567 us further refer to the majority beads in the high-reward jar as high-reward beads and 568 the majority beads in the low-reward jar as low-reward beads (for instance, if the high-569 reward jar is the house-majority jar, a house bead is a high-reward bead and a face 570 bead is a low-reward bead; note that the beads were not directly associated with 571 rewards per se). The goal for the agent is to choose between ! and " to maximize EV 572 given the current evidence (i.e., the number of high-reward beads ! and low-reward 573 beads " drawn from the jar so far) and the reward structure ( ! , " , # ). The likelihood of drawing a high-reward bead ! or a low-reward bead " conditional on 576 the jar type is known to the agent: 585 586 therefore 587 588 which is a function of the beads difference, ! − " (e.g., the posterior is the same 591 when ( ! , " ) = (5, 2) or (15, 12)) ( Fig. 2a, b) . Given the posterior, the agent makes a choice among three options: to bet on ! , to bet 594 on " , or to seek information and draw an additional bead from the jar, which incurs a 595 cost '()* (0.1 points). The agent should decide whether to draw an additional bead 596 based on the VOI, or the improvement in the bet's EV thanks to the next bead: where '()* is the highest EV that the agent could achieve after drawing the next bead 601 (without considering the information-seeking cost), and +,-is the highest EV that the Since the posterior is determined by the beads difference (Eq. 1), the bet EVs are also 615 determined by the beads difference. In order to evaluate '()* , we have to take into account two important facets of our 618 information-seeking paradigm: first, the content of information (the type of the next 619 bead, ! or " ) is stochastic, and second, the agent can decide whether to draw yet 620 another bead or not after observing the next bead. Therefore, we have to evaluate the 621 likelihood of the next bead type and combine it with the EV of an optimal choice 622 conditional on each bead type. The likelihood of the next bead type based on the 623 current evidence is evaluated according to the posterior on the jar type: 624 625 If the next bead is ! , it would update the evidence from ( ! , " ) to ( ! + 1, " ). Then 629 the agent can either make an optimal bet and achieve +,-( ! + 1, " ) or pay the cost 630 to draw another bead and achieve '()* ( ! + 1, " ) − '()* . Similarly, if the next bead 631 is " , it would update the evidence to ( ! , " + 1), based on which the agent can either 632 make an optimal bet and achieve +,-( ! , " + 1) or draw another bead and achieve 633 '()* ( ! , " + 1) − '()* . Therefore, the highest EV that the agent can achieve after 634 drawing an additional bead is In Eq. 3, '()* ( ! , " ) in the left-hand side depends on '()* ( ! + 1, " ) and motivated by information's instrumentality for future reward seeking. Behavioral data analysis In order to examine information-seeking behavior, we 682 analyzed the frequency at which participants drew at least one bead as a function of the 683 beads difference. We specifically focused on whether they drew the first bead as a 684 function of the current evidence and examined if it was biased by the reward asymmetry 685 as theoretically predicted. The relationship between information-seeking behavior and 686 the beads difference was analyzed using Gaussian Process (GP) logistic regression 687 (Rasmussen & Williams, 2006 (https://github.com/alshedivat/gpml) (Rasmussen & Nickisch, 2010) . To test whether information-seeking behavior systematically differed across blocks and one-trial-out cross validation (LOTO CV). We also adopted the same analytic approach 711 to the bet choices, comparing the performance of Models 1-4. We found that Model 3 outperformed other models for both information-seeking and bet normalized to the MNI space, and spatially smoothed (Gaussian kernel FWHM = 6 mm). To look for regions that represent the subjective VOI upon the initial beads presentation, 754 we ran a GLM analysis (GLM 1). The regressor of interest modeled the initial beads level T-statistics were entered into the population-level inference using FSL randomise, 766 in which clusters that showed positive response to subjective VOI were defined at the voxel-wise cluster-forming threshold of p < .001 and evaluated by sign-flipping 768 permutation on cluster mass. A cluster that survived whole-brain family-wise error 769 (FWE) corrected p < .05 is reported in Fig. 3B ; another cluster that survived a more 770 lenient threshold (p < .10) is reported in Fig. S1 . To illustrate how the cluster's activation varied as a function of the beads difference, we 773 ran another GLM (GLM 2) using FSL FEAT, which included a regressor for each level of 774 beads difference separately, along with the same nuisance regressors as GLM 1. T-775 statistics for each regressor of interest were then averaged across runs within each 776 block and then averaged across all voxels in the right DLPFC cluster defined as above 777 (Fig. 3B) . Lastly, to examine how the DLPFC responds to the updating of VOI, we ran another Fig. S1 . (A) At a liberal threshold (cluster-forming threshold p < .001, cluster mass p < .10, corrected for whole-brain FWE), the subjective VOI was positively associated with activations in right anterior insula. (B) The DLPFC cluster identified in the current dataset (Fig. 4b) overlaps with a subjective VOI cluster reported in Kobayashi & Hsu (2019) . Fig. S2 . The theoretical VOI was numerically estimated by backward recursion (up to 200 steps). The VOI reached an asymptote at each level of beads difference over the course of recursion. Moreover, the VOI was highest with a negative beads difference (−5) throughout recursion. A distinct inferential mechanism for delusions in schizophrenia Cognition in schizophrenia: core psychological and neural mechanisms Learning the value of information in an uncertain world Orbitofrontal cortex uses distinct codes for different choice attributes in decisions motivated by curiosity Midbrain dopamine neurons signal preference for advance information about upcoming rewards Lateral habenula neurons signal errors in the prediction of reward information The neural encoding of information prediction errors during non-instrumental information seeking Psychological expected utility theory and anticipatory feelings Valuation of knowledge and ignorance in mesolimbic reward circuitry Psychosis, Delusions and the "Jumping to Conclusions" Reasoning Bias: A Systematic Review and Meta-analysis Optimal strategies for seeking information: Models for statistics, choice reaction times, and human information processing Seeking Information to Reduce the Risk of Decisions Meta-analytic investigations of structural grey matter, executive domain-related functional activations, and white matter diffusivity in obsessive compulsive disorder: An integrative review The human brain is intrinsically organized into dynamic, anticorrelated functional networks Mnemonic coding of visual space in the monkey's dorsolateral prefrontal cortex Parietal Cortex and Insula Relate to Evidence Seeking Relevant to Reward-Related Decisions Evidence accumulation is biased by motivation: A computational account Towards a neuroscience of active sampling and curiosity States of curiosity modulate hippocampusdependent learning via the dopaminergic circuit Increased decision thresholds trigger extended information gathering across the compulsivity spectrum Information value theory Probabilistic Judgements in Deluded and Non-Deluded Subjects Neural mechanisms underlying the induction and relief of perceptual curiosity Medial frontal cortex activity predicts information sampling in economic choice Dopamine: generalization and bonuses The wick in the candle of learning: Epistemic curiosity activates reward circuitry and enhances memory Double dissociation of value computations in orbitofrontal and anterior cingulate neurons The psychology and neuroscience of curiosity Common neural code for reward and information value Diverse motives for human curiosity Neural Mechanisms of Foraging The novelty exploration bonus and its attentional modulation Temporal resolution of uncertainty and dynamic choice theory The case for motivated reasoning Shared striatal activity in decisions to satisfy curiosity and hunger at the risk of electric shocks Neurocomputational mechanisms underlying motivated seeing The psychology of curiosity: A review and reinterpretation Post-decision biases reveal a self-consistency principle in perceptual inference Functionally Dissociable Influences on Learning Rate in a Dynamic Environment Saliency, switching, attention and control: a network model of insula function Dissociable Forms of Uncertainty-Driven Representational Change Across the Human Brain Conservatism in a simple probability inference task Gaussian Processes for Machine Learning (GPML) Toolbox Gaussian Processes for Machine Learning Jumping to Conclusions About the Beads Task? A Meta-analysis of Delusional Ideation and Data-Gathering Frontal cortex subregions play distinct roles in choices between actions and stimuli Choice, uncertainty and value in prefrontal and cingulate cortex Dissociable Intrinsic Connectivity Networks for Salience Processing and Executive Control Decision Making and Sequential Sampling from Memory Integration theory applied to judgments of the value of information Forming Beliefs: Why Valence Matters How people decide what they want to know Dorsal anterior cingulate cortex and the value of control Anterior cingulate engagement in a foraging context reflects choice difficulty, not foraging value Advances in functional and structural MR image analysis and implementation as FSL The what, where and how of delay activity Confirmation bias through selective overweighting of choice-consistent evidence Value of information for decisions A neural network for information seeking Humans use directed and random exploration to solve the explore-exploit dilemma