key: cord-0322546-nb7r5e79 authors: Rideaux, Reuben; West, Rebecca K.; Wallis, Thomas S. A.; Bex, Peter J.; Mattingley, Jason B.; Harrison, William J. title: Spatial structure, phase, and the contrast of natural images date: 2021-11-02 journal: bioRxiv DOI: 10.1101/2021.06.16.448761 sha: 94f9a13af4af83275083f03dc508c051edb92c18 doc_id: 322546 cord_uid: nb7r5e79 The sensitivity of the human visual system is thought to be shaped by environmental statistics. A major endeavour in vision science, therefore, is to uncover the image statistics that predict perceptual and cognitive function. When searching for targets in natural images, for example, it has recently been proposed that target detection is inversely related to the spatial similarity of the target to its local background. We tested this hypothesis by measuring observers’ sensitivity to targets that were blended with natural image backgrounds. Targets were designed to have a spatial structure that was either similar or dissimilar to the background. Contrary to masking from similarity, we found that observers were most sensitive to targets that were most similar to their backgrounds. We hypothesised that a coincidence of phase-alignment between target and background results in a local contrast signal that facilitates detection when target-background similarity is high. We confirmed this prediction in a second experiment. Indeed, we show that, by solely manipulating the phase of a target relative to its background, the target can be rendered easily visible or undetectable. Our study thus reveals that, in addition to its structural similarity, the phase of the target relative to the background must be considered when predicting detection sensitivity in natural images. The human visual system is tasked with parsing the complexity of natural environments 19 into a coherent representation of behaviourally relevant information. These operations have 20 been shaped by various selective pressures over evolutionary and developmental 21 timescales. Therefore, the perceptual computations that guide cognition and behaviour 22 ultimately serve to extract functional information from rich and complex naturalistic 23 environments (Carandini et & Olshausen, 2001) . For example, a common task is to find a pre-defined 25 target object in a complex or cluttered visual environment. The vast majority of our 26 knowledge of the visual system, however, has been derived from experiments using 27 relatively sparse stimulus displays that are not representative of our typical visual diets. The 28 aim of the present study was to investigate how natural image structure influences target 29 detection. We tested how detection is influenced by the spatial structure, phase, and 30 contrast of natural image backgrounds to determine the features that best predict detection 31 sensitivity. 32 Luminance contrast plays a critical role in most visual tasks. The human visual system 33 is tuned to detect contrast across a range of spatial and temporal frequencies. Neurons in 34 primary visual cortex (V1) are classically understood as processing local regions of oriented 35 contrast that can define the borders of objects (Hubel & Wiesel, 1959) . Such properties of 36 individual neurons govern phenomenal perception and are thought to be shaped by the 37 statistics of natural environments (Barlow, 1961 (Barlow, , 1972 . The encoding of contrast within the 38 visual system is most commonly studied with oriented grating stimuli, such as Gabor 39 wavelets. Grating stimuli are conveniently characterised by a simple set of parameters: 40 orientation, contrast, position, and spatial frequency. From a computational perspective, 41 "Gabor wavelet analyses" allow the decomposition of any image into mathematically 42 tractable component features. Such analyses are relatively simple and are common in many 43 computer vision applications. Early theory suggested that analogous decomposition 44 processes occur in the visual system (Campbell & Robson, 1968 ). However, more recent 45 studies suggest that individual visual neurons encode complex higher-order statistical 46 information that is not necessarily predicted by Gabor parameters (e.g. Cadena et al., 2019) . 47 One common approach to investigate contrast sensitivity in natural conditions is to 48 have observers detect contrast-defined targets embedded in digital photographs or movies. 49 Relative to sensitivity as typically quantified with a uniform background, spatio-temporal 50 contrast sensitivity is diminished when viewing dynamic movies, particularly for lower 51 frequencies (Bex et al., 2009 ). Furthermore, during free viewing of natural movies, the large-52 scale retinal changes caused by saccadic eye movements also diminish sensitivity likely 53 due to forward and backward masking (Dorr & Bex, 2013; Wallis et al., 2015) . In general, 54 such studies have revealed that the sensitivity of the visual system does indeed depend on 55 naturalistic context (Bex & Makous, 2002; Geisler, 2008) . 56 Researchers have further sought to understand the statistical regularities of natural 57 scenes that impact the detectability of targets. For example, various image structures, such 58 as the density of edges within close proximity to the target, negatively impact detection 59 sensitivity (Bex et al., 2009 ; see also Wallis et al., 2015) . Indeed, the discriminability of visual 60 objects can be predicted from the spatial proximity of surrounding visual clutter (Balas et The present study 76 The aim of the present study was to test observers' sensitivity to targets presented on 77 natural image backgrounds. Importantly, we designed the test stimuli a priori such that 78 targets approximated the appearance of, and were aligned with, the local structure of a 79 natural image background, or differed from the local structure. We therefore distinguish 80 target-background alignment from target-background similarity in terms of the stimulus 81 generation procedure (alignment) versus an image statistic (similarity). As shown in Figure 82 1, we automated the placement of targets within natural backgrounds according to oriented 83 contrast energy at different image regions. We created two conditions, one in which targets 84 were aligned with their backgrounds and one in which targets were misaligned with their 85 backgrounds. In contrast to this stimulus generation procedure, target-background similarity 86 is a measure of the correlation between a target and a background without a target. While 87 target-background similarity ranges from 0 -1 for all stimuli, our stimulus generation method 88 results in higher similarity scores for aligned targets than misaligned targets (on average). 89 Based on previous studies showing a negative impact of increasing target-background 90 similarity on detection (Bex et al., 2009; Sebastian et al., 2017) , we expected to find worse 91 detection sensitivity when targets were aligned with the background -and were therefore 92 highly similar -than when they were misaligned relative to the background -and were 93 therefore relatively dissimilar. To anticipate our results, however, we found the opposite, 94 instead revealing that the influence of target-background similarity on detection sensitivity 95 depended almost entirely on the relative phase of the target. 96 Participants. We used a single-subjects design in which we measured observers' 115 perceptual performance with high precision and treat each observer as a replication (Smith 116 & Little, 2018 the distractor was a natural image with no target filters. The filters were designed to be 142 similar or dissimilar from the underlying natural image structure. We generated target stimuli 143 by blending a source image of a natural object with derivative of Gaussian wavelets 144 (henceforth: filters). The blending process followed four steps: 1) find the dominant 145 orientation of each pixel in a source image, 2) find the relative contrast of each pixel in the 146 image, 3) draw some number of filters at the highest contrast image regions, and 4) combine 147 the filters with a source image. We expand on these steps below. 148 First, we used a steerable filter approach to determine the dominant orientation at 149 every pixel in a given source image (Freeman & Adelson, 1991) . Filters were directional first 150 derivative of Gaussians oriented at 0° and 90°: 151 7 +,-# = 7 . & 168 ! indexes the x-y coordinates of the contrast maxima, and 7 +,-# is the orientation at 169 this location. We then steered a filter at this location as follows: 170 Here, S is the resulting filter stimulus in the range -1 to 1. Note the additional subscript 172 of the filters that indicates that the filters were centred on the location of the contrast maxima, 173 ! . This is trivially achieved by centring the x-y coordinates in Equations 1 and 2 on the 174 coordinates of ! . 175 windowed in a circular aperture with a diameter matching the width of the source image (i.e. 186 2°) and a raised cosine edge, transitioning to zero contrast in 6 pixels. To constrain the filters 187 generated by Equation 9 to appear within the windowed portion of the stimulus, the same 188 aperture was applied to the contrast map, C, prior to generating the stimulus. Any values 189 lower or higher than 0 or 1, respectively, in 345, were clipped. 190 For trials in which multiple filters were combined with a source image, we used an 191 iterative procedure to draw n local maxima from the contrast image. Following the argmax 192 operation in Equation 7, we updated the contrast map to minimise the contrast at the 193 maxima: 194 Where 3 is the contrast map for the n-th filter, and ( ! , ) is a two-dimensional 196 Gaussian with a peak of 1 centred on the location of the maxima ! , and a standard deviation 197 of . is the standard deviation of the basis filters, while is a scaling factor that 198 determines the spatial extent of change in the contrast map. The effect of this adjustment is 199 the creation of a new local maxima at a different location than in the previous iteration. The 200 greater the value of , the greater the spatial spread of filters. The first filter location is always 201 the image region with highest contrast. After accounting for the effect of , subsequent filters 202 are placed in regions of diminishing contrast. In trials in which multiple filters were present, 203 backgrounds were randomly selected as described above. 204 Image selection. The 26,107 images in the THINGS database are grouped into 1,854 205 concepts (e.g. "dog", "cup", "brush" etc), such that there are at least 12 unique, high quality 206 images for each concept (Hebart et al., 2019) . In each testing session (500 trials), we 207 selected 1000 source images from unique concepts such that no two images were drawn 208 from the same concept. The target background was thus always drawn from a different 209 concept than the distractor image. However, it was necessary that some concepts were 210 repeated across testing sessions, and it was also possible that some individual images were 211 also repeated across sessions (but never within sessions). 212 On each trial, we selected two images from the set of 500: one image for the target 213 background, and a second image was the distractor. On half the trials, the target filters were 214 generated from the target background and were therefore aligned with the background, 215 while on the other half of trials they were generated from the distractor image -but blended 216 with the target background -and were therefore misaligned relative to the target 217 background. Target filters were generated from the distractor background on misaligned 218 trials, as opposed to an unused image, so that the filters and their source image were 219 presented on every trial, but we doubt this decision was important to our results. We chose 220 to present two different background images on each trial, rather than, for example, 221 presenting two of the same background images, because we did not want observers to 222 attempt to simply spot the difference between two similar images. Instead, observers had to 223 perform a more natural task of searching unfamiliar and unique backgrounds for targets. 224 Procedure. A typical trial sequence is shown in Figure 2 . Each trial began with a small 225 red fixation spot in the centre of the display, followed by outlines of the upcoming stimulus 226 locations. Natural image backgrounds were followed by a blank of 500ms, after which time 227 the observer reported which of the two patches contained the target filter(s) using the 228 keyboard. Following the observer's response, the image patch with target filters was re-229 displayed for an additional 500ms, outlined in green or red depending on whether the 230 observer's response was correct or incorrect, respectively. Feedback was provided to 231 facilitate observers reaching a stable level of performance. No breaks were programmed 232 but could be taken by withholding a response. Each session included ten repeats of each 233 trial type, all presented in random order, giving 500 trials per session and taking 234 approximately 15 minutes when no breaks were taken. 235 Where is the sum of weighted linear predictors and !: is the absence or presence 257 of the signal (i.e. 0 or 1, respectively) on the i-th trial. By fitting such a probit model, estimates 258 of the predictor weights : and ! are identical to d' and c, respectively, as calculated in 259 Equation 13 and Equation 14. Whereas these equations fully specify sensitivity and bias in 260 a single interval present/absent judgement task, some small modifications are needed to 261 quantify sensitivity in a two-alternative forced choice task (2AFC) as in our experiment. First, 262 !: denotes whether the target filter(s) appeared in the left or right spatial interval, defined 263 as -.5 or .5, respectively. Similarly, observers' reports (i.e. "target appeared in the left or right 264 interval"), were defined as -1 and 1, respectively. Finally, in a 2AFC, observers have two 265 opportunities to detect the target -once per spatial interval -and so raw d' will be greater 266 than in a single-interval detection design. Therefore, sensitivity (but not bias) must be scaled 267 by Importantly, we can extend Equation 15 to quantify sensitivity to any number of other 270 predictors, F : 271 [=] = ! + : !: Consider, for example, the influence of filter amplitude ( ) on an observer's sensitivity: 273 Equation 19 Note that filter amplitude is entered into the model as an interaction with target location 275 because the model's predicted outcome is a spatial report; target amplitude alone can only 276 predict a change in bias. In preliminary model fits, we found that such bias was not 277 significantly different from zero, and thus included only interactive terms to facilitate 278 interpretability of the standard bias term, ! . We selected other model predictors according 279 to the model that produced the lowest Akaike information criterion (AIC; see below). 280 Finally, we implemented this model as a multilevel GLM (GLMM) to partially pool 281 coefficient estimates across observers (Gelman & Hill, 2007 Here, F,H is the offset for each predictor and observer , relative to the parameter's 288 mean , contingent on the parameters' estimated population variance * . The partial pooling 289 of observers' data in a GLMM results in more extreme values being pulled toward the 290 population mean estimate. Note that in our experiment, however, such pooling is relatively 291 minor due to the large number of trials, and therefore high precision, of each observer's 292 estimated performance, as well as the relatively small number of observers. Because 293 images were drawn randomly from trial to trial from a pool of tens of thousands of images, 294 we did not expect many, if any, repeats of each image. We therefore did not model the 295 background images as a random effect, but we note that such a design could be chosen in 296 future to estimate the variance associated with each tested background. 297 We entered into the model the factors target amplitude, number of filters, and target-298 background alignment, which, as noted above, were each entered as an interaction with the 299 spatial interval of the target. In hindsight, our inclusion of the condition in which target 300 amplitude was 0 was unnecessary. For all such trials, therefore, we set all predictors to have 301 a value of 0 so they were omitted from model calculations. The model fit was improved by 302 including nonlinear terms by raising amplitude and number of Gabors to the exponents 0.5 303 and 2, respectively. We further tested all combinations of interactions, but none improved 304 the model fit as assessed by the Akaike information criterion. 305 Post-hoc analysis of interaction between the number of filters and filter 306 amplitude. We modelled the joint influence of number of filters and filter amplitude on 307 proportion correct as a two-dimensional surface (see Figure 5B ). The surface is defined as: 308 Where z is the surface of estimated proportion correct, f<α J , α K E is a cumulative 310 probability function relating target amplitude to accuracy according to a threshold and 311 variance, α J and α $ , respectively, and f