key: cord-0663977-2uudsmp5
authors: Payne, Ethan; Talbot, Colm; Lasky, Paul D.; Thrane, Eric; Kissel, Jeffrey S.
title: Gravitational-wave astronomy with a physical calibration model
date: 2020-09-21
journal: nan
DOI: nan
sha: 036c7563a00d7b38d5d758d584da2d5c4da67356
doc_id: 663977
cord_uid: 2uudsmp5

We carry out astrophysical inference for compact binary merger events in LIGO-Virgo's first gravitational-wave transient catalog (GWTC-1) using a physically motivated calibration model. We demonstrate that importance sampling can be used to reduce the cost of what would otherwise be a computationally challenging analysis. We show that including the physical estimate for the calibration error distribution has negligible impact on the inference of parameters for the events in GWTC-1. Studying a simulated signal with matched filter signal-to-noise ratio $text{SNR}=200$, we project that a calibration error estimate typical of GWTC-1 is likely to be negligible for the current generation of gravitational-wave detectors. We argue that other sources of systematic error---from waveforms, prior distributions, and noise modelling---are likely to be more important. Finally, using the events in GWTC-1 as standard sirens, we infer an astrophysically-informed improvement on the estimate of the calibration error in the LIGO interferometers.

The burgeoning field of gravitational-wave astronomy [1] [2] [3] is advancing our understanding in multiple fields of astrophysics, including cosmology [4] , galactic and stellar evolution [5] , and strong-field gravity [6] . As the sensitivity of observatories improve, and gravitational waves are observed with increasingly large signal-to-noise ratios (SNRs) [7] [8] [9] , an understanding of systematic effects will become ever more important. Sources of systematic biases include errors associated with gravitational waveforms [10] , imperfect prior distributions, incorrect estimates for the noise power-spectral density [11] [12] [13] , and errors associated with the calibration of the detectors [14] . Here, we focus on errors associated with calibration.

Calibration is defined as the process of converting the detector's primary control system error signal due to differences in the lengths of the interferometer's arms to an estimate of strain on the detector [15] [16] [17] [18] . Imperfect knowledge of the interferometer's control system and response to diffential arm length changes leads to systematic error in the amplitude and phase of the calibration. This error is estimated by conducting a vast suite measurements of the control system, and propagating the results of those measurements into a physically informed model. The resulting error estimation is represented by a frequency-dependent probability distribution. In order to avoid bias, the estimated probability distribution of calibration errors must be taken into account when inferring the astrophysical parameters of gravitational-wave signals [14] . Unfortunately, marginalizing over calibration error distributions can dramatically increase the number of parameters used in astrophysical inference: from the 15 required to describe a binary black hole to > 50 [19] . * Electronic address: ethan.payne@ligo.org This increase in parameter space can lead to a significant increase in computational cost and convergence issues, which has somewhat limited efforts to carry out astrophysical inference that include an estimate of calibration errors up to this point.

In this work, we demonstrate a computationally efficient implementation of the original physical calibration model [16, 18, see Sec. II] for astrophysical inference. Following [20] , we first evaluate the posterior distribution of astrophysical parameters without any estimated calibration error distribution, then employ importance sampling to reweight approximate results to include this contribution. Importance sampling [21, 22] is the technique of constructing weights for individual samples which determine each sample's contribution to the inferred probability distribution. Having verified the analysis procedure, we carry out a study of gravitational-wave signals from the first LIGO-Virgo Gravitational-Wave Transient Catalog (GWTC-1) [3] using estimates of the calibration error at the time of those events. Combining data from multiple events, we infer an astrophysically informed calibration error estimate [23] , showing that it is possible to learn about the Advanced LIGO interferometers [24] using gravitational waves as standard sirens.

With the assumption that the estimated systematic error present in GWTC-1 will be typical in the future, we demonstrate the effect of the calibration error estimate on a simulation of an SNR = 200 binary black hole merger -approximately the loudest event that will be observed by the second-generation interferometer network during its operational lifetime [7, 25] . In this regime, naive application of our importance sampling algorithm becomes inefficient. However, we find that the calibration error estimate still has only a marginal effect on the posterior distributions of the intrinsic astrophysical parameters. The localization parameters are the most affected by the inclusion of the calibration error distribution; the sky map credible region approximately doubles in size for a SNR = 200 event. We conclude that the impact of cal-ibration error estimates will likely be small compared to other previously mentioned sources of systematic error in astrophysical parameter estimation.

The remainder of this paper is organized as follows. In Section II, we summarize the physically motivated calibration model, its parameters, and the collective error estimate from [16, 18] . In Section III, we describe our methodology for efficiently marginalizing over the probability distribution of calibration errors with importance sampling. In Section IV, we demonstrate our implementation using simulated data while in Section V, we analyze data from GWTC-1. We end with concluding remarks in Section VI.

In this section, we summarize the physical calibration model for the LIGO detectors described in [16, 18] . Though the Virgo detector [26] is similar to the LIGO detectors, the systematic error probability distribution used in this study is informed by the 68% confidence interval bounds on the systematic error described in Ref. [17] .

A. The physical model

The LIGO detectors are dual-recycled, kilometerscale Fabry-Pérot Michelson interferometers, most sensitive between 10 and 2000 Hz [24] . The passage of a gravitational-wave signal induces differential displacement between the two arms of the interferometer. This differential arm displacement is measured at the output of the detector, where interfering laser light reflected from the resonant arm cavities is incident on a set of photodiodes [27] . The photodiode signals are summed and digitized to form a signal which includes both detector noise and gravitational waves. The digitized signal also serves as the residual error signal of the feedback control system for the changes in differential arm length. This, among other control loops in the detector, ensures that external noise sources do not force the interferometer cavities off-resonance. This is achieved through differentially actuating on the arm cavity mirrors, or test masses, and their suspension systems [28] [29] [30] . Below ∼ 100 Hz, the control systems actuation forces suppress the interferometer's response to differential arm displacement. Above this frequency, the response is free of control system influence and depends on both the interferometric response to differential arm displacement as well as the signal processing electronics of the photodiodes. Thus, in order to reconstruct the measured strain from the digitized photodiode signal over the entire sensitive frequency band, a physically motivated model of the response and control system are required.

The model is divided into two conceptual components. The first component, the sensing function, is an optomechanical description of the interferometer if it were free of control forces, photodiode signal processing electronics and the digital acquisition system. The second component, the actuation function, describes how the control system splits the single, digitally filtered, error signal among three stages of cascading actuators on the test mass quadruple suspension systems; incorporating those actuators' digital to analog converters and signal processing electronics. The actuation function also includes the complex displacement response of the test mass for those forces from each stage [28] [29] [30] . Both model components are frequency-dependent complex transfer functions that are mostly static in time, but each have slowly time-varying correction factor parameters to account for natural drifts in their behavior.

The calibrated strain signal h is related to the differential arm error signal d err (in the frequency-domain) by the response function R,

Here, C is the sensing function while A is the actuation function. The variable D describes the set of digital filters responsible for converting the error signal to the single differential arm control signal. The variable L is the length of the interferometer arms. Equation (1) illustrates how systematic errors in C and A lead to an error in h. Following [16, 18] , we employ the following model for the sensing function:

Here, f is frequency, and {H C , f CC , τ C , f S , Q} are the parameters describing the optomechanical response (summarized in Table I ), which are part of a larger set of calibration model parameters Λ. The parameters κ C (t) and f CC (t) represent the time-dependent corrections needed to account for alignment drift in the suspended cavities. Detuning between the signal recycling cavity and arm cavities [31] is modelled as an optical spring with a characteristic frequency, f S , and associated quality factor, Q. Finally, C R (f ) is the digital acquisition response, which is measured a priori with high-precision. The probability distributions for the time-independent parameters of the sensing function, Λ, are determined by fitting measurements from a single, representative reference time with the model outlined in Eq. (2) using Markov Chain Monte Carlo (MCMC) sampling [32] . The probability distribution for the parameter f CC is determined both by an MCMC fit from the reference time measurement, and, like κ C (t), by the continuous high-precision tracking of its time dependence [59] . The actuation function is modeled as follows:

Here, {H U , τ U , H P , τ P , H T , τ T } are the actuation calibration parameters summarized in Table II . The subscripts refer to the stage of the suspension system where actuation force is applied. These are the "upper-intermediate", U , "penultimate", P and "test mass", T . The forceto-displacement response and the response of actuator electronics are incorporated in A i (f ). The digital distribution filters, F i (f ), and the scalar time-varying correction factors, κ i (t), are precisely known, and so do not appear in Table II . Again, the prior distributions for the actuation parameters are determined by MCMC sampling with data from single measurements of each stages' response. The values and uncertainties associated with time-dependent quantities are computed at a 1 hr cadence over the duration of an observing run. The final parameter within the physical calibration model is an overall scalar magnitude factor, η PCAL , whose probability distribution is derived from any systematic error and uncertainty in the photon calibrator systems (PCALs). The photon calibrator systems are used as fiducial displacement references for each detector [33, 34] . Typically, the systematic error is negligibly different from unity, and only adds an overall magnitude uncertainty: coincidentally 0.79% for both LIGO detectors during the second observing run. This additional correction is applied as a multiplicative factor to the response function [60] .

While the physical model of the response function, R(Λ), produces an approximately correct response, inspection of an ensemble of frequency-dependent residuals R measured /R(Λ), constructed from sensing and actuation function measurements, shows that the model is incomplete, i.e., the residuals are not consistent with unity; see Fig. 11 from [18] . The authors of [18] build an additional phenomenological model for C and each stage of A on top of the physical model in order to estimate the residuals, completing the error estimate with new phenomenological parameters.

The phenomenological model employs Gaussian process regression of the residuals, interpolating between 128 frequency points [35, 36] and several hyper-parameters constraining the covariance kernel between each frequency point. The corrected sensing and actuation functions are given by

Here, (C, A) are the physical sensing and actuation models while (C , A ) are the phenomenologically-corrected models. They are included as a part of the frequencydependent estimated distribution of calibration error.

Since η C and η A for each stage are complex-valued functions described by a magnitude and phase, the phenomenological model introduces an additional 256 × 4 calibration parameters to Λ. After applying both physical and phenomenological models to LIGO data, the authors of [18] find that the distribution of errors in the response R completely explains R measured /R(Λ), and is dominated-in most frequency regions-by uncertainty from the Gaussian process fit. That is, the systematic error from imperfect design of the physical model is large compared to the uncertainty in its parameters. However, by introducing such a high-dimensional phenomenological model, the systematic error of the physical model is converted almost entirely into statistical uncertainty. With so many free parameters, we expect it should be possible to fit nearly any measured form of R.

Our goal is to estimate astrophysical parameters θ describing the gravitational waveform of a compact binary merger given strain data h and marginalizing over the unknown calibration parameters, Λ. We follow style conventions from [37] . Assuming Gaussian noise, the likelihood is given by:

Here, P j is the power spectral density of the interferometer, µ j (θ) denotes the gravitational-wave model. In this manuscript, we utilize IMRPhenomPv2 [38, 39] for our source model of binary black hole systems, and IMRPhe-nomPv2NRTidal [40] for binary neutron star mergers. The parameters of the compact binary coalescence, θ, include intrinsic properties such as the masses and spins of the individual compact objects, and extrinsic parameters informing the orientation and location of the binary system. The subscript j refers to a single frequency bin, which are spaced by ∆f . Since the noise in each bin is approximately independent, the combined likelihood is simply

The product over frequency bins is implied in subsequent equations. Meanwhile, the calibration error is described by

the ratio of the true response function R(Λ), which depends on calibration parameters Λ, to the original response function used to calibrate the data R ∅ , denoted as η R in Ref. [18] . Our target distribution, the one for which we want to generate posterior samples, is Eq. (7) marginalized over Λ:

Here π(Λ) is our prior on the calibration parameters. The target distribution can be computationally expensive to sample from owing to the extra dimensionality associated with Λ. However, if the original calibration R ∅ is at least approximately correct, and if the SNR of the event is not too large (we quantify how large momentarily), then we can employ importance sampling to avoid sampling in Λ. Following [20] , we define our proposal distribution,

corresponding to the likelihood we would use if we believed the original response function R ∅ was perfectly accurate. We use the proposal distribution to generate samples in θ using the Bilby [41, 42] implementation of Dynesty [43] , a nested sampling algorithm [44] . Since we are not sampling in Λ, the proposal samples are computationally cheap to generate. Next, for each posterior sample of the binary model parameters, drawn from the proposal distribution, θ i , we calculate a weight, which requires marginalizing over calibration parameters. Following [18] , we carry out this calculation using a predetermined set of N = 10 4 calibration response curves, generated with random draws from the prior distribution {Λ k } ∼ π(Λ). We define a doubly-indexed weight relating the proposal likelihood to the target likelihood:

Here, i indexes binary posterior samples for the parameter θ while k indexes calibration prior samples for Λ. The calibration-marginalized weight is simply

Alternatively, we can marginalize over the gravitationalwave model parameters in order to obtain weights useful for constructing posteriors for the calibration parameters:

where Z ∅ (h) is the normalization coefficient of the proposal posterior distribution, known as the Bayesian evidence. This procedure is similar to approaches for estimating neutron-star equations of state with Gaussian processes [45] . The weights quantify the relative importance of each sample in light of the fact that we are actually interested in the target distribution, not the proposal distribution. The weights can be input directly into routines for constructing corner plots. They may also be used to calculate the Bayesian evidence for the target distribution, Z Λ (h), from the evidence for the proposal distribution. The ratio of the two evidences is simply the average weight,

known as the Bayes factor which provides a measure of the preference for the calibration model in comparison to the null hypothesis ∅ that the data are already correctly calibrated. The process of constructing these weights is known as importance sampling [21, 22] . This approach is not confined to the calibration model outlined in Section II, and allows for the application of improved models in the future. Furthermore, the method can equivalently be applied with other spline models [14, 19] used for analyses in GWTC-1 [3] .

The efficacy of importance sampling can be measured using an efficiency [20, 46, 47] :

Here, n is the number of astrophysical samples generated using the proposal distribution while n eff < n is the number of effective samples created from importance sampling. If the proposal distribution is close to the target distribution, the efficiency will be high. As a rule of thumb, > 50% is "excellent" (providing a fast, reliable answer) while ≈ 1% − 50% is "good," providing adequate efficiency to make importance sampling clearly useful. Efficiencies 1% indicate that the proposal distribution is not necessarily a good approximation for the target distribution, and so reweighting begins to become inefficient, requiring a large number of initial samples and many evaluations of the target likelihood in order to obtain a reliable answer. The efficiency falls with increasing SNR, since louder events are characterized by progressively peaked likelihood functions. We verify that the efficiency is above 10% when SNR 40. One can judge the convergence of the importance sampled result by considering the number of effective samples. The efficiency can also be used as a measure of the overall effect of the inclusion of a physical calibration model, though there are better measures. Pathological cases, where importance sampling fails due to multi-modality, are unlikely to apply to our present application; see [20] for additional details.

One benefit of likelihood reweighting is its low computational cost. By directly executing Bayesian inference with the calibration-marginalized likelihood, the number of evaluations of the more computationally expensive model is orders of magnitude larger than the number of posterior samples produced. By utilizing likelihood reweighting, the proposal distribution is found with a cheap likelihood function before the expensive likelihood is used sparingly in post-processing.

We can also use the astrophysical parametermarginalized weights to construct posterior distributions for the calibration hyper-parameters informed by an ensemble of events. We construct weights for the k th set of calibration curves informed by M events as

where ν indexes the different events, not to be confused with the additional implied product over frequency bins in Eq. (6) . The average combined weight, w tot , is the Bayes factor for the calibration error distribution compared to the null hypothesis that the calibration error is zero. Of course, in order to combine multiple events, we must take care to ensure that the interferometer is in the same state. Otherwise, the calibration parameters can be different for different events. Thus, one must ensure that events are only combined for a period during which the interferometer is maintained in a steady configuration.

We validate our method using simulated signals injected into Gaussian noise colored to match the Advanced LIGO design sensitivity noise curve [24] . We analyze two signals, both with properties consistent with GW150914 [1] . We focus on high-SNR events where calibration uncertainty is relatively more important. In one case, we adjust the distance to achieve an optimal SNR = 30, which is comparable to the loudest observed gravitational-wave signal, GW170817 [2] with SNR ≈ 32. In the second case, we set the distance to achieve SNR = 200. We use calibration envelopes equivalent to the calibration estimate at the time of GW170817 [61] .

Starting with the SNR = 30 event, we compare the posterior distributions for binary parameters θ obtained three different ways: ignoring calibration error, marginalizing over calibration error estimates with the importance sampling method described above, and with "direct sampling," in which we marginalize over calibration with the N = 10 4 response curves at every step using Eq. (9) as the nested sampler explores the astrophysical parameter space. The direct sampling method is relatively slow compared to importance sampling (by a factor of ∼ 250) requiring the use of pBilby [48] , a parallelized implementation of Dynesty [43] .

All three methods produce nearly identical posterior distributions, which are difficult to distinguish by eye, illustrating that calibration error distribution has only a very small effect on our inferences about astrophysical parameters. This is also verified in Sec. V when analysing all events from GWTC-1. In Table III , we present the maximum one-dimensional JensenShannon (JS) divergence [49] comparing the similarity of the posterior distributions obtained using each method. The JS divergence is a symmetric extension of the Kullback-Liebler (KL) divergence [50] which measures the divergence between 0 bit (no divergence) to 1 bit (maximal divergence). We obtain JS divergence values 6 × 10 −3 which are similar to those obtained from comparing the results obtained using different stochastic sampling codes to sample the same likelihood [41, 42, 51] .

The SNR = 200 event allows us to study what is likely to be the maximum-SNR regime for secondgeneration gravitational-wave detectors. The Advanced LIGO/Virgo network at design sensitivity is expected to observe O(10 4 ) events over its operational lifetime. Assuming that the distribution of network SNR scales like SNR −4 [7] , the number of events with a signal-to-noise ratio greater than 200 will be O(1) (see also [25] ).

The posterior distributions for the astrophysical parameters are presented in the top panels of Fig 1. We see the largest differences in the extrinsic parameters (right panel). In particular, we highlight that the cred-direct importance no error JSRA = 8.07 × 10 −3 JSRA = 5.16 × 10 −3 importance JSa 1 = 3.94 × 10 − 3   TABLE III: The largest one-dimensional JensenShannon (JS) divergence (bit) comparing the similarity of the posterior distributions obtained using different methods for a simulated SNR = 30 binary black hole signal. The small value in each cell indicates that the three methods produce similar distributions, which implies that calibration uncertainty does not have a significant effect on astrophysical inference.

ible regions on the sky are approximately twice as large when marginalizing over calibration error estimates than when assuming no calibration error is present. Quantitatively, this difference corresponds to a JS-divergence of 0.105 for right ascension. More modest changes are seen for the remaining parameters, with JS divergences ≤ 3.04 × 10 −2 , qualitative astrophysical results are unchanged. The mean and 95% credible regions for the prior and posterior on the calibration uncertainty are shown in the lower panels. We note that the width of the posterior credible regions are approximately half that of the prior credible regions.

The Bayes factor for the SNR = 200 injection is B Λ ∅ = 2.04 × 10 −4 , indicating a preference for the null hypothesis that there is no calibration error. This is expected given that we did not perturb the simulated data to introduce a systematic error. However, this result also tells us something interesting about the calibration model. No calibration error (λ = 1) should be allowed as one possible realisation of the calibration envelope. What does it mean, therefore, that the data so strongly prefer the null hypothesis for this injection? We suspect there are two factors at play. First, some of the preference is likely coming from a large physical calibration model parameter space. This results in a penalty known as an Occam factor, where simplified models with a smaller prior volume are preferred to models with a larger prior volume, provided the data is fit accurately. However, we suspect that there is a more important factor at play: the 10 4 realizations of the calibration envelope may not be sufficient to adequately fit the zero-error data. If this is the case, it could be highlighting the limitations that arise when we represent a continuous response function with some finite number of curves. Additional work beyond our present scope would be useful to investigate these hypotheses.

We analyze the 11 binary merger events identified in GWTC-1 [3, 52] using the method described in Section III. Strain data is utilized from the open data release [52] , while noise power-spectral densities are used from Ref. [53] produced with BayesWave [54, 55] . Cali-bration error distributions are estimated for LIGO detectors in the first observing run and Virgo using the spline method [19, 56] . Observations during the second observing run directly utilize the physical calibration model presented in Sec. II. To illustrate the typical effect of the inclusion of the physical calibration model, we first consider GW170608. In the top panels of Fig. 2 , we show the posterior distributions for the astrophysical parameters for GW170608. The red contours include marginalisation over calibration error estimates while the blue contours do not. While there are small differences between the red and blue contours-we encourage the reader to squint at the posterior distributions for DEC and ι-it is clear that the inclusion of uncertainty in the calibration error has a very small effect on the size and shape of the astrophysical posterior distributions. In the bottom panels of Fig. 2 we show the reconstructed calibration response function. The thick red curve is averaged over draws from the calibration parameter posterior distribution while the green curve is averaged over draws from the prior. The slight difference between the red and green credible intervals show a (small) change in the mean and 95% confidence intervals of the calibration envelope.

We also present the efficiencies and JS divergences for all events in GWTC-1 in Table IV . The efficiency for obtaining calibration-marginalized samples is = 78.2% for GW150914, and = 64.9% for GW170817. The nonunity efficiency for these two events is due to their larger network SNR. For other events in GWTC-1, we obtain efficiencies of = 97.4 − 99.7%. Visual inspection of the posterior distributions for the other events in GWTC-1 confirm that the effect of uncertainty in the calibration error is negligible for events in GWTC-1. This is further verified by JS divergences 1.5×10 −3 , which are comparable to values found between different implementations of stochastic sampling algorithms [42] and smaller than differences due to differences in waveform models [3] . The full analysis of GWTC-1 results are available for download [57] .

Finally, we conclude by determining the calibration envelope using events from the second observing run (O2) as standard sirens. We only use events from O2 to ensure that the time-independent calibration parameters are identical. We compute the combined weights for the calibration response curves following Eq. (16) . With eight events in O2, the combined calibration envelope is only marginally informed by the gravitational-wave signals. The reconstructed envelope, evaluated at the time of GW170729, is presented in Fig. 3 . We observe only a modest change from the prior. The total Bayes factor comparing the calibration uncertainty hypothesis to the zero-uncertainty hypothesis is likewise modest: B Λ ∅ = 2.33. More events are required to meaningfully inform the calibration error estimate. However, with the requirement to periodically update the calibration model parameters as improvements to the detectors are made, the required number of events may not be achievable in the foreseeable future. This is also concluded within Ref. shading. The inclusion of the calibration envelope broadens the majority of astrophysical parameters a modest amount. The sky localization of the event broadens noticeably with the inclusion of calibration uncertainty, expanding by a factor of ≈ 2 in solid angle. This indicates the possibility that even for the loudest events observed, the calibration error estimate may not play a major role in the inferences made about the intrinsic properties of the source. It is interesting to note that constraining calibration model parameters at lower frequencies where the gravitational-wave signal is detected can inform the calibration model at higher frequencies where no signal is present. [23], where they comment that due to the periodic model updates, astrophysical calibration may never be competitive.

We have presented a calibration-marginalized likelihood for astrophysical parameters employing a physically informed model for the calibration error as presented in [16, 18] . Within the signal-to-noise ratio regime of previously observed events and estimates of calibration errors at the levels reported in GWTC-1, we find the effect of calibration error is at the same level as the effect of stochastic sampling errors and less than other known systematics. Recent work from Ref. [58] has also investigated similar marginalization using direct sampling of the calibration error curve index, instead of importance sampling. The conclusions drawn within Ref. [58] are consistent with those drawn here. We also demonstrated that, if calibration errors remain as low as in GWTC-1, even future loud events will incur only modest changes in the estimates of astrophysical parameters, with the potential exception of increased uncertainty in the sky location. We also demonstrated the improved inference of calibration parameters using the collection of events from GWTC-1 as standard sirens. Our findings are consistent with [23] , where it is found that using gravitational-wave events to improve the estimate of calibration errors beyond that determined from in-situ measurements requires thousands of detections. 

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

and Virgo Collaborations)

Monte Carlo Strategies in Scientific Computing

Survey Sampling

and Virgo Collaborations)

Power spectral densities (psd) release for gwtc-1

Calibration uncertainty envelope release for gwtc-1

The analysis undertaken in this manuscript does not incorporate the probability distribution from MCMC sampling for fCC (t) following an implementation error in the analysis undertaken in Ref

While each detectors fiducial reference has its own systematic error and associated uncertainty, the collection of references for entire network are seeded from a single global reference. This common error on the global reference is included in the uncertainty for each detector, and excluded as an independent error from this analysis as it is degenerate with the luminosity distance of a gravitational-wave source

The Virgo observatory uses a spline-based calibration uncertainty model

We thank Salvatore Vitale, Carl-Johan Haster, Lilli Sun, Ben Farr, and Evan Goetz for insightful comments and sharing an early version of their manuscript. We thank Nikhil Sarin and Rory Smith in providing guidance for the use of pBilby. We thank Greg Mendell and Rick Savage for thoughtful discussions, and Reed Essick for helpful comments on the manuscript. This work is supported through Australian Research Council (ARC) Centre of Excellence CE170100004. EP acknowledges the support of the LSC Fellows program. PDL is supported through ARC Future Fellowship FT160100112, and ARC Discovery Project DP180103155. ET is supported through ARC Future Fellowship FT150100281. This is document LIGO-P2000294.This research has made use of data, software and/or web tools obtained from the Gravitational Wave Open Science Center (https://www.gw-openscience.org), a service of LIGO Laboratory, the LIGO Scientific Collaboration and the Virgo Collaboration.The authors are grateful for computational resources provided by the LIGO Laboratory and supported by National Science Foundation Grants PHY-0757058 and PHY-0823459. Computing was performed computing clusters at California Institute of Technology (LIGO Laboratory) and Swinburne University of Technology (OzSTAR). We would like to thank all of the essential workers who put their health at risk during the COVID-19 pandemic, without whom we would not have been able to complete this work.