key: cord-0264257-elpblxnn authors: Ho, Ming-Feng; Bird, Simeon; Garnett, Roman title: Damped Lyman-alpha Absorbers from Sloan Digital Sky Survey DR16Q with Gaussian processes date: 2021-03-19 journal: nan DOI: 10.1093/mnras/stab2169 sha: bb4068b6935932625b1bcb135953e1efc99e82ef doc_id: 264257 cord_uid: elpblxnn We present a new catalogue of Damped Lyman-$alpha$ absorbers from SDSS DR16Q, as well as new estimates of their statistical properties. Our estimates are computed with the Gaussian process models presented in Garnett et al. (2017); Ho et al. (2020) with an improved model for marginalising uncertainty in the mean optical depth of each quasar. We compute the column density distribution function (CDDF) at $24$. Our results for DLA line density and total hydrogen density are consistent with previous measurements. Despite a small bias due to the poorly measured blue edges of the spectra, we demonstrate that our new model can measure the DLA population statistics when the DLA is in the Lyman-$beta$ forest region. We verify our results are not sensitive to the signal-to-noise ratios and redshifts of the background quasars although a residual correlation remains for detections from $z_{textrm{QSO}}<2.5$, indicating some residual systematics when applying our models on very short spectra, where the SDSS spectral observing window only covers part of the Lyman-$alpha$ forest. Damped Lyman-α absorbers (DLAs) are strong Lyman-α absorption features discovered in quasar spectral sightlines. At the densities required to produce neutral hydrogen column densities above the DLA threshold, NHI > 10 20.3 cm −2 (Wolfe et al. 1986 ), the gas of DLAs is self-shielded from the ionising effect of the ultra-violet background (UVB) (Cen 2012) but diffuse enough to have a low star formation rate (Fumagalli et al. 2013) . DLAs contain a large fraction of the neutral hydrogen budget after reionisation (Gardner et al. 1997; Noterdaeme et al. 2012; Zafar, T. et al. 2013; Crighton et al. 2015) , which make them a direct probe of the distribution of neutral gas. Numerical simulations tell us DLAs are associated with a wide range of halo masses, with a peak value in the range of 10 10 − 10 11 M (Haehnelt et al. 1998; Prochaska & E-mail: mho026@ucr .edu † E-mail: sbird@ucr.edu Wolfe 1997; Pontzen et al. 2008; Rahmati & Schaye 2014) . Through cross-correlating the DLAs with the Lyman-α forest, Font-Ribera et al. (2012) measured a DLA bias factor bDLA = 2.17 ± 0.2. This implies a median host halo mass of ∼ 10 12 M , assuming all DLAs arise from halos of the same mass and. However, a model which assumes a power-law distribution function of DLA cross-section as a function of halo mass is only in marginal tension with the data (Bird et al. 2015) . Furthermore, a later measurement from SDSS-DR12 (Pérez-Ràfols et al. 2018) found a bias factor bDLA = 1.99 ± 0.11, and a median host halo mass ∼ 4 × 10 11 M , in good agreement with simulations. Alternative measurements by cross-correlating with CMB lensing data are broadly consistent with both simulated DLAs and Lyman-α clustering (Alonso et al. 2018; Lin et al. 2020 ). In the cosmology context, the Lyman-α forest is a successful probe of matter clustering between 2 < z < 6 (Croft et al. 1998; McDonald et al. 2000; Viel et al. 2004; McDonald et al. 2005b; Iršič et al. 2017; Chabanier et al. 2019 ). However, high column density absorbers such as DLAs will bias cosmological parameter estimates from Lyman-α and thus need to be masked out (McDonald et al. 2005a) . Simulations have been performed to study the effect of damped absorbers on the Lyman-α 1-D and 3-D flux power spectrum (Rogers et al. 2018a,b) , and a recent Bayesian fitting method has been proposed to better understand how DLA contaminants affect cosmological inference using the BAO peak (Cuceu et al. 2020) . In this work, we present new estimates for the column density distribution function (CDDF), the abundance of DLAs, and the average neutral hydrogen density at z = 2−5 for DLAs in the Sloan Digital Sky Survey IV quasar catalogue from Data Release 16 (SDSS-IV/eBOSS DR16) (Dawson et al. 2016; Lyke et al. 2020) . We compute DLA population statistics using the Gaussian process (GP) model presented in Ho et al. (2020) , a modified version of the machine learning framework from Garnett et al. (2017) . We retrain our model on SDSS DR12 (Eisenstein et al. 2011; Dawson et al. 2013; Alam et al. 2015; Pâris, Isabelle et al. 2018 ) and generate a DLA catalogue from DR16Q (Lyke et al. 2020) . We compute DLA population statistics from the DLA catalogue, which update the estimates we made in Bird et al. (2017) ; Ho et al. (2020) . The pipeline presented in Garnett et al. (2017) provided for the first time probabilistic detections of DLAs in each spectrum, which comes with a posterior distribution on putative DLAs for the column density and the absorber redshift. With the aid of a full posterior probability distribution for the number of DLAs in each quasar spectrum, "soft" detections in noisy data become available. We propagate uncertainties from each individual spectrum into the global population, without setting any hard threshold on the minimum required probability for the presence of DLAs. We are thus able to include even noisy spectra in our sample of DLAs. Ho et al. (2020) added an alternative model for sub-DLAs, which regularised excessive detections at low column density. We also included absorption from the mean optical depth in the Lyman-α forest in the GP mean function. This helped prevent the pipeline from using DLAs to compensate for Lyman-α forest absorption in the spectrum, essential at high redshift. In this work, we further improve this aspect of our model. We marginalise out uncertainty in the effective optical depth in each spectrum using the measured mean optical depth as a prior when computing the evidence for the null, DLA, and sub-DLA models. Several other DLA search methods for SDSS spectra have been implemented. These range from visual inspection surveys (Slosar et al. 2011) , visually guided Voigt profile fitting (Prochaska et al. 2005; Prochaska & Wolfe 2009 ), and template fitting (Noterdaeme et al. 2009 (Noterdaeme et al. , 2012 , to machine learning based methods such as a convolutional neural network (CNN) approach (Parks et al. 2018 ) and an unpublished Fisher discriminant analysis (Carithers 2012) . The CNN method (Parks et al. 2018) was also run to identify DLAs as part of the SDSS DR16 quasar catalogue (Lyke et al. 2020) . We compare the DLAs detected by our GP model and the DLAs in DR16Q in Section 6. Machine learning methods have also been proposed to classify broad absorption lines (BALs), including a linefinder based convolutional neural network (CNN) (Busca & Balland 2018) and a hybrid of a CNN with a principal component analysis Guo & Martini (2019) . Section 2 will briefly outline our modelling decisions and the changes to the model made in this work. Section 2.1 describes the cuts we applied to SDSS DR16Q. We recap our modelling details in Section 2.2. We present our results in Section 3, including the CDDF in Section 3.1 and the incidence rate of DLAs and total HI density in Section 3.2. In Section 4, we discuss the possible remaining systematics in our method. Section 5 shows population statistics for DLAs in Ly∞ to Lyβ. In Section 6, we briefly compare our DLA catalogue to the DLAs presented in the SDSS DR16Q catalogue, which implemented a CNN model (Parks et al. 2018) to classify DLAs. We conclude in Section 7. Here we briefly recap our Gaussian process (GP) based framework for detecting DLAs using Bayesian model selection. We summarise the general approach, while more comprehensive mathematical details may be found in Garnett et al. (2017); Ho et al. (2020) . A quasar sightline has spectroscopic observations D = (λ, y), where λ is a vector of rest wavelength bins, and y is a vector of observed flux at these wavelength bins. Suppose we have built likelihood functions for a set of models {Mi}. We can evaluate the posterior probability of a model, M, given a quasar observation, D, based on Bayes' rule: where p(D | M) is the model evidence of the quasar spectrum D given model M, Pr(M) is the prior probability of model M, and the denominator on the right-hand-side is the sum of posterior probabilities of all models in consideration. Concretely, we have the model without DLAs (M¬DLA), the model with k DLAs ({M DLA(i) } k i=1 ), and the model with sub-DLAs (M sub ). We set k = 3 here, allowing up to 3 DLAs per spectrum. We consider a posterior probability of a sub-DLA, M sub , not to be a DLA detection, as in Ho et al. (2020) . Section 2.2 describes the details of how we compute the model evidence for each model. Table 1 lists mathematical notation and definitions of parameters used throughout the paper. Our GP model requires a training set without DLAs for training the null model, M¬DLA. We use the DLAs in SDSS DR12Q detected by Ho et al. (2020) as our true DLA labels. Here we list the subset of DR12 quasars omitted from our training sample: • Quasars with zQSO < 2.15, which have almost no Lyman-α forest, are removed. • BAL: quasars with a broad absorption line (BAL) probability larger than 0.75 (BAL PROB 0.75) are removed, as suggested by Lyke et al. (2020) . BAL PROB is derived from QuasarNET (Busca & Balland 2018) . • CLASS PERSON == 30: quasars classified as BALs by human visual inspection are removed. • ZWARNING: spectra flagged with ZWARNING for pipeline redshift estimation are removed, but extremely noisy spectra with TOO MANY OUTLIERS are kept. We have in total 89, 408 spectra without DLAs for training the null model. We also use the same above criteria to select the DR16Q spectra for applying our model. In addition to the above criteria, the DR16Q quasar sample to which our model is applied is a subset of the full DR16Q sample chosen following additional conditions: • IS QSO FINAL == 1: We require this flag in the quasar sample, specifying that a spectrum is robustly classified as a quasar. • CLASS PERSON == 3 or 0: This flag specifies that the spectrum was classified by a human as a quasar (3) or was not visually identified (0) . • SOURCE Z: as suggested in Section 3.2 of (Lyke et al. 2020) , spectra with Z > 5 and SOURCE Z == PIPE have a suspect redshift estimate and should not be used without a careful visual re-inspection. We thus remove these spectra from our analysis. Integral to our method is a reliable quasar redshift estimate. It is not trivial to reliably estimate quasar redshifts in the large samples provided by DR16Q, 1 and so we are careful to use the redshift estimates suggested by Lyke et al. (2020) . To ensure our quasar redshifts are as homogeneous as possible, we use Z PCA, the recommended redshift estimate method for statistical analyses of a large ensemble of quasars. We also remove the spectra where redshift measurements disagree with each other by more than 0.1, which means we remove samples with |zi − zj| > 0.1 for zi, zj ∈ {Z PIPE, Z PCA, Z, Z VI}. If Z VI is not present, we use only the other three redshift estimates. Our final DR16Q sample size contains 159 807 Lyman-α quasar spectra. 1 Indeed, we have extended our GP framework to provide a quasar redshift estimate (Fauber et al. 2020 ). Consider a distant quasar with a known redshift, zQSO. Each spectroscopic observation gives us the observed flux, y, on a set of wavelength pixels in observed-frame wavelengths, λ obs . Since the quasar redshift is assumed to be known, we shift into the rest frame, λ = λ obs /(1 + zQSO). Standard errors are provided with each observed flux pixel, σ(λi), with λi the ith pixel in λ, and we define the noise variance of each observed flux pixel as νi = σ(λi) 2 . Given the observed flux of a quasar, we normalise all flux measurements by dividing the median flux observed between [1425Å, 1475Å] in the restframe, a wavelength range redwards of the Lyα emission and avoiding major emission lines. For each quasar observation, we have data D = (λ, y, ν, zQSO). We want to build a likelihood function to describe this data: which is the likelihood of the flux y given all other observed quantities. We model this likelihood as a Gaussian process: where µ is the mean vector of the GP, and Σ is the covariance matrix of the GP. We will use bold lowercase italics for vectors and bold uppercase letters for matrices. A GP is fully specified by its first two central moments: the mean function, µ(λ), and the covariance kernel, K(λ, λ ), (Rasmussen & Williams 2005) . Our task now is to learn the mean function and the covariance function from the training set. Suppose we have a set of quasar observations without any intervening DLAs, {D1, D2, · · · , DN spec }, where Nspec is the number of quasars in the training set. We can then learn the mean function by taking a precision weighted average: where the summation is over i index. j indicates jth pixel in the observed flux, i represents ith spectrum, and we only average over the non-NaN values. Note this differs from Ho et al. (2020) , where we used the mean rather than the precision weighted average. The precision weighted average can be viewed as a result of using an uninformative prior on µj and an independent Gaussian likelihood for each yij. If we have a set of normally disturbed flux pixels with each flux pixel follows yij ∼ N (µj, νij) with known variance νij and an unknown µj with an uninformative prior, the posterior will be a normal distribution with a new mean equals a precision weighted average. Instead of training on the raw observed flux y directly, we follow to train the mean function and the kernel on the flux after removing the average effect of the Lyman-α forest, the de-forest flux: which means we replace observed flux and its variance with the flux and variance before the suppression of Lyman-α forest. The effective optical depth is parameterised as: where λ1i is the transition wavelength from Lyman-α to the ith member in the Lyman series, f1i represents the oscillator strength, z1i is the absorber redshift, and we set N = 31. The absorber redshift is written as: We parameterise the effective optical depth by a power-law relation with τ0,MF and βMF parameters. Here we specify a subscript "MF" to annotate the parameters modified by mean flux suppression. Fig 1 shows our new GP mean function, compared to Ho et al. (2020) . Taking this Lyman-α mean flux into account introduces a dependence on quasar redshift into the mean function of the GP for each quasar: µ(λ, zQSO; βMF, τ0,MF) = µ(λ) · exp(−τ eff,HI (z(λ, zQSO); βMF, τ0,MF)) . (6) µ(λ) is the mean function we learned from Eq 2. We learn the mean function on a dense grid of wavelengths on a chosen rest-frame wavelength range: λ ∈ [850.75Å, 1420.75Å] with a linearly equal spacing of ∆λ = 0.25Å. Ho et al. (2020) only modelled the null model in the Lyman-α region, [911.75Å, 1215.75Å]. We extend the red end of our model to include a part of the metal line region until 1420.75Å. This empirically improved the column density estimation of DLAs near the Lyman-α emission peak, as otherwise part of the damping wing would go beyond 1215.75Å when a large DLA is very close to the quasar. The mean function is thus written as a mean vector µ(zQSO; βMF, τ0,MF) = µ(λ, zQSO; τ0,MF, βMF) and the kernel is written as a matrix Σ(λ, λ ) = Σ. The covariance matrix's optimisation procedure is described in Garnett et al. (2017) ; Ho et al. (2020) . We factorise the covariance matrix as in Ho et al. (2020) : The K matrix is a positive-definite symmetric matrix corresponding to the covariance between each quasar flux pixel. Ω is a diagonal matrix describing the absorption noise: diag ω is freely optimisable while the Lyman-α flux term, (1 − exp(τ eff,HI (z; β, τ0)) + c0) 2 , includes the redshift dependent noise variance with which we model the Lyman-α forest. The optimised absorption noise parameters used here are: The AF is a diagonal matrix reflecting the mean vector suppression for each spectrum corresponding to the mean flux in the Lyman-α forest: The parameters of this matrix follow the values given in Kamble et al. (2020) , which used a power-law relation to measure the effective optical depth in the Lyman-α forest in SDSS DR12: with associated uncertainty for each parameter: The instrumental noise is encoded in the diagonal matrix diag νi, where i simply denotes the ith quasar observation: The final covariance matrix learned from our data is shown in Fig 2. Comparing the kernel matrix we learned in this work to Ho et al. (2020) , the current kernel is less noisy and contains several distinct features of emission lines. The reduction in the noise is due to a larger training set, SDSS DR12Q catalogue, is used for optimising the kernel. After having learned the GP null model, we can write down the null model likelihood function: where the notation M¬DLA specifies that our null GP model is conditioned on a training set without DLAs. Once we have trained our GP null model, M¬DLA, according to Section 2.2.1, we need to integrate out the nuisance parameters associated with Lyman-α forest absorption to get the model evidence. meanflux parameters (βMF, τ0,MF) without their uncertainties, so the model evidence straightforwardly equals to Eq 14 without integration. In this work, we take the uncertainties of meanflux suppression into account and integrate them out, according to Kamble et al. (2020) prior. The model evidence thus will be: where we integrate out (βMF, τ0,MF) with We then use Quasi-Monte Carlo (qmc) to integrate out the meanflux parameters with 30 000 samples of (βMF, τ0,MF). qmc takes samples from a so-called low-discrepancy sequence, leading to faster convergence. Here we draw 30 000 samples generated from a scrambled Halton sequence, which gives samples approximately uniformly distributed on a unit square [0, 1] 2 . We then use inverse transform sampling to transform the Halton sequence to the distribution described in Eq 17. Suppose we have a trained GP null model in Eq 14, the DLA likelihood function will be the null model likelihood function multiplied by Voigt profiles for each line in the Lyman series of the absorber: p(y | λ, ν,zQSO, βMF, τ0,MF, Here A (k) = diag a (k) and a (k) is the function with k voigt profiles, which represents k DLAs: a(λ; zDLA, NHI) is a Voigt profile parameterised by the DLA's redshift, zDLA, and the column density of the DLA, NHI. The Voigt profile parameterisation used in this work is the same as Garnett et al. (2017) . We set the maximum number of DLAs per spectrum at k = 3 in this work, as there are rarely more than three DLAs per spectrum. As described in Garnett et al. (2017) , the default Voigt profile we use in this work includes Lyα, Lyβ, and Lyγ absorption, which allows us to constrain the DLA column density better when the Lyman-β forest is in the observation window. To get the model evidence, according to Eq 18, we need to integrate out the prior over the DLA parameters and the meanflux parameters (βMF, τ0,MF). For convenience, we denote the parameters which need to be integrated out by , βMF, τ0,MF}. For the model with a single DLA, we have four parameters θ = {zDLA, NHI, βMF, τ0,MF}. The model evidence is: By assuming each parameter is independent of each other, we factorise the parameter prior as: where we assign the Kamble et al. (2020) prior for the meanflux parameters as in Eq 17. We use the same prior for column density, p(NHI | MDLA), as Ho et al. (2020) . This was trained using kernel density estimation on the log 10 NHI distribution from Lee et al. (2013) DR9 DLAs with an addition of a 3% uniform prior. The zDLA prior is uniform within the search range for DLAs. We set this search range to be from Lyman-β to Lyman-α. Removing DLAs detected in the Lyman-β forest ensures the purity of DLA samples in deriving the statistical properties of the DLA population. However, to generate a complete catalogue, we also consider a search range from the Lyman limit to Lyman-β. We used the same model and priors for the sub-DLA model as in Ho et al. (2020) . The sampling range of the redshifts of sub-DLAs is the same as for the DLA model. Model priors are the same as Ho et al. (2020) , based on the DLA catalogue in SDSS DR9 (Carithers 2012). In this section, we show some example spectra to demonstrate our proposed model. Figure 3 shows an example with prominent DLA features. As shown in the parameter space (middle plot), the posterior distribution is peaked at the maximum a posteriori (MAP) values of those two DLAs. Our GP model estimates the parameters of the DLAs with small uncertainties. As shown in the top plot, our MAP values agree with the column densities measured by the CNN model reported in the DR16Q catalogue. Figure 4 shows an extremely noisy spectrum, for which our GP model is very uncertain about the effective Lyman-α absorption in the spectrum. The DLA models are degenerate with the absorption from the Lyman-α forest. Without modelling the uncertainty in the mean flux, the GP model does not know that the drop in the spectrum can be explained by Lyman-α forest absorption. It instead fits a big DLA with NHI = 10 22.9 cm −2 as its preferred explanation for the drop in flux. As we use more parameters to compute the DLA or sub-DLA model, the model selection will prefer to fit a Voigt profile to the GP if all candidate models are poorly fit. Thus, the DLA or sub-DLA model's evidence is sometimes too strong compared to the null model. The most common poor fit situations are quasar spectra with zQSO < 2.5 and with low signal-to-noise ratios (SNR). As SDSS optical spectra have a fixed observing window, quasar spectra with zQSO < 2.5 have an incomplete Lyman-α forest. The constraining power of the quasar becomes weaker as only part of the data fits into our modelling window, [850.75Å, 1420.75Å]. Thus the DLA model and the null model are closer in likelihood space. To avoid this situation, we introduced an additional Occam's razor in Ho et al. (2020) , which is injected in the model selection as: Here N is the Occam's razor penalty, and we used N = 10 000 in Ho et al. (2020) . We previously validated the Occam's razor strength by matching it to the DR9 concordance catalogue (Carithers 2012) . In this work, however, we modify our null model to consider uncertainty from the mean flux measurement, which means it has more parameters. Thus, the null model gains more constraining power, so a weaker Occam's razor may be preferable. To make our model posteriors more consistent with human identifications, we decided to conduct a visual inspection on a small subset of the spectra. We first train a model without Occam's razor and select at random from this model 239 putative large DLAs with NHI > 10 22 cm −2 and 243 putative small DLAs with 10 20 NHI < 10 21 cm −2 . We visually inspect each spectrum and compute the model posteriors with a range of strengths for Occam's razor, N = {1, 10, 100, 1 000, 30 000}. We then treat each spectrum as a multiple-choice problem: if we think the model posterior of a given Occam's razor describes the given spectrum well, then we record one vote for this value of Occam's razor. Multiple selections are allowed for each spectrum as the model posteriors are often very close. After collecting votes, the winning value of Occam's razor was N = 1 000, a ten times reduction from our earlier value. For quasar spectra with zQSO > 2.5 there are enough data points in the Lyman-α range that the strength of Occam's razor has a small effect. We will discuss the effect of Occam's razor in Section 4. We suggest incorporating variations due to Occam's razor into the uncertainty in population statistics for conservative usage. Here we summarise the modifications we made in this work, comparing to the model of Ho et al. (2020) : (i) Our training set is SDSS DR12 quasar spectra with DLAs detected by Ho et al. (2020) removed. We considered a DLA to be detected if the posterior probability of a spectrum containing a DLA is larger than 0.9, P (MDLA | D) > 0.9. (ii) The wavelength range modelled goes from λrest = 850.75Å to λrest = 1 420.75Å. (iii) The effective optical depth prior, (τ0, β), is updated from Kim et al. (2007) to Kamble et al. (2020) . (iv) The uncertainty in the mean flux suppression parameters, τ0 and β, is marginalised while computing the model evidence. The first modification gives us a training set size containing 89, 408 spectra without DLAs. The larger training set better learns the covariance structure of quasar emission lines, which allows the second modification: expanding the model to cover the Lyman break and Siv line. The expansion en-ables the model to use the metal lines to constrain the correlations of the emission lines in the Lyman-α forest. When using the previous modelling range, [911.75Å, 1215 .75Å], we found that we often detected DLAs with high NHI in the red end of the spectrum, where the code inserts a DLA at the quasar redshift to compensate for an oddly shaped Lymanα emission line. This was possible because when we cut the spectrum at a rest frame wavelength 1215.75Å, half of the damping wings were removed, allowing for more model freedom and dubious NHI estimation. Third, to make the mean flux suppression prior for (τ0, β) consistent with the DR12Q training set, we switched to the mean flux measurement based on BOSS DR12Q (Kamble et al. 2020 ). Our last modification is marginalising the uncertainty of Kamble et al. (2020) 's parameters while marginalising the DLA parameters. To compute the statistical properties of DLAs, we need to convert the posterior distribution of a DLA in each spectrum into the expected number of DLAs per redshift or column density bin, for which we use the method described in Bird et al. (2017) . We briefly summarise the modelling decisions we made to produce the DLA samples in the result section: • Search range: from Lyman-β to Lyman-α. 2 . • Maximum number of DLAs: three. • Maximum zQSO: quasar redshifts < 7. • DLA redshift 2 < zDLA < 5. Figure 5 shows the CDDF we estimate from DR16Q spectra. In the following sections, the CDDF is computed for NHI ∈ [10 20 , 10 23 ], while the DLA incidence rate dN/dX and the total HI density in DLAs ΩDLA are computed for NHI ∈ [10 20.3 , 10 23 ]. Ho20 refers to Ho et al. (2020) , a DR12 DLA catalogue that used a modified GP model from Garnett et al. (2017) . The CDDF is a histogram of column densities normalised by the effective spectral path that could contain 2 For the CDDF in Ho et al. (2020) , we used a sampling range from Lyβ + 3 000 km s −1 to Lyα − 30 000 km s −1 to avoid finding DLAs in the proximity zone. Here, we instead use Lyβ + 3 000 km s −1 to Lyα − 3 000 km s −1 . This has a very moderate effect on our results, however, we provide a check of systematics due to removing DLAs near to the quasar redshift in Section 4.2. DLAs. We count all spectral path with an absorber with zDLA < 5. Error bars denote the 68% confidence limits, and the grey band represents the 95% confidence limits. Note that the uncertainties here are the statistical uncertainties associated with the GP model. They do not include uncertainty due to potential systematics. Section 4 will describe how possible systematics would affect the CDDF. As shown in Figure 5 , we observe non-zero column density until 3 × 10 22 cm −2 . Our DR16 measurement is mostly consistent with our previous DR12 measurement until NHI 9 × 10 21 . For NHI 3 × 10 22 , both our DR12 and DR16 measurement are consistent with zero at 95% confidence level, though there is one bin from DR16 not consistent with zero (see Table A3 ). We also measure no turn over for the CDDF at the high column end, NHI ∼ 10 21.5 cm −2 . It was suggested in Schaye (2001) that molecular hydrogen sets a maximum NHI so that steepen the CDDF at the high end. The latest simulated CDDF from simba (Hassan et al. 2020) , which included molecular hydrogen formation in their star formation recipe, predicts no turn over at the high end, consistent with our measurements. In Figure 6 , we plot the CDDF with different Occam's razor strengths. When the Occam's razor strength is weak (N = 30), model selection will find DLAs even though the SNR is low, so we get more absorbers at both high and low column density ends. On the other hand, if the razor strength is strong (N = 30 000), model selection will prefer to avoid finding DLAs at low SNR spectra, which results in a decrease. However, in general, in Figure 6 , we observe the razor strength only marginally affects the CDDF. Thus the small tension at the low end, NHI ∈ [10 20 , 10 20.3 ], between our CDDF and N12 is more likely due to other reasons than Occam's razor. We show the redshift evolution of the CDDF in Figure 7 . The downward pointing symbols indicate the 68% upper confidence limit when the data is consistent with zero at 68% confidence limits. As we can anticipate, for high- redshift quasars with zQSO > 4, since the flux is highly absorbed, we detect DLAs with larger uncertainties, and the number of large DLAs is consistent with zero. In both Bird et al. (2017) and Ho et al. (2020) , we found that the CDDF is getting shallower at z > 4. However, given our detection for NHI > 4 × 10 21 at z > 4 is highly uncertain and consistent with zero detection, this trend is not significant in our current dataset. Instead, the detection of DLAs with NHI < 4 × 10 21 at z > 4 is consistent with the measurements at z ∈ [2.5, 4]. We find no strong evidence for an evolution of the slope of the CDDF at z > 4. . The CDDF derived from DLAs in a variety of redshift bins. Labels show the redshift bins in used. We show 68% confidence limits in error bars and 95% confidence limits in grey areas. If the bin is consistent with no detection at 68% limits, we show a down-pointing arrow indicating the 68% confidence upper limit. One possible reason why we found the CDDF was shallower at z > 4 in Bird et al. (2017) and Ho et al. (2020) is absorption due to the Lyman-α forest. When the spectrum is highly absorbed, there is a degeneracy between a large DLA and the forest's absorption. In Bird et al. (2017) , we did not model the GP mean as a function of effective optical depth, so it is possible the model was trying to use DLAs to compensate the excess absorption due to the forest, which results in a shallower CDDF at z > 4. In Ho et al. (2020) , the slope of the CDDF is less shallow at z > 4, as we modelled the effective optical depth into our GP mean. In this work, we integrated out the measurement uncertainty of the mean flux, and the slope is almost indistinguishable from the CDDF at z ∈ [2.5, 4]. This may indicate that, to understand the DLAs at z > 4 better, we need a better measurement for the effective optical depth at z > 4. One interesting feature in Figure 7 is the drop in the amplitude of the CDDF at z ∈ [2, 2.5]. As we will discuss in Section 3.2, the drop of CDDF at the low redshifts also shows in the DLA incident rate, dN/dX. We will discuss this in more detail in the next section. Figure 8 shows the incident rate of DLAs, dN/dX, as a function of absorber redshift. Our results are consistent with Wolfe (2009) and Ho et al. (2020) and are slightly lower than Noterdaeme et al. (2012) . dN/dX is sensitive to the weaker DLAs, so the difference between Noterdaeme et al. (2012) and Prochaska & Wolfe (2009) is likely due to the false positive rate. Prochaska & Wolfe (2009) performed a visually-guided Voigt profile fitting on SDSS-DR5. Though their sample size is smaller, with the help of the human eye, their method is likely less prone to false positives than the automated template fitting used in Noterdaeme et al. (2012) . This difference may also explain the drop in amplitude of the CDDF at z ∈ [2, 2.5]. Prochaska & Wolfe (2009) (2012) have a larger discrepancy at z ∈ [2, 2.5], and Noterdaeme et al. (2012) may overestimate the weak absorbers at this redshift range where the spectra are short. Our measurement is consistent with Prochaska & Wolfe (2009) , which implies that we detect fewer weak absorbers in low redshift bins and explains the small tension in the CDDF at low-NHI. One noticeable feature in dN/dX, which we have not discussed before, is the decrease of the line density from z = 3.5 to z = 4.0 and another increase at z > 4.0. This feature is also shown in our Ho20 measurement. The drop of dN/dX at z ∈ [3.5, 4] is consistent with Prochaska & Wolfe (2009) at 95% confidence limits. One interesting question is whether the increase from z ∈ [4.0, 4.5] is real. The measurements at z > 4 still have large error bars, so it is hard to say whether dN/dX at z > 4 is an increase or a flat line. More data, especially with high SNR, are needed to determine the trend of line density at z > 4. In Figure 9 , we show the total HI density in DLAs in terms of cosmic density. Our results are mostly consistent with Noterdaeme et al. (2012) at z ∈ [2.5, 3.5]. At higher redshift bins, z > 3.5, our measurements are consistent with Prochaska & Wolfe (2009) and Crighton et al. (2015) . Crighton et al. (2015) used high signal-to-noise spectra from a smaller survey, so they have larger error bars. Comparing to Ho20, our current ΩDLA has more mass at low-redshifts (zDLA ∼ 2) and less mass at zDLA ∈ [3.5, 4]. The trend of ΩDLA in DR16 is shallower than Ho20. We also plot the ΩDLA measured by Berg et al. (2019) in Figure 9 . We see our DR16 measurement is consistent with Berg et al. (2019) even at z > 3.5. There is a slight tension at z < 2.5, which may be because some low-redshift spectra are too short and noisy to measure column density confidently using our model. SDSS spectra with zQSO < 2.5 only covers a region from the Lyα to Lyβ or shorter. When the signal-to-noise is low, it is difficult to identify DLAs even using human eyes. As shown in Fig 10, belief in detecting a DLA in a short and noisy spectrum, as discussed in Section 2.4. Note that, from Figure 7 we see there are no solid detections for NHI > 3×10 21 DLAs at z > 4. In Bird et al. (2017) , ΩDLA was skewed towards high values at z > 4 even without real detections of large DLAs. Our result in Figure 9 does not have this issue. This may indicate our proposed method of integrating out the uncertainty on meanflux measurement helps us avoid the forest biasing the posterior density of NHI towards the high end. In general, we observe an increase of ΩDLA from z = 2 to z = 3.5, and a slight decrease from z = 3.5 to z = 4. For zQSO > 4, the measurement error is larger and less correlated between redshift bins, as in Ho20. This is reasonable given the lower quasar number density at high redshift. Figure 10 shows the line density and ΩDLA with various Occam's razor strengths. As expected, the razor strengths only moderately affect the statistics of low redshift spectra. For dN/dX, our results are consistent with Prochaska & Wolfe (2009) , even with the weakest razor. N12 still detects somewhat more weak DLAs than we do, even though we only apply a small penalty for these short and noisy spectra. 4.1 Effect of signal to noise ratios Figure 11 and Figure 12 show the abundance of DLAs from subsets of our catalogue with various signal-to-noise cuts (SNR), SNR > 2 and > 4. We define our SNR as the median of the ratio between the flux and the instrumental noise within the quasar spectrum redwards of the Lyman-α emission peak. This specific choice is to avoid introducing correlations between the detected DLAs and the SNR. With this definition, 80% of the quasar spectra have SNR > 2, and 46% of the spectra have SNR > 4. We verify that, in Figure 11 , the CDDF is not sensitive ALL GP SNR > 2 SNR > 4 Figure 11 . The CDDF of DLAs for a subset of samples with different minimal SNRs. SNR > 2 (orange) excludes 20% of the noisiest spectra, and SNR > 4 (green) excludes 54% of the spectra. 68% confidence limits are drawn as error bars, while 95% confidence limits are shown as a grey filled band. to the SNR when NHI < 10 22 cm −2 . However, we note that the highest non-zero column density at 95% confident limits changed from NHI < 3 × 10 22 cm −2 to NHI 10 22 cm −2 for samples with SNR > 4. This is likely because there are too few high column density absorbers to constrain the CDDF sufficiently at the high end in the smaller high SNR sample. We find that our ΩDLA measurement exhibits no systematic correlation with the SNR cuts. We notice a dependence of SNR on dN/dX at z ∈ [2.0, 2.5], which is due to the difficulty of finding DLAs in short and noisy spectra. As discussed in Section 2.4, it is hard to find features in these spectra, and the observing window cannot fully cover a high-NHI DLA profile with damping wings. To secure our samples' purity, we use an Occam's razor penalty which may also introduce this SNR dependence at z ∈ [2, 2.5]. As mentioned in Krogager et al. (2019) , the colour and magnitude criteria used in SDSS for quasar target selec-tion is biased against dusty DLAs, which harbour a certain amount of cold neutral gas. Krogager et al. (2019) showed that in SDSS DR7 this caused ΩDLA to be underestimated by 10 − 50% at z ∼ 3. Also, redder quasars containing metal rich dusty DLAs will have lower SNR in the blue part of the spectrum and thus may be excluded from the sample of Noterdaeme et al. (2009) , who enforced CNR > 3. This effect is likely to be substantially reduced in our sample, if present at all, as we use all quasars irrespective of SNR. We also define SNR using the region redwards of the Lyα emission peak specifically to avoid this kind of selection effect, and we are using DR16, which has a different and more complex selection function. More quantitatively, the XQ-100 targets in Berg et al. (2019) use only radio-selected quasars, or quasars previously found by other techniques, and so avoids any SDSS colour selection bias. Figure 9 shows that our ΩDLA mostly agrees with Berg et al. (2019) , implying that colour effects in our sample are smaller than those in DR7. In Figure 13 , we test our measured dN/dX with different quasar redshift bins. In a perfect scenario without systematics, we expect that the absorber properties be uncorrelated with the background quasars, as they are widely separated in physical space. However, Figure 13 , shows some residual correlation between absorber properties and the redshifts of the background quasars for DLAs in spectra with zQSO < 3. In Figure 14 , we have investigated removing the sampling range near the quasar redshift, |zQSO − z| < 30 000 km s −1 . We found removing the putative absorbers near the Lyman-α emission is sufficient to remove the correlation between quasar redshifts and DLA properties at z < 3. A small tension still exists for the z = 2 bin within 2.5 > zQSO > 2.0 for dN/dX, which may be due to the effect discussed above for SNR, as these very short spectra are often also noisy. To understand the implication of applying Occam's razor to the model posteriors, we conduct a test based on adding noise to a DLA spectrum. We choose a quasar spectrum that we are very confident contains a DLA and add additional Gaussian noise with zero mean and standard deviation σ to the flux and noise variance. We then examine changes in the DLA model posterior p({MDLA} | D). This test will mimic the effect of SNR on the model's ability to detect the underlying DLAs. For Occam's razor N = 30 000, the model posterior is p({MDLA} | D) 0.9 for σ 1.5, which corresponds to SNR 0.9. On the other hand, for a model without Occam's razor, the model posterior is p({MDLA} | D) 0.9 for σ 3, which means SNR 0.5. A strong Occam's razor thus introduces false negatives in very noisy spectra. However, by visually inspecting the flux with σ = 3 we determined that it is almost impossible for humans to identify the underlying DLA. Therefore, we choose to follow the value (N = 1 000) we determined in Section 2.4. We were unable to quantify the number of false positives, as our simple assumption of Gaussian noise rarely produces correlated structures that resemble DLAs. In practice, false positives are likely caused by oscillatory structure embedded in the noise, present when the SNR is extremely low. We have shown the CDDF, dN/dX, and ΩDLA of our GP model in Section 3. In this section, instead of using a sampling range from Lyβ to Lyα, we only compute the population statistics of DLAs detected within the Lyβ forest region. We set the sampling range to be Lyman limit +30 000 km s −1 to Lyman-β. We cut off a wider velocity width at the blue 10 21 10 22 10 23 N HI (cm 2 ) end to avoid counting DLAs detected right on the edge of the Lyman break. Figure 15 shows the CDDF for DLAs in the Lyman β region. As we can see from the figure, it is mostly consistent with the CDDF from Lyβ-Lyα for NHI < 10 21 , and it starts to diverge for NHI > 20 22 . We visually inspected those spectra and found that they are mostly due to fitting large DLAs on the spectra's noisy left edges. This may indicate that additional regularisation is still needed to avoid spurious detections at the blue end of high redshift spectra. In particular, if the redshift measurement is slightly inaccurate, parts of the Lyman break move into our modelling window. We also show the dN/dX and ΩDLA for DLAs in the Lyman-β forest region in Figure 16 . dN/dX in the Lyman-β region is broadly consistent with other measurements, with the detection consistent with zero at zDLA > 4. For ΩDLA, in the right panel of Figure 16 , we observe our measurement is biased high and highly uncertain for zDLA > 3.5. This may be because our current model can only poorly estimate the column density from the Lyman-β region from high-redshift quasar spectra, perhaps due to the high level of absorption from the Lyman-β and Lyman-α forests at these redshifts. Alternatively, it could again reflect that the mean flux measure is not certain at these redshifts, so the degeneracy between large DLAs and the effective Lyman-α/Lyman-β absorption is not fully broken by sampling (τ0,MF, βMF). SDSS DR16Q includes DLA measurements using the convolutional neural network (CNN) model of Parks et al. (2018) . The DLAs from the CNN model are recorded as CONF DLA, Z DLA, and NHI DLA columns in the DR16Q catalogue. 3 To compare our model and the CNN model, we restrict the zDLA sampling range of the CNN DLAs to be the same as our GP DLAs. Table 2 shows the confusion matrix. On the existence of DLAs, which means the binary classification of having at least one DLA or no DLA, the GP model is ∼ 94.8% in agreement with the CNN model. If we only consider only spectra with SNR > 6, the rate of agreement climbs to ∼ 96.5%. We have also checked the CDDF of the CNN DLAs, as shown in Figure 17 . The sampling range is restricted to be the same as ours, and we only count the DLAs with CONF DLA larger than 0.98. The CDDF of the CNN model under-detects DLAs with NHI > 7 × 10 20 , compared to N12. We have discussed this issue in Figure 19 of Ho et al. (2020) . The CDDF of the CNN model in the DR16Q catalogue shows improvements in detecting more high column density systems comparing to Parks et al. (2018) , but it is still an order of magnitude lower than N12 for NHI > 2 × 10 21 . Thus the lack of high column density systems in the CNN DLAs, as identified in Ho et al. (2020) , is still present in the latest catalogue. The dN/dX of the CNN model, in contrast, mostly agree with our GP measurements. Bins with z > 4.5 are even consistent at the 1-σ level. Since dN/dX is sensitive to low column density systems, it shows these two codes find consistent small DLAs, but differ in their column density estimates. We compare DLAs detected by the CNN and GP codes on a spectrum-by-spectrum basis in Figure 18 . As anticipated, the CNN and the GP code have a perfect agreement in zDLA, but the CNN predicts slightly lower log 10 NHI than the GP code, consistent with the CDDF plot in Figure 17 . We visually inspected 319 quasar spectra, where the CNN code strongly disagrees with the GP code's detections. As expected, most cases are spectra with low SNRs, where even human experts will have difficulty identifying DLAs. Besides those low-SNR cases, in general, the CNN code has false negatives on DLAs overlapping with sub-DLAs or DLAs very close to each other. There are 24 out of 319 cases which show a clear pattern where the CNN missed the DLAs when multiple absorption systems are overlapping or nearby. 4 Some of these are ambiguous detections, but 9 out of 24 have apparent damping wings on the absorber. We show two examples in Figure 19 . The first one shows a sub-DLA intervening on the right of the DLA damping wings. Though the damping wings are disturbed by the sub-DLA, the pattern of a DLA is still visible. The second example shows two DLAs close to each other, but not close enough to overlap. We suspect these non-detections for the CNN code are due to the lack of training data for multiple absorption systems (sub-DLAs or DLAs) close to each other. Since these overlapping cases are rare in the real dataset, we think one might need to implement simulated DLAs/sub-DLAs to augment the CNN training set. The comparison of Ω DLA with difference sampling ranges, Lyβ-Lyα (blue) and Ly∞-Lyβ (red). Other plot settings are the same as Figure 9 . Table 2 . The confusion matrix for multi-DLAs detections between the GP and the CNN model (Parks et al. 2018 ). Note we require both the model posteriors of our GP model and DLA confidence in Parks to be larger than 0.98. We also require log 10 N HI > 20.3. The maximum number of DLAs is fixed to three, and everything larger than three is considered three. Parks et al. (2018) . The z DLA and log 10 N HI values are taken from the SDSS DR16Q catalogue in column Z DLA and NHI DLA. We require the confidence of DLAs to be larger than 0.98 and set the search range of the CNN DLAs to be the same as our search range, which is Lyman-β +3 000 km s −1 to z QSO − 3 000 km s −1 . (Right) The line density of the DLAs detected by the CNN model. All three measurements, GP, PW09, and CNN are consistent on the line density. We have presented a new estimate of the abundance of DLAs from z = 2 to z = 5 and a DLA catalogue built from SDSS DR16Q spectra (Lyke et al. 2020 ) using our Gaussian process model Garnett et al. (2017) ; Ho et al. (2020) . We verify our results are in good agreement with previous measurements from Noterdaeme et al. (2012) , Wolfe (2009), and Crighton et al. (2015) . We newly integrate out the uncertainty in the measured mean flux, which improves our modelling of DLA detection uncertainties for zQSO > 4 without biasing towards high NHI detections. We note, nevertheless, that there is a residual dependence on low-redshift spectra with zQSO < 2.5. This could be due to unmodelled systematics or simply because the lowredshift optical spectra are incomplete in the Lyman series range, so we can not securely detect DLAs in low zQSO. Incorporating spectra with shorter observed wavelengths could potentially verify these detections at zQSO < 2.5. Our measurement shows the abundance of DLAs and neural hydrogen increases moderately over 2 < z < 4, while the trend beyond z = 4 is unclear due to statistical uncertainties. Larger datasets and better mean flux measurements are needed to give more robust constraints for DLA detections at z > 4. Our DLA catalogue is publicly available at http: //tiny.cc/gp_dla_dr16q, including both MATLAB catalogue and JSON catalogue. A sub-DLA candidate catalogue is available in JSON format. README files are included to describe the data formats of both catalogues. The data files for DLA population statistics are also included, including CDDF, dN/dX, and ΩDLA with or without SNR cuts. A tutorial for manipulating the MATLAB catalogue is publicly available at https://github.com/jibanCat/gp_ dla_detection_dr16q_public/tree/master/notebooks as a notebook file. Our GP code is also publicly available at https://github.com/jibanCat/gp_dla_detection_ dr16q_public/. Table A5 . The column density distribution function integrated over spectral lengths within 2.5 < z < 3 log 10 N HI f (N HI ) (10 −21 ) 68% limits (10 −21 ) 95% limits (10 −21 ) Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) −21 ) 68% limits (10 −21 ) 95% limits We thank the anonymous referee for the constructive comments and suggestions. We thank Reza Monadi and Bryan Scott for useful discussions and comments. SB was supported by NSF AST-1817256. RG was supported by the NSF under award numbers IIS-1939677, OAC-1940224, and IIS-1845434. SB and RG were supported by an Amazon.com Machine Learning Research Award, which also provided computing time. We hope the world can recover from the COVID pandemic soon.