key: cord-0936827-ojcx5s2f
authors: Domingues, T. D.; Mourino, H.; Sepulveda, N.
title: Analysis of cutoff point estimation for determining seropositivity in the context of SARS-CoV-2 infections
date: 2021-12-21
journal: nan
DOI: 10.1101/2021.12.20.21268100
sha: 164ab1dd0c81626af96b029d089dce1c63cd63e6
doc_id: 936827
cord_uid: ojcx5s2f

In this work will apply mixture models based on distributions from the SMSN family to antibody data against four SARS-CoV-2 virus antigens. Furthermore, since the true infection status of individuals is known a priori, performance measures will be calculated for the methods proposed for cutoff point estimation such as sensitivity, specificity and accuracy. The results of a simulation study will also be presented.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection that causes the devastating and often lethal COVID-19 disease was first detected in China, province of Wuhan in December 2019 ( [26] ). Rapidly, SARS-CoV-2 infection spread over the entire world and the COVID-19 disease was declared as a pandemic by the World Health Organization.

The detection of the virus is so far done by the so-called reverse transcription quantitative PCR (RT-qPCR) on samples from nasopharyngeal or throat swabs ( [26] ). In general, only symptomatic individuals or people who were in close contact with detected cases are tested, which might lead to an underestimation of the proportion of individuals infected with SARS-CoV-2 ( [31] ). Alternatively, serological testing allows to detect asymptomatic individuals exposed to the infection. In addition, serological testing is able to quantify the degree of exposure to the infection in the population. In this context, it is important to estimate seroprevalence at the population level, i.e., the proportion of seropositive individuals that show antibodies against any SARS-CoV-2 antigen ( [14] ).

The presence of antibodies in a serum sample can be regarded as an indicator of immunity against a given infectious agent or as an indicator of past infection in the absence of vaccination ( [10] ). The detection of antibodies in the serum samples is classically done via enzyme linked immunosorbent assays (ELISA), where the resulting data are light intensities, also called optical density, which reflects the underlying antibody concentration in the samples ( [9] ). For statistical convenience, the analysis of serological data proceeds by dichotomizing the amount of antibodies present in the serum of an individual using an arbitrary cutoff point in the antibody distribution to achieve a certain sensitivity and specificity. This allows the classification of individuals into seronegative (with antibody levels below the cutoff point) and seropositive (with antibody levels above the cutoff point) ( [26] ).

Given the possible impact of the cutoff chosen, different criteria for seropositivity determination have a direct impact on the sensitivity and specificity of the respective serological classification ( [22] ). In addition, it might also impact the estimation of the seroprevalence ( [13] ) and the following (epidemiological) decision that can be taken when facing a given estimate of this epidemiological parameter. This means that when determining the cutoff point for a serological test, one should take into account the benefit of the test, the economic and social consequences of serological misclassification and the prevalence of the disease in the population. It turns out that these aspects are often ignored in practice ( [25] ).

One of the traditional methods to establish the cutoff point in serological assays is to consider the logarithmic transformation of the antibody concentration of a known seronegative population and proceed to calculate the mean plus 2 or 3 standard deviations ( [25, 18, 4, 32] ). This method is more adequate when the antibody distribution of the seronegative population is normally distributed ( [4] ). However, our previous studies of different serological data ( [30, 9] ) showed evidence against a normality assumption for the antibody levels associated with a putative seronegative population. In the case where the true infection (or disease) status is known, ROC curve-based methods are most commonly used to determine the cutoff point for defining seropositivity. These methods are widely discussed in the literature ( [23, 12, 27, 11, 5, 33, 21] ).

Alternatively, finite mixture models can be used to determine the seropositivite cutoff directly from the data ( [4, 29, 13, 21, 9] ). In our previous work, three methods for determining seropositivity cutoff were explored using the so-called scale mixtures of Skew-normal distributions in the case where the true infection status is unknown ( [9] ). In this paper we applied the same methods and models in order to evaluate their performance in freely available serological data concerning SARS-CoV-2 virus ( [26] ). We also used simulation to understand the performance of the cutoff estimators associated with different criteria for seropositivity determination.

In this study we analyzed IgG antibody responses against four SARS-CoV-2 spike or nucleoprotein antigens: RBD -glycoprotein receptor-binding domain; S tri -S trimeric spike protein; S1 -spike glycoprotein S1 domain; S2 -SARS-CoV-2 spike glycoprotein S2 domain. Antibodies were measured in serum samples collected up to 39 days after symptom onset from 215 adults in four French hospitals (53 patients and 162 health-care workers) with quantitative RT-PCRconfirmed SARS-CoV-2 infection. A total of 335 negative control serum samples were collected from France, Thailand, and Peru before the start of the COVID-19 pandemic ( [26] ). A detailed description of lab procedures can be found in the original study ( [26] ). The data is freely available at https://github.com/MWhite-InstitutPasteur/SARSCoV2SeroDXphase2.

Serological data can be viewed as arising from two or more latent populations; each population is assumed to represent different levels of exposure to a given antigen. For simplicity, individuals that were never exposed or exposed a long time ago to an infectious agent are considered as seronegative. In contrast, individuals exposed to the same infectious agent are considered seropositive. In this scenario, the antibody distribution can be described by a mixture of two or more probability distributions ( [8] ). However, the true serological state of the individuals is unknown and therefore it needs to be estimated.

In the particular case of the SARS-CoV-2 data, we know which individuals were exposed to the virus and, therefore, we can assume to know which individuals are true seronegative and seropositives.

In many serological studies, it is common to assume a normal distribution for the basis of the mixture models. However, the behaviour of antibody distribution is not constant over time and their concentration decreases after infection ( [26] ). This fact makes the distribution of the seropositive population skewed to the left ( [10] ). In order to accommodate the possible skewness in the seropositive population we use the scale mixture of Skew-Normal (SMSN) class of distributions that include the Skew-Normal and the Skew-t distributions, which will be the focus of our study. A brief description of these alternative distributions can be found below.

Let W ⌢ SN (µ, σ 2 , α) a random variable with a Skew-Normal distribution. In this distribution, the parameters µ, σ 2 , and α can be seen as the location, scale, and shape parameters, respectively. Then the probability density function (pdf) is given by

where ϕ(.) and Φ(.) is the pdf and the cumulative distribution function of the standard Normal distribution, respectively ( [3, 1, 9] ). The Skew-Normal distribution is part of a family of distributions called the Scale Mixtures of Skew-Normal distributions (SMSN), of which the Skew-t distribution is also a particular case ( [9] ).

A random variable W is said to have a Skew-t distribution, W ⌢ ST (µ, σ 2 , α, v), if the pdf is given by

where f T (.; µ, σ 2 , v + 1) and F T (.; µ, σ 2 , v + 1) represents the pdf and the cumu- 

Let G 1 and G 2 be the seronegative and seropositive subpopulations from a population G, respectively. Let π 1 and π 2 the probabilities of sampling a seronegative and a seropositive individual, respectively (with the usual restriction of 2 k=1 π k = 1 and 0 ≤ π k ≤ 1) and considering Z the random variable that represents the antibody level. The probability density function (pdf) of Z is given by

where f k (z; θ k ) is the mixing probability density function of Z associated with the k − th latent population and parameterized by the vector θ k . Θ is the vector of all unknown parameters of the mixture model, i.e., Θ = (π 1 , π 2 , θ 1 , θ 2 ). In our application, f k (z; θ k ), is given by the Skew-normal or the Skew-t distributions.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021. 12.20.21268100 doi: medRxiv preprint In general, the estimation of a finite mixture model can be done by the classical EM algorithm ( [15] ). The EM algorithm is an iterative method widely used in incomplete data problems where the maximum likelihood estimators (MLE) have no closed expression ( [7] ). Considering (z 1 , z 2 , ..., z n ) the observed sample of size n and Y i ≡ Y ik , (i = 1, .., n; k = 1, 2), the binary vector representing the component from which the data comes from. Thus, Y i ⌢ Bernoulli(π 2 ) and the pdf of Y i is given by

We have that the complete data is the pair (z n , y n ) and the joint pdf is given by

Then, the log-likelihood function is given by

The step E of the EM algorithm consists in obtaining

log{(π 2 f 2 (z i ; θ 2 )},

The step M consists in maximizing Q(Θ, Θ (p) ) as function of the unknown parameters. However, if the model has many parameters that need to be estimated, then step M may incur in computational problems such as excessive time consuming or estimate instability. In this sense, it is possible to break the step M into several sub-steps (S > 1) that allow to get around these computational constraints by performing some restrictions on the parameters. This method is called expectation-conditional-maximization (ECM) algorithm ( [20, 17, 19] ). Considering that Θ (p+s) represents the value of Θ in the s th CM step of the iteration p + 1 in order to maximize Q(Θ, Θ (p) ) and the constraint function g s (Θ) = g s (Θ (p+(s−1) ), the ECM algorithm is performed as follow ( [17] ):

1. calculate the expected complete-data log-likelihood given the current estimates of the parameters, Θ (p) . The calculations are the same as for the EM algorithm; 2. fix Θ (p) and calculate Θ (p+s) to maximise the expected complete-data loglikelihood; 3. fix Θ (p+s) and calculate Θ (p+(s+1)) to maximise the expected complete-data log-likelihood on the s + 1 sub-step iteration and continuing until you have gone through all the S sub-steps.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint

In this way, it can be seen that Q(Θ, Θ (p+s) ≥ Q(Θ, Θ (p) ) for all Θ ∈ Ω s (Θ (p+s) ), where Ω s (Θ (p+s) ) = {Θ ∈ Ω : g s (Θ) = g s (Θ (p+(s−1) )} ( [20, 17, 19] ).

Considering the SMSN family of distributions, namely the Skew-Normal and the Skew-t distributions, the application of the ECM algorithm in the context of mixtures can be found in ( [16, 3] ).

In order to decide which model is the best one among all the models fitted to the same data, we used the Bayesian Information Criterion (BIC) ( [9] ).

Seroprevalence is an epidemiological measure defined by the proportion of seropositive individuals in the sample. For its estimation, it is then necessary to define the serological status of the i-th individual by dychotomization the variable, Z i , which represents the antibody concentration of the individual. This dychotomization is done by determining a value c such that for antibody values equal to or greater than c, the individual is classified as seropositive and seronegative, otherwise. Thus, let Y be the random variable representing the number of seropositive individuals in a sample of size n, we have to

where π 2 represents the seroprevalence, i.e, π 2 = P [Z i ≥ c] and I {.} is the indicator variable. Considering that the random variable representing the antibody levels Z i is modelled by a finite mixture of distributions, the way to estimate the cutoff c from the observed data is not standard. To facilitate the determination of this cutoff value, we below present three estimation methods or criteria.

-Method 1 (M1): It is based on the 99.9%-quantile associated with the estimated seronegative population. This method is the most popular in seroepidemiology ( [29, 28] ). It is often called as the 3σ rule, because the 99.9%quantile is given by the mean plus 3 times the standard deviation of a normally distributed seronegative population; -Method 2 (M2): It relies on the minimum of the density mixture functions.

In the case of two latent populations, the cutoff corresponds to the absolute minimum, and in the case of three or more latent populations the cutoff corresponds to the lowest relative minimum. This point can be calculated using the Dekker's algorithm ( [6] ). It should be noted that the minimum of the mixing function is not expected to coincide with the point of intersection of the probability densities of each individual subpopulation; -Method 3 (M3): It imposes a threshold in the the so-called conditional classification curves ( [29] ). Under the assumption that all components but the first one refer to seropositive individuals, the conditional classification curve for the i-th individual given the antibody level Z i = x is defined as

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint

In turn, the classification curve of seronegative individuals is simply given by

After calculating these curves, one can impose a minimum value for the classification of each individual. In this case, two cutoff values arise in the antibody distribution, one for the seronegative individuals and another for seropositive individuals. Mathematically, the classification rule is given as follows

where c − and c + are the cutoff values in the antibody distribution that ensure a minimum classification probability, say 90%. To calculate these cutoff values in practice, one can use the bisection method providing an initial interval where they might be located ( [29] ).

In order to evaluate the performance of each of the cutoff points, we estimated the respective sensitivity and specificity. Let D and D * be the true and estimated serological classification (or infection status), respectively. Sensitivity is defined as the conditional probability

In turn, the specificity is defined as

The overall performance of each method is given by the accuracy (ACC) of the proposed method which corresponds to the proportion of correct results, that is,

We performed a small simulation study to assess the performance of cutoff points proposed by each method. With this purpose, we assume two simulation scenarios regarding the mixture model assumed for the data: (i) a mixture model based on the Skew-Normal distributions and (ii) a mixture model based on the Skew-t distribution.

For each scenario, we simulated 1000 samples with dimensions 100, 500 and 1000. In addition, for each simulation cycle, the weight of the mixture model was varied to check the ability of the model to identify the seropositive component . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint even when the weight assigned to that component is very low. The implications of varying the weight of the seronegative and seropositive population are as follows: in the case where the proportion of seronegative individuals is very high relative to seropositive individuals, more effective decisions can be made to control the number of infections in the population. The opposite scenario is important in the case of effectiveness of vaccination in the population, particularly for individuals who may have lost immunity. To this end, it was considered that the proportion of seronegative individuals could take the values 90%, 60% and 30%, being the respective proportion of seropositive individuals 10%, 40% and 70%, respectively. For each simulated sample, the parameters of the mixture model were estimated by maximum likelihood (via the EM algorithm) according to the distributional scenarios described above, as well as the respective cutoff points according to the methods M1, M2 and M3. Considering θ * the estimated parameter, θ the true value of the parameter, than we calculate the relative error that is 1

We used the package mixsmsn to fit different mixture models based on SMSN ( [24] ). In particular, we used the function smsn.mix to estimate the model parameter via the EM algorithm For fitting the Student's t-distribution, we considered the R package extraDistr ( [34] ), namely, the function dlst to calculate their density, the function plst to define the cumulative distribution function and the function rlst to generate random samples in the simulation study. The fitting of the Skew-Normal distributions was performed with the package sn ( [2] ). The functions dsn, psn and rsn were used to calculate the probability density function, the cumulative distribution function and generate random samples of the Skew-Normal distribution, respectively. In the case of the Skew-t distribution, the functions dst, pst and rst were used to calculate the probability density function, the cumulative distribution function and generate random samples, respectively.

For this study, data relating to 549 individuals was analysed. Serum samples were collected from individuals with confirmed SARS-CoV-2 infection by PCR test in four hospital units from Paris, namely: 4 (0.7%) from the Hôpital Bichat, 49 (9.0%) from the Hôpital Cochin and 161 (29.3%) from the Nouvel Hôpital (Strasbourg). Regarding the negative controls, 68 (12.4%) are from the Thai Red Cross (TRC), 90 (16.4%) from the Peruvian donors (NHP) and 177 (32.2%) from the France blood donors (Établissement Français du Sang). For each antigen under analysis, the logarithmic transformation of base 10 was considered for the concentration of antibodies against that antigen.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (Figure 1 ). Such differences were expected given the general knowledge about the infection status, i.e., individuals who have already been exposed to the virus will have a higher concentration of antibodies than those who are still susceptible.

We performed the fitting of the different mixture models considering two subpopulations, i.e., a seronegative population and a seropositive population. According to the BIC values, the model based on the Skew-Normal distribution was considered for the following antigens: RBD (BIC=852.25), S1 (BIC=561.63), S2 (BIC=775.29). For the case of the Stri antigen, the best model was found to be the Skew-t distribution (BIC=915.82) (Figure 2 and table 2 ). As has been observed in previous studies, there is a marked skew to the right of the data for the seronegative population and a skewed for the left in the seropositive population, although not very marked for the S1 (α S1 = 1.062) and S2 (α S2 = 0.450)

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint antigens (Table 1) . Antibody distribution for S1 antigen. C. Antibody distribution for S2 antigen. D. Antibody distribution for S tri antigen. Antibody concentration in x axis is given in log 10 units. 

After defining the model that best fits the data, we proceeded to categorize the amount of antibodies for each antigen by estimating the cutoff point. For this, we used the methods M1, M2 and M3 already described and whose results are shown in Figure 1 and table 2.

Estimation of the cutoff point based on the minimum densities of the mixture model (M2) proved to be the method with the highest sensitivity for classifying seropositive individuals, as well as the one that produces the highest proportion of correct results (accuracy) for the RBD antigen (cutof f = 2.49, sens = . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint 86.45%, ACC = 92.89%), S1 (cutof f = 2.27, sens = 71.03%, ACC = 86.89%) and S2 (cutof f = 2.39, sens = 83.64%, ACC = 90.89%). In the case of the S tri antigen, it was not possible to calculate the sensitivity and accuracy of the method based on the 99.9%-quantile (M1), given the high values that the quantile assumes leading to the seropositive population being fully absorbed by it. Thus, for comparison purposes, the application of each methods to the Skew-Normal distribution was considered, again verifying that the method based on the minimum densities of the mixture model produces the highest sensitivity (cutof f = 2.46, sens = 90.19%). However, for this antigen, the method with the highest accuracy is based on the conditional probability (set at 90%) of classifying an individual as being seropositive (ACC=93.44%) (Figure 3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 21, 2021. ; https://doi.org /10.1101 /10. /2021 In order to evaluate the quality of methods M1, M2 and M3, the optimal cutoff point was estimated using the ROC curve. This is possible since the true infection status of the individuals is known. It is interesting to see that in terms of specificity and accuracy the results are similar to the method that is traditionally used (ROC curve). However, it is possible to observe a poor performance of the M1 method with regard to its sensitivity. (Figure 3 , table 2 and table 3) . 

To conduct the simulation study, two scenarios were considered: the first consists of the scenario where the model that best fits the data is a Skew-Normal distribution, and the second where the model that best fits the data is a Skew-t distribution. For this purpose, the results for the RBD antigen (Skew-Normal distribution) and the S tri antigen (Skew-t distribution) were selected. For each scenario the sample size was varied, as well as the proportion of seronegative individuals in the population. The results are shown in table 4 and table 5 .

In general it is found that as the sample size increases, both the relative error and the root mean square error tend to decrease. It is also found that for small samples and extreme π 1 values (π 1 = 0.3 or π 1 = 0.9), the models tend to have some difficulty in identifying a seronegative and seropositive population. This is a result that alerts to the existence of possible false positives and false negatives in the case of small samples. In situations where there is an ongoing vaccination plan and therefore the majority of the population is seropositive (e.g. π 1 = 0.9) it is important to know if it is possible to identify seronegative individuals in this population given the time of immunization. If the timing of immunization is short, it is important to identify these individuals early in order to take action and prevent a further increase in infections.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint Table 4 : Relative bias and Mean Square Error (MSE) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2) and conditional probability method (M3) for the RBD antigen. opt M 1 denotes the theoretical cutoff point for the 99.9%-quantile; opt M 2 denotes the theoretical cutoff point for the minimum of the density mixture method; opt M 3 denotes the theoretical cutoff point for conditional probability method. π 1 denotes the weight of the seronegative population; c M 1 denotes the cutoff estimated by M1 method after N=1000 simulations; c M 2 denotes the cutoff estimated by M2 method after N=1000 simulations; c M 3 denotes the cutoff estimated by M3 method after N=1000 simulations. 

The purpose of this study was to use a flexible class of mixture models to antibody data against the SARS-CoV-2 virus. In particular, we used a class of models that allows capturing the skewness present in this type of data, namely . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint Table 5 : Relative bias and Mean Square Error (MSE) of the 99.9%-quantile method (M1); minimum of mixture densities method (M2) and conditional probability method (M3) for the S tri antigen. opt M 1 denotes the theoretical cutoff point for the 99.9%-quantile; opt M 2 denotes the theoretical cutoff point for the minimum of the density mixture method; opt M 3 denotes the theoretical cutoff point for conditional probability method. π 1 denotes the weight of the seronegative population; c M 1 denotes the cutoff estimated by M1 method after N=1000 simulations; c M 2 denotes the cutoff estimated by M2 method after N=1000 simulations; c M 3 denotes the cutoff estimated by M3 method after N=1000 simulations. the Skew-Normal and Skew-t distributions.

It has become clear that diagnostic tests play a key role in the early identification of infected individuals, allowing us to act to control a pandemic by isolating and tracing the contacts of an infected person. Diagnostic tests can . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint classify an individual as seronegative or seropositive by defining a cutoff point that can take on different values depending on the technique used by the manufacturer to develop the test. Most of the time, this cutoff point is relaxed and is calculated using the 3σ-rule, which assumes that the underlying distribution of the data is Normal. However, as we have seen in our application, this assumption cannot always be made, making this method unfeasible.

Note that this study has the advantage that the true cases and controls of the infection are known, allowing us to compare different methods for obtaining the cutoff point that allows classifying an individual as seropositive.

In [9] , three methods for obtaining the cutoff point had been presented that could not yet be validated because the true infection status of the individuals was not known. In this sense, we proceeded to use these methods in this study, and it was verified that the three methods under analysis present high accuracy, compared to methods used in literature, namely through the empirical ROC curve. However, the proposed methods proved to be more specific than sensitive. Note that the performance of the method based on the 99.9% probability quantile may be overestimated, especially when the fitted distribution corresponds to a heavy-tailed distribution (such as the Skew-t distribution). This is because the calculation of this quantile involves only and exclusively the population of seropositive individuals, so that if the distribution is too skewed to the right, then the seropositive population is totally absorbed by this quantile.

When a new virus is present in the population, there is a natural tendency for the proportion of susceptible individuals to be much higher than the seropositive individuals. This is the phase in which early identification of the infected people is essential for pandemic control, although total control of the spread of the virus only occurs when there is vaccination or eradication of the virus. In this sense, with the simulation study developed in this work, we intend to analyze the pandemic evolution scenarios and understand the behavior of different methods for determining the cutoff point. It was found that as the sample size increases, there is a tendency for the relative error and the mean square error of the cutoff point estimates in skewed distributions to decrease, while this tendency is not linear in the case of the usual symmetric distributions (Normal and Student t). This fact may be due to the fact that symmetrical distributions are not the most appropriate for these types of data, or even that the proposed methods should not be used when considering the usual distributions.

As we expected, for small sample sizes and for large imbalances in the serological populations, the proposed models were found to have problems in identifying two components. Note that in the case of skewed distributions, it will be natural that if the weight of the seronegative population is very high, then observations relating to the seropositive population are considered false negatives and false . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 21, 2021. ; https://doi.org/10.1101/2021.12.20.21268100 doi: medRxiv preprint positives otherwise.

A limitation of this study is the fact that the adjustment of the different mixture models was performed using the same distribution for the two components (through the package mixsmsn). If the components of the mixture model were distinct, this would have a direct implication on the estimated cutoff points. However, the package that would allow this analysis is now discontinued.

In conclusion, we recommend the use of mixture models based on distributions of the SMSN family for the analysis of serological data given the flexibility of these models, as well as the use of the proposed methods for determining cutoff points as an alternative to the method based on the 3σ rule.

The skew-normal and related families

The skew-normal and related distributions such as the skew-t

Robust mixture modelling based on scale mixtures of skew-normal distributions

Mixture model analysis for establishing a diagnostic cut-off point for pertussis antibody levels

Optimal cutoff and accuracy of an igm enzyme-linked immunosorbent assay for diagnosis of acute scrub typhus in northern thailand: an alternative reference method to the igm immunofluorescence assay

Algorithms for minimization without derivatives

Maximum likelihood estimation from incomplete data via the em algorithm

A statistical analysis of serological data from the uk myalgic encephalomyelitis/chronic fatigue syndrome biobank

Analysis of antibody data using finite mixture models based on scale mixtures of skew-normal distributions

Analysis of serological surveys using mixture models: application to a survey of parvovirus b19

On determining the most appropriate test cut-off value: the case of tests with continuous results

Determination of the accuracy and optimal cut-off point for elisa test in diagnosis of human brucellosis in iran

Is it appropriate to use fixed assay cut-offs for estimating seroprevalence?

Estimating sars-cov-2 seroprevalence and epidemiological parameters with uncertainty from serological surveys. medRxiv

A simple multithreaded implementation of the em algorithm for mixture models

Finite mixture modelling using the skew-normal distribution

The ecme algorithm: a simple extension of em and ecm with faster monotone convergence

Application of a noninvasive oral fluid test for detection of treponemal igg in a predominantly hiv-infected population

The em algorithm and extensions

Maximum likelihood estimation via the ecm algorithm: A general framework

Defining seropositivity thresholds for use in trachoma elimination studies

Use of mixture models in determining laboratory criterion for identification of seropositive individuals: application to parvovirus b19 serology

The inconsistency of "optimal" cut-points using two roc based criteria

Fitting finite mixture of scale mixture of skewnormal distributions

Determination of the optimal cutoff value for a serological assay: an example using the johne's absorbed eia

Multiplex assays for the identification of serological signatures of sars-cov-2 infection: an antibody-based diagnostic and machine learning study

Finding the optimal cut-point for gaussian and gamma distributed biomarkers

The validity of diagnostic cut-offs for commercial and in-house scrub typhus igm and igg elisas: A review of the evidence

Current mathematical models for analyzing anti-malarial antibody data with an eye to malaria elimination and eradication

Detection and modeling of anti-leptospira igg prevalence in cats from lisbon area and its correlation to retroviral infections, lifestyle, clinical and hematologic changes

Guessous, I.: Seroprevalence of anti-sars-cov-2 igg antibodies in geneva, switzerland (serocov-pop): a population-based study

Application of a mixture model for determining the cutoff threshold for activity in high-throughput screening

Defining an optimal cut-point value in roc analysis: an alternative approach. Computational and mathematical methods in medicine

Additional univariate and multivariate distributions