key: cord-0721443-v3p22cz9 authors: Jones, Hayley E; Mulchandani, Ranya; Taylor-Phillips, Sian; Ades, A E; Shute, Justin; Perry, Keith R; Chandra, Nastassya L; Brooks, Tim; Charlett, Andre; Hickman, Matthew; Oliver, Isabel; Kaptoge, Stephen; Danesh, John; Di Angelantonio, Emanuele; Wyllie, David title: Accuracy of four lateral flow immunoassays for anti SARS-CoV-2 antibodies: a head-to-head comparative study date: 2021-06-04 journal: EBioMedicine DOI: 10.1016/j.ebiom.2021.103414 sha: 1c1396977e90f2347289c345d302be074aed06b7 doc_id: 721443 cord_uid: v3p22cz9 BACKGROUND: SARS-CoV-2 antibody tests are used for population surveillance and might have a future role in individual risk assessment. Lateral flow immunoassays (LFIAs) can deliver results rapidly and at scale, but have widely varying accuracy. METHODS: In a laboratory setting, we performed head-to-head comparisons of four LFIAs: the Rapid Test Consortium's AbC-19(TM) Rapid Test, OrientGene COVID IgG/IgM Rapid Test Cassette, SureScreen COVID-19 Rapid Test Cassette, and Biomerica COVID-19 IgG/IgM Rapid Test. We analysed blood samples from 2,847 key workers and 1,995 pre-pandemic blood donors with all four devices. FINDINGS: We observed a clear trade-off between sensitivity and specificity: the IgG band of the SureScreen device and the AbC-19(TM) device had higher specificities but OrientGene and Biomerica higher sensitivities. Based on analysis of pre-pandemic samples, SureScreen IgG band had the highest specificity (98.9%, 95% confidence interval 98.3 to 99.3%), which translated to the highest positive predictive value across any pre-test probability: for example, 95.1% (95% uncertainty interval 92.6, 96.8%) at 20% pre-test probability. All four devices showed higher sensitivity at higher antibody concentrations (“spectrum effects”), but the extent of this varied by device. INTERPRETATION: The estimates of sensitivity and specificity can be used to adjust for test error rates when using these devices to estimate the prevalence of antibody. If tests were used to determine whether an individual has SARS-CoV-2 antibodies, in an example scenario in which 20% of individuals have antibodies we estimate around 5% of positive results on the most specific device would be false positives. FUNDING: Public Health England. Tests for SARS-CoV-2 antibodies are used for population serosurveillance [1, 2] and could in future be used for post-vaccination seroepidemiology. Given evidence that antibodies are associated with reduced risk of COVID-19 disease [3] [4] [5] [6] [7] , antibody tests might also have a role in individual risk assessment [8] , pending improved understanding of the mechanisms and longevity of immunity. Both uses require understanding of test sensitivity and specificity: these can be used to adjust seroprevalence estimates for test errors [9] , while any test used for individual risk assessment would need to be shown to be sufficiently accurate, in particular, highly specific [10, 11] . A number of laboratory-based immunoassays and lateral flow immunoassays (LFIAs) are now available, which detect IgG and/or IgM responses to the spike or nucleoprotein antigens [12] [13] [14] . Following infection with SARS-CoV-2, most individuals generate antibodies against both of these antigens [15] . Existing efficacious recombinant vaccines contain the spike antigen [16] , therefore vaccinated individuals generate only a response to this. LFIAs are small devices which produce results rapidly, without the need for a laboratory, and therefore have the potential to be employed at scale. A Cochrane review found 38 studies evaluating LFIAs for SARS-CoV-2 antibodies already by late April 2020. However, results from most studies were judged to be at high risk of bias, and very few studies directly compared multiple devices [14] . Where direct comparisons have been performed, they have shown that accuracy of LFIAs varies widely across devices [13, [17] [18] [19] [20] . A key limitation of most studies is that sensitivity has been estimated only from individuals who previously had a positive PCR test. In a recent evaluation of one LFIA, the UK Rapid Test Consortium's "AbC-19 TM Rapid Test" [21] (AbC-19 hereafter), we found evidence that this can over-estimate sensitivity [21] . We attributed this to PCR-confirmed cases tending to be more severe À particularly early in the pandemic, when access to testing was very limited. Since more severe disease is associated with increased antibody concentrations [22] [23] [24] , which may be easier to detect, estimates of test sensitivity based on previously PCR-confirmed cases only are susceptible to "spectrum bias" [25, 26] . In this paper, we present a head-to-head comparison of the accuracy of AbC-19 and three other LFIAs, based on a large (n = 4,842) number of blood samples. The three additional devices were Orient-Gene "COVID IgG/IgM Rapid Test Cassette", SureScreen "COVID-19 Rapid Test Cassette", and Biomerica "COVID-19 IgG/IgM Rapid Test", hereafter referred to as OrientGene, SureScreen and Biomerica for brevity. We analysed blood samples from 2,847 key workers participating in the EDSAB-HOME study and 1,995 pre-pandemic blood donors from the COMPARE study [27] , in a laboratory setting. All samples were from distinct individuals. We evaluated each device using two approaches. First (Approach 1), we compared LFIA results with the known previous infection status of pre-pandemic blood donors ("known negatives") and the 268 EDSAB-HOME participants who reported previous PCR positivity ("known positives"). Second (Approach 2), we compared LFIA results with results on two sensitive laboratory immunoassays in EDSAB-HOME participants. Both approaches were pre-specified in our protocol (available at http:// www.isrctn.com/ISRCTN56609224). We have previously reported accuracy of the AbC-19 device based on the same sample set and overall approaches: these results are reproduced here for comparative purposes [21] . Following this previous work, and in particular due to the spectrum effects observed [21] we anticipated the estimates of sensitivity based on comparison with a laboratory immunoassay (Approach 2) but estimates of specificity based on pre-pandemic sera (Approach 1) to be the least susceptible to bias. Devices (Table S3) Only three studies compared the accuracy of two or more of the four devices. With the exception of our previous report on the accuracy of the AbC-19 TM device, which the current manuscript builds upon, sample size ranged from 7 to 684. For details, see Supplementary Materials ( Figure S1 , Tables S1, S2). The largest study compared OrientGene, SureScreen and Biomerica. SureScreen was estimated to have the highest specificity (99.8%, 95% CI 98.9 to 100%) and OrientGene the highest sensitivity (92.6%), but with uncertainty about the latter result due to small sample sizes. The other two comparative studies were small (n = 65, n = 67) and therefore provide very uncertain results. We previously observed spectrum effects for the AbC-19 TM device, such that sensitivity is upwardly biased if estimated only from PCR-confirmed cases. The vast majority of previous studies estimated sensitivity in this way. We performed a large scale (n = 4,842), head-to-head laboratory-based evaluation and comparison of four lateral flow devices, which were selected for evaluation by the UK Department of Health and Social Care's New Tests Advisory Group, on the basis of a survey of test and performance data available. We evaluated the accuracy of diagnosis based on both IgG and IgM bands, and the IgG band alone. We found a clear trade-off between sensitivity and specificity across devices, with the SureScreen and AbC-19 TM devices being more specific and Ori-entGene and Biomerica more sensitive. Based on analysis of 1,995 pre-pandemic blood samples, we are 99% confident that SureScreen (IgG band reading) has the highest specificity of the four devices (98.9%, 95% CI 98.3, 99.3%). By including individuals without PCR confirmation, and exploring the relationship between laboratory immunoassay antibody index and LFIA positivity, we were able to explore spectrum effects. We found evidence that all four devices have reduced sensitivity at lower antibody indices. However, the extent of this varies by device and appears to be less for other devices than for AbC-19. Our estimates of sensitivity and specificity are likely to be higher than would be observed in real use of these devices, as they were based on majority readings of three trained laboratory personnel. When used in epidemiological studies of antibody prevalence, the estimates of sensitivity and specificity provided in this study can be used to adjust for test errors. Increased precision in error rates will translate to increased precision in seroprevalence estimates. If lateral flow devices were used for individual risk assessment, devices with maximum specificity would be preferable. However, if, for example, 20% of the tested population had antibodies, we estimate that around 1 in 20 positive results on the most specific device would be incorrect. other three devices contain separate bands representing detection of IgG and IgM. We report results for these by two different scoring strategies: (i) "one band", in which we considered a result to be positive only if the IgG band was positive; and (ii) "two band", in which we considered results to be positive if either band was positive. In statistical analysis, these two readings were treated as separate "tests", such that our comparison was of seven tests in total. By definition, the "two band" reading of each device has sensitivity greater than or equal to, but specificity less than or equal to, the "one band" reading. The two sets of study participants have been described in full previously [21] . A flow diagram is provided (Fig. 1) . EDSAB-HOME (ISRCTN56609224) was a prospective study designed to assess the accuracy of LFIAs in key workers in England [24] . The research protocol is available at http://www.isrctn.com/ ISRCTN56609224. Participants were convenience samples, recruited through their workplaces in three recruitment streams. Individuals in Streams A (fire and police officers: n = 1,147) and B (healthcare Fig. 1 . Study flow diagram workers: n = 1,546) were recruited without regard to previous SARS-CoV-2 infection status. Stream C (n = 154) consisted of additional healthcare workers who were recruited based on self-reported previous PCR positivity. Symptom history was not part of the eligibility criteria. During June 2020, all participants (n = 2,847) completed an online questionnaire and had a venous blood sample taken at a study clinic. Sample size considerations, eligibility criteria, the recruitment process and demographic characteristics (Table S4 ) are described in the Supplementary Materials. Detailed information on participants from the study questionnaire (including symptom history, testing history and household exposure) is provided elsewhere [24] . In addition to the Stream C individuals, some Stream A/B individuals (n = 114) also self-reported previous PCR positivity. All self-reported PCR results were later validated by comparison with national laboratory records. We refer to the total (n = 268) individuals with a previous PCR positive result as "known positives" and to the remaining n = 2,579 EDSAB-HOME participants as "individuals with unknown previous infection status" at clinic visit. Twelve of 268 known positives reported having experienced no symptoms. Of the known positives reporting symptoms, the median (interquartile range) number of days between symptom onset and study clinic was 63 (52 to 75). COMPARE (ISRCTN90871183) was a 2016-2017 blood donor cohort study in England [27] . We performed stratified random sampling by age, sex and region to select 2,000 participants, of whom 1,995 had samples available for analysis. We refer to these samples as "known negatives". All tests were performed by experienced laboratory staff at PHE Colindale, London. All EDSAB-HOME samples were first tested with two laboratory immunoassays: Roche Elecsys Ò , which measures total (including IgG and IgM) antibodies against the Nucleoprotein, and EuroImmun Anti-SARS-CoV-2 ELISA assay, which measures IgG antibodies against the S protein S1 domain. Any immunoassay failing for technical reasons was repeated. Lateral flow devices were stored in a temperature controlled room (thresholds 16-30°C, actuals from continuous monitoring system 19-20°C). Laboratory staff reading the LFIAs received on-site training from Abingdon Health (the manufacturers of the AbC-19 device), SureScreen and Biomerica. Laboratory staff discussed the evaluation, including use of the LFIA device, with OrientGene on a call. The manufacturers' instructions for use were followed, with plasma being pipetted into the devices, followed by the chase buffer supplied with the kits. The first 350 COMPARE samples were interspersed randomly among EDSAB-HOME samples, with the remaining 1,650 COMPARE samples being analysed later. Each device was independently read by three members of staff. Readers were blind to demographic or clinical information on participants and to results on any previous assays. Readers scored LFIA test bands using the WHO scoring system for subjectively read assays: 0 ("negative"), 1 ("very weak but definitely reactive"), 2 ("medium to strong reactivity") or 7 ("invalid") [28] . As this scoring system does not clearly state how to categorise "weak" bands, our readers used a score of 1 for what they considered to be either "weak" or "very weak". The majority score was taken as the consensus reading. For assessment of test sensitivity and specificity, scores of 1 and 2 were grouped as "positive". If any band of a device was assigned a consensus score of 7 (invalid), the sample was retested and the re-test results taken as primary. We report numbers and proportions of invalid bands, and the total number and proportion of all devices with at least one invalid band. We re-tested samples when LFIAs made apparent errors, on the following basis: for all four devices, we re-tested EDSAB-HOME samples if the result differed from the immunoassay composite reference standard of "positive on either Roche Elecsys Ò or EuroImmun, versus negative on both". For three of the four devices, we also re-tested all COMPARE samples that incorrectly tested positive. For Biomerica, due to a lack of devices and a high observed false positive rate, only false positives in the first batch of 350 samples were re-tested. Any re-test results are reported as secondary. Two checks were made before pipetting to ensure that samples were in the correct position. Once results were available, an initial check was made to ensure that no obvious mistakes had been made by readers on the laboratory scoring sheet (i.e. that scoring was for the correct sample). Data were then manually entered into a spreadsheet for each test run and every result checked against the primary data (laboratory scoring sheet). Results were transferred from each test run spreadsheet to the main results sheet, and checked for correct alignment. Estimation of sensitivity, specificity, positive and negative predictive values We estimated LFIA accuracy through comparison of results with the known previous SARS-CoV-2 infection status of individuals. Specificity was estimated from all 1,995 "known negative" samples. The association of false positivity with age, sex and ethnicity was also explored. We estimated sensitivity from the 268 "known positive" EDSAB-HOME samples. Numbers of false negatives are also reported by time since symptom onset and separately for asymptomatic individuals. We estimated LFIA accuracy through comparison with results on the Roche Elecsys Ò laboratory immunoassay in EDSAB-HOME samples. This was selected as the primary laboratory reference standard on the basis that it was the assay available to us that had the highest published accuracy for detection of recent SARS-CoV-2 infection at the time of sample collection, as per our protocol. We used the manufacturer recommended positivity threshold of 1.0. At this threshold, the assay has been estimated to have sensitivity of 97.2% (95% CI 95.4, 98.4%) and specificity of 99.8% (99.3, 100%) to previous infection [12] . As sensitivity analyses, we also report accuracy estimates based on comparison with EuroImmun and a composite reference standard of "positive on either laboratory assay versus negative on both". We treated EuroImmun results as positive if they were greater than or equal to the manufacturer "borderline" threshold of 0.8 [24] . In Approach 2, estimates of sensitivity were calculated separately for known positives and individuals with unknown previous infection status, to assess for potential spectrum bias [21] . Specificity was estimated from reference standard negative individuals among the "unknown previous infection status" population. As EDSAB-HOME Streams A and B comprise a "one gate" population [21, 29] , we also report Approach 2 results from all EDSAB-HOME streams A and B participants combined, regardless of previous PCR positivity. Positive and negative predictive values: We estimated the positive and negative predictive value (PPV and NPV) for example scenarios of a 10%, 20% and 30% pre-test probability. To calculate these, we used estimates of specificity based on pre-pandemic sera (Approach 1) and sensitivity based on comparison with Roche Elecsys Ò in individuals with unknown previous infection status (Approach 2). As noted above, we anticipated that these estimates of the respective parameters would be the least susceptible to bias. Statistical analysis was performed in R4.0.3 and Stata 15. Sensitivity and specificity were estimated by observed proportions based on each reference standard, with 95% CIs computed using Wilson's method. Logistic regressions with age, sex and ethnicity as covariates were used to explore potential associations with false positivity. To further explore potential associations with age, we also fitted fractional polynomials and plotted the best fitting functional form for each test. In comparing the sensitivity and specificity of the seven "tests", we used generalised estimating equations (GEE) to account for conditional correlations among results [30] . For example, in Approach 1 we fitted separate GEE regressions, with test as a covariate, to the "known positives" and to the "known negatives". We used independence working covariance matrices [30] . We obtained 95% uncertainty intervals (UIs) around PPVs, NPVs, differences in sensitivity and differences in specificity using Monte Carlo simulation. This is a commonly used approach for propagating uncertainty in functions of parameters, used frequently for example in decision modeling [31] . We sampled one million iterations from a multivariate normal distribution (using R function "mvnorm") for each set of GEE regression coefficients, using the parameter estimates and robust variance-covariance matrix. We calculated each function of parameters of interest (e.g. the PPV at 20% prevalence) at each iteration. We report the median value across iterations as the parameter estimate and 2.5 th and 97.5 th percentiles as 95% uncertainty intervals. We also present ranks (from 1 to 7) for each set of sensitivity, specificity and PPV estimates. These were similarly computed at each iteration of the Monte Carlo simulations, and summarised by medians, 2.5 th and 95.5 th percentiles across simulations [32] . We further report the proportion of simulations for which each test was ranked first, i.e. the probability the test is the "best" with regard to each measure. Within the Approach 2 analysis of Roche Elecsys Ò positives, we report the absolute difference between sensitivity estimated from PCR-confirmed cases and sensitivity estimated from individuals with unknown previous infection status, with 95% UI. Among individuals who were positive on Roche Elecsys Ò , we also examined the relationship between the amount of antibody present and the likelihood of lateral flow test positivity. We categorised anti-Nucleoprotein (Roche Elecsys Ò ) and, separately, anti-S1 (EuroImmun) antibody indices into bins containing similar number of samples and calculated the observed sensitivity (i.e. proportion of positive results) with 95% CI for each LFIA in each bin. To aid visual assessment of the relationship between antibody index and sensitivity, we also plotted exploratory dose-response curves. The shape of fitted curve was selected using the Akaike information criteria, using the drc package in R [33] . The EDSAB-HOME study was approved by the NHS The study was commissioned by the UK Government's Department of Health and Social Care (DHSC), and was funded and implemented by Public Health England, supported by the NIHR Clinical Research Network Portfolio. The DHSC had no role in the study design, data collection, analysis, interpretation of results, writing of the manuscript, or the decision to publish. Figs. 2 and 3 show results from Approach 1 and from the Approach 2 analysis of individuals with unknown previous infection status at clinic visit, which are mutually exclusive and exhaustive subsets of the 4,842 samples. Also shown are results from the Approach 2 sensitivity analyses with alternative reference standards. Estimated differences between the sensitivity and specificity of tests, with 95% UIs, are shown in Tables S5 and S6. Both approaches show a clear trade-off between sensitivity and specificity, with SureScreen 1 band and AbC-19 having higher specificities but lowest sensitivities, while OrientGene and Biomerica have higher sensitivities but lower specificities. From Approach 1, SureScreen 1 band is estimated to have higher specificity but lower sensitivity than AbC-19, whereas the two tests appeared comparable (although with all point estimates marginally favouring SureScreen) from Approach 2. Resulting from this, we estimate the one band reading of the SureScreen device to have the highest PPV. Approach 1 estimates are shown in Tables 1 (specificity) and 2 (sensitivity). SureScreen 1 band reading was estimated to have 98.9% specificity (95% CI 98.3, 99.3%), with high certainty (99%) of this being the highest. This is 1.0% (95% UI 0.2, 1.8%) higher than the specificity of AbC-19 (Table S6) , which was ranked 2 nd (95% UI 2 nd , 4 th ). There was no strong evidence of any association between false positivity and age for any device (Table S7) although there was some indication that Biomerica 1 band specificity might decline in older adults ( Figure S2 ). With the exception of an apparent association of false positivity of the AbC-19 device with sex, which we have reported previously [21] , there was no indication of specificity varying by sex or ethnicity (Table S8) . SureScreen 1 band was, however, estimated to have the lowest sensitivity when this was estimated from PCR-confirmed cases only ( Table 2: 88.8%, 95% CI 84.5, 92.0%), 3.7% (95% UI 0.5, 7.1%) lower than AbC-19 (Table S5) . Among the 268 "known positives", nine were negative on Roche Elecsys Ò . Removing these from the denominator slightly increased point estimates of sensitivity (Table 3) , but had no notable impact on rankings. Among the 2,579 individuals with unknown previous infection status, 354 were positive on Roche Elecsys Ò . Point estimates of sensitivity were lower for all seven tests in this population than among known positives (see below). In this population, there was evidence that both the OrientGene and Biomerica devices have higher sensitivity than SureScreen or AbC-19 (Table S5 ). There was no evidence of a difference between the sensitivity of SureScreen and AbC-19 (absolute difference in favour of SureScreen = 0.8%, 95% UI -2.2, 3.9%). Increases in sensitivity in the 2 band versus 1 band reading of OrientGene and Biomerica devices were minimal. Based on the 2,225 individuals with unknown previous infection status who were negative on Roche Elecsys Ò , specificity estimates were very similar to those from Approach 1 for SureScreen and Biomerica, but around 1% higher for AbC-19 and OrientGene (Table 3 ). The ranking of devices was consistent across the two approaches, but the observed difference in specificity between SureScreen and AbC-19 was much reduced in Approach 2 (difference = 0.1%, 95% UI -0.4 to 0.6%, Table S6 ). Figs. 2, 3 and Tables S9, S10 show results from sensitivity analyses on the 2,579 samples from individuals with unknown previous infection status. When EuroImmun was taken as the reference standard, estimates of specificity were robust, while sensitivity appeared slightly higher for AbC-19, OrientGene and SureScreen, but slightly lower for Biomerica, although with overlapping CIs. All devices were estimated to have slightly lower sensitivity when evaluated against the composite reference standard. OrientGene was ranked highest for sensitivity across all three immunoassay reference standards, but with Biomerica appearing as a close contender when evaluated against Roche Elecsys Ò . Table S11 shows sensitivity and specificity estimated from all EDSAB-HOME Streams A and B ("one gate" study), based on comparison with each of the three immunoassay reference standards. Rankings of devices were quite robust to inclusion of PCR-confirmed cases. Re-test results are shown in Table S12 . Based on the sets of estimates that we consider least susceptible to bias (see Methods), we are 99% confident that SureScreen 1 band reading has the highest PPV. This ranking does not depend on pretest probability (Table 4, Figure S4 ). At a pre-test probability of 20%, we estimate SureScreen 1 band reading to have a PPV of 95.1% (95% UI 92.6, 96.8%), such that we would expect approximately one in twenty positive results to be incorrect. OrientGene and Biomerica have the highest ranking NPVs. There is very little difference between the NPVs for the one or two band readings of these devices (Table 4) . For all seven tests, point estimates of sensitivity were lower among individuals with unknown previous infection status who were positive on Roche Elecsys Ò than among PCR-confirmed cases, with strong statistical evidence of a difference for all tests except Ori-entGene ( Table 3 ). The greatest observed difference was for AbC-19. Fig. 4 shows that all devices were more sensitive at higher antibody concentrations. This effect was most marked in the devices with lower sensitivity, particularly AbC-19. All LFIAs had high sensitivity at the highest anti-S IgG concentrations, but at lower concentrations many lateral flow tests were falsely negative ( Figure S3 , Tables S13, S14). Very few bands or devices produced invalid readings (Table S15 ). Laboratory assessors reported that SureScreen bands were intense, well defined and easy to read, and that OrientGene bands were also easy to read. For Biomerica, some gradients and streaking in band formation were observed, which led to devices taking slightly longer to read. As we have reported previously, AbC-19 bands were often weak visually [21] . All devices showed some variability in reading across three assessors. Concordance was highest for the SureScreen IgG band: there were no discrepancies in the reading of this for 98.7% (98.3, 98.9%) of devices (Table S16) . Positive OrientGene, Biomerica and SureScreen IgG bands all tended to be stronger than AbC-19 bands: for example, across the 613 EDSAB-HOME samples that were positive on Roche Elecsys Ò , 76%, 69% and 79% showed "medium to strong reactivity" respectively, compared with 44% of AbC-19 devices (Table S17) . Concordance was lower for reading of IgM than IgG bands. IgM bands, when read as positive, were also often weak. We found evidence that SureScreen (when reading the IgG band only) and AbC-19 have higher specificities than OrientGene and Biomerica, but the latter have higher sensitivities. We can confidently conclude that SureScreen 1 band reading has~99% specificity, since this estimate was robust across two large discrete samples sets. In contrast, estimates of the specificity of AbC-19 and OrientGene varied slightly across Approaches 1 and 2. As Approach 2 denominators are subject to some misclassification error, we consider the estimates of specificity based on pre-pandemic samples to be most reliable. The sensitivities of OrientGene and Biomerica appeared comparable based on a reference standard of Roche Elecsys Ò (anti-N) immunoassay, whereas OrientGene appeared to have higher sensitivity when an alternative (anti-S) reference standard was used. This difference is not surprising since Biomerica also measures anti-N response whereas OrientGene (and the other two devices studied) measures anti-S response. For all four devices, there was some evidence of lower sensitivity to detect lower concentrations of antibody. This spectrum effect appeared strongest for the AbC-19 test and weakest for OrientGene. Due to spectrum effects, we consider Approach 2 estimates of sensitivity to be the most realistic. Notably, none of the four devices met the UK Medicines and Health products Regulatory Agency's requirement of sensitivity >98% for the use case of individual level risk assessment [11] , even in our least conservative analytical scenario, which we expect to overestimate sensitivity. On the other hand, the basis for this criterion is unclear, as we would expect high specificity to be the key consideration for this potential use case. Major strengths of this work include its size and performance of all LFIAs on an identical sample set. This design is optimal for comparing test accuracy [34] . Inclusion of laboratory immunoassay positive cases without PCR confirmation is an additional key strength over most previous studies in this field (see Research in Context panel and Table S2 ): this allowed assessment and quantification of spectrum effects. Antibody test sensitivity may have been over-estimated by studies that have quantified this from previously PCR-confirmed cases alone, particularly if blood samples were taken at a point when access to PCR testing was very limited. A limitation of our study is that tests were conducted in a laboratory setting, with the majority reading across three expert readers being taken as the result. For devices with discrepancies between readers, the accuracy of a single reader can be expected to lower [21] . Accuracy may be lower still if devices were read by individuals at home with less or no training, and may differ if device reading Fig. 3 . Sensitivity and specificity of lateral flow devices, with 95% confidence intervals, plotted in Receiver Operator Characteristic space. Four sets of estimates are shown: (i) Approach 1, i.e. specificity from analysis of known negatives and sensitivity from known positives (sample size: n = 1,995 for specificity, n = 268 for sensitivity); (ii) Approach 2 analysis of individuals with unknown previous infection status ("unknowns"), calculated against Roche Elecsys Ò reference standard (sample size: n = 2,225 for specificity, n = 354 for sensitivity); (iii) Approach 2 sensitivity analysis: analysis of unknowns compared with alternative EuroImmun reference standard (n = 2,233 for specificity, n = 346 for sensitivity); (iv) Approach 2 sensitivity analysis: analysis of unknowns compared with alternative composite reference standard (CRS) of positive on either Roche Elecsys Ò or EuroImmun versus negative on both (n = 2,207 for specificity, n = 372 for sensitivity). NB SureScreen 2 band overlays Orient Gene 1 band in the first panel. Table 1 Specificity of lateral flow devices: Approach 1 (known negatives). Estimates based on analysis of 1,995 pre-pandemic samples. CI = confidence interval, UI = uncertainty interval based on percentiles from Monte Carlo simulation, TNs = true negatives, FPs = false positives, "Probability best" = the proportion of Monte Carlo simulations in which the test had the highest specificity. Note: these AbC-19 TM results have been published previously (21) and are reproduced here for comparative purposes. Table 3 Sensitivity and specificity of lateral flow devices: Approach 2. Comparison with Roche Elecsys Ò immunoassay in EDSAB-HOME samples, stratified by previous PCR positivity. CI = confidence interval, UI = uncertainty interval based on percentiles from Monte Carlo simulation, "Probability best" = the proportion of Monte Carlo simulations in which the test had the highest sensitivity or specificity. Note: the AbC-19 TM results have been published previously (21) and are reproduced here for comparative purposes. technologies were used. Variation by reader type seems particularly likely for the devices that our laboratory assessors found more difficult to read, due to weak bands or gradients and streaking in band formation. SureScreen IgG band, followed by OrientGene IgG band, had the highest concordance across readers, who also reported these bands to be easy to read. An ongoing difficulty in this field is the ambiguity as to whether the true parameters of interest are sensitivity and specificity to previous infection, to presence of particular antibodies, or to "immunity". Our estimates are best interpreted as sensitivity and specificity to "recent" SARS-CoV-2 infection (Approach 1) or presence of an antibody response (Approach 2). These can be expected to correlate very highly since most individuals seroconvert [15] and both the anti-S and anti-N antibody response are highly specific to SARS-CoV-2 [12] . Although we believe that, due to spectrum effects, our estimates of sensitivity based on a reference standard of Roche Elecsys Ò are more reliable than those based on previous PCR confirmation, we note that this assay may itself make some errors, and that evaluation against this assay may tend to favour LFIAs measuring anti-N responses. We explored this with sensitivity analyses using two alternative reference standards. Although there is strong evidence that presence of antibody response correlates with reduced risk [3, 4] , our estimates should not be directly interpreted as sensitivity and specificity to detect "immunity" or to detect "any" previous infection (given declining antibody response over time). Further, our study describes test accuracy following natural infection, not after vaccination. Estimates of sensitivity would require further validation in vaccinated populations if the tests were to be used for post-vaccination monitoring. Notably, antigen choice precludes both Biomerica and Roche Elecsys Ò from this use case. An additional limitation of our analyses is that we did not quantify the accuracy of tests used in sequence, e.g. check positive results on Test A with a confirmatory Test B [35] . Finally, we estimated device accuracy in key workers in the UK and we note that accuracy may not be generalisable to other populations. If these devices are used for seroprevalence estimation, our estimates of LFIA accuracy can be used to adjust for test errors [9] . The "one gate" estimates of sensitivity would likely be the most appropriate for this. For the alternative potential use case of individual risk assessment (pending improved understanding of immunity), it would be desirable to use the most specific test or that with the highest PPV, which we estimate to be SureScreen 1 band reading, followed by AbC-19. At a 20% seroprevalence, we estimate that around 1 in 20 SureScreen IgG positive readings would be a false positive. Confirmatory testing, possibly with a second LFIA, would be an option, although requires evaluation. DW, RM, HEJ, STP, AEA, TB, AC, MH and IO planned the study. KRP and JS planned the laboratory based investigation. SK, JD, EDA, and DW planned the specificity investigations. DW, RM, EDSAB-HOME site investigators and COMPARE investigators collected/provided samples. RB, EL, and TB collated samples and performed assays. KRP and JS conducted experiments. HEJ, DW and SK did the statistical analyses. NC performed the rapid review of previous evidence. HEJ and DW wrote the paper, which all authors critically reviewed. Data have been verified by HEJ and DW. JS and KP report financial activities on behalf of WHO in 2018 and 2019 in evaluation of several other rapid test kits. MH declares unrelated and unrestricted speaker fees and travel expenses in last 3 years from MSD and Gillead. JD has received grants from Merck, Novartis, Pfizer and AstraZeneca and personal fees and non-financial support from Pfizer Population Research Advisory Panel. Outside of this work, RB and EL perform meningococcal contract research on behalf of PHE for GSK, Pfizer and Sanofi Pasteur. All other authors declare no conflicts of interest. Prevalence of SARS-CoV-2 antibodies in a large nationwide sample of patients on dialysis in the USA: a cross-sectional study SARS-CoV-2 antibody seroprevalence in the general population and high-risk occupational groups across 18 cities in Iran: a population-based crosssectional study Antibody Status and Incidence of SARS-CoV-2 Infection in Health Care Workers Prior SARS-CoV-2 infection is associated with protection against symptomatic reinfection Safety and immunogenicity of the ChAdOx1 nCoV-19 vaccine against SARS-CoV-2: a preliminary report of a phase 1/2, single-blind, randomised controlled trial Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK Do antibody positive healthcare workers have lower SARS-CoV-2 infection rates than antibody negative healthcare workers? Large multi-centre prospective cohort study (the SIREN study Ethical Implementation of Immunity Passports During the COVID-19 Pandemic Estimating prevalence from the results of a screening test Testing for SARS-CoV-2 antibodies Product Profile: antibody tests to help determine if people have immunity to SARS-CoV-2, Version 2. London: MHRA Performance characteristics of five immunoassays for SARS-CoV-2: a head-tohead benchmark comparison Clinical and laboratory evaluation of SARS-CoV-2 lateral flow assays for use in a national COVID-19 seroprevalence survey Antibody tests for identification of current and past infection with SARS-CoV-2 Antibody responses to SARS-CoV-2 in patients with COVID-19 World Health Organization. COVID-19 -Landscape of novel coronavirus candidate vaccine development worldwide 2020 Diagnostic performance of seven rapid IgG/IgM antibody tests and the Euroimmun IgA/IgG ELISA in COVID-19 patients Antibody testing for COVID-19: a report from the national COVID scientific advisory panel Evaluation of serological SARS-CoV-2 lateral flow assays for rapid point of care testing Evaluation of SARS-CoV-2 serology assays reveals a range of test performance AbC-19 rapid test" for detection of previous SARS-CoV-2 infection in key workers: test accuracy study Longitudinal observation and decline of neutralizing antibody responses in the three months following SARS-CoV-2 infection in humans Detection, prevalence, and duration of humoral responses to SARS-CoV-2 under conditions of limited population exposure Association between self-reported signs and symptoms and SARS-CoV-2 antibody detection in UK key workers Problems of spectrum and bias in evaluating the efficacy of diagnostic tests Empirical evidence of design-related bias in studies of diagnostic tests Comparison of four methods to measure haemoglobin concentrations in whole blood donors (COMPARE): A diagnostic accuracy study WHO performance evaluation protocols Case-control and two-gate designs in diagnostic accuracy studies The statistical evaluation of medical tests for classification and prediction, xvi Probabilistic sensitivity analysis using Monte Carlo simulation. A practical approach Evidence synthesis for decision making in healthcare Dose-Response Analysis Using R Comparative accuracy: assessing new tests against existing diagnostic pathways Orthogonal SARS-CoV-2 Serological Assays Enable Surveillance of Low-Prevalence Communities and Reveal Durable Humoral Immunity The study was commissioned by the UK Government's Department of Health and Social Care, and was funded and implemented by Public Health England, supported by the NIHR Clinical Research Network (CRN) Portfolio. HEJ We thank the following people who supported laboratory testing, data entry and checking, and specimen management: Jake Hall, Maryam Razaei, Nipunadi Hettiarachchi, Sarah Nalukenge, Katy Moore, Maria Bolea, Palak Joshi, Matthew Hannah, Amisha Vibhakar, Siew Lin Ngui, Amy Gentle, Honor Gartland, Stephanie L Smith, Rashara Harewood, Hamish Wilson, Shabnam Jamarani, James Bull, Martha Valencia, Suzanna Barrow, Joshim Uddin, Beejal Vaghela, Shahmeen Ali. We also thank Steve Harbour and Neil Woodford, who provided staff, laboratories, and equipment; the blood donor centre staff and blood donors for participating in the COMPARE study; and Philippa Moore, Antoanela Colda and Richard Stewart for their invaluable contributions in the Milton Keynes General Hospital and Gloucestershire Hospitals study sites. Supplementary material associated with this article can be found in the online version at doi:10.1016/j.ebiom.2021.103414.