key: cord-0691755-jbkaxd4k
authors: Pearce, Neil; Rhodes, Sarah; Stocking, Katie; Pembrey, Lucy; van Veldhoven, Karin; Brickley, Elizabeth B.; Robertson, Steve; Davoren, Donna; Nafilyan, Vahe; Windsor-Shellard, Ben; Fletcher, Tony; van Tongeren, Martie
title: Occupational differences in COVID-19 incidence, severity, and mortality in the United Kingdom: Available data and framework for analyses
date: 2021-05-10
journal: Wellcome Open Res
DOI: 10.12688/wellcomeopenres.16729.1
sha: 281e0f4379167d78f3c6c17d1f835590dd4571bf
doc_id: 691755
cord_uid: jbkaxd4k

There are important differences in the risk of SARS-CoV-2 infection and death depending on occupation. Infections in healthcare workers have received the most attention, and there are clearly increased risks for intensive care unit workers who are caring for COVID-19 patients. However, a number of other occupations may also be at an increased risk, particularly those which involve social care or contact with the public. A large number of data sets are available with the potential to assess occupational risks of COVID-19 incidence, severity, or mortality. We are reviewing these data sets as part of the Partnership for Research in Occupational, Transport, Environmental COVID Transmission (PROTECT) initiative, which is part of the National COVID-19 Core Studies. In this report, we review the data sets available (including the key variables on occupation and potential confounders) for examining occupational differences in SARS-CoV-2 infection and COVID-19 incidence, severity and mortality. We also discuss the possible types of analyses of these data sets and the definitions of (occupational) exposure and outcomes. We conclude that none of these data sets are ideal, and all have various strengths and weaknesses. For example, mortality data suffer from problems of coding of COVID-19 deaths, and the deaths (in England and Wales) that have been referred to the coroner are unavailable. On the other hand, testing data is heavily biased in some periods (particularly the first wave) because some occupations (e.g. healthcare workers) were tested more often than the general population. Random population surveys are, in principle, ideal for estimating population prevalence and incidence, but are also affected by non-response. Thus, any analysis of the risks in a particular occupation or sector (e.g. transport), will require a careful analysis and triangulation of findings across the various available data sets.

There are epidemics of SARS-CoV-2 infection throughout most parts of the world 1,2 , and the United Kingdom is currently experiencing particularly high infection and death rates. There are major occupational differences in the risk of SARS-CoV-2 infection and death [3] [4] [5] , but there have been relatively few systematic analyses of infection or death rates across different occupation types. There are clearly increased risks for intensive care unit workers who are caring for COVID-19 patients, as well as increased risks for other health and social care workers. However, a number of other occupations may also be at an increased risk, particularly those which involve social care or contact with the public 5 .

A large number of data sets are available to potentially assess occupational risks of COVID-19 incidence, severity, or mortality ( Table 1) . We are reviewing these data sets as part of the Partnership for Research in Occupational, Transport, Environmental COVID Transmission (PROTECT) initiative, part of the National COVID-19 Core Studies. In this report, we review the available data sets, and in the Discussion, we provide more detail on some of the larger and more relevant data sets available for examining occupational differences in SARS-CoV-2 infection and COVID-19 incidence, severity and mortality. We also discuss the possible types of analyses of these data sets and the definitions of (occupational) exposure and outcomes.

In any analyses of this type, one may distinguish several populations that are relevant:

• There is a target population to which we wish to draw inferences (e.g. all people in the UK, all people on the planet)

• There is a source population which is used as the source of participants for a particular study (e.g. everyone living in the UK aged 20-64 and in employment)

• There is a (perhaps smaller) study population (i.e. the group of people who actually take part in the study, with some of the source population not taking part either due to selection by the investigators, or self-selection (i.e. non-response))

Since the focus is on occupational exposure to COVID-19, the focus of almost all analyses will be on the working age population and will usually be restricted to those who were in employment at the beginning of the pandemic on 11 March 2020 6 . In data sets such as the Office of National Statistics (ONS) mortality data, the source population is the entire population of England and Wales (aged 20-64 and in employment at the beginning of the pandemic, and with an occupation recorded). In other data sets, e.g. UK Biobank, the source population is the entire population of England and Wales, aged 40-69 years and living in the UK in 2006, and who have not emigrated subsequently; the study population is those who actually took part in the survey (response rate = 5.5%).

Cohort data includes national mortality data (ONS data), cohorts based on Electronic Health Records (EHRs) such as Open-Safely, as well as population cohorts such as UK Biobank and many others (this data is being integrated and standardised, to the extent possible, by the Longitudinal Health Core Study, and the Data and Connectivity Core Study (National COVID-19 Core Studies)). Most cohorts have, or will have, linked mortality data. Many also have SARS-CoV-2 testing data, either as a single test, as a series of repeated test results, or self-reported tests and symptoms. Some also have hospitalization data.

In some instances, case-control studies can be nested within cohorts, or can be conducted as 'stand-alone' studies. One particular instance of this is the test-negative design 7,8 . It has been proposed that this is used for COVID-19 research for populations in which not everyone has been tested. The logic is that there are many individual factors (health seeking behaviour, access to transport, etc.) which may influence someone's ability to get tested. Thus, if we compare those who test positive with general population control samples, there may be considerable bias. When the test-negative design 7,8 is applied to COVID-19, people who are tested are given the questionnaire on risk factors (or we obtain risk factor information some other way), and we then compare those who test positive with those who test negative. If everyone in the study population is tested (i.e. a comprehensive investigation), then this is essentially a cross-sectional study. However, in cases where not everyone is tested, then we compare the test-positives with the test-negatives. It should be noted that people may be tested because they have symptoms, and therefore those who test negative may have a different respiratory infection. Thus, when we compare these two groups, we can learn about risk factors that are specific for SARS-CoV-2 (rather than respiratory infections in general). We can learn even more if we can also give a questionnaire to an additional carefully selected control person who was not symptomatic and therefore not tested. By comparing the test-positives with their controls, we can learn about risk factors for SARS-CoV-2, and by comparing the test-negatives with their controls, we can learn about risk factors for other respiratory infections. By putting the three sets of analyses together 7 , i.e. test+ves vs test-ves, test+ves vs additional selected controls, test-ves vs population controls -using triangulation 9 -we can learn a great deal.

Cross-sectional surveys include the baseline surveys for cohort studies (e.g. if everyone has a SARS-CoV-2 test at baseline), and 'one-off' outbreak investigations. Essentially, if everyone is only tested once, then usually the study will be crosssectional. Such surveys can be analysed in the same way as a case-control study 10 . 

The outcome data will vary according to the data set under analysis. It can include measures of SARS-CoV-2 infection (symptoms, positive test results), severity (hospitalisation, intensive care unit (ICU) admission) or mortality (COVID-19-related death, excess mortality). In most analyses one would take the first positive test result by reverse transcription polymerase chain reaction (RT-PCR) or serology as an outcome. One would only consider multiple positive test results in the same person if it were considered that these involved different infections.

There are a number of different classification methods for symptoms 12 , for example, the 'any symptom that could be caused by Coronavirus' definition applied by Understanding Society 13 . Other methods include focussing on three key symptoms 14 or applying a prediction model 15 .

There are also a number of ways to classify death from COVID-19 16 , for example, some methods include those where COVID-19 is mentioned on the death certificate 17 , whereas others classify them as, 'any death within 28 days of a positive test', as seen on the GOV.UK website.

The analyses described in this document focus on the relationship between occupation and work-related risk factors and health outcomes. An ideal investigation into the risk of transmission and infection in the workplace would include data that indicates the (likelihood of) exposure to infected people. However, this is virtually impossible, perhaps with the exception of healthcare staff working in COVID-19 wards. Hence, markers for the risk of exposure in groups of workers (rather than individuals) will need to be developed. In occupational epidemiological studies, different methodologies have been used to assess exposures to hazardous agents (or markers of exposure) in workplaces. Ideally, exposure is assessed quantitatively based on measurements of the environments. This is extremely challenging for SARS-CoV-2 due to the transient nature of the exposure. One possible option for future research may be to measure SARS-CoV-2 in sewage waste from workplaces, in order to determine if infections are occurring, and some trials are ongoing 18 . However, such data are unlikely to be widely available, and it will not be possible to use such data to distinguish between the exposure of individual workers within the same workplace.

Information on occupational risk factors can be collected through questionnaires. Many of the studies and data sources reported in Table 1 , will include data from questionnaires completed by participants. Unfortunately, the level and detail of occupational information requested in the questionnaires varies widely between the different data sources and studies. Some will have very limited data, e.g. just whether participants are working from home or are furloughed, working hours (e.g. full-time or part-time work), patterns (shift-work), or job security (e.g. zero hours contracts). Further details can be collected by questionnaires, and an example of a questionnaire which aims to collect data on work-related risk factors is described in Extended data 19 .

Analyses of health outcomes, including symptoms, positive tests, hospitalisation, ICU admissions, and deaths for each occupational group is informative. Ideally, occupational data should be collected and analysed using standard occupational classification (SOC), such as SOC2010 or SOC2020. The use of the SOC will allow better comparison across studies. In this classification, each occupation is given a 4-digit code, but analyses can also be done using just the first digit, first two digits, etc. (see Discussion for 1-and 2-digit SOC codes). Analyses using 4-digit codes may not always be possible due to the size of the study, however, when possible, they may provide very useful information. For example, the first ONS report on COVID-19 deaths and occupation 20 demonstrated that within the broad category of Road Transport Driver (SOC 821), the COVID-19 mortality rate was elevated in bus and taxi drivers, but not in large goods vehicle and van drivers, suggesting that contact with the general public is a risk factor.

3-digit and 4-digit occupational codes can be selected and grouped on the basis of prior knowledge. One example of this is given in the first ONS report 20 (see Table 2 ).

Similar analyses have been done grouping healthcare workers and social care workers 17 .

Occupational Self-Coding and Automatic Recording (OSCAR) One barrier for using SOC or other standardised occupational classifications is that they generally require collection of information on job and activities using free text questions, combined with post-hoc coding. This can be very time consuming, although some tools are available that can be used for (semi-)automatic coding e.g. Computer Assisted Structured Coding Tool (CASCOT). Still, many researchers are not keen to include open-ended and free text questions.

To overcome this problem, an occupational self-coding tool was developed for a study on chronic obstructive pulmonary disease (COPD) using the UK Biobank 20 . Occupational Self-Coding and Automatic Recoding (OSCAR) was developed by the authors using the hierarchical structure of the SOC2000 which allows individuals to collect and automatically code their lifetime job histories via a simple decision-tree model 20 .

We are currently modifying OSCAR in order to focus only on recent occupations (e.g. since the beginning of 2020, rather than a full history). In addition, we have developed a more detailed occupational questionnaire as an optional tool in the COVID-19 version of OSCAR (see Extended data 19 ).

The SOC codes can also be used in combination with a Job Exposure Matrix (JEM) which we are currently developing. This All men aged 20 to 64 years 9.9 9.4 10.4

Occupations are classified for each of these factors as follows:

The JEM is developed based on a combination of data and expert judgement which are used to classify each occupation, e.g. according to the likelihood/extent of public contact. As the JEM is developed in collaboration with European partners, an international occupational classification system (ISCO) is used, rather than the UK SOC classification. Hence, when completed the JEM will need to be translated into SOC, for which we will use a combination of existing cross-classifications as well as expert judgements.

When considering differences in SARS-CoV-2 and COVID-19 risk in different occupations, the 'standard' confounders include age, sex, ethnicity, deprivation, and region. Some of these factors may be time-varying, and this should ideally be taken into account in the analysis.

The term 'race' is an artificial construct, and therefore most researchers prefer to use the term 'ethnicity' 22 which is a complex construct that includes biology, history, cultural orientation and practice, language, religion, and lifestyle, all of which can affect health. The UK census reports 18 categories of ethnicity (Table 3) . Although it may be necessary to group these 18 categories into two -White and BAME (Black Asian and Minority Ethnic) -when study numbers are small, many object to this categorisation on the basis that there are substantial differences (including experiences of racism as well as cultural, social, economic, historical factors) between the different 'non-White' ethnic groups; thus it is preferable to report study findings separately for each ethnic group if the numbers permit. For example, one recent analysis 23 of COVID-19 infection, hospitalisation, and mortality reported approach has been used successfully in many other occupational epidemiological studies based on general population data 21 , where limited data are available on work-related factors. A JEM is basically a (3) Furthermore, the JEM also includes the following estimates: 1. Job insecurity: proportion of flexible labour contracts (including zero-hours contracts) 2. Migrant workers: proportion of migrant workers the findings by separating ethnicities into White (63%), South Asian (6%), Black (2%), Other (2%) and Mixed (1%) with 26% not providing any information on ethnicity.

The UK census has 10 categories for regions in England and Wales (Table 4) . Each region (with the exception of London) includes a mix of urban and rural residents.

The UK census has five categories of household deprivation (Table 5 ).

There are also several potential effect modifiers, including working from home, being furloughed, and the availability and use of personal protective equipment (PPE). All of these may modify the risk of infection, even if remaining in the same job throughout the pandemic.

Descriptive analyses All analyses will usually start with similar descriptive analyses, e.g. tables of the characteristics of the study participants.

Intersectoral approaches may also be used in these descriptive analyses. These will usually be specific to the data set under analysis, so we will not try to establish general principles here.

The main studies that have used directly age-standardised rates are the ONS analyses 20 . These have estimated age-standardised mortality rates (ASMR) standardised to the 2013 European Standard Population. They are described in more detail in the Discussion section.

Cohort studies that have more comprehensive data, including data on potential confounders, can be analysed using Poisson regression 25 or the Cox proportional hazards model (they should yield the same results). For each occupational group being considered (see below for how these are defined and compared), we might run the following models if we are specifically investigating occupational exposures, and we wish to adjust for confounders such as ethnicity, deprivation, etc: 

There is a considerable amount of literature on the use of excess mortality analyses for studying COVID-19 mortality 26 . The rationale is that excess all-cause mortality may, in some instances, be a better measure of the true mortality burden from COVID-19 than is the case for COVID-19-specific mortality, because of the problems of classification of COVID-19 death on death certificates 1,2 . For example, Vandoros 27 used ONS data on the number of deaths in England and Wales that did not officially involve COVID-19 over the period 2015-2020; they used a difference-in-differences econometric approach to study whether there was a relative increase in deaths not registered as COVID-19-related during the pandemic, compared to a control time period. Results suggest that there were an additional 968 weekly deaths that officially did not involve COVID-19, compared to what would otherwise have been expected. Vandoros concluded that it is possible that some people are dying from COVID-19 without being diagnosed, and/or that there are excess deaths due to other causes resulting from the pandemic.

Case-control studies can be analysed using logistic regression 25 . The general modelling strategy is essentially the same as that described for Poisson regression or the Cox proportional hazards model (see above).

The idea of 'triangulating' evidence from different methods and data sources has been proposed and used implicitly for decades, often without explicitly describing it as triangulation 9,28,29 . The key aspect of triangulation is that it involves comparing results from at least two (but ideally more) methods that have differing key sources of unrelated bias 9 . If evidence from such different epidemiological approaches all point to the same conclusion, this strengthens confidence that that is the correct causal conclusion, particularly when the key sources of bias for some of the approaches predict that the findings would point in opposite directions. The difference between 'epidemiologic triangulation' and the systematic review or meta-analysis of trials or epidemiological studies is that a systematic review seeks similar studies, which are expected to yield similar findings, and hence can be grouped in a meta-analysis to obtain a more precise estimate of an exposure. Epidemiological triangulation, in contrast, looks for different types of studies, which might be expected to yield different findings, because they involve different potential biases, or biases in different directions; this allows one to assess the likely existence or absence of the biases that one might be concerned about in one particular type of study 30 . Triangulation is particularly relevant to analyses of the relationship between COVID-19 and occupation, since the available databases have different strengths and weaknesses, often with biases in different directions. Thus, it is important to compare findings for a particular occupation (e.g. healthcare workers) across different data sets, and to attempt to understand why different analyses may give different results, and what the potential strengths and directions of the biases are in the different data sets.

Meta-analysis 31 is a quantitative technique that allows the combination effect measures from multiple studies to increase precision and to allow for an overall summary. Meta-analysis is often accompanied with forest plots 32 , which allow visual comparison of effect measures, to assess consistency and explore variation.

An advantage of analysing multiple data sets using the same general protocol is that there will be consistency in terms of the chosen outcome measures, the summary measures used, the format of the occupation variables, and the confounders adjusted for. However, in this context meta-analysis must be approached very cautiously because of the complex heterogeneity among the data sets in terms of the methods of data-collection, outcome measures, time periods covered, and testing strategies.

Occupations can be grouped in many different ways and the comparison of multiple occupation groups will lead to a large number of effect measures that are likely to be unsuitable for meta-analysis. The use of the JEM (see below) will allow us to look at the effect of a small number of key exposure variables related to occupation. Meta-analysis could then be performed on the effect measures related to these exposures.

There is a variety of analysis strategies which are used in analyses of this type, and there is no single 'gold standard' that can be universally applied 33,34 . One possible analysis strategy would involve considering the following contrasts: 

In this section we discuss the key data sets associated with this study in further detail.

The Office for National Statistics (ONS) has recently published several reports on COVID-19 deaths in the working age population (20-64 years) in England and Wales 20 . There were high COVID-19 death rates in selected occupations, particularly for men, including high death rates in occupations involving public contact 35,36 . These job types include security guards, taxi drivers and chauffeurs, bus and coach drivers, chefs, sales and retail assistants, and social care workers.

The findings were adjusted for age, but not for other factors such as ethnic group, place of residence and deprivation.

In the ONS data, deaths were defined using the International Classification of Diseases, 10th Revision (ICD-10). Deaths involving the coronavirus (COVID-19) include those with an underlying cause, or any mention of ICD-10 codes U07.1 (COVID-19, virus identified) or U07.2 (COVID-19, virus not identified). ONS applied an age restriction, selecting deaths among those aged 20 to 64 years, because of limitations of occupational mortality data for those below the age of 20 years and those above the age of 64 years. Occupation is reported on the death certificate at the time of death registration by the informant. This information was then coded using SOC2010.

Population counts for occupations were obtained from the Annual Population Survey (APS), using data collected in 2019 17,37 . The APS is the largest ongoing household survey in the UK and is based on interviews with members of randomly selected households. The survey covers a range of diverse topics, including information on occupation, which is then coded using the SOC2010 Manual 38 . The population counts were also restricted to those aged 20 to 64 years and were weighted to be representative of those living in England and Wales.

Mortality rates for the broader population of all usual residents in England and Wales were based on the mid-year population estimates for 2018.

This is the 'standard' way of conducting such analyses, which has been used in the ONS reports to date, where the numerator data is obtained from death registrations, and the denominator data is obtained from population surveys. The relevant files are death registrations, England and Wales and the Annual Population Survey (see Table 1 ).

This is a data set newly available from ONS 39 . The 2011 census was linked to the 2011-2013 Patient Registers (PR) using deterministic and probabilistic matching. It was first linked deterministically using 24 different matching keys, based on a combination of forename, surname, date of birth, sex, and geography (postcode or Unique Property Reference Number). Using different combinations of these variables ensured that records that contain errors in these variables could nonetheless be linked. The matches needed to be unique within a matching key for the match to be accepted. Probabilistic matching was then used to attempt to match records that were not linked deterministically, using 13 different combinations of personal identifiers. Candidate matches were assigned to census records using the Felligi-Sunter probabilistic matching method. Despite being in the population at risk of COVID-19-related death in March 2020, ONS did not replenish the sample with post-2011 births or immigrants. While the latter group could have been identified and in principle linked to our data, neither group are captured in the 2011 census and therefore they have no ethnicity or covariate data recorded. Additionally, the younger population have been the least affected with COVID-19 related hospitalisation and mortality. For the same reason, individuals not enumerated at the 2011 census (estimated to be 6.1% of the population of England and Wales) were not included in the study population.

At this stage, the data set only includes deaths for 2020, but it is possible that deaths from 2011-2019 could also be linked. 2020 in relation to occupation. They found that there were 120,075 working participants aged 49-64 years in 2020, after excluding those who had died previously, or had missing data. They compared the occupation at baseline to that at follow-up, for a sub-cohort of 12,292 people who completed further data collection between 30 th April 2014 and 7 th March 2019. They found high agreement between the job at baseline and at follow-up: 67% for 'other essential workers', and 92% for 'non-essential workers'. For more narrowly defined occupational groups, agreement ranged from 53% for food workers to 88% for healthcare professionals.

One possible set of analyses for this data is to undertake standard cohort (Poisson regression of Cox regression) analyses with a positive SARS-CoV-2 test as the outcome. Such analyses have been performed by Chadeau-Hyam et al. 40 who also compared the risk factors for positive COVID-19 tests with those for negative COVID-19 tests (this is discussed further below). Mutambudzi et al. 43 have performed similar analyses with severe COVID-19 (a +ve test in a hospital setting and/or COVID-19-related death) as the outcome. Thus, they have already published findings for the standard SOC occupational groups but have not published any findings for COVID-19-related JEM.

An alternative approach to analysing the UK Biobank data would be to use the test-negative design. The rationale for this is that during the first wave of the pandemic testing was done on the basis of symptoms and/or high-risk occupations (e.g. healthcare workers), so standard cohort analyses may be biased (e.g. Chadeau-Hyam et al. 38 found particularly high positivity rates for healthcare workers which may just reflect that this group was being tested regularly). Chadeau-Hyam et al. in part addressed this selection bias by comparing the findings for positive and negative COVID-19 tests (they compare the findings for tested vs non-tested, +ve vs non-tested, -ve vs non-tested, and +ve vs -ve), but such an analysis has not been done for occupation.

Understanding Society is a UK-wide long-term longitudinal study involving approximately 10,000 participants per decade. Understanding Society uses probability sampling and is constructed to allow population inferences. From April 2020, participants from the main Understanding Society sample completed an online survey relating to the COVID-19 pandemic once a month from April to July, and then once every 2 months from September onwards. Each survey includes core content (e.g. SARS-CoV-2 test results and symptoms, information about working from home or furlough) which is designed to track changes. The survey also includes variable content adapted each month as the coronavirus situation develops. The latest release of data was for the September 2020 questionnaire, and at that point 19,763 participants had completed at least one survey. Occupation data was collected in June 2020 and this included 3-digit SOC codes and sector data. The dataset contains information on age, gender, and ethnicity, as well as geographical information. Nandi and Platt 44 found that within the Understanding Society population, Black Africans are more likely to report experiencing SARS-CoV-2 symptoms than White UK, and this could not be explained by greater exposure to overcrowding or by the fact that they were keyworkers.

The Understanding Society suite of data sets includes weighting (if necessary) to allow valid population inferences. This includes weighting related to the design (clustering and stratification) and to the response. Weighted analyses may be conducted using SVYDESIGN commands in R.

One possible set of analyses for this data is to undertake standard cohort (Poisson regression or Cox regression) analyses with either positive SARS-CoV-2 test and/or symptoms suggestive of SARS-CoV-2 as the outcome, and using the 1-digit SOC codes or sector as covariates. Note that this dataset is unlikely to be large enough to consider breakdown by 2-digit SOC codes. Covariates that take into account periods of working from home or furlough can be included (these could be time-varying). Analysis using covariates derived from the JEM can be also included. Symptom data is likely to overestimate the incidence of SARS-CoV-2, however access to testing and motivation to take a test is likely to vary by occupation whereas reporting of symptoms is likely to be independent of occupation.

An alternative approach to analysing the UK Understanding Society data would be to use the test-negative design. The rationale for this is that during the first wave of the pandemic testing was done on the basis of symptoms and/or high-risk occupations (e.g. healthcare workers), so standard cohort analyses may be biased. Usually once someone has tested positive, they would not be re-tested, and if they were, they would be excluded from the analysis. Williamson et al. 46 have analysed the OpenSafely data and linked the primary care records to 10,926 COVID-19-related deaths. They found higher death rates to be related to male sex, older age, higher deprivation, diabetes, severe asthma, and various other medical conditions. Black and South Asian people were at higher risk of COVID-19-related death, even after adjustment for potential confounders.

The ethnic differences were explored further by Mathur et al. 23 who found substantial evidence of ethnic inequalities in the risk of testing +ve, ICU admission, and mortality, which persisted after accounting for explanatory factors including household size. However, they noted that some of this excess risk may be related to factors not captured in clinical records such as occupation. They note that prioritizing linkage between health, social care and employment data and engaging with ethnic minority communities is essential for generating evidence to prevent further widening of ethnic inequalities in COVID-19.

Thus, OpenSafely is a potentially important database for examining occupational differences in COVID-19 incidence, severity, and mortality, adjusted for other factors such as deprivation and ethnicity. However, occupational information has not been linked to OpenSafely at this stage.

A large number of data sets are available to potentially assess occupational risks of COVID-19 incidence, severity, or mortality. All have various strengths and weaknesses. For example, mortality data suffer from problems of coding of COVID-19 deaths, and the unavailability (in England and Wales) of deaths that have been referred to the Coroner, and testing data is heavily biased in some periods (particularly the first wave) because some occupations (e.g. healthcare workers) were tested more often than the general population. In principle, random population surveys are ideal for estimating population prevalence and incidence but are also affected by non-response. Thus, any analysis of the risks in a particular occupation or sector (e.g. transport), will require a careful analysis and triangulation of findings across the various available data sets.

Underlying data All data underlying the results are available as part of the article and no additional source data are required. We read the method article by Neil Pearce and colleagues with great interest, in part because we are moving forward with similar work here in Canada. The paper is a great contribution. Although in many ways it is UK-specific, the broader issues it addresses are relevant to non-UK researchers trying to develop the best methods for approaching this difficult topic. This paper was very useful in organizing our thoughts on the methodological and challenging issues, though we do have some suggestions.

On page 5, perhaps testing should be added to the list of outcomes to be examined. Although it is not a disease, it is an important indicator for the potential to recognize the disease and testing and test-positivity rates are useful for understanding COVID-19 and the development of public policy.

OSCAR is a very positive development for future coding of occupations and we look forward to learning more. On the other hand, the automated coding currently used for many large existing data sets can have major problems in terms of both reliability and validity, which increase with the number of digits used. The effect of the misclassification introduced is not differential in regards to disease status, so likely mutes associations. This deserves mention as a limitation of these datasets and highlight the value of OSCAR.

We were surprised at the lack of discussion of industry sector. Some characteristics of a workplace can sometimes be better characterized by the industry, such as whether the work is "public facing" or "essential" which impact the potential for infection while operating or whether the work continues during lockdown. For example, someone in a cleaning occupation could have a quite different risk depending on whether they are employed in a hospital, factory, restaurant, or recreational facility.

Although "Confounders and Effect Modifiers" is a heading, the discussion of effect modification is very limited. In particular, the issue of race/ethnicity is extremely important and deserves consideration as an effect modifier. In our country it has a major impact on where people are employed, testing rates, availability of vaccines, and vaccine hesitancy.

We were surprised at the discussion of geography limited to political regions. Surely other options are available in the UK? One of the major challenges facing us is differentiating workplace from community transmission and geography, at the very least urban versus rural, is a useful surrogate.

Triangulation is discussed in broad terms. Perhaps an example would be helpful, such as using the population health approaches discussed in the paper with the workplace level information provided by the Public Health England outbreak investigations.

Effect modification is not raised in the context of analysis. I assume that the investigators would look at this, but it is important to mention understanding the complex relationship between the variables before treating them as confounders and adjusting away their effects. Again race/ethnicity is an important example but, given differences in testing, vaccination, and other factors, even sex and age deserve close examination before adjusting away their effects. For example, are certain occupational groups infected at an earlier age?

Although selection bias is mentioned in relation to the UK Biobank, no further discussion of the point is provided, other than it may diminish over time. A major challenge with many similar cohorts is that they are based on voluntary participation and may not be representative of the labour force.

In the first sentence "and the United Kingdom is currently experiencing particularly high infection and death rates." -suggest change to "has experienced" to not be rooted in one time.

In Table 1 please specify "UK" in the title. Is the availability of occupational data in REACT still "unknown?" Perhaps "Possible" in the last column could be described more?

The link for the occupational questionnaire (reference 19) seems to have a description of the questionnaire, but not the questionnaire itself, which would be helpful.

Are sufficient details provided to allow replication of the method development and its use by others? good to include time frame of data (calendar year/months) presented in Table 2 to make clear that this was during COVID pandemic.

The OSCAR tool a great innovation-hadn't heard of this before. Could greatly facilitate systematic collection of occupation data.

Tried to get a look at the Questionnaire at the "Extended Data" link, but didn't manage to see the actual questionnaire.

The JEM development is very promising. 'Risk factors for transmission' in the JEM could perhaps also include interaction with members of the public. This would be the case, for example, for workers stacking supermarket shelves. Or distinguish between or indoor [e.g., building] or enclosed space [e.g., public transport bus] proximity with members of the public versus outdoor (e.g., traffic control worker at an inner city construction site)? Such interaction/interfacing should probably be independent of distance, acknowledging the potential for aerosol transmission. Is this what the authors are trying to get at by "c. Indirect contact"? Not clear.

It's a finer/minor point, but job insecurity might be better expressed as 'employment precarity' because some higher status jobs have low security but relatively good working conditions, whereas precarious employment (such as zero hours contracts) has both low security and a raft of other poor working conditions that could predispose to COVID exposure and infection. Perhaps the focus on zero hours contracts is because there is a source of data on this in the UK by occupation?

The focus on occupation is well-founded and based on the availability of data as well as historical precedent. But perhaps the authors could consider (if they haven't already) whether industrial sector information could also be useful, where it is accessible? This could provide another lens on key constructs/risk factors such as precariousness/job insecurity from which to triangulate. For example, the hospitality and retail sectors (in many countries, though not certain about the UK) have a particularly high prevalence of precariously employed workers. CASCOT appears to be able to code sector as well as occupation?

This article makes a valuable contribution in detailing a wide range of population-level data sources. In seeking to generate relevant measures from these various sources, a possibly useful distinction could be identifying those measures of infection/morbidity/mortality occurrence that are based on the same occupation 'measurement method' for numerators and denominators, or cases and non-cases in the populations from which cases have emerged (such rates by occupation based on APS data with comparably SOC-coded occupation for cases and non-cases). These can still be biased, but would at least be internally consistent in exposure (occupation) measurement. We face the same challenges in estimating suicide rates among workers in particular occupations or sectors (e.g., building and construction) based on Coronial investigation records to determine the occupation of suicide cases, while sourcing occupation or sector denominator data from periodic (~every 3-5 years) Labour Force and Census surveys, leaving all sorts of room for error.

Please check the links. AT least one needs to be more specific: the hyperlink from OSCAR (Occupational Self-Coding and Automatic Recoding) took me to a web page for "Lungs at Work", ways of assessing exposure to SARS-CoV-2 through occupational characteristics like interacting with the public, working on a production line in close proximity to other workers, and by being in a so-called "essential" occupation. Some examples of exposure assessments that would strengthen the paper include [1] [2] [3] [4] [5] .

A few minor additional suggestions:

In the second paragraph, it would be helpful to state explicitly that the data resources are for the U.K.

In the discussion of race/ethnicity on page 6, and/or in the discussion of confounders/effect modifiers on page 8, I think that it would be helpful to go into more detail about the complex potential roles of race/ethnicity (and I suppose also deprivation) in the pandemic. It is not at all a simple matter to "control" for race/ethnicity when it may affect risk of infection, underlying conditions, probability of being tested, quality of health care, and probably several other critical steps. Hawkins et al. 6 found that Blacks consistently had higher mortality rates from Covid-19 than Whites within the same occupation, in Massachusetts USA. There are several possible reasons for this, but I think the paper would be improved by acknowledging the complexity of teasing out the reasons for race/ethnic differences.

○ On page 5, the application of wastewater epidemiology to workplaces is a good point to raise, and I think there might be a few additional references that could point readers to concrete examples. Prisons and other congregate settings are being studied effectively to identify outbreaks, and these of course are occupational exposures. 

PubMed Abstract | Publisher Full Text 2. Baker MG, Peckham TK, Seixas NS: Estimating the burden of United States workers exposed to infection or disease: A key factor in containing risk of COVID-19 infection

Prevalence of Underlying Medical Conditions Among Selected Essential Critical Infrastructure Workers -Behavioral Risk Factor Surveillance System

Occupations by Proximity and Indoor/Outdoor Work: Relevance to COVID-19 in All Workers and Black/Hispanic Workers

Social Determinants of COVID-19 in Massachusetts, United States: An Ecological Study

PubMed Abstract | Publisher Full Text Is the rationale for developing the new method (or application) clearly explained? Yes Is the description of the method technically sound?

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes Competing Interests: No competing interests were disclosed.We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above. 

Institute for Health Transformation, School of Health & Social Development, Deakin University, Geelong, Australia

This article makes a valuable contribution in detailing a wide range of population-level data sources for investigating occupational differences in COVID-19 incidence, severity, and mortality in the UK.

Page 3, first paragraph of Introduction: "the United Kingdom is currently (as of MONTH, year) experiencing particularly high infection and death rates." Suggest inserting month and year, as situation constantly changing Page 5, Methods Under 'Occupational Codes' section: SOC is presumably specific to the UK? Simply mention that qualification for non-UK international readers, and perhaps note its compatibility/translatability to ISCO (International Standard Classification of Occupations)?Page 6, Table 2 : Excellent example comparing mortality rates by different types of vehicle drivers, but would be not a description or report on OSCAR (whereas the CASCOT link does go to a CASCOT-specific page).

Is the description of the method technically sound? Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes

Competing Interests: No competing interests were disclosed.

Reviewer Report 25 May 2021 https://doi.org/10.21956/wellcomeopenres.18450.r43812 © 2021 Kriebel D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Department of Public Health, University of Massachusetts Lowell, Lowell, MA, USA This is a very useful summary of a large number of resources available in the U.K. for studies of occupational differences in Covid-19. The topic is highly relevant because the roles of occupation in risk of Covid-19 are complex, and unfortunately these roles have not been sufficiently taken into consideration in public debates and policy formulation. The authors are very qualified to provide a thorough overview of the topic with a valuable compendium of resources both in data and in methods.One substantive addition to the paper would strengthen it significantly. The discussion of exposure variables could be strengthened. The paper lacks reference to the literature on different

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Yes Competing Interests: No competing interests were disclosed.

Occupational and environmental exposure assessment and epidemiology.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.