key: cord-0254225-zl0gvapm
authors: Scholes, S.; Ng Fat, L.; Moody, A.; Mindell, J. S.
title: Does the use of prediction equations to correct self-reported height and weight improve obesity prevalance estimates? A pooled cross-sectional analysis of Health Survey for England data
date: 2022-01-30
journal: nan
DOI: 10.1101/2022.01.28.22270014
sha: a38d51125b12f7afc7aa2efb89b960389865c359
doc_id: 254225
cord_uid: zl0gvapm

Objective: Adults typically overestimate height and underestimate weight compared with measured values, and such misreporting varies by socio-demographic and health-related factors. Using self-reported and interviewer-measured height and weight, collected from the same participants, we aimed to develop a set of prediction equations to correct bias in self-reported height and weight, and assess whether this adjustment improved the accuracy of obesity prevalence estimates relative to those based only on self-report. Design: Population-based cross-sectional study. Participants: 38,942 participants aged 16+ (Health Survey for England 2011-16) with non-missing self-reported and interviewer-measured height and weight. Main outcome measures: Comparisons between self-reported, interviewer-measured (gold standard) and corrected (based on prediction equations) body mass index (BMI in kg/m2) including (i) difference between means and obesity prevalence, and (ii) measures of agreement for BMI classification. Results: On average, men overestimated height more than women (1.6 and 1.0cm, respectively; p<0.001), whilst women underestimated weight more than men (-2.1 and -1.5kg, respectively; p<0.001). Underestimation of BMI was larger on average for women than for men ( 1.1 and 1.0kg/m2, respectively; p<0.001). Obesity prevalence based on self-reported BMI was 6.8 and 6.0 percentage points (pp) lower than that estimated using measured BMI for men and women, respectively. Corrected BMI (based on models containing all significant predictors of misreporting of height and weight) lowered underestimation of obesity to 0.8pp in both sexes and improved the sensitivity of being classified as obese over self-reported BMI by 15.0pp for men and 12.2pp for women. Results based on models using age alone as a predictor of misreporting were similar. Conclusions: Compared with self-reported data, applying prediction equations improved the accuracy of obesity prevalence estimates and increased sensitivity of being classified as obese. Including additional socio-demographic variables does not add enough predictive power to justify the added complexity of including them in prediction equations.

• The limitations of body mass index (BMI) calculated from self-reported height and weight are well known.

• Health examination surveys such as the Health Survey for England (HSE) enable study of reporting bias when they collect both self-report and directly measured data on height and weight from the same participants.

• This study used HSE 2011-16 data to derive a set of adjustments to self-reported height and weight based on linear regression models that estimated measured values from selfreported values, with additional corrections for socio-demographic and health-related factors predictive of misreporting.

• Corrected and measured BMI (the gold standard) were compared to quantify by how much obesity prevalence estimates were improved relative to those based on self-report data only.

• Prediction equations are specific to time, place, target population and methods of data collection. As such, these may not be applicable to surveys with more recent data, or different socio-demographic, health and self-reported anthropometric profiles.

A few large cross-sectional surveys in England and the United States (US) include only selfreports of height and weight, as direct measurement is not feasible due to factors such as cost.

In lieu of direct measures, body mass index (BMI) derived from self-reported height and weight (hereafter, self-reported BMI) is sometimes used for research 1 and regularly for estimating obesity prevalence at a sub-national level as part of monitoring efforts 2 3 .

However, systematic literature reviews 4 5 and epidemiologic studies [6] [7] [8] [9] [10] [11] have consistently shown that adults on average overestimate height and underestimate weight compared with measured values. Whilst such misreporting is typically moderate on average for continuous variables, self-reported BMI often results in a systematic misclassification of BMI categories.

Misreporting of height and weight, plus the skewness of BMI distributions, result in significant underestimation of obesity prevalence 11 .

Misreporting of height and weight varies by socio-demographic factors, e.g. by sex 11 , age 12 , race/ethnicity 10 , and socioeconomic status 9 , and by health-related factors such as current smoking status 8 and self-perceived health 8 . Younger women in particular underestimate weight (linked to social desirability bias) 13 , while older persons overestimate height (linked to reporting height measured earlier in life, prior to becoming shorter due to changes in bone and muscle) 6 12 . Misreporting of weight is greater in higher BMI categories 10 11 14-16 : the term "flat slope syndrome" in obesity epidemiology describes the systematic tendency for selfreported BMI (relative to measured BMI) to overestimate low values and underestimate high values 7 17 18 .

Health examination surveys (e.g. the Health Survey for England (HSE) and the US National Health and Nutrition Examination Survey (NHANES)) often collect both self-report and measured data on height and weight from the same participants, enabling study of selfreporting bias 10 11 . As such analyses have shown systematic patterns in misreporting, equations including variables predictive of misreporting have been developed for use with surveys collecting self-report but not measured height and weight 9 10 16 . In England, selfreported height and weight in the Active Lives Survey (ALS) datasets are adjusted by formulae based on HSE data to monitor excess weight (BMI ≥ 25kg/m 2 ) at local authority level. For surveys with no direct measurements, such adjustments are made in the expectation that corrected values improve BMI classification sensitivities, and so estimate levels of excess weight and obesity more accurately, compared with reliance on self-report data alone 3 16 .

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01. 28.22270014 doi: medRxiv preprint The HSE is the main data source for monitoring overweight and obesity in the general population in England. Annually since 1991, trained interviewers have measured participants' height and weight. Self-reported height and weight were included in each survey year between 2011 and 2016. The present study aims to analyse HSE 2011-16 data to develop a set of equations to correct self-reported height and weight to more closely approximate measured height and weight. Should corrected BMI show an improvement over self-reported BMI, these equations could then be applied to (i) self-report data in HSE 2021 (where interviewer-measurement was not possible for a substantial portion of fieldwork due to Covid-19 pandemic precautions) and (ii) other interview-based surveys (e.g. ALS) to improve accuracy of obesity prevalence estimates. Our objectives were to (i) identify which variables are associated with misreporting (thereby meriting inclusion in prediction equations), and (ii) assess whether applying the chosen equations improved the classification of adults into BMI categories.

The present study used HSE data on adults (aged 16y+) from all survey years between 2011 and 2016, when both self-reported and measured height and weight were collected. The HSE is a cross-sectional, general population survey of individuals living in private households, with a new sample each year randomly selected by address 19 . Data collection occurs throughout the year. The first stage is a health interview, including questions about sociodemographic factors, diagnosed health conditions, self-rated health, health-related lifestyle behaviours, and direct measurements of -and in 2011-2016, self-reported -height and weight. The second stage is a nurse-visit, including biophysical measurements. Interviews and nurse visits take place in the participants' own homes. All adults in selected households were eligible (maximum 10); the percentage of eligible households participating ranged from 66% in 2011 to 59% in 2016. Participants gave verbal consent to be interviewed, visited by a nurse, and to have anthropometric measurements taken. Research ethics approval was obtained from relevant committees.

Self-reported height and weight were collected with the questions: "How tall are you without shoes?" and "How much do you weigh without clothes and shoes?" Height was reported in either metres or feet and inches; weight was reported in kilogrammes (kg) or stones and pounds. Height was measured using a portable stadiometer with a sliding head plate, a base . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint plate and connecting rods marked with a measuring scale. Participants were asked to remove their shoes. One measurement (to the nearest even millimetre) was taken, with participants' stretching to the maximum height and the head positioned in the Frankfort plane. For those not pregnant, a single weight measurement (to the nearest 100g; maximum 200kg) was recorded using Class III Seca scales; participants were asked to remove their shoes and any bulky clothing or heavy items in pockets, etc. No adjustment was made for the weight of clothing. Participants unable to stand or unsteady on their feet were not measured. Those who weighed 200+kg were asked for their estimated weight because the scales are inaccurate above this level: these two cases were included herein (measured weight set equal to reported weight) to avoid underestimating weight in the upper-tail. Participants were assigned missing values if they were considered by the interviewer to have unreliable measurements, e.g. those who were too stooped or wore excessive clothing. Participants were not told at the time of interview that their height and weight would be measured; however, given their informed consent, it is likely that they might have anticipated being measured subsequent to their report.

All participants (n=49,817) were asked their height and weight soon after starting the interview, and the measurements took place near the end. As our aim was to study selfreporting bias, the analytical sample was limited to n=38,942 participants with non-missing self-reported and measured height and weight. The participants excluded was as follows: pregnant (n=471); missing self-report and measured data (n=1001); missing self-report but not measured data (n=2550); and missing measured but not self-report data (n=6853).

Missing self-report and measured data were primarily due to "don't knows" and refusals, respectively.

We decided a priori to conduct sex-specific analyses due to documented differences in reporting 6 8-10 14 20 and the sexual dimorphisms in height, weight and adiposity 21 . Initial analyses showed no linear trend in misreporting over the six-year period in either sex (Supplementary Data Table S1 ): pooled data was therefore used for subsequent analysis.

Among complete cases (n=38,942), mean self-reported and measured height, weight and BMI were calculated with 95% confidence intervals (CIs). To compare across distributions 22 , . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01. 28.22270014 doi: medRxiv preprint we computed values at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles 11 . BMI was calculated as weight in kg divided by height in metres squared, and the World Health Organization classification was used to group participants into five categories: underweight (BMI:<18.5kg/m 2 ); normal weight (18.5-24 .9kg/m 2 ); overweight, not obese (25.0-29.9kg/m 2 ); obesity grades I and II (30.0-39.9kg/m 2 ) and obesity grade III (≥40kg/m 2 ) 23 .

Participants were also classified according to the binary categories of (i) overweight including obesity (excess weight: ≥ 25kg/m 2 ), and (ii) obesity (≥30kg/m 2 ). These definitions were applied to all participants as adults are defined in the HSE series as aged 16y+.

To compare self-report and measured data, we first calculated the difference (self-reported minus measured) between means (height, weight, and BMI) and between prevalence (overweight including obesity, and obesity). The degree of individual variability in the difference was summarised by the standard deviation (SD) 4 6 . Bland-Altman limits of agreement (LOA) were calculated for BMI. Secondly, we cross-tabulated self-reported and measured BMI categories. Using measured BMI as gold standard, we calculated estimates of sensitivity (the percent of true positives) and specificity (the percent of true negatives) to quantify the classification accuracy of self-reported BMI. Cohen's Kappa (κ) statistic was also used to assess the degree of agreement after accounting for agreement at random 13 .

Linear regression modelling was used to develop prediction equations that estimated measured values of height and weight from self-reported height and weight, with appropriate adjustments for any variables independently associated with misreporting 8-10 16 24 . This involved three main steps.

Step 1: Predictors of misreporting First, as in other studies 6 9 20 25 , participants with an absolute difference (self-reported minus measured) ≥ 4SD from the mean were considered outliers (with possible unrealistic reported values): these were excluded (height: n=189; weight: n=276) to avoid potentially undue influence on the equations. For those remaining (n=38,477), separately for height and weight, linear regression modelling was used to identify which variables were independently predictive of the difference between self-reported and measured values. Continuous variables for self-reported height and weight were entered as linear, quadratic, and cubic terms to allow for possible non-linearity 10 ; higher power terms were excluded if the slope was not statistically different from zero (p>0.05). Independent variables included age-group (16-17y, . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint 18-19y , and in 5-yr intervals up to 85y+); ethnic group; Government Office Region; cigarette smoking status (current, ex-, never-smoker); general health (very good/good, fair, bad/very bad); presence of a limiting longstanding illness; and two indicators of socioeconomic status: Based on joint Wald tests, variables statistically significant as a whole (p<0.05) were candidates for inclusion in prediction equations.

Step 2: Deriving the prediction equations Secondly, as in other studies 9 20 , the sample was randomly split into a training (hereafter, split-sample A) and testing (split-sample B) dataset using a 70:30 ratio. Split-sample A (n=27,035) was used for model-fitting and refinement, and parameter estimation (prediction equations). Split-sample B (n=11,442) provided an independent assessment of predictive accuracy of the equations.

To develop the prediction equations via linear regression modelling, measured height and weight were the dependent variables 8 , and self-reported height and weight, age-group, and any other variables significantly associated with misreporting (from Step 1) were the independent variables. In a final refinement step, only significant variables (p<0.05) were retained (hereafter, full models) for reasons of parsimony (i.e. achieved similar goodness-offit with as few predictors as possible). We also fitted models containing age-group only as a predictor of misreporting (hereafter, reduced models): such equations may be particularly useful for researchers using surveys that do not contain all the independent variables retained in the full models.

Step 3: Assessing the predictive accuracy of the equations Thirdly, the prediction equations generated from split-sample A were applied to split-sample B. Corrected BMI was derived using the predicted values of height and weight. Descriptive statistics (means for continuous variables and percentages for BMI categories) were used to compare self-reported, measured, and corrected values. Using measured BMI as gold standard, estimates of sensitivity and specificity were calculated to quantify by how much . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint corrected BMI improved BMI classification. Finally, to investigate the presence of any systematic error in self-reported BMI (i.e. the "flat slope syndrome" mentioned earlier), we fitted a linear regression model in which the difference between self-reported and measured BMI was the dependent variable and measured BMI was the independent variable. To investigate any such error in corrected BMI, we fitted a linear regression model in which the difference between corrected and measured BMI was the dependent variable and measured BMI was the independent variable 16 . For each case, a significantly negative slope for BMI would indicate the tendency for BMI underestimation to increase as measured BMI increases.

We also examined predictive accuracy of equations derived using a simpler approach, based on linear regression models with the measured height/weight values as the dependent variable and the self-reported height/weight values and age (in single-year bands but trimmed at 90y) as independent variables 10 26 . Linear, quadratic and cubic terms were entered for self-reported height and weight, and linear and quadratic terms were entered for age. Each term was retained in the model (thereby included in the equation) irrespective of statistical significance to allow for possible associations in future datasets 10 . This approach (implemented using formulae based on HSE 2012-14 data) is currently used to correct self-reported height and weight in the ALS for monitoring levels of excess weight across English local authorities.

All analyses accounted for the complex survey design, incorporating survey non-response weights and geographical clustering. Statistical significance was set at p<0.05 for two-tailed tests, with no adjustment for multiple comparisons. HSE datasets are available via the UK Data Service (www.ukdataservice.ac.uk) [27] [28] [29] [30] [31] [32] and are subject to an end user license agreement. Dataset preparation was performed in SPSS v24.0 (IBM Corp., Armonk, New York); analysis was performed in Stata v17.1 (StataCorp LLC, College Station, Texas).

Patients or the public were not involved in the design, or conduct, or reporting, or dissemination plans of our research (which involves secondary analysis of existing data). The project was shaped by discussions with the HSE Steering Group, including representatives from various national government agencies and local authorities.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; To compare self-reported and measured data, Table 1 presents the difference between means (height, weight, BMI) and prevalence (BMI categories). On average, men overestimated height more than women (difference: 1.6cm and 1.0cm, respectively; p<0.001 for sex difference), whilst women underestimated weight more than men (difference: -2.1kg and -1.5kg; p<0.001). Underestimation of BMI was slightly larger on average for women than for men (-1.1kg/m 2 and -1.0kg/m 2 ; p<0.001). About three-quarters of adults had self-reported BMI values within 2 units of measured BMI (data not shown); however, underestimation of BMI was greater in higher BMI categories (Supplementary Data Table S2 ). The Bland-Altman LOA show the range within which approximately 95% of the differences between self-reported and measured BMI would be expected to fall 33 . The LOA for self-reported BMI was in the range of -4.6 to 2.5 BMI units for men and from -5.0 to 2.7 units for women (unweighted data; Supplementary Data Figure S1 ). For both sexes, the highest percentiles of BMI were lower for self-reported than for measured data, indicating more compressed distributions (Supplementary Data Table S2 ; Figure S2 ). The prevalence of overweight including obesity was 8.0 and 9.0 percentage points (pp) lower for self-reported than measured data for men and women, respectively. The equivalent figures for obesity were 6.8 and 6.0pp.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; 1 0 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022.

Using measured BMI as gold standard, 77% of men and 78% of women were correctly classified using self-reported BMI; the sensitivity of the self-reported obese category was 69% for men and 72% for women (Table 2 ). Of those classified as overweight, but not obese (25.0-29.9kg/m 2 ) based on self-reported BMI, 19% of men and 22% of women were classified as obese based on measured BMI. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. Table S4 ).

The regression coefficients (prediction equations) for the aforementioned variables based on the models with measured height as the dependent variable correct self-reported height upwards (positive signs) or downwards (negative signs) as appropriate (Table S4 ): e.g., compared with those in the White group, the predicted measured height from the selfreported height of participants in the Asian group is corrected downwards by 0.61cm (men) and 1.78cm (women). The reduced models contained age-group only as a predictor of misreporting. The R-squared (proportion of variance in measured height explained by the independent variables) for both the full and reduced models were similar for men (88.6% and 88.5%, respectively) and for women (87.7% and 87.1%).

For men, being an ex-regular or never (vs. current) smoker were associated with greater underestimation of weight; lower educational status (O level or no qualifications vs. having a degree) was associated with lower underestimation. For women, being an ex-regular or never (vs. current) smoker and being in the Black (vs. White) ethnic group were associated with greater underestimation of weight (Supplementary Data Table S5 ). Table S5 shows the prediction equations for deriving corrected values of weight: e.g. compared with current smokers, the predicted measured weight from the self-reported weight of never smokers was adjusted upwards by 0.65kg (men) and 0.18kg (women). The R-squared for both the full and reduced models were similar for men (94.3% for both models) and for women (95.1% for both models).

The prediction equations were applied to split-sample B to generate corrected values. Table 3 shows the means (height, weight, BMI) and percentages (BMI categories) for the selfreported, measured, and corrected values. In each case, corrected estimates were closer than self-reported estimates to measured estimates, and the difference in means between corrected . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; 1 3 and measured values were not significantly different from zero, with the exception of height for women (e.g. measured height was underestimated on average by 0.1cm in the full model (P=0.003; data not shown)). The LOA for corrected BMI was in the range of -2.9 to 2.9 BMI units for men and from -3.0 to 3.0 units for women (unweighted data; Supplementary Data Figure S3 ). For each set of data, Figure 1 plots the mean values for height and weight by sex and age-group.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; 1 4 .

It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022.

Overweight including obese (≥25.0kg/m 2 ); Obese (≥30.0kg/m 2 ). Formulae used to generate corrected BMI values are shown in Tables S4 and S5 (Supplementary data).

.

It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. Compared with measured BMI, obesity prevalence based on self-reported BMI was 6.5 and 5.2pp lower for men and women, respectively. The corresponding value for corrected BMI (full models) was 0.8pp for both sexes (Table 3 ; Figure 2 ). The prevalence of overweight including obesity was slightly overestimated (1.2% men; 0.4% women). Table 4 shows the cross-tabulation of corrected (full models) and measured BMI categories.

83% of men and 84% of women were correctly classified. Sensitivity values of the obesity category based on self-reported BMI were 71% and 75% for men and women, respectively (data not shown). The sensitivity of obesity based on corrected BMI increased to 86% and 87% for men and women, respectively, an absolute improvement over self-reported data by 15.0 and 12.2pp. The corresponding sensitivities based on reduced models were similar (86% both sexes: data not shown). In contrast, sensitivity values in the normal weight category (18.5-24 .9kg/m 2 ) were higher for self-reported (91% men; 94% women) than corrected BMI (83% men; 89% women).

The results of linear regression analyses in which measured BMI was the predictor of error are shown in supplementary data (Table S6 ). The significantly negative slope for BMI (p<0.001 for both sexes) indicates that the differences between corrected and measured BMI showed a systematic, though smaller, bias in relation to measured BMI.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint Table 4 .

Corrected BMI categories (full model) 

The fitted regression equations describing the relationship between self-reported and measured height/weight with age as the single continuous predictor of misreporting are shown as supplementary data (Tables S7-S8; Figure S4 plots the mean values for height and weight by sex and age-group). The difference in means between the corrected and measured values were not significantly different from zero with the exception of height for women; the sensitivity estimates for this simpler method yielded similar values to the equations described above (overweight including obesity: 94% men and 93% women; obesity: 85% men and 86% women; Table S9 ).

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; 1 8

Using pooled HSE 2011-16 data containing self-reported and interviewer-measured height and weight, we developed two sets of prediction equations that can be easily used to correct for biases in self-reported BMI. Although not perfectly predictive of measured BMI, corrected BMI performed better than self-reported BMI in more closely approximating obesity prevalence based on measured BMI. Applying corrected values also increased sensitivity of the obesity category. Using measured BMI as gold standard, the sensitivity of obesity for the full model-corrected BMI was estimated as 86% and 87% for men and women, respectively, an improvement in absolute terms over self-reported BMI by 15.0 and 12.2pp.

In agreement with other studies 11 , the present study showed that mean height was overestimated by self-report relative to measured height, and that mean weight was underestimated, resulting in a net underestimation of mean BMI. Our estimates for the difference between means in height, weight and BMI were within the range shown by systematic literature reviews 4 5 . In agreement with other studies 9 11 , we found that women underestimated weight more than men but that men overestimated height more than women.

As reported elsewhere 34 , we do not know whether the differences between self-reported and measured anthropometrics arise due to participants' lack of knowledge about their current height and weight or whether it is due to misreporting of information that is accurately known.

In agreement with other studies 11 , the mean differences found in the present study between self-reported and measured data were moderate on average (around -1kg/m 2 for BMI).

However, it is important to look beyond differences in means, as moderate differences on average can be accompanied by (i) a large degree of variability between individuals (shown by the SD of the difference between self-reported and measured data) 4 , (ii) a compression of the BMI distribution (shown by lower values at the highest percentiles for self-reported than for measured data) 11 , and (iii) sizeable misclassification of BMI categories based on selfreported data due to such compression (e.g. shifting adults below the BMI cut-off of 30kg/m 2 for obesity), resulting in an underestimation of obesity prevalence 11 . As reported elsewhere 16 , a large degree of misclassification can occur if a non-trivial number of adults have a moderate difference between self-reported and measured BMI at the margins of broadly defined BMI categories. The positively skewed distribution of BMI increases this effect.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; 1 9

Our results were mainly in agreement with previous studies 9 16 : developing prediction equations to correct self-reported height and weight by socio-demographic and health-related variables to more closely approximate measured values is feasible. First, in our main analysis, corrected BMI reduced the underestimation of obesity prevalence compared with selfreported BMI 9 , but it remained underestimated (in absolute terms) by 0.8pp for both sexes.

As found elsewhere 16 , measured BMI significantly predicted the difference between corrected and measured BMI, indicating that the systematic error in self-reported BMI was not eliminated by the prediction equations. The presence of such residual bias has been identified as a reason for not using equations to predict measured values from self-reported values 7 . However, the usefulness of prediction equations has been demonstrated by the ability to reduce considerably, although not eliminate, the differences between self-reported and measured anthropometrics across a few, easily gathered socio-demographic and healthrelated variables 16 , as well as increasing the sensitivity of obesity classification vs. selfreported data 8 . Our results also showed that the prediction equations decreased sensitivity in the normal weight category (through erroneously shifting a proportion of normal weight participants to the overweight but not obese category, leading to slight overestimation of levels of excess weight). This finding was consistent with previous studies 9 20 , and likely reflects higher levels of reporting accuracy of anthropometry data among normal weight adults.

Secondly, as elsewhere 8 9 25 35 , our similar results based on full and reduced models, and those of a simpler approach predicting measured values directly from self-reported values and age, confirmed that no single model stood out as the best overall candidate, and that adding variables such as ethnic group and educational status to equations only marginally improved predictive accuracy (shown by similarity in R-squared and estimates of sensitivity). As reported elsewhere 36 , differences between demographic subgroups in the mean difference between self-reported and measured weight may be explained to some extent by differences in measured weight: adjustment for weight in regression models therefore results in attenuation of subgroup differences 37 . It may be reasonably concluded that including additional variables such as educational status and ethnic group does not add enough predictive power to the models to justify the added complexity of including them in prediction equations.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint Pooling data across six years ensured a sample size large enough to compare self-reported and measured height and weight overall and by various socio-demographic and health-related variables, and allowed splitting the data into training and test datasets. Using a regressionbased approach we were able to correct for differences in misreporting of height and weight across various subgroups. Unlike other studies 6 , there was no time lapse between the collection of self-reported and measured height and weight, and consistent methodology was used in each survey. We used two approaches to develop prediction equations to enable researchers to evaluate for themselves whether either approach, and if so which, best suits their data and goals.

A study limitation is the sizeable number of participants excluded from our analyses due to missing data on height and weight. Our findings could be biased if complete cases were systematically different from those with missing data (e.g. if those who refused to be measured were more likely than those who did not refuse to underestimate their weight due at least partly to being heavier), and such bias could result in prediction equations that are inaccurate 20 . To partially evaluate this bias we compared obesity prevalence based on selfreported BMI among those with and without measured BMI 25 . Obesity prevalence via selfreported BMI was higher for those without measured BMI (22% men; 25% women) than for those with measured BMI (18% men; 19% women), indicating that heavier participants were less likely to agree to direct measurement 25 . Furthermore, as in other studies, in developing the prediction equations we excluded a small but non-trivial number of participants with a large observed difference between self-reported and measured data: this exclusion may have limited the generalisability of our analyses to some extent. Such cases would be impossible to identify and exclude in surveys that collect self-report but not measured data 13 .

Other limitations include potentially relevant variables that we could not include in regression models due to not being available in all HSE years (e.g. physical activity; perceptions of weight). We also decided a priori to use age as a categorical rather than continuous variable in our main analysis. As only categorical age is now provided on publicly available HSE datasets (to preserve anonymity of participants), our approach enables researchers to easily replicate our results and revise/update equations accordingly. (Continuous age was used in our sensitivity analysis to replicate the approach used to correct self-reported height and weight in the ALS). Finally, although we showed no linear trend in misreporting, the prediction equations we have developed based on 2011-16 data might not be entirely applicable to more contemporaneous data. This might be the case if obesity prevalence has . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint greatly increased or decreased, such that the composition of this group changed, and/or the social desirability of having a normal BMI increased or decreased. Changes in accuracy of home scales, or in the up-to-date knowledge of one's own height and weight (e.g. if healthworkers began to routinely measure height as part of BMI assessment, and relay that information to patients) could also affect the applicability of these equations to more recent data. Likewise, any potential increase in misreporting of weight associated with weight gain during the Covid-19 pandemic (e.g. due to fewer opportunities for outdoor physical activity)

is not taken into account by the equations developed herein.

Our findings must also be interpreted with caution. It is likely that HSE 2011-16 participants might have anticipated that interviewers would take direct measurements of height and weight, resulting in more "truthful" reporting compared, for example, with a telephone interview where participants would not anticipate being measured 10 . Previous studies have shown that misreporting of height (except for older adults) and weight was smaller for inhouse interviews compared with telephone interviews 35 . More "truthful" reporting is associated with an underestimation of the differences between self-reported and measured height and weight 3 . Applying the prediction equations developed in the present study on surveys which collect height and weight data by telephone interviews or mailed questionnaires would likely underestimate obesity prevalence to a greater extent than shown herein. Finally, as cautioned elsewhere 9 11 20 , prediction equations are specific to time, place, target population, and methods of data collection. We do not assume that these equations developed using HSE data are applicable to non-HSE samples with different sociodemographic, health and self-reported anthropometric profiles.

The prediction equations developed in the present study improved the sensitivity of selfreported obesity and took into account the variations in potential misreporting of height and weight by socio-demographic and health-related variables. Including additional sociodemographic variables does not add enough predictive power to justify the added complexity of including them in prediction equations. Potentially these equations could be used to adjust for errors in self-reported BMI, however important caveats to their use need to be considered.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint

The HSE datasets generated and analysed during the current study (age banding for participants) are available via the UK Data Service (UKDS: https://ukdataservice.ac.uk/), subject to their end user license agreement. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 30, 2022. ; https://doi.org/10.1101/2022.01.28.22270014 doi: medRxiv preprint

Body-mass index and mortality among 1.46 million white adults

Regional inequalities in self-reported conditions and non-communicable diseases in European countries: findings from the European Social Survey (2014) special module on the social determinants of health

Correction of self-reported BMI based on objective measurements: a Belgian experience

A comparison of direct vs. self report measures for assessing height, weight and body mass index: a systematic review

A comparison of measured versus self-reported anthropometrics for assessing obesity in adults: a literature review

Self-reported weight and height

Prediction equations do not eliminate systematic error in self reported body mass index

The validity of obesity based on self reported weight and height: implications for population studies

The feasibility of establishing correction factors to adjust self-reported estimates of obesity

Regression models to predict corrected weight, height and obesity prevalence from self-reported data: data from BRFSS 1999-2007

Comparisons of self reported and measured height and weight, BMI, and obesity prevalence from national surveys

Effects of age on validity of self-reported height, weight, and body mass index: findings from the Third National Health and Nutrition Examination Survey, 1988-1994

Accuracy of body mass index categories based on self-reported height and weight among women in the United States

The validity of self-reported weight in US adults: a population based cross-sectional study

Underreporting of BMI in adults and its effect on obesity prevalence estimations in the period

Accuracy and usefulness of BMI measures based on selfreported weight and height: findings from the NHANES & NHIS

The predictive validity of body mass index based on self-reported weight and height

Evaluation of a suggested novel method to adjust BMI calculated from self reported weight and height for measurement error

Cohort profile: the health survey for England

The usefulness of "corrected" body mass index vs. self-reported body mass index: comparing the population distributions, sensitivity, specificity, and predictive utility of three correction equations using Canadian population-based data

Sexual dimorphism in body composition across human populations: associations with climate and proxies for short and long term energy supply

Determinants of the population health distribution: an illustration examining body mass index

World Health Organization. BMI classification

Validation and calibration of self-reported height and weight in the Danish Health Examination Survey

Estimates of obesity based on self-report versus direct measures

Validity of self-reported height and weight in 4808 EPIC-Oxford participants

Health Survey for England

Health Survey for England

Health Survey for England

Health Survey for England

Health Survey for England

Health Survey for England

Use and reporting of Bland-Altman analyses in studies of self-reported versus measured weight and height

Body mass definitions of obesity: sensitivity and specificity using self-reported weight and height

Trends in national and state-level obesity in the USA after correction for self-report bias: analysis of health surveys

Validation of self-reported height and weight in a large, nationwide cohort of US adults

Bias in reported body weight as a function of education, occupation, health and weight concern

The authors thank the interviewers and nurses, the participants in the Health Survey for England series, and colleagues at NatCen Social Research. The authors also thank NHS Digital. NHS Digital is the trading name of the Health and Social Care Information Centre.

The Health Survey for England was funded by NHS Digital. The authors are funded to conduct the annual HSE but this specific study was not funded. NHS Digital had no role in the analysis, interpretation of data, decision to publish or preparation of the manuscript for this specific study.

SS conceptualised the study and was responsible for conducting the analyses, interpreting the results and drafting the manuscript. SS, LNF, AM and JSM critically revised the manuscript.All authors have read and approved the manuscript.

The procedures used in the HSE to obtain informed consent from survey participants are very closely scrutinised by a NHS ethics committee each year. Approval was obtained from the 

The authors declare that they have no competing interests.