key: cord-0173179-uk46dlie authors: Koltai, J'ulia; V'as'arhelyi, Orsolya; Rost, Gergely; Karsai, M'arton title: Monitoring behavioural responses during pandemic via reconstructed contact matrices from online and representative surveys date: 2021-02-17 journal: nan DOI: nan sha: 3c704923b64d1c63da83d0cd46c61eb2481f451b doc_id: 173179 cord_uid: uk46dlie The unprecedented behavioural responses of societies have been evidently shaping the COVID-19 pandemic, yet it is a significant challenge to accurately monitor the continuously changing social mixing patterns in real-time. Contact matrices, usually stratified by age, summarise interaction motifs efficiently, but their collection relies on conventional representative survey techniques, which are expensive and slow to obtain. Here we report a data collection effort involving over $2.3%$ of the Hungarian population to simultaneously record contact matrices through a longitudinal online and sequence of representative phone surveys. To correct non-representative biases characterising the online data, by using census data and the representative samples we develop a reconstruction method to provide a scalable, cheap, and flexible way to dynamically obtain closer-to-representative contact matrices. Our results demonstrate the potential of combined online-offline data collections to understand the changing behavioural responses determining the future evolution of the outbreak, and inform epidemic models with crucial data. and the content of the questionnaire and explain our data collection methods in details. Subsequently we introduce our methodology about the weighting of age contact matrices collected online, in which the dimension of weights are derived from representative data collections conducted in the same period. Finally, we demonstrate our methodology on contact matrices observed during the first wave of the COVID-19 pandemic in Hungary. The MASZK questionnaire The primary purpose of our questionnaire was to dynamically estimate the age contact matrices of people in different environments (like home, work, school, or elsewhere). For this very reason, we asked the respondent about the number of people from different age groups, with whom they had contacts with. First, we recorded reference contact patterns by asking respondents about their contacts during a typical weekday and weekend before the COVID-19 outbreak in Hungary (13th March 2020). Second, we recorded actual contact patterns of participants by asking them about their contact activities on the day before their actual response. We classified close contacts as physical contacts (direct physical contacts without using personal protective equipment), and proxy contacts (two persons stayed closer than 2 meters to each other at least for 15 minutes) 44 . Individual contact patterns were recorded as the approximate number of contacts between the ego and their peers from different age groups of 0 − 4, 5 − 14, 15 − 29, 30 − 44, 45 − 59, 60 − 69, 70 − 79, and 80+. For the sake of potential adoption of our method and reproducibility of results we share the core part of our questionnaire including the essential questions for our analysis in the Supplementary Information (SI) 45 . Observation period Phone census Representative sample Weighted online sample Representative sample (a) (b) (c) Weighted online sample Reference period Figure 1 . Contact dynamics, representative and reconstructed age contact matrices. Age contact matrices measured during the (a) reference and (b) pandemic period via CATI survey methodology on a representative sample (blue) and via weighted non-representative online data collection (orange) after reconstruction (for methodology see section on Construction of age contact matrices). Data for children under 18 (indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations, thus our data cannot provide a representative sample for the first two age groups. (c) Timeline of early pandemic regulations in Hungary and the average number of per capita daily proxy social contacts in rural areas (solid green line), the central area (red solid line) of Hungary, and in the whole country (blue solid line). While online data collection was continuously ongoing after the 23rd March 2020, representative data via telephone surveys were collected during the periods assigned by diagonal shading. Blue shades indicate telephone census collection, while grey shades cover the online observation period of the actual study. Both methods retrospectively recorded the contact patterns from the reference period (before 13th March 2020), except for age groups under 15 in the online questionnaire. Online data collection MASZK was originally developed as an online survey 40 , and was later published as a mobile phone application 46 . Participation was -and still is -voluntary and the data collection was completely anonymous (for further details 3/17 see the Methods section). The data collection started on the 23rd March 2020 and is still ongoing (as of Spring of 2021). While keeping the core questionnaire (shared in the SI) intact, the additional content has been adjusted to the actually pressing issues of the pandemic, like work and home office conditions, job security, self-protection practices, or intention for vaccination in case of availability. Respondents were asked to fill out the questionnaire as many days, as they can, providing ongoing relevant information about their contacts. Up to date, the questionnaire has been completed in 405, 984 times by 226, 086 respondents, which accounts for ∼ 2.3% of the population of Hungary. The collected data sensitively reflects public awareness and reactions to national regulations as it can be followed in Fig. 1c . During the reference period, until the 13th of March 2020 when the first regulations were announced, the average daily number of proxy social contacts of individuals was measured ∼ 25. This number dropped radically by 88% to a value ∼ 3 after a national lock-down was introduced. Subsequently, the lock-down was lifted first in rural Hungary (4th May 2020) and later in the more densely populated central region (18th May 2020). This was followed by a modest increase in the number of social contacts to ∼ 8, which though never reached its reference value until the end of the observed period (20th June 2020). In this work we analyse a period of consecutive three weeks (29th April to 19th May 2020) during the first relaxation of the restrictive measures, as both types of data collection campaigns were conducted in these days. Using online surveys we recorded 30, 770 responses from 12, 208 people during this three-week period (see Methods, and SI, Table S1 ). Nationally representative telephone survey Additionally to the ongoing online data collection, CATI surveys were conducted by a market research company to ask the same questionnaire on a nationally representative sample of people in each month. The sample size was 1, 500, which is 50% larger than the conventional sample size for nationally representative samples in Hungary. Data collection campaigns were conducted in the beginning of the lock-down period (2-7 April 2020), during the first relaxation period (6-12 May 2020), and in each month after May 2020. In the current work, we analyze the data of the second period, where two-third of the data was collected about weekdays, while one third about weekends (for further details see Methods). Our goal with this data collection method was to obtain more realistic and representative data about the contact patterns of the Hungarian population; and to compare similar data coming from different sources to develop tools for reducing biases inherent in the non-representative sample of voluntary online survey. In order to construct the age contact matrix of social contacts for the whole population, we collected information about the number of proxy and physical contacts of each respondent x during the reference and actual periods in different settings. For a given social connection type, period, and setting, using the age of the respondents we assigned them into one of eight age groups A (as defined in section The MASZK questionnaire), while doing the same for their contacts too. Thus we received an individual contact matrix M x coding for each user x the number of contacts they had with others from age groups i ∈ A. Assuming an individual representative weight w x for each respondent, we computed a weighted average contact matrix (M) i j , which was column-wise normalised, thus giving us the weighted average number of contacts between a person from age group j with someone from age group i. Note that this matrix is not symmetric, and in case of a fully representative sample, weights would be w x = 1, simplifying the computation to a simple averaging process (see Methods). Despite the many advantages of open online surveys, due to voluntary participation they often record a highly non-representative sample of the observed population, which may cause misleading conclusions about the nature of the epidemic process. To identify the most relevant social-demographic dimensions along which the online survey data is biased, we compare the non-representative online data to the corresponding national census. In most cases, the tests for representativeness of an online survey focus on standard social-demographic characteristics of the observed sample and the population. On the other hand, in our case those characteristics are relevant, which significantly influence the contact patterns of the respondents. To explore these underlying factors, we performed regression analysis on the proxy contacts of respondents in the representative sample recorded in the actual period (for further details see Methods and SI). As the goal of the regression analysis was to detect influencing factors relevant for a later weighting process, the independent variables of these models were not only limited to those asked in the questionnaire, but also by data available in the census. Although we could identify several significant dimensions, which significantly affect the contact patterns of people, we could not rule out the possibility, that other dimensions, that were not included in the survey or measured by the census, also influence the contact patterns significantly. These regression analyses indicated that the age, employment status, education, settlement type, gender and geographical region of the domicile are the most significant social-demographic dimensions along which our online data is non-representative. Indeed, statistics shown in Fig. 2 likely to be middle aged, employed, higher educated, live in the capital and more likely to be women. On the other hand, people who are lower educated, older than 70 years, or live in small settlements like towns are under-represented. These striking differences demonstrate that the analysis of the raw online survey would lead to biased contact patterns, which are hardly generalizable for the whole Hungarian population. After the detection of those social-demographic variables, which significantly affect the contact patterns observed in the representative survey, we provide a weighting methodology for the online survey to make it more accurate in the measurement of contact patterns of the whole population. The goal of this procedure is to provide an individual weight w x for every respondent x, which indicates how much they are needed to be taken into account in the re-constructed online data to make it representative. Those respondents, who belong to an underrepresented social group get higher weights, while those from over-represented groups get lower ones. From results in Fig, 2 it is evident, that differences between the online and the census data are quite large. This suggests that individual weights will take values from a very broad range, which is undesirable as extreme weights can result unstable estimations 48 . Therefore, our weighting methodology needs to meet two goals (1) bringing the online survey data closer to the Hungarian Census, by making it more representative in terms of the identified social-economic dimensions; while (2) keeping the size of the weights in a reasonable range. To meet the second goal, we applied iterative proportional fitting (IPF). IPF is a weighting methodology, which adjusts the inner cells of an n-dimensional contingency table in a way that it returns the previously provided expected row and column margins 43 . In our case, the expected margins (the population distributions of the weighting variables) are taken from census data, and the contingency tables, on which we apply the weighting procedure, are derived from the online survey data. To obtain well fitting weights, which satisfy both of our goals, we built on the age stratified structure of contact matrices. First, as they are built up by age-group-wise normalized vectors for each age group, the relative proportions of age groups can be neglected (not included as expected margins) in the IPF, which considerably decreases the variation of the obtained individual weights. Second, as not necessarily the same dimensions are relevant in each age group (e.g., in some age groups the education level affects contact patterns, in other age groups the geographical location is important.), the identification of relevant weighting dimensions is conducted separately in each age group -which can lead to more realistic weights. The results strengthen this argument as very different social-demographic dimensions affected the total number of actual proxy contacts Table of age groups, the corresponding social-demographic variables and weight limits. Social-demographic variables are listed for each age group, which were used as margins in the IPF procedure, together with the minimum and maximum values of calculated individual weights. The symbol * indicates interactions between variables. To increase the precision of the weighting procedure, regression analyses targeting the detection of those dimensions, which affect the contact patterns of people, were conducted separately on each age group. The selected dimensions served as expected margins in the IPF procedure. Note that some age groups are merged to make age categories populated enough and to be compatible with the age categories of the census. significantly in different age groups, as summarised in Table 1 (for margins see SI) . Compared to standard cell weighting, IPF is less likely to result extremely small or large weights. In our case, after the selection of the relevant dimensions, the IPF process obtained weights, which stayed within the range of 0.04 and 25.49 (as presented in Table 1 with weight distribution summarised in the SI). The closer an individual weight is to one, the more the corresponding individual is representative of their age group -by the listed dimensions. The weight values characterising different age groups can thus disclose, which groups are strongly biased in the online survey as compared to population data. From this perspective of evaluation, the results in Table 1 suggest that the age groups of 60-69 and 15-29 are the ones closest to the population data of the same age group according to their composition by the listed dimensions. At the same time, the most problematic age group is the 70+, where observed minimum and maximum weights cover the largest range. The larger range of weights can be explained by the self-selection process of respondents, in which older generation is less likely to adopt digital technologies or have internet access, thus, those respondents, who filled out the online questionnaire from this age group are not typical representatives of the whole age group. The reconstructed online proxy age contact matrix (panel Fig. 3e ) appeared with an expected structure very similar to the representative result (panel Fig. 3c ). It exposes a strong diagonal component induced by age homophily (for annotated matrices see SI), meanwhile it suggests larger contact numbers between people of age 15-59, including the employed population of the country. These matrices were recorded during the period in May 2020, when schools were closed in Hungary. This is reflected in the higher contact numbers between the youngest age groups and their parents' generation from the age group of 30-44. However, if we compare the representative or the reconstructed (weighted) matrices to their corresponding reference period measures (see Fig. 1a and b), we evidently see the radical decrease in the number of contacts (darker shades for reference period and lighter for the later one) and the closure of schools significantly reducing the number of homophilic contacts between children of age 5-14 as compared to the reference period. To quantify the precision of our reconstruction method we compare the raw (not weighted) and reconstructed online proxy contact matrices to the corresponding representative matrix. Although we have demonstrated that the IPF method provides weights within a reasonable range, it is still not evident, which age cells changed the most by the weighting, and which of them became closer to their representative value due to the the reconstruction. In the diagonal of Fig. 3 we depict the three actual proxy contact matrices built from the representative survey (Fig. 3c) , from the reconstructed (weighted) online survey (Fig. 3e ) and the raw (not weighted) online survey (Fig. 3g) . First, in the upper diagonal, we compare these matrices by calculating their pairwise differences (see Fig. 3a, b and d) . The difference between the representative survey and the raw online data (Fig. 3a) shows that middle-aged respondents of the online data collection had higher number of average contacts with young and middle aged adults than the respondents of the representative survey. Meanwhile, the non-representative online data collection underestimates the number of contacts of elderly people with others of similar age old. However, while the absolute difference in the total number of contacts between the representative and the not weighted online survey was 16.4, after reconstruction this difference between the representative and weighted online matrices reduced to 14.9), which corresponds to a 9.13% increase in Relative Accuracy Gain (for precise definition see Methods). Our weighting method performs the best in cases, when the difference between two matrix cells is close to 0 (white in Fig. 3b ), like in case of the 60-69 years old egos and their 30-44 years old alters. The difference matrix of the non-weighted and weighted matrices depicts the effect of the reconstruction process on Figure 3 . Results of iterative proportional fitting. Normalized actual proxy contact matrices (green diagonal), their pairwise difference matrices (above diagonal) and pairwise two-tail T-test results (below diagonal) are depicted for the online non-weighted, online weighted, and representative matrices. In the difference matrices red or blue cells indicate that the source matrix (column label) appeared with higher or lower number of average contact than the target (row label) in a given cell. For results of pairwise two-tail T-tests blue to yellow cells (corresponding to p > 0.05, assigned by an arrow beside the colorbar) indicate that the given cell is not significantly different in the source (column label) and target (row label) matrices. Data for children under 18 (indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations, thus our data cannot provide a representative sample for the first two age groups (see Limitations) the online matrix (see Fig. 3d ). Although the magnitudes of differences are not large, certain heterogeneities are visible, like the decrease of contact numbers between middle age people and the increase of contacts between 70-79 years old egos and similar others after the reconstruction. To further quantify the goodness of the weighting in detail, we tested if a cell of a contact matrix is significantly different from the same cell of another contact matrix. Each cell of a contact matrix M i j appears as the average of the distribution of the number of contacts between the age-group j of a respondent and the age group i of their peers. Thus we can perform a pairwise two-tailed independent sample T-test for each cell to see whether the population means of two groups corresponding to respective cells measured in different contact matrices are significantly different from each other 49 . These tests show if the differences presented in the upper diagonal of the figure are statistically significant, or just the results of estimation uncertainties. In the visualisations of the lower diagonal panels of Fig. 3 , yellow cells correspond to p > 0.05 values (p = 0.05 is indicated by arrows near colorbars) suggesting that average contact numbers between the corresponding age groups are not significantly different in the two data sources. For example, this is the case in the cell of egos from age group 45-59, and their peers from 15-29 in Fig. 3f , which shows the results of the significance tests comparing the values of the representative and the weighted online matrices. This result suggest that the average contact number between the 45-59, and the 15-29 years old, are not significantly different in the representative and in the weighted online matrices. To check the robustness of our matrix reconstruction method, we performed the same significance test between the raw (not weighted) online matrix and the representative matrices (Fig. 3i) . Comparing its results to the results of the weighted and representative matrices (Fig. 3f) , the 7/17 number of cells, which are not significantly different increased by 6.38% in the latter (from from 44 to 47), while the range of similarity has also elevated (indicated by more yellow cells). Meanwhile, from the T-test results between the raw (not weighted) and weighted online matrices (see Fig. 3h ) it is evident that the weighting helped to capture the contact patterns better in the reconstructed matrix, especially in case of the active population (30-59) with same-age and older people, and the contacts of the elderly people (70-79) with younger others. Precise estimation of the contact patterns of these age groups are especially important for predicting the potential number of infected cases, which may end up with severe medical conditions in case of the COVID-19 pandemic 26 . These results show that the reconstruction caused significant changes in the values of 8 cells out of the 64 and that these changes brought the value of the given cell closer to the representative one in most cases (for exact significance values see SI, Figure S2 ). It is very important to emphasize that the comparison of the actual proxy contacts in the representative and weighted/not weighted online matrices does not follow the same logic for children in the first two age groups. Due to data protection regulations, the CATI survey is only representative for the adult population of Hungary and not for children, while the online survey could not involve under age children either. Data of children are based on the responses of adult parents estimating the contact patterns of their own children. This estimation is surely biased as, especially for older children, parents may not be fully aware about all daily social contacts of their children. Consequently, we cannot use the representative sample as a 'gold standard' for these age groups, because the population of children recorded in that data is not representative for the children population of the whole country. Correction of this bias would require a separate data collection campaign involving a representative set of children directly, which in turn would raise challenges to meet privacy regulations of under-aged participants and fall beyond the scope of the actual study. Nevertheless, this explains the larger differences between the online and representative matrices in the first two columns in Fig. 3 off-diagonal panels. If we do not consider these age groups, the Relative Accuracy Gain of the weighting process increases to 11.92% as the absolute difference in the number of contacts between the representative and weighted online survey decreases to 11.36 which corresponds to an increase of 8.33% in the number of significantly not different cells. To make this bias evident, we separated the non-representative age groups with a vertical dashed line within the matrices, while indicated by asterisks at the labels in each relevant plot. Another potential limitation may be rooted in the sampling of the observed population. This issue is present at the online data collection, where the number of responses may vary in time. If the size of the online sample is too small, individual weights would diverge and the reconstructed matrices would suffer from large errors. In the present study, this is not an issue, as in the examined period the number of daily responses were stable and relatively high. However in the case of a longitudinal data collection, these parameters can change due to the varying level of public awareness, political influence, or media campaigns. Finally, not only the number, but also the composition of the respondents may change in time, thus the precision of the actual weights may decrease. To account for this effect in the dynamical reconstruction of contact matrices, one would need to make a representative data collection periodically, and recompute the relevant dimensions and weights for each period. Although we have collected representative samples in each month since April 2020, the demonstration of dynamical re-weighting is the subject of a future investigation (in preparation). There we also plan to apply more experimental weighting procedures, where we will not only include variables available in the census, but also others only available in the representative data. The goal of these weighting experiments is to increase the Relative Accuracy Gain of the procedure. Emergency situations, like the actual COVID-19 pandemic, may induce radical changes in the behavioural patterns of people leading to the reduction and re-organisation of their social interactions 50 . Changes may be induced by external influences such as governmental interventions, or change in employment status, but they may strongly depend also on individual decisions induced by self-, and environment-awareness or risk avoiding behaviour. All these influences have convoluted effects on the size and structure of personal interactions leading to different paths of epidemic transmissions in a connected population 51 . Age contact matrices provide a useful way to summarise and follow such changes in the social fabric at different settings and time. Importantly, they can be further used for more realistic modelling of epidemic spreading. Nevertheless, their collection was rather spurious, expensive, and other than some recent studies 26 , they were collected during 'normal' times, thus they commonly missed to capture changes in contact patterns during emergency periods. In this study we provide a feasible alternative approach, which combines the advantages of online data collections with the precision provided by representative telephone surveys. We report here, one of the largest data collected to date to estimate age contact matrices in a single country, reaching over 2.3% of the population of Hungary. As the online data provided a non-representative sample of the population, we developed a methodology to reconstruct closer-to representative contact matrices from the online data by using the simultaneously collected representative samples. This data collection method is not only scalable, flexible in terms of content, and relatively cheap, but it also allows for dynamical estimation of contact matrices with high temporal and spatial resolution. The reproducibility of our results and the possible adoption of our methods in different countries are primary concerns for us. For these reasons, along this study, we share the core questionnaire for further use 45 , together with the raw, reconstructed, and representative matrices and all supporting data calculated for Hungary. Up to date, our data collection method has been implemented already in Mexico 52 and Cuba. We hope that it will prove useful to collect relevant data for applied epidemiological modelling in other countries too, and at large, will contribute to the global efforts to fight the actual COVID-19 and any future pandemic. The online data collection started on the 23rd of March 2020 through the website covid.sed.hu and later using a mobile phone app 46 . The anonymity of participants was ensured by using encrypted browser cookies to store hashed identifiers locally, while transferring only anonymous encrypted data to a central secure server. Encrypted browser cookies were used for the detection of returning respondent filling out the questionnaire on multiple days. The participants did not have to give any information, which could be used for their re-identification. The data collection was fully complying with the actual European and Hungarian privacy data regulations and was approved by the Hungarian National Authority for Data Protection and Freedom of Information 53 . The data collection was accompanied with an ongoing marketing campaign, including regular radio and newspaper interviews, ads on social media platforms, and posters on public transportation, to reach the broadest audience possible. Targeted campaigns were also published with help of national organisations to reach parents, university students, or elderly people. In this study, we analyse data collected between the 29th of April and the 19th of May 2020 and recorded 30, 770 responses from 12, 208 respondents of the online questionnaire. The questionnaire was constructed by two parts in order to minimise the burden and potential churning (sample attrition) of participants: Static questionnaire: It was asked only once upon first response (controlled by encrypted browser cookies) about information, which do not change frequently, like the year the respondent was born, gender, domicile, education level, etc. This static part also included questions about the proxy contact patterns of the respondent during the reference period, before the official declaration of the pandemic, 13th of March 2020. We recorded reference contact patterns separately for typical weekdays and weekends of the respondents together with their age and gender detailed household structure. Dynamic questionnaire: It was asked to be completed ideally every day about the activities of the respondent on the previous day. More specifically, we asked the reasons they were outside, the places they visited, the protections they wore, travel mode they used, the changes in their working conditions, etc. We asked questions about their proxy and physical social contacts outside their home, at work, or elsewhere; and also about those people, with whom they had contacts at home, but who are not part of their household. For those, who mentioned children under 18 years in their household, more questions were asked about the contact patterns of their children at school or elsewhere. We share the full questionnaire including the essential questions for our analysis in the SI. A smaller scale, but nationwide representative data collection was also conducted between the 6th and 12th of May 2020 using exactly the same questionnaire taken from the online survey. The data collection was implemented by CATI survey methodology using both landline and mobile phone numbers. A multi-step, proportionally stratified, probabilistic sampling procedure was used for sampling. The sample is representative for the Hungarian population aged 18 or older by gender, age, education and domicile. Sampling errors were corrected using iterative proportional post-stratification weights. After data collection, only the anonymised and hashed data was shared with people involved in the project after signing non-disclosure agreements. We categorised people into eight age groups, as defined in the main text, thus constructed 8 × 8 matrices with column indices corresponding to the age group of our respondents and row indices correspond to the age group of their contacts. In order to compute the population level age contact matrix, we use a formal description. Let X be the set of respondents (ego), and let Y be the set of individuals who are contacts of some x ∈ X. For a specific x, let N x ⊂ Y be the set of individuals who are contacts of x. We assign by a(x) ∈ A = {1, . . . , 8} the age group of an individual x. Next we define the matrix M x,y for each x ∈ X and y ∈ N x as follows: (M x,y ) i, j = 1 if a(x) = j and a(y) = i, and zero otherwise. For an ego x we can now compute its individual contact matrix as M x = ∑ y∈N x M x,y . Finally, we use an individual weight w x assigned to each ego, coming from the IPF weighting method described in the main text. This weight effectively describes how much an ego and its contacts should be considered in order to receive a contact matrix for a closer-to-representative population. The population level contact matrix is computed by The goal of the weighting process was to correct the unrepresentativeness of the online data without getting very large weights which may lead to large errors in the estimations. However, unlike at a general survey, representativeness in our case was not a general term for the Hungarian population, but was related to their contact patterns. To unfold, which variables are the ones that affect the actual proxy contacts the most in the different age groups, we applied linear regression analysis on the representative survey data for each age group separately. The dependent variable of these regressions was the total number of actual proxy contacts; and the independent variables were those ones, which we measured in the questionnaire and which were also available on a population level from census. The following independent variables were matched these two criteria: region (the seven main geographical region of Hungary where the respondent lives), type of settlement of the domicile, gender, highest level of education, and activity (detailed typology of the work type of the respondent -white or blue collar -or the reason they are not employed). We built three models for each age group. In the first model, only the main effects of these variables were included. In the second model we added the two-way interaction terms of all independent variables. Finally, in the third model we included those interaction terms, where neither the region and activity variables were present -as these are categorical data causing too many parameters in the interactions. This step was done to see clearer signals, where the large number of categories of these two variables does not distort the effect of others. For each age group, we selected the significant variables and the significant interaction terms as weighting dimensions. If a main effect of a variable was significant, and an interaction term, which was built up by the same variable was also significant, we only included the interaction term, because the margins of the interaction also include the margins of those variables, which build that up. Based on the results of the regression analyses and of the comparison of the online data with the population data, in some cases, we included the aggregated categories (values) of these dimensions in the weighting procedure. For example, in the case of activity, a binary variable was created, where the two categories showed if the respondent worked or did not work. In the case of geographical region, instead of the original seven categories we used two, which showed if the respondent lived in the central region of the country (which includes the capital), or in another region. The reason for these simplifications was that in these variables, the strongest effects on the contact patterns of the people were manifested along these cleavages. We define Relative Accuracy Gain (RAG) in our setting to quantify how much we gain in terms of accuracy to approximate the representative contact matrix due to the weighting procedure of the online contact matrix, as compared to the unweighted case. It is defined as the function of the sum of absolute differences in the total number of contacts between the representative (rs) and the weighted online (ow) and the representative and not weighted (onw) online matrices. More formally where M rs denotes the actual proxy matrix obtained from the nationally representative survey, M ow is the weighted actual proxy matrix obtained after reconstruction from the online survey, and M onw is the not weighted actual proxy matrix measured directly from the online survey. The goal of the Hungarian Data Provider Questionnaire (MASZK) questionnaire was to dynamically estimate the age contact matrices of people in different settings (like home, work, school, or elsewhere). To collect such data we developed a questionnaire to ask about people's demographic characters, domicile, family structure, health conditions, travel patterns, education level, employment situations and many more. More importantly we asked them about the number of people from different age groups, with whom they had contacts. First, we recorded reference contact patterns by asking respondents about their contacts during a typical weekday and weekend before the COVID-19 outbreak in Hungary (13th March 2020). Second, we recorded actual contact patterns of participants by asking them to indicate all their contact activities happened on the day before their actual response. We defined contacts in two different ways relevant for possible infection transmission. Interactions between people without any protection were called physical contacts, while proxy contacts were identified as if two people stayed closer than 2 meters to each other at least for 15 minutes. Individual contact patterns were recorded as the number of contacts between the ego and their peers from different age groups of 0 − 4, 5 − 14, 15 − 29, 30 − 44, 45 − 59, 60 − 69, 70 − 79, and 80+. Due to privacy regulations, contact patterns of under-age people was not possible directly. Nevertheless, to collect data about children younger than 18 years old, we asked respondents living in the same household with an under-age children to estimate their number of contacts in different settings. For the sake of potential adoption of our method and reproducibility of results we share the questionnaire including the essential questions for our analysis in this repository 45 . The overarching goal of the modelling process on the representative survey data was to identify those variables that can be used to weight our non-representative online data, coming from the MASZK questionnaire. Since our goal was to weight our dataset to be more representative for the number of proxy contacts of respondents, we first ran regression models to identify those factors that significantly affect the daily number of proxy interactions. As the contact matrices contain the average number of proxy contacts for each age group separately, we ran general linear models with identity link function separately for each age-group. In this way, we could chose factors, which significantly affect the number of proxy contacts specifically for the given age-group. This method helps to avoid potential weighting variables, which do not influence the contacts for all age segment -and thus it limits the increment of the standard deviation of the estimation. The dependent variable of the models was the proxy number of contacts; the independent variables were region and activity type as factors, and gender, education and type of settlements as co-variates. The reason for selecting these independent variables was because there are available census data about the distribution of these attributes, which we could use in the weighting procedure later. Moreover, as these dimensions are commonly recorded in any census, it makes possible to easily apply our method in different countries without measuring expensive representative samples. Additionally to the baseline models, we built extended models, to which we included the interaction terms of the independent variables. For each age group we selected those variables or those interactions for the further weighting procedure, which significantly (on a 0.05 level) affected the proxy number of contacts. In the case of the region variable, the results suggested that the main differences are between Central Hungary and other regions, so in the weighting procedure we treated the region variable as a dummy. Similarly, in the case of the independent variable, which measured activity type, the breakpoints were mostly between the active and not active groups, thus we included this variable into the weighting procedure as a dummy one. The resulting variables are available in Table 1 in the main text, while the full model tables are presented below in Tables S2, S3 and Using the detected social-demographic variables effecting significantly the contact patterns, in the MS we described a weighting methodology for the online survey. Our goal was to provide a method which assign a w x weight to each individual x, which are not distributed very broadly, as extreme weights increases the standard errors of the estimates and decrease the accuracy of the estimation. Therefore, our weighting methodology needs to keep the weights in a reasonable range. This was possible by applying iterative proportional fitting, which resulted in individual weights distributed over a relatively small range, between 0 < w x < 25 as demonstrated in Fig. S4 ). Table 1 in the main text . To extend our results reported in the main text, here we summarise the measured and reconstructed contact matrices and their comparison in a matrix plot panel, annotated with numerical values. More precisely, in Fig. S5 we in the diagonal we show the representative, online-weighted, and online-unweighted matrices. Above the diagonal we depict the pairwise differences between these matrices, while below the diagonal we show the pairwise two-tail T-test results. The raw, reconstructed, and representative matrices are shared as data tables in an online repositories 45 Figure 5 . Annotated matrices. Normalized actual proxy contact matrices (green diagonal), their pairwise difference matrices (above diagonal) and pairwise two-tail T-test results (below diagonal) are depicted for the online non-weighted, online weighted, and representative matrices. In the difference matrices red or blue cells indicate that the source matrix (column label) appeared with higher or lower number of average contact than the target (row label) at the given cell. For results of pairwise two-tail T-tests yellow to blue cells (corresponding to p > 0.05, assigned by an arrow beside the colorbar) indicate that the given cell is not significantly different in the source (column label) and target (row label) matrices. Data for children under 18 (indicated with asterisk and vertical dashed lines) could not be collected directly due to privacy regulations, thus our data cannot provide a representative sample for the first two age groups. Social contacts and mixing patterns relevant to the spread of infectious diseases Duration and distance of exposure are important predictors of transmission among community contacts of ontario sars cases Transmission of influenza a in human beings How contagious are common respiratory tract infections? Review of aerosol transmission of influenza a virus Predicting the behavior of techno-social systems Inferring the structure of social contacts from demographic data in the analysis of infectious diseases spread Projecting social contact matrices in 152 countries using contact surveys and demographic data Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing COVID-19 epidemic in Switzerland: on the importance of testing, contact tracing and isolation Population-scale longitudinal mapping of COVID-19 symptoms, behaviour and testing Predicted adoption rates of contact tracing app configurations-insights from a choice-based conjoint study with a representative sample of the UK population The fundamental limitations of COVID-19 contact tracing methods and how to resolve them with a Bayesian network approach Inherent privacy limitations of decentralized contact tracing apps The mathematics of infectious diseases Epidemic processes in complex networks Complex social networks Statistical physics of vaccination What types of contacts are important for the spread of infections? using contact survey data to explore European mixing patterns Gender-structured population modeling: mathematical methods, numerics, and simulations The French connection: the first large population-based contact survey in France relevant for the spread of infectious diseases A systematic review of social contact surveys to inform transmission models of close-contact infections Contacts in context: large-scale setting-specific social mixing matrices from the BBC Pandemic project Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK Social mixing patterns in rural and urban areas of Southern China Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China Representative contact diaries for modeling the spread of infectious diseases in Taiwan Social contact patterns relevant to the spread of respiratory infectious diseases in hong kong Social contacts, vaccination decisions and influenza in Japan Social contact patterns in vietnam and implications for the control of infectious diseases Characteristics of human encounters and social mixing patterns relevant to infectious diseases spread by close contact: a survey in Southwest Uganda Social contact structures and time use patterns in the Manicaland Province of Zimbabwe Quantifying age-related rates of social contact using diaries in a rural coastal population of Kenya Estimating contact patterns relevant to the spread of infectious diseases in Russia A household-based study of contact networks relevant for the spread of infectious diseases in the highlands of Peru Projecting social contact matrices to different demographic structures Close encounters of the infectious kind: methods to measure social mixing behaviour Comparison of three methods for ascertainment of contact information relevant to respiratory pathogen transmission in encounter networks Social mixing patterns for transmission models of close contact infections: exploring self-evaluation and diary-based data collection through a web-based interface Hungarian data supply questionnaire (maszk) (date of access 2020.09.28) Hungarian data supply questionnaire (maszk) team Early phase of the COVID-19 outbreak in Hungary and post-lockdown scenarios Discrete multivariate analysis: theory and practice Surveillance definitions for COVID-19 Magyar népszámlálás 2011 Encyclopedia of survey research methods The paired t test under artificial pairing. The Using social and behavioural science to support COVID-19 pandemic response Social network-based distancing strategies to flatten the COVID-19 curve in a post-lockdown world COVID-19 UNAM Nemzeti adatvédelmi és információszabadság hatóság The authors are very thankful for the COVID-19 development team lead by Vilmos Bilicki from the Department of Software Development at the University of Szeged 41 and for Eszter Bokányi for the data analysis and her constructive comments. This work was done in the framework of the Hungarian National Development, Research, and Innovation (NKFIH) Fund 2020-2.1.1-ED-2020-00003. JK was supported by the Premium Postdoctoral Grant of the Hungarian Academy of Sciences. MK is thankful for the support from the DataRedux (ANR-19-CE46-0008) project funded by ANR and the SoBigData++ (H2020-871042) project. GR was supported by NKFIH FK 124016, EFOP-3.6.1-16-2016-00008, and TUDFO/47138-1/2019-ITM.Author contributions statement J.K., M.K and O.V contributed equally to this work, collected data and analysed the results. All authors reviewed the manuscript. The authors declare no competing interests. Monitoring behavioural responses during pandemic via reconstructed contact matrices from online and representative surveys Table 4 . Population census of variables used for applying the weighting methodology called iterative proportional fitting on the online survey to make it more accurate of measuring the contact patterns of the whole population in age groups 45 − 59, 60 − 69, 70+