key: cord-0964410-us8g1i1q authors: West, Brady T; Ong, Ai Rene; Conrad, Frederick G; Schober, Michael F; Larsen, Kallan M; Hupp, Andrew L title: Interviewer Effects in Live Video and Prerecorded Video Interviewing date: 2021-12-03 journal: J Surv Stat Methodol DOI: 10.1093/jssam/smab040 sha: dc409bbe11b74c3bd6f52af032969a3aa5cca95d doc_id: 964410 cord_uid: us8g1i1q Live video (LV) communication tools (e.g., Zoom) have the potential to provide survey researchers with many of the benefits of in-person interviewing, while also greatly reducing data collection costs, given that interviewers do not need to travel and make in-person visits to sampled households. The COVID-19 pandemic has exposed the vulnerability of in-person data collection to public health crises, forcing survey researchers to explore remote data collection modes—such as LV interviewing—that seem likely to yield high-quality data without in-person interaction. Given the potential benefits of these technologies, the operational and methodological aspects of video interviewing have started to receive research attention from survey methodologists. Although it is remote, video interviewing still involves respondent–interviewer interaction that introduces the possibility of interviewer effects. No research to date has evaluated this potential threat to the quality of the data collected in video interviews. This research note presents an evaluation of interviewer effects in a recent experimental study of alternative approaches to video interviewing including both LV interviewing and the use of prerecorded videos of the same interviewers asking questions embedded in a web survey (“prerecorded video” interviewing). We find little evidence of significant interviewer effects when using these two approaches, which is a promising result. We also find that when interviewer effects were present, they tended to be slightly larger in the LV approach as would be expected in light of its being an interactive approach. We conclude with a discussion of the implications of these findings for future research using video interviewing. remote, video interviewing still involves respondent-interviewer interaction that introduces the possibility of interviewer effects. No research to date has evaluated this potential threat to the quality of the data collected in video interviews. This research note presents an evaluation of interviewer effects in a recent experimental study of alternative approaches to video interviewing including both LV interviewing and the use of prerecorded videos of the same interviewers asking questions embedded in a web survey ("prerecorded video" interviewing). We find little evidence of significant interviewer effects when using these two approaches, which is a promising result. We also find that when interviewer effects were present, they tended to be slightly larger in the LV approach as would be expected in light of its being an interactive approach. We conclude with a discussion of the implications of these findings for future research using video interviewing. KEYWORDS: Interviewer effects; Video interviewing; Web surveys. Live video (LV) communication tools (e.g., Zoom) have the potential to provide survey researchers with many of the benefits of in-person interviewing while also greatly reducing data collection costs because interviewers do not need to travel and make in-person visits to sampled households. The COVID-19 pandemic has exposed the vulnerability of in-person data collection to public health crises, forcing survey researchers to explore remote data collection approaches-such as LV interviewing-that seem likely to yield high-quality data without in-person interaction. Given the potential benefits of video communication technologies, the operational and methodological aspects of video interviewing have started to receive research attention from survey This research note presents an analysis of interviewer effects from an experimental study of alternative approaches to using video communication technologies to collect survey data. This work is especially important given the significant restrictions on survey research that have been introduced by the global pandemic, and the potential of LV interviewing to provide data quality benefits similar to face-to-face interviewing without putting interviewers at risk. The results provide good news about the use of this approach and at the same time introduce the importance of monitoring interviewer effects when using these types of video interviewing approaches. methodologists (Endres and Hillygus 2019; Conrad, Schober, Hupp, West, Larsen, et al. 2020; Schober, Conrad, Hupp, Larsen, Ong, et al. 2020 ). However, we know of no survey organizations that routinely use LV in production surveys; more methodological research on LV interviewing is necessary before this becomes a routine survey research practice. One recent methodological study of LV interviewing focused on experimental comparisons of data quality with in-person interviewing (Endres and Hillygus 2019) . Comparing LV and face-to-face interviewing to selfadministration, these authors found benefits of LV interviewing quite similar to those of in-person interviewing: less nondifferentiation, less item nonresponse, fewer do not know responses, and more participant satisfaction compared to a conventional web survey. Another recent methodological study performed an experimental comparison between a different type of videobased interviewing approach, namely prerecorded videos (PVs) of questions being asked by an interviewer in a web survey, and a traditional web survey (Haan, Ongena, Vannieuwenhuyze, and de Glopper 2017) . These authors found few differences in disclosure of sensitive information and respondent engagement. They suggest that their "video-web" approach did not improve data quality because it lacked responsivity, using prerecorded questions. Although these studies have provided insights into the effects that LV interviewing or PVs of interviewers asking questions might have on data quality, they did not evaluate a potential threat to data quality associated with intervieweradministered modes of data collection: interviewer effects. LV interviewing involves respondent-interviewer interaction, which introduces the possibility of interviewer effects on collected survey responses by allowing interviewers to question and probe responses differently from one another. PVs of different interviewers asking questions feature the exact same delivery of a question across all respondents assigned to a particular interviewer and do not involve respondent-interviewer interaction. This reduces the possibility of interviewer effects due to responsive nonverbal behaviors, verbal probing and clarification, and other tailored behaviors that are possible in LV interviewing. However, interviewer effects may still arise in surveys using PVs due to observable or inferable interviewer characteristics, such as their gender, age, or race/ethnicity (Krysan and Couper 2003; Conrad, Schober, Nielsen and Reichert 2020) , or the magnification of differences in question delivery between interviewers due to the identical presentation each time a question is asked. Different interviewers conducting LV interviews or recording videos of themselves asking survey questions may therefore introduce the same types of effects on survey measures that have been reported in myriad prior studies of in-person interactions (West and Blom 2017) . The resulting variability among interviewers in the distributions of the responses collected reduces the efficiency of survey estimates, lowering effective sample sizes and statistical power in a manner similar to cluster sampling (Elliott and West 2015) . Table 1 summarizes theoretical mechanisms that could introduce interviewer effects for each of these two approaches. Based on these mechanisms, we hypothesize that LV interviewing will more often introduce interviewer effects, especially for more sensitive or complex items. The lack of studies evaluating interviewer effects in LV or PV interviews precludes hypotheses based on empirical evidence. This research note presents an evaluation of interviewer effects in a recent experimental study of LV and PV interviewing. While one should study interviewer effects from a Total Survey Error perspective (West and Blom 2017) and prior work has suggested that interviewer effects may arise from variance among interviewers in the types of respondents recruited (e.g., West and Olson 2010) , we focus exclusively on the variability in survey measures among interviewers because the interviewers were not responsible for recruitment. We seek answers to the following two research questions: (1) How much interviewer variance arises when using each approach, in particular for responses to sensitive questions and measures of satisficing? (2) Does the interviewer variance differ significantly across these two approaches? We implemented a randomized experiment in an original data collection that took place between August 2019 and March 2020 (with most of the data collection occurring before the COVID-19 pandemic). This study sought to evaluate three different data collection approaches: LV, PV, and a conventional (text-only) web survey. We focus on the LV and PV approaches exclusively as the web approach did not involve interviewers. A nonprobability sample of individuals ages 18 and above was recruited to participate in the study via one of two online research services: CloudResearch (cloudresearch.com) or the University of Michigan online Health Research program (umhealthresearch.org). Study participants were therefore volunteer online panelists, and all interested participants were randomly assigned to one of the three approaches upon agreeing to participate. All individuals were offered a $5 Amazon gift code for participating, and participants randomly assigned to the LV approach were offered an additional $15, conditional on completing their interview along with an online debriefing offered to both LV and PV participants. About one-third of the participants completed the survey using a smartphone across the two approaches (31.9 percent in LV and 40.3 percent in PV). Conrad and Schober (2000) , Schober and Conrad (1997) , West, Conrad, Kreuter, and Mittereder (2018) , and van der Zouwen, Dijkstra, and Smit (1991) Attempts to establish rapport Yes No Goudy and Potter (1975) and Garbarski, Schaeffer, and Dykema (2016) Speed asking questions/alternative delivery styles Yes Yes Olson and Peytchev (2007) and Olson and Smyth (2015) Voice characteristics Yes Yes Charoenruk and Olson (2018) Fixed/identical question presentation by a given interviewer Yes Haan et al. (2017) Inability to adjust question presentation and subsequent dialogue responsively No Yes Conrad and Schober (2000) , Schober and Conrad (1997) , and Haan et al. Dijkstra and Ongena (2006) and J€ ackle, Roberts, and Lynn (2010) NOTE.-a These are not meant to be comprehensive lists of all studies on a particular source of interviewer effects; see West and Blom (2017) and Davis et al. (2010) for more comprehensive overviews. The survey featured 36 questions from major US social surveys and other methodological studies (e.g., the General Social Survey; see the Supplementary data online). Ten questions required an open numerical response (e.g., hours watching television per day), nine presented categorical responses (e.g., Likert-type agree/disagree responses to statements like "It is important to maintain a healthy diet"), and 17 items (statements) comprised three batteries, offering the same options for each statement in a battery (e.g., agree/disagree). The batteries were implemented as a series of individual items (i.e., not as grids in PV) for comparability across the approaches, and responses to these items were dichotomized as neutral (e.g., neither agree nor disagree) versus non-neutral (e.g., somewhat disagree) about particular topics to allow for modeling the probability of a substantive (i.e., non-neutral) response. The questions were organized by topic and presented in order from the least sensitive topic to the most sensitive topic. These sensitivity ratings were determined by a prior norming study in which respondents rated how uncomfortable they thought people would be answering each question and selecting each response option. This norming study was based on approaches used in prior assessments of question sensitivity (e.g., Fail, Schober, and Conrad 2021; Feuer, Fail, and Schober 2019) . Questions about credit card balance, attending religious services, volunteer work, helping the homeless, participating in local elections, sex frequency, and frequency of watching pornography were all considered more sensitive (based on the norming study). We also analyzed three measures of data quality: disclosure, measured by the average sensitivity of the response options selected by a participant (e.g., because 24 percent of the respondents in the sensitivity rating study rated reporting having only one sex partner in the past 12 months to be "Very Uncomfortable" or "Somewhat Uncomfortable," the sensitivity of that response is 0.24); rounding, measured by the number of numerical answers divisible by 10 (e.g., ten versus eleven movies seen in the last month); and near straightlining, measured by the participant selecting the same response option for all or all but one of the items in any of the three batteries. Initial analyses have shown that LV interviewing produced significantly less disclosure, a significantly higher proportion of rounded answers, and significantly less straightlining than the PV approach ; whether these measures of data quality tend to vary across interviewers in each of these two approaches remains an open question. LV interviewing (279 respondents) was implemented as synchronous twoway video using the BlueJeans video platform (https://www.bluejeans.com/), through which the interviewer administered a questionnaire programmed using Blaise and displayed on the interviewer's screen below the BlueJeans window. Participants randomly assigned to LV interviewing scheduled appointments with one of eight interviewers from the University of Michigan Survey Research Center. The same eight interviewers were video recorded asking the same 36 survey questions, and these recordings were embedded in a Blaise 5 web survey comprising the PV approach. For each PV respondent, one of the eight interviewers asked all 36 questions. The response options did not appear until the recording played in its entirety; the video recordings autoplayed except on mobile devices. At the beginning of data collection, we required participants from CloudResearch to schedule interviews with only one randomly assigned interviewer, attempting to implement interpenetrated sample assignment for the purpose of estimating interviewer effects. When this approach resulted in many missed and unscheduled appointments, we allowed Michigan Health Research respondents to schedule appointments with any interviewer at any of the available interview slots. In instances where multiple interviewers were available at the same time, the scheduling software we used, Calendly, assigned interviewers based on who had completed the fewest interviews thus far. Because this process did not result in true random assignment of participants to interviewers, we reviewed the distributions of demographic measures (sex, age, race, and education) of the participants assigned to each interviewer on a weekly basis. When notable imbalances across the interviewers were identified, we adjusted which interviewers' appointment slots participants from certain demographic groups could view for scheduling. Table 2 shows the final distributions of selected demographic characteristics for each of the eight interviewers. This table shows that outside of the gender distributions for LV interviewing, the eight interviewers ultimately interviewed participants with similar demographic features. The same was true for the PV interviews. For each of the thirty-six survey variables and the data quality measures, we fit the following mixed-effects model, where i indexes respondents, j indexes interviewers, LV and PV are indicators of assignment to the two approaches, and g() is a link function appropriate for a given type of dependent variable (e.g., the logit link for a binary dependent variable): (1) , and e ij $ Nð0; r 2 Þ for continuous items. Given the small number of interviewers available for this study, we used penalized quasi-likelihood (as implemented in the GLIMMIX procedure of SAS Version 9.4) to fit logistic regression models of the form in (1) to each of the binary measures (excluding residual terms), and restricted maximum likelihood to fit linear regression models of the same form to the numeric measures. These estimation procedures reduce the bias in estimates of variance components that can arise with small samples of higher-level clusters (McNeish and Stapleton 2016) . We specified the model in (1) for each of our measures because each interviewer ultimately produced data for both approaches. This model, which varies from more traditional multilevel models for studying interviewer effects, allows for unique interviewer variance depending on the approach used (LV or PV), and enables us to answer our second research question. Furthermore, the model allows the random interviewer effects associated with each approach to covary, enabling an assessment of the correlation of the two random effects for each interviewer across the two approaches. For our first research question, we initially tested the significance of the covariance of the two random effects by fitting the model in (1), referred to as Model 1, along with a reduced model with the covariance of the random effects constrained to be zero (Model 2), and performing a likelihood ratio test of the null hypothesis that the covariance was equal to zero. We note that estimating the covariance of the random effects based on Model 1 was only feasible for outcome measures (responses) that presented evidence of nonzero variance in the random interviewer effects for both approaches. We then tested the significance of the interviewer variance components for each approach by using an appropriate mixture-based likelihood ratio test of the null hypothesis that a given variance component is equal to zero (e.g., West and Olson 2010) . For our second research question, we tested whether the interviewer variance components were equal by fitting a reduced model (Model 3) with the interviewer variance constrained to be equal for each mode, and then performing a likelihood ratio test of the null hypothesis that the two variance components were equal (West and Elliott 2014) . Given the relatively low statistical power of this study to detect interviewer variance and differences in interviewer variance between the approaches (with only eight interviewers), we supplemented our analyses with descriptive estimates of intra-interviewer correlations (IICs) for each item measured in each approach (based on ratios of the estimated interviewer variance component to the total variance for each item; see West et al. [2018] for details). Because multilevel models constrain estimates of IICs to be greater than or equal to zero, we also estimated the IICs using the ANOVA-based method outlined by Kish (1962) and frequently employed in prior studies of interviewer effects REML, restricted maximum likelihood; PQL, penalized quasi-likelihood; PM, past month; PY, past year. IICs estimated using the ANOVA-based method (Kish 1962) , which allows for negative estimates of the IICs, are in parentheses. Estimates of 0 indicate variance components that were very close to zero (i.e., less than 0.001) when using the multilevel modeling methods. a There were only seven cases of near straightlining on the food battery, and two of them were interviewed by one interviewer in the PV approach; this estimated IIC under multilevel modeling was therefore not considered reliable. (e.g., Groves and Magilavy 1986) . We also generated plots of the two predicted values (EBLUPs) of the random effects for each interviewer (corresponding to the two approaches) to visually examine the correlations of the two random effects for each interviewer when we found evidence of interviewer variance. When Using Each Approach? Table 3 presents estimates of the IICs for each measure by approach and indications of whether the interviewer variance for a given approach was found to be significantly greater than zero. In general, the majority of the estimated IICs were very small for both approaches, with twenty-two out of forty-three measures computed as zero for each approach when using multilevel modeling, and 22 (LV) and 19 (PV) computed as negative when using Kish's ANOVA-based method. Table 4 shows that the means and ranges of the IICs are largely in line with similar descriptive summaries reported in prior studies of interviewer effects on multiple survey items. We see no evidence of the mean IIC for the LV approach being substantially higher than the mean IIC for PV or the means reported in prior studies, and the ranges of the estimated IICs for both approaches are actually somewhat smaller than observed in the prior literature. We found no evidence of significant interviewer variance in the PV approach. Five of these estimated IICs were at or above 0.02 when using the multilevel modeling and ANOVA-based methods to compute them. Such IICs would generally be considered "large" and would likely impact the precision of survey estimates in a negative fashion (Groves and Magilavy 1986; O'Muircheartaigh and Campanelli 1998) . We also found five relatively large IICs in the LV approach, using the same criterion; two were significantly greater than zero at the 0.10 or 0.01 level, despite our reduced power. The survey items that seemed to introduce larger interviewer effects in the LV interviews included a binary indicator of ever performing volunteer work (IIC ¼ 0.084, p < .01) and a binary indicator of ever helping the homeless (IIC ¼ 0.026, p < .10). We also found a relatively large IIC in LV (IIC ¼ 0.084) for the measure of straightlining based on the battery of items on sports. We did not find any evidence of a correlation between the rated sensitivity of the item and the magnitude of the estimated IIC in either approach. 3.2 Research Question 2: Does the Interviewer Variance Differ across the Two Approaches? Table 5 summarizes the model fitting and testing results for selected outcome measures with notable differences in the estimated IICs across the approaches Based on aggregation of the data in Table 1 from Groves and Magilavy (1986) . c Based on aggregation of the data from the literature review in Table 1 ( O'Muircheartaigh and Campanelli 1998) . Based on their own study, these authors report that 80 percent of the IICs computed were less than 0.02; see figure 1 in O' Muircheartaigh and Campanelli (1998) . (from table 3) . We were largely unable to reliably estimate correlations of the random interviewer effects in the two approaches, given that the majority of the items presented evidence of no interviewer variance in one of the two approaches. As a result, we focus primarily on the tests of equality in the interviewer variance between the two approaches. We only found evidence of the interviewer variance in LV being significantly larger than in the PV approach for one of the items (volunteer work). We visualize these differences in interviewer variance for three of the items in figure 1 , which presents predictions of the interviewer effects for the eight interviewers for each of these items in each approach. The larger variance in the black dots representing predictions of the random effects in LV for the eight interviewers is evident, as is the general lack of correlation between the predicted random effects. Closer inspection of the predicted random interviewer effects in figure 1 indicates that the second interviewer had relatively large positive effects on the responses to each item in LV. This interviewer completed a total of thirty-nine LV interviews (table 1) . Both LV and PV respondents were asked to complete an online debriefing after the survey, which was designed for quality control purposes and to provide more insight into possible differences in response distributions between the approaches. An analysis of selected debriefing items indicated that fewer respondents interviewed by the second interviewer reported that they were "Very Satisfied" with the interview compared to the other interviewers. Respondents interviewed by this individual also reported the lowest comfort with the interviewer (only 54 percent said they were "Very Comfortable" with the interviewer). Two of this interviewer's respondents also reported that they felt that they could not answer honestly. The behavior of the interviewer that produced these respondent reactions might have affected the answers of the respondents (e.g., more socially desirable responses), which may have introduced higher interviewer variance on selected items relative to the PV approach. We found that interviewer effects in LV interviewing tended to be rare and did not arise at substantially higher rates when compared to the PV approach. Out of 43 variables analyzed, including six measures of data quality, we only found evidence of IICs greater than 0.02 for five items in the LV interviews (two of which had significant interviewer variance components) and five items in the PV approach (none of which had significant variance components). When comparing the two approaches, we did find significantly higher interviewer variance for the item on frequency of volunteering in LV interviews, suggesting that this approach may introduce opportunities for larger interviewer effects on selected items. While this lack of significant findings may have been due to relatively low statistical power, given that we only studied eight interviewers, descriptive analyses of the estimated IICs for the two approaches using multiple estimation methods suggested that the large majority of the IICs were small (<0.02), meaning that a higher-powered study may have led to the same results. This is generally good news for LV interviewing. Although there is no obvious reason why video mediation would increase standardization compared to in-person interviewing, the fact that interviewer effects were generally no greater in LV than PV could mean that the interaction was as standardized in LV as in PV. This could be the case if video mediation reduced the amount of probing (Mangione, Fowler, and Louis 1992) , but there is not a clear theoretical reason why this would have been the case. Without having video-recorded LV interviews, we cannot test this possibility. Because LV interviewing mimics in-person interviewing in several important ways, potential interviewer effects introduced by LV in different contexts need monitoring as in in-person interviewing. The same is true for the PV approach, which may produce significant data quality benefits relative to a text-only web survey at far lower cost than LV . Future experiments evaluating the LV approach on a larger scale could use our results for power analysis (West 2020) , compare interviewer effects in LV interviewing and in-person interviewing, and carefully examine any differences between LV and in-person interviewing in terms of the factors introducing interviewer effects. The possibility that interviewer effects introduced by LV or PV may also vary by the device used (e.g., computer versus mobile device) also warrants future examination. Analyses of explanations for any interviewer effects observed will be crucial for future work in this area (West and Blom 2017) . These explanatory factors may include observable features of the interviewers (which we could not access in this study), behaviors during the interviews (e.g., in debriefings, some interviewers expressed frustration with technical difficulties and coaching people through the use of a mobile phone for the interview), and nonverbal expressions (e.g., some interviewers said that trying to keep a "poker face" when hearing responses to sensitive questions was difficult). We did not have the resources to record and analyze the recordings of the LV interviews, but future studies using more interviewers could code a subsample of recorded LV interviews and study potential correlates of any interviewer effects observed, such as respondent comfort answering questions. First Impressions Count: Interviewer Appearance and Information Effects in Stated Preference Studies Improvement of the Quality of Responses to Factual Survey Questions by Interviewer Training Interviewer Behavior and the Quality of Social Network Data Do Listeners Perceive Interviewers' Attributes from Their Voices and Do Perceptions Differ by Question Type? Interviewer and Clustering Effects in an Attitude Survey Clarifying Question Meaning in a Household Telephone Survey Interviewers, Video, and Survey Data Collection Social Identities of Virtual Interviewers and Their Impact on Survey Responses Interviewer Effects in Public Health Surveys Stereotype Threat and Race of Interviewer Effects in a Survey on Political Knowledge Question-Answer Sequences in Survey-Interviews BMI of Interviewer Effects Clustering by Interviewer": A Source of Variance That is Unaccounted for in Single-Stage Health Surveys A Future for Video Interviewing? An Experimental Assessment of Video Mode Compared to Face-to-Face and Online Self-Complete Interviewing The Time It Takes to Reveal Embarrassing Information in a Mobile Phone Survey A Study of Interviewer Variance Empirically Assessing Survey Question and Response Sensitivity The Impact of Technology on Interaction in Computer-Assisted Interviews Interviewing Practices, Conversational Practices, and Rapport: Responsiveness and Engagement in the Survey Interview Interview Rapport: Demise of a Concept Measuring and Explaining Interviewer Effects in Centralized Telephone Surveys Interviewers' Questions: Rewording Not Always a Bad Thing Response Behavior in a Video-Web Survey: A Mode Comparison Study Telephone versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Bias Meeting Both Ends: Between Standardization and Recipient Design in Telephone Survey Interviews Assessing the Effect of Data Collection Mode on Measurement Interviewer Gender and Gender Attitudes Logistic Regression with Multiple Random Effects: A Simulation Study of Estimation Methods and Statistical Packages Studies of Interviewer Variance for Attitudinal Variables Race in the Live and the Virtual Interview: Racial Deference, Social Desirability, and Activation Effects in Attitude Surveys Interviewer Gender Effects on Survey Responses to Marriage-Related Questions Question Characteristics and Interviewer Effects The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration Effect of Interviewer Experience on Interview Pace and Interviewer Attitudes The Effect of CATI Questions, Respondents, and Interviewers on Response Time The Relative Impact of Interviewer Effects and Sample Design Effects on Survey Precision An Assessment of the Reliability of WFS Data Separating Interviewer and Sampling-Point Effects Does Conversational Interviewing Reduce Survey Measurement Error? Design Considerations for Live Video Survey Interviews The Effects of Black and White Interviewers on Black Responses in 1968 Response Effects in Surveys: A Review and Synthesis Interviewer Effects in Telephone Surveys Studying Respondent-Interviewer Interaction: The Relationship between Interviewing Style, Interviewer Behavior, and Response Behavior Nonverbal Behavior in Face-to-Face Survey Interviews: An Analysis of Interviewer Behavior and Adequate Responding Designing Studies for Comparing Interviewer Variance in Two Groups of Survey Interviewers Explaining Interviewer Effects: A Research Synthesis Can Conversational Interviewing Improve Survey Response Quality without Increasing Interviewer Effects? Frequentist and Bayesian Approaches for Comparing Interviewer Variance Components in Two Groups of Survey Interviewers How Much of Interviewer Variance Is Really Nonresponse Error Variance? Supplementary materials are available online at academic.oup.com/jssam