key: cord-1035902-e8kot7mf
authors: Eikenboom, Anna M; Le Cessie, Saskia; Waernbaum, Ingeborg; Groenwold, Rolf H H; de Boer, Mark G J
title: Quality of Conduct and Reporting of Propensity Score Methods in Studies Investigating the Effectiveness of Antimicrobial Therapy
date: 2022-03-07
journal: Open Forum Infect Dis
DOI: 10.1093/ofid/ofac110
sha: 8211c93f819a7266f4c9e47bbeb87bcb862d6ff9
doc_id: 1035902
cord_uid: e8kot7mf

BACKGROUND: Propensity score methods are becoming increasingly popular in infectious disease medicine to correct for confounding in observational studies. However, applying and reporting propensity score techniques correctly requires substantial knowledge of these methods. The quality of conduct and reporting of propensity score methods in studies investigating the effectiveness of antimicrobial therapy is yet undetermined. METHODS: A systematic review was performed to provide an overview of studies (2005–2020) on the effectiveness of antimicrobial therapy that used propensity score methods. A quality assessment tool and a standardized quality score were developed to evaluate a subset of studies in which antibacterial therapy was investigated in detail. The scale of this standardized score ranges between 0 (lowest quality) and 100 (excellent). RESULTS: A total of 437 studies were included. The absolute number of studies that investigated the effectiveness of antimicrobial therapy and that used propensity score methods increased 15-fold between the periods 2005–2009 and 2015–2019. Propensity score matching was the most frequently applied technique (65%), followed by propensity score–adjusted multivariable regression (25%). A subset of 108 studies was evaluated in detail. The median standardized quality score per year ranged between 53 and 61 (overall range: 33–88) and remained constant over the years. CONCLUSIONS: The quality of conduct and reporting of propensity score methods in research on the effectiveness of antimicrobial therapy needs substantial improvement. The quality assessment instrument that was developed in this study may serve to help investigators improve the conduct and reporting of propensity score methods.

In infectious disease medicine, it is of great importance to investigate the effectiveness of antimicrobial therapies in an efficient and valid manner [1] . Increasing antimicrobial, and especially antibacterial, resistance and newly emerging infectious diseases such as severe acute respiratory syndrome coronavirus 2 (SAR-CoV-2) create the need for rapid development of antimicrobial therapy. Randomized controlled trials (RCTs) are considered the gold standard to investigate treatment effects [2, 3] . However, RCTs for many antimicrobial treatment decisions may not be feasible, may be unethical, or may be too costly or not timely enough [3] . Therefore, observational studies are commonly performed to investigate treatment effects of antimicrobial therapy [1] . However, because in observational studies treatment assignment is not a random process, direct comparison of treatment groups without taking baseline differences into account may lead to incorrect conclusions due to confounding. One of the approaches to correct for measured confounding is the use of propensity score methods [4] [5] [6] . These methods attempt to balance the observed baseline covariates between the treatment and control groups. Within participants with the same propensity score, the distribution of the observed covariates in the treated and untreated groups would be approximately the same, similar to an RCT [4] .

Due to increasing antimicrobial resistance and rapidly evolving insights in antimicrobial therapy, it is not surprising that also in the field of infectious diseases the popularity of propensity score methods has increased greatly in the past 15 years [5, 7, 8] . However, applying propensity score techniques correctly requires substantial knowledge on propensity score analysis and its underlying assumptions. What's more, studies in which propensity methods are applied should report sufficient details of the analyses that were performed [4] . Therefore, a quality assessment instrument consisting of a set of quality criteria is needed to ensure quality of conduct and reporting of studies using propensity score methods [9] [10] [11] . In addition, it is still unknown how in the field of infectious disease medicine the quality of conduct and reporting of studies using propensity score methods has evolved over the past 15 years.

The aim of this study was to provide an overview of the use of propensity score methods in studies investigating the effectiveness of antimicrobial therapy between 2005 and 2020, to develop a quality assessment instrument for studies using propensity score methods, and to use this tool to assess the quality of conduct and reporting of the studies over time.

The study consisted of a systematic review and development of a quality assessment tool as well as a quality assessment. The study structure and data flow are presented in Figure 1 .

A systematic review was conducted following the PRISMA criteria (Supplementary Data 1) and was registered in the Prospero database (registration number: CRD42020210473) [12] . This systematic review describes (a) the number of studies investigating the effectiveness of antimicrobial therapy in which methods were applied that have been published in the past 15 years, (b) the infectious diseases that were investigated, and (c) how often different propensity score techniques have been used. PubMed was searched using a search strategy that was carefully designed in collaboration with a librarian (Supplementary Data 2). Titles and abstracts were screened to find studies that fulfilled the following eligibility criteria: original research of observational data; a main study aim of estimating the effectiveness (note: not safety) of antimicrobial therapy (ie, antibacterial, -fungal, -viral, and -parasitic therapy); use of propensity score methods in the analysis; published between September 1, 2005, and September 1, 2020; and written in English. Case reports, meta-analyses, reviews, abstracts, and protocols were excluded.

A quality assessment instrument for studies using propensity score methods was developed inspired by the general principles of the Delphi method to reach consensus [13, 14] . The list of recommendations developed by Yao et al. was used as a starting point. On the basis of additional literature on propensity score methods, modifications were made to the quality criteria suggested by Yao et al., and quality criteria were added or removed from the list [4-6, 11, 15-24] . The tool was developed to assess if sufficient details were reported on the propensity score method used, whether assumptions of propensity score methods were discussed, and whether the balance of baseline variables before and after propensity score analysis was checked. The list was discussed among experts, and a quality assessment tool was drafted. Subsequently, 3 independent experts reviewed the 2) Maximum of 10 articles selected per publication year (randomly) quality assessment tool using a feedback form (Supplementary Data 3). Following the principles of the Delphi method, improvements were made if similar feedback was provided by at least 2 out of 3 experts. A subset of the papers that were included in the systematic review was evaluated in detail using the quality assessment instrument. The subset consisted of studies in which the effect of an antimicrobial therapy for (sub)acute bacterial infections was investigated. Studies that concerned fungal infections, parasitic infections, viral infections, and chronic infections and studies in which antibiotic prophylaxis was investigated were excluded from this part of the analyses. Due to time constraints, it was not feasible to include all eligible studies for quality assessment. If >10 studies met the inclusion criteria in 1 publication year, 10 studies were randomly selected. Every article received a random number, generated by using a random number generator. Then, articles were sorted by publication year and ascending random numbers. Per publication year, the 10 studies that received the lowest random numbers were included. Selected studies were reviewed and assessed by application of the quality assessment tool (by A.E.). The results were subsequently discussed in a larger team (by A.E., M.B., and S.C.). Because there were more quality criteria for studies that used propensity score matching and IPTW than for stratification and covariate adjustment using propensity scores, the maximum score depended on the propensity score method used. Therefore, total scores were standardized to a value between 0 and 100 by dividing the total score by the maximum score that could be achieved and then multiplying by 100. For the purpose of this study, the standardized score is further referred to as the Standardized Quality Score for Propensity score Methods (SQSPM).

Categorical variables were reported using percentages; continuous variables were reported using means with standard deviations or, in case of skewed variables, medians with interquartile ranges. Articles included for systematic review were categorized per propensity score method used. The systematic review identified 4 propensity score techniques: matching, inverse probability of treatment weighting (IPTW or IPW), stratification, and propensity score-adjusted regression [4] . When the propensity score method used was not explicitly mentioned, the methods used were deducted from the "Methods" and "Results" sections of the paper. The number of studies using propensity score methods and the distribution of the propensity score methods used were both calculated per year.

The median SQSPM per year was determined, and the SQSPM was compared between different propensity score methods and types of infectious diseases. Summaries of the scores per criterion were provided to investigate which criteria were frequently met and which criteria need attention in future research. The degree of concordance between different items of the quality score was assessed using Pearson correlation coefficients and visualized using a heatmap. In this way, patterns of items (ie, clusters) that are often reported poorly could be observed. All analyses were conducted using SPSS, version 25.0 (IBM, Armonk, NY, USA).

The systematic search strategy yielded 923 unique peerreviewed studies. For the systematic review, 437 studies fulfilled the eligibility criteria, and a subset of 108 studies was included for quality assessment. A bibliography of the studies that were included in the systematic review and quality assessment can be found in Supplementary Data 4. The data flow is described in Figure 1 . Table 1 shows an overview of the number of studies in which propensity score methods were applied by infectious disease category and propensity score method. Overall, propensity score matching was the most frequently used propensity score method, with 65% of the studies overall using matching, followed by propensity score-adjusted regression, which was used in 25% of the included articles. In Figure 2A , the number of studies per year is reported.

The absolute number of studies investigating effectiveness of antimicrobial therapy and using propensity score increased 15-fold between the periods 2005-2009 and 2015-2019. In Figure 2B , the relative frequency of the use of the 4 propensity score methods per year is depicted. Initially, propensity score-adjusted regression was the most frequently used propensity score method, but its popularity decreased over time. Stratification was frequently applied in the early years as well but almost disappeared after 2017. IPTW and propensity score matching increased in popularity over the years.

The draft quality assessment tool consisted of twenty criteria of which criterion fourteen consisted of three sub-criteria (Supplementary Data 6). Based on the feedback from the independent experts, several adjustments were made. The definitive quality assessment tool (Table 2) consists of 18 criteria. For quality assessment, 108 studies were included. Table 3 shows that 66% of these studies received an SQSPM between 50 and 70. Overall, scores ranged between 33 and 88. The score category 80-90 was reached by 2 studies, in which propensity score matching was used. Both articles lacked a complete description of the influence of missing data on propensity score estimation (criterion 18) and the details of propensity score analysis (criterium 7). The discussion of the positivity assumption (ie, that all participants are able to receive both treatments, meaning that estimated propensity scores should not be too close to 0 or 1) (17b), the sensitivity analysis (10) , and balance after propensity score analysis (14b) were reported incompletely in both of the articles. In Figure 2B , the median SQSPM per year is depicted; it fluctuated between 53 and 61 from 2005 until 2020. There was no improvement of quality of conduct and reporting over the years. For every propensity score technique separately, a similar trend was observed (Supplementary Data 7). No differences in scores were observed between different types of infectious diseases (unpublished data).

In Figure 3 , the percentages of articles that completely fulfilled the criteria are reported by criterion. The 5 criteria that were met most frequently were criteria that require description of the use of propensity score methods in the abstract or title (criterion 1), the propensity score method used (criterion 3), the statistical methods used to analyze the data after applying the propensity score method (criterion 9), the software used for analysis (criterion 11), and the sample size before and after matching (in matching studies; criterion 12). These criteria were all met in at least 90% of evaluated studies. The 5 criteria that were met least frequently were criteria that require description of the sensitivity analysis that was used (criterion 10), the distribution of propensity scores in both treatment groups (criterion 14c), the distribution of the weights (in IPTW studies; criterion 15), discussion of the positivity assumption (criterion 17b), and the influence of missing data on propensity score estimation (criterion 18). These criteria were met in ≤15% of evaluated studies.

The concordance in scores between criteria is reported in a heat map (Supplementary Data 8) . A high level of concordance was observed between criteria that concern checking of balance and criteria that require discussion of underlying assumptions of propensity score methods. A high level of concordance was also observed between criteria requiring a detailed description of the propensity score model in the "Methods" section.

In the period of study, we observed a large absolute increase in the number of published peer-reviewed studies investigating the effectiveness of antimicrobial therapy in which propensity score methods were applied. The results of the quality assessment showed that the quality of conduct and reporting of propensity score methods in these studies was far from optimal, with the majority of the studies having a standardized quality score between 50 and 70 out of 100. We also found that the quality of conduct and reporting did not improve over time. Quality criteria that were more specific in studies using propensity score methods were less often met than more generic quality criteria. Furthermore, the concordance analysis showed that often many details of propensity score analysis were reported, or none were reported, and that often all assumptions of propensity score methods were discussed, or none were discussed. This indicates Abbreviations: BSI, bloodstream infection; GI, gastrointestinal infection; IPTW, inverse probability of treatment weighting; UTI, urinary tract infection. a n = the number of studies within the infectious disease category. In multiple studies, >1 propensity score method was applied. Therefore, the numbers of the different propensity score methods within the infectious disease category add up to more than the total number of studies included in the infectious disease category. Percentages of different propensity score methods were calculated on the total number of studies within in the infectious disease category. Therefore, the percentages add up to >100%. that some researchers may not have been sufficiently aware of the concept of discussing underlying assumptions or providing details of the analysis methods that were used. The fact that the quality of conduct and reporting of propensity score methods is far from optimal has been observed in other fields of research, for example, oncology [11] , cardiovascular surgery [15] , in high-ranked journals in different disciplines [10] , and in a comprehensive quality assessment that included all fields of the medical literature [25] . In these papers, suggestions for improvement were provided, including describing the process of the propensity score model development and the propensity score analysis in detail, checking balance after applying propensity score methods, and discussion of assumptions of propensity score methods [10, 11, 15, 25] . Furthermore, previous research has shown that the quality of reporting of confounding in general was still far from ideal after the publication of the STROBE guideline [26] .

Although in infectious disease research a systematic review on quality of conduct and reporting of propensity score methods has not previously been performed, attention to the importance of careful use of propensity score methods has been raised. For example, in a letter to the editor of this journal, Roth et al. argued that studies using propensity score methods in infectious disease medicine should be conducted and reported in a more standardized manner [9] . In an educational paper, Amoah et al. demonstrated in a case study how different propensity score methods and standard regression methods could be used correctly [27] .

The quality assessment instrument that was developed in this study describes the spectrum of methodological and reporting standards. The instrument can be used by reviewers and researchers to improve quality of conduct and reporting of studies using propensity score methods in the field of antimicrobial therapy and in other fields of clinical research. It is important to emphasize that the instrument should not be seen as a scoring tool that provides an absolute indication of quality of conduct and reporting of such a study. In our quality assessment, all criteria had equal weight, and we used the SQSPM to count the number of criteria that were fulfilled and to compare these numbers between type of methods and over the years. However, to calculate a score that provides absolute indication of quality, several criteria should receive more weight than others. In particular, checking the balance after applying a propensity score method and discussion of the assumptions of propensity score methods should probably be weighted more heavily. Therefore, the quality assessment tool should rather be seen as a set of quality criteria that should all be fulfilled and can be used as a checklist to evaluate which items are still unaddressed. Still, even when all quality criteria would be fulfilled, propensity score methods only address measured confounders. Therefore, to ensure correct conduct of propensity score methods, the risk of unmeasured confounding should always be estimated before applying propensity score methods.

A strength of this study is that we first presented a detailed overview of the use of propensity score methods in studies investigating effectiveness on antimicrobial therapy, which is particularly helpful to put the results of the quality assessment into perspective. This outlined the relevance of improvement of The use of propensity score analysis is indicated with a commonly used term in the title or the abstract.

No 0 Methods 2 Motivation b for using propensity score methods is indicated.

No 0 3 It is described which propensity method is used (if >1, consider the primary analysis).

Yes, weighting 1

Yes, stratification 1

Yes, covariate adjustment using propensity score 1

No 0 4 It is indicated which method is used to estimate the propensity score.

Yes, a logistic model 1

Yes, decision trees 1

Yes, other, namely: 1

No 0 5 The process of variable selection for the propensity score model is described.

Yes, variables are selected with a statistical selection method 1

Yes, some variables are specified beforehand, and others are selected with a statistical selection method 1 Yes, other, namely: 1

No 0 6 The variables included in the propensity score model are described.

No 0 7 Details of propensity score analysis are described: a. Details that should be described for propensity score matching: matching algorithm, caliper, matching ratio, with/without replacement. b. Details that should be described for propensity score weighting: It is clear how weights are obtained. c. Details that should be described for propensity score stratification:

The number of strata is provided; strata are defined clearly. Statistical methods to analyze data after applying a propensity score method are described. The software used for analysis is indicated. If propensity score matching was used: The package used to create matched sample is described as well. Sample size for each treatment group before and after matching is reported.

IPTW, stratification, covariate adjustment using propensity score

No 0

Not applicable 0 13

The distribution of baseline characteristics for each group before propensity score analysis is described.

After propensity score matching, weighting, or stratification:

The distribution of baseline characteristics in the matched/weighted groups or in each stratum is reported.

Covariate adjustment using propensity score

No 0

Not applicable 0 14b

After propensity score matching, weighting, or stratification: It has been checked whether sufficient balance has been achieved (love plot, standardized mean difference, etc.).

Incomplete

After propensity score matching, weighting, or stratification:

The distribution of the propensity scores in both treatment groups is described (in plot or text). The distribution of the size of the weights is described. PSM, stratification, covariate adjustment using propensity score Yes 1

No 0

Not applicable 0 16

The number of patients with missing data for each variable of interest for the propensity score analysis is reported.

No 0

Modeling assumptions are met:

The no unmeasured confounders assumption is discussed.

No 0 17b

Modeling assumptions are met: It is discussed whether there is sufficient overlap to perform propensity score analysis (positivity assumption).

No 0

The influence of missing data in propensity score estimation and missing data due to incomplete matching is discussed.

No 0

Abbreviations: IPTW, inversed probability of treatment weighting; PSM, propensity score matching. a The quality assessment tool is based on the suggested quality criteria by Yao et al. [11] , additional literature on propensity score methods, and discussion between experts.

b For this quality assessment, the use of propensity score methods was considered to be motivated when somewhere in the article it was at least mentioned that propensity score methods were used to address confounding. quality of conduct and reporting of propensity score methods and provided context for the quality assessment tool that was developed in this study. Furthermore, the quality assessment tool was developed in a structured and stepwise manner. This study was limited in a few respects. To keep the quality assessment tool practical and concise, several quality criteria had to be prioritized over others. For example, there are >2 assumptions of propensity score methods that could be discussed, but we included the 2 assumptions that were considered most important. Another limitation is that the initial quality assessment was performed by one of the investigators and that, as they become more experienced, investigators may become more (or less) strict in their assessments. However, after the quality assessment, 10 articles were re-assessed, and no major discrepancies were found. The above-named factors could have influenced the mean scores, but did not likely influence the overall outcome patterns. Of note, studies were assessed based on conduct and reporting, but it was not a study aim to evaluate if the obtained results were valid (eg, by assessing the likelihood that there would be unmeasured confounding). Previous research showed that methodological methods such as propensity scores may not always be mentioned in the title or abstract [28] . Therefore, it is possible that the quality of reporting of propensity score methods is even less optimal than reported here.

From the results of this study, the conclusion can be drawn that the quality of conduct and reporting of propensity score methods in studies investigating antimicrobial therapy needs substantial improvement. The quality assessment instrument constructed in this study can be used as a starting point for designing, conducting, and reporting a study in which propensity score methods are applied. Even so, the instrument can assist in reviewing an article of a study in which propensity score methods are used to evaluate if all requirements are met. Optimally, guidelines should be developed and incorporated in tools such as STROBE or ROBINS-I [29, 30] . By doing this, the quality of conduct and reporting of these increasingly popular statistical methods can be improved. This would structurally contribute to the validity of research on the effectiveness of antimicrobial therapy.

Supplementary materials are available at Open Forum Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

We would like to thank Dr. S. A. Swanson for providing feedback on the quality assessment tool and for commenting on the manuscript. We Table 3 Table 2 for the definition of the criteria. For every criterion, only the studies to which the criterion applied were included.

Evidence for health decision making -beyond randomized, controlled trials

Testing Treatments: Better Research for Better Healthcare

Understanding controlled trials. Why are randomised controlled trials important?

An introduction to propensity score methods for reducing the effects of confounding in observational studies

An introduction to propensity scores: what, when, and how

The central role of the propensity score in observational studies for causal effects

The antibiotic resistance crisis: part 1: causes and threats

The antimicrobial resistance crisis: causes, consequences, and management

Plea for standardized reporting and justification of propensity score methods

Review of the use of propensity score diagnostics in papers published in high-ranking medical journals

Reporting and guidelines in propensity score analysis: a systematic review of cancer and cancer surgical studies

The PRISMA 2020 statement: an updated guideline for reporting systematic reviews

Analysis of the Future: The Delphi Method (P-3558)

The Delphi Method: Techniques and Applications

Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement

Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group

Propensity score: an alternative method of analyzing treatment effects

Formulating causal questions and principled statistical answers

Indications for propensity scores and review of their use in pharmacoepidemiology

The relative ability of different propensity score methods to balance measured covariates between treated and untreated subjects in observational studies

A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study

A tutorial and case study in propensity score analysis: an application to estimating the effect of in-hospital smoking cessation counseling on mortality

Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples

An overview of the objectives of and the approaches to propensity score analyses

Reporting of covariate selection and balance assessment in propensity score analysis is suboptimal: a systematic review

Quality of reporting of confounding remained suboptimal after the STROBE guideline

Comparing propensity score methods versus traditional regression analysis for the evaluation of observational data: a case study evaluating the treatment of gram-negative bloodstream infections

Title, abstract, and keyword searching resulted in poor recovery of articles in systematic reviews of epidemiologic practice

Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration

ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions