key: cord-1015681-cijb9igf authors: Koneswarakantha, B.; Menard, T. title: An update on statistical modeling for quality risk assessment of clinical trials date: 2021-07-15 journal: nan DOI: 10.1101/2021.07.12.21260214 sha: a793668572b3b4961e962814f42fc6775e65095b doc_id: 1015681 cord_uid: cijb9igf Background - As investigator site audits have largely been conducted remotely during the COVID-19 pandemic, remote quality monitoring has gained some momentum. To further facilitate the conduct of remote Quality Assurance (QA) activities, we developed new quality indicators, building on a previously published statistical modelling methodology. Methods - We modeled the risk of having an audit or inspection finding using historical audits and inspections data from 2011 - 2019. We used logistic regression to model finding risk for 4 clinical impact factor (CIF) categories: Safety Reporting, Data Integrity, Consent and Protecting Endpoints. Results - Resulting Area Under the Receiver Operating Characteristic Curves were between 0.57 - 0.66 with calibrated predictive ranges of 27 - 41%. The combined and adjusted risk factors could be used to easily interpret risk estimates. Conclusion - Continuous surveillance of the identified risk factors and resulting risk estimates could be used to complement remote QA strategies and help to manage audit targets and audit focus also in post-pandemic times. In a recent project, we modelled the risk of having clinical trial audit or inspection findings by combining historic audit and inspection findings gathered over 9 years with operational QA data. Findings were grouped into 5 clinical impact factors (CIFs) for which we were able to model finding risk using logistic regression with easily interpretable features. Despite a low-signal to noise ratio, we could reliably predict a decrease in risk by 12-44% with 2-8 coefficients per model [1] . However, the features that we generated were not very distinctive. Most of them described study protocol properties and thus the models would assign similar risk for all sites of a given study. Nevertheless, we demonstrated that our approach could be used to identify risk factors for audit and inspection findings [1] . Having expanded our historical operational dataset, we could apply our previously established methodology to remodel risk of audit and inspection findings. The models were of similar predictive quality but included more distinctive site specific risk factors. This improved operational site quality monitoring as ongoing risk assessments were adapting to site quality indicators. Audit finding and inspection data and all clinical trial data were gathered from Roche internal data sources. Geographic population data was purchased from https://marketplace.namara.io. The risk for audit and inspection findings was modelled as previously described [1] . Briefly, audit and inspection findings from 2011-2019 were assigned to one of five Clinical Impact Factors (CIF) categories. Several operational features (based on Adverse Events (AE), Issues and Deviations and Data Queries) were engineered to reflect the state of the site at the time of the audit or inspection. Operational Features were complemented by site characteristics such as geographic population and study characteristics such as therapeutic area. To account for the number of patients and the individual study progress of each patient, we normalized features by either number of patient visits or total number of days that have passed for all patients since enrollment. As critical thresholds for quality indicators can be protocol specific, we also calculated the study percent rank for some features (e.g. percentage of missed visits and number of parallel trials) which indicated the percentage of sites in the same study that had a lower value. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 15, 2021. ; https://doi.org/10.1101/2021.07.12.21260214 doi: medRxiv preprint Continuous features were normalized using a Yeo Johnson power transformation [2] and binned into 5 groups with the same value range. For missing values all resulting binary bins were set to zero. Finding frequencies for each bin and categorical features were examined and promising candidates were preselected. The set of preselected features was narrowed down by fitting logistic regression models to a training data subset (2011) (2012) (2013) (2014) (2015) . We then iteratively removed uninterpretable features based on Subject Matter Expert (SME) review and merged bins to facilitate their interpretation, to obtain a final feature set which was used to fit a model on the entire data set. We confirmed that there were no relevant interactions between the final model coefficients with a maximum absolute correlation of 0.19. Sites that had missing values for all final features were removed before model fitting. To validate the models and to get a performance estimate, we used time series cross validation [3] in which data from each year would be used as a test set for a model fit with all data from previous years (see Fig. 1 ). The receiver operator characteristics area under the curve (AUC) and Brier Scores were calculated by pooling the predictions from all test sets. To fit a calibration model, the probability risk estimates of the test set predictions were divided into 4 bins of semi-equal range with a minimum of 100 test predictions per bin. For each bin, the predicted risk was averaged and the actual observed risk was calculated. To fit the calibration model, we performed a linear regression on the mean predicted risk versus the observed risk. In order to avoid extreme predictions by combinations of risk factors that are unvalidated, we calculated a lower and upper range limit using the observed risk of the 50 lowest and 50 highest risk estimates. Overall, the availability of more features allowed us to simplify our modelling strategy compared to the previous iterations [1] . As most features were site specific, we did not remove sites from previously audited studies from the test sets during time series cross validation (i.e. to avoid data leakage [4] from the training to the test set). Furthermore, as risk models were generated with a larger number of coefficients, a linear calibration fit was more appropriate than a manually fitted step function. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2021. ; The identified risk factors could be categorized into 7 groups (AE, Issues and Deviation, Data Queries, Geographical Populations, Parallel Trials and Study Characteristics, see Table 2 ). AEs, data queries, and issues and deviations represented frequent on site events that were connected to heavily regulated operational processes and left a coherent data trail. Thus, event frequencies and processing times could serve as proxy measures for operational quality and were likely to influence the risk of findings. We have observed in the past that geographical location influences finding risk. Instead of using country-specific risk factors we chose to analyse characterics of the population living within a 100km radius of a site as not all sites in a given country draw from a similar patient population and attract similar staff. Further population-specific features were more descriptive and easier to interpret. For example, a low male to female sex ratio in the younger adult population (18-39 years) was a good surrogate indicator for sites in urban areas ( Fig. 6 ), as it tends to be easier for womean to find adequate work in urban centers than in rural areas. [5, 6] . The other risk factor categories parallel trials and study characteristics are less influential and have already been previously identified [1] . We were able to reproduce 4 out of 5 models using this refined set of features. The model for the impact factor sponsor oversight was the one with the weakest performance in the previous iteration [1] and the previously identified risk factors did not fulfill the stricter interpretability criteria. The best AUC we could obtain was 0.54 with a calibrated predictive range of only 3% (Table 1 and Figure 2 ). For the other CIFs, the AUCs were 0.57 -0.66 with calibrated predictive ranges of 27-41 %. Risk Factors were associated with several risk factor categories ( Table 2 ) that mostly derived from operational site data. The individual adjusted risk factors for each model have been plotted as tabular forest plots which also show the risk factor distribution in the training data ( Fig. 3-6 ). . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2021. ; https://doi.org/10.1101/2021.07.12.21260214 doi: medRxiv preprint Interpretation Clinical Impact Factor: Consent (Fig. 3) Informed consent requires paper signatures on forms that include the most up to date study information. This process is not yet fully digitized in most studies [7] . Therefore a common auditing activity is to verify the paper signatures. If signatures were missing or were obtained too late a consent finding would be raised. The risk for such findings was increased for pediatric studies which require signatures from each parent. Moreover, when patients missed scheduled visits they might not have been able to reconsented in time. Consequently, risk increases with higher percentages of missed visits. Site issues and protocol deviations must be captured in an issue log and need to get processed within a specific timeframe. Interestingly, the number of issues that were closed late was indicative of an increase in consent risk, while a low number of minor deviations reduced the risk of consent findings. These risk factors did not seem to be directly connected to the consent process and seemed to be purely indicative of site quality. They also influenced the risk of findings related to other CIFs (Fig. 5-6 ). An increase in consent risk could be found for sites located in densely populated areas. We could not see how this connected to the consent process. We did not find that population density was influencing any of the other risk models, thus we could only suspect that there was a strong unknown confounding variable that we were not yet capturing. Clinical Impact Factor: Safety (Fig. 4) AEs need to be adequately recorded into the medical database followed by a medical seriousness and causality assessment by qualified site staff. For serious AEs (SAEs) accelerated reporting timelines apply. Any detectable failure in this process would trigger safety findings. Accordingly, high rates of AEs increased the risk of safety findings while a low rate of SAEs or the absence of SAEs decreased risk. Independently of AE rates, cancer studies had an increased risk for safety findings as did sites that were located in a region with a high percentage of over 60 year olds. We could speculate that older or terminally ill patients were more likely to suffer from concomitant diseases which added additional complexity to the adequate AE capture and causality assessment. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2021. ; https://doi.org/10.1101/2021.07.12.21260214 doi: medRxiv preprint Clinical Impact Factor: Data Integrity (Fig. 5) Whenever there were mismatches between source data records and the clinical databases, a data integrity finding was raised. Invalid entries into the clinical database or a site monitor discovering questionable entries could trigger a data query that needed to be addressed by the site staff. We observed that if the normalized number of open queries per site was low, risk for data integrity findings was also decreased. Furthermore, a comparatively low number of active trials in the last year at the site and the number of issues that were due at the time of the audit both decreased data integrity finding risk. These last two risk factors were not directly connected to data integrity and thus should be viewed as generalisable site quality indicators. Interestingly, the ratio of males to females aged 18-39 year old in the population around the site seemed to influence data integrity risk. A lower ratio was indicative of sites in an urban location while a high ratio seemed to indicate a more rural location as rural labor markets were primarily populated by men [5, 6] . We could speculate that urban site staff was more used to working with medical computer systems or that network connectivity is better at urban sites. Clinical Impact Factor: Protecting the primary endpoints (Fig. 6) Root causes for findings that were raised because the primary endpoints were at risk were numerous. Among the most frequent were inadequate study documents, mishandling of samples or the investigational medicinal product (IMP) and mismanaged protocol deviations. In this category we mostly identified risk factors that were indicative of overall site operational quality, some of which also influenced risk for findings for other CIFs such as number late issues, number of minor deviations and query processing time. Furthermore, a speedy reporting of AEs decreased risk as well which was potentially another operative site quality indicator. Using more operational site features over the more static study features of our previous iteration was a major improvement. This resulted in more diverse risk predictions for all sites in a given trial and risk estimates that would continuously adjust as sites continued to participate in the trial creating new data points. However, none of the new operational features correlated with sponsor oversight related audit and inspection findings. Useful features could probably be engineered from monitoring visits, source data verification and vendor management, but we have not yet . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2021. ; https://doi.org/10.1101/2021.07.12.21260214 doi: medRxiv preprint been able to obtain that data electronically for a sufficient fraction of previously audited sites. The long observational time span during which data standards and IT systems have changed is the biggest limiting factor. The model that we had previously fit for the sponsor oversight CIF was the lowest performer among all models with an Area Under the Receiver Operating Characteristic Curve (AUC) barely over 0.5. The features were based on study characteristics and site total enrollment data rather than enrollment at the time of the audit. Despite the low AUC the sponsor oversight model of the last iteration had an acceptable calibrated prediction range of Δ26%. We were not able to reproduce this with the refined set of features of this iteration. This showed that our attempt to model risk in this low signal to noise ratio environment was not bias free but we did need to make certain compromises in order to end up with interpretable risk models that could support business decisions. For example, strictly linear relationships between modelled risk and features were rarely detectable. In a lot of cases, risk would only increase or decrease once a threshold crosses a certain threshold (see AE per visit in Fig. 4 ) and we therefore resorted to binning all numerical features. Moreover, the large fraction of missing values for some features was not ideal. By setting all bins derived from a feature with a missing value to zero, we implicitly imputed missing values as belonging to the reference bin. This carried some risk for bias which was a trade-off that we were willing to make in favour of including additional interpretable risk factors in our models. Altogether, we have modelled audit finding risks for four CIFs with improved AUC and calibrated predicted range. We were able to include at least one operational risk factor that would change over the course of a study together with several site specific risk factors. It was thus likely that a given set of sites from one or more studies would cover the entire predictive range of the model allowing quality professionals to rank sites by CIF finding risk in a meaningful and transparent way. It was important to note that despite our effort to validate our models using time-series cross validation and calibration we have merely modelled the risk of historic investigator site audit and inspection findings. The recent pandemic has accelerated the decrease of traditionally conducted audits and inspections and remote quality activities are getting more and more common [8, 9] . The number of traditional site audits in 2020 was already too low to include the data in this iteration. Sites audited in the past are often the sites that have recruited the most patients and thus carried the highest risk for the study outcome. Furthermore, findings were restricted to quality issues that could be identified by an auditing team on site. This selection bias limits the generalizability of the models we have . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 15, 2021. ; https://doi.org/10.1101/2021.07.12.21260214 doi: medRxiv preprint created and in practice we display audit finding risk along with additional relevant operational site quality indicators that can be grouped under the same CIF. The risks and the quality indicators are integrated into a clinical analytics dashboard used by quality leads to manage audit focus and audit target selection. As travel and physical access to sites is getting more restrictive, new complementing strategies for QA based on remote data analytics are emerging. Monitoring quality indicators and audit finding risk assessment could help to manage audit target selection and audit focus. To establish regulatory and industry trust, and to foster adoption of analytics-driven QA , we will continue to focus our effort on the development of novel methods QA [10, 11, 12, 13, 14] and on cross-company collaboration and data sharing [15] . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Harnessing the Power of Quality Assurance Data: Can We Use Statistical Modeling for Quality Risk Assessment of Clinical Trials? A new family of power transformations to improve normality or symmetry On the use of cross-validation for time series predictor evaluation Leakage in Data Mining: Formulation, Detection, and Avoidance How is internal migration reshaping metropolitan populations in Latin America? A new method and new evidence Gender-Specific Migration from Eastern to Western Germany: Where Have All the Young Women Gone A need to simplify informed consent documents in cancer clinical trials. A position paper of the ARCAD Group Leveraging analytics to assure quality during the Covid-19 pandemic -The COVACTA clinical study example Letter to the Editor: New Approaches to Regulatory Innovation Emerging During the Crucible of COVID-19 Enabling Data-Driven Clinical Quality Assurance: Predicting Adverse Event Reporting in Clinical Trials Using Machine Learning Follow-Up on the Use of Machine Learning in Clinical Quality Assurance: Can We Detect Adverse Event Under-Reporting in Oncology Trials Follow-up on the Use of Advanced Analytics for Clinical Quality Assurance: Bootstrap Resampling to Enhance Detection of Adverse Event Under-Reporting Bayesian Modeling for the Detection of Adverse Events Underreporting in Clinical Trials Using Statistical Modeling for Enhanced and Flexible Pharmacovigilance Audit Risk Assessment and Planning Cross-company collaboration to leverage analytics for clinical quality and accelerate drug development -the IMPALA industry group