key: cord-0977260-cezgpocd authors: Ueno, Taro; Ichikawa, Daisuke; Shimizu, Yoichi; Narisawa, Tomomi; Tsuji, Katsunori; Ochi, Eisuke; Sakurai, Naomi; Iwata, Hiroji; Matsuoka, Yutaka J title: Comorbid insomnia among breast cancer survivors and its prediction using machine learning: a nationwide study in Japan date: 2021-10-27 journal: Jpn J Clin Oncol DOI: 10.1093/jjco/hyab169 sha: 55450fc06c2d72c9ad7643b0b879f27084796172 doc_id: 977260 cord_uid: cezgpocd OBJECTIVE: Insomnia is an increasingly recognized major symptom of breast cancer which can seriously disrupt the quality of life during and many years after treatment. Sleep problems have also been linked with survival in women with breast cancer. The aims of this study were to estimate the prevalence of insomnia in breast cancers survivors, clarify the clinical characteristics of their sleep difficulties and use machine learning techniques to explore clinical insights. METHODS: Our analysis of data, obtained in a nationwide questionnaire survey of breast cancer survivors in Japan, revealed a prevalence of suspected insomnia of 37.5%. With the clinical data obtained, we then used machine learning algorithms to develop a classifier that predicts comorbid insomnia. The performance of the prediction model was evaluated using 8-fold cross-validation. RESULTS: When using optimal hyperparameters, the L2 penalized logistic regression model and the XGBoost model provided predictive accuracy of 71.5 and 70.6% for the presence of suspected insomnia, with areas under the curve of 0.76 and 0.75, respectively. Population segments with high risk of insomnia were also extracted using the RuleFit algorithm. We found that cancer-related fatigue is a predictor of insomnia in breast cancer survivors. CONCLUSIONS: The high prevalence of sleep problems and its link with mortality warrants routine screening. Our novel predictive model using a machine learning approach offers clinically important insights for the early detection of comorbid insomnia and intervention in breast cancer survivors. Breast cancer is the leading cancer affecting women worldwide. In 2009, 60 000-70 000 new cases of breast cancer were reported in Japan (1) . Advances in screening and treatment are improving survival times. With the 5-year survival rate as high as 93%, many Japanese women are now living as survivors of breast cancer. Longer survival times are drawing increasing attention to the impact of the disease and its treatment on long-term outcomes and health-related quality of life. Insomnia is one of the most prevalent symptoms experienced by cancer patients (2) . In Japan, the prevalence of insomnia is 14.6-22.3% among women in the general population (3) (4) (5) , and prevalence is known to be higher in breast cancer survivors than in the general population (2, (6) (7) (8) . The pooled estimate for the prevalence of sleep disturbance is ∼40%. In addition, insomnia can be a significant independent prognostic factor in breast cancer survivors (9) (10) (11) . Despite this evidence, no studies have systematically investigated the prevalence of comorbid insomnia among breast cancer survivors in Japan. The purpose of this study was to clarify the prevalence, severity and characteristics of insomnia in breast cancer survivors and to determine the clinical characteristics associated with the comorbidity. We also developed a classifier that predicts the comorbid insomnia using two machine learning algorithms, namely, the L2 penalized logistic regression model and the XGBoost model (12) . Furthermore, we used the RuleFit algorithm to extract the hidden rules for segments at high risk of comorbid insomnia. In this study, we analyzed responses to the Athens Insomnia Scale (AIS) as part of a nationwide survey of Japanese breast cancer survivors (13) . The study was approved by the institutional review board of the National Cancer Center, Japan (ID: 2018-295) and by the ethics committees of all 34 participating hospitals. Data were collected anonymously, and care was taken not to collect any identifying information. Attending physicians in the outpatient clinics of the facilities handed out a set of materials containing explanatory documents and a survey form to eligible participants, who completed the survey independently and returned it by mail to the research office. Measurement items are described in the protocol paper (13) . The following items were collected: background information, the Global Physical Activity Questionnaire, EuroQol 5 Dimension, the Japanese equivalent of WHO Health and Work Performance Questionnaire Short Form, the Cancer Fatigue Scale (CFS), the Concerns about Recurrence Scale, the AIS, the Common Terminology Criteria for Adverse Events (PRO-CTCAE) and the Resilience Scale. The total scores of each scale were used to develop the prediction models. Eligibility criteria were as follows: (i) diagnosis of primary breast cancer without distant metastasis, (ii) no recurrence, (iii) age ≥ 20 years, (iv) completion of initial treatments with curative intent aside from hormone therapy and (v) already informed of the diagnosis of breast cancer. Participants who could not complete the self-reported questionnaire (written in Japanese) unaided were excluded. Participants who did not complete the AIS were excluded. For participant recruitment, we selected 52 facilities accredited by the Japanese Breast Cancer Society that conducted more than 100 breast cancer surgeries in the period from April 2016 to March 2017. Based on the sampling method used in the public opinion survey on cancer countermeasures conducted by the Japan Cabinet Office, we set 22 stratified categories according to 11 districts and population size (>200 000 people and <200 000 people). We used the AIS to assess the comorbidity of insomnia. The scale, which was created by the World Health Organization as part of the 'World Sleep and Health Project' (14, 15) , is an eight-item, selfadministered psychometric instrument, with a total score of 24 points. An AIS score of 6 is the optimum cutoff based on the balance between sensitivity and specificity (16) , and we considered a score of ≥6 to indicate suspected comorbid insomnia in this study. The missing values were imputed using a regression model in which variables other than those to be imputed were used as variables (17) . Logistic regression analysis was used for categorical variables, and multiple regression analysis was used for continuous variables. All the data except for those with >20% of missing values were used to build the prediction models. We used the L2 penalized logistic regression model (18) to realize increased stability while overcoming logistic regression's shortcoming of degraded performance when features are strongly correlated (19) . Along with the baseline model, we used models obtained by machine learning using a gradient-boosting decision tree (GBDT) approach, which is an ensemble learning algorithm that combines base learners, such as decision trees and linear classifiers (20) . We adopted the GBDT approach because of its superior predictive ability (21) . GBDT gives a predictive model as an ensemble of decision trees and achieves high predictive ability with a differentiable loss function. GBDT requires tuning of parameter, such as the number of trees, shrinkage parameter and interaction depth, which was done by a grid search. By using the RuleFit algorithm, sparse linear models are learned, which include automatically detected interaction effects in the form of decision rules (22) . The first step of the algorithm is extracting decision rules from original features, which is done by using random forests in the present study (23) . The second step is learning a sparse linear model with the original features and new features based on the decision rules. In the study, we used the L2 penalized logistic regression model as a sparse linear model (24) . These new features are reflected in the interactions between the original features. To assess the predictive ability of each machine learning algorithm, we used an 8-fold external cross-validation procedure. For crossvalidation, we used subject-wise data splitting rather than recordwise data splitting because identity confounding has been reported with the latter (25, 26) . Then, to assess the predictive accuracy of our developed classifier, we used receiver-operating characteristic curve analysis, where the area under the curve (AUC) represented the ability to predict comorbid insomnia (27, 28) . In the medical field, AUC of a prediction model is considered to have high accuracy when it is ≥0.9, moderate accuracy when it is ≥0.7 but <0.9 and low accuracy when it is ≥0.5 but <0.7. In this study, moderate accuracy or higher was considered to indicate a suitable model. The importance of variable in class discrimination in the predictive model was assessed using the mean decrease in gain. The Python libraries pandas (version 1.0.5), xgboost (version 0.9) and scikit-learn (version 0.21.2) were used for data handling and building and evaluating the prediction models. In total, 791 individuals from 34 hospitals participated in the nationwide survey, and we collected data from 759 participants who returned the AIS questionnaire. Characteristics of these participants are shown in Table 1 . Overall, 284 participants (37.4%) were assessed as having suspected comorbid insomnia, 83 of whom (11%) took medication for insomnia. An AIS score of 6 is the optimum cutoff based on the balance between sensitivity and specificity (16) and we considered a score of ≥6 to indicate suspected comorbid insomnia in this study. Characteristics of each questionnaire in AIS are shown in Table 2 . The L2 penalized logistic regression model (Fig. 1 ) and the XGBoost model (Fig. 2) were used to develop classifiers for comorbid insomnia in breast cancer survivors based on questionnaire surveys and had predictive accuracy of 71.5 and 70.6% for the presence of suspected insomnia, giving AUCs of 0.76 and 0.75, respectively. We then investigated the importance of variables in the optimal predictive model obtained by machine learning. Figure 3 shows the ranking of variable importance in the L2 penalized logistic regression model. General fatigue determined on the CFS (29) was the most important variable for predicting comorbid insomnia, followed by physical fatigue and cognitive fatigue. In addition, high QOL measured on the EuroQol Five-Dimensional Questionnaire (30) and resilience measured on the 14-item Resilience Scale (31,32) less strongly related to comorbid insomnia. Figure 4 shows the ranking of variable importance in the XGBoost model, where general fatigue was again ranked as the most important variable for prediction of comorbid insomnia. We further used the RuleFit algorithm (22) to discover the hidden rules that may be predictive of the risk of comorbid insomnia from among a number of potential candidates. Figure 5 shows results for the population segment classified by the RuleFit algorithm as having high risk based on the following rules: presence of depressive symptoms on the Patient-Reported Outcomes version of the PRO-CTCAE, cognitive fatigue score > 4.5 on the CFS, and resilience score < 78.5 on the 14-item Resilience Scale Short Version (RS14). The prevalence of insomnia in this segment was high at 73%. Figure 6 shows the results for the population segment classified by the RuleFit algorithm as having low risk of comorbid insomnia based on the following rules: RS14 score > 62.5, a higher EQ5D score indicating higher QOL and no decrease in physical activity. The prevalence of insomnia in the segment was only 20%. Comorbid insomnia is known to affect quality of life and prognosis in breast cancer survivors. In this study, using nationwide survey data, we have shown for the first time a high prevalence of insomnia in breast cancer survivors in Japan. Based on AIS responses, the prevalence of suspected comorbid insomnia was as high as 37.5%. The previous reports showed the prevalence of sleep disturbance in breast cancer is ∼40%. The prevalence of insomnia among breast cancer patients in Japan is comparable with the previous reports which is higher than that of general population. The participants tended to complain of symptoms such as daytime sleepiness, insufficient sleep duration and unsatisfactory quality of sleep. Using the survey data collected, we also used machine learning algorithms, namely, the L2 penalized logistic regression model and the XGBoost model, to develop classifiers that successfully predicted comorbid insomnia in breast cancer survivors. The ranking of variable importance revealed that fatigue, QOL and resilience were predictive of both risk of comorbid insomnia and protection against it. These variables were selected in both the logistic regression model and the XGBoost model, so the results were consistent. The impact of insomnia on QOL is widely recognized and the relationship between insomnia, fatigue and resilience has also been reported (33, 34) . We further extracted segments with high risk of comorbid insomnia using the RuleFit algorithm. The algorithm identified cancerrelated fatigue, health-related QOL and resilience as important rules. These results suggest that, it might be helpful in clinical practice to focus on patients with cancer-related fatigue, low health-related QOL and low resilience for early detection of comorbid insomnia and intervention in breast cancer survivors, and it may be important to assess these factors. In addition, self-reported change in physical activity was also selected as an important rule. Patients whose physical activity does not decrease after diagnosis of breast cancer tend to be less likely to have insomnia. Growing evidence suggests that exercise may play a role in maintaining and improving common cancer-related health outcomes, and multiple international organizations have issued guidelines recommending high levels of physical activity in cancer survivors (35, 36) . In addition, the adverse effects of decreased physical activity during the COVID-19 pandemic on long-term outcomes in cancer patients have been noted (37) . The need for recently developed home-based exercise programs for breast cancer patients (38, 39) is expected to continue increasing in the future. Our study has some limitations. First, we used cross-sectional data, so we cannot determine the direction of causality in the results. Although some clinical variables such as fatigue, resilience, quality of life and physical activity are associated with comorbid insomnia in breast cancer survivors, we could not identify the risk factors. To solve the problem, several methods have been proposed for discovering causal relationships in cross-sectional studies (13) . Second, we could not distinguish the effect of cancer treatments on insomnia. A previous study has indicated that cancer treatments, such as chemotherapy and radiotherapy, cause worsening of insomnia symptoms (40) . Further investigation by subgroup analysis in a larger population may provide additional clinical value. In summary, our findings are consistent with the high prevalence of sleep problems in breast cancer survivors. A novel predictive technical or material support. N.S. was responsible for the patient and public involvement. Y.J.M. was in charge of the supervision. Cancer incidence and incidence rates in Japan in 2009: a study of 32 population-based cancer registries for the Monitoring of Cancer Incidence in Japan (MCIJ) project Insomnia in the context of cancer: a review of a neglected problem An epidemiological study of insomnia among the Japanese general population Nationwide epidemiological study of insomnia in Japan Prevalence of sleep disturbance and hypnotic medication use in relation to sociodemographic factors in the general Japanese adult population Prevalence and risk factors of sleep disturbances in breast cancersurvivors: systematic review and meta-analyses Prevalence and risk factors for insomnia among breast cancer patients on aromatase inhibitors Behavioral symptoms in patients with breast cancer and survivors Actigraphy-measured sleep disruption as a predictor of survival among women with advanced breast cancer Sleep and survival among women with breast cancer: 30 years of follow-up within the Nurses' Health Study Sleep duration is associated with survival in advanced cancer patients XGBoost, a machine learning method, predicts neurological recovery in patients with cervical spinal cord injury Study protocol for a nationwide questionnaire survey of physical activity among breast cancer survivors in Japan Athens Insomnia Scale: validation of an instrument based on ICD-10 criteria Development and validation of the Japanese version of the Athens Insomnia Scale The diagnostic validity of the Athens Insomnia Scale Imputation of missing values is superior to complete case analysis and the missingindicator method in multivariable diagnostic research: a clinical example Ridge estimators in logistic regression Multilocus association mapping using generalized ridge logistic regression Stochastic gradient boosting Gradient boosting machines, a tutorial Predictive learning via rule ensembles Random forests: finding quasars. Statistical Challenges in Astronomy Regression shrinkage and selection via the lasso The need to approximate the use-case in clinical machine learning Detecting the impact of subject characteristics on machine learning-based diagnostic applications Measuring the accuracy of diagnostic systems A readers' guide to the interpretation of diagnostic test properties: clinical example of sepsis Development and validation of the cancer fatigue scale: a brief, three-dimensional, self-rating scale for assessment of fatigue in cancer patients EuroQol-a new facility for the measurement of healthrelated quality of life Development and psychometric evaluation of the Resilience Scale Reliability and validity of the Japanese version of the Resilience Scale and its short version Insomnia and fatigue symptom trajectories in breast cancer: a longitudinal cohort study Lack of resilience is related to stress-related sleep reactivity, hyperarousal, and emotion dysregulation in insomnia disorder Exercise is medicine in oncology: engaging clinicians to help patients move through cancer Nutrition and physical activity guidelines for cancer survivors Physical activity for oncological patients in COVID-19 era: no time to relax Effect of home-based high-intensity interval training and behavioural modification using information and communication technology on cardiorespiratory fitness and exercise habits among sedentary breast cancer survivors: habit-B study protocol for a randomised controlled trial Data validation and verification using blockchain in a clinical trial for breast cancer: regulatory sandbox Cancer treatments and their side effects are associated with aggravation of insomnia: results of a longitudinal study The authors wish to thank Prof Uchitomi (National Cancer Center Japan), Dr Shimazu (National Cancer Center Japan) and Mr Motohashi (SUSMED) for their generous support and helpful advice. The authors also wish to thank Ms Akutsu for her efforts in data management. This article is based on results obtained from a project, P20006, commissioned by the New Energy and Industrial Technology Development Organization. This study was supported by the National Cancer Center Research and Development Fund (30-A-17). Ochi has received research support from Nippon Suisan Kaisha. Ueno and Ichikawa are presidents and shareholders of SUSMED, Inc. Matsuoka has received speaker fees from Suntory Wellness, Pfizer, Mochida, Eli Lilly and Morinaga Milk, and Cimic and is conducting collaborative research with SUSMED. All other authors declare that they have no competing interests regarding this work. The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. All authors were responsible for the acquisition, analysis or interpretation of data and the critical revision of the manuscript for important intellectual content. T.U., D.I. and Y.J.M. took care of the drafting of the manuscript. T.U. and D.I. were in charge of the statistical analysis. Y.S. and T.N. were in charge of administrative,