key: cord-0801176-5iu9h9ke authors: Hu, Bing; Li, Yun; Wang, Guilian; Zhang, Yanqing title: The Blood Gene Expression Signature for Kawasaki Disease in Children Identified with Advanced Feature Selection Methods date: 2020-06-28 journal: Biomed Res Int DOI: 10.1155/2020/6062436 sha: 6656e9039b72e3c3e329f3b8bf85c13324474d12 doc_id: 801176 cord_uid: 5iu9h9ke Kawasaki disease (KD) is an acute vasculitis, accompanied by coronary artery aneurysm, coronary artery dilatation, arrhythmia, and other serious cardiovascular diseases. So far, the etiology of KD is unclear; it is necessary to study the molecular mechanism and related factors of KD. In this study, we analyzed the expression profiles of 75 DB (identifying bacteria), 122 DV (identifying virus), 71 HC (healthy control), and 311 KD (Kawasaki disease) samples. 332 key genes related to KD and pathogen infections were identified using a combination of advanced feature selection methods: (1) Boruta, (2) Monte-Carlo Feature Selection (MCFS), and (3) Incremental Feature Selection (IFS). The number of signature genes was narrowed down step by step. Subsequently, their functions were revealed by KEGG and GO enrichment analyses. Our results provided clues of potential molecular mechanisms of KD and were helpful for KD detection and treatment. Kawasaki disease (KD) is an acute vasculitis, accompanied by coronary artery aneurysm, coronary artery dilatation, arrhythmia, and other serious cardiovascular diseases [1, 2] . It was first described by Japanese doctor Kawasaki in the late 1960s and has since been reported around the world with an increasing incidence [3, 4] . According to the recent survey, Japan owns the highest incidence of KD with 265 cases per 100,000 kids under the age of five [5] . KD initially manifested as high fever, cervical lymphadenopathy, and mucocutaneous inflammation [6] . Aspirin therapy and intravenous immunoglobulin (IVIG) injection play a key role in the effective treatment of KD, reducing the incidence of coronary artery complications from 5% to 25% [7] . KD occurs not only in infant and childhood period but even in adolescence. The young age of onset may suggest that susceptibility may be related to the maturity of the immune system [8] . So far, the etiology of KD is unclear, but epidemiological features indicate that there may be a connection between it and as-yet-undefined pathogen infections. In the surveys of Uehara and Belay, the incidence of KD reached a peak in winter and spring, which was similar to that of many respiratory diseases. This seasonal feature provides a new thought that KD may be caused by one or several pathogens related to respiratory diseases [2, 8, 9] . According to statistics, 8-42% of patients was associated with respiratory virus infection and 33% with bacterial infection [10] [11] [12] [13] . Viral infection leads to abnormal lymphocyte subsets and inflammation, which were positively correlated with the occurrence of vascular inflammation in KD [14] . Rowley et al. found that the upregulation of expression of the interferon-stimulated gene was detected in acute lung tissue of KD, which illustrated the presence of cellular immune response after viral infection. They also observed that coronary artery inflammation of KD was characterized by antiviral immune response, including the upregulation of related genes induced by type I interferon and activation of cytotoxic T lymphocytes [15] [16] [17] . A related study suggested that some common respiratory viruses, such as enteroviruses, adenoviruses, coronaviruses, and rhinoviruses, were associated with KD cases [11] . It is reported that among these viruses, human coronavirus (HCoV)-229E may be involved in the occurrence of KD [18] . All of these strongly support the hypothesis that the infection of viruses and bacteria may be related to KD. Up to date, there is no clinical specific diagnostic test for KD, and the diagnosis is still highly dependent on the symptoms and ultrasound imaging results [19] . Therefore, it is still necessary to study the molecular mechanism and related factors of KD. In this study, we analyzed the expression profiles of DB (identifying bacteria), DV (identifying virus), HC (healthy control), and KD (Kawasaki disease) samples. By comparing their expression difference, we obtained 332 key genes related to KD and pathogen infections. Subsequently, their functions were revealed by KEGG and GO enrichment analysis. Our study provides a direction for the study of potential molecular mechanism of KD occurrence. HumanHT-12 V4.0 expression beadchip. Only the common 25,159 genes were analyzed. We performed quantile normalization to make sure the samples from a different batch were comparable using the R function "normalize.quantiles" in package preprocessCore (https://bioconductor.org/packages/ preprocessCore/). Filtering. Since there were many genes and most of them were not associated with KD, we applied Boruta feature filtering [21] to detect all the relevant genes first. Boruta feature filtering is an advanced feature selection method wrapped with random forest. First, the real dataset was shuffled. Then, the importance of each feature was calculated. The features with real importance scores significantly higher than the shuffled ones were kept. Iteratively, all relevant features were selected. With Boruta feature filtering, we got a much smaller number of features for further analysis. We used python package Boruta (https://pypi.org/project/Boruta/) to apply the Boruta feature filtering. Feature Selection. We adopted the Monte-Carlo Feature Selection (MCFS) [22] to rank the relevant features. It generated a number of randomly selected feature sets and then constructed many classification trees [23] [24] [25] . By ensembling these classification trees, the importance of each feature was calculated. In general, a feature was important if it had been selected by many classification trees. Suppose d was the total number of relevant features selected by Boruta, m features (m <