key: cord-0585890-8msop7p4 authors: Bleu-Laine, Marc-Henri; Puranik, Tejas G.; Mavris, Dimitri N.; Matthews, Bryan title: Multi-Class Multiple Instance Learning for Predicting Precursors to Aviation Safety Events date: 2021-03-10 journal: nan DOI: nan sha: 7428de1f2041ffaa956c134ea17669b6f568b3cb doc_id: 585890 cord_uid: 8msop7p4 In recent years, there has been a rapid growth in the application of machine learning techniques that leverage aviation data collected from commercial airline operations to improve safety. Anomaly detection and predictive maintenance have been the main targets for machine learning applications. However, this paper focuses on the identification of precursors, which is a relatively newer application. Precursors are events correlated with adverse events that happen prior to the adverse event itself. Therefore, precursor mining provides many benefits including understanding the reasons behind a safety incident and the ability to identify signatures, which can be tracked throughout a flight to alert the operators of the potential for an adverse event in the future. This work proposes using the multiple-instance learning (MIL) framework, a weakly supervised learning task, combined with carefully designed binary classifier leveraging a Multi-Head Convolutional Neural Network-Recurrent Neural Network (MHCNN-RNN) architecture. Multi-class classifiers are then created and compared, enabling the prediction of different adverse events for any given flight by combining binary classifiers, and by modifying the MHCNN-RNN to handle multiple outputs. Results obtained showed that the multiple binary classifiers perform better and are able to accurately forecast high speed and high path angle events during the approach phase. Multiple binary classifiers are also capable of determining the aircraft's parameters that are correlated to these events. The identified parameters can be considered precursors to the events and may be studied/tracked further to prevent these events in the future. The aviation industry brings tremendous amount of social and economic benefits. It has been observed that the size of the air transportation industry has doubled every 15 years [1] , and was expected to continue growing [2] prior to the COVID-19 pandemic. In 2018 alone, airlines around the world carried a total of 4.3 billion passengers while the total global economic impact of the industry reached USD 2.7 Trillion in 2016 [1] . Furthermore, even though the industry continues to grow and more people are flying everyday, aviation safety has improved over the past decades as seen on to 0.6 in 2019, which is below the target rate set by the Federal Aviation Administration (FAA). This reduction is the result of the efforts undertaken by agencies such as the National Aeronautics and Space Administration (NASA), the FAA, the National Transportation Safety Board (NTSB), and others. In particular, these efforts led to better certification standards, better operating procedures, and decision-support systems [4] . However, it is important to keep reducing the accident rate even further so that we do not observe a rise in the number of accidents given the industry's expected continued growth [2, 5] . In order to significantly improve safety, the aviation industry has been moving towards a proactive approach to safety assessment and vulnerability identification, which consists of characterizing potential risks in terms of anomalies or deviations from nominal operations and the precursors to adverse events. This knowledge can be leveraged to increase awareness of emerging vulnerabilities amongst the operators and be incorporated into automated monitoring tools to flag the risks before they result in near-misses, incidents, or accidents [6] . Extensive data collection and advancement in data-mining methodologies are key enablers to proactive risk management in aviation [6, 7] . Airline programs such as the Flight Operations Quality Assurance (FOQA) have enabled the creation of large, and heterogeneous data sets. FOQA is a voluntary safety program that is designed to make commercial aviation safer. Data is collected using devices such as Quick Access Recorder (QAR) or directly from Flight Data Recorder (FDR) [8] . Traditional techniques of flight data analysis have focused on a continuous cycle of data collection from on-board recorders, retrospective analysis of flight-data records, identification of operational safety exceedances, design and implementation of corrective measures, and monitoring to assess their effectiveness. Airlines work with the FAA to reduce and eliminate safety risks, and flight safety divisions within airlines generally use FOQA data to perform exceedence and statistical analyses. Exceedence analysis consist of setting specific limits to the recorded airborne data so that particular parameters that fall outside of the normal operating conditions can be flagged [8] . The level of exceedence can be programmed for different severities of events. These profiles are used to create distributions of various criteria, which enables airlines to evaluate flight risk levels and trend known vulnerabilities over time [8] . A validation step at the end of each analysis is performed to determine the nature of corrective actions required and to store valid events in databases for analyzing trends [8] . Research efforts have been made towards advancing the development of data-driven methodologies, applying data science techniques, and using modern machine learning, including deep learning algorithms, in the context of aviation safety [9] [10] [11] [12] [13] [14] [15] . Majority of the data mining effort in aviation is directed at detecting anomalies in aviation data [10, 12, 13, [16] [17] [18] [19] [20] [21] and leveraging unsupervised learning techniques due to lack of labeled data. Data mining has also been applied towards predictive maintenance in aviation [9] and predicting future trajectory states [22, 23] . While identifying anomalies is important, it is also critical to investigate the causal factors or precursors to these anomalies or other safety events in order to understand them better and prevent them in the future. Recent literature has focused on identifying precursors to safety events and anomalies using FOQA data [24] [25] [26] . Precursors can be defined as any event that are correlated to a safety incident and occurs prior the incident itself [24] . They are useful for forecasting safety events, and provide insights to why the event happened. Knowing the precursors can then be used to initiate actions to avoid events from occurring [26] . Therefore, being able to identify and monitor precursors is an important step towards proactive safety enhancement in aviation operations. One of the main challenges in identification of precursors observed in literature is the lack of subject-matter-expert validated labels for the identified anomalies and safety events. To account for the lack of labels usually encountered, this paper proposes the application of a weakly-supervised learning technique called multi-class multiple-instance learning (MIL) [27] to detect different multiple adverse events and their precursors. Considering the above observations, the main aim of this paper is the development of a methodology that leverages highly dimensional aviation data to predict multiple adverse events and discover their precursors. This work will use a flight-level label called a bag-label in MIL terminology to predict the anomalies at the flight-level and identify their precursors along with their occurrence in-time during a flight. The current framework uses a deep learning model constructed as multi-headed convolutional neural network (CNN) where each flight sensor gets its own CNN and the multi-headed architecture is designed to work with the MIL framework. Successful implementation of the methodology uses the flight's data to 1) predict adverse events, and 2) determine the precursors to the predicted adverse events. One major benefit from this framework over current approaches will be the extension of binary classifier to perform multi-class predictions and the ability to retrieve the precursors with little to no post-processing. Additional benefits will include transferability of the model due its multi-head architecture, knowledge discovery to help analysts find root causes of safety events faster, and relative simplicity of the model to provide better transparency and explainability (compared to previous architectures). This section presents recent work conducted related to anomaly detection, precursor mining, and other applications of deep learning models in aviation safety. The data source utilized for this work is also presented along with the definition of multiple-instance learning and its accompanying assumptions. Focusing on precursors to anomalies is important as it allows identifying potential causes to safety hazards. Recent work shows a growing interest in detecting precursors in various domains. Multiple techniques have been explored and the most relevant ones are summarized in this subsection. Yue Ning et al. [28] presented an approach for precursor identification using nested Multi-Instance Learning (n-MIL). In their work, they forecast societal events in different cities. At the instance level, the probability of an article published on a given day is modeled using a simple logistic function. These probabilities are then aggregated over a day, and finally the probabilities for different days are aggregated together up to a certain number of days before the event, creating the nested structure. The authors explain that the probability of a news article on a given day can be used to estimate how related an article is to the target. Since the articles with higher probabilities influenced the classifier decision, they are likely to be precursors of the predicted societal event. Janakiraman et al. [24] proposed to use a Deep Temporal Multiple-Instance Learning (DT-MIL) framework, which combines Multiple-Instance Learning and recurrent neural networks, in this case a gated recurrent unit (GRU), to mine precursors in FOQA data. In this approach, the individual time steps are considered low-level instances and the whole flight is considered a bag. The labels (occurrence of adverse event) are given at the bag level but the methodology takes advantage of MIL and uses the low-level instances to correctly predict the bag label, which allows to infer the instance-level label. The time-steps at which the probability of a safety event occurring is greater than a defined threshold are retrieved. The region of time for which the probability (called the precursor score) of a safety event occurring is high is then analyzed during post processing, where each feature is perturbed one at a time. The precursors are identified by finding the features whose perturbations had more significant impacts in reducing the precursor score. The DT-MIL model is also referred to as a newer version of the Automatic Discovery of Precursors in Time Series (ADOPT) * , which is different from the architecture presented in [29] . Ackley et al. [25] have used a sequential backward selection technique along with Random Forest classification models for predicting Unstable Approach adverse events. They have identified the critical parameters using a cumulative feature importance score and grouped them into various categories of parameters (such as energy-related, configurationrelated, etc.) that contribute significantly towards the identification of the adverse event. Their analysis is conducted at fixed altitudes above the event detection trigger altitude of 1000 feet above touchdown. Similarly, Lee et al. [30] have also used a Random Forest algorithm to identify precursors to two different aviation safety events using supervised learning. The normalized precursor score is obtained using a Gini importance for all parameters contained in the classification model. Melynk et al. [31] have proposed a framework for detecting precursors to aviation safety incidents due to human factors based on Hidden Semi-Markov Models. They performed an empirical evaluation of their models against traditional anomaly detection algorithms and demonstrated better performance on synthetic and flight simulator data. Mangortey et al. [32] used a variety of clustering techniques to identify clusters of nominal operations and subsequently determine important parameters that differentiate outliers from those nominal clusters. Despite presenting interesting * https://github.com/nasa/ADOPT insights, their method is still unsupervised and lacks validation. The proposed framework is demonstrated using a publicly available data set obtained from NASA's DASHlink website, which is a collaborative sharing network for researchers in the Data Mining and Systems Health Management field † . Flight data were recorded from a single type of regional jet operating in commercial service over a three-year period. The data contains detailed aircraft dynamics, system performance, and other engineering parameters but are de-identified such that it cannot be traced back to a particular manufacturer or airline. Since this data set is not part of any airline's FOQA program, additional preprocessing is required to create FOQA-like flags and label safety events for individual flights, where the labeling was created by using domain-based rules. The definitions of resulting adverse events are presented in Table 1 . Each of these events are characterized by a severity level ranging from 1 to 3, with 3 being the most severe. For this work, only flights without any safety events and the ones with safety events with a severity level 3 will be considered. This will allow limited overlaps between normal and abnormal operations. Moreover, to ensure a multi-class problem instead of a multi-label one, flights with multiple adverse events were omitted. The high speed in approach, and the high path angle in approach events were selected for this work because of their higher frequency in the data set. It is noted that the methodology developed in this work is applicable to any similar data set containing time series data of aircraft parameters and the known occurrence of an event. The multiple-instance learning framework is used when labels are only available in sets called bags [33] , and each bag contains many instances. The learning task is therefore supervised, but since the labels are not provided for each † https://c3.nasa.gov/dashlink/resources/?page=3&sort=-created&type=28 instance, we say that the task is weakly supervised. This framework is of interest because it alleviates the burden of weak supervision, which is common to many fields as labelling data is costly. In particular, it is useful in the context of aviation because of the lack of labeled data. In this context, we can consider the bag to be a flight and the instances to be the time-steps of the features. Furthermore, positive bags are flights that had a safety event while negative bags are the ones with no events recorded. The standard MIL assumption, which is used in this work, states that all negative bags contain only negative instances, and that all positive bags contain at least one positive instance [33] . This means a flight experiencing an event have at least one abnormal time-step that caused the whole flight to be considered a positive bag. Flights can later be divided into training, validation, and testing sets, and the performances of a chosen model can then be evaluated on those two subsets. The learning task for this work is a classification problem. In a classification application, the task is to use given data and assign it to one of multiple predefined classes. In particular, given time-series from sensor data, we can assign the series a label such as a safety event [10] . MIL classification task can be performed at the bag and instance level. However, the methodology developed in this work will use the bag label to infer the instance label of each feature at each time-step. In other words for example if we classify a bag as positive, the algorithm will determine (infer) the time instances that enabled the classification of the bag. The Intelligent Methodology for the Discovery of Precursors of Adverse Events (IM-DoPE) developed in this work is described in this section. The four steps of the methodology as seen in Fig. 2 are the data processing, the development of the model, the extraction of precursors and the derivation of precursor scores. These steps allow for the discovery of precursors related to adverse events of interest, which can be used to provide insights into the cause of these events. The DASHlink sample flight data is recorded at 1 Hz frequency, therefore no effort was placed in re-sampling the data set at the same frequency, as this is a sufficiently high frequency. The data set had already been cleaned prior to accessing it as no missing data was observed. Subsequent tasks mainly focused on selecting events of interest, dimensionality reduction, data interpolation, and data formatting. As previously stated, 2 adverse events were selected for this work due to their higher frequency in the available data set. These events are pre-defined and characterized by the deviation of certain aircraft parameters from accepted nominal behaviors. Fig 3, shows the counts of nominal flights, and adverse flights that experienced one of the two events. As seen in the figure, the high speed event was the most commonly observed event among the available data followed by the high path angle event. The dimensionality of the data was reduced by leveraging the correlations among the features. Certain parameters (e.g. computed airspeed and true airspeed) are different but convey similar information. This is due to the redundancy of parameters, the aircraft's physics, and the derivation of parameters [25] . Using the data, a correlation matrix was created. For each pair of features, the Pearson's correlation coefficient is computed. The coefficient is a measure of the strength of the association between the features in the pair ‡ . The Pearson correlation coefficient for two random variables and is given by: The correlation coefficient is bounded between −1 < < 1, and values closer to 1 signify strong positive correlation while values closer to -1 suggest strong negative correlation. In eq. 1, 2 is the sample variance for the variable , and 2 the one for the variable , and 2 the co-variance between and . Since features are considered highly correlated if they have a correlation coefficient higher than 0.9 [34] , a threshold of = 0.90 was set so that for a given pair of features, with the absolute value of their correlation coefficient greater or equal to the threshold, one of the correlated features was removed. Note that the absolute value of the correlated coefficient between feature pairs was taken since since feature can be highly negatively correlated. For this work, only non highly correlated continuous variables were used as the input feature space. In addition to the correlation-based feature selection, trivial precursors were removed. This step was performed to avoid observing one of the speed parameters as precursors to the high speed event. All the events of interest for this work are flagged at a 1,000 ft above touch-down. Given the large number of flights and the differences between each of them, re-sampling the data was necessary to ensure uniformity across all flights. Thus, starting from a distance of 20 nautical miles away from the 1,000 ft mark, flights were re-sampled at every quarter nautical miles. Each flight therefore contains 81 data points or time steps. A Python code was developed to interpolate each feature to retrieve their values at every quarter mile distance, yielding a data set similar to table 2 for each flight. In practice, modern algorithms such as Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) usually take as inputs 3 dimensional tensors. In order to account for this, data from each flight, as specified in ‡ Data Analysis -Pearson's Correlation Coefficient: http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442 The data takes a 3 dimensional form where the first dimension (N) corresponds to the batch size or the number of flights that the deep learning algorithm will process at once. The second dimension (L) is the number of time-steps in each flight equal to 81 for this work. Finally the last dimension (D) refers to the number of features/aircraft parameters which corresponds to 58 continuous variables. The model architecture was chosen to take advantage of the Multiple-Instance Learning (MIL) framework, the feature extraction capabilities of Convolutional Neural Networks (CNNs) [35, 36] , and the temporal pattern recognition ability of Recurrent Neural Networks (RNNs) yielding a MHCNN-RNN architecture. This architecture choice allows the aircraft's parameters to be initially processed individually by the MHCNNs, which yields feature maps. The complex (time-dependent) correlations between the feature maps are then learned by the RNN. The model was developed using PyTorch [37] , a high performance Python deep learning library. fig. 6 . The MHCNN is effective at processing each feature independently [35] . This allows the model to find relevant patterns in each aircraft parameter, which subsequently leads to the identification of the parameters that are highly correlated to the adverse event. Indeed, the most important aircraft parameters are the ones that helped the algorithm correctly predict an anomaly, and therefore can be considered precursors to the anomaly. A window-based approach was used to process the data with the MHCNN, allowing for information extraction of each feature at different time regions. The window is defined by the kernel size of the CNN and the networks all use a stride of 1, such that overlapping windows are created. Some instances of time are more useful than others to predict an anomaly and this approach allows the algorithm to have a more granular extraction of features. Similar to the architecture proposed in [35] , four CNNs are used in each head to process the time-series and expand the unique channel of each time series as seen in fig 6 and fig 7, resulting in the extraction of relevant information from each window of the time-series. In Figure 6 , 1 , 2 and 3 represent the number of channels, and 1 , 2 , and 3 represent the length of the time series after each convolution operation. The length of the time series after each operation depends on CNN parameters such as the kernel size, the stride, and whether or not padding was used.The first three convolutional layers use batch normalization to reduces the internal co-variance shift, and bring a regularization effect [35] , and each normalization is followed by a ReLU activation function which has become a default function to use when developing neural networks due to its many advantages [38] . The fourth convolutional layer reduces the number of channels back to 1, and applies a sigmoid Fig. 6 Layer configuration with output size of layer activation function, allowing the output (or feature map) to be interpreted as the probability of a given feature to be a precursor. The chronological order of the feature maps was kept, which helped maintain the temporal information of the data. A concatenation step was then performed to combine the feature maps of each aircraft parameters into a 3 dimensional precursor score tensor. The concatenated tensor is the input to the GRU, which processes the features all together unlike the MHCNN that processed the features independently. This layer learns the temporal patterns present in the extracted information by the previous convolutional layers. The GRU is appropriate for handling temporal data due to its structure characterized by two gates: the update and reset gates [36] . Both of these gates are used to select which information to keep and which to ignore, and allow the GRU to handle long-term dependencies better than simpler architectures [36] . An hyperbolic tangent activation is then applied to the output of the GRU and followed by a time distributed dense layer that squeezes all the features into one unique dimension, allowing for additional approximation capability [24] . The output is passed through a sigmoid function, and can be interpreted as the probability that a precursor occurs at a given instances of time. Finally to classify a given flight as positive (event occurred), it is assumed that if any instance of time is positive then the whole flight would be labeled as positive (MIL assumption). Thus, a max pooling layer is used to get the maximum probability across time to classify the flight. A threshold is set such that a flight is positive only if the maximum probability across time is greater or equal to the threshold. The architecture is thus expected to achieve two critical goals: 1) Label the bag for identifying adverse events 2) Extract precursors and the time instance at which they occur Although the two approaches are similar because they take advantage of the MIL framework for precursor mining, it is important to note the differences between this architecture and the previously developed algorithm ADOPT [24] . The architecture proposed in this paper is designed such that the precursor probabilities for each feature can be extracted directly from the neural network layers, this approach aims at creating a more interpretable model. ADOPT can only identify temporal precursors directly, and needs to perform a sensitivity analysis during a post-processing step, which consist of perturbing one individual feature at a time and measuring its effect on the time instance probabilities to determine precursors in the feature space. This method potentially misses interactions between features. Thus being able to retrieve precursors by leveraging the combination of the MHCNNs (extraction of individual information for each parameter) and the GRU (extraction of temporal interactions of parameters) mitigates this drawback while eliminating the overhead required to setup the post-processing analysis. Multiple Binary Classifiers The architecture described in III.B.1 corresponds to a binary classifier. Both IM-DoPE and ADOPT make use of binary classifiers as their task is not only to correctly predict the occurrence of an adverse Fig. 7 Layer configuration showing where the precursor score is extracted event but to also identify precursors to the event. Both models could be extended to include multiple outputs allowing for the training of multiple events at once [24] . However, for this work the multi-class model is obtained by combining multiple binary classifiers since it was observed that they performed well at completing their individual tasks. Two popular approaches of combining the classifiers are the "one-vs-all" and the "one-vs-one" strategies [39] . The first strategy learns to classify one class at a time, where the class is distinguished from the other ones. The second strategy requires creating multiple classifiers for different pairs of classes. In this work, the approach taken is a mix between the two strategies. Indeed, 2 classifiers are trained for each event, similarly to "one-vs-all" approach, but each of them are trained to either recognize the event or the default nominal operation. This approach was chosen to allow for the interpretation of nominal flights when performing predictions and to avoid training extensive number of models. When using the models to perform an inference, the output class is chosen to be the output of the model with the highest probability (i.e. the highest confidence) that is greater than a specified threshold. If none of the probabilities are greater than the threshold, then the output is defaulted to the class corresponding to normal operations since each model either predicts an event or normal operation. When using sigmoid functions, the decision threshold is problem dependent and could be treated as a hyperparameter to tune. However, a default decision threshold of 0.5 was used for this work since it yielded good performances for the model. The advantages of this architecture are the clear interpretability of the probability of a flight experiencing a given event or not, and the easier learning task that the model has to accomplish. The binary architecture can also be changed to include multiple output nodes, such that the output after the last max pool layer is of size ( ℎ , ) where is the number of classes to predict. The sigmoid layer before the final max pooling was kept to ensure that for each class/anomaly, the probability that the flight experiences each of them is bounded between 0 and 1. Indeed this allows for multi-labeling, which is a use case in aviation since multiple events could occur during the same flight. Using a sigmoid layer is different than using a softmax layer, which would require the sum of the probabilities of each event to be 1. This architecture was preferred over the usage of a softmax layer because it could handle both multi-class, and multi-label problems. In fact for this work the model was trained using flights that experienced only one anomaly at a time, and therefore learned to keep the probabilities of non-occurring events low while keeping the ones of occurring events higher. Thus, for each flight the class with the highest probability was chosen as the final output class. The advantage of this architecture is the convenience of being able to train one model instead of multiple binary classifiers. Given the novelty of the chosen architecture for the precursor mining task, there is a lack of knowledge on what parameters to choose for the model, and on how to train it. To mitigate this problem, a hyperparameter search was performed to determine the optimal parameters of the architecture for each event. A grid-search is the most straight-forward strategy to perform a hyperparameter search, as it entails searching through the space created by all possible combinations of hyperparameters. This means that each parameter is given an array of values to try, and the model is evaluated using metrics described in III.B.4 for each of the parameters combinations. Table 3 defines the hyperparameter space containing 36 possible combinations. Similarly to [41] , the data set was divided into 3 subsets: the training set, the validation set, and the testing set as seen on fig. 8 . Stratified subsets are used to ensure that the proportion of nominal flights to abnormal flights remain consistent in each subset. Additionally, stratified mini-batch were used during the training process. The training set was used to train the model using a given set of hyperparameters (i.e. one hyperparameter combination). Google Colab's Tesla V100 16 GB of RAM GPU was leveraged to train the models. The performances of the models on the validation set was used to determine the best model. 3) Recall: Measures the the fraction of positives labels that were actually detected [43] . The closer to one the better, the following equation defines the recall: 4) F1 score: Harmonic mean of the the precision and recall [43] . The closer to one the better, the following equation defines the recall: Metric created to measure the resemblance in the precursor score ranking of IM-DoPE and ADOPT [29] . Assuming precursor scores for feature i ranked between 0 and 1, N flights, and d feature, the DFA for each combination of hyperparameters is given by: DFA for a combination of hyperparameters is the mean sum square error between the precursor scores of ADOPT and IM-DoPE, therefore the closer to zero, the more alike the rankings of the two algorithms are. The precursor score sum square errors are averaged across all features 5, and across all flights 6. ADOPT has been validated by expert for few events such as an high speed exceedence and can thus be used as a second-hand validation if no human expert is available. Once acceptable performances are achieved by the model, it can be used to identify precursors. The precursor score , of each feature at time step is extracted directly from the architecture. As seen on fig. 9 , the raw precursor score for each feature can be extracted from the last convolution layer of each head. The raw scores are bounded between 0 and 1 due to the sigmoid function used in the fourth convolulational layer. From empirical results, it was observed that the model learns to set non-important parameters to 0.5. As an example on Fig. 9 , the rudder position, the rudder pedal, and the spoiler position are deemed not important by the model since their individual scores remain at 0.5 for last 20 nautical miles before a 1,000 ft. On the other hand, the radio altitude was flagged as a precursor. As the less important features go through the different convolutional layers, they get reduced to zero. This zeroed out feature is then passed through the sigmoid activation resulting in a raw score of 0.5. The identified precursors have instance of time (precursor score over time) at which they are more correlated to the event of interest (grayed area in Fig. 9 and 10 ). The combination of GRU and dense layer allows for the extraction of these time instances, as seen in Fig. 10 . Similar to ADOPT, abnormal flights will have the precursor score increase towards 1 and nominal flights will have the score fall towards 0. Additionally, since the precursors are more correlated to an event at time windows identified by the GRU, an overall score can be given to the precursor by averaging its , values within an identified window . A more natural way to express the precursor score is to have a score of zero for parameters that are not important. Therefore, the raw precursor scores were adjusted. The adjusted precursor score of feature is therefore defined by the following equation: where | | is the number of time steps within the time window . The adjusted precursor score is thus bounded between 0 and 0.5. From this definition, it can be seen that features with higher scores are flagged as precursors. Multiple Binary Classifiers For each event, 36 combination of hyperparameters were evaluated and the best classifiers were selected. Table 5 summarises the quantitative results obtained when evaluating the models on the test set. fig. 11 . Additionally, best model of IM-DoPE performs better than ADOPT when using the same training set, and testing set. It is however important to note that while a grid-search was performed for IM-DoPE for each event, the default setting of ADOPT were used. Some models obtained during the grid-search Table 6 presents similar scores as in table 5 showing that even when combined, each model learns to identify its positive class correctly. The confusion matrix in 7 confirms this behavior, as the numbers of FP and TN for each class are relatively low. On one hand from the results obtained, it can be seen that there are slight improvements in the F1 score of the nominal and high speed classes thanks to the respective increase in the recall and precision scores. On the other hand, the ability to accurately predict the high path angle anomalies decreased. In fact, this model architecture is able to find most flights that experienced the high path angle event, but has a higher number of false positive than the other multi-class model leading to a much lower F1 score. Even though both the multiple binary models and the multiple output nodes model can be used along with the precursor identification part of the IM-DoPE methodology, the combination of multiple binary classifiers yielded better results, and therefore was chosen to perform the identification of precursors. The results are presented in the following subsections. As previously mentioned, the final model is expected to predict adverse events and identify their precursors. The adjusted precursor score can be obtained for each feature and each flight. Considering true positive flights only, the average adjusted precursor score can be determined and the precursors can be identified on a fleet level, as seen in table 9 for the top 5 precursors of each event. Noticeable differences are observed in the precursor rankings for the two events. For example the glideslope deviation and the pitch angle are characteristics of a high path angle event while the N1 target relates to engine power which can be related to speed. Some resemblances are also observed, for instance the altitude is seen to be the precursor for both events. This is expected since the altitude above touch-down is used to define both events. The trained model can also be used to analyze individual flights. Two flights experiencing a high speed event, and a high path angle event respectively, are analyzed IV.B.2 and in IV.B.3. When performing an inference using the trained model, the precursor scores can be extracted from the model's MHCNN outputs. Following previously highlighted steps in III.C.2, the adjusted precursor score can be computed and used to identify the precursors for a flight. Fig. 12 shows the precursor ranking obtain for a flight that experienced a high speed event. Similar precursors were identified by ADOPT for this flight. The top 5 precursors were the altitude, the radio altitude, the flight path acceleration and the N1 target. Additionally, the the total pressure was also identified by ADOPT as a top precursor. Once the precursors are discovered, the aircraft's parameters can then be plotted to assess the them. In particular, the time series is of interest when the precursor score over time is greater than the 0.5 threshold, which occurs at around 1.5 nautical miles away from a 1,000 ft above touch-down on fig in the last 2.5 nautical miles. The oil temperature of the flight is also plotted to show that even though it has a lower values than the mean of nominal flights, the relatively low precursor score obtained is likely due to the fact that the temperature is relatively constant and is not highly correlated to the event. Similar to the high speed event case, precursor to the high path angle event can be probed from the model. Again, the identified precursors provide a list of aircraft's parameters that can be assessed through visualization. For the selected flight the presence of a dominating precursor is observed. Indeed, there is a larger difference between the top two precursor than there was for the high speed event, as seen on fig. 14. This larger difference suggest a highly abnormal glideslope deviation. Moreover, the glideslope was also identified by ADOPT as the top parameter. Other important parameters were the pitch angle, the radio altitude, the airbrake position, and the flight path acceleration. The abnormal behavior is confirmed by fig. 14 since the glideslope deviation is much greater than 2 times the standard deviation. The potential cause of the event is likely mainly related to the high deviation in the glideslope. However, other precursors such as the pitch angle and the flight path acceleration are also identified and observed to have abnormal patterns. It is important to note that these two parameters can also be precursors to the high speed event as previously observed. This flight in particular was not classified as a high speed event but some of the parameters behaved similarly to how they would behave during such event. This led the high speed event classifier to increase the precursor score over time probability to be closer to 1. Ultimately, the multi-class classifier correctly labeled this flight as a high path angle event because the high path angle classifier had a stronger confidence. The MHCNN-RNN architecture yields satisfying results. The scores obtained from the conventional classification metrics show the model's ability to extract information from the data in order to make accurate predictions. In fact, the model accurately forecast the events since it predicts them before they actually occur, though variability in the prediction is observed. The model also identifies precursors, and it was observed that different precursors are discovered for different events. which is expected. The high speed top precursors relate more to poor energy management while the precursors for the high path angle include trajectory related parameters, as seen in table 9. The identified precursors are partially validated given their resemblances with ADOPT identified precursors, which is characterized by the low DFA values. Additional support towards the validation of the discovered precursors is obtained through visualization and The work presented in this paper tackles the precursor mining task. A public flight data set was leveraged to create FOQA-like labels. After performing the required preprocessing steps, a novel architecture for the task was developed to take advantage of the Multi-Instance Learning framework, the feature extraction capabilities of Convolutional Neural Networks and the temporal pattern recognition capabilities of Recurrent Neural Networks. On one hand binary classifiers were trained to predict both high speed and high path angle events, and on the other the MHCNN-RNN architecture was modified to handle multiple outputs. In both cases, a grid-search was implemented to determine the best parameters for the neural networks, and thus the best model for the prediction the safety events. For the multiple binary classifier case, the best two models were then combined to form an unique multi-class classifier. For both multi-class extensions, the final models were evaluated on a test set and high scores were observed for classification metrics such as F1 score, precision, and recall, in particular when combining binary classifiers. Furthermore, the binary models were then used to identify precursors and provide the average precursor score across all flights that experienced each of the two events. Finally, visualizations were used to observe the behaviors of the identified precursors, which exhibited patterns different from normal flight operations. Future work will include enhancing the interpretability of the precursor score tensor. While empirically the parameters that deviate from 0.5 are understood to be correlated to the event of interest, the meaning of the direction of the deviation (greater than or lower than 0.5) needs to be investigated. Further improvements to the model can be made towards increasing the prediction window, and allowing the classification of unknown precursors instead of defaulting unknown behaviors to the nominal class. Additionally, the lower scores for the modified MHCNN-RNN could be due to the class imbalance, especially the lower number of high path angle events. Future work will also explore methodologies to handle class imbalance, and extend the learning task to a multi-label problem. Aviation Benefits Report Commercial Market Outlook Aviation Safety 2019 Year in Review A Modeling Environment for Assessing Aviation Safety The Potential for Improving Aviation Safety and Reducing the Accident Rate URL aviationsafetyblog.asmspro.com/blog/best-data-mining-methods-predictive-aviation-safety-risk-management Error Prevention as Developed in Airlines Federal Aviation Administration Advisory Circular, 120-82 -Flight Operational Quality Assurance Recent Advances in Anomaly Detection Methods Applied to Aviation Challenges and Opportunities in Flight Data Mining: A Review of the State of the Art Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms Anomaly Detection in General-Aviation Operations Using Energy Metrics and Flight-Data Records Identification of Instantaneous Anomalies in General Aviation Operations using Energy Metrics A Novel Deep learning method for aircraft landing speed prediction based on cloud-based sensor data Incremental-learning-based unsupervised anomaly detection algorithm for terminal airspace operations Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study An Application of DBSCAN Clustering for Flight Anomaly Detection During the Approach Phase Anomaly detection via a Gaussian Mixture Model for flight operation and safety monitoring Real-time anomaly detection framework using a support vector regression for the safety monitoring of commercial aircraft Unsupervised Anomaly Detection in Flight Data Using Convolutional Variational Auto-Encoder Trajectory Clustering within the Terminal Airspace Utilizing a Weighted Distance Function Towards online prediction of safety-critical landing metrics in aviation using supervised machine learning Deep spatio-temporal neural networks for risk prediction and decision support in aviation operations Explaining Aviation Safety Incidents Using Deep Temporal Multiple Instance Learning A Supervised Learning Approach for Safety Event Precursor Identification in Commercial Aviation Data-Driven Precursor Detection Algorithm for Terminal Airspace Operations A Multiclass Multiple Instance Learning Method with Exact Likelihood Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning Using ADOPT Algorithm and Operational Data to Discover Precursors to Aviation Adverse Events Critical Parameter Identification for Safety Events in Commercial Aviation Using Machine Learning Detection of precursors to aviation safety incidents due to human factors Application of Machine Learning to Parameter Selection for Flight Risk Identification Multiple instance learning: A survey of problem characteristics and applications Multi-head CNN-RNN for multi-time series anomaly detection: An industrial case study Deep Learning PyTorch: An Imperative Style, High-Performance Deep Learning Library A Gentle Introduction to the Rectified Linear Unit (ReLU) An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes Adam: A Method for Stochastic Optimization Machine Learning Approach to the Analysis of Traffic Management Initiatives Machine Learning with R: Discover How to Build Machine Learning Algorithms, Prepare Data, and Dig Deep into Data Prediction Techniques with R Machine Learning: a Probabilistic Perspective The authors would like to thank Dr. Nikunj Oza, Dr. Milad Memarzadeh, and Dr. Hamed Valizadegan for the feedback and insights they provided.