key: cord-0550766-6ppqrzpr authors: Fox, Geoffrey; Rundle, John; Donnellan, Andrea; Feng, Bo title: Earthquake Nowcasting with Deep Learning date: 2021-12-18 journal: nan DOI: nan sha: 12b8686737cef940931e90ca7991b5a14472956d doc_id: 550766 cord_uid: 6ppqrzpr We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS. Earthquake forecasting is an old and difficult problem with many interesting characteristics. In studying this we not only hope to shed light on this socioscientific challenge but also lead to new methods based on deep learning that can be applied in other areas. Perhaps the most important characteristic is the nature of its challenge. Namely, it is unlikely that building a new zettascale supercomputer will accurately simulate quakes and lead to reliable earthquake predictions [1] . As a phase transition in a system with unknown boundary conditions and phenomenological models (as for friction), it is not obvious that earthquakes are the solution of a set of differential equations or that accurate probabilities of large events can be computed. For these reasons, we have chosen recently to focus on earthquake nowcasting [2] , which is the estimation of risk in the present, the immediate past, and the near future. Rather, earthquake nowcasting is an archetype of data-intensive problems characteristic of the fourth paradigm of scientific discovery [3] ; we suppose that the patterns of previous events hold clues to the future. This is a bit different from say other studies where machine learning has been successful; for example, language translation and image recognition are very complicated problems but one can see there are natural patterns from grammar, words, segments, colors, etc. that are clearly but complexly correlated with what we want to find out. Machine learning successfully learns this complex relationship between inputs and predictions. Earthquake nowcasting challenges AI to discover "hidden variables" in a case where their existence is not as clear as in other successes of machine and deep learning. These ideas are illustrated in figure 1 where we are pursuing the right side data-intensive approach, rather than developing a theoretical model and determining unknown parameters from the data. In this paper, everything is hidden and determined by the data! Data-driven methods avoid the bias of incomplete theories but maybe there is not rich enough data to provide good insights. We partly address this here with a choice of "known inputs" which are natural mathematical expansion functions explained in Sec. 3.5. The history of earthquakes is recorded as events detailing the (three-dimensional) position and size of each quake. These can be binned in space and time to generate geospatial time series corresponding to sequences in time labeled by spatial position (and bin size). This class of problem is extremely common in both science (research) and commercial areas. Perhaps the most intense study of this problem has come from traffic transportation) studies with many innovations aimed at ride-hailing with papers from for example the companies Uber and Didi as well as academia. However, there are also many medical and earth/environmental science problems of this type [4] - [9] . This problem class includes cases where there is a significant spatial structure that typically strongly affects nearby space points. Details of the geometric structure are important, for example, in traffic-related problems where the road network and function of land use define a graph structure relating different spatial points [10] - [13] . The problem chosen is termed [14] , [15] a spatial bag where there is spatial variation but it is not clearly linked to the geometric distance between spatial regions. Earthquakes have obvious coarse grain spatial structure but local structure comes from faults and we have not found this useful in the current analysis which therefore is in the spatial bag class. Hydrology study of catchments [4] , [16] is also a spatial bag problem. Different catchments are related by values of static attributes such as aridity, elevation, and the nature of the soil and not directly by geometry. In the following text, we first review the use of nowcasting and machine learning for earthquakes. Then we describe the data setup for deep learning and describe the three methods used in this work. We present nowcasting results from each of these and finish with a summary of some open questions in this promising but preliminary area. We hope that the deep learning approach could be more general and not require a priori guesswork as to what patterns were important. Earthquakes are a clear and present danger to world communities [17] , and many earthquake forecasting methods have been proposed to date with little success. A comprehensive recent review is given by [2] . In this paper, we continue the development of a new approach, earthquake nowcasting, using machine learning methods that have recently been developed. Nowcasting is a term originating from economics and finance. It refers to the process of determining the uncertain state of the economy or markets at the current time by indirect means. By the current time we mean the present, the immediate past, and the very near future. Rundle and Donnellan [18] , [19] and [2] , [20] - [22] have applied this idea to seismically active regions, where the goal is to determine the current state of the fault system, and its current level of progress through the earthquake cycle. In our implementation of this idea, we use the global catalog of earthquakes, using "small" earthquakes to determine the level of hazard from "large" earthquakes in the region. The original method is based on the idea of an earthquake cycle. A specific region and a specific large earthquake magnitude of interest were defined, ensuring that there is enough data to span at least 20 or more large earthquake cycles in the region. An "Earthquake Potential Score" (EPS) was then defined as the cumulative probability distribution P(n 3.29). Other easily available quantities are powers of quake energy (using Energy ~ 10 1.5m where m is magnitude). We use the concept of "Energy averaging" when there are multiple events in a single space-time bin. So the magnitude assigned to each bin is defined in equation 2 as " " where ( ) we sum over events of individual magnitudes We also use energy averaging defined in equation (3) for quantities such as the depth The multiplicity per each bin is the number of events in a bin satisfying a criterion and are not "Energy averaged". (4) We also looked at the powers of the energy instead of for n = 0.25, 0.5, 1 (5) = ( ∑ 10 As the exponent n increases from n~0 (the log) to n=1, one becomes more sensitive to large earthquakes but loses dynamic range in the deep learning network as measured by mean value/maximum of the input data. We have done a preliminary analysis of E 0.25 but this paper only presents an analysis of the data using "Log(Energy)" (magnitude). Note all input properties and predictions (separately for each data category) are independently normalized to have a maximum modulus of 1 over all space and time values. Note that in deep learning values of O(1) are especially significant as that is the value where activation layers introduce nonlinearity into the analysis. The time dependence of different measures of earthquake activity is compared in figure 5 for 2-week time bins. The data is summed over the 2400 spatial bins but averaged as in equations (1) to (5) within each space-time bin. All plots show m bin defined in (2) which we use for most inputs and outputs and we compare with the multiplicity of events with magnitude > 3.29, the multiplicity of all events, E 0.25 and E 0.5 . Note that E 0.5 is the "Benioff Strain" [39] - [42] , a measure of the accumulating stress or strain. The m bin curve on each figure is renormalized to have a maximum that is precisely one half that of other plotted quantities. Unsurprisingly m bin has the least drastic time structure and the multiplicity plots are similar in their overall structure to E 0.5 . We largely deal with m bin as its smoother behavior is easier to model. We tried some preliminary analysis with E 0.25 for targets (defined later) but did not find a good description of data as deep learning doesn't easily describe measurements with a large ratio of maximum to mean. These ratios are recorded in Table 1 , where for the model, the individual pixel row is most relevant. In the results presented here, we just considered the 500 most active "pixels" among 2400 of the full dataset. These active regions were selected similarly to earlier papers by cuts on the counts of events with magnitude > 3.29. Of these 400 pixels were used for training and 100 for validation and testing. 2400 0.1 by 0.1-degree subregions analyzed (pixels) are illustrated in figure 6 with 400 red as training and 100 green as validation pixels with the pixel intensity corresponding to earthquake activity. This analysis is built around forward and backward observables using 2-week time units and the m bin observables which are calculated in various time intervals and in the forward and backward directions. m bin (F:Δt,t) is the energy averaged magnitude of equation (2) calculated over the time interval Δt starting at the bin following the time t. m bin (B:Δt,t) is the energy averaged magnitude of equation (2) calculated over the time interval Δt ending at the time t bin. Forward measurements are predicted and backward measurements input. Note that a year prediction m bin (F:Δt=52 weeks,t) can not be calculated from the 26 two-week predictions m bin (F:Δt=2 weeks,t to t+25). The energy averaging (roughly a maximum magnitude) does not commute with the averaged prediction. Here earthquake analysis differs from other time series where observables are directly measured event counts. The road structure is described by a graph neural net but we didn't use this approach here but rather after much experimentation settled on a simple idea shown in fig. 7 . We divided the faults into groups (36 in results presented here) and labeled each region by the index on a space-filling curve running through our region. Then each of the 2400 pixels is given a static label from 0..35 which is used as described below. There are many ways of generating space-filling curves and we chose 4 independent ones and so each pixel ends with four static labels. We introduced the concept of a spatial bag for time series earlier and illustrate this in fig. 8 , where we have a set of time series where the spatial distances (e.g. locality) between points is not important; rather they are differentiated by values of properties which can either be static (such as percentages of the population with high blood pressure for medical examples or minimum annual temperature for hydrology catchment studies) or dynamic (such as a local social distancing measure). In other papers, we will discuss problems that combine the features of spatial bags and distance locality where convolutional networks (especially convLSTM) are clearly useful. We tried convLSTM for the Earthquake case but did not find it as effective as the methods reported here. Four distinct deep learning implementations will be considered 1) LSTM Pure recurrent Neural Network --two-layer LSTM [43] 2) TFT A version of the Google Temporal Fusion Transformer TFT [44] - [46] that uses two distinct LSTM's --one each for Encoder and Decoder with a temporal (attention-based) transformer 3) Science Transformer A hybrid model that was built at the University of Virginia with a space-time transformer for the Decoder and a two-layer LSTM for the encoder. This is an architecture that combines two models. One is an AutoEncoder (AE) that encodes and decodes the image-like input and output. The other is a Temporal Convolutional Network (TCN) which takes the differences from the input and output of the AutoEncoder and predicts the one-step future loss which is used to nowcast immediate earthquakes in [49, 50] . All of these are programmed using the standard Tensorflow 2 class model. This required some adjustment for TFT where the software is presented [47] in Tensorflow 1 compatibility mode running under Tensorflow 2. NVIDIA has a modern version of the TFT with PyTorch [48] . We modified the TFT in a few ways to allow multiple targets and multi-layer LSTM's. We also use Mean Square Error MSE and not quantiles (mean absolute error) as the loss function. This loss is used identically in all 3 models. MSE appears more appropriate than MAE in this problem as it emphasizes the interesting large earthquake region; MAE weights small earthquakes more than MSE. Deep Learning models are typically constructed as a set of connected layers that calculate forward and back-propagation supported by the well-known PyTorch and Tensorflow frameworks. Each layer has parameters one can change and the user can also vary the order and nature of layers. We have three implementations with rather different structures and sizes. The smallest model is the LSTM with 6 layers shown in figure 9 and 66,594 trainable parameters. Layers used include basic dense, dropout, activation, softmax, add, multiply as well as complex layers such as LSTM. The broad classes we use are dense embedding layer for initial and projection layer for final processing, transformers to identify patterns, and LSTM recurrent networks for time-series processing. In other work, we have used autoencoders and temporal convolutional networks to identify earthquakes as extreme events [49] , [50] . We illustrate the simplest network in fig. 9 where grey represents layers and yellow-green data arrays with dimensions. The high-level architectures are compared in fig. 10 while the detailed TFT architecture is well described in [44] , [45] and the other two models can be found in [51] , [52] . (13) ; InProp the number of input time series (27) and OutPred the number of predicted time series (24) . The Temporal Fusion Transformer has two important differences from the other two models. Firstly it uses static variables to provide context (initial values) for several of the time-dependent steps. Secondly, it uses separate input embedders and output mappers for each variable whereas the other two methods use shared embedders and mappers. In future work, these TFT strategies should be considered in the other methods. Deep Learning involves somewhat different inputs and outputs (predictions or targets) to more traditional approaches. Following [45] , we identify three major classes of features which are functions of time (time series) for each member of the spatial bag (i.e each pixel in fig. 6 ) or static characteristics of the bag member. These classes are Known Inputs, Observed Inputs, and Targets. We describe these in general and then specify how they were chosen for each model. Observed Inputs are the basic measured quantities such as magnitude and depth of earthquakes. In the Covid example illustrated in fig. 3 , they are daily infections and fatalities but also time-dependent auxiliary variables such as vaccination rate, measures of social distancing, hospitalization rates etc.. In the earthquake case, bin multiplicity counts are auxiliary variables. More formally we divide observed inputs into "(a priori) unknown inputs" and targets as one typically has variables that are both observed inputs at times up to current time and these need to be predicted and so fall in the target category as well. Known Inputs are an interesting concept that includes both static features and time series (known time-dependent features) . These are parameters that are known in both the past and future whereas observed inputs are only known in the past and need to be nowcast into the future. In Covid and Earthquake examples the only measured known inputs are the static features. In commercial applications, daily signatures of holidays or weekends are interesting inputs. Note these are categorical variables and generally all parameters can be categorical or come from a numerical scale. The latter tend to dominate scientific time series. In our analysis, we made extensive use of Mathematical Expansions which are functions of time that it appears natural to express the time dependence in terms of. In general, this is the place where simple models or theoretical intuition can be fed in but there is so far not much experience in exploiting this. As well as feeding in theoretical intuition in "known inputs" one can also feed in sophisticated functions of the observed data --such of those in our earlier work -as "observed inputs" allowing the deep learning to identify precursor patterns of future earthquakes. Fig. 12 shows the modified TFT applied to describe daily Covid data up to November 29, 2021 (extended from [43] ). The weekly mathematical "known input" helps the model obtain a good description. The annual periodic variation was used to advantage for describing hydrology data in [52] . We also used Legendre polynomials as known inputs where ( θ ) θ ( ) varies from -1 to +1 over the full-time range of the problem. The results here use the Legendre polynomials from l=0 to 4. Note , , lie between -1 and +1 θ ( ) θ ( ) ( θ ( )) and fit the deep learning framework constraints quite well as they do not create poor ratios of mean/max of sizes. These are known inputs as they can be calculated for any time value --past or future. Targets have already been introduced as these are the functions of time that we are trying to predict. They often include all or a subset of the observed inputs as for training they need to be known for times previous to that where the nowcasting was made. However just as we had unknown inputs that were time series which were not predicted, we also sometimes used Synthetic Targets that were predicted but not part of input set. These need to be known for "all times" so they can be used in training. However, whereas inputs are best if they have no missing data, that is not required for targets. These can be missing as we are using an additive loss function (MSE or MAE) and missing targets are just dropped from the sum in the loss function. We use this for predicting many years in the future which are dropped from training at times such that the future measurements are not available. If targets predict time units into the future we choose the maximum training time so the targets with the smallest forward are known but not the others. In the simplest models, targets are the values of observables at the final time. However, we also use synthetic futures when observables are m bin (F:Δt,t) integrals over time periods Δt up to 4 years. We now summarize the choices that we made for the three networks discussed in this paper. Note for the LSTM and Science Transformer static variables are implemented as "known input time series" with values independent of time. For the TFT, static variables are cleverly used to provide initial context for appropriate layers processing the true time series. The deep learning models are perhaps the important glamorous parts of the analysis but substantial data engineering is needed to pre-process the input data and visualize the results. This hard work is described in [52] and can be viewed in the online Jupyter notebooks. The deep learning methods use approaches that were largely originally developed for natural language processing. Recurrent Neural Networks RNN explicitly focus on sequences and pass them through a common network with pretty subtle features. RNNs are designed to gather a history which allows the time dependence to be remembered Attention-based methods [56] are more modern and perhaps somewhat easier to understand as attention is a simple idea. NLP is basically a classification problem (look up tokens in a context-sensitive dictionary) whereas (science) tends to be numerical and so it is not immediately obvious how to use attention in technical time series and we describe one possible approach in this paper. There have been a few studies of transformer architectures for numerical time series such as [57] - [63] but there is not a large literature. Attention [64] , [65] means that you "learn" structure from other related data and look for patterns using a simple "dot-product" mechanism discussed later matching structure of different sequences; there are other approaches to match patterns which is a good topic for future work. Here we use a simple attention mechanism in an initial decoder but use a recurrent net LSTM for the encoder as shown in fig. 5b ). Such mixtures have been investigated and compared [66] , [67] . We compare the two architectures shown in a) and b) of fig. 13 ; the pure LSTM used in sec. 2 and the science transformer The basic item for LSTM and Transformer is the same; a space point with a time sequence with each time in the sequence having a set of static and dynamic values. In an LSTM the sequence is "hidden" and you have to unroll the recurrent network to see it. However in the transformer, the different time values in a sequence are treated directly and so each item contains W terms (W is the size of time sequence), Each term is embedded in an input layer and then mapped by 3 different layers into vectors Q (query) K (key) V (value). As shown in fig. 14 , one matches terms i and j by calculating Q(i)K T (j) and ranking with a softmax step. This multiplies the characteristic vector V(j) of this pattern and the total attention A(i) for item i, is calculated as a weighted sum over the values V(j). There are several different attention "heads" (networks generating Q K V) in each step and the whole process is repeated in multiple encoder layers. The result of the encoder step is considered separately for each item (each time in a time sequence at a given location) and the embedded input of this layer is combined with the attention as input to the LSTM decode step. In natural language processing, you look for patterns among neighboring sentences but for science time series you can have larger regions as spatial bags have no locality. This leads to many choices as to the space over which attention is calculated as one can't realistically consider all items simultaneously. Suppose we have N loc locations; each with N seq sequences of length W. Then the space to be searched has size N loc . N seq . W which is too large. In COVID-19 example of fig. 12 [43] N loc = 500, N seq~ 700 and W up to 13. In the hydrology example in [52] , N loc = 671, N seq~ 7000 and W up to 270. The next subsection describes the 3 search strategies we have looked at: and depicted in fig. 15 . There is • Temporal search: points in sequence for fixed location • Spatial search: locations for a fixed position in the sequence • Full search: complete location-sequence space One will need to sample items randomly as only a small fraction of space is looked at in one attention step whatever method is used. Note that in all cases we used a batch size of 1 as the attention space was effectively the batch. Actually, in the LSTM stage, the different locations in the attention space were considered separately and the attention search space became the batch. In the work reported here the attention space (and batch size) was set to N loc but this is not required and would not work in some cases with large values of N loc . Even in the examples considered here, the search space can get so large that one needs to address memory size issues. Note the encoding step of the transformer is many matrix multiplications and gets excellent GPU performance and typically LSTM decoder would be a significant part of the compute time and so the addition of attention is not a major performance hurdle although it does require a large GPU memory. The TFT only uses temporal attention separately for each space point and so only the science transformer needs major (80 GB of A100) GPU memory. Note the attention space choice implies different initial shuffling to form batches and also different prediction stage approaches. For spatial and temporal searches one can keep all locations at a particular time value together in both forming batches and in calculating prediction. For the full search the complete set of N loc .N seq sequences must be shuffled. In practice, we combined the space-time and pure time searches and accumulated the results of these two attention searches. These models give very detailed predictions over time and space and we can only illustrate them in this paper. We give a global summary with the Nash Sutcliffe Efficiency and a sample of time-dependent results in figures of earthquake activity m bin (F:Δt,t) . The Nash Sutcliffe Efficiency NSE is given in equation 1 where we use the normalized NSE, NNSE = 1/(2-NSE) which runs between 0 (bad) and 1 (perfection). Table 5 gives the results are for m bin summed over all 400/100 locations. Note all time interval bins use the forward m bin described in section 2 and equation (2) Figures 16-19 show the nowcasts for training and validation sets for a selection of time intervals: 2 weeks, one year, two years and four years. In this initial study, the 3 methods give qualitatively similar results which have strong promising potential for useful nowcasts. The time period listed corresponds to the forward prediction of that length. The TFT only uses backward prediction but does this for time steps reaching 26 weeks into the future so the backward one-year value at t+26 fortnights is the forward prediction at time t. It is not clear if the detailed 2-week predictions of TFT into the future (not given in other models) are valuable. We noted before that one cannot use these predictions to calculate cumulative observables as energy-averaging and finding mean are not commutative. The current TFT implementation was not set up to nowcast all the time intervals covered by the other models and so this model is absent in some cells of Table 5 and in the later figures for 2-year and 4-year nowcasts. This is not intrinsic to the TFT model and will be addressed in future work using larger memory machines to train networks. The AE-TCN joint model is built as described in previous research [49] , [50] and an overview architecture is shown earlier in Figure 10 . This research is an empirical study for predicting extreme shocks as major events controlled by a pre-configured threshold value. Fig 20 illustrates a result from a test set covering around 2,000 data points, where each data point is an image-like snapshot of Southern California with pixels representing shock energy. We calculate the loss of the AutoEncoder as the mean absolute error from the reconstruction to the input. In Figure 5 , the top sub-figure is the min-max scaled loss using the test dataset. The bottom sub-figure of Figure 20 is the model prediction using the test dataset. A high value in prediction indicates the corresponding event is large. In the joint modeling, an autoencoder is responsible for encoding and decoding image input and a temporal convolutional network takes the differences from the autoencoder and predicts high loss, which indicates large events one step ahead. Above we have presented an initial analysis of the use of deep learning for nowcasting earthquakes. We believe the methods are promising and have broad applicability across earth science which offers many examples of geospatial time series. We have identified a class -spatial bags -where different space points are linked by different values of static labels and not directly by the geometry, We see very complex networks with from over 60,000 to 8 million trainable (unknown) parameters. We use the data to develop the model which then replaces the physics description with these hidden variables. Note a feature of the gradient descent optimizers is that you can use a redundant parametrization and still converge to a reasonable description. You will over-train with lots and lots of parameters but by comparing training and validation loss find a good description presented here. This paper is not a definitive study, as it is probing somewhat uncharted territory. In particular, we have made very many heuristic choices that need further investigation --or one could say there are many hyperparameters in this problem that need to be explored. Areas to explore include time-unit (2 weeks here), the choice of known inputs, the many parameters defining the networks, the targets, and the basic variable from m bin to Energy E and its powers. We also should explore the choice of the validation set; should it be gotten from dividing in space (as here) or time? We earlier noted that the TFT had several clever ideas that could be used in the LSTM and Science Transformer examples. Further for the transformer should one use temporal patterns as in TFT and AE-TCN or the space-time choice made in the Science Transformer. We intend to build a benchmark set of time series datasets and reference implementations as playing the same role for time series that ImageNet, ILSVRC, and AlexNet played for images. The different implementations establish best practices and can be used in different applications as identified already [68] in finance, networking, security, monitoring of complex systems from Tokamaks [69] to operating systems, and environmental science. This work will be performed in the MLCommons Science Data working group [70] . We intend to combine the open datasets and clean reference implementations available in MLCommons [71] with documentation and tutorials which will allow MLCommons benchmarks to encourage the broad community to study these examples and use the ideas in other applications as well as improving on our base reference implementations. Computational Earthquake Science The complex dynamics of earthquake fault systems: new approaches to forecasting and nowcasting of earthquakes The Fourth Paradigm 10 Years On D2 2020 AI4ESS Summer School: Recurrent4Neural Networks and LSTMs CAMELS Extended Maurer Forcing Data Catchment-Aware LSTMs for Regional Rainfall-Runoff Modeling GitHub Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets The CAMELS data set:Catchment attributes and meteorology for large-sample studies A Comprehensive Review of Deep Learning Applications in Hydrology and Water Resources Artificial Intelligence for Smart Transportation Video Deep multi-view spatial-temporal network for taxi demand prediction Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework Deep Learning for Spatial Time Series Study of Earthquakes with Deep Learning," presented at the Frankfurt Institute for Advanced Study Seismology & Artificial Intelligence Kickoff Workshop, Virtual FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks The Mechanics of Earthquakes and Faulting Nowcasting earthquakes in southern California with machine learning: Bursts, swarms, and aftershocks may be related to levels of regional tectonic stress Nowcasting earthquakes in southern California with machine learning:Bursts, swarms and aftershocks may reveal the regional tectonic stress Nowcasting Earthquakes by Visualizing the Earthquake Cycle with Machine Learning: A Comparison of Two Methods Nowcasting earthquakes Natural Time, Nowcasting and the Physics of Earthquakes: Estimation of Seismic Risk to Global Megacities Global Seismic Nowcasting With Shannon Information Entropy Linear pattern dynamics in nonlinear threshold systems Principal component analysis of geodetically measured deformation in Long Valley caldera, eastern California Eigenpatterns in southern California seismicity Nowcasting Earthquakes: Imaging the Earthquake Cycle in California with Machine Learning Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks Long short-term memory Generative Adversarial Network for Stock Market price Prediction A Gentle Introduction to Generative Adversarial Networks (GANs) Generative adversarial network ID3 Algorithm Nash-Sutcliffe model efficiency coefficient Modelling daily streamflow at ungauged catchments: what information is necessary? Enhancing streamflow forecast and extracting insights using long-short term memory networks with data integration Earthquake Hazards Program of United States Geological Survey Earthquake Data Used in Study 'Earthquake Forecasting with Deep Learning Precursory Seismic Activation and Critical-point Phenomena Accelerated Seismic Release and Related Aspects of Seismicity Patterns on Earthquake Faults A systematic test of time-to-failure analysis log-periodic behavior of a hierarchical failure model with applications to precursory seismic activation AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting Temporal fusion transformers for interpretable multi-horizon time series forecasting Temporal Fusion Transformer: Time Series Forecasting with Interpretability Google's state-of-the-art Transformer has it all Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting TFT For PyTorch Spatiotemporal Pattern Mining for Nowcasting Extreme Earthquakes in Southern California TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California Study of Earthquakes with Deep Learning (Earthquakes for Real) Deep Learning Based Time Evolution FFFFWNPF-EARTHQB-LSTMFullProps2 Google Colab for LSTM Forecast FFFFWNPFEARTHQ-newTFTv29 Google Colab for TFT Forecast Y8FFFFWNPF-EARTHQDGX-Transformer1 DGX Jupyter Notebook for Science Transformer Forecast Attention in Natural Language Processing An attention based deep learning model of clinical events in the intensive care unit Spatiotemporal Attention for Multivariate Time Series Prediction and Interpretation Attention-Based Hierarchical Recurrent Neural Network for Phenotype Classification Deep Contextual Clinical Prediction with Reverse Distillation CAMP: Co-Attention Memory Networks for Diagnosis Prediction in Healthcare Think Globally Attend and diagnose: Clinical time series analysis using attention models Attention is All you Need Attention Is All You Need A Comparison of Transformer and LSTM Encoder Decoder Models for ASR Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning Benchmarking Deep Learning for Time Series: Challenges and Directions Predicting disruptive instabilities in controlled fusion plasmas through deep learning Science Data working Group of MLCommons MLCommons Homepage: Machine learning innovation to benefit everyone The work of GCF and BF is partially supported by the National Science Foundation (NSF) through awards CIF21 DIBBS 1443054, CINES 1835598, and Global Pervasive Computational Epidemiology 1918626. We thank Tony Hey, Jeyan Thiyagalingam, Lijiang Guo, Gregor von Laszewski, Niranda Perera, Saumyadipta Pyne, Judy Fox, Xinyuan Huang, Russell Hofmann, Keerat Singh, JCS Kadupitiya, and Vikram Jadhao for great discussions. Sercan Arik and the Google TFT team were very helpful. We are grateful to Cisco University Research Program Fund grant 2020-220491 for supporting this research. The inspiration of MLCommons has been essential to guide our research. Research by JBR has been supported by a grant from the US Department of Energy to the University of California, Davis, number DoE grant number DE-SC0017324, and by a grant from the National Aeronautics and Space Administration to the University of California, Davis, number NNX17AI32G. Portions of the research by Andrea Donnellan were carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.