key: cord-0809294-5zhhm4mn
authors: Cerna, Selene; Guyeux, Christophe; Laiymani, David
title: The usefulness of NLP techniques for predicting peaks in firefighter interventions due to rare events
date: 2022-02-26
journal: Neural Comput Appl
DOI: 10.1007/s00521-022-06996-x
sha: 5750d884ee58158df5d4ca3e09af161a2e1b47ed
doc_id: 809294
cord_uid: 5zhhm4mn

In some countries such as France, the number of operations assisted by firefighters has shown an almost linear increase over the years, contrary to their resource capacity. For this reason, predicting the number of interventions has become a necessity. Initially, time series models were developed with several types of qualitative and quantitative features, including the alert level of the bulletins, to predict the operational load. We realized that interventions related to human activities are quite predictable. However, the recognition of interventions due to rare events such as storms or floods needs more than quantitative meteorological data to be identified, since there are almost always zero cases. Thus, this work proposes the application of natural language processing techniques, namely long short-term memory, convolutional neural networks, FlauBERT, and CamemBERT to extract features from the texts of weather bulletins in order to recognize periods with peak interventions, where the intense workload of firefighters is caused by rare events. Four categories identified as Emergency Person Rescue, Total Person Rescue, interventions related to Heating, and Storm/Flood were our targets for the multilabel classification models developed. The results showed a remarkable accuracy of 80%, 86%, 92%, and 86% for Emergency Rescue People, Total Rescue People, Heating, and Storm/Flood, respectively.

The missions and the activity of fire departments change from one country to another: In some countries, fire departments are only in charge of extinguishing fires, while in others they are also in charge of rescuing people, whether it is urgent or not. In countries such as France, fire activities represent only almost 10% of all their interventions, and they may be called out for road accidents, floods, respiratory ailments, gastroenteritis, or even wasp nests. In western countries with an aging population, the share of personal assistance is constantly increasing, when repeated economic crises lead to budget cuts.

Thus, in countries where fire brigades activity encompasses rescuing people, the number of interventions has been steadily increasing for several years now and the COVID-19 pandemic has only amplified the situation. In this context, ensuring rapid and efficient interventions becomes a major challenge for many brigades and it seems interesting to try to plan interventions in advance. Better still, answering the questions When? Where? Of what type (road accident, domestic accident, flood, etc.)? can have a strong impact and therefore save lives. This difficult problem is just beginning to be studied [1] [2] [3] . In fact, the vast majority of brigades have accumulated an important mass of data related to their past interventions. It is therefore interesting to try to use these data and the recent advances in machine learning (ML) to try out intervention prediction. It is reasonable to think that such predictions are possible, as the reasons for these interventions are to some extent deterministic. Forest fires occur more frequently in dry, hot weather than in wet, rainy weather; floods follow heavy rains; domestic and work accidents occur mostly in the middle of the day, and they are rare at 3 am ; falls on the ice do not occur in summer and drowning in outdoor pools does not occur in winter. As can be seen from the above examples, weather conditions have an undeniable impact on human activity and the accidents it causes, and therefore on the activity of firefighters. In addition, most national weather prediction services provide online services or APIs to retrieve various physical quantities useful for weather knowledge. These quantities, which are internationally codified and are mandatory measurements, include temperature, pressure, wind direction and strength, dew point, or hydrometry, for a set of stations spread over the entire territory.

However, while these physical quantities are useful in predicting, to some extent, the weather, they are only partially effective in predicting firefighters interventions, their types, and intensities. Knowing that a thunderstorm event is coming does not accurately predict the extent to which firefighters activity will be affected. More precisely, by coupling the history of weather conditions with the history of interventions, we can see that while some stormtype weather conditions clearly lead to peaks in intervention, others, on the contrary, go somewhat unnoticed. In other words, the simple quantitative value of the physical quantities to be measured does not alone make it possible to accurately predict the occurrence of periods of heavy intervention (for example, heavy rains that do not systematically lead to flooding). In this article, we would like to insist on the fact that most states have also set up a meteorological vigilance service, with reports indicating the risk incurred (snow, ice, storm, floods, etc.), the duration of the event, and its textual description, as well as useful advice. It is clear that integrating the level of vigilance in the form of qualitative variables, one for each risk monitored, allows on the one hand to improve the regression scores in the learning phase of the total number of interventions. On the other hand, we argue that the textual content of newsletters is valuable and underexploited, that sentences such as ''it is advisable to unplug electrical appliances'' or ''it is strongly advised not to walk along the coast'' are rich in information for the prediction of interventions such as heating or emergency rescue due to rare events, and that automatic natural language processing tools are now mature enough to understand the link between such assertions and peak intervention periods.

With these elements in mind, the present work performs a detailed study of various ML models developed for the prediction of intervention peaks due to rare weather events and identifies the remarkable impact of NLP models and weather bulletin texts. The rarity of events such as flood or heating events makes them difficult to predict due to the small amount of data available and, as we will see in the remainder, these events can be the source of extremely high numbers of interventions. The predictions are made for four categories of interventions, namely Emergency Person Rescue, Total Person Rescue, interventions related to Heating, and interventions related to Storm/Flood. The data were collected from 2012 to 2020. They are composed of interventions of the Fire Department of Doubs (SDIS 25), located in the northeast of France, quantitative weather variables and texts of weather bulletins and their vigilance levels from Météo-France. In this way, SDIS 25 or other emergency medical services (EMS), in general, could better recognize periods with high workload generated by rare events, through text processing of weather bulletins, and strategically prepare the appropriate personnel and armament to deal with the crisis so that no service disruptions occur and more lives can be saved. In summary, this article proposes four main contributions described as follows.

a. Analysis and prediction of interventions using basic univariate time series models This allows to recognize the seasonality of rescue-type interventions and identify the lack of recognition of incidents triggered by rare events. We compared three basic models per category to predict the number of interventions per hour. Each model makes predictions equal to the mean, equal to the last known value, and equal to the mean per hour. b. Analysis and prediction of interventions using multivariate time series models This allows to complement and identify the trend and the seasonality of the signal by adding qualitative and quantitative variables (vigilance indicators and weather and calendar variables). However, these are still not sufficient to recognize interventions caused by rare events. We developed seven models per category with the state-of-the-art extreme gradient boosting (XGBoost) technique, in which the features are combined to measure their impact on the prediction. c. Analysis and prediction of intervention peaks using multilabel classification models based on decision trees and tabular data The problem is restated as a multilabel classification task for the four categories, using the variables of the previous model and the XGBoost and random forest (RF) techniques well known in the literature for their fast execution and robustness. This allows to determine the influence of the variables according to the new perspective and deduce that there is still a lack of information to recognize the peaks due to rare events. Thus, these models become our baselines for the following models with natural language processing (NLP).

d. Analysis and prediction of intervention peaks using multilabel classification models based on NLP techniques and text from meteorological bulletins We developed and compared models based on ancient (Long Short-Term Memory -LSTM and Convolutional Neural Network -CNN) and modern (Flau-BERT and CamemBERT transformers) NLP techniques and only texts extracted from weather bulletins. This allows to significantly improve the forecast of the peaks compared to the previous models with decision trees and tabular data and thus demonstrate that it is possible to extract much more information from public weather bulletins using NLP techniques.

The structure of this article continues with Sect. 2 that reviews the contributions of related works. Section 3 describes the acquisition and preprocessing of the data. Section 3 presents the types of neural networks applied for the natural language processing of bulletins. Section 5 presents the results of three basic approaches developed before the approach with NLP techniques, to analyze and demonstrate the efficacy of the latter with the best model found. The article ends with Sect. 6, in which our conclusions are shown and avenues for future improvements are discussed.

Among the works reviewed and related to the optimization of fire departments responses to incidents, we mainly found contributions for the prediction of interventions [3, 4] and fires [5] . Likewise, in certain parts of the world firefighters are part of EMS, since they also provide ambulance services. In this way, predictions of traffic accidents [6] , ambulance response time, and resource allocation are also included [7] [8] [9] . Furthermore, we can find works related to the predictions of rare events such as earthquakes [10] and hurricanes [11] , which would allow firefighters to identify a specific location for damage assessment and develop better strategies when succoring the population. In fact, the aforementioned works did not use NLP techniques. However, in other studies, NLP techniques demonstrate their outstanding utility by enriching data sources and predictions. This is achieved by recovering habits, preferences, emotions, feelings, and distress messages through the recognition of semantic patterns [12] [13] [14] from various media such as social networks, the news, therapeutic reminders, information systems under the video-based, audio-based, and text-based formats [15] . On the one hand, in [16] , the authors presented a cognitive assistant system for EMSs based on Google Speech API, in which the voice records (incident description and patient status) received by the respondent are converted into texts for extract medical concepts. In this way, the system responds by providing information to rescuers on protocols to follow such as resuscitation and airway management. On the other hand, NLP has contributed to the generation of terminological sources for the classification and forecasting of rare events or crises [17] [18] [19] [20] . For example, in [18] , it is presented the first study for crisis management using French transformer-based architectures (BERT, Flau-BERT, and CamemBERT) apply to French social media, to classify tweets for natural disasters. In [21] , the authors make use of the Bayesian model averaging approach and linear-chain conditional random fields to extract knowledge from tweets and build a decision support system to identify early warning signs of earthquakes. Also, [22] developed the Flood AI Knowledge Engine, which is a system composed of ontology management, query mapping and execution, and NLP modules. The system provides emergency preparedness and response, as well as knowledge about flood-related resources for the population. This knowledge is returned by interpreting natural language queries from users.

Having seen the impact of NLP on the recognition of extreme events and given that, to the authors' knowledge, no previous studies have exploited a source such as weather bulletins to detect trends in the number of firefighter interventions, the present work takes advantage of NLP techniques to process these bulletins and predict peak intervention periods for the categories: Emergency Person Rescue, Total Person Rescue, interventions related to Heating, and Storm/Flood.

We collected the interventions of the SDIS25 firefighters over the period 2012-2020, for the entire department, for each time slot of this period (76224 slots, 1 slot = 1 hour), and the following four types of intervention: Emergency Person Rescue, Total Person Rescue, interventions related to Heating, and Storm/Flood. A short statistical description of each type of intervention (per hour) is provided in Table 1 , when Fig. 1 plots the curves for each type for the first days of 2012.

It can be seen that in both types of personal assistance (emergency and total), the number of interventions is on average high enough to show seasonal patterns: The 24-hour cycle is clearly shown in Fig. 1 . We can also see that some days are special, such as New Year's Day and that the integration of calendar variables into this daily seasonality should improve the predictions. Finally, the rarity of storms or heating events will make their predictions problematic in the absence of additional information, which already shows the interest of considering weathertype variables. This is even more true for the ''storms and floods'' case as it has the lowest average (0.26 interventions per hour) and an extremely high maximum (82 interventions in one hour): These events are very rare, but the source of an extremely high number of interventions.

This is the reason why we have retrieved historical meteorological data from the Météo-France site (essential SYNOP data [23] ). The three closest meteorological stations selected are those of Nancy-Ochey (latitude 48.581000, longitude 5.959833), Dijon-Longvic (47.267833; 5.088333), and Basel-Mulhouse (47.614333; 7.510000). The data recovered in this way are temperature (degrees Celsius), pressure (Pa), pressure variation (Pa per hour), barometric trend (categorical), humidity (percent), dew point, last hour rainfall (millimeter), last three hours rainfall (millimeter), mean wind speed (10 min., m/s), mean wind direction (10 min., m/s), gusts over a period (m/ s), horizontal visibility (m), and current weather (categorical, 100 possible values).

These data were supplemented by (textual) vigilance alert bulletins from Météo-France [24] . They are XML files containing the type of vigilance (heatwave, extreme cold, snow or ice, thunderstorms, strong winds), the beginning and end of the vigilance period, the level of the alert (green, orange or red), a detailed description of the risk (including the locations impacted, the conditions to be expected, etc.), as well as a set of very detailed advice to users. An example of such files is provided in Fig. 2 . Finally, we added calendar-type variables, namely the time of the niche considered, the day in the week, the day in the month, the month in the year, and the year considered.

Totally, 12218 weather bulletins have been produced since 2011, including 1054 for the northeast region (known as CMIRNE) of France, which interests us. Eighteen departments are represented in this region, but we were only interested in the Doubs and its three neighboring departments (Territory of Belfort, Haute-Saône and Jura) in a first time.

The set of texts available may seem small at first glance, but on the one hand, each bulletin covers a number of departments and has several sections (location, description, qualification of the phenomenon, new facts, current situation, expected evolution, possible consequences, and behavioral advice), and each section is made up of several long and detailed sentences. The result is a corpus of 76333 characters.

These bulletins are then cut out by section. Over the entire period of vigilance, the average number of interventions is calculated for each of the four types considered, and a class 0 or 1 is introduced depending on whether this number is above the average number of interventions of the type in question, for all the period 2012-2020. We thus associate 4 binary labels for each text in each section of each bulletin, as shown in Table 2 . Through this encoding, we became interested in the question: ''in the context of Fig. 3 , one can observe in more detail the number of samples (texts) with value 0, where the number of interventions was below or equal to the mean, and value 1, in the opposite case. Furthermore, we find an unequal distribution in each binary class of each category that might bias the prediction models.

Let us now introduce the neural networks that have been considered here for natural language processing.

Long short-term memory (LSTM) networks are a type of recurrent neural network (RNN) which is a category of neural networks dedicated to sequence processing [25] . In the case of NLP, RNN are interesting since, first they can process sequences of variable size and second the use of recurrent connections allow to analyze the past part of the signal. In this way, RNN are particularly well suited to handle three different types of problems: sequence labeling, sequence classification, and sequence generation. A recurrent network can be approximated by a non-recurrent network unfolded in time. But as the unfolded network is deeper, the vanishing of the gradient is more important during learning, and it becomes more difficult to train. In the same way, as the weights of the recurrent layer are duplicated, RNNs are also subject to exploding gradient. Although they are very effective for modeling short-or medium-term dependencies, RNN are still insufficient for modeling long-term or very long-term dependencies. In NLP, it is common to need to model dependencies of the order of a hundred or more time steps. That is why LSTM has been introduced.

To model very long-term dependencies, it is necessary to give recurrent neural networks the ability to maintain a state over a long period. This is the purpose of LSTM cells.

The cell can be seen as an internal memory and can < Titre name =" Consé quences possibles " > < Paragraphe > < Intitule > Vent / Orange </ Intitule > < Texte >  n  o  i  t  u  b  i  r  t  s  i  d  e  d  x  u  a  e  s  é  r  s  e  l  r  e  t  c  e  f  f  a  t  n  e  v  u  e  p  e  n  o  h  p  é  l  é  t  e  d  t  e  é  t  i  c  i  r  t  c  e  l  é  '  d  s  e  r  u  p  u  o  c  s  e  D  * pendant des duré es relativement importantes . </ Texte > < Texte > * Les toitures et les cheminé es peuvent ê tre endommagé es . </ Texte > < Texte > * Des branches d ' arbre risquent de se rompre .-Les vé hicules peuventê tre dé portés. decides which memories will be eliminated from the previous long-term state c tÀÀ1 ; and the output gate, o t , decides whether the cell content should influence the output y(t) of the neuron. Let x t be the present entry and h tÀ1 the preceding short-term state. LSTM can be expressed in the following way:

where W xi ; W xf ; W xo and W xg are the weight matrices for their connection to the input vector x t ; W hi ; W hf ; W ho and W hg are the weight matrices for their connection to the previous short-term state h tÀÀs1 ; and b f ; b g ; b i and b o are the bias terms of each layer. LSTM are generally used in layers and in this case, the outputs of all neurons are fed back to the inputs of all neurons.

In this paper, we used the Keras and TensorFlow libraries in Python to initially built several architectures with 1, 2, and 3 LSTM layers, different numbers of neurons, and constant values for learning rate and batch size. Then, we evaluated their performances and selected the three architectures that gave us the first best results, these are the ones described in Table 3 . Finally, to obtain the best LSTM model, we intensified the search with the three selected architectures, varying the learning rate and batch size parameters. For this, we used the HyperOpt library with 100 iterations and the Tree Parzen Estimator suggest (tpe.suggest) algorithm, which models two density functions instead of the probability of an observation to estimate the expected improvement of a new configuration.

Furthermore, before the texts enter the neural network, a preprocessing was applied as detailed below: a. Words with flexible endings were eliminated from the French texts and their base forms were returned, this is known as lemmatization, through the ''spacy'' library. Unlike a pre-trained model in a large vocabulary, this LSTM model uses the texts of the weather bulletins as a base, it is there the need to bring the words to their base form to identify a greater risk of a possible event.

For example, the words ''inondations'' and ''inondables'' share the same root and refer to ''floods.'' Neural Computing and Applications b. Using the ''nltk'' library, French stopwords were eliminated, that is, words such as articles, pronouns, prepositions, and auxiliary verbs, among others, which do not have a big impact on our predictions. c. From the Keras library, we used the Tokenizer function to create a dictionary of words based on their frequency. Then, we reduced the dictionary to the 1000 most repeated words and for each text we generated vectors with integer values that are the indexes in the dictionary. Finally, they were padded to the same length and entered the neural network described previously.

Convolutional neural networks (CNN) are a class of deep neural networks, most often applied to visual image analysis [26] . More recently, however, CNN have also found their place in solving problems related to NLP tasks. A CNN is generally build around the following layers:

-Convolutional layers are composed of neurons whose purpose is to detect patterns (features map) from their inputs. -Pooling layers whose purpose is to reduce the feature map dimensionality in order to be more computationally efficient. These layers are often chained after convolutional layers. -These two previous set of layers are generally followed by one or more fully connected layers

As previously stated, CNN has been a game changer in the field of image analysis. They also have shown some interesting results in NLP [18, 27] . Indeed, texts can be represented an array of vectors, just like images can be represented by an array of pixel values. Here, we deal with one dimensional convolutions, but the principles remain the same: we still want find patterns in the sequence which become more complex with each added convolutional layer.

In order to associate each word with a specific vector, different techniques can be used. We are talking here about words embedding techniques. The most effective ones are those that are context-sensitive. We can cite here Glove [28] and Word2Vec [29] and as we will see in the next section, Bert is another word embedding technique. Word embeddings are able, by reducing the dimension, to capture the context the semantic and syntactic similarity (gender, synonyms, , etc.) of a word. For example, one would expect the words ''remarkable'' and ''admirable'' to be represented by relatively closely spaced vectors in the vector space where these vectors are defined. It has been shown that contextual words embedding can effectively capture the semantic and arithmetic properties of a word. It also reduces the size of the problem and therefore the learning task. In the case of CNN, the obtained inputs vectors will allow the model to have a much better representation of the words during the learning phase.

Given the sequential nature of texts, RNN and LSTM are more common models in NLP. However, they can be long and difficult to train. In this way, for large datasets CNN can be an interesting alternative.

Similar to the architecture selection procedure performed in LSTM, Table 4 shows the specifications of the three CNN architectures selected to subsequently perform an exhaustive search, varying the learning rate and batch size with the HyperOpt library. In addition, the text preprocessing performed before entering the CNN is also the applied in the LSTM Sect. 4.1.

The sequential nature of RNNs was regularly pointed out as a hindrance to the training of these models on long texts both for computation time reasons (even modern GPUs do not parallelize well this type of process) and because of gradient vanishing problems (despite the use of LSTM models). In order to no longer process texts sequentially, transformers models provide a solution by processing the whole sequence all at once and by using the attention mechanism which allow capture different types of relationships between tokens. A transformer is built based on an encoder and a decoder, each of them consisting of a stack of attention and dense layers. Since 2017 and the first transformer model, many models have been developed such as ELMo, GPT-2 and GPT-3, BERT, XLNet, RoBERTa, and Turing-NLG.The main advantages of the transformer model are the following:

-The distance between two tokens is no longer a parameter taken into account by the model (the model can take into account long-term dependencies). -The attention matrix calculation allows to parallelize the process of encoding and then decoding the sequences, thus accelerating the calculations. -No labeled data are required to pre-train these models and it is then possible to train a transformer-based model by providing a huge amount of unlabeled text data.

From the last point it follows that it is possible to do transfer learning with this trained model in order to perform other NLP tasks like text classification, namely entity recognition, text generation, etc. This is how in June 2017, Google presents BERT (Bidirectional Encoder Representations from Transformers) [30] . It is a transformer composed of a suite of encoders only (N = 12 or 24 depending on the version: base with 110 millions parameters or large with 340 millions parameters). Bert was originally pretrained by using two tasks. It hides some of the words (15%, although this is actually more complex) and learns how to find them. This allows him to acquire a general and bidirectional knowledge of the language. BERT also learns to recognize if two sentences are consecutive or not. The corpus used for this pre-training was the BooksCorpus with 800 M words and a version of the English Wikipedia with 2,500 M words. When it came out, BERT was able to outperform state-of-the-art models for a large set of NLP tasks such as GLUE (General Language Understanding Evaluation) or SQuAD (Stanford Question Answer Dataset).

In the following, we use transfer learning (with our vigilance data) on two pre-trained models based on the BERT architecture and French corpus.

CamemBERT [31] is based on RoBERTa, which improves the original implementation of BERT by using dynamic masking and training with larger batches and for longer. However, the main differences between RoBERTa and CamemBERT are that the latter uses the whole-word masking and SentencePiece tokenization instead of WordPiece. CamemBERT has been pre-trained on the French subcorpus (138GB) of a huge multilingual corpus (Common Crawl Oscar corpus) and proved that crawled data with high variability are preferable to using Wikipedia-based data. In the present article, we run the base implementation of CamemBERT from the HuggingFace's transformers library in Pytorch, which uses the original architecture of BERT base (12 layers, 768 hidden dimensions, 12 attention heads, and 110M parameters).

FlauBERT [32] is also a RoBERTa-based model for French. It has been pre-trained on less but more edited data (71GB). Its text corpus consists of 24 sub-corpora collected from different sources with diverse subjects and writing styles. Its performance is very close to CamemBERT. In fact, in its original paper, it is demonstrated that the performance of a parsing model can be improved with an ensemble of FlauBERT and CamemBERT. In the present article, we run the base implementation of FlauBERT from the HuggingFace's transformers library in Pytorch, which uses the original architecture of BERT base too.

First, we present the scores of the univariate time series models for the prediction of the number of interventions. The models perform constant predictions equal to the mean, predictions corresponding to the persistence model (we predict the last known values), and predictions corresponding to the mean per hour (see Fig. 4 ). Next, we develop multivariate time series models, based on one of the most powerful learning machine tools available today: XGBoost [33] . The data used are composed of calendar features (day in the week, month, year, etc.), quantitative meteorological variables, corresponding to the three meteorological stations closest to the Doubs, which makes it possible to measure the extent to which predictions of firefighting interventions can be made by taking only Neural Computing and Applications numerical data from meteorological information, and only vigilance levels from bulletins as indicators. Then, we reframe the problem as a multilabel classification task in order to analyze the impact of the tabular data previously described but with a new perspective that allows us to recognize the periods with intervention peaks and not the number of interventions. For this purpose, we compared the XGBoost classifier and the one from a technique also recognized for its speed and robustness, called random forest [34] . Finally, we replaced the tabular data with the meteorological texts of the bulletins and applied NLP techniques to study how valuable these data can be.

Scores corresponding to constant predictions are given in this subsection. The results obtained are shown in Table 5 for the Emergency Person Rescue, Table 6 for the Total Person Rescue, Table 7 for the Heating related interventions, and Table 8 for the Storm/Flood ones. In these tables, MAE stands for mean absolute error, and RMSE is for the root mean squared error, the most usual metrics for regression. As can be seen, the average per hour does better than the average alone for personal assistance, which is understandable given the daily seasonality of human activity. As shown in Fig. 4 , there are fewer interventions at night than during the day because people are simply sleeping; similarly, because people eat at noon, there is a plateau at that time. This improvement is not found for interventions such as heating or storm, as these are only minimally related to human activity: A storm can cause damage both day and night.

One might a priori be astonished that the simplistic persistence model does better than the average per hour, which seems more evolved. However, this is well explained by the scarcity of interventions of the heating or storm type: If most of the time, there are 0 interventions per hour, then the probability that the slots at time t and time t?1 both have no intervention is high. Thus, replicating what happened at time t as a prediction of what will happen at time t?1 is a winning strategy when events are rare. The good success of this model in the case of rescue-type interventions is again explained by the strong daily seasonality of these interventions: Between one hour and the next, the number of interventions is close, when it is very 

The first features to be taken into consideration are obviously the calendar data, which fully condition human activity. The time of day captures the daily seasonality, with a drop-in activity at night, a maximum of activity in the late morning and in the afternoon, with a plateau at noon. For various reasons, the day in the week also has its importance: weekend, a day without school for children, etc. In the same way, the day in the year makes it possible to recover particular periods such as the summer, the winter vacations, even particular days (national holiday, Christmas and New Year's Day, etc.). These particularities are also contained, but in a less continuous way, in the month in the year. Finally, the year is important because it allows us to model the general upward trend, for the various reasons mentioned above (aging population, disengagement of the private sector, etc.). For accidents related to heating (chimney fire, etc.), storms and floods, or even emergency rescue, it is reasonable to think that meteorology is important and that predictions should be improved by adding variables describing it. For example, between two national holidays, if the first one is rainy when the second one takes place under a radiant sun during a heatwave, this will probably have an impact on firefighters' exits.

The purpose of this section is to see if this meteorological influence can be recovered without using NLP. We will therefore compare the quality of the predictions for the four types of intervention, with or without the temporal data (calendar), with or without the color of the weather alert (vigilance), and with or without the quantitative data from the Météo-France site (weather) such as temperature, pressure, and wind. To do so, we will randomly separate our dataset between learning (80%) and test (20%), and we will look at the MAE and RMSE scores on the test set, once the learning is completed. This learning will be done with XGBoost (Poisson regression, max depth of 6), with 20% of the learning set used for validation, and an early stopping criterion of ten steps.

The first lesson to be learned from Table 9 is that, in the case of emergency person rescue, calendar data are obviously what appears to be most important, but that predictions can be improved by adding the color of the weather alert bulletin. These bulletins are basically too coarse a dataset to be useful in predicting these types of interventions on their own. As such, quantitative meteorological data do a little better but are far from what is obtained with calendar data. However, the best result is obtained by mixing calendar data with vigilance bulletins, when the addition of quantitative weather variables systematically lowers the quality of the predictions. The information that they carry on their own is found in a couple of calendar and vigilance data, which come as if cleaned of the noise that the quantitative variables carry: Variables such as gust over a period or mean wind speed (10 min.) have too local a scope, both spatially and temporally, whereas the bulletins have a more general scope (the information is digested). Note again that, among baselines, one does not do better there than the persistence model. An autoregressive component is fundamental to this prediction problem, and it would improve the scores of Table 9 in an obvious way: Among the interventions between t and t?1, there are the new interventions that appear after t, and those that appeared previously but are still in use (the system has strong inertia).

The same lessons can be learned from total interventions in a more pronounced way, see Table 10 . This time, the best result is achieved by calendar data alone, and the addition of weather variables only pushes XGBoost to overfit. These total interventions include, in addition to the emergency rescue, accidents on the public highway, and non-emergency rescue: person trapped in an elevator or locked in a balcony, wasp's nest, etc. These interventions are, by nature, much more difficult to predict. And the impact of temperature or atmospheric pressure is certainly much less on this type of intervention than, for example, knowing whether it is the middle of the day or the middle of the night. One can then naturally wonder if adding textual information on the weather could improve such results.

In the case of interventions related to heating, this time the best score is obtained by linking the calendar data to the vigilance bulletins, as shown in Table 11 . This is understandable, given the nature of the interventions (chimney fire, electrical heating appliance fire, etc.). Here again, quantitative meteorological information does not provide the same benefit as the color of the vigilance bulletin, for the same reasons as previously mentioned. Conversely, for events such as storms and floods, if the calendar alone produces the best results, it is closely followed by the combination of calendar and meteorological data (cf .  Table 12 ). Here, in a counterintuitive way, the vigilance bulletins greatly reduce performance. All this can be explained by noting first of all that floods do not occur at any time of the year in the Doubs, but mainly in winter. They are very localized in time, when the bulletins of vigilance usually extend over a fairly long period. Finally, such events follow a strong fall of water, a fact that is found in weather variables.

To conclude, the case of heating-type interventions shows that by adding information about the weather, predictions can be improved. The case of storms and floods, on the other hand, shows that quantitative information (temperature, pressure, etc.) and qualitative information (risk of storm, flooding, etc.) do not provide enough information on the weather: the latter has an obvious impact on the interventions, but these variables do not improve the predictions. And the Emergency and Total People Rescue confirm that these variables describe the weather forecast too crudely, which explains why we are interested in the textual content of the vigilance bulletins.

In this section, the problem is restated as a multilabel classification, where the labels are the four categories Emergency Person Rescue, Total Person Rescue, Heating, and Storm/Flood, and the possible classes are 0 and 1 calculated as described in 3.2. This is in order to identify the peak periods of intervention concerning each category. The quantitative and qualitative variables described in the preceding section are the inputs for the models developed in this section, where each sample of the dataset represents one hour, an illustration is shown in Fig. 6 . In addition, the models created are the results of various combinations of the variables to observe the influence of them but with a classification approach. These models will be the baselines to be overcome by the NLP techniques in the following section.

Before applying NLP techniques, we also sought to develop and compare models with simpler techniques, which do not require a greater consumption of resources, such as those based on decision trees, but which have demonstrated their remarkable effectiveness in the literature with tabular data. We chose XGBoost as a representative of boosting algorithms, which seek to reduce model variance, and random forest, as a representative of bagging algorithms, which seek to reduce bias.

Since decision trees are robust to work with categorical and continuous values, we kept the original values of our variables. Furthermore, similar to the previous section, we randomly split the data into 80% for training and 20% for testing. Since XGBoost and Random Forest do not natively support multitarget classification, we used MultiOut-putClassifier, from the Scikit-Learn library, to fit one classifier per target. To select the best model, we used Bayesian optimization via the HyperOpt library. In total 100 iterations were performed for each combination of variable type and each technique. To guide the search for the best model, we set as loss function the metric Micro F1-score, hereafter called F1-score for short. Generally, this metric is used to assess the quality of multilabel binary problems, where the score closer to 1 means better Micro-Precision and Micro-Recall (we will abbreviate both to Precision and Recall, respectively) and closer to 0 means poor model performance. Other metrics such as Accuracy and Balanced Accuracy are also presented to analyze our resulting models. Table 13 shows the results of the best models obtained for each data input combination and Table 16 describes their hyperparameters and the search space used. As we can see in Table 13 , calendar data together with vigilance alert levels improve the performance of the models. What is more, the best model, obtained by random forest, used only calendar data reaching an F1-score of 0.81. Also, we note that when the inputs are weather, vigilance indicators, and weather plus vigilance indicators, the models show an F1score below 0.64, a poor performance, when in fact these variables should present a greater contribution on the recognition of intervention peaks.

Therefore, these are the baseline results that we need to outperform to be efficient. An NLP tool must have an F1score greater than those values. Thus, we have to see whether information derived from the text of the bulletins allows better prediction of interventions than simple calendar data, vigilance levels, or quantitative information such as temperature or wind speed. For this purpose, we are interested in comparing different NLP models.

In this section, we seek to discover if we have better learning for the four outputs when we process the texts of the bulletins. For this task, the dataset used considers the bulletins of the three neighboring departments of Doubs and the bulletins of Doubs. They were structured as shown in Table 2 and the task is maintained as a multilabel classification for the recognition of the intervention peak periods.

The initial dataset was split into 80% for the learning phase and 20% for the testing phase. Experimentation was then performed by varying different hyperparameters for the different models. The results presented in Table 14 correspond to the best results obtained for each technique applied. We calculated the same metrics presented in previous Sect. 5.3 to identify possible improvements. On the one hand, as described in Sects. 4.1 and 4.2, for LSTM and CNN, we applied a text preprocessing and used the library Hyperopt to search for the best configuration for our models, where the number of iterations was 100, and the guiding metric was the F1-score. The best setting for LSTM was the architecture no. 1, with a learning rate ¼ 6e-4, number of epochs ¼ 200, and the batch size ¼ 59. The best configuration for CNN was the architecture no. 3 with a learning rate ¼ 9e-3, number of epochs ¼ 105, and the batch size ¼ 95. On the other hand, transformers were used in its base models: CamemBERT (110 Million parameters) and FlauBERT (138 Million parameters) with Table 17 .

From Table 14 , we see that LSTM and CNN results are quite similar. However, the two French transformers models, CamemBERT and FlauBERT, outperform both traditional techniques. This is mainly because the LSTM and CNN models were trained only with the vocabulary of the bulletins while the transformers were pre-trained with an extensive vocabulary of the French language (Common Crawl Oscar corpus). Thus, this set of experiments confirm the recent literature results [35] on text classification problems and the superiority of transformers models. Note that, Accuracy and Balanced Accuracy are quite different due to the imbalance of the dataset as mentioned in Sect. 3.2. Nevertheless, when looking at the F1-score, all the models in this section remarkably outperform the results obtained in the previous section with decision trees (Table 13) . What is more, the best model obtained with CamemBERT overcomes by far the best model obtained with random forest by 8%, 14%, 1%, 5%, and 9%, when comparing F1-score, Accuracy, Balanced Accuracy, Precision, and Recall, respectively.

When we examine accuracy by independent category in Table 15 , we prove that the feature extraction from texts improves the recognition of intervention peaks due to rare events. The best NLP model reached accuracies of 80%, 86%, 92%, and 86% for Emergency Rescue People, Total Rescue People, Heating, and Storm/Flood, respectively. Moreover, the last two categories Heating and Storm/ Flood, which represent interventions generated by rare events and which were complicated to recognize by the approaches analyzed in the previous sections, are the ones that demonstrated high accuracy with the NLP techniques and bulletin texts, without degrading the accuracy of the two rescue-type categories.

As mentioned in Sect. 2, generally, the models developed for predicting the interventions of fire departments or EMS use tabular data such as temporal information, quantitative meteorological variables, and traffic indicators [3, 5, 6, 11] . These are very useful for recognizing incidents related to human activity since it is possible to identify seasonality and trend over time. For example, people are more active during the day than at night, there are more drownings in the pool during the summer than in the winter, as the population increases the incidents also increase, etc. Also, because the major operational burden of these organizations is the interventions rescue-type. However, some interventions are difficult to detect, since they occur only a few times a year. These are produced by rare events such as natural phenomena (storms, floods, forest fires, etc.), and although their occurrence is minimal over the years, the workload they produce in a small period can be 28 times more than normal in some cases. For example, in 2016, the average number of interventions assisted by SDIS 25 per hour was 3.34. However, there was an hour in which 84 interventions occurred due to a storm. The storm caused flooding in the region, human and material damage, and breakdowns in the service of SDIS 25 [3] . For this reason, we need an intelligent system that could help to predict the peak periods of intervention generated by rare events. Thus, the present work developed models based on NLP techniques and meteorological bulletins from public sources to recognize periods with a heavy operational load. The results obtained are significant for practical purposes. Initially, the model could be deployed in production as a small stand-alone application. Or, the predictions (the binary indicators) could be included in a larger set of tabular data that would be the input for a predictive model of the number of interventions for a certain time and location. In this way, with an initial application or with a more robust system, the fire department could reorganize its personnel and armament to cope with these periods of high demand, reduce breakdowns in service due to lack of resources, and save more lives.

The present paper demonstrates the effectiveness of NLP techniques for the recognition of rare events that will cause an increase above the average in certain periods of firefighter interventions. This is done by processing the texts contained in the weather bulletins using the traditional techniques LSTM and CNN, and transformers Camem-BERT and FlauBERT. The results of the NLP models and bulletin texts exceed those of the baselines with Decision Trees (XgBoost and Random Forest) and tabular data by 8% and 14% when comparing the best F1-score and Accuracy, respectively. The advantage of using these texts is also reflected when assessing the accuracy of the two categories with interventions related to rare events, achieving 92% for Heating and 86% for Storm/Flood with the best CamemBERT model developed.

In this way, fire departments and EMS, in general, would be able to identify peak periods of interventions and optimize their response by establishing better strategies to prepare their armament for natural disasters (storms, floods, etc.) and keep the population better protected and safe.

As future work, we propose to add meteorological bulletins from other departments of the country, which would allow us to better track a possible extreme event and its consequences on the firefighters' workload. Furthermore, we will develop and compare other modeling and text preprocessing techniques with the texts in French and English. Finally, we aim to integrate the results of this classification approach into a larger regression model that predicts the number of interventions for a certain period (hourly, daily, and monthly) and by locality (principal cities and mountain cities).

See Tables 16 and 17 . 

Conflict of interest Authors certify that there is no conflicts of interests or competing in interests in these works. Early stopping: 15 15 Restore best weights: True True

Predicting fire brigades operational breakdowns: a real case study

Forecasting the number of firefighter interventions per region with local-differential-privacy-based data

A comparison of LSTM and XGBoost for predicting firemen interventions

Predicting the category of fire department operations

Predicting fires for policy making: improving accuracy of fire brigade allocation in the Brazilian amazon

A comprehensive solution to road traffic accident detection and ambulance management

Demand forecast using data analytics for the preallocation of ambulances

Optimization and simulation of an ambulance location problem

Integrating the ambulance dispatching and relocation problems to maximize system's preparedness

Spatio-temporal seismic data analysis for predicting earthquake: Bangladesh perspective

Analyzing temporal-spatial evolution of rare events by using social media data

Novel use of natural language processing (NLP) to predict suicidal ideation and psychiatric symptoms in a text-based mental health intervention in madrid

Sentiment analysis with NLP on twitter data

Scoring tourist attractions based on sentiment lexicon

Hands-on Python natural language processing: explore tools and techniques to analyze and process text with a view to building real-world NLP applications

Emterms 1.0: a terminological resource for crisis tweets

A three-level classification of french tweets in ecological crises

Twitter speaks: a case of national disaster situational awareness

Twitter in mass emergency: what nlp techniques can contribute

Earthquake management: a decision support system based on natural language processing

An intelligent system on knowledge generation and communication about flooding

Météo-France public data

Vigilance cards and bulletins archive

Long short-term memory

Deep learning

Rapid classification of crisis-related data on social networks using convolutional neural networks

Glove: global vectors for word representation

Efficient estimation of word representations in vector space

Bert: pretraining of deep bidirectional transformers for language understanding

Camembert: a tasty french language model

Flaubert: unsupervised language model pre-training for french

Xgboost

Random forests

Deep learning based text classification: a comprehensive review