key: cord-0675469-7znek63o
authors: Sharma, Mandar; Brownstein, John S.; Ramakrishnan, Naren
title: TCube: Domain-Agnostic Neural Time-series Narration
date: 2021-10-11
journal: nan
DOI: nan
sha: 6b8ef926e486079c0eed877dfc863cbc896f8914
doc_id: 675469
cord_uid: 7znek63o

The task of generating rich and fluent narratives that aptly describe the characteristics, trends, and anomalies of time-series data is invaluable to the sciences (geology, meteorology, epidemiology) or finance (trades, stocks, or sales and inventory). The efforts for time-series narration hitherto are domain-specific and use predefined templates that offer consistency but lead to mechanical narratives. We present TCube (Time-series-to-text), a domain-agnostic neural framework for time-series narration, that couples the representation of essential time-series elements in the form of a dense knowledge graph and the translation of said knowledge graph into rich and fluent narratives through the transfer-learning capabilities of PLMs (Pre-trained Language Models). TCube's design primarily addresses the challenge that lies in building a neural framework in the complete paucity of annotated training data for time-series. The design incorporates knowledge graphs as an intermediary for the representation of essential time-series elements which can be linearized for textual translation. To the best of our knowledge, TCube is the first investigation of the use of neural strategies for time-series narration. Through extensive evaluations, we show that TCube can improve the lexical diversity of the generated narratives by up to 65.38% while still maintaining grammatical integrity. The practicality and deployability of TCube is further validated through an expert review (n=21) where 76.2% of participating experts wary of auto-generated narratives favored TCube as a deployable system for time-series narration due to its richer narratives. Our code-base, models, and datasets, with detailed instructions for reproducibility is publicly hosted at https://github.com/Mandar-Sharma/TCube.

Real-world data is often temporal in nature. From the global outbreaks of infectious diseases to the prices of stocks, all chronologically recorded data takes the form of a timeseries. Thus, its mining and analysis has been of significant interest to the scientific community [1] . Time-series narration aims to portray the discerning characteristics of a time-series obtained from such analysis through a textual narrative. The efficacy of narratives as an aid to data comprehension has been validated through studies in digital libraries [2] as well as causal networks [3] . Petre, in his advocacy for the importance of textual representations of data [4] , humorously notes, "A picture is worth a thousand words -isn't it? And hence graphical representation is by its nature universally superior to text -isn't it? Why then isn't the anecdote itself expressed graphically?". 1 https://github.com/Mandar-Sharma/TCube Time-series narration falls under the umbrella of data-totext, a sub-field of NLG (Natural Language Generation) that aims to produce meaningful and coherent textual descriptions of non-linguistic data [5] , [6] . Although data-to-text has garnered significant interest over the years, recent efforts for textual description of data have been focused on either tabular data [7] - [9] or graph data [10] , [11] . The attention that these data types have garnered simultaneously highlight the two key challenges for time-series data. The first being in the design and training of such a system in the paucity of "gold" datasets and the second in its evaluation standards.

• End-to-end models for data-to-text generation showcase learning a direct input-output mapping from data to text [12] , [13] through the use of annotated datasets such as WikiBio [12] and E2E [14] for tabular data and WebNLG and DART [15] , [16] for RDF (Resource Description Framework) triples [17] . In both tabular data and RDF triples, the information to be presented in the narrative is present in the data itself and is copied to the output token -making end-to-end learning possible. In contrast, time-series requires further processing for the discovery of underlying patterns to be narrated. Thus, due to the inherent numerical and continuous nature of time-series, one needs to consider time-series as a whole rather than a sum of its individual constituents. Thus, one would have to either follow the traditional modular pipeline architecture [5] where non-linguistic data is transformed into text through several intermediate steps, or formulate a novel approach suited to time-series data altogether. • The "gold" narratives in the aforementioned datasets offers a common ground for automated evaluation of competing frameworks on the basis of word-based metrics such as BLEU [18] and its variants [19] - [21] . Thus, there are domain-familiar metrics present to showcase how one framework can perform better than another. For timeseries data, without human annotations corresponding to the data, automated evaluation through said word-based metrics is not possible.

As will be discussed in the related works section, there have been several previous efforts for time-series narration. Although these pioneering efforts have laid significant groundwork for this field, the recent work in time-series narration falls short in two crucial areas: First, they are domain-specific, modeled specifically for use in fields such as meteorology, intensive care, health monitoring and so on. Second, the proposed systems have not actualized the recent advances in language processing, rather, relying on the traditional pipeline architecture. Graefe et. al. [22] note "news consumers get more pleasure out of reading human-written as opposed to computer-written content". Thus, these template-based narratives can be met with a dismissive response by its users due to its seemingly mechanical nature -we further elaborate on this in our expert review section.

To address these challenges, we present T 3 : Timeseries-To-Text, which stands out from previous forays in this task through a) its domain-agnostic nature and b) its coupling of dense knowledge graph based representation of essential timeseries elements and the translation of said knowledge graph into rich and fluent narratives through the transfer-learning capabilities of large PLMs (Pre-trained Language Models) fine-tuned to this specific task -tackling the paucity of annotated data. Figure 1 highlights the diversity in the narratives generated by T 3 along with the automatic extrapolations and abbreviations deduced by the language models. The terms 'United Kingdom', 'United States', and 'Carbon Monoxide' are automatically abbreviated to 'UK', 'US', and 'CO' respec-tively. Similarly, the system extrapolates information such as adding 'as a measurement of air quality' when mentioning carbon monoxide values, adding 'the state of' to Kansas, and introducing the term 'trade volume' when describing export values. Our contributions are summarized as follows:

• To the best of our knowledge, T 3 is the first foray into neural time-series narration. Our rigorous evaluations across multi-domain datasets showcases that T 3 consistently produces 65.38% more diverse narratives with the same grammatical integrity as the existing baselines. • Through an expert review (n = 21), we validate the performance, practicality, and linguistic superiority of T 3 . 76.2% of participating experts who were wary of autogenerated narratives favored T 3 as a deployable system as compared to existing baselines. • We benchmark the performance of several time-series segmentation and regime-shift detection algorithms as well as prominent PLMs for outlining the best approach to a domain-agnostic time-series narration framework. • Our code-base, pre-trained models, the datasets used, along with a detailed notebook guide for reproducibility are made public 1 .

II. NARRATIVES: GOOD, BAD, AND BORING Textual narratives are swiftly becoming important components of visualization systems, either as a way to generate data insights to accompany visualizations [23] or to structure visualizations for better communication [24] . Research into what makes an effective narrative is still in its infancy and is necessarily tied to the underlying analytical task and domain. For temporal data, we identify the following crucial facets: Level of detail: Should the narrative capture an executive summary or provide in-depth access to the underlying data? Language diversity: Greater diversity in language prevents monotony but could detract from conveying key messages and conclusions. Lower diversity, on the other hand, supports comparison of different narratives, but leads to "glossing over" by analysts -defeating the very purpose of these narratives. Verbalizing numbers: The verbalization of quantitative or probabilistic data (using Kent's words of estimative probability [25] or the NIC/ Mercyhurst standardization) and trends is considered important in specific domains (such as intelligence analysis [26] ), however, other applications argue for direct access to the original numeric information. Human performance aspects: Understanding the characteristics of narratives that lead to improved human performance is an ongoing research problem [27] . Narratives provide increased comprehension, interest, and engagement and are known to contribute "distinct cognitive pathways of comprehension" with increased recall, ease of comprehension, and shorter reading times [28] . Conversely, the challenge of the written word implies slowness and error-prone behavior due to short-term memory limits.

In essence, successful narrative research requires a standardization of both the generation and evaluation space, and an understanding of how a narrative fits into the larger comprehension process of the analyst. As an example, a "bad" narrative for a fictional monthly sales-volume dataset, in the form of "The sales numbers for January 2019 were 1500 while the sales numbers for February 2019 were 2000. Similarly, the sales numbers for ...", falls to meet all the above criterion: it is lexically repetitive, portrays no information about the data that would have been difficult to discern visually, and presents the numbers as-is with no verbalization.

While some of the earliest work on time-series narration can be traced back to 1994 with the Forecast Generator (FOG) [29] , a framework for generating bilingual (English/French) textual summaries of weather forecasts, in the recent decades, Ehud Reiter's research group has laid significant groundwork for this domain. Their SUMTIME-MOUSAM project [30] generates short textual summaries of weather forecasts and SUMTIME-TURBINE [31] generates the same for sensor readings from a gas turbine. The design of these SUMTIME systems highlights the importance of domain expertise in relaying the information embedded in a raw time-series in a manner relevant to the end user. Following this, their SUMTIME project was extended to SUMTIME-NEONATE [32] , which generates textual summaries of time-series data intended to aid medical professionals in monitoring infants in neonatal intensive care units. In 2003 [33] , the authors highlight the use of Gricean maxims of cooperative communication [34] for the selection of the most crucial information to be relayed to the end user. The authors further investigate the impact of word choice in textual summarization by avoiding words specific to one idiolect and words whose meanings varied in different idiolects [35] .

Kacprzyk et. al. [36] propose the use of Zadeh's calculus of linguistically quantified propositions with varying tnorms to summarize time-series segmented with Piece-wise Linear Approximations. Castillo-Ortega et. al. [37] , propose linguistic summarization of time-series based on the hierarchical structure of time. The multiple candidate summaries are evaluated with a multi-objective evolutionary algorithm. In the physiological domain, Banaee et. al. [38] propose a system to summarize the data streams from health monitoring systems in a clinician and patient centric manner. Dubey et. al. [39] propose the use of Case-based Reasoning from records of previous summaries to summarize weather reports.

Thus, there has been significant investigation into this domain. However, the research emphasis has heavily been in the identification of the information to relay to the end user rather than relaying the information in a manner engaging to the end user -having the narratives themselves be rich and fluent. The textual output of the above mentioned systems follow the traditional modular pipeline architecture of Reiter and Dale [5] . Commercial services such as The Automatic Statistician 2 and Narrative Science 3 offer data summarization through visualization and narratives. Although their technology and code is proprietary, a perusal through offered samples 4 for time-series summarization hints towards templated generation where variables from analysis are plugged into preset templates.

In this section we outline some necessary background in time-series segmentation, detecting shifting regimes, and PLMs, as a foundation for T 3 's architecture.

Given a time-series T of length n, a segmentation of T contains a set of distinct temporal cut-points S = {c 1 , c 2 , .., c k } corresponding to k straight lines where k << n [40] . The segmentation approach can be limited by the number of segments k produced, or by a predefined threshold for segment-wise or cumulative error. As time-series of varying types and lengths need to be approximated with varying number of segments, we evaluate the following candidate segmentation algorithms based on a preset error threshold to promote domain-agnosticism.

Sliding Windows: The data points from a time-series are added to a sliding window until the maximum approximation error is met and a segment is formed. This process repeats with the window starting from the next data point. Bottom-Up: The algorithm starts with the finest approximation such that a time-series of length n is approximated by n 2 segments. The algorithm iteratively merges the lowest cost adjacent segments until the stopping criteria is met. SWAB: An acronym for the integration of Sliding Windows and Bottom-Up, SWAB [41] first defines an initial buffer w on which Bottom-Up is performed. The first segment from w is reported and the corresponding data points are removed from it. Remaining points from the series are read into w till the linear fit on it reaches an error threshold. This process is repeated until the buffer w reaches the end of the time-series.

Regime shift or switching refers to changes in the state or structure of a time-series. For domain-agnosticism, we require the shift-detection algorithms to be unsupervised, universal approximators, and input length invariant. Thus, based on these criterion, we evaluate the following candidates: Rrepresentation Learning: Franceschi et. al.'s [42] unsupervised representation learning algorithm, hereby noted as "RL", learns representations of time-series elements using an encoder architecture based on causal dilated convolutions with a triplet loss arrangement that employs time-based negative sampling. Matrix Profile: The Matrix Profile [43] , [44] is a multipurpose annotation (profile) of a time-series T where the i th location on the profile records the distance of the sub-sequence in T at the i th location to its nearest neighbor.

Transfer learning in language processing has been democratized and made universal with the advent of PLMs [45] which share the multi-headed attention core architecture of transformers [46] . Transfer learning, in the context of PLMs, is essentially the adaptation of these massive language models to downstream tasks such as data-to-text, question answering, summarization and much more via a fine-tuning process on task-specific data. Through the effors of Thomas Wolf et. al. [47] , second-generation seq2seq PLMs such as Google's T5 [48] and Facebook's BART [49] and auto-regressive PLMs such as Open-AI's GPT-2 [50] and many more have been made accessible to the larger community.

The motivation behind using PLMs for this task not only stems from the fact that they lead the benchmark for a multitude of downstream language processing tasks [51] but also due to the evidence that PLMs, due to their apparent acquisition of worldly knowledge [52] , in some cases refuse to generate false outputs even when the input to the system is corrupted [11] . As Open AI's GPT-3 [53] has not been released for public access at the time of publication of this paper, we have not been able to incorporate it into our experiments.

The PLMs we intend to investigate-viz. Open-AI's GPT-2, Facebook's BART, and Google's T5-though differing in their architectures and training strategies, share an auto-regressive decoder. Auto-regressive language generation is based on the assumption that the probability distribution of a sequence of words can be decomposed into the product of conditional next word distributions. If W 0 be the initial context word sequence and T be the length of the sequence to be generated, then the probability distribution can be defined as:

Basic Sampling: This strategy is based on randomly picking a word w t based on its conditional probability distribution w t ∼ P (w|w 1:t−1 ). Thus, the next word in the sequence is chosen based on its conditional probability of occurrence.

Top-K Sampling: In top-K sampling [54] , the K words most likely to occur next in the sequence are chosen and the probability mass is redistributed among these K words. This leads to a more "human-like" text generation.

Top-p Sampling: Top-p sampling [55] , also known as nucleus sampling, addresses a core issue in top-K sampling. Since top-K re-distributes the probability mass among the top K chosen words, it has the potential to break down in particularly sharp or flat distributions. If a distribution is sharp, the limit on the selection of just K words can lead to insensible text generation. On the other hand, for flat distributions, the limit prevents the generation from being diverse. Thus, instead of limiting the sampling space to K words, top-p samples from the smallest possible set of words whose cumulative probability exceeds a predefined probability p.

The two-stage design of T 3 , as illustrated in figure 2, is motivated by the need to produce rich and fluent narratives of time-series data with the least-possible human intervention. Subsections VI-A and VI-B highlight thorough experimentation that motivate the specific choices for the segmentation and regime-shift detection algorithms for T 3 while subsections VI-C and VI-D highlight the same for our choice of PLMs.

Stage I: The time-series is first log-transformed to approximately conform the data to normality before information extraction. This log-transformed series is segmented into k linear segments where the individual slopes of these k segments indicates the trends followed by the data in their respective intervals. Simultaneously, sequential data-points with similar properties are clustered together based on their learned representations. These clusters represent changing regimes in the dataset. The above time-series characteristics are encoded into a RDF-based knowledge graph. Figure 3 illustrates a sample knowledge graph (curtailed) as extracted from T 3 's first stage for the United States COVID19 time-series. Fig. 2 . The two stage T 3 framework: In Stage I, the system extracts trends, regimes, and peaks from the input time-series which is formulated into a knowledge graph. In Stage II, a PLM fine-tuned for graph-to-text generation generates the narrative from the input graph.

Stage II: Anterior to T 3 's execution, the PLMs are finetuned with both WebNLG and DART datasets for graph-totext translation. The knowledge graph from Stage I is thus translated into a rich and descriptive narrative by these PLMs using sampling techniques for strategic language generation. 

Time-series: To promote domain-agnosticism, the datasets used for evaluating T 3 are drawn from five different fields -COVID19 5 , Direction of Trade Statistics 6 , Carbon Monoxide Pollution 7 , World Population 4 , and Climate Change 4 . Based on the amount and consistency of the data, we consider the same ten countries (United States, India, Brazil, Russia, United Kingdom, France, Spain, Italy, Turkey, and Germany) across these datasets. The CO (Carbon Monoxide) units, however, are extracted for the U.S. states with EPA state codes 1 through 10. Table I provides a brief statistical summary of these datasets.

Fine-tuning: RDF-based datasets WebNLG v3.0 and DART v1.1 are used for fine-tuning the PLMs in T 3 . Table I briefly summarizes the statistics of these datasets where N x represents the number of samples for x ∈ {train, dev, test} and V , W SR, and SSR represent the vocabulary size, words per SR (Surface Realization), and sentence per SR respectively.

Tokens <X> where X ∈ {H, R, T } are appended to the start of the Head (subject), Relationship (predicate), and Tail (object) entities of each RDF triple. The Adam optimizer [56] with a linearly decreasing learning rate is used to fine-tune the PLMs with learning rates initially set to 3e-5 for T5 and BART and 5e-4 for GPT-2. For uniformity, the maximum token lengths for all PLMs are set to their default maximum (512) with a batch size of 4. For strategic decoding, based on the average length (∼100 words) and the average number of unique words (∼50) present in the generated narratives we set k as 50. Similarly, based on popular practice, we set p as 92%. 

In order to evaluate our candidate segmentation algorithms, we must first determine the right value of allowable maximum linear-fit error appropriate for our datasets. The evaluation of the total SSE (Sum Squared of Errors) of residuals vs k (the numbers of segments produced), as a function of the maxmimum linear-fit error, hints at 2.75 as a potential error "sweet spot". The figure below presents this analysis for the U.S. COVID19 dataset -the left marker indicates the tradeoff point between the total SSE and k while the right marker indicates the point where both total SSE and k stabilize. Table II outlines the performance of the selected segmentation algorithms across our datasets with the maximum linearfit threshold set to 2.75. We observe that SWAB consistently performs the best in terms of both the r 2 goodness-of-fit and SSE, making it the segmentation algorithm of choice for T 3 .

Out of the k segments produced for each time-series, if the slope of k th i−1 segment follows that of the k th i segment, we rearrange them as a single segment for continuity. This is illustrated in the figure above for the U.S. COVID19 timeseries where the original k segments are consolidated based on their slopes to 6 long segments (k > 6) that indicate the core trends followed by the time-series over significant time-spans.

For the evaluation of our candidate regime-shift detection algorithms, we force these algorithms to produce a known number of regime shifts validated through visual interpretation of the data -regime shifts in COVID19 cases should correspond to waves of outbreak, as illustrated in the figure below, whereas those in DOTS Exports should correlate to inflation or deflation in the economy. Table III outlines the performance of Matrix Profile and RL across our datasets based on the standard deviations (σ) of the formed regimes. Our evaluations lead us to conclude that the performance of Matrix Profile and RL are on-par and vary based on the individual dataset. In our implementation, an RL instance trained on the COVID19 dataset showcases high cross-domain transferability when applied to other series in our catalog. The Matrix Profile, however, requires a windowsize definition prior to its execution which varies based on the input time-series. The tendency of RL to favor automation makes it the regime-shift detection algorithm of choice for T 3 . 

The task of translating a graph to text is predominantly a Machine Translation task. Thus, the PLM architecture of preference are seq2seq models such as Google's T5 and Facebook's BART. However, for completeness we also include an auto-regressive model -OpenAI's GPT-2 in our evaluation. The performance of these models are bench-marked across three dataset configurations: WebNLG, DART, and a 

To evaluate the performance of T 3 , we measure its performance with respect to our baseline -the templated generation framework. The templated generation takes in the data from Stage I of T 3 , however, instead of passing it to Stage II, it feeds it to a template designed for the desired domain. The narratives generated by these systems are evaluated based on three core dimensions of linguistic quality:

• The Flesch's RE (Reading Ease) score [57] measures the readability of a text based on the average length of its sentences and the average number of syllables of its words 8 . Ranging from 0 to 100, increasing scores represent increasing levels of readability. • The TTR (Type Token Ratio) 9 is a measure of text diversity where the tokens refers to the total number of words in a given text while types refers to the number of non-repeating unique words. Simply calculated as T T R = T ypes T okens , the closer the TTR is to 1, the more lexical variety there is in a given text. 8 https://pypi.org/project/textstat/ 9 https://pypi.org/project/lexical-diversity/ • The G (grammar score) 10 , represents the grammatical integrity of the text. Similar to TTR, the closer G is to 1, the better the grammar of the text. G = 1 − Number of grammatical errors in a sentence Number of words in a sentence For each of our five datasets described in section 5-B, the RE score, TTR, and Grammar score (G) are averaged-out for the aforementioned ten countries/states. The performance of T 3 is evaluated with three decoding strategies: T 3 with P LM top−K represents the use of top-K sampling scheme, T 3 with P LM top−p represents the use of top-p sampling scheme, and T 3 with simply P LM refers to the default sampling scheme where words are sampled from the base conditional probability distribution without the use of top-K or top-p strategies. Table V illustrates the comparative performance of T 3 with templated generation. From this, we make four key observations: 1) T 3 significantly outperforms templated generation in lexical diversity. The highest increase in lexical diversity was observed in the COVID19 dataset where T 3 increases the TTR by 65.38% while the lowest observed increase was in the DOTS Exports dataset where T 3 increases the TTR by 13.33%. 2) T 3 remains closely competitive with templated generation in maintaining grammatical integrity. As templated generation uses pre-defined sentence planning, the grammar is expected to be perfect (T T R = 1). While T 3 achieves perfect grammatical integrity in the DOTS Exports, U.S. CO Pollution, and World Population datasets, the highest observed loss in grammatical integrity was 7.9% in the Global Temperature dataset. 10 https://pypi.org/project/language-tool-python/ 3) T 3 consistently outperforms templated generation in terms of readability, although not significantly. We attribute this to the distinct sentences formed when each element of the knowledge graph is translated to text. 4) In terms of PLM selection, we observe that T5 tends to lean more towards grammatical integrity while BART tends to produce more linguistically diverse text. Similar observations are made for the sampling strategies: topp sampling leads to more grammatical consistent texts while top-K sampling promotes linguistic diversity.

We conduct an expert review (n = 21) [58] to validate the practicality of T 3 . The review simultaneously acts as a human evaluation of T 3 's narratives as well. 85.7% of the recruited experts had expertise in data science, 76.2% in data visualization, and 66.7% in NLP. When asked to rate their trust in machine-generated narratives on a 1 to 5 Likert scale, the response from the experts resembled a right-skewed bellcurve where 42.9% of the experts had chosen a rating of 3 (neither complete trust or distrust in machine-generated narratives). In agreement with [22] , 61.9% of the recruited experts acknowledged being dismissive of machine-generated narratives, while the remaining claimed equal treatment of both machine and human generated narratives. The experts, each, were presented with 2 time-series datasets, where each time-series was accompanied with 4 narratives -a baseline templated narrative, 2 narratives randomly sampled from T 3 , and finally, a sub-par T 3 narrative (generated by repeatedly sampling from T 3 until a a sub-par narrative was generated). For each of these narratives, the experts were asked to rate its coherence, linguistic diversity, grammatical integrity, and data fidelity (does the model tend to hallucinate?) on a 1 to 5 Likert scale. Figure 4 presents an overview of the findings: T 3 and templated generation were rated comparably in terms of coherence, grammatical integrity, and data fidelity. However, T 3 was rated considerably higher in terms of linguistic diversity -in alignment with our experimental findings. In their concluding remarks, 76.2% of the experts chose T 3 over templated narratives for deployable systems. For the remaining 23.8% of the experts that chose templated narratives, their sentiment resonates with the need for mission-critical data fidelity.

We have presented T 3 , a domain-agnostic neural framework for time-series narration. Through our experiments, we outline a strategy forward for universal time-series narration. There are numerous avenues to pursue to augment the space of timeseries narration. From the analysis of time-series data to the realization of natural language summaries, work in each of these space will bring us closer to better data-to-text systems. With a dataset of time-series and narrative pairs, a promising direction for future exploration lies in learning direct mappings from numbers to text, extending beyond just time-series.

This work was partially supported by DARPA (Defense Advanced Research Projects Agency) under contract number FA8650-17-C-7720. The views, opinions and/or findings expressed in this publication are solely those of the author(s).

A review on time series data mining

Vis author profiles: Interactive descriptions of publication records combining text and visualization

Once upon a time in visualization: Understanding the use of textual narratives for causality

Why looking isn't always seeing: readership skills and graphical programming

Building Natural Language Generation Systems, ser

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

Table-to-text generation by structure-aware seq2seq learning

Data-to-text generation with content selection and planning

A hierarchical model for data-to-text generation

Triple-to-text: Converting RDF triples into high-quality natural languages via optimizing an inverse KL divergence

Investigating pretrained language models for graph-to-text generation

Neural text generation from structured data with application to the biography domain

End-to-end content and plan selection for data-to-text generation

The E2E dataset: New challenges for end-to-end generation

Creating training corpora for NLG micro-planners

Dart: Open-domain structured data record to text generation

Resource description framework (rdf): Concepts and abstract syntax

Bleu: a method for automatic evaluation of machine translation

ROUGE: A package for automatic evaluation of summaries

METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments

chrF: character n-gram F-score for automatic MT evaluation

Readers' perception of computer-generated news: Credibility, expertise, and readability

Augmenting visualizations with interactive data facts to facilitate interpretation and communication

Graphiti: Interactive specification of attribute-based edges for network modeling and visualization

Words of Estimative Probability

Analysis of competing hypotheses

Coupling story to visualization: Using textual analysis as a bridge between data and interpretation

Using narratives and storytelling to communicate science with nonexpert audiences

Using natural-language processing to produce weather forecasts

Modelling the task of summarising time series data using ka techniques

Sumtime-turbine: a knowledge-based system to communicate gas turbine time-series data

Summarizing neonatal time series data

Generating english summaries of time series data using the gricean maxims

Logic and conversation

Choosing words in computer-generated weather forecasts

Linguistic summarization of time series using a fuzzy quantifier driven aggregation

Linguistic summarization of time series data using genetic algorithms

A framework for automatic text generation of trends in physiological time series data

Textual summarization of time series using case-based reasoning: a case study

Cut-n-reveal: Time series segmentations with explanations

Segmenting time series: A survey and novel approach

Unsupervised scalable representation learning for multivariate time series

Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets

Mpa: a novel cross-language api for time series analysis

Pre-trained models for natural language processing: A survey

Attention is all you need

Transformers: State-of-the-art natural language processing

Exploring the limits of transfer learning with a unified textto-text transformer

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Language models are unsupervised multitask learners

Text-to-text pre-training for data-to-text tasks

Language models are open knowledge graphs

Language models are few-shot learners

Hierarchical neural story generation

The curious case of neural text degeneration

Adam: A method for stochastic optimization

A new readability yardstick

Evaluating visualizations: do expert reviews work