key: cord-0594726-tq6v8lip
authors: Mori, Yusuke; Yamane, Hiroaki; Shimizu, Ryohei; Mukuta, Yusuke; Harada, Tatsuya
title: COMPASS: a Creative Support System that Alerts Novelists to the Unnoticed Missing Contents
date: 2022-02-26
journal: nan
DOI: nan
sha: 26128a9bea65c0ed48c96ebb97140e3dd69d1dd0
doc_id: 594726
cord_uid: tq6v8lip

When humans write, they may unintentionally omit some information. Complementing the omitted information using a computer is helpful in providing writing support. Recently, in the field of story understanding and generation, story completion (SC) was proposed to generate the missing parts of an incomplete story. Although its applicability is limited because it requires that the user have prior knowledge of the missing part of a story, missing position prediction (MPP) can be used to compensate for this problem. MPP aims to predict the position of the missing part, but the prerequisite knowledge that"one sentence is missing"is still required. In this study, we propose Variable Number MPP (VN-MPP), a new MPP task that removes this restriction; that is, the task to predict multiple missing sentences or to judge whether there are no missing sentences in the first place. We also propose two methods for this new MPP task. Furthermore, based on the novel task and methods, we developed a creative writing support system, COMPASS. The results of a user experiment involving professional creators who write texts in Japanese confirm the efficacy and utility of the developed system.

Creativity is human nature. Writing and reading stories are essential aspects of creativity. Moreover, understanding how humans write and interpret stories is inextricably linked to understanding humans themselves. Winston [2011] theorized this as The Strong Story Hypothesis: The mechanisms that enable humans to tell, understand, and recombine stories separate human intelligence from that of other primates. Stories are used not only for entertainment, but also for a variety of other purposes: advertisement [Ono et al., 2019] , marketing [McKee and Gerace, 2018] , education [Koncel-Kedziorski et al., 2016] , and serious storytelling [Lugmayr et al., 2017] . Storytelling is deeply rooted in human life.

Currently, thanks to the Internet, anybody can freely publish their original stories. However, creating a story is not an easy task. It is even more difficult to write something that people like and want to read. Sometimes, even professional writers fall into slumps (so-called "writer's block") during the writing process and cannot complete their stories. To understand the secret of creating good stories, various studies have been conducted [Campbell, 1949] . Here, to clarify what it means to create a story, it is important to have the practical knowledge of creators who are actually engaged in the creation process. In fact, it is true that the knowledge of many creators helps in understanding a story [Forster and Stallybrass, 1927 , Propp, 1968 , Vonnegut, 1995 . Rules for creating stories have been extensively studied; "Three-act structure" [Field, 2006] and "Save the cat" [Snyder, 2005] are famous examples. These works can help guide people who wish to create good stories to demonstrate their creativity. In this way, story understanding and story generation are inextricably linked.

Open story generation is a problem proposed by Li et al. [2013] , in which a story about any domain is automatically generated without a priori manual knowledge engineering. The intent of the authors of each paper may differ, but it seems the terms "open-domain story generation" and "open-ended story generation" can be included in this group. 2 Because it seems that "open-ended" is easily understood and shows the general meaning of "not restricted," for the sake of simplicity, we will refer to story generation in this direction as "open-ended story generation" in this paper. Note that giving a prompt/title as an input to the models is generally used in open-ended storytelling. This kind of input is not considered a "restriction" in this area.

We define "Context-aware story generation" as story generation tasks in which a human-written story/plot (so-called "context") is given as an input and, according to the given context, models generate subsequent sentences, complementary sentences in the missing middle part, etc. Typical examples of this approach include SEG [Zhao et al., 2018] , SC [Wang and Wan, 2019] , and story infilling [Ippolito et al., 2019 ].

In open-ended story generation, machine-generated stories are variables, but this does not mean that open-ended story generation is more difficult than context-aware story generation. For example, in the story completion task, it is necessary for models to understand the context that has already been written by humans and then fill in the gaps. It needs both aspects of story understanding and story generation, thus restriction by context does not mean the task is easy. Both open-ended story generation and context-aware story generation have important implications. With that in mind, in this study, we focus on the latter approach.

To overcome the issue of conventional story completion tasks requiring information regarding the position of the missing part in a story, we previously proposed "Missing Position Prediction" (MPP) as a task to predict the position based on the given incomplete story. In our previous paper [Mori et al., 2020] , we proposed MPP with limited conditions (LC-MPP). In this paper, referring to LC-MPP, we propose an updated version of MPP, Various Number MPP (VN-MPP), as a task closer to a more realistic setting.

In story and narrative research, it is necessary to define a story and the kind of text that can be regarded as a story; the task of judging whether a text is a story is known as story detection [Eisenberg and Finlayson, 2017] . We define a story as a series of events related to characters and having a beginning and an end; these events are intended to change the emotions and relationships of the characters.

The major contributions of this study are summarized as follows.

• To overcome the issue of conventional SC tasks requiring information regarding the position of the missing part in a story, we previously proposed "Missing Position Prediction" as a task to predict the position based on the given incomplete story. At first, we proposed MPP with limited conditions (LC-MPP) [Mori et al., 2020] . However, in this paper, referring to LC-MPP, we propose an updated version of MPP, Various Number MPP (VN-MPP), as a task closer to a more realistic setting.

• We propose two novel methods for VN-MPP and Story Completion (SC): the two-module approach and the end-to-end approach.

• Based on our proposed tasks and methods, we developed a system for human story writing assistance. We named this system "COMPASS", which stands for a writing support system to COMPlement Author unaware Story gapS. Subsequently, four professionals in the field of creative writing in Japanese evaluated the developed system and confirmed its efficacy and utility.

This paper is divided into two major parts. In the first part, we present the proposed VN-MPP task (Section 3) and a proposed method to solve it (Section 4). In the second part, we discuss the implementation of VN-MPP as a creation support system (Section 5) and the experimental verification of its practicality by professionals (Section 6).

We plan to make the source codes publicly available in the future.

2 Related Work

Applying machine learning to human story writing assistance is an approach of which interesting works were published in recent years [Roemmele, 2016 , Peng et al., 2018 , Yao et al., 2019 , Goldfarb-Tarrant et al., 2019 . Referring to Recurrent Neural Networks (RNN) as a promising machine learning framework for language generation tasks, Roemmele [2016] envisioned the task of narrative auto-completion applied to helping an author write a story. Peng et al. [2018] proposed an analyze-to-generate framework for controllable story generation. They apply two types of generation control: 1) ending valence control (happy or sad ending) and 2) storyline keywords. Yao et al. [2019] proposed a two-step pipeline for open-domain story generation: 1) story planning, which generates a storyline represented by an ordered list of words, and 2) surface realization, which composes a story based on the storyline. They proposed a hierarchical generation framework named plan-and-write that combines storyline planning and surface realization to generate stories from titles. Based on the studies carried out by Yao et al. [2019] and Holtzman et al. [2018] , Goldfarb-Tarrant et al. [2019] presented a neural narrative generation system named Plan-and-Revise in which humans and computers collaborate to generate stories.

This research is positioned in the context of such research on creative writing support, and at the same time aims to apply the research on SC to creative writing support.

SC was recently proposed in the field of story understanding and generation as a method for generating the missing parts of an incomplete story. Here, we discuss the research on story understanding and generation, which are strongly associated with SC.

At the intersection of NLP and literary analysis, various studies have been conducted. Referring to the narrative cloze test [Chambers and Jurafsky, 2008] as a typical example of a story understanding task considering events, Mostafazadeh et al. [2016] proposed SCT as a more difficult task. SCT presents four sentences, and the last sentence is excluded from a story composed of five sentences. The system must select an appropriate sentence from two choices that complement the missing last sentence. In addition to the task, the authors released a large-scale story corpus named ROCStories, 3 which is a collection of non-fictional daily-life stories written by hundreds of workers at Amazon Mechanical Turk. The five-sentence stories contain varied common-sense knowledge. SCT was proposed as a challenging task, but the improvement of NLP methods is so rapid that an even more challenging task was needed to evaluate the performance of models in this ever-evolving field. When SCT was proposed, machine learning models could solve this task with an accuracy of less than 60%. However, in a few years, Chaturvedi et al. [2017] achieved an accuracy of 77.6% and Radford et al. [2018] achieved 86.5%.

With the development of the field of story understanding, the field of story generation has also become more active in research. Story generation approaches can be roughly divided into two types. One involves studies that produce the entire story [Fan et al., 2018] . The other involves studies that complement the existing incomplete text [Ippolito et al., 2019 , Donahue et al., 2020 , Wang et al., 2020 . In this paper, we focus on the latter approach. This is because we believe that generating sentences to improve an incomplete story is indeed an important task in human writing assistance.

Inspired by the aforementioned task, SCT, which is a subtask of story generation, SEG was designed by Zhao et al. [2018] . In their SEG, a system is given an incomplete story, where the last sentence is excluded from the original five-sentence story. The objective of the task is to automatically generate the last sentence of this given incomplete story. Furthermore, based on SEG, Wang and Wan [2019] proposed an SC task and investigated the problem of generating missing story plots at any position in an incomplete story. If a sentence in the middle is missing, the task becomes more difficult because the system must capture the context both before and after the missing sentence. Additionally, in recent years, research regarding text infilling has been actively conducted [Ippolito et al., 2019 , Donahue et al., 2020 . Regarding stories, Ippolito et al. [2019] worked on complementing the missing span between left and right contexts, which they called "story infilling."

However, these studies require that the writer have prior knowledge of the missing parts, and they do not consider the case where the writer is unaware of the flaws in their work. To overcome this limitation, we [Mori et al., 2020] proposed a story comprehension task named MPP. In MPP, an incomplete story with one sentence missing is given as input. Unlike the previously mentioned task, no information regarding the position of the missing content is required. MPP requires the prediction of the position of the missing part. The ability to solve this task indicates that computers can identify flaws in a story's plot. Referring to our previous work [Mori et al., 2020] , Cai et al. [2020] pointed out that most previous studies focused solely on the problem of what to infill/modify and the need to know the positions of missing parts a priori. To solve MPP, it is necessary to identify unnaturalness because MPP is deeply related to a fundamental question in story understanding: whether or not the model understands the flow of a story.

However, conventional MPP has several restrictions. In the first MPP approach proposed by us [Mori et al., 2020] , it must be known a priori that there is a missing position in the input story and that there is only one such instance. In reality, an input story may be complete, i.e., it has no missing parts. Furthermore, there may be a case in which there are multiple missing positions. In this paper, we propose VN-MPP in which the number of missing positions, including zero, can vary.

Referring to RNNs as a promising machine learning framework for language generation tasks, Roemmele [2016] envisioned the task of narrative auto-completion applied to helping an author write a story. With the advent of the sequence-to-sequence model (Seq2seq), the use of neural networks as a method for generating natural sentences has become commonplace. Seq2seq was first proposed for machine translation [Sutskever et al., 2014] . However, it has been widely applied to other tasks in NLP [Vinyals and Le, 2015] . In SEG, simple Seq2seq and an extension using the attention mechanism are used as a baseline [Zhao et al., 2018 , Li et al., 2018 , Guan et al., 2019 , Mori et al., 2019a .

Transformer [Vaswani et al., 2017] , which replaced the RNNs in Seq2seq with self-attention, is the basis of today's significant improvement in NLP. Unsupervised pre-trained large neural models, such as BERT [Devlin et al., 2019] and GPT-2 [Radford et al., 2019] , were proposed using the Transformer architecture and soon became the mainstream in NLP. These pre-trained models were roughly divided into two: one uses the transformer encoder (bi-directional architecture) and the other uses the transformer decoder (left-to-right architecture). In sequence generation, it was common knowledge that models using left-to-right architecture [Radford et al., 2019 are more suitable. However, instead of using only one of the transformer encoder based architecture or the transformer decoder based architecture, attempts to create Seq2seq (Encoder-Decoder) models using unsupervised pre-trained large neural models for initializing each of the encoders and the decoders are becoming the new mainstream , Rothe et al., 2020 . In this paper, our proposed method is based on BART (proposed by ), which uses BERT as the encoder and GPT-2 as the decoder, and it exhibits high performance in tasks, such as summarization.

In text generation tasks, human evaluation, i.e., to involve humans as judges, is generally thought of as the gold standard. This is not unnatural, because most of the text generation models aim to generate "natural" text for humans. However, some problems remain. Human evaluation is costly, time-consuming, and dependent on individual abilities. Regarding Amazon Mechanical Turk, which is a generally used crowdsourcing platform, Ippolito et al. [2019] reported that the evaluation by average workers is unreliable in the task of story infilling. They inserted one honeypot question in 11 questions and found that performance on the honeypot question was close to random guessing. August et al. [2020] pointed out that human evaluation schemes tend to ignore the difference of perspectives, authors, and readers.

Therefore, automatic metrics to evaluate the day-by-day progress of natural language generation are strongly needed. However, it has been shown that traditional metrics have poor correlation with human evaluation, so proper evaluation of text generation is difficult [Liu et al., 2016 , Novikova et al., 2017 , Chaganty et al., 2018 , Gatt and Krahmer, 2018 , Hashimoto et al., 2019 . Recently, based on large unsupervised pre-trained neural models, various machine-learned metrics have been proposed and tested: BERTScore and BLEURT [Sellam et al., 2020] . In particular, for story generation, proposed UNION, a learnable UNreferenced metrIc for evaluating Open-eNded story generation.

Based on the above metrics, we evaluated how well the method proposed in this paper performed on our proposed VN-MPP task.

We begin by formulating SEG, SC, and MPP. Then, we formulate our proposed VN-MPP.

We define S = {s 1 , s 2 , ..., s n } as a story comprising n sentences. In SEG, S = {s 1 , s 2 , ..., s n−1 } is given as an input. The objective of the task is to generate an appropriate ending. For story completion, an incomplete story consisting of n − 1 sentences S = {s 1 , ..., s k−1 , s k+1 , ..., s n }, where k represents the position of the missing sentence in the story, is provided. Next, the objective of the task is to generate an appropriate sentence that is coherent with the given sentences. During each task, the model is trained to maximize probability p(y|S ), where y represents the ground truth sentence.

To overcome the issue of the story completion model requiring information about k, i.e., the position of the missing sentence, our previous work [Mori et al., 2020] proposed MPP to predict k from a given n − 1 sentences. Similar to the story completion task, an incomplete story comprising n − 1 sentences S = {s 1 , ..., s k−1 , s k+1 , ..., s n } is given as an input. However, no information about k is provided. The order of the sentences is known, but the missing position is unknown. More specifically, s k−1 and s k+1 are treated as continuous sentences. Here, their objective is to predict k from the input. In other words, the model is trained to maximize probability p(missing = k|S ).

In MPP, it should be known a priori that there is a missing position in the input story and that there is only one such instance. In this study, we propose VN-MPP in which the number of missing positions, including zero, can be variable.

A story comprising n − m (0 ≤ m < n) sentences S = {s i1 , s i2 , ..., s ij , s ij+1 , ..., s in−m } is given as an input. However, no information about m is provided. We use i, j to indicate that for a given incomplete story, we do not know how many sentences are missing from the original text. For example, we may be given the first, third, and fifth sentences (i.e. {s 1 , s 3 , s 5 }) as S , but do not know their original position, hence they are represented as the first, second, and third sentences in the incomplete story (i.e. {s i1 , s i2 , s i3 }). The order of the sentences is known, but the number of missing positions and the location of each missing position are unknown. More specifically, s ij and s ij+1 are treated as continuous sentences, but there may exist some lost sentences. Here, our objective is to predict all the missing positions from the input, including the case where there is no missing position (m = 0), i.e., where the input story is complete.

We note here that even when there is a missing part in the story, it may be caused by a writer's intention in that "I want the readers to read between lines." However, the missing part can also be an unintentional mistake. To analyze if the model can understand whether the missing part is a "writer's intentional omission" is out of the scope of this study.

To solve our proposed VN-MPP and SC, we propose two methods in this section. Then, in Section 5, we arrange one of these methods to be more suitable for a creative writing assistance system.

-Two-module Approach -End-to-end Approach

• Arranged method (will be discussed in Section 5)

-Two-module Approach with improved SC module (Two-module v2)

As mentioned earlier, we will first discuss the following two approaches: Two-module Approach and End-to-end Approach.

The first method consists of two modules: the VN-MPP module and the SC module. In this study, we call this approach the two-module approach. The second method treats VN-MPP and SC in an end-to-end manner. In this study, we call this approach the end-to-end approach. We also provide the details of dataset preprocessing, which is another key factor of our proposed methodology.

Although the end-to-end method is considered to be the mainstream in the field of machine learning, in the first method, we deliberately divided the module for each task so that the output of VN-MPP alone could be confirmed. In addition, because various methods for SC have been proposed and are expected to be developed in the future, it is also advantageous that our proposed VN-MPP module can be easily combined with these methods.

As written above, we make two-modules: the VN-MPP module and the SC module. However, we propose a way to handle these modules through a unified structure method.

We use large transformer-based Seq2seq models to solve both VN-MPP and SC. In other words, we treat both VN-MPP and SC as Seq2seq type tasks, represented by translation and summarization, etc. Because the tasks are conversions within the same language, we can consider them as tasks similar to summarization in form.

We introduce a new special token <missing_sentence> to solve both VN-MPP and SC. In VN-MPP, an input is an incomplete story, and the output is an incomplete story with its missing part filled with <missing_sentence> tokens. In SC, an input is an incomplete story with <missing_sentence> tokens, and the output is a completed story where <missing_sentence> tokens are replaced with appropriate sentences.

Our end-to-end approach is simpler than our two-module approach.

VN-MPP and SC were handled consistently in this approach and treated as a Seq2seq task. Because this end-to-end approach also involves conversions within the same language, we can consider it as a task similar to summarization in form.

In this case, there is no need to introduce additional special tokens. We simply input an incomplete story to the Seq2seq encoder, after which we obtain a complete story as an output from the Seq2seq decoder.

From the original complete stories from a dataset, we preprocess them and create two sets of misinformation data. A conceptual diagram of the preprocessing is presented in Figure 1 . 4

Given an original story comprising n sentences, it can be written as equation 1.

Jennifer has a big exam tomorrow. She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week.

Jennifer has a big exam tomorrow. <missing_sentence> She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week. <missing_sentence>

Jennifer has a big exam tomorrow. She got so stressed, she pulled an allnighter. She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week. Jennifer felt bittersweet about it.

Remove Special Tokens From this, we create two incomplete stories. The first one is an incomplete story with <missing_sentence> tokens. We randomly choose the number of missing sentences m, after which we replace m sentences from the original story with the special tokens.

Next, we remove the special tokens and create an incomplete story without any information regarding the missing positions.

The aim of the entire task can be explained as the conversion from (3) to (1). Our end-to-end approach tries to do this in one step. On the other hand, our two-module approach does two conversions: VN-MPP module convert 3 to (2), and SC module convert (2) to (1).

How we define m for each dataset is described in detail in Section 4.5.

We implement our codes based on PyTorch [Paszke et al., 2019] , an open-source machine learning framework provided as a Python library. 5 To make use of unsupervised pre-trained large neural models, our code is also based on Hugging Face Transformers [Wolf et al., 2020] , which provides general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG) for both TensorFlow 2.0 and PyTorch. To add details, we use PyTorch version 1.7.0 and Hugging Face Transformers version 4.5.1.

We use ROCStories as a story dataset and, from other domains, we use the CNN / Daily Mail summarization dataset [See et al., 2017] to investigate how useful our proposed VN-MPP is in domains other than stories.

ROCStories is typically used in SEG [Guan et al., 2019 , Li et al., 2018 , Zhao et al., 2018 . Similarly, [Wang and Wan, 2019] used ROCStories for their story completion task. The dataset is also used for controllable story generation [Peng It has been pointed out that the Story Cloze Test dataset (SCT-v1.0), published together with ROCStories, has a bias. There is too much difference between a right ending and a wrong ending, so the classification performance can be improved in unintended ways. [Sharma et al., 2018] proposed SCT-v1.5 datasets to avoid the bias. However, we use ROCStories, the dataset used for training in SCT, which is not directly related to the SCT-v1.0 bias. In addition, SCT-v1.5 is intended to put restrictions on workers when writing right/wrong endings in order to make the two-choice question adequately difficult. Of course, SCT-v1.5 is superior to SCT-v1.0 in conducting SCT, but we believe that SCT-v1.0 and ROCStories are closer to stories that humans can write freely (although there are some rules including five-sentence restriction).

As shown in Table 1 , the dataset was randomly split in the ratio of 8:1:1 to obtain the training, development, and test sets, respectively. We removed sentences from a five-sentence story. The number of missing positions m was randomly decided to be 0 ≤ m ≤ 2, based on a discrete uniform distribution. For the development and test sets, this removal procedure was performed when creating the dataset, to improve reproducibility. For the training set, we retained the original five-sentence story in the dataset and removed sentences randomly when reading the data during training. As a result, different sentences could be removed from the same story with different m value, thus, acting as data augmentation and preventing over-fitting.

WritingPrompts is one of the commonly used datasets in the domain of stories [Fan et al., 2018] . We also considered using this dataset, but ultimately decided not to. The reasons are presented below. WritingPrompts consists of 303,358 stories paired with writing prompts collected from an online forum, Reddit. The average length of the prompt / story is 28.4 / 734.5, respectively. As it contains very long stories, it is generally used with trimming (retain a predetermined number of words from the start and truncate the rest). In other words, it is difficult to handle the "whole story" as is. We believe that this creates a problem in learning VN-MPP. This is because VN-MPP is designed to learn by contrasting complete and incomplete texts in order to estimate "what part of an incomplete text is incomplete." Hence, trimmed texts, i.e., texts from which certain parts have been deleted, are not suitable to train VN-MPP models.

Although we propose VN-MPP mainly for creative support and the target is story-like text, the application of the VN-MPP and proposed methods are not limited only to stories. To show this, we use a dataset from other domains. The CNN / Daily Mail summarization dataset contains online news articles paired with multi-sentence summaries. The average number of tokens of article /summary is 781 / 56, respectively. The original CNN / Daily Mail dataset was proposed to support supervised neural methodologies for machine reading and question answering [Hermann et al., 2015] , and Nallapati et al. [2016] modified the dataset to be used for summarization. The version we use for our task was proposed by See et al. [2017] . This version of the dataset is non-anonymized; this contrasts with the earlier versions, in which the data are anonymized. They also stated that the non-anonymized version is the favorable problem to solve because it requires no pre-processing. We also believe that the non-anonymized version is favorable for our task because it allows the ability to consider proper nouns to be evaluated.

This time, we decided to use highlights instead of articles. This is because "highlights" are considered to be more important per sentence, i.e., if they are missing, it would be a big problem.

Moreover, the purpose of using this dataset is to show the versatility of VN-MPP. In other words, (1) it can be applied to more than just stories (2) it can be applied even if the original text is not a five-sentence text. For the training set of the CNN / Daily Mail summarization dataset, we examined the mean and standard deviation (std) of the word length and sentence length of the highlights. We found that the mean word length is 54.7 and the std is 23.0. For the number of sentences, the mean is 3.68 and the std is 1.35. Thus, the highlights of the dataset contain variability and are useful in investigating the adaptability of VN-MPP.

We use the original split as shown in Table 2 , We removed sentences from an article. The number of missing positions m was randomly decided to be 0 ≤ m ≤ min(9, #sentences), based on a discrete uniform distribution. This removal procedure was performed when creating the dataset, to improve reproducibility.

From the perspective of addressing social media, we considered using the dataset crawled from Twitter, such as Coronavirus (COVID-19) Tweets Datasets (COV19Tweets Dataset) [Lamsal, 2020] . However, Twitter's terms of service prohibit redistributing tweets as is (because users may want to delete their own tweets). Therefore, the dataset contains the IDs of the tweets rather than the full text of the tweets. Given Twitter's concern for the rights of users to control their own tweets, we felt it was undesirable to include the original tweets retrieved from Twitter IDs in this paper. If only automatically generated tweets were included, it would be difficult to make a qualitative evaluation of the proposed task, such as whether completion is indeed achieved. Although we believe that it is important to handle Twitter, which is a representative example of microblogs, in the study of Social Media, we decided to avoid handling the corpus consisting of Twitter data in this study for this reason.

To show the trend of word utilization in the dataset we used, we did a word cloud visualization. The former is a dataset of everyday stories, so we can see things like "went" and "friend." The latter is a news dataset, so "say" and "said" stand out, indicating that someone has said something.

We experimented with the concatenation task of VN-MPP and SC. The hierarchical method proposed by us [Mori et al., 2020 ] together with their MPP task is based on the limited condition that "one sentence is missing from a five-sentence text," and it is difficult to apply it directly to the VN-MPP task. VN-MPP is a task that cannot be solved by the conventional methods used for tasks such as MPP, SC, and SEG. In this paper, we propose and evaluate the first two methods for solving VN-MPP.

We first evaluated the proposed method in VN-MPP with each dataset using BLEU, a standard evaluation metric for machine translation [Papineni et al., 2002] . BLEU is also used in the evaluation of tasks such as SEG, therefore, it can be considered useful in the evaluation of VN-MPP as well. Furthermore, in VN-MPP using ROCStories, which requires creative text generation, we conducted another evaluation with the recently proposed metrics described in Section 2.5. Concretely, we used UNION, BERTScore, and BLEURT. For UNION, we used the officially provided checkpoint trained with ROCStories. 6 For BERTScore and BLEURT, we used their default model "roberta-large" 7 and "bleurt-tiny-128" 8 .

For the two-module approach, we used BART-base or BART-large for each module. When combining two modules, the base models are the same. For the end-to-end approach, we also used BART-base or BART-large.

BART uses the standard Transformer architecture [Vaswani et al., 2017] , except that GeLUs [Hendrycks and Gimpel, 2020] activation function is used instead of ReLU [Nair and Hinton, 2010] and parameters are initialised from N (0, 0.02) . BART-base model has 6 layers for encoder and decoder respectively. BART-large model has 12 layers for each.

We fine-tuned our models over three epochs on NVIDIA Tesla V100 GPUs. Specifically, we used one GPU for training on ROCStories, and four GPUs for training on the highlights of the CNN / Daily Mail summarization dataset. We used AdamW [Loshchilov and Hutter, 2019] optimization with parameters β 1 = 0.9, β 2 = 0.999, = 1e − 08. We set BLEU mean length of text the initial learning rate to 3e − 05, and linearly decreased the learning rate from the initial point to 0 during 10-epoch training to avoid over-fitting.

Transformer-based Seq2Seq language models have greatly improved performance compared to conventional models in text-to-text tasks, especially in summarization and translation. To set up the training parameters, we mainly referred to the training settings in the summarization task because we consider summarization is more similar to story completion than translation is.

The results of the evaluation with BLEU are shown in Tables 3, 4 , 5, and 6. The average length of the generated text is also shown. Note that the two-module approach evaluated the VN-MPP module and SC module separately, using the input and output of each module.

The results for the BART-base based models on ROCStories are presented in Table 3 , and those for the BART-large based models on ROCStories in Table 4 . In the same manner, the results for the models based on BART-base or BART-large and fine-tuned on the CNN / Daily Mail summarization dataset are shown in Table 5 and Serena was planning a surprise for her husband's birthday. She wanted to throw him a party, but his schedule was tough. He would always arrive home at widely different times. To get around it, she worked with his co-workers. Together they were able to surprise him perfectly.

Serena was planning a surprise for her husband's birthday. She wanted to throw him a party, but his schedule was tough. He would always arrive home at widely different times. To get around it, she worked with his co-workers. <missing_sentence> SC module output Serena was planning a surprise for her husband's birthday. She wanted to throw him a party, but his schedule was tough. He would always arrive home at widely different times. To get around it, she worked with his co-workers. She was able to throw a surprise party for him.

End-to-End Serena was planning a surprise for her husband's birthday. She wanted to throw him a party, but his schedule was tough. He would always arrive home at widely different times. To get around it, she worked with his co-workers. She was able to throw a surprise party for him. For the VN-MPP module, good results are obtained especially for ROCStories, and high BLEU values are obtained for the CNN / Daily Mail summarization dataset. Although not as good as the VN-MPP module, the SC module and the End-to-End module also show sufficient performance considering that they include sentence generation in their tasks. Table 7 shows the results of evaluation with other metrics. Note that for these metrics the two-module approach was evaluated as a whole, not separately. The reason is that UNION, in particular, is a measure for evaluating whether a text is story-like or not, and is inappropriate for evaluating the VN-MPP module, which outputs a text that includes a special token. For the same reason, we do not use these metrics to evaluate CNN / Daily Mail. UNION is a binary classification model; hence, we show the ratio of the number of output sentences judged to be "story-like" in the test set (9817 instances), with a probability of 0.5 as a threshold.

Only the two-module approach using BART-large has a slightly low UNION value; the other models have values above 0.9, indicating that story-like outputs can be generated (completed) by our methods. The values for BERTScore and BLEURT are close. Tables 8 and 9 , respectively. Regarding the experimental results obtained using ROCStories, it can be seen that the number and position of the missing sentences can be correctly predicted even when the number of missing sentences differs. Furthermore, in terms of meaning, sentences that are natural in the context of the story can be generated. On the other hand, on CNN / Daily Mail, the model failed to estimate the missing positions, or generated sentences that were close in mood but not necessarily in context. SC module output NEW: Prosecutor: "This is not the first time we've seen this type of behavior"

NEW: Moschetto's attorney says he will appeal the decision.

Moschetto, 54, was arrested for selling drugs and weapons, prosecutors say.

Michael Moschetto is accused of killing his wife and two children.

Moschetto, 54, was arrested for selling drugs and weapons, prosecutors say. Table 9 : Generation results of BART-base based fine-tuned models on CNN / Daily Mail.

Based on the task and methods proposed in Sections 3 and 4, we built a system that executes VN-MPP and SC. We named this system "COMPASS", which stands for a writing support system to COMPlement Author unaware Story gapS. As explained above, VN-MPP excludes the constraints of LC-MPP and allows the number of sentences in the input to vary. More specifically, the number of sentences in the original story and the number of missing sentences can vary. Figure 4 shows the appearance of this system.

Moreover, we introduce beam search as a decoding algorithm for the system instead of greedy sampling. 9 As can be seen in [Bahdanau et al., 2015, Freitag and Al-Onaizan, 2017] , beam search was already being used in Seq2seq models in the early days when Seq2seq models using RNN were proposed [Graves, 2012 , Boulanger-Lewandowski et al., 2013 , Sutskever et al., 2014 . By using beam search, multiple candidate sentences can be displayed. We made the beam size and the number of suggested sentences user-selectable, so that users can receive more interactive assistance with story completion.

Based on the user input, our system can identify the parts to be completed in incomplete stories and generate candidate sentences for completion. In addition to inputting arbitrary sentences, the user can also use pre-prepared example sentences, which makes it easy to test the system's usability.

In the prototype system we have already made publicly available, 10 the input is limited to "four sentences with one sentence missing from a story consisting of five sentences." Furthermore, only one sentence is generated as a candidate sentence.

The system using VN-MPP, which is an extension of LC-MPP, removes these limitations and can estimate multiple missing locations for variable-length inputs. In addition, we developed a new two-module v2 system that enables us to present multiple candidates by beam search. 9 Greedy sampling is also known as greedy search or the greedy algorithm. 10 https://github.com/mil-tokyo/mppsc-demo Figure 4 : The VN-MPP + SC demo system for human story writing assistance. It estimates the missing positions of a given incomplete story and generates and presents sentences to complete the story.

We also added functions to display some information that may be useful to users. One such function evaluates the story-likeness of the completed text. For this function, we use UNION as a metric . Note that we trained our own model of UNION using DistilBERT [Sanh et al., 2019] instead of BERT, which was used in the original implementation. Another function visualizes the emotions of the reader. In our previous study [Mori et al., 2019b] , we proposed "Emotional Flow" and showed the importance of emotions, especially multi-perspective and multi-dimensional emotions. Considering emotions is a well-known approach in storytelling; hence, we include emotion visualization as an essential part of our system. To predict emotions from input and output text, we fine-tuned BERT with "EmoBank" EmoBank is proposed as a text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance (VAD) scheme [Buechel and Hahn, 2017a,b] . This dataset contains annotations from two perspective: the reader perspective and the writer perspective. We focus on the annotation from the reader perspective. As our intention is to focus on stories, we decided to select "fiction" and "essays" from all categories, considering the characteristics of each category. Using the predicted Valence and Arousal value, we draw Emotional Flow.

Jennifer has a big exam tomorrow. She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week.

Jennifer has a big exam tomorrow. <missing_sentence> She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week. <missing_sentence>

Jennifer has a big exam tomorrow. She got so stressed, she pulled an all-nighter. She went into class the next day, weary as can be. Her teacher stated that the test is postponed for next week. Jennifer felt bittersweet about it.

(1) An Original Story

(2) An Incomplete Story w/ missing position information

Concatenate with another special token (4) Target Sentences w/ another Special Token <completion>She got so stressed, she pulled an allnighter. <completion>Jennifer felt bittersweet about it.

["She got so stressed, she pulled an all-nighter.", "Jennifer felt bittersweet about it."]

Module 1 VN-MPP

Figure 5: Preprocessing the dataset in two-module v2 and correspondence between the input and output of each module.

Although the end-to-end method is considered the mainstream in the field of machine learning, in the first method, we deliberately divided the module for each task so that the output of VN-MPP can be confirmed independently. In addition, because various methods for SC have been proposed and are expected to be developed in the future, it is also advantageous that our proposed VN-MPP module can be easily combined with these methods.

As a comparison with Figure 1 , we show the new dataset preprocessing method and the input/output of the two modules in Figure 5 .

As before, the same process is used to create a sentence in which a part of the input sentence is replaced by <missing_sentence> and an incomplete story that does not contain any information about the missing position. However, it additionally uses the original sentence that was replaced by <missing_sentence>, i.e., missing sentences.

The problem is that the number of missing sentences is variable in VN-MPP and, assuming that they are generated separately, the number of times the output is performed depends on the input, which makes batch learning difficult.

Therefore, we conceived the idea of introducing yet another special token. Specifically, we added another special token, <completion>, to the beginning of each missing sentence, so that the missing sentence can be treated as a single sequence. Then, after generating the missing sentences, we use the <completion> token as a guide to retrieve the complement sentences from the generated sequence. This allows the new Module 2 to generate only completion statements, and to control them without affecting the rest of the system. 11

The front-end and back-end of this web application are written in Python using Streamlit. 12 For machine learning model implementation, we mainly use PyTorch and HuggingFace Transformers.

To verify the usefulness of our proposed system from the viewpoint of creative writing assistance, we conducted a user study. Specifically, we asked professional storytellers in Japan to evaluate it. 11 The implementation of this method occurred with the mistyping of "<completion>" as "<complition>." However, this token is registered in the tokenizer independently of the original word (completion), and is not affected by the string in the token. Additionally, the proposed system does not display the token, so users would not see the typo in the token. Therefore, this typo does not affect the evaluation experiment, nor does it affect the claims in this thesis. 12 https://streamlit.io/ Figure 6 : Initial state of the Japanese version of the proposed system.

When a user accesses the system, the screen shown in Figure 6 is presented. Then, the user enters an arbitrary sentence, and the system returns the missing position, as shown in Figure 7 . In addition, candidate sentences to insert into the missing positions are presented (Figure 8 ). Emotional Flow is also displayed, allowing the user to see the kinds of emotions the sentence will evoke in the reader. The user can adjust the parameters to find desirable candidate sentences while considering the emotion the user wants to evoke in the reader (Figure 9 ).

The example input used in the series of images and its English translation are as follows: Figure 7 : The user inputs an arbitrary text and presses the "Enter" or the "Return" key, then our VN-MPP is executed. The input is repeated and the result of VN-MPP is displayed. In the example in this figure, there are two missing position predicted.

• その秘密を抱えて過ごす日々のなか、 彼が運命的に出会ったのが、 一学年下の少女・朱莉だっ た。 実は彼女も同じ糸を見られると打ち明けられた肇。 "糸"の秘密をめぐる共犯者として、 し だいに二人は惹かれ合っていくが……。

• (English translated version) While spending his days with this secret, he fatefully meets a girl one year younger than him, Akari. Hajime is told that she can also see the same strings. As accomplices in the secret of the "strings," they gradually become attracted to each other.

The example input is a modified version of the synopsis of the following work: 『僕らふたりに運命の糸は』 (霧友 正規, KADOKAWA 富士見Ｌ文庫, 2019). 13 We artificially made the following information missing: Hajime is a high school student who can see the "red thread of fate" of anyone he touches. As shown in Figure 9 , our system can generate the concrete content equivalent of "he" and "this secret" before the pronoun and the demonstrative adjective appear. Figure 8 : Candidate sentences to insert into the missing positions are presented. Emotional Flow is also displayed, allowing the user to see the kinds of emotions each sentence will evoke in the reader. Figure 9 : The user can adjust the parameters to find desirable candidate sentences. In Candidate 2, the protagonist, "肇 (Hajime)," is said to be a "高校二年生 (sophomore in high school)" (which matches the setting of the original work used for the input sentence), and in Candidate 3, the setting of a college student is suggested because it is a "大学の卒 業式 (college graduation ceremony)."

6.2 Development of the Japanese-version System for User Study

To handle Japanese input, we use mBART50, a multilingual Sequence-to-Sequence model proposed by Tang et al. [2020] . They state that mBART , referred to as an example of previous multilingual models, has been trained on a variety of languages, but the multilingual nature of the pre-training is not used during finetuning. They proposed "multilingual finetuning" as a replacement for bilingual finetuning and demonstrated large improvements.

We use the pre-trained checkpoint "facebook/mbart-large-50" for finetuning on the Japanese-version tasks: VN-MPP and SC. 14

To develop the Japanese version of our system, we needed a dataset of stories written in Japanese. We considered the following two approaches.

• Gather stories written in Japanese and use them as the Japanese dataset.

• Translate the English dataset to Japanese and use it as the Japanese dataset.

We chose the former approach, and constructed a Japanese novel synopsis dataset: Narou-synopsis. "Shosetsuka ni Narou (小説家になろう)" is a web service that allows users to post their original novels. Users can post and read novels for free. The name of the service is a registered trademark of HinaProject Inc. and the company provides the API to obtain the metadata of the novels posted to the website. 15 We use this API to retrieve the metadata of the novels. The available metadata include a synopsis, which we utilize as training data.

We retrieved metadata for a total of 40519 novels from 21 genres. The service assigns points to novels based on readers' responses, and we obtained up to 2,000 entries in each genre, in order of the highest total points. There were some genres with less than 2,000 points, hence the total number was less than 42,000.

As a second approach, we first attempted to add a translation module to the English version of the system that we developed earlier. Specifically, for each Japanese input, we obtained an English translation of the input sentence by means of a Japanese-English translation system. The system then evaluated it, and subsequently performed back translation between English and Japanese to obtain the Japanese output. However, we deemed this method unsuitable for novel texts because the back translation changes the style and nuance of the original text.

Therefore, we attempted to use machine translation before learning; specifically, we used all the ROCStories that had been machine translated into Japanese as the training data. We named this dataset ROCStories-auto-Ja.

For machine translation, we used mBART50 finetuned for multilingual machine translation, named "facebook/mbartlarge-50-many-to-many-mmt." 16

In this multilingual scenario, we discovered an advantage of the two-module approach: the VN-MPP module and the SC module can be trained on different datasets and then combined.

The system using ROCStories-auto-Ja showed good performance in MPP. However, because the complement sentences it generated were not originally written in Japanese, it was difficult to capture the context of proper nouns.

Although the MPP validation score of Narou-synopsis was lower than that of ROCStories-auto-Ja, the SC module was able to generate sentences that are typical in modern Japanese entertainment novels for young people.

Therefore, we used a system that combines the VN-MPP module using ROCStories-auto-Ja and the SC module using Narou-synopsis for the user study.

The model used for the translation was "facebook/mbart-large-50-many-to-many-mmt," the same checkpoint of mBART50 that we used for creating ROCStories-auto-Ja.

We asked four professionals in the creative writing field to evaluate our proposed system. We prepared three systems: our proposed system and two comparison systems. The comparison systems were designed and implemented without MPP ability.

• System A: COMPASS (Proposed System)

• System B: "Always Add Last" System • System C: "Random MPP" System

The order of the three systems was determined randomly. 17 All three systems were identical in appearance, except for the alphabet assigned to the system. "Always Add Last" is a system that always determines that the end is missing. This comparison system was designed with reference to Write With Transformers, 18 which are text auto-completion systems in which Causal Language Models such as GPT-2 are used for generating subsequent text. Moreover, the appearance of the system was aligned for a fair comparison with our proposed system.

In "Random MPP," a randomly selected position was presented as the "missing position" instead of the output of Module 1 (VN-MPP). Random seeds were set to return the same result for the same input.

The design of the "Random MPP" system was the most difficult part. A system that "points out different positions as missing for the same input each time it is executed" could be considered. This is the equivalent of advice from someone who changes their opinion from time to time. However, as our intention was not to frustrate the user, we designed "Random MPP" as described above.

As the computing environment for executing the systems, we used Amazon Elastic Compute Cloud (Amazon EC2) of Amazon Web Services (AWS). 19 To ensure fairness, we executed the three systems on equivalent virtual servers. On Amazon EC2, we chose the "g4dn.xlarge" instance for the virtual server.

We had the evaluators view an instructional video explaining how to use the systems. For this instructional video, we used an earlier version of the proposed system instead of the version used for the evaluation, i.e., System A. This is because it could be assumed to be the proposed system when there was a match with the example in the instructional video. Then, we had the evaluators access and use the three systems and fill out a questionnaire with their ratings.

This experiment was designed to allow the evaluators to participate remotely.

The questions we asked the users to answer are shown in Tables 10 and 11. The questions in Table 10 constituted the first half of the questionnaire, and involved user evaluation and comparison of the systems. The questions in Table 11 constituted the remaining part, and sought to determine how the users create stories and what they desire from creative support systems. Note that the original questionnaire we used was written in Japanese.

Here, we discuss the evaluation results obtained from the user study.

All comments in the free writing field were written in Japanese. We have included the original responses by the evaluators as they were written. The English translations are given by us for reference purposes.

In the questionnaire, we first ask each evaluator to individually rank the systems as best, second best, and worst based on their opinion. We also ask each of them to write the reason for their ranking. Table 12 shows the collected answers. The responses varied, but overall System A, i.e., our proposed system, received the highest rating. To make the results easier to understand at a glance, a bar plot is presented in Figure 10 .

The evaluator who said that C was the best stated that diversity was more important than accuracy for proposing missing positions. Therefore, C, which randomly gave out more missing positions, was probably more suitable than A, which did not point out missing parts unless necessary. The evaluator who said that B was best emphasized the importance of being able to develop the story, and it is likely that B, which generates a continuation, was more appropriate support for this person than the other systems.

We also asked the evaluators to rate each module of each system on a 5-point scale. The results are shown in Figure 11 .

Regarding the MPP modules, the proposed method received the highest evaluation. It is interesting to note that although the same SC modules were used in the three systems, their evaluations differed significantly.

For each system, we asked the evaluators about the usefulness of each function. In this question, evaluators were allowed to choose as many functions as they thought to be useful. They were also asked to write the reason for their answer. As shown in Table 10 , the choices of functions were as follows:

• Predict where to add -MPP • Propose complementary sentences -SC 

If you have created a story using this system that you think is a good story and you think it would be OK to publish it, write it in the form. Table 10 : First half of the questionnaire used in the user study. With "Others," a free description box was provided.

Regarding the question about useful function, if the user felt that there was nothing useful, we asked that they select "Other" and indicate that. "*" in the "Required" row indicates that the answer is required. Although the terms MPP and SC were not communicated to the evaluators, the correspondence with their choices is shown for reference. For Emotional Flow, the name was given to the evaluator as shown in Table 8 .

The result of the votes are shown in Figure 12 , and the reason why they consider functions useful/not useful are shown in Table 13 .

From the reasons given the evaluators, it is clear that different evaluators had different thoughts about the functions. Some of evaluators found the quantification of story-likeness interesting, whereas others were uncomfortable with the fact that it was not clear how it was being evaluated. In addition, some of them focused on emotion visualization; we anticipate that our Emotional Flow will be able to meet the needs of such users when valence and arousal estimation become more accurate.

On the other hand, the complementary sentences were generally rated low. This suggests that, from a professional novelist's point of view, the model trained on the current dataset does not generate sufficiently good sentences.

6.6 Future Direction 6.6.1 Copyright and Privacy

In the evaluation experiment, the proposed system and the comparison systems were executed on the (virtual) server managed by us, who conducted the experiment. This is because we avoided sharing the source code with the evaluators in order to prevent them from knowing which of the three systems, A, B, and C, was the proposed system. However, when considering the operation of the proposed system in the real world, running it as a stand-alone system in the hands of each user is an important topic to consider.

Privacy is an important issue in the exchange of information via the Internet. In particular, in the case of creative writing support systems, the nature of supporting the writing of works that have not yet been released to the world means that copyright must be carefully handled alongside with privacy. When the system operator collects the creative works input by the users, it is necessary to establish rules to protect the users' rights, and the users must be fully convinced and assured that their rights are being protected. Alternatively, one possible solution is to make the system a stand-alone system where no one but the user oneself can see the input. Our proposed system has also been verified to work on CPUs. It can be run in environments that are not rich in computing resources, although the response time experienced by the user will be longer. 

Although it may conflict with the considerations mentioned above of privacy and copyright, gathering Japanese story data and constructing a significant and high-quality dataset is an essential part of the system's future development.

In this study, we developed a Japanese version of the system in order to have professional creators who work in Japanese evaluate the system. However, there was a major difficulty in doing so, especially regarding the lack of datasets.

As an alternative approach, we have successfully trained a module of MPP by translating the entire English novel dataset into Japanese. However, it was also confirmed that applying the same method to the SC module was undesirable because it would destroy the style and nuance of the text.

In fact, some of the evaluators complained about the generated complementary sentences as shown in Table 13 . We should consider improving the method to learn better sentence expressions from a small amount of data; however, it is also important to develop high-quality and large-scale data sets.

The understanding and cooperation of the people who actually produce such data, i.e., professional creators, is essential for the realization of such high-quality and large-scale data sets.

In the evaluation experiment, all evaluators were asked to try the same three systems. As a result, it became clear that each evaluator preferred a different system. Overall, the proposal method was the best, but the evaluation was uneven.

Reasons to consider functions as useful/not useful A (COM-PASS) -ストーリーらしさの数値化は面白かった。 加筆すべき箇所はラストに来ることが多か ったが、その他の箇所に来ることもあり、自分だったら何を入れるかを検討するのに役 だった。補完文の提案内容はつたなかった。 (It was interesting to quantify the story-likeness. The parts that needed to be added were often in the last part of the story, but sometimes they were in other parts, which helped me to think about what I would add. The suggestions for complementary sentences were not so good.) -調整用パラメータはいじれば提示も変わってくるものの、 どう影響しているのかが分 かりにくい。ストーリーらしさの数値化は正直なところそれの持つ意味がよく分からな い。何をもってストーリーらしいと評価しているのか等。読者の感情の可視化はもう少 し読み解きやすい形式があると助かるのではないか。以上の点は全てのシステムで同じ ように感じたため、以降の設問においては回答を省略させていただく。 (The parameters for adjustment can be tweaked to change the presentation, but it's hard to see how they affect it. To be honest, I don't really understand the meaning behind the quantification of story-likeness. How the system evaluate story-likeness? It would be helpful if the visualization of the reader's emotions were in an easier format to read and understand. Since I felt the same way about the above points in all systems, I will omit my answers to the following questions.) -それが一文だけでなく長期的に視覚化できるようになると、 物語構築が立てやすい。 (If it can be visualized in the long term, not just in one sentence, it is easier to build the narrative structure.) (* "it (それ)" in this comment refers to "Predict and visualize the reader emotions.") B -感情変化の可視化は作家にとって有用な情報。 (Visualization of emotional changes is useful information for writers.) -ストーリーらしさの数値化は、 評価のクオリティが高まれば第三者的な視点として 役に立つと思う。 (I think the quantification of story-likeness can be useful as a third party perspective if the quality of the evaluation is high enough.) C -感情は有益な情報。 (Emotions are useful information.) -省略すると言いつつ最後に一つ思い出したので書いておくと、 自分はマウスのホイー ルでページを上下に送ろうとするタイプで、マウスカーソルが読者感情の可視化のグラ フのところにある時にその動作をしようとするとグラフが拡大縮小されてしまいページ を上下出来ないことに若干の不便を感じた。 (I'm the type of person who tries to move the page up and down with the mouse wheel, and when the mouse cursor is on the graph of reader emotions visualization, the graph is scaled up and down and I can't move the page up and down. I found this slightly inconvenient.) -加筆すべき箇所の提案が多かったのは、 自分が改めて「そこになにか入れられない か？ 」と考え直すきっかけに出来るように思う。 補完文に関してはそのまま活かせる印 象はなかった。 (There were a lot of suggestions of the place to add some text, which I think will make me rethink, "What can I put in there?" I didn't have the impression that I could make use of the complementary sentences as it is.) Table 13 : Reasons given by evaluators for selecting the functions as useful for story creation.

What each evaluator wants from the creation support system is also in a different direction. What is desirable for one author may not necessarily be desirable for another author. Therefore, personalization will play an important role in the future direction of creation support systems.

We did not conduct a long-term experiment because it would have been a burden on the subjects. However, it is possible to adapt the output of the system to the user based on the user's input data and feedback on the output. Further fine-tuning and automatic optimization of the parameters for output adjustment may be considered.

With the user's permission, if the system can use the person's previous writings as training data, it is expected to be able to generate sentences that are more like the user's own writing style.

To overcome the issue of conventional SC tasks requiring information regarding the position of the missing part in a story, we previously proposed an MPP that predicts the position based on the given incomplete story. Specifically, we proposed MPP with limited conditions (LC-MPP) in [Mori et al., 2020] . In this paper we proposed an updated version called Variable Number MPP (VN-MPP).

In LC-MPP, it is known that there is a missing position in the input story, and that there is only one such instance. However, in reality, an input story may be complete, that is, k is null. Furthermore, there may be a case in which there are multiple missing positions, that is, a case in which k has multiple values. We thus proposed VN-MPP as a task closer to a more realistic setting to address this issue. Furthermore, we proposed two novel methods for VN-MPP and Story Completion (SC): the two-module approach and the end-to-end approach. Our proposed method not only copes with the variability in the number of missing sentences, but also with the variability in the number of sentences in the input.

Based on our proposed MPP task, we developed a story writing support system, COMPASS. Further, we created a Japanese version of the system and conducted an evaluation experiment involving professional evaluators who are currently engaged in creative activities. The results obtained confirm the usefulness of our proposed MPP-based writing support system.

By having the evaluators actually use the system, we were able to obtain concrete answers regarding their desiderata for creative support systems. Based on their feedback, we will further develop the system to make it more useful for creative support.

Exploring the effect of author and reader identity in online story writing: the STORIESINTHEWILD corpus

Neural machine translation by jointly learning to align and translate

Audio chord recognition with recurrent neural networks

EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis

Readers vs. writers vs. texts: Coping with different perspectives of text understanding in emotion annotation

Narrative incoherence detection

The Hero with a Thousand Faces

The price of debiasing automatic metrics in natural language evalaution

Unsupervised learning of narrative event chains

Story comprehension for predicting what happens next

Creative writing with a machine in the loop: Case studies on slogans and stories

BERT: Pre-training of deep bidirectional transformers for language understanding

Enabling language models to fill in the blanks

A simpler and more generalizable story detector using verb and character features

Hierarchical neural story generation

Strategies for structuring story generation

The Screenwriter's Workbook, Revised Edition

Aspects of the novel

Beam search strategies for neural machine translation

Neural language generation: Formulation, methods, and evaluation

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

Computational approaches to storytelling and creativity. AI Magazine

Plan, Write, and Revise: an interactive system for open-domain story generation

Sequence transduction with recurrent neural networks

Union: An unreferenced metric for evaluating open-ended story generation

Story ending generation with incremental encoding and commonsense knowledge

A knowledge-enhanced pretraining model for commonsense story generation

Unifying human and statistical evaluation for natural language generation

Teaching machines to read and comprehend

Learning to write with cooperative discriminators

A survey of deep learning applied to story generation

INSET: Sentence infilling with INter-SEntential transformer

Unsupervised hierarchical story infilling

Automatic novel writing: A status report

A theme-rewriting approach for generating algebra word problems

Design and analysis of a large-scale covid-19 tweets dataset

Creating characters in a story-telling universe

Story-telling as planning and learning

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Story generation with crowdsourced plot graphs

Generating reasonable and diversified story ending using sequence to sequence model with adversarial training

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation

Multilingual denoising pre-training for neural machine translation

Decoupled weight decay regularization

Serious storytelling -a first definition and review

Storynomics: Story-Driven Marketing in the Post-Advertising World. Twelve

Tale-spin, an interactive program that writes stories

Toward a better story end: Collecting human evaluation with reasons

How narratives move your mind: A corpus of shared-character stories for connecting emotional flow and interestingness

Finding and generating a missing part for story completion

A corpus and cloze evaluation for deeper understanding of commonsense stories

Rectified linear units improve restricted boltzmann machines

Abstractive text summarization using sequenceto-sequence RNNs and beyond

Why we need new evaluation metrics for NLG

Advertising plot generation system based on comprehensive narrative analysis of advertisement videos

BLEU: A method for automatic evaluation of machine translation

Pytorch: An imperative style, high-performance deep learning library

Towards controllable story generation

Counterfactual story reasoning and generation

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

Writing Stories with Help from Recurrent Neural Networks

Leveraging pre-trained checkpoints for sequence generation tasks

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter

Get to the point: Summarization with pointer-generator networks

BLEURT: Learning robust metrics for text generation

Tackling the story ending biases in the story cloze test

SAVE THE CAT! The Last Book on Screenwriting You'll Ever Need

Sequence to sequence learning with neural networks

Multilingual translation with extensible multilingual pretraining and finetuning

Attention is all you need

A neural conversational model

Kurt vonnegut on the shapes of stories

Narrative interpolation for generating and understanding stories

Transformer-based conditioned variational autoencoder for story completion

The strong story hypothesis and the directed perception hypothesis

Transformers: State-of-the-art natural language processing

Megatron-cntrl: Controllable story generation with external knowledge using large-scale language models

XLNet: Generalized autoregressive pretraining for language understanding

Plan-and-Write: Towards better automatic storytelling

Bertscore: Evaluating text generation with bert

From plots to endings: A reinforced pointer generator for story ending generation