key: cord-0648748-7d2o1heo authors: Onishi, Takeshi title: Relation/Entity-Centric Reading Comprehension date: 2020-08-27 journal: nan DOI: nan sha: 00b4046cfc7f94d35e4fcf5af4cac34323812034 doc_id: 648748 cord_uid: 7d2o1heo Constructing a machine that understands human language is one of the most elusive and long-standing challenges in artificial intelligence. This thesis addresses this challenge through studies of reading comprehension with a focus on understanding entities and their relationships. More specifically, we focus on question answering tasks designed to measure reading comprehension. We focus on entities and relations because they are typically used to represent the semantics of natural language. In Chapter 1, we overview the history of the reading comprehension tasks and various styles of the tasks. We also differentiate the reading comprehension tasks from other question answering tasks. Then, we present entities and their relations in the context of reading comprehension tasks. In Chapter 2, we present an original reading comprehension dataset. We used baseline systems and a sampling approach to control the difficulty of the dataset. As a result, the dataset achieved a high human performance and low machine performance, and the gap indicates the dataset provides questions that requires deep understanding of texts. In Chapter 3, we analyze neural network models for reading comprehension tasks and iv show that the vector representations learned in the models can be understood as being composed of a predicate applied to entities. In Chapter 4, we apply our findings in Chapter 3 to another reading comprehension dataset focusing on entities and their relations. We propose a transformer encoder-based model and show that the model achieves the higher development accuracy than other models with a similar number of parameters. In Chapter 5, we present our work on relation extraction, a task for predicting a relation My special thank goes to the late Mr. Tatsuro Toyoda, the found of Toyota Technological Institute. I really appreciate the opportunity to study in Toyota Technological Institute at Chicago and Japan, and it is a great honor for me to be the first student to graduate from both Toyota Technological Institute in the world. Hermann et al. [1] , results marked II are from Chen et al. [2] , result marked III is from Kadlec et al. [3] , and result marked IV is from Dhingra et al. [4] . 34 Table 3 .2 Accuracy on Who-did-What dataset. Each result is based on a single model. Results for neural readers other than NSE are based on replications of those systems. All models were trained on the relaxed training set which uniformly yields better performance than the restricted training set. The first group of models are explicit reference models and the second group are aggregation models. + indicates anonymization with better reference identifier. . 49 In many areas of engineering, it is our dream to create a machine that is more productive than a human, and can tolerate working for a longer time and that releases workers from tedious tasks. In Artificial Intelligence (AI) we seek a machine that has intelligence of humans. Here, intelligence might include the ability to understand images, understand speech, and read texts. The ability to read texts is studied in Natural Language Processing (NLP), a field of study to process natural language texts, and its ultimate goal is to create a machine that understands natural language texts. Although the ability is essential for the desired machine to communicate with humans like workers do, understanding texts is not a well-defined goal, and it is nontrivial to verify the ability. A legacy approach for verifying the ability is the Turing test [6] where a tester talks with a machine or human, and we see whether the tester can reliably tell the machine from the human or not. Although the test setting is convincing, there are two practical issues that we are concerned about. First, the test cannot compare the intelligence of two given machines. The test verifies if each machine has the intelligence or not and each machine makes a fairly independent conversation for each other, thus it is difficult to compare these test results. 1 Second, the test only verifies the existence of the intelligence, and it does not help to explain how the machine understands given texts. A practical approach might be reading comprehension tasks, where machines answer a question about a given passage rather than making a conversation. Here it is important that answering the question requires information described in the passage. So we can see how much a machine understands the given passage by observing the answer the machine makes. In this setting, we can compare the abilities of these machines by simply counting the number of correct answers given by each machine. Additionally, we are also interested in how the information in texts is represented in the machines, especially deep neural network models that are notoriously difficult to interpret. We challenge this question with focusing entities and their relations described in the texts and show here that the vectors of neural readers can be decomposed into a predicate and entities. Thus, this dissertation shows studies of these reading comprehension tasks focusing on entities and relations. We believe that understanding how machines take care of entities and their relations in a given passage helps further the study of machine reading comprehension. Then eventually, this study contributes to the ultimate goals of AI. A machine that understands human language is the ultimate goal of NLP. Understanding is a nontrivial concept to define; however, the NLP community believes it involves multiple aspects and has put decades of effort into solving different tasks for the various aspects of text understanding, including: Syntactic aspects: • Part-of-speech tagging: This is a task to find a syntactic rule for each token in a sentence. Each token is identified as a noun, verb, adjective, etc. Figure 1.1 shows an example of part-of-speech tagging. • Syntactic parsing: This is a task to find syntactic phrases in a sentence such as a noun phrase, verb phrase. Figure 1. 2 shows an example of syntactic parsing. • Dependency parsing: Dependency is a relation between tokens where a token modifies another token. Dependency parsing is a task to find all dependencies in a sentence. Semantic aspects: • Named entity recognition: This is a task to find named entities and their types in a sentence. Typical named entity types are "Person" and "Location". • Coreference resolution: This is a task to collect tokens that refer to the same entity. For example, Donald Trump can be referred by "he", "Trump" or "the president." Figure 1 .5 shows an example of coreference resolution. A reading comprehension task is a question answering task that is designed for testing all these aspects and probe even deeper levels of understanding. (1) Robbie Keane and (2) Dimitar Berbato, in the passage with named entity recognition and coreference resolution, and then it might find "Dimitar Berbato" is the best answer. then returns an answer. Hence, a supervised training instance is a tuple of a passage, question, and answer. The passage is a text resource that provides enough information to find the answer, such as a news article, encyclopedia article, or multiple paragraphs of these articles. The question is also a text resource, but it is much shorter than the passage. The answer style is different depending on the style of each reading comprehension task. Here, we divide existing reading comprehension tasks into three styles depending on their answer type. • Multiple choice: In this style, a list of candidate answers is given along with each passage and question. Hence the answer is one of the candidate answers. On the example The performance of a machine is evaluated by the accuracy; the number of correct answers over the number of all questions. • Span prediction: In this style, the answer is a span in the passage, i.e., the answer is a pair of a start token and end token. This style is also referred to as extractive question answering. On the example question in Table 1.2, there are two occurrences of COVID-19 in the passage, but the answer is the second one. The performance of a machine is evaluated by span-level accuracy by exact matching (EM) and/or an F1 score. EM is the same as the accuracy where the predicted span is correct if and only if the sequence of words specified by the predicated span is the same as the sequence of words specified by the gold span. This matching scheme might be called string matching. The F1 score is a harmonic mean of precision and recall that are computed between the bag of tokens in the predicted span and the bag of tokens 6 in the gold span. where P and G are the bag of tokens in the predicted span and that in the gold span, respectively. • Free-form answer: In this style, the answer can be any sequence of words in a vocabulary; thus, a machine generates the sequence to answer the given question. On the example question in Table 1 The goal of other question answering tasks is to appropriately answer questions posed by humans, and reading comprehension skills are less considered. Thus the machine may use any kind of information resources, including structured knowledge such as knowledge bases and unstructured knowledge texts such as encyclopedias, dictionaries, news articles, and Web texts. Additionally, the unstructured knowledge texts are longer than a passage and typically web-scale. These information resources require less reading comprehension skills described in Chapter 1. For example, given an access to a large text corpus, a simple grammatical transformation and string matching will likely suffice to answer the question like "who is the president of the U.S." Here the question can be grammatically transformed into a declarative sentence, "*** is the president of the U.S." Then, the machine more likely finds a sentence that matches the declarative sentence. On the other hand, the goal of reading comprehension is to understand a given (short) text. Thus a machine uses unstructured knowledge texts only. The texts are typically short and carefully written so that they require more reading comprehension skills. For example, 8 multiple given passages might share some information. Such shared information is called world knowledge, and some machines might be able to answer a question correctly without reading the given passage but using the world knowledge written in other passages. Hence, this issue makes it difficult to tell if the machine has reading comprehension skills. To avoid this issue, early work in this field mostly focused on fictional stories [13] because each fictional story has different characters and stories and then unlikely shares information. Early study [14] describes this difference by using terms; micro-reading/macro-reading. Macro-reading is a task where the input is a large text collection, and the output is a large collection of facts expressed by the text collection, without requiring that every fact be extracted. Micro-reading is a task where a single text document is input, and the desired output is the full information content of that document. Reading comprehension question answering is not new, and we can find early work from the 1970s. In this section, we review the history of three paradigms: development of the theory, rule-based systems, and deep learning systems. Very early systems operate in very limited domains in the 1970s. For example, SHRDLU [15] is a computer program where a user can move some objects in a 3D computer graphic by using English. LUNAR [16] is another computer program that answers questions about lunar geology and chemistry, and Baseball [17] is for questions about baseball. One of the most notable early work in the 1970s might be the QUALM system [13] . The work proposed a conceptual theory to understand the nature of question answering. Here the work analyzed how humans classify questions, and the algorithm classified questions in a similar way that humans do. In the 1980s to 1990s, various rule-based systems were proposed for each domain. Here we describe a notable shared task and dataset. The dataset was proposed by Hirschman et al. [18] and consists of 60 stories for development and 60 stories for testing of 3rd to 6th grade material, and each story is followed by short-answer questions, i.e., who, what, when, where and why questions. In the task, a machine takes each story and question then finds a sentence in the story that most likely contains the answer key. Multiple rule-based systems were developed for this task. Deep Read [18] takes a bag-of-words approach with shallow linguistic processing, including stemming, name identification, semantic class identification, and pronoun resolution. QUARC [19] uses lexical and semantic correspondence, and then Charniak et al. [20] combines them. As the results, these systems achieved 30-40% accuracy, i.e., these systems correctly predict a sentence containing the answer for 30-40% of questions. From the 2010s, supervised learning models significantly improved their performance in various tasks, including reading comprehension tasks. Even some supervised learning models overcame human performance in some tasks [2] . These improvements were made by deep neural networks and large-scale datasets. A deep neural network is a scalable machine learning model. A deep neural network is typically composed of "units". Each unit takes an input vector x and returns an output vector y by using a linear and non-linear transformation as the following. where W is a matrix, b is a bias vector, and f is a non-linear function. The deep neural network is trained by a stochastic gradient descent algorithm where a loss is computed on a subset of training instances, and then the gradient of the loss is computed with respecting the parameters of the deep neural network. Hence, the parameters are updated to the direction of the gradient. where L is the loss function to be minimized, X is a subset of training instances called mini-batch, and θ is the parameters. The stochastic gradient algorithm takes linear time against the size of the training data, and the memory requirement is linear to the size of the mini-batch. Thus neural network models can learn any large-scale training data in linear time by the stochastic gradient algorithm. Larger training data provides more instances to learn, hence scaling up training data is believed to be a promising approach in machine learning. Here, we note the contribution of the World Wide Web (WWW) to the large-scale training data. The WWW is an information system over the Internet where a document or web resource is identified by a Uniform Resource Locators (URL). People uploads various kinds of texts on the WWW, including news articles, blog articles, and encyclopedia articles. The amount of these texts on the WWW was estimated as at least 320 million pages in 1998 [21] , and it is estimated as at least billions in 2016 [22] . Naturally, these texts are computer-readable, unlike texts on books, and some of them are copy-right free. Hence we find the text on the WWW as a large accessible text resource. Recently, the WWW is a major resource of multiple standard reading comprehension datasets including, SQuAD, Wikihop, HotpotQA [23, 5, 24] . Thanks to the large-scale dataset supported by the WWW and the scalable training algorithm, deep neural network models can learn significantly large information on the dataset. As a result, these models perform better and better, and then their performances are achieving the human performance in some tasks [2] . The significant success of deep learning raises two questions. • "What is a good question in reading comprehension tasks?" • "How do these machines understand texts?" Questions in reading comprehension tasks are designed for testing reading comprehension skills, and each question requires these skills to solve. Today, as the deep neural network models perform better and better, we are more and more interested in more complicated reading comprehension skills that are beyond NER, coreference resolution, and dependency parsing. Additionally, we need to feed millions of such questions to train the deep neural network models, and it is not realistic for us to write each question manually. To address the problem and provide millions of such comprehension questions, we take a sampling approach in Chapter 2. Early systems were rule-based, and the mechanism of their text-understanding is relatively explainable. For example, if a machine reads a given text by operating rules designed by a researcher, then the process can be explained by a sequence of rules that the machine used. This sequence explains how the machine understands the given texts. On the other hand, deep neural network models operate multiple vector transformations, and each transformation does not explicitly correlate with any grammatical/semantic rules. Thus, unlike rule-based systems, the sequence of these operations does not explain enough how the machine understands the given text. We claim that entities and their relations can be a key to explainability in Section 1.2. Then, we empirically analyze how neural network models understand texts by using entities and their relations in Chapter 3, and apply it to our novel neural reader in Chapter 4. In Chapter 5, we extracted these entities and relations and visualize them for materials science. We are interested in entities and their relations in the context of reading comprehension. In the following, we overview entities and their relations in the context of knowledge bases. Then, we describe reading comprehension datasets focusing on entities and relations, and also relation extraction from the point of view of reading comprehension. Entities and their relations are well studied in the context of knowledge bases. A knowledge base such as WordNet [25] or Wikidata [26] is a structured database that typically represents its information by using entities and their relations as Fig.1 .6 shows the relations around "John McCormick". Here, entities and their relation are defined for the information desired to be represented. Quine [27] stated that "To be assumed as an entity is [...] to be reckoned as the value of a variable" or "to be is to be the value of a variable ". Hobbs [28] , inspired by Quine, limited entity types to "physical object, numbers, sets, times, possible worlds, propositions, events". Naturally, their relations are also designed for the target information. Entities and their relations are critical to solve questions in some reading comprehension question answering tasks. For example, each answer of CNN/Daily Mail dataset [1] is an entity that satisfies the condition given by the question sentence. The dataset is Clozestyle, where each question is a sentence whose key entity is blanked out. Here the question asks to find the blanked entity from the given passage. In other cases, each question of Wikireading [9] and Wikihop [5] consists of an entity and/or relation. In Wikihop, each question is a pair of a subject entity and relation, and the answer is an object entity that has the relation with the subject entity. In Wikireading, each question is a relation and the passage describes a subject entity and the answer is an object entity that has the relation with the subject entity described in the passage. We also consider question answering tasks whose answers are relations. These tasks are studied in the context of relation extraction in knowledge base population described in Section 1.2.1. Relation extraction is a task for finding a relation between two given entities described in a text resource. It is worth noting that the task is different from relation classification. Relation classification is a task for finding a relation between two given entities described in a given text resource (typically a sentence) where the positions of these entities are given. On the other hand, the positions are not given in relation extraction, and the text resource is typically longer than a single sentence. Thus, the task can be viewed as another reading comprehension task focusing entities and relations in the text. Entities and relations are critical for these tasks; however, we believe that such entities and their relations are critical, not only for these datasets but also for other datasets that implicitly require a machine to understand entities and their relations. In this section, we briefly overview how knowledge bases help various tasks, including question answering and information retrieval, and the motivation of knowledge base population, a task to fill a knowledge base from texts. A knowledge base is often a critical component of an expert system. An expert system 14 is typically composed of inference rules written by hand and a knowledge base and emulates the decision-making ability of a human expert. As it is sometimes difficult for the human expert to explain his/her decision, it is difficult to design complicated inference rules, but it might be easier to add more knowledge to the knowledge base. The performance of each system heavily depends on the coverage of its knowledge base. Today, some large-scale knowledge bases are available, e.g., Freebase and Wikidata. Freebase started as a collaborative knowledge base whose data was accumulated by its community members. Freebase consists of 125M tuples of a subject entity, object entity, and their relation, whose topics spread over 4K types, including people, media, and locations [29, 30] . Wikidata is also a collaborative knowledge base consisting of 87M entities 1 and most of these entities are linked to entities in sister projects such as Wikipedia; thus, it can provide extra information about these entities. Such large-scale knowledge bases help various tasks, including information retrieval and question answering, but still, the coverage of the knowledge base is critical for the performance. Despite the efforts of the community members who are maintaining these knowledge bases, their sizes are far from sufficient because new knowledge is emerging rapidly. On the other hand, we are more likely able to access textual information describing the new knowledge. Thus, we study knowledge base population to feed the knowledge base from texts. Entity-centered reading comprehension dataset Researchers distinguish the problem of general knowledge question answering from that of reading comprehension [1, 31] as descibed in Section 1.1.2. Reading comprehension is more difficult than knowledge-based or Information Retrieval (IR)-based question answering in two ways. First, reading comprehension systems must infer answers from a given unstructured passage rather than structured knowledge sources such as Freebase [29] or the Google Knowledge Graph [32] . Second, reading comprehension systems cannot exploit the large level of redundancy present on the web to find statements that provide a strong syntactic match to the question [33] . In contrast, a reading comprehension system must use the single phrasing in the given passage, which may be a poor syntactic match to the question. In this chapter, we describe the construction of a new reading comprehension dataset that we refer to as Who-did-What (WDW) [7] . Two typical examples are shown in Question: Sources close to the presidential palace said that Fujimori declined at the last moment to leave the country and instead he will send a high level delegation to the ceremony, at which Chilean President Eduardo Frei will pass the mandate to ***. the first sentence of the question article. An information retrieval system is then used to select a passage with high overlap with the first sentence of the question article, and an answer choice list is generated from the person named entities in the passage. Our dataset differs from the CNN/Daily Mail dataset [1] in that it forms questions from two distinct articles rather than summary points. This allows problems to be derived from document collections that do not contain manually-written summaries. This also reduces the syntactic similarity between the question and the relevant sentences in the passage, increasing the need for deeper semantic analysis. To make the dataset more challenging we selectively remove problems so as to suppress four simple baselines -selecting the most mentioned person, the first mentioned person, and two language model baselines. This is also intended to produce problems requiring deeper semantic analysis. The resulting dataset yields a larger gap between human and machine performance than existing ones. Humans can answer questions in our dataset with an 84% success rate compared to the estimates of 75% for CNN [2] and 82% for the CBT named entities task [31] . In spite of this higher level of human performance, various existing readers perform significantly worse on our dataset than they do on the CNN dataset. For example, the Attentive Reader [1] achieves 63% on CNN but only 55% on WDW and the Attention Sum Reader [3] achieves 70% on CNN but only 59% on WDW. In summary, we believe that our WDW is more challenging, and requires deeper semantic analysis. Our WDW is related to several datasets for machine comprehension. In this section, we review notable reading comprehension datasets since the 1990s including dataset developed after our WDW. The Deep Read dataset [18] is an outstanding early work on reading comprehension dataset. The dataset consists of 60 development and 60 test simulated news stories of 3rd to 6th grade material. Each story is followed by short-answer 5W questions; who, what, when, where, and why questions, as a sample on Table 2 Question2: What is the name of our national library? Question3: When did this library burn down? Question4: Where can this library be found? Question5: Why were some early people called "men of the written tablets"? Question1: What is the name of the trouble making turtle? Candidate answers: a)Fries, b)Pudding, c)James, d)Jane Question2: What did James pull off of the shelves in the grocery store? Candidate answers: a)pudding, b)fries, c)food, d)splinters Answer2: (a)pudding by seven year old children. These fictional stories and questions were written by Amazon Mechanical Turk cloud workers. Although they claim that their cloud sourcing approach is scalable, this dataset is too small to train models for the general problem of reading comprehension. The bAbI synthetic question answering dataset [8] contains passages describing a series of actions in a simulation followed by a question. For this synthetic data a logical algorithm can be written to solve the problems exactly (and, in fact, is used to generate ground truth 20 answers whose answer is a span of text in the given document. A sample question is given in Table 2 . 6 . Questions and answer spans are written by cloud workers. In the dataset construction, a cloud worker writes five questions, and their answer spans for each passage that is a paragraph of a Wikipedia article whose length is shorter than 500 characters. In addition 22 Passage: ... a small aircraft carrying @entity5 , @entity6 and @entity7 the @entity12 @entity3 crashed a few miles from @entity9 , near @entity10 , @entity11 ... Candidate answers: 1)entity1, 2)entity2, 3)entity3, ... . Short, intense periods of rain in scattered locations are called "showers"... Question2: What is another main form of precipitation besides drizzle, rain, snow, sleet and hail? Question3: Where do water droplets collide with ice crystals to form precipitation? to the answer span, two other cloud workers are given the passage and question only and predict the answer span. Thus, each question has at most three gold answer spans. The evaluation metric is EM and F1. Here F1 is computed between a bag of tokens in a gold answer span and a bag of tokens in the predicted span. dataset with the aspect of macro-reading. The dataset consists of 100K questions sampled from user queries issued to a search engine. Each question comes with a passage, which is a set of approximately ten web-pages that are retrieved by an information retrieval system. These 23 questions and passages make the task more like a general question answering task rather than a reading comprehension task. Firstly, the passage is longer than that in other datasets whose passage is a paragraph or a news article. Secondly, it is unclear if answering these questions based on web-queries require the reading comprehension skills, e.g., we generally make a web-query by using keywords rather than a question sentence to help keyword matching. These aspects make these questions more likely to be solved by syntactic matching. TriviaQA [36] is another reading comprehension dataset with the aspect of Macro-reading. The dataset consists of 96K questions and 663K evidence documents. These questions and their answers are from 14 trivial and quiz-league websites. The answer type is free-form answer, and the evaluation metrics are EM and F1 as following SQuAD. The evidence document is a passage in our context and collected from web-pages and Wikipedia articles by using a Web search engine. Hence, it is worth noting that each question has multiple evidence documents to read, unlike SQuAD where each question has one passage. Thus the passage is relatively long for each question, and then the dataset has the aspect of Macro-reading. NarrativeQA [37] is a medium-scale reading comprehension dataset consisting of 1.5K passages and 47K questions. These questions are from books or movie scripts, and questions are written by cloud workers. In the dataset construction, the cloud workers write the pairs of a question and answer based solely on a given summary of the corresponding passage. The answer type is free-form answer, and then the evaluation metric is BLEU, Meteor and ROUGE, and the mean reciprocal rank (MRR). Here MRR is ave r 1 r where r is the rank of the correct answer among candidate answers. HotpotQA [24] is a reading comprehension dataset requiring the reasoning. Here the reasoning is a task to provide a set of sentences explaining why the answer is selected. The dataset consists of 113K questions and passages. Each passage is a set of paragraphs from 24 Wikipedia articles, and the question is written by a cloud worker. Additionally, the cloud worker picks support facts, sentences in the passage that determine the answer for each question. The dataset employed Joint F1 for the evaluation metric in addition to EM and F1. Joint F1 is computed as follows: where P (ans) and P (sup) are the precisions of the answer span and the support facts for each, and R (ans) and R (sup) are the recalls of the answer span and the support facts for each. This evaluation metric forces machines to find not only the correct answer span but also the correct support facts. Wikireading [9] is the largest reading comprehension dataset in the datasets in this section that consists of 19M pairs of a question and answer. The dataset is constructed from Wikipedia and Wikidata. Wikipedia is a free online encyclopedia hosted by the Wikimedia Foundation that consists of more than 6 million articles 2 . Wikidata is a collaboratively edited knowledge base hosted by the Wikimedia Foundation that consists of sets of tuples, i.e., (subject entity, relation type, argument entity). There are more than 7,000 relation types, including "instance_of" and "location", and most entities in Wikidata and entries in Wikipedia are linked for each other. In the dataset, each question is a pair of the subject entity and relation type in a tuple, and then the answer is the argument entity in the tuple. The corresponding passage for the question is a Wikipedia article whose title is the subject entity. The answer type is free-form answer, and a machine is expected to predict the name of the argument entity. Again, the evaluation metrics are EM and F1 as following SQuAD. The dataset is pretty biased, and the top 20 relation types cover 75% of the dataset so that We now describe the construction of our WDW in more detail. To generate a question, we first generate the question by selecting a random article -the "question article" -from the Gigaword corpus and taking the first sentence of that article -the "question sentence"as the source of the cloze question. The hope is that the first sentence of an article contains prominent people and events which are likely to be discussed in other independent articles. To convert the question sentence to a cloze question, we first extract named entities using the Stanford NER system [38] and parse the sentence using the Stanford PCFG parser [39] . The person named entities are candidates for deletion to create a cloze problem. For each person named entity, we then identify a noun phrase in the automatic parse that is headed with Apple Founder Steve Jobs" we identify the two person noun phrases "President Obama" and "Apple Founder Steve Jobs". When a person named entity is selected for deletion, the entire noun phrase is deleted. For example, when deleting the second named entity, we get "President Obama met yesterday with ***" rather than "President Obama met yesterday with Apple founder ***". This increases the difficulty of the problems because systems cannot rely on descriptors and other local contextual cues. About 700,000 question sentences are generated from Gigaword articles (8% of the total number of articles). Once a cloze question has been formed, we select an appropriate article as a passage. The article should be independent of the question article but should discuss the people and events mentioned in the question sentence. To find a passage, we search the Gigaword dataset using the Apache Lucene information retrieval system [40] , using the question sentence as the query. The named entity to be deleted is included in the query and required to be included in the returned article. We also restrict the search to articles published within two weeks of the date of the question article. Articles containing sentences too similar to the question in word overlap and phrase matching near the blanked phrase are removed. We select the best matching article satisfying our constraints. If no such article can be found, we abort the process and move on to a new question. Given a question and a passage, we next form the list of choices. We collect all person named entities in the passage except unblanked person named entities in the question. Person named entities that are subsets of another longer named entity are eliminated from the choice list. For example, the choice "Obama" would be eliminated if the list also contains "Barack Obama". We also discard ambiguous cases where a part of a blanked NE appears in multiple choices in the list, e.g., if a passage has "Bill Clinton" and "Hillary Clinton" and the blanked phrase is "Clinton" then we discard it. We found this simple coreference rule to work well in 28 practice since news articles usually employ full names for initial mentions of persons. If the resulting choice list contains fewer than two or more than five choices, the process is aborted and we move on to a new question. 3 After forming an initial set of problems, we then remove "duplicated" problems. Duplication arises because Gigaword contains many copies of the same article or articles where one is clearly an edited version of another. Our duplication-removal process ensures that no two problems have very similar questions. Here, similarity is defined as the ratio of the size of the bag of words intersection to the size of the smaller bag. Then we remove some problems in order to focus our dataset on the most interesting problems. We decided to remove questions that can be solved by a syntactic matching algorithm, counting algorithm, or simple heuristic algorithm because we found machine learning systems easily learned these techniques from these questions; thus, they were not appropriate to teach and test deeper reading comprehension skills of these machine learning systems. We used the following two syntactic matching algorithms, a counting algorithm, and a heuristic algorithm as baselines to find such questions. We remove these questions to suppress their performance. • First person in passage: Select the person that appears first in the passage. • Most frequent person: Select the most frequent person in the passage. • n-gram: Select the most likely answer to fill the blank under a 5-gram language model trained on Gigaword minus articles which are too similar to one of the questions in word overlap and phrase matching. • Unigram: Select the most frequent last name using the unigram counts from the 5-gram model. 3 The maximum of five helps to avoid sports articles containing structured lists of results. To minimize the number of questions removed we solve an optimization problem defined by limiting the performance of each baseline to a specified target value while removing as few problems as possible, i.e., where T (C) is the subset of the questions solved by the subset C of the suppressed baselines, α(C) is a keeping rate for question set T (C), C i = 1 indicates the i-th baseline is in the subset, |b| is the number of baselines, N is a total number of questions, and k is the upper bound for the baselines after suppression. We choose k to yield random performance for the baselines. The performance of the baselines before and after suppression is shown in We report the performance of following several systems to characterize our dataset: • Word overlap: Select the choice c inserted to the question q which is the most similar to any sentence s in the passage, i.e., CosSim(bag(c + q), bag(s)). • Sliding window and Distance baselines (and their combination) from Richardson et al. [34] . • Semantic features: NLP feature based system from Wang et al. [41] . • Attentive Reader: LSTM with attention mechanism [1] . • Stanford Reader: An attentive reader modified with a bilinear term [2] . • Attention Sum Reader: GRU with a point-attention mechanism [3] . • Gated-Attention Reader: Attention Sum Reader with gated layers [4] . leverage the frequency of the answer in the passage, a heuristic which appears beneficial for the CNN/Daily Mail tasks. It seems that our suppression of the most-frequent-person baseline more strongly affects the performance of these latter systems. We presented a large-scale person-centered cloze dataset. The dataset is not anonymized, and each passage is a raw text which is not only natural but also easier to be pre-processed by syntactic and semantic parsers. In the dataset construction, we used baseline suppression, where we selected undesired questions by multiple baseline systems and randomly removed some of them. This approach can flexibly design the difficulty and quality of a dataset by replacing baseline systems that select undesired questions. [3] , and result marked IV is from Dhingra et al. [4] . Analysis of a neural structure in entity-centered reading comprehension As we discussed in Section 2.1, several large scale cloze-style reading comprehension datasets [1, 31, 7] have been introduced, and the large sizes of them enable the application of deep learning. Despite the significant performance of the deep learning models, the prediction structure of these models is poorly understood. In this chapter, we present empirical evidence for the emergence of predication structure in a certain class of deep learning models for reading comprehension (neural readers); "Aggregation" and "Explicit reference" readers. Both readers work on the CNN/Daily Mail dataset, a dataset with anonymized entities. This work was published as the best paper in 2nd Workshop on Representation Learning for NLP [42] . Before we explain the neural readers, we review the CNN/Daily Mail dataset where entities are anonymized. This dataset consists of anonymized passages and questions where named entities are replaced by anonymous entity identifiers such as "entity37". For example, the passage might contain "entity52 gave entity24 a rousing applause", and the question might be "X received a rounding applause from entity52", then the answer is the most 35 appropriate entity identifier in the passage to fill X. The same entity identifiers are used over all the problems, and a different identifier is assigned to an entity every time the passage and question are read. Thus, the entity identifiers are presumably just pointers to semanticsfree tokens and do not have any semantic meaning. We will write entity identifiers as logical constant symbols such as c rather than strings such as "entity37". "Aggregation" readers, including Memory Networks [8, 43] , the Attentive Reader [1] , and the Stanford Reader [2] , use bidirectional LSTMs or GRUs to construct a contextual embedding h t of each position t in the passage and also an embedding h q of the question q. They then select an answer c using a criterion similar to where e(c) is the vector embedding of the constant symbol (entity identifier) c. In practice the inner-product h t , h q is normalized over t using a softmax to yield attention weights α t Here t α t h t can be viewed as a vector representation of the passage. We argue that for aggregation readers, roughly defined by Equation ( 3) The first inner product in Equation (3.3) is interpreted as measuring the extent to which for any x. The second inner product is interpreted as restricting t to positions talking about the constant symbol c. Note that the posited decomposition of h t is not explicit in Equation (3.2) but instead must emerge during training. We present empirical evidence that this structure does emerge. The empirical evidence is somewhat tricky as the direct sum structure that divides h t into its two parts need not be axis aligned and therefore need not literally correspond to vector concatenation. "Explicit reference readers", including the Attention Sum Reader [3] , the Gated-Attention Reader [4] , and the Attention-over-Attention Reader [44] , avoid Equation In this research, we have only considered anonymized datasets that require the handling of semantics-free constant symbols. However, even for non-anonymized datasets such as WDW, it is helpful to add features which indicate which positions in the passage are referring to which candidate answers. This indicates, not surprisingly, that reference is important in question answering. The fact that explicit reference features are needed in aggregation readers on non-anonymized data indicates that reference is not being solved by the aggrega-tion readers. However, as reference seems to be important for cloze-style question answering, these problems may ultimately provide training data from which reference resolution can be learned. Here we classify readers into aggregation readers and explicit reference readers. Aggregation readers appeared first in the literature, including Memory Networks [8, 43] , the Attentive Reader [1] , and the Stanford Reader [2] . Then, Explicit reference readers, including the Attention Sum Reader [3] , the Gated-Attention Reader [4] , and the Attention-over-Attention Reader [44] , were proposed. In the following sections, we define aggregation readers more specifically by Equations (3.7) and (3.9) and then explicit reference readers by Equation Finally, they compute a probability distribution P over the answers: is the output of a multi layer perceptron given input x. Also, the answer distribution in the Attentive Reader is defined over the full vocabulary rather than just the candidate answer set A: Here we think of R(a, p) as the set of references to a in the passage p. It is important to note that Equation (3.12) is an equality and that P (a|p, q, A) is not normalized to the members of R(a, p). When training with the log-loss objective this drives the attention α t to be normalized -to have support only on the positions t with t ∈ R(a, p) for some a. Gated-Attention Reader. The Gated-Attention Reader [4] involves a K-layer biGRU architecture defined by the following equations. h q = [fGRU(e(q)) |q| , bGRU(e(q)) 1 ], 1 ≤ ≤ K. (3.14) Attention-over-Attention Reader. The Attention-over-Attention Reader [44] uses a more elaborate method to compute the attention α t . We will use t to range over positions in the passage and j to range over positions in the question. The model is then defined by the following equations. h = biGRU(e(p)), h q = biGRU(e(q)). Note that the final equation defining α t can be interpreted as applying the attention β j to the attentions α t,j . This reader uses Equations (3.12) and (3.13). In this section, we claim an emergent predication structure in the hidden vectors h t that explains the high performance of aggregation readers. Intuitively we think of the hidden Formally, the decomposition of h t into this predication structure is not necessarily axis aligned. Rather than posit an axis-aligned concatenation, we posit that the hidden vector space H is a possibly non-aligned direct sum where S is a subspace of "statement vectors" and E is an orthogonal subspace of "entity pointers". Each hidden state vector h ∈ H then has a unique decomposition as h = Ψ + e 42 for Ψ ∈ S and e ∈ E. This is equivalent to saying that the hidden vector space H is some rotation of a concatenation of the vector spaces S and E. In this non-axis aligned model, we assume emergent embeddings s(Φ) and s(a) with s(Φ) ∈ S and s(a) ∈ E. We also assume that the latent spaces are learned in such a way that explicit entity output embeddings satisfy e o (a) ∈ E. This predication structure explains that a question asks for a value of x such that a statement Ψ[x] is implied by the passage. For a question Ψ we might even suggest the following vectorial interpretation of entailment. This interpretation is exactly correct if some of the dimensions of the vector space correspond to predicates, Ψ is a 0-1 vector representing a conjunction predicates, and Φ is also 0-1 on these dimensions indicating whether a predicate is implied by the context. We now present empirical evidence for this emergent structure. The empirical evidence supports two corollaries that are derived from the structure. Thus, the aggregation readers and the explicit reference readers are using essentially the same answer selection criterion. The first three rows of In this section, we propose a novel approach, one-hot pointer annotation, to locate entities approach, we use a non-anonymized dataset (WDW), and add a one-hot indicator to each input (word embedding) that indicates occurrences of candidate answers in a passage. This approach simply provides the reference information R(a, p) without losing any information in the passage, unlike anonymized entity identifiers that remove original tokens in the passage. Additionally, we hope that the one-hot indicator helps aggregation readers that are apparently benefited by the anonymization. The performance of aggregation and explicit reference readers on WDW is in Table ( Our experiments indicate that both explicit reference and aggregation readers benefit greatly from this externally provided reference information. Especially, explicit reference readers rely on reference resolution-a specification of which phrases in the given passage refer to candidate answers. Aggregation readers also seem to demonstrate a stronger learning ability in that they essentially learn to mimic explicit reference readers by identifying reference annotation and using it appropriately. This is done most clearly in the pointer reader architectures. Furthermore, we have argued for, and given experimental evidence for, an interpretation of aggregation readers as learning emergent predication structure-a factoring of neural representations into a direct sum of a statement (predicate) representation and an entity (argument) representation. Human Performance -84 Table 3 .2 Accuracy on Who-did-What dataset. Each result is based on a single model. Results for neural readers other than NSE are based on replications of those systems. All models were trained on the relaxed training set which uniformly yields better performance than the restricted training set. The first group of models are explicit reference models and the second group are aggregation models. + indicates anonymization with better reference identifier. There is great interest in learning representations for natural language understanding. These neural reading comprehension is such that systems still benefit from externally provided linguistic features, including externally annotated reference resolution. It would be interesting to develop fully automated neural readers that perform as well as readers using externally provided annotations. In this work, we claimed and empirically showed that the success of aggregation readers and explicit readers could be explained by Equation Finally, we proposed one-hop pointer annotation to helps aggregation readers whose performance indicates that these neural networks are benefited from externally provided linguistic features, including externally annotated reference information. Relation and entity centered reading comprehension In this work, we apply the externally provided reference information that improved the performance of neural readers in Chapter 3 to another reading comprehension task focusing on not only entities but also their relations, and propose a novel neural model and training algorithm that memory-efficiently trains the model. We propose a Transformer based model with an explicit reference structure that efficiently captures the global contexts. Although the self-attention layer in Transformer consumes a memory that quadratically scales to the length of the input sequence, we propose a training algorithm whose memory requirement is constant to the length of the sequence. We employed Wikihop to show the performance of the model and the training algorithm. The dataset is a reading comprehension dataset focusing on not only entities but also their relations. We presented studies to find an entity from a passage for a given textual query, i.e., cloze-style reading comprehension, in Chapter 2 and Chapter 3. On the other hand, Wikihop is a reading comprehension task whose query consists of a relation and entity and asks another entity that has the relation to the entity. Our model, trained by the algorithm, achieved the state-of-the-art in Wikihop. 51 Wikihop consists of a passage, question, candidate answers, and an answer. Here a question is a tuple of a query entity and relation, and then the answer is another entity that has the relation to the query entity. The task is closely related to the relation extraction tasks, and, unlike cloze-style reading comprehension, the task requires not only finding an entity but also understanding relations in the passage. In addition to that, the dataset also provides anonymized passages that help the reference resolution. Wikihop is designed for multi-hop reading comprehension with relatively long passages. In Wikihop, each passage has multiple paragraphs, as shown in Fig. 4 [5] . that is designed as a set of tuples, and each tuple consists of a subject entity, object entity, and their relation. There are more than 7,000 relation types, including "instance_of" and In this section, we review related work for Wikihop by using three approaches. In the first approach, models have various self-attention structures. A limitation of the naive selfattention layer is the maximum length of a sequence that it can take. These models modified the self-attention structure to overcome the limitation; however, their training time (including pre-training and fine-tuning for a downstream task) is longer than those of other models. In the second approach, models consist of a pre-trained encoder and additional network structure, so that they are solely fine-tuned for a downstream task. We also take the pretraining and fine-tuning approach, but we propose a simpler model on the top of an encoder. In the third approach, models are full scratch models whose parameters are all randomly initialized and optimized only on the dataset of the downstream task. These models have no access to the additional linguistic resources used in pre-training and do not perform as well as pre-trained models. Models modifying self-attention structure: In recent years, pre-trained Transformers are surpassing the performance of other neural structures like recurrent neural networks, and convolutional neural networks in reading comprehension tasks. Transformer is a neural structure that processes a sequence by stacked self-attention layers [46] . Each self-attention layer computes an attention from a token to other tokens as follows: where Q, K and V are query, key and value vectors for each token. The network structure is completely geometry free, i.e., there is no structure to reserve the order of tokens in the sequence like recurrent networks, but Transformer takes a position embedding along with a word embedding for each token. This self-attention mechanism gives a rich expressive power to Transformer. However, the structure requires an amount of memory that is quadratic in the sequence length in training. The self-attention structure is trained by a stochastic algorithm. The algorithm has two steps to update parameters in the structure. The first step is the forwarding process, where the structure computes the loss through the query, key, and value embeddings. The second step is the backpropagation, where we compute the gradient for each parameter using the query, key, and value embeddings. Thus, the query, key, and value embeddings must be kept until the backpropagation. As Equation Here, we review approaches that modify the structure, self-attention layer to address the issues. Dynamic self-attention [47] is a self-attention layer whose attention is over top-K tokens selected by a convolutional layer [48] . Transformer-XL and XLNet [49, 50] have a self-attention layer that uses relative position embeddings rather than absolute positions. A relative position provides the distance between two tokens; a token that we compute the attention from, and another token that we compute the attention to. Thus they are not Although these approaches potentially solve the fundamental limitation of the Transformer encoder, these models need to be pre-trained from scratch. Typically, these Transformer encoders are pre-trained on a large training data that is much larger than the training data of downstream tasks. As the result, the pre-training is the most time-consuming part of its parameter optimization. Thus, other approaches that are reviewed in the following section add additional structure on the top of pre-trained encoders so that they can avoid the pre-training. Another approach is fine-tuning based on pre-trained encoders. In this approach, a model consists of an encoder whose parameters are pre-trained and an additional neural structure whose parameters are randomly initialized. The pre-trained encoder provides contextual word embeddings for each input text. The encoder is pre-trained on a large scale language resource so that it is believed that the encoder obtained some general linguistic knowledge and its contextual word embeddings help downstream tasks. The additional structure is a task-specific neural structure that can efficiently leverage these contextual word embeddings for the downstream task. Thus, the parameters of the structures are fine-tuned for the task during the model is trained on the downstream dataset. For example, Graph Convolutional Networks is used on the top of Embeddings from Language Model (ELMo) encoder [53, 54] . Chen et al. [55] proposed a two-stage approach. In the first stage, a pointer network [56] selects a part of the passage that is likely essential for solving the question. In the second stage, a Transformer model takes the part of the passage and finds the answer. It is worth mentioning that, in some studies, models are trained from scratch. These models consist of a relatively simple encoder and a relatively complicated additional neural structure. For example, Zhong et al. [57] proposed a Coarse-grain Fine-grain Coattention Network consisting of attention over candidate entities mentioned in each paragraph and another attention over the paragraphs on the top of a bidirectional Gated Recurrent Unit (GRU) encoder [58] . Tu et al. [59] proposed a Heterogeneous Document-Entity (HDE) graph whose node is each entity-mention and paragraph encoded by GRU. Dhingra et al. [60] proposed a GRU with additional connections between tokens if these tokens are referring to the same entity (coreference). We propose a simpler and efficient structure that adds a sum layer on the top of a Transformer encoder. Our model works without the time-consuming pre-training, and also our experiments indicate our simple structure efficiently leverages the context embeddings given by the pre-trained Transformer encoder. We propose a Transformer-based model with the explicit reference structure and a training algorithm for it. Here, the Transformer encoder is a function that takes a sequence of tokens and returns a contextual embedding for each token in the sequence. As we explained in Section 3, the explicit reference structure is a neural network structure that explicitly takes the contextual embedding of a token referring to a candidate answer to score the candidate, and these models explicitly leverage these embeddings. In this model, the Transformer encoder encodes each paragraph and computes the contextual embeddings of tokens for each paragraph independently, so that its memory usage is linear to the number of the paragraphs 58 and does not quadratically scale with the length of the passage, as we see in Section 4.2. Then the model accumulates these embeddings over paragraphs and scores the candidate answers. The overview of this model is shown in Figure 4 .3. We also propose a training algorithm for it, which reduces the memory usage during the training to the constant to the number of paragraphs. Remembering that the passage is a set of paragraphs, the Transformer encoder encodes the paragraphs independently. We denote the k-th paragraph by para k , the question by q, and then the encoder parameters by Φ. Then letting the contextual embeddings of the k-th Here the Transformer encoder takes a concatenation of the question and paragraph. The contextual embeddings of a token referring to each candidate answer are accumulated over all paragraphs. Remembering that each question consists of a relation and entity q e , we also similarly accumulate a query entity embedding, then the candidate answer embeddings are concatenated to the query entity embeddings. Letting the score of the i-th candidate answer be where H k Φ [t] is the t-th contextual representation vector for the given paragraph, f is a fully connected layer, and R(para k , c) is the set of positions t where the entity c occurs in the paragraph. To find these positions, we matched entities and noun phrases in the passage whose most words match each entity when entities are not anonymized. We also propose a stochastic gradient algorithm to train this model, whose memory usage is constant to the number of paragraphs as Algorithm 1. In this model, the Transformer encoder takes each paragraph instead of the entire passage, so the memory usage of the naive stochastic gradient algorithm is quadratic to the length of paragraphs and linear to the number of the paragraphs, which is still too large to fit a GPU memory when the passage has many paragraphs. During the training, the memory is consumed by a computational graph. The computational graph can be viewed as a representation of an objective function and requires memory for each neural output of parameterized functions in the objective function during the training. For example, parameters of a parameterized function f (x; θ) is updated by the following during the training, ∂f Here the computational graph keeps the output value of the neural f 2 in the forwarding propagation until the backpropagation. Our training algorithm computes the forwarding propagation twice for each backpropagation. In the first forwarding propagation, we compute the loss without keeping all neural outputs, and then in the second forward propagation, we compute the same loss with keeping a subset of the neural outputs whose parameters are updated on the upcoming backpropagation. In the first forwarding propagation, we compute the contextual embedding for each paragraph independently without keeping neural outputs. We denote the embeddings by H k Φ which is computed as Here we keep the contextual embedding only and remove the left of neural output values. In the second forwarding propagation, we recompute the contextual embedding for a single paragraph then compute the total loss with keeping neural outputs for the following backpropagation. We denote the contextual embedding of the target paragraph by H k Φ , and Now we sum the contextual embedding of the target paragraph and that of other paragraphs. Then, the total loss for the passage is where a is the correct answer and only neural outputs under H k Φ are stored in the computational graph. And then the gradient is computed with respect to Φ thus Φ is updated in Algorithm 1 Update steps for each question in the training algorithm that performs the forward propagation twice for the backpropagation. Input: query q, paragraphs p 0 , p 1 , ..., candidate answers c 0 , c 1 , ..., and answer a ∈ {c 0 , c 1 , ...} 1: for para k ∈ para 0 , para 1 , ... The total loss is computed for each paragraph so that all parameters are updated. Our model is mostly initialized by BERT pre-trained model and fine-tuned on anonymized Wikihop. We use the anonymized version and avoid solving the coreference resolution and identifying mentions of each candidate answer by ourselves so that we use the exact same reference information that other systems used. The encoder of our model is BERT [61] , whose parameters are initialized by BERT-base with twelve self-attention layers and 512 position embeddings. BERT-base is a medium-size Transformer encoder whose scale is similar to each anonymized entity in passages. Other parameters are randomly initialized. Our model is fine-tuned on Wikihop for five epochs. During the fine-tuning, we permutated candidate answers in each passage to avoid over-fitting. We used 10% dropout [62] , warmup [63] over the first 8% of the training data, and Adam optimizer [64] for the parameter optimization. The learning rate is searched from 2 × 10 −6 upto 2 × 10 −4 . Table 4 .2 shows the performance of each system on the development data and test data. The first four models are trained from scratch, and the following models are pre-trained on large-scale data and then fine-tuned on the Wikihop training data. The table shows that the performance of our system is significantly higher than those of the other systems on the development data. Our system shows more than 2% higher accuracy than Longformer-base on the development data. Longformer-base and Longformer-large have 12 and 24 layers for each, and our model uses BERT-base with 12 layers; hence its parameter size is similar to that of Longformer-base. In the test data, Longformer-large achieved the highest accuracy; however, our model achieved the best accuracy in the models with its parameter size scale. Additionally, Longformers are trained on non-anonymized data and they can potentially leverage the information of candidate answer names. On the other hand, our model is trained on anonymized data where candidate answers are replaced by entity IDs; thus, it is impossible to leverage the information of candidate answer names. It is also worth noting that models trained on the anonymized training data perform as good as or better on the non-anonymized test data than the anonymized test data because we can always convert the non-anonymized data into the anonymized data. In order to better understand the contribution of the explicit reference structure to the performance, we show two upper bound accuracies; a model that reads each paragraph independently, and an oracle model that solely reads paragraphs mentioning the answer. The first model scores each candidate answer for each paragraph independently during the training so that the model does not take account of the contexts beyond each paragraph. Then, each candidate answer is scored for each paragraph independently unlike the explicit reference reader as follows: . (4.11) The model predicts the answer by the maximum score over the paragraphs, i.e., The first row of Table 4 .3 shows the accuracy of the model. The accuracy dropped by 8% from our full explicit reference Transformer. This gap indicates that the simple embedding sum significantly contributes to capturing the contexts beyond each paragraph. The second model is an oracle model that takes solely paragraphs containing the correct answer so that it gives an identical maximum performance of the explicit reference Transformer in each paragraph. The model is trained and tested solely on paragraphs containing the correct answer. It is worth noting that the oracle is strong and removes most of the candidate answers. The second row of Table 4 .3 shows the accuracy of the oracle model. Naturally, the performance is better than those of non-oracle models, and the strong accuracy indicates the potential of the explicit reference Transformer. We proposed the explicit reference Transformer that has a simple sum layer on the top of a pre-trained Transformer encoder. The sum layer, called explicit reference structure, performs over contextual token embeddings referring to each candidate answer. Our model is simple and efficiently fine-tuned over Wikihop, and its performance is significantly better than that of models with the similar parameter size. Dev accuracy (%) Independent paragraphs 69.4 Oracle paragraphs 96.9 Our model 77.4 Table 4 .3 The model of independent paragraph reads each paragraph independently, and the model of oracle paragraphs takes solely paragraphs mentioning the correct answer. We also proposed a novel stochastic gradient descent training algorithm. The algorithm performs the forward computation twice; one for computing contextual embeddings and another for storing all neural outputs for the backpropagation. The algorithm requires a constant size of the memory-usage to the length of the input text; thus, it memory-efficiently trains the Transformer encoder. For future work, we would like to apply this model to other datasets to show the robustness of this approach. The Transformer encoder encodes geometric information by solely position embeddings, unlike recurrent networks and convolutional networks that encode geometric information by their network structures. However, the Transformer encoder, we believe, strongly associated with the geometry of the input sequence, and the contextual token embedding on the top of the t-th token is mostly representing the token. Hence, using the explicit embeddings of a task-specific token seems a promising approach. Relation extraction with weakly supervised learning for materials science In this chapter, we present our work in relation extraction for materials science [65] . As we described in Section 1.2, relation extraction is studied in the context of knowledge base population, however; it can be view as a reading comprehension desiring a relation between two given entities. Thus, in this study, we find a relation between two given entities from a text resource, and also we build a graph using the relations that visualize the knowledge described in the text resource. Additionally, this work is collaborative work with materials science, and our target knowledge to be visualized is information that helps material development. A key strategy to build the structured knowledge in materials science is Processing-Structure-Property-Performance (PSPP) reciprocity [66] . The PSPP reciprocity is a framework to understand material development, a field of study to find a manufacturing process that gives a material with specific properties. The PSPP reciprocity explains how the manufacturing process gives a property of a material on three steps: process, structure, and property. The first step is a set of processings where each processing is a (typically) chemical or physical input to the material. The second step is a set of structures where each structure of the material is a pattern of molecules in the material. The third step is a set of properties. Each property is a character of the material that we find valuable. The PSPP reciprocity explains that the first step -processings -builds structures in the material, and the second step -structure -gives some properties of the material, then the third step -propertygives the performance of the material. The PSPP reciprocity derives a knowledge graph, and PSPP chart defined as follows. In the knowledge graph, each node represents a specific process, structure, or property, a node of processing has an edge to a node of a structure if the processing builds the structure, and the node of the structure has an edge to the node of a property if the structure affects the property. Then no node of processing and no node of a structure are connected because, according to the PSPP reciprocity, all properties are given by processings through structures. A subset of the knowledge graph is called a PSPP chart, e.g., Fig.5 .1 [67] . These edges in the PSPP chart indirectly visualize processings that impact on specific desired properties and help material development. Even though PSPP charts are practically helpful in material development, there are a huge number of nodes in the knowledge graph, and it is expensive to find all edges by hand. Hundreds of processings, structures, and properties are known in materials science. Thus the number of all possible edge is exponentially large, and finding such a number of edges by hand is practically impossible. In practice, expert researchers draw a PSPP chart, subgraph of the knowledge graph around target properties. In this research, we developed a computer-aided material design system that automatically finds a PSPP chart from given scientific articles. The system is based on weak supervision that is well studied in the context of knowledge base completion, such as TAC 1 . Here, the system is trained on about 100 relationships and thousands of non-annotated The process-structure-property-performance reciprocity scientific articles from Elsevier's API 2 , and then completes all relations among processing/structure/property nodes. The system does not rely on any specific dataset such as AtomWork [68] , but it relies on scientific articles that likely cover the knowledge needed to fill the PSPP chart. Then, the system visualizes processings that likely impact on given target properties. This study is closely related to knowledge base population, a task to find relations between entities in a knowledge base. A knowledge base is a well-structured database consisting of relationships among entities, i.e., tuples of an entity-pair and relation. For the knowledge base, it is difficult to complete all relationships in the knowledge base by hand, and automatic approaches to complete the knowledge graph from texts are studied in the field of NLP. In these approaches, we used distant supervision [69] . In distant supervision, we preprocess the training data; a subgraph of the knowledge base (tuples of a pair of entities and their relation) and corpus (raw text), and then generate weakly labeled sentences. Each weakly labeled sentence is a sentence mentioning multiple entities whose relation is in the subgraph, and labeled by the relation. In other words, the weakly labeled sentence is distantly labeled by the relation on the knowledge base. Then these weakly labeled sentences are used to train machine learning models. The labels of these sentences seem noisier than manual labeling for each sentence, and the noise reduction of these labels is a key to this approach. Feature-based machine learning models and convolutional neural network (CNN) models are studied in the distant supervised approach. In recent years, CNN models have surpassed feature-based models [70, 71, 72, 73] . Residual learning is used to help the deep CNN network [74] . Zeng et al. [75] split a sentence into three parts, then applied max pooling to each part of the sentence over a CNN layer. Sentence level attention is introduced for selecting a key sentence. In this approach, a network takes a set of sentences for a relation between two entities. Each sentence contains both entities. An attention mechanism over a CNN allows the network to automatically select a key sentence which is likely describing the desired relation. It seems helpful to overcome the noise of distant labels [76, 77, 78] . 70 Our task is to complete a PSPP knowledge graph from scientific articles and extract a subgraph of the PSPP knowledge graph. Let E be the entities of the knowledge graph, and Our system completes the PSPP knowledge graph by two steps; entity collection and relation identification, and then produces a PSPP chart for given properties from the knowledge graph. In the first step, our system collects entities E in the PSPP knowledge graph, and then these entities were classified into three material development steps; processing, structure, and property. For example, 'tempering' and 'hot working' are classified into processing, 'grain refining' and 'austenite dispersion' are classified into structure, and then 'strength' and 'cost' are classified into property. In the second step, our system identifies relations among entities r e i ,e j from scientific articles. Here a machine learning model is trained on weakly labeled sentences, i.e., {(S e i ,e j , r e i ,e j )|e i , e j ∈ E train ⊂ E}, where E train is a set of entities in PSPP charts for training. The trained model fills other 71 relations to complete the PSPP knowledge graph. Then additionally, our system finds and visualizes processes that likely impact on given properties. Here, we assume a scenario where a researcher is developing a new material with certain desired properties and looking for processes related to the properties in a PSPP chart. In this scenario, the PSPP chart is with certain processes and structures around the desired properties. In this section, we describe how we collected entities in the knowledge graph. not long enough to cover structural entities from a materials science standpoint. All keywords and the n most frequent noun phrases are collected, and each word/phrase is assigned a node in the PSPP knowledge graph. The total numbers of entities were 500, 500, and 1000 for process, property, and structural entities respectively. In this section, we describe our CNN model for identifying the relation between entities. We use a stacked CNN with residual connections [74] . The CNN model consists of convolutional units with a deep residual learning framework that embeds the sentence into a vector representation. Then, the vector representation produces the probability distribution of the binary relation with a sigmoid layer. We show the overview of the model in Fig. 5. 3. The CNN model takes each weakly labeled sentence. Let the sentence be s = {t 0 , ..., t i , ...}, where t i is the i-th token in the sentence, and W (t i ) ∈ R dw be a word embedding of the token t i . We define a relative distance from a token to an entity in the sentence as k − i 74 where k is the position of the entity and i is the position of the token. Let the relative position embedding of the token be P (k − i) ∈ R dp . We define the token embedding as where k 1 and k 2 are the first and second entity in the sentence. Note that each sentence s is padded to a fixed length L, and any relative distance greater than D max , is treated as D max . We put the token embeddings into the first convolutional layer. The convolutional unit of the first layer takes token embeddings around the position i, and computes c i ∈ R dc as follows: ...; x i+h−1 ], w ∈ R dc×h(dw+2dp) and b ∈ R dc is a bias. g is an elementwise non-linear function, ReLU. Following the first convolutional layer, the other convolutional layers are stacked with residual learning connections that directly transmit a signal from a lower to a higher layer while skipping the middle layers. We define these two adjacent convolutional layers called a residual CNN block as follows: wherec 0 = c. Here, the first convolutional layerĉ k i takes a signal from the immediately lower layerc k−1 i:i+h and another signal from the lower blockc k−2 i:i+h . We put the output of the last convolutional layer into a max pooling layer. Denoting the last output asc K ∈ R L−h+1×dc , Then, we put z into two fully connected layers and a sigmoid function that gives the probability distribution of the desired relation given the sentence P (r|s): where r is the binary relation between the entities, w g ∈ R dc×dc and b g ∈ R dc . The desired probability P (r = True|e i , e j ) is the maximum of the probabilities over sentences. This is where the parameters Φ = {W, P 1 , P 2 , w, b}. Additionally, we generate a PSPP chart from the knowledge graph for given desired properties. Here the PSPP chart is a subgraph of the knowledge graph that indicates processings that are likely impact on the desired properties. We find the PSPP chart by considering a max-flow problem where the flow occurs from the given properties to the processings. The inlets are all processings and the outlets are given properties. The capacity of each edge is the score of the relation, i.e., P (r = True|e i , e j ). We maximize the amount of flow with a limited number of nodes in the graph. We compute the capacity of each entity in the graph, which is the amount of flow that it can accept. Recalling that nodes of structure are connected to property and processing, and no processing and property are connected, all flows pass through nodes of structure. We define the capacity of a node of structure e as C(e) = min e ∈PRC P (r = True|e, e ), e∈PRP P (r = True|e , e) , (5.12) where PRC is a set of all nodes of processing and PRP' is a set of the desired properties. Similarly, we define the capacity of processing as where STR' is a set of all nodes of structure. The produced PSPP chart is composed of n processings, m structures, and the desired properties where n and m are the given hyper-parameters. The entities of the processing/structure are the n and m most capable nodes. For efficiency, the nodes are greedily searched so that optimality is not guaranteed. The PSPP chart shows the processings/structures related to the desired properties. The CNN model in Section 5.3 was trained and evaluated on a set of PSPP charts and scientific articles. The model was trained on weakly labeled sentences mentioning entities in PSPP charts for training, and then it took weakly labeled sentences mentioning entities in held-out PSPP charts for testing and predicted relations between entities in these held-out PSPP charts. In both of training and testing, the weakly labeled sentences are found in the scientific articles in Section 5.4. Structure ↔ Property 10 31 We used four PSPP charts [66] for training and testing. These four charts have 104 entity pairs in total as shown in Table 5.2 and Table 5 .3. We used three arbitrary charts for training and the fourth chart for testing. Thus, we trained and tested our model on four pairs of training and test charts. We used the likelihoods of relationships in these four test charts for computing precision and recall curves in Section 5.4 to obtain a smooth curve. We used publicly accessible scientific articles on ScienceDirect 4 for training and testing. ScienceDirect is an Elsevier platform providing access to articles in journals in a variety of fields, such as social sciences and engineering. Approximately 3,400 articles were collected using the keyword ('material' and 'microstructure') on ScienceDirect, i.e., each article is related to both 'material' and 'microstructure'. About 5,000 weakly labeled sentences were founded in these scientific articles by using the four PSPP charts, i.e., roughly 50 sentences for each entity pair on average. We trained our CNN model described in Section 5.3 on weakly labeled sentences. Each weakly labeled sentence is labeled as follows. Let a set of sentences mentioning entities e i and e j be S e i ,e j . Here each entity is mapped to a span in a sentence by max-span string matching, i.e., an entity is mapped to the span if the span is the entity name, and no other entity names overlap it. For instance, • Within each phase, the properties are ... • When a substance undergoes a phase transition ... The phase in the first sentence is mapped to entity 'phase', but phrase transition is mapped to 'phase_transition' instead of 'phase' in the second sentence. Thus a sentence mentions an entity if and only if it is mapped on a span in the sentence. The model parameters are optimized by stochastic gradient descent and dropout. Dropout randomly drops some signals in the network that are thought to help the generalization capabilities of the network. We employed an Adam optimizer with a learning rate of 0.00005, and randomly dropped signals from max pooling during training with a probability of 20%. The word embeddings were initialized with GloVe vectors [79] . Other hyper-parameters are listed in Table 5 .4. We compared the performance of our CNN model to the performance of legacy machine learning models; Logistic regression and Support Vector Machine (SVM). The baseline mod- The evaluation metrics are precision and recall, which are standard metrics for information extraction tasks. Precision is the ratio of correctly predicted positive entity pairs to all predicted positive entity pairs, and gives the accuracy of the prediction. Recall is the ratio of correct predictions to all positive entity pairs in the test data, and gives the coverage of the prediction. A positive entity pair is a pair whose relation is True. We obtain high precision In this evaluation, the hyper-parameter was an integer t, the number of positive entity pairs in the prediction. For a given t and a set of entity pairs in the test relationship data, the system predicts a binary relation, for each pair. It predicts the t most likely positive pairs, and the other pairs are predicted as negative. The entity pairs in the test relationship data were scored by a machine learning model trained on the corresponding training relationship data, where the score was P (r = True|e i , e j ). Then, a precision and recall pair for a given hyper-parameter t was computed as follows: where R test is the set of entity pairs with positive relations in all test relationship data, and R t represents the t most likely positive entity pairs. The likelihood was a score given by the model. We developed a web-based end-to-end demo system to demonstrate our system in Fig. 5 .7. The demo system worked in a typical scenario of material development, where a scientist was looking for factors related to certain desired properties. The demo system provides an PSPP design chart for the properties that the scientist provided. The end-to-end system works on Apache Tomcat 5 . The system input consisted of the desired properties along with a base material. The desired properties were selected from a list of properties collected as in Section 5.3. The base material was the target material, such as aluminum or titanium. It was important to obtain the desired knowledge. For example, the relationship between 'strength' and 'matrix' in titanium alloys might have been different from this relationship in aluminum alloys. Then, the system predicts PSPP relations from the scientific articles about the base material. Firstly, the system selects a set of scientific articles for the base material. As in Section 5.4, the articles were collected by keyword search in ScientceDirect. Thus the system predicted all relations among the entities collected in Section 5.3, and scored them as in Section 5.3. Then the system generated a PSPP chart for the given properties as Section 5.3. The system output was a PSPP design chart suggesting the required structures and processes. The chart formed by three columns -process, structure, and property-suggested relations from the processes to the desired properties. Moreover, for each relation, the system provided a representative sentence to justify the relation and aid the researcher's understanding. In this study, we developed and tested our knowledge extraction and representation system intended to support material design, by representing knowledge as relationships. Knowledge was represented as relationships in PSPP design charts. We leveraged weakly supervised learning for relation extraction. The end-to-end system proved our concept, and its relation extraction performance was superior to that of other baseline models. Our contribution in this study is twofold. Firstly, we proposed a novel knowledge graph based on PSPP charts, and developed a system to build the knowledge graph from text using NLP technologies. Secondly, we experimentally verified that such technical knowledge can be extracted from text by using machine learning models. Our target knowledge is relations in PSPP design charts. These relations appear rather technical and significantly different from typical relations in NLP such as 'has_a' and 'is_a'. Extraction of these relations from text is nontrivial, and might need other knowledge resource such as equations and properties The end-to-end demo system. a) Desired properties and a base material were selected. b) A sample of the generated PSPP design chart. The desired properties were toughness and creep strength, and 'steel' was selected as base material. c) A sentence describing the relation between toughness and carbon content. of materials; however, we experimentally verified that a state-of-the-art machine learning model can find these relations from texts. Knowledge graphs in the scientific domain are recently highly demanded, and numerous works have been published. We overview the related work after our work. As we employed the PSPP reciprocity, various types of knowledge graphs are studied for each target information. In the general scientific domain, Auer et al. [80] proposed the vision and infrastructure of a knowledge base for the general scientific domain. In biology, Jiang et al. [81] pointed out some relational scientific facts are true under specific conditions in biology. For example, given the following sentence: "We observed that ... alkaline pH increases the activity of TRV5/V6 channels in Jurkat T Cells." [82] We can find a relational fact, {"alkaline pH", increase, "TRV5/V6 channels"}, which is true if {"TRV5/V6 channels", locate, "Jurkat T Cells" }. Another knowledge base for biology combines multiple structured databases and scientific papers [83] . In materials science, Mrdjenovich et al. [84] manually Luan et al. [88] proposed SciERC as extending these datasets. These annotations provide cleaner training labels and make training efficient. Information extraction for materials science (material informatics) is also highly demanded and actively studied. For example, another desired information to be extracted for materials science might be synthesis procedures. A synthesis procedure is a sequence of operations to synthesis a compound. Mention-level annotated datasets are provided for this task [89, 90, 91] , and Mysore et al. [92] apply the generative model of Kiddon et al. [93] to induce the procedures. Furthermore, several essential NLP technologies are studied for material informatics, such as entity recognition for materials science [94, 95] , and word2vec [96] on materials science publications [97] . Table 5 .5 Sample representative sentences scored by the CNN model. Label P indicates that the entities are positively related in the test relationship data and label N indicates a negative relation. Entities in each sentence are underlined. The score is the v r z 2 of each sentence. Score/Label Sentence 1 36.5/P ... the following matrix form : [11] k ∼ u = λu ... In this thesis, we discussed reading comprehension, focusing on entities and their relations. We started with an overview of reading comprehension tasks and the role of entities and their relations in these tasks. In early work, these tasks provide a small hand-written dataset for rule-based systems. Later, the datasets are getting bigger and bigger for machine learning models, especially for deep neural network models that are capable of being trained on such large scale training data. Then we claim that the goal of these tasks is to test the reading comprehension skills of machines, and it differentiates the reading comprehension from other question answering tasks. Additionally, we are interested in not only testing these skills but also how the machine understands texts, and then claim that entities and their relation can be a key to explain it. In Chapter 2, we constructed a reading comprehension dataset, WDW, that is designed to validate the reading comprehension skills, especially the skill to understand entities in given texts. Here we used baseline systems and a sampling approach to control the difficulty of the dataset so that each question requires appropriate reading comprehension skills to solve it. The dataset gives a larger gap between human performance and machine performance, which shows that our dataset requires deeper text understanding. In Chapter 3, we investigated the skill to understand entities and experimentally identified a neural network module that associates with each entity in neural readers. We explored neural readers and classified them into aggregation readers and explicit readers by their neural structures on top of contextual token embeddings. We experimentally found contextual token embeddings that strongly correlate with each entity, and then showed the attention layer of the aggregation reader mimics the explicit reference of the explicit reader. In Chapter 4, we feedbacked the findings to another entity and relation centric reading comprehension dataset, Wikihop, and improved the performance of the neural network model. Here we leverage the neural structure associating with each entity for scor- ing each candidate answer. Additionally, we proposed a training algorithm that can train self-attention layers without quadratically consuming the memory. In Chapter 5, we developed a visualization system that summarizes given texts into a graph consisting of entities and their relations. This system extracts entities and their relations from a bunch of scientific articles. These entities and relations produce a graph that visualizes a summary of the given scientific articles. This work is collaborative work with materials science, and our target information to be visualized is PSPP relations. We showed that such highly scientific relations could be extracted by the novel neural network trained on about 100 labeled relations and scientific articles. We presented our contribution to reading comprehension focusing on entities and their relations. Here, we discuss straightforwardly more work to do to understand the reading comprehension skills of deep neural networks better. Thanks to the deep neural networks and large scale datasets, the performance of machines 92 in reading comprehension tasks is significantly improved. On the other hand, it becomes more and more difficult to explain each semantic role of vector representation as the network structure becomes more and more complicated. We presented an empirical analysis of neural readers in Chapter 3, and identified contextual token embeddings that strongly correlate with each entity embedding in an entity-centric dataset. A follow-up question might be the following. "How are entities treated in other reading comprehension styles and other neural models ?" Recently, other reading comprehension styles, such as the span prediction and free-form answer, is more popular, and other neural models are proposed, such as Transformer. However, they are still based on linear transformations; thus, we can capture a correlation between arbitrary two vector representations by computing inner-product just as Chapter 3. Then, we can apply the same approach to these reading comprehension styles and capture neural module that correlates with each entity. We are also interested in a practical issue of the machine learning we faced in Chapter 5, a lacking of training data for a specific domain. In many practical cases, it is untrivial to collect enough amount of manually labeled training data for neural network models, and a domain-specific dataset tends to be smaller than a general-domain dataset, like [98] . Thus, the size of the dataset tends to be a bottleneck of the performance. In this thesis, we took three approaches to address this issue. Firstly, we build a dataset by heuristically matching news articles and sampling them in Chapter 2. Secondly, we initialize our model with a pretrained neural network and then fine-tuned in Chapter 4. Thirdly, we combined relational information and texts by the idea of distant supervision in Chapter 5. There are other interesting approaches, including zero-shot learning [99] , one-shot learning [100, 101] fewshot learning [102] . We believe it is critical to choose a suitable learning scheme to develop 93 a domain-specific machine learning system. 94 Investigating variability of fatigue indicator parameters of two-phase nickel-based superalloy microstructures A study on the microstructures and mechanical properties of Ti-B20-0.1B alloys of direct rolling in the α+β phase region Preparation and properties of an Al2O3/Ti(C,N) micro-nano-composite ceramic tool material by microwave sintering Microstructure evolution and mechanical properties of drop-tube processed, rapidly solidified grey cast iron Effects of Sc addition on the microstructure and mechanical properties of cast Al-3Li-1.5Cu-0.15Zr alloy Microstructure analysis and yield strength simulation in high Co-Ni secondary hardening steel Study on the microstructure and mechanical properties of Aermet 100 steel at the tempering temperature around 482 • C Effect of thermo-mechanical cycling on the microstructure and toughness in the weld CGHAZ of a novel high strength low carbon steel Effect of ingot grain refinement on the tensile properties of 2024 Al alloy sheets Effect of microstructure on cleavage resistance of high-strength quenched and tempered steels Effects of solution treatment on microstructure and mechanical properties of thixoformed Mg2Sip/AM60B composite Microstructural evolution and mechanical properties of linear friction welded Ti2AlNb joint during solution and aging treatment Teaching machines to read and comprehend Proceedings of Advances in Neural Information Processing Systems A thorough examination of the CNN/daily mail reading comprehension task Text understanding with the attention sum reader network Gated-attention readers for text comprehension Annual Meeting of the Association for Computational Linguistics Constructing datasets for multi-hop reading comprehension across documents Who did what: A large-scale person-centered cloze dataset Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks WikiReading: A novel large-scale language understanding task over Wikipedia Bleu: a method for automatic evaluation of machine translation METEOR: An automatic metric for MT evaluation with improved correlation with human judgments ROUGE: A package for automatic evaluation of summaries A conceptual theory of question answering Populating the semantic web by macro-reading internet text Procedures as a representation for data in a computer program for understanding natural language The lunar sciences natural language system: final report Baseball: an automatic question-answerer Deep read: A reading comprehension system A rule-based question answering system for reading comprehension tests Reading comprehension programs in a statisticallanguage-processing class Searching the world wide web Estimating search engine index size variability: a 9-year longitudinal study SQuAD: 100,000+ questions for machine comprehension of text HotpotQA: A dataset for diverse, explainable multi-hop question answering WordNet: An Electronic Lexical Database. Language, Speech, and Communication Wikidata: A Free Collaborative Knowledgebase On what there is. The review of metaphysics Ontological promiscuity Freebase: A collaboratively created graph database for structuring human knowledge Freebase data dumps The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations Introducing the knowledge graph: things, not strings. Official Google blog WikiQA: A challenge dataset for open-domain question answering MCTest: A challenge dataset for the open-domain machine comprehension of text Ms marco: A human generated machine reading comprehension dataset Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension The narrativeqa reading comprehension challenge The Stanford CoreNLP Natural Language Processing Toolkit Accurate unlexicalized parsing Lucene in Action, Second Edition Machine comprehension with syntax, frames, and semantics Emergent predication structure in hidden state vectors of neural readers End-to-end memory networks Attention-over-attention neural networks for reading comprehension Long short-term memory Attention is all you need Token-level dynamic self-attention network for multi-passage reading comprehension Xception: Deep learning with depthwise separable convolutions Generalized autoregressive pretraining for language understanding Proceedings of Advances in Neural Information Processing Systems Transformer-XL: Attentive language models beyond a fixed-length context Reformer: The efficient transformer Longformer: The long-document transformer Question answering by reasoning across documents with graph convolutional networks Deep contextualized word representations Multi-hop question answering via reasoning chains Pointer networks Coarse-grain finegrain coattention network for multi-evidence question answering Learning phrase representations using RNN encoder-decoder for statistical machine translation Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs Neural models for reasoning over multiple mentions using coreference BERT: Pretraining of deep bidirectional transformers for language understanding Dropout: A simple way to prevent neural networks from overfitting Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. The Computing Research Repository (CoRR) Adam: A method for stochastic optimization Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity Cybermaterials: Materials by design and accelerated insertion of materials Designing a new material world Inorganic materials database for exploring the nature of material Distant supervision for relation extraction without labeled data Knowledge-based weak supervision for information extraction of overlapping relations Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Multi-instance multi-label learning for relation extraction Natural Language Processing and Computational Natural Language Learning Distant supervision for relation extraction with matrix completion Modeling Relations and Their Mentions without Labeled Text Deep residual learning for weakly-supervised relation extraction Distant supervision for relation extraction via piecewise convolutional neural networks Conference on Empirical Methods in Natural Language Processing Neural relation extraction with selective attention over instances Distant Supervision for Relation Extraction with Sentence-Level Attention and Entity Descriptions A soft-label method for noise-tolerant distantly supervised relation extraction GloVe: Global vectors for word representation Towards a knowledge graph for science The role of "condition": A novel scientific knowledge graph representation and construction model Trpv5/v6 channels mediate ca(2+) influx in jurkat t cells under the control of extracellular ph An information extraction and knowledge graph platform for accelerating biochemical discoveries propnet: A knowledge graph for materials science. Matter, 2 Towards the bosch materials science knowledge base SemEval 2017 task 10: ScienceIE -extracting keyphrases and relations from scientific publications SemEval-2018 task 7: Semantic relation extraction and classification in scientific papers Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing The SOFC-exp corpus and neural approaches to information extraction in the materials science domain Annual Meeting of the Association for Computational Linguistics The materials science procedural text corpus: Annotating materials synthesis procedures with shallow semantic structures Text-mined dataset of inorganic materials synthesis recipes. Scientific Data Automatically extracting action graphs from materials science synthesis procedures Mise en place: Unsupervised interpretation of instructional recipes Named entity recognition and normalization applied to largescale information extraction from the materials science literature Machine-learned and codified synthesis parameters of oxide materials. Scientific Data Efficient estimation of word representations in vector space Unsupervised word embeddings capture latent knowledge from materials science literature Imagenet: A large-scale hierarchical image database Learning to detect unseen object classes by between-class attribute transfer One-shot learning of object categories Object classification from a single example utilizing class relevance metrics Diverse few-shot text classification with 110 multiple metrics