Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation Laurel Orr†, Megan Leszczynski†, Simran Arora†, Sen Wu†, Neel Guha†, Xiao Ling‡, and Christopher Ré† †Stanford University ‡Apple {lorr1,mleszczy,simran,senwu,nguha,chrismre}@cs.stanford.edu, xiaoling@apple.com Abstract A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly grounded in reasoning patterns for disambiguation. We define core reasoning patterns for disambiguation, create a learning procedure to encourage the self-supervised model to learn the patterns, and show how to use weak supervision to enhance the signals in the training data. Encoding the reasoning patterns in a simple Transformer architecture, Bootleg meets or exceeds state-of-the-art on three NED benchmarks. We further show that the learned representations from Bootleg successfully transfer to other non-disambiguation tasks that require entity-based knowledge: we set a new state-of- the-art in the popular TACRED relation extraction task by 1.0 F1 points and demonstrate up to 8% performance lift in highly optimized production search and assistant tasks at a major technology company. 1 Introduction Knowledge-aware deep learning models have recently led to significant progress in fields ranging from natural language understanding [38, 41] to computer vision [56]. Incorporating explicit knowledge allows for models to better recall factual information about specific entities [38]. Despite these successes, a persistent challenge that recent works continue to identify is how to leverage knowledge for low-resource regimes, such as tail examples that appear rarely (if at all) in the training data [16]. In this work, we study knowledge incorporation in the context of named entity disambiguation (NED) to better disambiguate the long tail of entities that occur infrequently during training.1 Humans disambiguate by leveraging subtle reasoning over entity-based knowledge to map strings to entities in a knowledge base. For example, in the sentence “Where is Lincoln in Logan County?”, resolving the mention “Lincoln” to “Lincoln, IL” requires reasoning about relations because “Lincoln, IL”—not “Lincoln, NE” or “Abraham Lincoln”—is the capital of Logan County. Previous NED systems disambiguate by memorizing co-occurrences between entities and textual context in a self-supervised manner [16, 51]. The self-supervision is critical to building a model that is easy to maintain and does not require expensive hand-curated features. However, these approaches struggle to handle tail entities: a baseline SotA model from [16] achieves less than 28 F1 points over the tail, compared to 86 F1 points over all entities. Despite their rarity in training data, many real-world entities are tail entities: 89% of entities in the Wikidata knowledge base do not have Wikipedia pages to serve as a source of textual training data. However, to achieve 60 F1 points on disambiguation, we find that the prior SotA baseline model should see an entity 1In this work, we define tail entities as those occurring 10 or fewer times in the training data. 1 ar X iv :2 01 0. 10 36 3v 3 [ cs .C L ] 2 3 O ct 2 02 0 How tall is Lincoln? The Core Reasoning Patterns of NED Type Affordance people have “heights” Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL LOC PER ORG LOC co-occurrence with "Nebraska" Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Where is Lincoln in Logan County? "capital-of" relation Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Logan County, IL Logan County, OK Logan County, OH Is a Lincoln or Ford more expensive? consistent "car" types Lincoln, NE Abraham Lincoln Lincoln Motor Lincoln, IL Ford Motor Ford, Australia Henry Ford KG Relations Entity Memorization Type Consistency Increasing generality of pattern Where is Lincoln Nebraska? Up to 100x more data needed to recover performance of Bootleg over the tail Overcoming the Long Tail of NED Baseline Bootleg Tail Torso Head Unseen 0 ~100x F1 0 0.2 0.4 0.6 0.8 1.0 Number entity occurrences in training 1 102 104 106 Figure 1: (Left) shows the four reasoning patterns for disambiguation. The correct entity is bolded. (Right) shows F1 versus number of times an entity was seen in training data for a baseline NED model compared to Bootleg across the head, torso, tail, and unseen. on-the-order-of 100 times during training (Figure 1 (right)). This presents a scalability challenge as there are 15x more entities in Wikidata than in Wikipedia, the majority of which are tail entities. For the model to observe each of these tail entities 100x, the training data would need to be scaled by 1,500x the size of Wikipedia. Prior approaches struggle with the tail, yet industry applications such as search and voice assistants are known to be tail-heavy [4, 20]. Given the requirement for high quality tail disambiguation, major technology companies continue to press on this challenge [29, 39]. Instead of scaling the training data until co-occurrences between tail entities and text can be memorized, we define a principled set of reasoning patterns for entity disambiguation across the head and tail. When humans disambiguate entities, they leverage signals from context as well as from entity relations and types. For example, resolving “Lincoln” in the text “How tall is Lincoln?” to “Abraham Lincoln” requires reasoning that people, not locations or car companies, have heights—a type affordance pattern. These core patterns apply to both head and tail examples with high coverage and involve reasoning over entity facts, relations, and types, information which is available for both head and tail in structured data sources. 2 Thus, we hypothesize that these patterns assembled from the structured resources can be learned over training data and generalize to the tail. In this work, we introduce Bootleg, an open-source, self-supervised NED system designed to succeed on head and tail entities. 3 Bootleg encodes the entity, relation, and type signals as embedding inputs to a simple stacked Transformer architecture. The key challenges we face are understanding how to use knowledge for NED, designing a model that learns those patterns, and fully extracting the useful knowledge signals from the training data: • Tail Reasoning: Humans use subtle reasoning patterns to disambiguate different entities, especially unfamiliar tail entities. The first challenge is characterizing these reasoning patterns and understanding their coverage over the tail. • Poor Tail Generalization: We find that a model trained using standard regularization and a combination of entity, type and relation information performs 10 F1 points worse on disambiguating unseen entities compared to the two models which respectively use only type and only relation information. We find this performance drop is due to the model’s over-reliance on discriminative textual and entity features compared to more general type and relation features. • Underutilized Data: Self-supervised models improve with more training data [7]. However, only a 2We find that type affordance patterns apply to over 84% of all examples, including tail examples, while KG relation patterns apply to over 27% of all examples and type consistency applies to over 8% of all examples. In Wikidata, 75% of entities that are not in Wikipedia have type or knowledge graph connectivity signals, and among tail entities, 88% are in non-tail type categories and 90% are in non-tail relation categories. 3Bootleg is open-source at http://hazyresearch.stanford.edu/bootleg 2 http://hazyresearch.stanford.edu/bootleg limited portion of the standard NED training dataset, Wikipedia, is useful: Wikipedia lacks labels [19] and we find that an estimated 68% of entities in the dataset are not labeled.4 Bootleg addresses these challenges through three contributions: • Reasoning Patterns for Disambiguation: We contribute a principled set of core disambiguation patterns for NED (Figure 1 (left))—entity memorization, type consistency, KG relation, and type affordance—and show that on slices of Wikipedia examples exemplifying each pattern, Bootleg provides a lift over the baseline SotA model on tail examples by 18 F1, 56 F1, 62 F1, and 45 F1 points respectively. Overall, using these patterns, Bootleg meets or exceeds state-of-the-art performance on three NED benchmarks and outperforms the prior SotA by more than 40 F1 points on the tail of Wikipedia. • Generalizing Learning to the Tail: Our key insight is that there are distinct entity-, type-, and relation- tails. Over tail entities (based on entity count in the training data), 88% have non-tail types and 90% have non-tail relations. The model should balance these signals differently depending on the particular entity being disambiguated. We thus contribute a new 2D regularization scheme to combine the entity, tail, and relation signals and achieve a lift of 13.6 F1 points on unseen entities compared to the model using standard regularization techniques. We conduct extensive ablation studies to verify the effectiveness of our approach. • Weak Labelling of Data: Our insight is that because Wikipedia is highly structured—most sentences on an entity’s Wikipedia page refer to that entity via pronouns or alternative names—we can weakly label our training data to label mentions. Through weak labeling, we increase the number of labeled mentions in the training data by 1.7x, and find this provides a 2.6 F1 point lift on unseen entities. With these three contributions, Bootleg achieves SotA on three NED benchmarks. We further show that embeddings from Bootleg are useful for downstream applications that require the knowledge of entities. We show the reasoning patterns learned in Bootleg transfer to tasks beyond NED by extracting Bootleg’s learned embeddings and using them to set a new SotA by 1.0 F1 points on the TACRED relation extraction task [2, 53], where the prior SotA model also uses entity-based knowledge [38]. Bootleg representations further provide an 8% performance lift on highly optimized industrial search and assistant tasks at a major technology company. For Bootleg’s embeddings to be viable for production, it is critical that these models are space-efficient: the models using only Bootleg relation and type embeddings each achieve 3.3x the performance of the prior SotA baseline over unseen entities using 1% of the space. 2 NED Overview and Reasoning Patterns We now define the task of named entity disambiguation (NED), the four core reasoning patterns, and the structural resources required for learning the patterns. Task Definition Given a knowledge base of entities E and an input sentence, the goal of named entity disambiguation is to determine the entities e ∈ E referenced in each sentence. Specifically, the input is a sequence of N tokens W = {w1, . . . , wN} and a set of M non-overlapping spans in the sequence W, termed mentions, to be disambiguated M = {m1, . . . , mM}. The output is the most likely entity for each mention. The Tail of NED We define the tail, torso, and head of NED as entities occurring less than 11 times, between 11 and 1,000, and more than 1,000 times in training, respectively. Following Figure 1 (right), the head represents those entities a simple language-based baseline model can easily resolve, as shown by a baseline SotA model from [16] achieving 86 F1 over all entities. These entities were seen enough times during training to memorize distinguishing contextual cues. The tail represents the entities these models struggle to resolve due to their rarity in training data, as shown by the same baseline model achieving less than 28 F1 on the tail. 4We computed this statistic by computing the number of proper nouns and the number of pronouns/known aliases for an entity on that entity’s page that were not already linked. 3 2.1 Four Reasoning Patterns When humans disambiguate entities in text, they conceptually leverage signals over entities, relationships, and types. Our empirical analysis reveals a set of desirable reasoning patterns for NED. The patterns operate at different levels of granularity (see Figure 1 (left))—from patterns which are highly specific to an entity, to patterns which apply to categories of entities—and are defined as follows. • Entity Memorization: We define entity memorization as the factual knowledge associated with a specific entity. Disambiguating “Lincoln” in the text “Where is Lincoln, Nebraska?” requires memorizing that “Lincoln, Nebraska”, not “Abraham Lincoln” frequently occurs with the text “Nebraska” (Figure 1 (left)). This pattern is easily learned by now-standard Transformer-based language models. As this pattern is at the entity-level, it is the least general pattern. • Type Consistency: Type consistency is the pattern that certain textual signals in text indicate that the types of entities in a collection are likely similar. For example, when disambiguating “Lincoln” in the text “Is a Lincoln or Ford more expensive?”, the keyword “or” indicates that the entities in the pair (or sequence) are likely of the same Wikidata type, “car company”. Type consistency is a more general pattern than entity memorization, covering 12% of the tail examples in a sample of Wikipedia.5 • KG Relations: We define the knowledge graph (KG) relation pattern as when two candidates have a known KG relationship and textual signals indicate that the relation is discussed in the sentence. For example, when disambiguating “Lincoln” in the sentence “Where is Lincoln in Logan County?”, “Lincoln, IL” has the KG relationship “capital of” with Logan County while Lincoln, NE does not. The keyword “in” is associated with the relation “capital of” between two location entities, indicating that “Lincoln, IL” is correct, despite being the less popular candidate entity associated with “Lincoln”. As patterns over pairs of entities with KG relations cover 23% of the tail examples, this is a more general reasoning pattern than consistency. • Type Affordance: We define type affordance as the textual signals associated with a specific entity- type in natural language. For example, “Manhattan” is likely resolved to the cocktail rather than the burrough in the sentence “He ordered a Manhattan.” due to the affordance that drinks, not locations, are “ordered”. As affordance signals cover 76% of the tail examples, it is the most general reasoning pattern. Required Structural Resources An NED system requires entity, relation, and type knowledge signals to learn these reasoning patterns. Entity knowledge is captured in unstructured text, while relation signals and type signals are readily available in structured knowledge bases such as Wikidata: from a sample of Wikipedia, 27% of all mentions and 23% of tail mentions participate in a relation, and 97% of all mentions and 92% of tail mentions are assigned some type in Wikidata. As these structural resources are readily available for all entities, they are useful for generalizing to the tail. A rare entity with a particular type or relation can leverage textual patterns learned from every other entity with that type or relation. Given the input signals and reasoning patterns, the next key challenge is ensuring that the model combines the discriminative entity and more general relation and type signals that are useful for disambiguation. 3 Bootleg Architecture for Tail Disambiguation We now describe our approach to leverage the reasoning patterns based on entity, relation, and type signals. We then present our new regularization scheme to inject inductive bias of when to use general versus discriminative reasoning patterns and our weak labeling technique to extract more signal from the self-supervision training data. 5Coverage numbers are calculated from representative slices of Wikidata that require each reasoning pattern. Additional details in Section 5. 4 “Where is Lincoln in Logan County?” Lincoln, ILLincoln, NEAbraham Lincoln AddAttn type embsentity emb AddAttn relation embs Cat + Proj Ent2Ent Phrase2Ent Softmax + Lincoln, IL Logan County, IL KG2Ent single layer Logan Country, OHLogan County, OKLogan Country, IL ue re te E W E W BERT Figure 2: Bootleg’s neural model. The entity, type, and relation embeddings are generated for each candidate and concatenated to form our entity representation matrix E. This, together with our word embedding matrix W, are inputs to Bootleg’s Ent2Ent, Phrase2Ent, and KG2Ent modules which aim to encode the four reasoning patterns. The most likely candidate for each mention is returned. 3.1 Encoding the Signals We first encode the structural signals—entities, KG relations and types—by mapping each to a set of embeddings. • Entity Embedding: Each entity e is represented by a unique embedding ue. • Type Embedding: Let T be the set of possible entity types. Given a known mapping from an entity e to its set {te,1, . . . , te,T |te,i ∈ T} of T possible types, Bootleg assigns an embedding te,i to each type. Because an entity can have multiple types, we use an additive attention [3], AddAttn, to create a single type embedding te = AddAttn([te,1, . . . , te,T ]). We further allow the model to leverage coarse named entity recognition types through a mention-type prediction module (see Appendix A for details). This coarse predicted type is concatenated with the assigned type to form te. • Relation Embedding: Let R represent the set of possible relationships any entity can participate in. Similar to types, given a mapping from an entity e to its set {re,1, . . . , re,R|re,i ∈R} of R relationships, Bootleg assigns an embedding re,i to each relation. Because an entity can participate in multiple relations, we use the additive attention to compute re = AddAttn([re,1, . . . , re,R]). As in existing work [16, 40], given the input sentence of length N and set of M mentions, Bootleg generates for each mention mi a set Γ(mi) = {e1i , . . . , e K i } of K possible entity candidates that could be referred to by mi. For each candidate and its associated types and relations, Bootleg uses a multi-layer perceptron e = MLP([ue, te, re]) to generate a vector representation for each candidate entity, for each mention. We denote this entity matrix as E ∈ RM×K×H, where H is the hidden dimension. We use BERT to generate contextual embeddings for each token in the input sentence. We denote this sentence embedding as W ∈ RN×H. W and E are passed to Bootleg’s model architecture, described next. 3.2 Bootleg Model Architecture The design goal of Bootleg is to capture the reasoning patterns by modeling textual signals associated with entities (for entity memorization), co-occurrences between entity types (for type consistency), textual signals associated with relations along with which entities are explicitly linked in the KG (for KG relations), 5 and textual signals associated with types (for type affordance). We design three modules to capture these design goals: a phrase memorization module, a co-occurrence memorization module, and a knowledge graph connection module. The model architecture is shown in Figure 2. We describe each module next. Phrase Memorization Module We design the phrase memorization module, Phrase2Ent, to encode the dependencies between the input text and the entity, relation, and type embeddings. The purpose of this module is to learn textual cues for the entity memorization and type affordance patterns. It should also learn relation context for the KG relation pattern. It will, for example, allow the person type embedding to encode the association with the keyword “height”. The module accepts as input E and W and outputs Ep = MHA(E, W), where MHA is the standard multi-headed attention with a feed-forward layer and skip connections [48]. Co-occurrence Memorization Module We design the co-occurrence memorization module, Ent2Ent, to encode the dependencies between entities. The purpose of the Ent2Ent module is to learn textual cues for the type consistency pattern. The module accepts E and computes Ec = MHA(E) using self-attention. Knowledge Graph (KG) Connection Module We design the KG module, KG2Ent, to collectively resolve entities based on pairwise connectivity features. Let K represent the adjacency matrix of a (possibly weighted) graph where the nodes are entities and an edge between ei and ej signifies that the two entities share some pairwise feature. Given E, KG2Ent computes Ek = softmax(K + wI)E + E where I is the identity and w is a learned scalar weight that allows Bootleg to learn to balance the original entity and its connections. This module allows for representation transfer between two related entities, meaning entities with a high-scoring representation will boost the score of related entities. The second computation acts as a skip connection between the input and output. In Bootleg, we allow the user to specify multiple KG2Ent modules: one for each adjacency matrix. The purpose of KG2Ent along with Phrase2Ent is to learn the KG relation pattern. End-to-End The computations for one layer of Bootleg includes: E′ =MHA(E, W) + MHA(E) Ek =softmax(K + wI)E′ + E′ where Ek is passed as the entity matrix to the next layer. After the final layer, Bootleg scores each entity by computing Sdis = max(EkvT , E′vT ) with Sdis ∈ RM×K and learned scoring vector v ∈ RH. Bootleg then outputs the highest scoring candidate for each mention. This scoring treats Ek and E′ as two separate predictions in an ensemble method, allowing the model to use collective reasoning from Ek when it achieves the highest scoring representation. If there are multiple KG2Ent modules, we use the average of their outputs as input to the next layer and, for scoring, take the maximum score across all outputs. For training, we use the cross-entropy loss of S to compute the disambiguation loss Ldis. 3.3 Improving Tail Generalization Regularization is the standard technique to encourage models to generalize, as models will naturally fit to discriminative features. However, we demonstrate that standard regularization is not effective when we want to leverage a combination of general and discriminative signals. We then present two techniques, regularization and weak labeling, to encourage Bootleg to incorporate general structural signals and learn general reasoning patterns. 3.3.1 Regularization We hypothesize that Bootleg will over-rely on the more discriminative entity features compared to the more general type and relation features to lower training loss. However, tail disambiguation requires Bootleg to leverage the general features. Using standard regularization techniques, we evaluate three models which respectively use only type embeddings, only relation embeddings, and a combination of type, relation, and 6 entity embeddings. Bootleg’s performance on unseen entities is 10 F1 points worse on the latter than each of the former two, suggesting that standard regularization is not sufficient when the signals operate at different granularities (details Table 9 in Appendix B). We can improve tail performance if Bootleg leverages memorized discriminative features for popular entities and general features for rare entities. We achieve this by designing a new regularization scheme for the entity-specific embedding u, which has two key properties: it is 2-dimensional and more popular entities are regularized less than less popular ones. • 2-dimensional: In contrast to 1-dimensional dropout, 2-dimensional regularization involves masking the full embedding. With probability p(e), we set u = 0 before the MLP layer; i.e., e = MLP([0, te, re]). Entirely masking the entity embedding in these cases, the model learns to disambiguate using the type and relation patterns, without entity knowledge. • Inverse Popularity: We find in ablations (Appendix B) that setting p(e) proportional to the power of the inverse of the entity e’s popularity in the training data (i.e., the more popular the less regularized), gives us the best performance and improves by 13.6 F1 on unseen entities over standard regularization. In contrast, fixing p(e) at 80% improves performance by over 11.3 F1 over standard regularization, and regularizing proportional to the power of popularity only improves performance by 3.8 F1 (details in Section 4). The regularization scheme encourages Bootleg to use entity-specific knowledge when the entity is seen enough times to memorize entity patterns and encourages the use of generalizable patterns over the rare, highly masked, entities. 3.3.2 Weakly Supervised Data Labeling We use Wikipedia to train Bootleg: we define a self-supervision task in which the internal links in Wikipedia are the gold entity labels for mentions during training. Although this dataset is large and widely used, it is often incomplete with an estimated 68% of named entities being unlabeled. Given the scale and the requirement that Bootleg be self-supervised, it is not feasible to hand-label the data. Our insight is that because Wikipedia is highly structured—most sentences on an entity’s Wikipedia page refer to that entity via pronouns or alternative names—we can weakly label our training data [44] to label mentions. We use two heuristics for weak labeling: the first labels pronouns that match the gender of a person’s Wikipedia page as references to that person, and the second labels known alternative names for an entity if the alternative name appears in sentences on the entity’s Wikipedia page. Through weak labeling, we increase the number of labeled mentions in the training data by 1.7x across Wikipedia, and find this provides a 2.6 F1 lift on unseen entities (full results in Appendix B Table 11). 4 Experiments We demonstrate that Bootleg (1) nearly matches or exceeds state-of-the-art performance on three standard NED benchmarks and (2) outperforms a BERT-based NED baseline on the tail. As NED is critical for downstream tasks that require the knowledge of entities, we (3) verify Bootleg’s learned reasoning patterns can transfer by using them for a downstream task: using Bootleg’s learned representations, we achieve a new SotA on the TACRED relation extraction task and improve performance on a production task at a major technology company by 8%. Finally, we (4) demonstrate that Bootleg can be sample-efficient by using only a fraction of its learned entity embeddings without sacrificing performance. We (5) ablate Bootleg to understand the impact of the structural signals and the regularization scheme on improved tail performance. 4.1 Experimental Setup Wikipedia Data We define our knowledge base as the set of entities with mentions in Wikipedia (for a total of 5.3M entities). We allow each mention to have up to K = 30 possible candidates. As Bootleg is a sentence disambiguation system, we train on individual sentences from Wikipedia, where the anchor links and our weak labeling (Section 3.3) serve as mention labels. 7 Table 1: We compare Bootleg to the best published numbers on three NED benchmarks. “-” indicates that the metric was not reported. Bolded numbers indicate the best value for each metric on each benchmark. Benchmark Model Precision Recall F1 KORE50 Hu et al. [24]7 80.0 79.8 79.9 Bootleg 86.0 85.4 85.7 RSS500 Phan et al. [40] 82.3 82.3 82.3 Bootleg 82.5 82.5 82.5 AIDA Févry et al. [16] - 96.7 - Bootleg 96.9 96.7 96.8 Our candidate lists Γ are mined from Wikipedia anchor links and the “also known as” field in Wikidata. For each person, we further add their first and last name as aliases linking to that person. We use the mention boundaries provided in the Wikipedia data and generate candidates by performing a direct lookup in Γ. We use Wikidata and YAGO knowledge graphs and Wikipedia to extract structural data about entity types and relations as input for Bootleg. Further details about data are in Appendix B. Metrics We report micro-average F1 scores for all metrics over true anchor links in Wikipedia (not weak labels). We measure the torso and tail sets based on the number of times that an entity is the gold entity across Wikipedia anchors and weak labels, as this represents the number of times an entity is seen by Bootleg. For benchmarks, we also report precision and recall using the number of mentions extracted by Bootleg and the number of mentions defined in the data as denominators, respectively. The numerator is the number of correctly disambiguated mentions. For Wikipedia data experiments, we filter mentions such that (a) the gold entity is in the candidate set and (b) they have more than one possible candidate. The former is to decouple candidate generation from model performance for ablations.6 The latter is to not inflate a model’s performance, as all models are trivially correct when there is a single candidate. Training For our main Bootleg model, we train for two epochs on Wikipedia sentences with a maximum sentence length of 100. For our benchmark model, we train for one epoch and additionally add a title embedding feature, a sentence co-occurrence KG matrix as another KG module, and a Wikipedia page co-occurrence statistical feature. Additional details about the models and training procedure are in Appendix B. 4.2 Bootleg Performance Benchmark Performance To understand the overall performance of Bootleg, we compare against reported state-of-the-art numbers of two standard sentence benchmarks (KORE50, RSS500) and the standard document benchmark (AIDA CoNLL-YAGO). Benchmark details are in Appendix B. For AIDA, we first convert each document into a set of sentences where a sentence is the document title, a BERT SEP token, and the sentence. We find this is sufficient to encode document context into Bootleg. We fine-tune the pretrained Bootleg model on the AIDA training set with learning rate of 0.00007, 2 epochs, batch size of 16, and evaluating every 25 steps. We choose the test score associated with the best validation score.8 In Table 1, we show that Bootleg achieves up to 5.8 F1 points higher than prior reported numbers on benchmarks. Tail Performance To validate that Bootleg improves tail disambiguation, we compare against a baseline model from Févry et al. [16], which we refer to as NED-Base.9 NED-Base learns entity embeddings by 6We drop only 1% of mentions from this filter. 8We use the standard candidate list from Pershina et al. [36] when comparing to existing systems for fine-tuning and inference for AIDA CoNLL-YAGO. 9As code for the model from Févry et al. [16] is not publicly available, we re-implemented the model. We used our candidate 8 Table 2: (top) We compare Bootleg to a BERT-based NED baseline (NED-Base) on validation sets of a Wikipedia dataset. We report micro-average F1 scores. All torso, tail, and unseen validation sets are filtered by the number of entity occurrences in the training data and such that the mention has more than one candidate. Model All Entities Torso Entities Tail Entities Unseen Entities NED-Base 85.9 79.3 27.8 18.5 Bootleg 91.3 87.3 69.0 68.5 Bootleg (Ent-only) 85.8 79.0 37.9 14.9 Bootleg (Type-only) 88.0 81.6 62.9 61.6 Bootleg (KG-only) 87.1 79.4 64.0 64.7 # Mentions 4,065,778 1,911,590 162,761 9,626 maximizing the dot product between the entity candidates and fine-tuned BERT-contextual representations of the mention. NED-Base is successful overall on the validation achieving 85.9 F1 points, which is within 5.4 F1 points of Bootleg (Table 2). However, when we examine performance over the torso and tail, we see that Bootleg outperforms NED-Base by 8 and 41.2 F1 points, respectively. Finally, on unseen entities, Bootleg outperforms NED-Base by 50 F1 points. Note that NED-Base only has access to textual data, indicating that text is often sufficient for popular entities, but not for rare entities. 4.3 Downstream Evaluation Relation Extraction Using the learned representations from Bootleg, we achieve the new state-of-the-art on TACRED, a standard relation extraction benchmark. TACRED involves identifying the relationship between a specified subject and object in an example sentence as one of 41 relation types (e.g., spouse) or no relation. Relation extraction is a well-suited for evaluating Bootleg because the substrings in the text can refer to many different entities, and the disambiguated entities impact the set of likely relations. Given an example, we run inference with the Bootleg model to disambiguate named entities and generate the contextual Bootleg entity embedding matrix, which we feed to a simple Transformer architecture that uses SpanBERT [27] (details in Appendix C). We achieve a micro-average test F1 score of 80.3, which improves upon the prior state of the art—KnowBERT [38], which also uses entity-based knowledge—by 1.0 F1 points and the baseline SpanBERT model by 2.3 F1 points on TACRED-Revisited data (Table 3) ([53], Alt et al. [2]). We find that the Bootleg downstream model corrects errors made by the SpanBERT baseline, for example by leveraging entity, type, and relation information or recognizing that different textual aliases refer to the same entity (see Table 4). generators and fine-tuned a pretrained BERT encoder rather than training a BERT encoder from scratch, as is done in Févry et al. [16]. We trained NED-Base on the same weak labelled data as Bootleg for 2 epochs. Table 3: Test micro-average F1 score on revised TACRED dataset. Validation Set F1 Bootleg Model 80.3 KnowBERT 79.3 SpanBERT 78.0 9 Table 4: The following are examples of how the contextual entity representation from Bootleg, generated from entity, relation, and type signals, can help our downstream model. We provide the TACRED example, signals provided by Bootleg, as well our model and the baseline SpanBERT models’ predictions. TACRED Example Bootleg Signals Our Prediction SpanBERT Prediction Vincent Astor, like Marshall (subj), died unexpectedly of a heart attack (obj) in 1959 .. . Gold relation: Cause of Death Disambiguates “Marshall” to Thomas Riley Marshall and “heart attack” to myocardial infarction, which have the Wikidata relation “cause of death” Cause of Death No Relation The International Water Management (obj) Institute or IWMI (subj) study said both . ... Gold relation: Alternate Names Disambiguates alias “International Water Management Institute” and its acronym, the alias “IWMI”, to the same Wikidata entity Alternate Names No Relation In studying the slices for which the Bootleg downstream model improves upon the baseline SpanBERT model, we rank TACRED examples in three ways: by the proportion of words where Bootleg disambiguates it as an entity, leverages Wikidata relations for the embedding, and leverages Wikidata types for the embedding. For each of these three, we report the gap between the SpanBERT model and Bootleg model’s error rates on the examples with above-median proportion (more Bootleg signal) relative to the below-median proportion (less Bootleg signal). We find that the relative gap between the baseline and Bootleg error rates is larger on the slice above (with more Bootleg information) than below the median by 1.10x, 4.67x, and 1.35x respectively: with more Bootleg information, the improvement our SotA model provides over SpanBERT increases (more details in Appendix C). Industry Use Case We additionally demonstrate how the learned entity embeddings from Bootleg provide useful information to a system at a large technology company that answers factoid queries such as “How tall is the president of the United States?". We use Bootleg’s embeddings in the Overton [45] system and compare to the same system without Bootleg embeddings as the baseline. We measure the overall test quality (F1) on an in-house entity disambiguation task as well as the quality over the tail slices which include unseen entities. Per company policy, we report relative to the baseline rather than raw F1 score; for example, if the baseline F1 score is 80.0 and the subject F1 is 88.0, the relative quality is 88.0/80.0 = 1.1. Table 5 shows that the use of Bootleg’s embeddings consistently results in a positive relative quality, even over Spanish, French, and German, where improvements are most visible over tail entities. 4.4 Memory Usage We explore the memory usage of Bootleg during inference and demonstrate that by only using the entity embeddings for the top 5% of entities, ranked by popularity in the training data, Bootleg reduces its Table 5: Relative F1 quality of an Overton[45] model with Bootleg embeddings over one without in four languages. Validation Set English Spanish French German All Entities 1.08 1.03 1.02 1.00 Tail Entities 1.08 1.17 1.05 1.03 10 All Torso Tail Unseen F1 0.6 0.7 0.8 0.9 1.0 Compression ratio 0 20 40 60 80 100 Figure 3: We show the error across all entities, torso entities, tail entities, and unseen entities as we decrease the number embeddings we use during inference, assigning the non-popular entities to a fixed unseen entity embedding. For example, a compression ratio of 80 means only the top 20% of entity embeddings are used, ranked by entity popularity. embedding memory consumption by 95%, while sacrificing only 0.8 F1 points over all entities. We find that the 5.3M entity embeddings used in Bootleg consume the most memory, taking 5.2 GB of space while the attention network only consumes 39 MB (1.37B updated model parameters in total, 1.36B from embeddings). As Bootleg’s representations must be used in a variety of downstream tasks, the representations must be memory-efficient: we thus study the effect of reducing Bootleg’s memory footprint by only using the most popular entity embeddings. Specifically, for the top k% of entities ranked by the number of occurrences in training data, we keep the learned entity embedding intact. For the remaining entities, we choose a random entity embedding for an unseen entity to use instead. Instead of storing 5.3M entity embeddings, we thus store ((100−k)/100)∗5.3M, which gives a compression ratio of (100 −k). Figure 3 shows performance for k of 100, 50, 20, 10, 5, 1, and 0.1. We see that when just the top 5% of entity embeddings are used, we only sacrifice 0.8 F1 points overall and in fact score 2 F1 points higher over the tail. We hypothesize that the increase in tail performance is due to the fact that the majority of mention candidates all have the same learned embedding, decreasing the amount of conflict among candidates from textual patterns. 4.5 Ablation Study Bootleg To better understand the performance gains of Bootleg, we perform an ablation study over a subset of Wikipedia (data details explained in Appendix B). We train Bootleg with: (1) only learned entity embeddings (Ent-only), (2) only type information from type embeddings (Type-only), and (3) only knowledge graph information from relation embeddings and knowledge graph connections (KG-only). All model sizes are reported in Appendix B Table 10. In Table 2, we see that just using type or knowledge graph information leads to improvements on the tail of over 25 F1 points and on the unseen entities of over 46 F1 points compared to the Ent-only model. However, neither the Type-only nor KG-only model performs as well on any of the validation sets as the full Bootleg model. An interesting comparison is between Ent-only and NED-Base. NED-Base overall outperforms Ent-only due to the fine-tuning of BERT word embeddings. We attribute the high performance of Ent-only on the tail compared to NED-Base to our Ent2Ent module which allows for memorizing co-occurrence patterns over entities. Regularization To understand the impact of our entity regularization function p(e) on overall performance, we perform an ablation study on a sample of Wikipedia (explained in Appendix B). We apply (1) a fixed regularization set to a constant percent of 0, 20, 50 and 80, (2) a regularization function proportional to the power of the inverse popularity, and (3) the inverse of (2). Table 6 shows results over unseen entities (full results and details in Appendix B). We see that the fixed regularization of 80% achieves the highest F1 over the fixed regularizations of (1). The method that regularizes by inverse popularity achieves the highest 11 Table 6: We show the micro F1 score over unseen entities for a Wikipedia sample as we vary the entity regularization scheme p(e). A scalar percent means a fixed regularization. InvPop (inverse poularity scheme) applies less regularization for more popular entities and Pop applies more regularization for more popular entities. p(e) 0% 20% 50% 80% Pop InvPop Unseen Entities 48.6 52.5 57.7 59.9 52.4 62.2 overall F1. We further see that the scheme where popular entities are more regularized sees a drop of 9.8 F1 points in performance compared to the inverse popularity scheme. 5 Analysis We have shown that Bootleg excels on benchmark tasks and that Bootleg’s learned patterns can transfer to non-NED tasks. We now verify whether the defined entity, type consistency, KG relation, and affordance reasoning patterns are responsible for these results. We evaluate each over a representative slice of the Wikipedia validation set that exemplifies one of the reasoning patterns and present the results from each ablated model (Table 7). • Entity To evaluate whether Bootleg captures factual knowledge about entities in the form of textual entity cues, we consider the slice of 28K overall, 5K tail examples where the gold entity has no relation or type signals available. • Type Consistency To evaluate whether Bootleg captures consistency patterns, we consider the slice of 312K overall, 19K tail examples that contain a list of three or more sequential distinct gold entities, where all items in the list share at least one type. • KG Relation To evaluate whether Bootleg captures KG relation patterns, we consider the slice of 1.1M overall, 37K tail examples for which the gold entities are connected by a known relation in the Wikidata knowledge graph. • Type Affordance To evaluate whether Bootleg captures affordance patterns, we consider a slice where the sentence contains keywords that are afforded by the type of the gold entity. We mine the keywords afforded by a type by taking the 15 keywords that receive the highest TF-IDF scores over training examples with that type. This slice has 3.4M overall, 124K tail examples. Pattern Analysis For the slice representing each reasoning pattern, we find that Bootleg provides a lift over the Entity-only and NED-Base models, especially over the tail. We find that Bootleg generally combines the entity, relation, and type signals effectively, performing better than the individual Entity-only, KG-only, and Type-only models, although the KG-only model performs well on the KG relation slice. The lift from Bootleg across slices indicates the model’s ability to capture the reasoning required for the slice. We provide additional details in Appendix D. Error Analysis We next study the errors made by Bootleg and find four key error buckets. • Granularity Bootleg struggles with granularity, predicting an entity that is too general or too specific compared to the gold entity (example in Table 8). Considering the set of examples where the predicted entity is a Wikidata subclass of the gold entity or vice versa, Bootleg predicts a too general or specific entity in 12% of overall and 7% of tail errors. • Numerical Bootleg struggles with entities containing numerical tokens, which may be due to the fact that the BERT model represents some numbers with sub-word tokens and is known to not perform as well for numbers as other language models [49] (example in Table 8). To evaluate examples requiring 12 Table 7: We report the Overall/Tail F1 scores across each ablation model for a slice of data that exemplifies a reasoning pattern. Each slice is representative but may not cover every example that contains the reasoning pattern. Model Entity Type Consistency KG Relation Type Affordance NED-Base 59/29 84/29 91/30 87/28 Bootleg 66/47 95/85 98/92 93/73 Bootleg (Ent-only) 59/31 87/45 90/42 87/39 Bootleg (Type-only) 53/44 93/80 93/69 90/66 Bootleg (KG-only) 40/29 92/79 97/93 89/68 % Coverage 0.7%/3.3% 8%/12% 27%/23% 84%/76% reasoning over numbers, we consider the slice of data where the entity title contains a year, as this is the most common numerical feature in a title. This slice covers 14% of overall and 25% of tail errors. • Multi-Hop There is room for improvement in multi-hop reasoning. In the example shown Table 8, none of the present gold entities—Stillwater Santa Fe Depot, Citizens Bank Building (Stillwater, Oklahoma), Hoke Building (Stillwater, Oklahoma), or Walker Building (Stillwater, Oklahoma)—are directly connected in Wikidata; however, they share connections to the entity “Oklahoma”. This indicates that the correct disambiguation is Citizens Bank Building (Stillwater, Oklahoma), not Citizens Bank Building (Burnsville, North Carolina). To evaluate examples requiring 2-hop reasoning, we consider examples where none of the present entities are directly linked in the KG, but a present pair connects to a different entity that is not present in the sentence. We find this occurs in 6% of overall and 7% of tail errors. This type of error represents a fundamental limitation of Bootleg as we do not encode any form of multi-hop reasoning over a KG in Bootleg. Our KG information only encodes single-hop patterns (i.e., direct connections). • Exact Match Bootleg struggles on several examples in which the exact entity title is present in the text. Considering examples where the BERT Baseline is correct but Bootleg is incorrect, in 28% of the examples, the textual mention is an exact match of the entity title. Further, 32% of the examples contain a keyword from the entity title that Bootleg misses (example in Table 8). We attribute this decrease in performance to Bootleg’s regularization. This mention-to-entity similarity would need to be encoded in Bootleg’s entity embedding, but the regularization encourages Bootleg to not use entity-level information. 6 Related Work We discuss related work in terms of both NED and the broader picture of self-supervised models and tail data. Standard, pre-deep-learning approaches to NED have been rule-based [1] or leverage statistical techniques and manual feature engineering to filter and rank candidates [50]. For example, link counts and similarity scores between entity titles and mention are two such features [12]. These systems tend to be hard to maintain over time, with the work of Petasis et al. [37] building a model to detect when a rule-based NED system needs to be retrained and updated. In recent years, deep learning systems have become the new standard (see Mudgal et al. [32] for a high-level overview of deep learning approaches to entity disambiguation and entity matching problems). The most recent state-of-the-art models generally rely on deep contextual word embeddings with entity embeddings [16, 46, 51]. As we showed in Table 2, these models perform well over popular entities, but struggle to resolve the tail. Jin et al. [26] and Hoffart et al. [23] study disambiguation at the tail, and both rely on phrase-based language models for feature extraction. Unlike our work, they do not fuse type or knowledge graph information for disambiguation. 13 Table 8: We identify four key error buckets for Bootleg: granularity, numerical errors, multi-hop reasoning, and missed exact matches. We provide a Wikipedia example, the gold entity, and Bootleg’s predicted entity for each example. Error Wikipedia Example Bootleg Prediction Gold Entity Granularity Posey is the recipient of a Golden Globe Award nomination, a Satel- lite Award nomination and two In- dependent Spirit Award nominations. Satellite Awards Satellite Award for Best Actress – Motion Picture Numerical He competed in the individual road race and team time trial events at the 1976 Summer Olympics. Cycling at the 1960 Summer Olympics – 1960 Men’s Road Race Cycling at the 1976 Sum- mer Olympics – 1976 Men’s Road Race Multi-hop Other nearby historic buildings in- clude the Santa Fe Depot, the Cit- izens Bank Building, the Hoke Building, the Walker Building, and the Courthouse Citizens Bank Build- ing (Burnsville, North Carolina) Citizens Bank Building (Stillwater, Oklahoma) Exact Match According to the Nielsen Media Research, the episode was watched by 469 million viewers... Nielsen ratings Nielsen Media Research Disambiguation with Types Similar to our work, recent approaches have found that type information can be useful for entity disambiguation [9, 14, 21, 31, 43, 55]. Dredze et al. [14] use predicted coarse-grained types as entity features into a SVM classifier. Chen et al. [9] models type information as local context and integrates a BERT contextual embedding into the model from [17]. Raiman and Raiman [43] learns its own type systems and performs disambiguation through type prediction alone (essentially capturing the type affordance pattern). Ling et al. [31] demonstrate that the 112-type FIGER type ontology could improve entity disambiguation, and the LATTE framework [55] uses multi-task learning to jointly perform type classification and entity disambiguation on biomedical data. Gupta et al. [21] adds both an entity-level and mention-level type objective using type embeddings embeddings. We build on these works using fine and coarse-grained entity-level type embeddings and a mention-level type prediction task. Disambiguation with Knowledge Graphs Several recent works have also incorporated (knowledge) graph information through graph embeddings [35], co-occurrences in the Wikipedia hyperlink graph [42], and the incorporation of latent relation variables [30] to aid disambiguation. Cetoli et al. [8] and Mulang et al. [33] incorporate Wikidata triples as context into entity disambiguation by encoding triples as textual phrases (e.g., “ ”) to use as additional inputs, along with the original text to disambiguate, into a language model. In Bootleg, the Wikidata connections through the KG2Ent module allow for collective resolution and are not just additional features. Entity Knowledge in Downstream Tasks The works of Broscheit [6], Peters et al. [38], Poerner et al. [41], Zhang et al. [54] all try to add entity knowledge into a deep language model to improve downstream natural language task performance. Peters et al. [38], Poerner et al. [41], Zhang et al. [54] incorporate pretrained entity embeddings and finetune either on a the standard masked sequence-to-sequence prediction task or combined with an entity disambiguation/linking task.10 On the other hand, Broscheit [6] trains its own entity embeddings. Most works, like Bootleg, see lift from incorporating entity representations in the downstream tasks. 10Entity disambiguation refers to when the mentions are pre-detected in text. Entity linking includes the mention detection phase. In Bootleg, we focus on the entity disambiguation task. 14 Wikipedia Weak Labelling Although uncommon, Broscheit [6], De Cao et al. [13], Ghaddar and Langlais [19], Nothman et al. [34] all apply some heuristic weak labelling techniques to increase link coverage in Wikipedia for either entity disambiguation or named entity recognition. All methods generally rely on finding known surface forms for entities and labelling those in the text. Bootleg is the first to investigate the lift from incorporating weakly labelled Wikipedia data over the tail. Self-Supervision and the Tail The works of Tata et al. [47], Chung et al. [11], Ilievski et al. [25], and Chung et al. [10] all focus on the importance of the tail during inference and the challenges of capturing it during training. They all highlight the data management challenges of monitoring the tail (and other missed slices of data) and improving generalizability. In particular, Ilievski et al. [25] studies the tail in NED and encourages the use of separate head and tail subsets of data. From a broader perspective of natural language systems and generalizability, Ettinger et al. [15] highlights that many NLP systems are brittle in the face of tail linguistic patterns. Bootleg builds off this work, investigating the tail with respect to NED and demonstrating the generalizable reasoning patterns over structural resources can aid tail disambiguation. 7 Conclusion We present Bootleg, a state-of-the-art NED system that is explicitly grounded in a principled set of reasoning patterns for disambiguation, defined over entities, types, and knowledge graph relations. The contributions of this work include the characterization and evaluation of core reasoning patterns for disambiguation, a new learning procedure to encourage the model to learn the patterns, and a weak supervision technique to increase utilization of the training data. We find that Bootleg improves over the baseline SotA model by over 40 F1 points on the tail of Wikipedia. Using Bootleg’s entity embeddings for a downstream relation extraction task improves performance by 1.0 F1 points, and Bootleg’s representations lead to an 8% lift on highly optimized production tasks at a major technology company. We hope this work inspires future research on improving tail performance by incorporating outside knowledge in deep models. Acknowledgements: We thank Jared Dunnmon, Dan Fu, Karan Goel, Sarah Hooper, Monica Lam, Fred Sala, Nimit Sohoni, and Silei Xu for their valuable feedback and Pallavi Gudipati for help with experiments. We gratefully acknowledge the support of DARPA under Nos. FA86501827865 (SDH) and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervi- sion); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Swiss Re, the HAI-AWS Cloud Credits for Research program, and members of the Stanford DAWN project: Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwith- standing any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government. References [1] John Aberdeen, John D Burger, David Day, Lynette Hirschman, David D Palmer, Patricia Robinson, and Marc Vilain. Mitre: Description of the alembic system as used in met. In TIPSTER TEXT PROGRAM PHASE II: Proceedings of a Workshop held at Vienna, Virginia, May 6-8, 1996, pages 461–462, 1996. [2] Christoph Alt, Aleksandra Gabryszak, and Leonhard Hennig. Tacred revisited: A thorough evaluation of the tacred relation extraction task. In ACL, 2020. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [4] Michael S Bernstein, Jaime Teevan, Susan Dumais, Daniel Liebling, and Eric Horvitz. Direct answers for search queries in the long tail. In SIGCHI, 2012. [5] Terra Blevins and Luke Zettlemoyer. Moving down the long tail of word sense disambiguation with gloss-informed biencoders. arXiv preprint arXiv:2005.02590, 2020. 15 [6] Samuel Broscheit. Investigating entity knowledge in bert with simple neural end-to-end entity linking. arXiv preprint arXiv:2003.05473, 2020. [7] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. [8] Alberto Cetoli, Mohammad Akbari, Stefano Bragaglia, Andrew D O’Harney, and Marc Sloan. Named entity disambiguation using deep learning on graphs. arXiv preprint arXiv:1810.09164, 2018. [9] Shuang Chen, Jinpeng Wang, Feng Jiang, and Chin-Yew Lin. Improving entity linking by modeling latent entity type information. arXiv preprint arXiv:2001.01447, 2020. [10] Yeounoh Chung, Peter J Haas, Eli Upfal, and Tim Kraska. Unknown examples & machine learning model generalization. arXiv preprint arXiv:1808.08294, 2018. [11] Yeounoh Chung, Neoklis Polyzotis, Kihyun Tae, Steven Euijong Whang, et al. Automated data slicing for model validation: A big data-ai integration approach. TKDE, 2019. [12] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 708–716, 2007. [13] Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904, 2020. [14] Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 277–285, 2010. [15] Allyson Ettinger, Sudha Rao, Hal Daumé III, and Emily M Bender. Towards linguistically generalizable nlp systems: A workshop and shared task. arXiv preprint arXiv:1711.01505, 2017. [16] Thibault Févry, Nicholas FitzGerald, Livio Baldini Soares, and Tom Kwiatkowski. Empirical evaluation of pretraining strategies for supervised entity linking. In AKBC, 2020. [17] Octavian-Eugen Ganea and Thomas Hofmann. Deep joint entity disambiguation with local neural attention. arXiv preprint arXiv:1704.04920, 2017. [18] Daniel Gerber, Sebastian Hellmann, Lorenz Bühmann, Tommaso Soru, Ricardo Usbeck, and Axel- Cyrille Ngonga Ngomo. Real-time rdf extraction from unstructured data streams. In ISWC, 2013. [19] Abbas Ghaddar and Philippe Langlais. Winer: A wikipedia annotated corpus for named entity recognition. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 413–422, 2017. [20] Ben Gomes. Our latest quality improvements for search. https://blog.google/products/search/ our-latest-quality-improvements-search/, 2017. [21] Nitish Gupta, Sameer Singh, and Dan Roth. Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2681–2690, 2017. [22] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics, 2011. [23] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. Kore: keyphrase overlap relatedness for entity disambiguation. In CIKM, 2012. 16 https://blog.google/products/search/our-latest-quality-improvements-search/ https://blog.google/products/search/our-latest-quality-improvements-search/ [24] Shengze Hu, Zhen Tan, Weixin Zeng, Bin Ge, and Weidong Xiao. Entity linking via symmetrical attention-based neural network and entity structural features. Symmetry, 2019. [25] Filip Ilievski, Piek Vossen, and Stefan Schlobach. Systematic study of long tail phenomena in entity linking. In Proceedings of the 27th International Conference on Computational Linguistics, pages 664–674, 2018. [26] Yuzhe Jin, Emre Kiciman, Kuansan Wang, and Ricky Loynd. Entity linking at the tail: Sparse signals, unknown entities and phrase models. In WSDM ’14 Proceedings of the 7th ACM international conference on Web search and data mining, pages 453–462. ACM, February 2014. [27] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019. [28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [29] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. [30] Phong Le and Ivan Titov. Improving entity linking by modeling latent relations between mentions. arXiv preprint arXiv:1804.10637, 2018. [31] Xiao Ling, Sameer Singh, and Daniel S Weld. Design challenges for entity linking. Transactions of the Association for Computational Linguistics, 3:315–328, 2015. [32] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018. [33] Isaiah Onando Mulang, Kuldeep Singh, Chaitali Prabhu, Abhishek Nadgeri, Johannes Hoffart, and Jens Lehmann. Evaluating the impact of knowledge graph contexton entity disambiguation models. arXiv preprint arXiv:2008.05190, 2020. [34] Joel Nothman, James R Curran, and Tara Murphy. Transforming wikipedia into named entity training data. In Proceedings of the Australasian Language Technology Association Workshop 2008, pages 124–132, 2008. [35] Alberto Parravicini, Rhicheek Patra, Davide B Bartolini, and Marco D Santambrogio. Fast and accurate entity linking via graph embedding. In Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), pages 1–9, 2019. [36] Maria Pershina, Yifan He, and Ralph Grishman. Personalized page rank for named entity disambiguation. In NAACL, 2015. [37] Georgios Petasis, Frantz Vichot, Francis Wolinski, Georgios Paliouras, Vangelis Karkaletsis, and Con- stantine D Spyropoulos. Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 426–433, 2001. [38] Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 43–54, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1005. 17 [39] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, et al. Kilt: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252, 2020. [40] Minh C. Phan, Aixin Sun, Yi Tay, Jialong Han, and Chenliang Li. Pair-linking for collective entity disambiguation: Two could be better than all. TKDE, 2019. [41] N Poerner, U Waltinger, and H Schütze. E-bert: Efficient-yet-effective entity embeddings for bert. arXiv preprint arXiv:1911.03681, 2019. [42] Priya Radhakrishnan, Partha Talukdar, and Vasudeva Varma. Elden: Improved entity linking using densified knowledge graphs. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1844–1853, 2018. [43] Jonathan Raphael Raiman and Olivier Michel Raiman. Deeptype: multilingual entity linking by neural type system evolution. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [44] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In VLDB, 2017. [45] Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products. In CIDR, 2020. [46] Hamed Shahbazi, Xiaoli Z Fern, Reza Ghaeini, Rasha Obeidat, and Prasad Tadepalli. Entity-aware elmo: Learning contextual entity representation for entity disambiguation. arXiv preprint arXiv:1908.05762, 2019. [47] Sandeep Tata, Vlad Panait, Suming Jeremiah Chen, and Mike Colagrosso. Itemsuggest: A data management platform for machine learned ranking services. In CIDR, 2019. [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neurips, 2017. [49] Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940, 2019. [50] Vikas Yadav and Steven Bethard. A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470, 2019. [51] Ikuya Yamada and Hiroyuki Shindo. Pre-training of deep contextualized embeddings of words and entities for named entity disambiguation. arXiv preprint arXiv:1909.00426, 2019. [52] Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, and Gerhard Weikum. HYENA: Hierarchical type classification for entity names. In Proceedings of COLING 2012: Posters, pages 1361–1370, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. [53] Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 35–45, 2017. [54] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129, 2019. [55] Ming Zhu, Busra Celikkaya, Parminder Bhatia, and Chandan K. Reddy. Latte: Latent type modeling for biomedical entity linking. In AAAI, 2020. [56] Yuke Zhu, Joseph J Lim, and Li Fei-Fei. Knowledge acquisition for visual question answering via iterative querying. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1154–1163, 2017. 18 A Extended Model Details We now provide additional details about the model introduced in Section 3. We first describe our type prediction module and then describe the added entity positional encoding. Type Prediction To allow the model to further infer the correct types for an entity, especially when the entity does not have a preassigned type, we add a coarse mention type prediction task given the mention embedding. Given a mention m and a coarse type embedding matrix T, the task is to assign a coarse type embedding for the mention m; i.e., determine tm. We do so by adding the first and last token of the mention from W to generate a contextualized mention embedding m. We predict the coarse type of the mention t̂m by computing Stype = softmax(MLP(m)) t̂m = StypeT where Stype generates a distribution over coarse types. For each entity candidate of m, t̂m gets concatenated to the other type embedding te before the MLP. This is supervised by minimizing the cross entropy between Stype and the true coarse type for the gold entity, generating a type prediction loss Ltype. When performing type prediction, our overall loss is Ldis + Ltype. Position Encoding We need Bootleg to be able to reason over absolute and relative positioning of the words in the sentence and the mentions. For example, in the sentence “Where is America in Indiana?”, “America” refers to the city in Indiana, not the United States. In the sentence “Where is Indiana in America?”, “America” refers to the United States. The relative position of “Indiana”, “in”, and “America” signals the correct answer. To achieve this signaling, we add the sin positional encoding from Vaswani et al. [48] to E before it is passed to our neural model. Specifically, for mention m, we concatenate of the positional encoding of the first and last token of m, project the concatenation to dimension H, and add it to each of m’s K candidates in E. As we use BERT word embeddings for W , the positional encoding is already added to words in the sentence. B Extended Results We now give the details of our experimental setup and training. We then give extended results over the regularization scheme and model ablations. Lastly, we extend our error analysis to validate Bootleg’s ability to reason over the four patterns. B.1 Evaluation Data Wikipedia Datasets We use two main datasets to evaluate Bootleg. • Wikipedia: we use the November 2019 dump of Wikipedia to train Bootleg. We use the set of entities that are linked to in Wikipedia for a total of 3.3M entities. After weak labelling, we have a total of 5.7M sentences. • Wikipedia Subset: we use a subset of Wikipedia for our micro ablation experiments over regularization parameters. We generate this subset by taking all sentences where at least one mention is a mention from the KORE50 disambiguation benchmark. Our set of entities is all entities and entity candidates referred to by mentions in this subset of sentences. We have a total of 370,000 entities and 520,000 sentences. For our Wikipedia experiments, we use a 80/10/10 train/test/dev split by Wikipedia pages, meaning all sentences for a single Wikipedia page get placed into one of the splits. For our benchmark model, we use a 96/2/2 train/test/dev split over sentences to allow our model to learn as much as possible from Wikipedia for our benchmark tasks. 19 Benchmark Datasets We use three benchmark NED datasets. Following standard procedure [17], we only consider mentions whose linked entities appear in Wikipedia. The datasets are summarized as follows: • KORE50: KORE50 [23] represents difficult-to-disambiguate sentences and contains 144 mentions to disambiguate. Note, as of the Nov 2019 Wikipedia dump, one mention in the 144 does not have a Wikipedia page. Although it is standard to remove mentions that do not link to an entity in Wikipedia, to be comparable to other methods, we measure with 144 mentions, not 143. • RSS500: RSS500 [18] is a dataset of news sentences and contains 520 mentions (4 of the mentions did not have entities in E). • AIDA CoNLL-YAGO: AIDA CoNLL-YAGO [22] is a document-based news dataset containing 4, 485 test mentions, 4, 791 validation set mentions, and 18, 541 training mentions. As Bootleg is a sentence-level NED system, we create sentences from documents following the technique from Févry et al. [16] where we concatenate the title of the document to the beginning of each sentence. To improve the quality of annotated mention boundaries in the benchmarks, we follow the technique of Phan et al. [40] and allow for mention boundary expansion using a standard off-the-shelf NER tagger.11 For candidate generation, as aliases may not exist in Γ, we gather possible candidates by looking at n-grams in descending order of length and determine the top 30 by measuring the similarity of the proper nouns in the example sentence to each candidate’s Wikipedia page text. Structural Resources The last source of input data to Bootleg is the structural resources of types and knowledge graph relations. We extract relations from Wikidata knowledge graph triples. For our pairwise KG adjacency matrix used in KG2Ent, we require the subject and object to be in E. For our relation embeddings, we only require the subject be in E as our goal is to extract all relations an entity participates in independent of the other entities in the sentence. We have a total of 1,197 relations. We use two different type sources to assign types to entities—Wikidata types and HYENA types [52]—and use coarse HYENA types for type prediction. The Wikidata types are generated from Wikidata’s “instance of”, “subclass of”, and “occupation” relationships. The “occupation" types are used to improve disambiguation of people, which otherwise only receive “human" types in Wikidata. We filter the set of Wikidata types to be only those occurring 100 or more times in Wikipedia, leaving 27K Wikidata types in total. The HYENA type hierarchy has 505 types derived from the YAGO type hierarchy. We also use the coarsest HYENA type for an entity as the gold type for type prediction. There are 5 coarse HYENA types of person, location, organization, artifact, event, and miscellaneous. B.2 Training Details Model Parameters We run three separate models of Bootleg: two on our full Wikipedia data (one for the ablation and one for the benchmarks) and one on our micro data. For all models we use 30 candidates for each mention and incorporate the structural resources discussed above. We set T = 3 and R = 50 for the number of types and relations assigned to each entity. For the models trained on our full Wikipedia data, we set the hidden dimension to 512, the dimension of u to 256, and the dimension of all other type and relation embeddings to 128. For our models trained on our micro dataset, we set the hidden dimension to 256, the dimension of u to 256, and the dimension of all other type and relation embeddings to 128. The final differences to discuss are between the benchmark model and ablation model over all of Wikipedia. To make the best performing model for benchmarks, we add two additional additional features we found improved performance: • We use an additional KG2Ent module in addition to an adjacency matrix indicating if two entities are connected in Wikidata. We add a matrix containing the log of the number of times two entities occur in a sentence together in Wikipedia. If they co-occur less than 10 times, the weight is 0. We found this helped the model better learn entity co-occurrences from Wikipedia. • We allow our model to use additional entity-based features to be concatenated into our final E matrix. We add two features. The first is the average BERT WordPiece embeddings of the title of an entity. This is 11We use the spaCy NER tagger from https://spacy.io/ 20 https://spacy.io/ Table 9: (top) We compare Bootleg to a BERT-based NED baseline (NED-Base) on validation sets of our micro Wikipedia dataset and ablate Bootleg by only training with entity, type, or knowledge graph data. We further ablate (bottom 8 rows) the regularization schemes for the entity embeddings for Bootleg. Model All Entities Torso Entities Tail Entities Unseen Entities NED-Base 90.2 91.6 50.5 21.5 Bootleg (Ent-only) 89.1 89.0 48.3 15.5 Bootleg (Type-only) 91.6 90.4 65.9 56.8 Bootleg (KG-only) 91.8 90.8 65.3 58.6 Bootleg (p(e) = 0%) 92.5 92.3 67.7 48.6 Bootleg (p(e) = 20%) 92.8 92.5 68.9 52.5 Bootleg (p(e) = 50%) 92.9 92.7 70.1 57.7 Bootleg (p(e) = 80%) 92.8 92.2 69.5 59.9 Bootleg (InvPopLog) 92.7 91.9 69.7 61.1 Bootleg (InvPopPow) 92.8 92.3 70.5 62.2 Bootleg (InvPopLin) 92.6 91.8 69.7 61.0 Bootleg (PopPow) 92.9 92.5 68.9 52.4 # Mentions 96,237 37,077 11,087 2,810 similar to improving tail generalization by embedding a word definition in word sense disambiguation [5]. This allows the model to better capture textual cues indicating the correct entity. We also append a 1-dimensional feature of how many other entities in the sentence appear on the entity’s Wikipedia page. This increases the likelihood of an entity that has more connection to other candidates in the sentence. We empirically find that using the page co-occurrences as an entity feature rather than as a KG2Ent module performs similarly and reduces the runtime. Further, our benchmark model uses a fixed regularization scheme of 80% which did not hurt benchmark performance and training was marginally faster than the inverse popularity scheme. We did not use these features for ablations as we wanted a clean study of the model components as described in Section 3 with respect to the reasoning patterns. Training We initialize all entity embeddings to the same vector to reduce the impact of noise from unseen entities receiving different random embeddings. We use the Adam optimizer [28] with a learning rate of 0.0001 and a dropout of 0.1 in all feedforward layers, 16 heads in our attention modules, and we freeze the BERT encoder stack. Note for the NED-Base model, we do not freeze the encoder stack to be consistent with Févry et al. [16]. For the models trained on all of Wikipedia, we use a batch size of 512 and train for 1 epoch for the benchmark model and 2 epochs for the ablation models on 8 NVIDIA V100 GPUs. For our micro data model, we use a batch size of 96 and train for 8 epochs on a NVIDIA P100 GPU. B.3 Extended Ablation Results Ablation Model Size Table 10 reports the model sizes of each of the five ablation models from Table 2. As we finetuned the BERT language model in NED-Base (to be consistent with Févry et al. [16]) but do not do so in Bootleg, we do not count the BERT parameters in our reported sizes to be comparable. Regularization We now present the extended results of our regularization and weak labelling ablations over our representative micro dataset. Table 9 gives full ablations over a variety of regularization techniques. As in Table 2, we include results from models using only the entity, type, or relation information, in addition to the BERT and Bootleg models. 21 Table 10: We report the model sizes in MB of each of the five ablation models: NED-Base, Bootleg, Bootleg (Ent-Only), Bootleg (KG-Only), and Bootleg (Type-Only). Model NED-Base Bootleg Ent-Only Type-Only KG-Only Embedding Size (MB) 5,186 5,201 5,186 13 1 Network Size (MB) 4 39 35 38 34 Total Size (MB) 5,190 5,240 5,221 51 35 Table 11: We report Bootleg trained with versus without weak labelling on our micro Wikipedia dataset. The slices defined by gold anchor counts (pre-weak labelling). We use the InvPopPow regularization for both. Model All Entities Torso Entities Tail Entities Unseen Entities Bootleg 92.8 92.6 70.5 63.3 Bootleg (No WL) 93.3 93.1 70.2 60.7 # Mentions 96,237 36,904 11,541 3,146 We report the results of inverse popularity regularization based on three different functions that map the the curve of entity counts in training to the regularization value. For each function, we fix that entities with a frequency 1 receive a regularization value of 0.95 while entities with a frequency of 10,000 receive a value of 0.05 and assign intermediate values to generate a linear, logarithmic, and power curve that applies less regularization for more popular entities. The regularization reported in Table 6 uses a power law function (f(x) = 0.95(x−0.32)). We also report in Table 9 a linear function (f(x) = −0.00009x + 0.9501) and a logarithmic function (f(x) = −0.097 log(x) + 0.96). Each regularization function is set to a range of 0.05 to 0.95. We leave it as future work to explore other varied regularization functions. Table 9 shows similar trends as reported in Section 4 that Bootleg with all sources of information and the power law inverse regularization performs best over the tail and the unseen entities. We do see that the model trained with a fixed regularization of 0.5 performs marginally better on the torso and over all entities, likely because this involves less regularization over those entity embeddings, allowing it to better leverage the memorized entity patterns, while also leveraging some type and relational information (as shown by its improved performance over a lower fixed regularization). This model, however, performs 4.5 F1 points worse over unseen entities than the best model. Weak Labeling Table 11 shows Bootleg’s results with the inverse power law regularization with and without weak labelling. For this ablation, we define our set of torso, tail, and unseen entities by counting entity occurrence before weak labelling to have a better understanding as to the lift from adding weak labelling (rather than the drop without it). We see that weak labelling provides a lift of 2.6 F1 points over unseen entities and 0.3 F1 points over tail entities. Surprisingly, without weak labeling, Bootleg performs 0.5 F1 points better on torso entities. We hypothesize this occurs because the noisy weakly labels increase the amount available signals for Bootleg to learn consistency patterns for rarer entities—noisy signals are better than no signals—however, popular entities have enough less-noisy gold labels in the training data, so the noise from weak labels may create conflicting signals that hurt performance. To validate this hypothesis, we see that overall, counting both true and weakly labelled mentions, 4% of mentions without weak labeling share the same types as at least one other mention in the sentence while 14% of mentions with weak labelling do. Our model predicts a consistent answer only 4% of the time without weak labeling compared to 13% of the time with weak labeling. Note this is a slightly higher coverage numbers than reported in Section 5 as we are using a weaker form of consistency—two mentions in the sentence share the same type independent of position and ordering—and are including weakly labelled mentions. This 22 indicates consistency is a significantly more useful pattern with weak labelling, and our model predicts more consistent answers with weak labelling than without. Over the torso with weak labelling, we find that 14% of the errors across all mentions (weak labelled and anchor) are when Bootleg uses consistency reasoning, but the correct answer does not follow the consistency pattern. Without weak labelling, only 5% of the errors are due to consistency. C Extended Downstream Details We now provide additional details of our SotA TACRED model, which uses Bootleg embeddings. Input We use the revisited TACRED dataset [2]: each example includes text and subject and object positions in the text. The task involves extracting the relation between the subject and object. There are 41 potential relations as well as a “no relation” option. The other features we use as inputs are NER, POS tags, and contextual Bootleg embeddings for entities that Bootleg disambiguates in the sentence. Bootleg Model As TACRED does not come with existing mention boundaries, we perform mention extraction by searching over n-grams, from longest to shortest, in the sentence and extract those that are known mentions in Bootleg’s candidate maps. We use the same Bootleg model from our ablations with entity, KG, and type information except with the addition of fine-tuned BERT word embeddings. For efficiency, we train on a subset of Wikipedia training data relevant to TACRED. To obtain the relevant subset, we take Wikipedia sentences containing entities extracted during candidate generation from a uniform sample of TACRED data; i.e., entities in the candidate lists of detected mentions from a uniform TACRED sample. The contextualized entity embeddings from Bootleg over all TACRED are fed to the downstream model. Downstream Model We first use standard SpanBERT-Large embeddings to encode the input text, concatenate the contextual Bootleg embeddings with the SpanBERT embeddings, and then pass this through four transformer layers. We then calculate the cross-entropy loss and apply a softmax for scoring. We freeze the Bootleg embeddings and fine-tune the SpanBERT embeddings. We use the following hyperparameters: the learning rate is 0.00002, batch size is 8, gradient accummulation is 6, number of epochs is 10, L2 regularization is 0.008, warm-up percentage is 0.1, and maximum sequence length is 128. We train with a NVIDIA Tesla V100 GPU. Extended Results We study the model performance as a function of the signals provided by Bootleg. In Table 12, we show that on slices with above-median numbers of Bootleg entity, relation, and type signal counts identified in the TACRED example, the relative gap between BERT and Bootleg errors is larger on the slice above, than below, the median by 1.10x, 4.67x, and 1.35x respectively. In Table 13 we show the relative error rates from the Bootleg and baseline SpanBERT models for the slices where Bootleg provides an entity, relation, or type signal for the TACRED example’s subject or object. On the slice of these signals are respectively present, the baseline model performs 1.20x, 1.18x, and 1.20x worse than the Bootleg TACRED model. These results indicate that the knowledge representations from Bootleg successfully transfer useful information to the downstream model. D Extended Error Analysis D.1 Reasoning Patterns Here we provide additional details about the distributions of types and relations in the data. Distinct Tails Like entities, types and relations also have tail distributions. For example, types such as “country” and “film” appear 2.7M and 800k times respectively, while types such as “quasiregular polyhedron” and “hospital-acquired infection” appear once each in our Wikipedia training data. Meanwhile, relations such as “occupation” and “educated at” appear 35M and 16M times respectively, while relations such as 23 Table 12: We rank TACRED examples by the proportion of words that receive Bootleg embedding features where: Bootleg disambiguates an entity, leverages Wikidata relations for the embedding, and leverages Wikidata types for the embedding. We take examples where the proportion is greater than 0. For each of these three slices, we report the gap between the SpanBERT model and Bootleg model’s error rates on the examples with above-median proportion (more Bootleg signal) relative to the below-median proportion (less Bootleg signal). With more Bootleg information, we see the improvement our SotA model provides over SpanBERT increases. Bootleg Signal # Examples with the Signal Gap Above/Below Median Entity 15323 1.10 Relation 5400 4.67 Type 15509 1.35 Table 13: We compute the error rate of SpanBERT relative to our Bootleg downstream model for three slices of TACRED data where respectively Bootleg disambiguates the subject and/or object, Bootleg leverages Wikidata relations for the embedding of the subject and object pair, and Bootleg leverages Wikidata types for the embedding of the subject and/or object in the example. Subject-Object Signal # Examples BERT/Bootleg Error Rate Entity 12621 1.20 Relation 542 1.18 Obj Type 12044 1.20 “positive diagnostic predictor” and “author of afterword” appear 7 times respectively in the Wikipedia training data. However we find that the entity-, relation-, and type-tails are distinct: 88% of the tail-entities by entity-count have Wikidata types that are non-tail types and 90% of the tail-entities by entity-count have non-tail relations.12 For example, the head Wikidata type “country” contains rare entities “Palaú” and “Belgium–France border”. We observe that Bootleg significantly improves tail performance over each of the tails. We rank the Wikidata types and relations by the number of occurrences in the training data and study the lift from Bootleg as a function of the number of times the signal appears during training. Bootleg performs an 9.4 F1 and 20.3 F1 points better than the NED-Base baseline for examples with gold types appearing more and less than the median number of times during training respectively. Bootleg provides a a 7.8 F1 points and 13.7 F1 points better than the baseline for examples with gold relations appearing more and less than the median number of times during training respectively. These results indicate that Bootleg excels on the tails of types and relations as well. Next, ranking the Wikidata types and relations by the proportion of comprised rare (tail and unseen) entities, we further find that Bootleg provides the lowest error rates across types and relations, regardless of the proportion of rare entities, while the baseline and Entity-Only models give relatively larger error rates as the proportion of rare entities increases (Figure 4). The trend for types is flat as the proportion of rare entities increases, while the trend for relations is upwards sloping. These results indicate that Bootleg is better able to transfer the patterns learned from one entity to other entities that share its same types and relations. The improvement from Bootleg over the baseline increases as the rare-proportion increases, indicating that Bootleg is able to efficiently transfer knowledge even when the type or relation category contains none or few popular entities. Type Affordance For the type affordance pattern, we find that the TF-IDF keywords provide high coverage over the examples containing the gold type: 88% of examples where the gold entity has a particular type contain an affordance keyword for that type. An example of a type with full coverage by the affordance 12Similar to tail-entities, tail-types and tail-relations are defined as those appearing 1-10 times in the training data. 24 0.0 0.2 0.4 0.6 0.8 1.0 Rare-Entities Proportion of Relation 0.0 0.2 0.4 0.6 0.8 1.0 Er ro r R at e BOOTLEG BERT KG-Only Type-Only Entity-Only 0.0 0.2 0.4 0.6 0.8 1.0 Rare-Entities Proportion of Type 0.0 0.2 0.4 0.6 0.8 1.0 Er ro r R at e BOOTLEG BERT KG-Only Type-Only Entity-Only Figure 4: For all the entities of a particular type or relation, we calculate the percentage of rare entities (tails and toes entities). We show the error rate on the Wikipedia validation set as a function of the rare-proportion of entities of a given (Left) relation or (Right) type appearing in the validation set. keywords is “café”, with keywords such as “coffee”, “Starbucks”, and “Internet”; in each of the 77 times an entity of the type “cafe” appears in the validation set, an affordance keyword is present. Types with low coverage in the validation set by affordance keywords tend to be the rare types: for the types with coverage less than 50%, such as “dietician” or “chess official”, the median number of occurrences in the validation set is 1. This supports the need for knowledge signals with distinct tails, which can be assembled to together address the rare examples. 25 1 Introduction 2 NED Overview and Reasoning Patterns 2.1 Four Reasoning Patterns 3 Bootleg Architecture for Tail Disambiguation 3.1 Encoding the Signals 3.2 Bootleg Model Architecture 3.3 Improving Tail Generalization 3.3.1 Regularization 3.3.2 Weakly Supervised Data Labeling 4 Experiments 4.1 Experimental Setup 4.2 Bootleg Performance 4.3 Downstream Evaluation 4.4 Memory Usage 4.5 Ablation Study 5 Analysis 6 Related Work 7 Conclusion A Extended Model Details B Extended Results B.1 Evaluation Data B.2 Training Details B.3 Extended Ablation Results C Extended Downstream Details D Extended Error Analysis D.1 Reasoning Patterns