key: cord-0182090-0sxst005 authors: Pinter, Yuval title: Integrating Approaches to Word Representation date: 2021-09-10 journal: nan DOI: nan sha: d87647784c12517d31964cc508d5b8423cc24f50 doc_id: 182090 cord_uid: 0sxst005 The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing. I present a survey of the distributional, compositional, and relational approaches to addressing this task, and discuss various means of integrating them into systems, with special emphasis on the word level and the out-of-vocabulary phenomenon. The mission of natural language processing (NLP) as a computational research field is to enable machines to function in human-oriented environments where language is the medium of communication. We want them to understand our utterances, to connect these utterances with the objects and concepts of the surrounding world, to produce language which is meaningful to us and helps us navigate a task or satisfy an emotional need. Over the years of its existence, the mainstream of NLP has known shifts motivated by developments in computation, in linguistics, in foundational artificial intelligence, and in learning theory. Since the mid-2010's, the clear dominant framework for tackling NLP tasks, and an undeniably powerful one, has been that of deep neural networks (DNNs). This connectionist approach was originally motivated by the workings of the human brain, but has since developed its own characteristics, and formed a well-defined landscape for exploration which includes constraints stemming from the fundamental properties of its design. This survey focuses on one of these built-in constraints, which I believe to be central to DNNs in the context of natural language, and specifically of text processing, namely that of representations. DNNs "live" in metric space: their operation manipulates real numbers organized into vectors and matrices, propagating function applications and calculated values within instantiations of pre-defined architectures. This mode of existence is very well-suited to problem domains that inhabit their own metric space, like the physical realms of vision and sound. In stark contrast to these, the textual form of linguistic communication is built atop a discrete alphabet and hinges on notions such as symbolic semantics, inconsistent compositionality, and the arbitrariness of the sign (de Saussure, 1916) . The example in (1) exhibits all of these: the symbol dog refers to two distinct objects bearing no semantic resemblance; large and white each describe the (canine) dog's physical properties, while dining categorizes the table based on its function, and hot does not modify (the second) dog at all, but rather joins it to denote a distinctive atomic concept. (1) The large white dog ate the hot dog left on the dining table. Given these properties of language, it is far from straightforward to decide the means by which to transform raw text into an input for a neural NLP system tasked with a goal which requires a grasp on the overall communicative intent of the text, such that this initial representation does not lose basic semantics essential to the eventual outcome. This transformation process is known as embedding, after which its artifacts are themselves known as embeddings, often used synonymously in context with "vectors" or "distributed representations". Indeed, the choice for default representations has known several shifts within the short DNN era, motivated in part by advances in computational power but also by a collective coming to terms with the limitations of the preceding methods. The great challenge of representation is compounded by the unboundedness of it all -human concept space is ever-expanding, and each new concept may be assigned an arbitrary sign (e.g., zoomer); within an existing concept space, associations capable of inspiring new utterances occupy a combinatorial magnitude which is essentially infinite; and even the form-meaning relationship itself exhibits malleability by humans' interaction with text input devices and various cognitive biases. 1 Each of these sources of expansion weighs any proposed representational method with the additional burden of generalizing to novel inputs while maintaining consistency in the manner by which they are represented in the system. In the NLP literature, the surface manifestation of the expanding spaces of concept and form, and of the more locally-constrained disparity between text available at different points in time of a model's training and deployment, is known as the out-of-vocabulary problem, and the unseen surface forms themselves are termed OOVs. In this survey, I consider three central approaches to representing the fundamental units of natural language text in its input stage and the consequences of each approach's selection on the goals of the systems they are applied in. The first, most popular, and most successful one when used in isolation, is the distributional approach where the representation function is trained to embed textual units which appear in similar contexts close to each other in vector space. The second is the compositional approach which seeks to assemble embeddings for workable textual units by breaking them down into more fundamental elements and applying functions over their own representations, less committed to semantic guarantees. The last is the relational approach which makes use of large semantic structures curated manually or in a semi-supervised fashion, leveraging known connections between text and concepts and among concepts in order to create embeddings manifesting humans' notions of "meaning". The OOV problem features heavily in the motivation and analysis of the work presented, as it presents challenges to each of the approaches described, yet the exact definition of vocabularies and OOV-ness themselves are challenged by the advent of NLP systems that have become mainstream following the processed described in this work, namely contextualized subword embeddings. Natural language is ultimately a system for conveying meaning, information, and social cues from the realm of human experience into a discrete linear form by encoding them as auditory, visual, and/or textual symbols, which are then iteratively composed into more complex units. In order to process such a system's outputs by computational means, it seems fitting to identify those symbols which carry the basic units of meaning, and then find the proper ways to map those meanings into representations for a program which can compose them. The first step, that of identifying linguistic atoms, proves to be a formidable challenge. From the surface output perspective, the common wisdom is that the basic semantic unit of language is what is known as a morpheme. The English word unbelievable, for example, is composed of a stem morpheme believe, a semantic-syntactic suffix -able recasting the verb into an adjective pertaining to potential, and a semantic prefix un-denoting negation. But this morpheme = atom stipulation is not unassailable. Processes below the morpheme level have been documented across languages, for example the sound symbolism phenomenon known as phonaesthesia, where arbitrary sound patterns correlate with a concept or conceptual properties, such as /gl/ in the English light/shine-related words glow, glitter, and glare (Blake, 2017) . Less arbitrarily, patterns and even individual sounds in names are known to evoke semantic qualities based on their acoustic properties (Köhler, 1947; Bergh et al., 1984) . In English-language informal communication modes, writers sometimes employ the practice of expressive lengthening, where a single character in a word is repeated in order to amplify its referent's extension on some scale. For example, looooong would be used to describe a particularly long object or period of time. In addition to these sub-morpheme phenomena, the morpheme symbolism and the atoms of our conceptual space relate at neither a univalent nor a one-to-one relation. Certain stem morphemes, like star, denote multiple types of concepts or objects (polysemy and homonymy), while some concepts may be referred to using different morphemes like the relevant meanings of room and space (synonymy). The suffix -s can denote both a third-person present verb or a plural noun (polyexponence), and both are replaced by -es under certain local conditions (flexivity). Theoretical quibbles notwithstanding, NLP is a practical field, and from its nascence it was clear that finding the most appropriate way to break text down to its purest elements should not set back our efforts to perform sequence-level tasks and develop useful applications. Thus, concessions must be made in the form of selecting a unit easily extractable from text and working with it. This necessity coincides with the reality of having English as the overwhelmingly central target of NLP applications and easiest source of data. The focus on a language with mostly isolating morphology, where morphemes often occupy distinct word forms that are related through sentencelevel syntax, conspired with the technical ease of detecting whitespace in text and led to an inevitable starting point for the community in using the space-delimited word as the basic unit of text analysis. 2 The very name of the fundamental bag-of-words approach (BoW) illustrates the implicit synonymity of "word" and "basic unit of representation" in NLP jargon. Although subword-and multiword-level systems were designed and developed outside this paradigm, mostly citing a non-English motivation, when the neural revolution came the predominant methods again anchored the field to the space-delimited word as the atom. The most obvious advantage of this approach is its simplicity, considering how difficult it is in practice to extract correct sub-word morphemes directly from text. Historically-entrenched orthographic conventions and local-context phonological processes lead to phenomena such as variance in morpheme form at different instantiations, such as the disappearance of the stem's final e in the unbelievable or the s-t alteration in derivations like Mars-Martian, making a deterministic mapping from surface form to morpheme sequence impossible. The lack of overt textual marking of morpheme boundaries (except for the uncommon case of hyphenation) also leads to ambiguous segmentation in words like unionize, and the general property of our sound and writing systems' inventory being relatively small leads to the incidence of affix-identical sequences in single-morpheme words like reply (cf. shortly) and bring (cf. lying). Automatic detection of morphemes can be achieved today by unsupervised data-driven systems like Morfessor Lagus, 2002, 2007) , which rely on large amounts of training data and provide no guarantee to finding the true morphemes in all cases or downstream applications. The idea of breaking down concepts in language into numerically-valued axes has played a role in the formation of the modern research landscape in linguistics. Osgood (1952) proposed a low-dimensional space in which nominal objects and concepts are represented by values associated with characteristics which may describe them, such that "eager" and "burning" share a value along the weak ⇔ strong dimension, while differing along the cold ⇔ hot dimension. The values were elicited from human subjects. Scaling this very linguistically-motivated approach manually over an entire language is at the very least impractical, and over the years some relaxations of this scheme to define representations for words which are distributed along dimensions gave rise to more automation-friendly processing techniques. Most crucial was the realization that the individual dimensions in the representation space do not have to be meaningful in and of themselves. Liberating the dimensions from their labels allowed the number of dimensions to be governed by concerns of data availability and computational memory and power, rather than by the precision of our semantic theory and ontological thoroughness; it allows for the discovery of unnamed but possibly useful similarities and distinctions between concepts; and it "leaves room" for new properties to be learned if, for example, a domain shift occurs during the process of applying an embedding-based system to a downstream task. Embedding concepts into a "blank" vector space using learning methods turns the implied causal direction that motivated Osgood's framework on its head: instead of creating the embeddings based on what we know about language and the relations between concepts, the latter become the proxy target by which we can measure whether or not the embeddings learned by our model are useful to us. Starting with an arbitrary metric space with well-known properties such as R d becomes a great advantage, as the space comes with metrics and operations which are easy to conceptualize and imagine as the necessary proxies. 3 As the formative instance of this realization served the ability to score the relative directionality of two vectors using the cosine similarity function, which can be compared to annotations in word similarity resources such as WordSim-65 (Rubenstein and Goodenough, 1965) , where human subjects were asked to score word pairs without the hassle of decomposing them into their semantic properties first. Metric space also affords the intuitive parallelogram metaphor of word analogy, haunting every introductory text and presentation on embeddings with the equation king − man + woman ≈ queen. The development of the distributed view of representation for linguistic objects accompanied the rise of methodologies making use of the distributional hypothesis, traditionally attributed to Harris (1954) and framed as "you shall know a word by the company it keeps". The maximalist interpretation of this adage as "a word is defined by applying a combination function to the set of its contexts", used pre-modernneurally in influential methods such as Brown Clustering (Brown et al., 1992) , is an appealing principle to the embedding movement for good reason: breaking words down into contexts provides us with just the distributed fixed dimensions we seek. Once we decide exactly what "context" means to us, we can programatically extract all contexts for all target words given only a corpus, and base our latent dimensions (whose number is limited to hundreds or thousands for practical reasons) on them. The two methods which ended up dominating the distributional embeddings landscape share a definition of context, essentially "words that appear near the target word", but translate this decision into embedding differently. In SkipGram (Mikolov et al., 2013a) , dimension significance is built "bottom-up" from a random initialization and a traversal of the corpus; in GloVe (Pennington et al., 2014) , dimensions are the result of an implicit reduction of the full V × V co-occurrence matrix, where V is the number of words in our vocabulary. The former approach was inspired by early embedding systems (Bengio et al., 2003) developed around the task of language modeling, which is defined with an expectation based in distributional signals, while the latter has origins in latent semantic analysis (LSA; Deerwester et al., 1990) . Evaluation on intrinsic tasks such as similarity datasets and analogy benchmarks (e.g., Finkelstein et al., 2001; Mikolov et al., 2013b; Hill et al., 2015) cemented distributional word embeddings as the representation go-to and an accessible replacement to one-hot encodings for a host of applications, while performance on downstream tasks within deep learning systems advanced the understanding of the utility that pre-training can afford end-to-end systems which include an embedding layer (Collobert and Weston, 2008; Collobert et al., 2011) . The choice of space-delimited words as the basic unit for representation, and the large resource investment necessary to pre-train a distributional model over a large corpus, in both money and time, create a situation where vectors can mostly be trusted as long as the words they represent are present in the pre-training corpus. The models so far discussed have no intrinsic ability to represent words not present in their lookup table, or out-of-vocabulary, or OOVs (Brill, 1995; Brants, 2000; Plank, 2016; Heigold et al., 2017; Young et al., 2018) . Empirical analyses such as the one in Pinter et al. (2017) show that indeed, the overwhelming majority of downstream datasets contain words not present in the pre-training corpora. Pinter et al. (2020a) present a diachronical dataset showcasing the volume of novel terms entering a large, steady daily publication in English over time; but even a snapshot of a language at a given moment contains unlimited domain-specific terms, morphological derivations, named entities, potential loanwords, typographical errors, and other sources of OOVs which would appear very reasonably in text analysis tasks and which the downstream model should be given the faculty to handle. In fact, according to Kornai (2002) , statistical reasoning leads us to conclude that languages have an infinite vocabulary. But even if a language's word set were finite, and all present in some corpus, practical memory and lookup constraints would still limit embedding tables to non-exhaustive vocabularies. To overcome the intrinsic limits of corpus-learned embedding tables, the distributional system has begotten some heuristics that try and initialize embeddings for OOVs beyond the trivial random initialization fallback. If one were to stay true to Firth's maxim, one possible strategy would be to keep SkipGram's context embedding table as well as the main table (for "target" words), and initialize OOV embeddings based on the context in which they are first encountered (Horn, 2017) . This approach has not caught on, and instead most practitioners took to the use of a special embedding, named as an abbreviation of unknown (Bengio et al., 2003) . In a pre-training stage, such an embedding is learned by replacing a small percentage of the corpus with a dedicated token, thus gaining at least some prior for an initialization, in some sense an average over possible contexts for encountering any word. This approach is brutally simplistic; it assumes not only that all novel words are representable using the same approximation technique, but that they are all exactly the same. The first assumption alone is easy to dispute: a careful observation of any taxonomy of word formation processes (Lieber, 2005; Plag, 2018) suggests that embedding new words into an existing space must involve considering multiple approaches in parallel. • Words created by processes at the multi-word level, such as compounding or blending, require means of extracting the underlying constructed words and composing the semantic contribution from each word. For example, brunch is a blend of breakfast and lunch; a reasonable initial embedding can be the mean vector for these two words, hopefully keeping it at a high similarity with other meals and the appropriate time of day. • Words that are inflections of known words, for example ameliorating, can benefit from a morphological analysis which finds its stem and syntactic suffix, placing the new vector at the sum of the verb ameliorate and the generalized notion of -ing verbs, if one is realized in the embedding space (arguably, in a good space it should at least be reliably extractable). • Novel named entities such as Lyft or SARS-COV-2, more often than not, reflect arbitrary naming practices and cultural primitives, and even recognition of their type (person / organization / location, etc.) might well be impossible without access to knowledge bases covering the appropriate domain, noting explicitly where in concept space the novel word should be embedded. • Some OOVs are the result of unpredictable subword processes such as typographical errors (typos) and stylistic variation, like the aforementioned expressive lengthening. In such cases, it is sometimes best to opt out of creation of a new embedding at all and simply map the new form to the existing embedding of its intended canonical word form. This choice will depend on the intended application; in certain cases like sentiment analysis, the stylistic information itself is essential. • Loanwords like vespa originate in a different language than the one the embedding was produced for, but in some cases we have access to an embedding space for the origin language and a function which translates between the two languages' space. A system which can detect the word and its origin, perhaps overcoming processes like writing-system transliteration and phonological adaptation, can start by embedding the target language word in a position projected from the source language's embedding for the equivalent word form. This is not a comprehensive list. More types of novel words are identified in Pinter et al. (2020a) , and not all suggestions in the taxonomy above correspond to actual existing work. Limiting this discussion to a strict interpretation of written-form uniqueness also prevents us from considering as OOVs concepts which are spelled in the same way as other words, either by chance (homography, for example row as a line or a fight), by naming (e.g., Space Force), or by processes such as zero-derivation (the verb smoke, derived from the noun). In languages other than English, some OOV-creating forces may be more dominant in word formation than in English. Morphologicallyrich languages, as one edge case, feature large percentages of OOVs in novel texts for a given task's text size compared to English, and this property is often compounded by the fact that many of these are low-resource languages, possessing a relatively small corpus-extracted vocabulary to begin with. The richness and unpredictability of the OOV problem calls for complementing the word representation systems obtained distributionally with additional approaches, which is the focus of this survey. The first approach considered is an attempt to break the space-delimited word paradigm and get at the finer atomic units of meaning, which can then either be used as the fundamental representation layer, or induce better representations at the word level. This perspective, known as the compositional approach, is inspired mostly by the cases where insufficient generalizations are made for cases of morphological word formation processes. Under the compositional framework, an ideal representation for unbelievable can be obtained by (1) detecting its three morphological components un-, believe, and -able, (2) querying reliable representations learned for each of them, distributionally or otherwise, and (3) properly assembling them via some appropriate function. 4 Each of these three steps is a challenge in itself and open to various implementational approaches. Learning representations for subword units is usually done by considering the subword elements in unison with the full word while applying a distributional method (e.g., Bojanowski et al., 2017) , but some have opted for pre-processing the pre-training corpus such that only lemma forms exist as raw text and the other tokens are explicit representations of the morphological attributes attached to each lemma (Avraham and Goldberg, 2017; Tan et al., 2020) , inducing the production of more consistent vocabularies. Others yet leave the learning to the downstream task itself, feeding off the backpropagated signal from the training instances (Sutskever et al., 2011; Ling et al., 2015; Lample et al., 2016; Garneau et al., 2019) ; while others train a compositional network based on the word embedding table in an intermediate phase between pre-training and downstream application (Pinter et al., 2017; Zhao et al., 2018) . The composition function from subwords to the word level is also open to many different approaches: prior work has opted for construction techniques as diverse as using the subword strings as one-hot entries to represent the words themselves (Huang et al., 2013) ; summing morpheme embeddings to produce word embeddings (Botha and Blunsom, 2014) ; traversing a possibly deep morphological parse tree using a recursive neural network (Luong et al., 2013) ; positing probabilistic word embeddings for which the morpheme embeddings act as a prior distribution (Bhatia et al., 2016) ; side-by-side training of both word-level and character-level modules followed by concatenating the resulting representations, to allow the downstream model to learn from both levels independently and control the interaction terms directly ; assembling a hierarchical recurrent net that progressively encodes longer portions of text in each layer (Chung et al., 2019) ; or dispensing with the word level altogether and just representing text with a single atomic layer of characters or subwords (Sennrich et al., 2016) . Most challenging of all is the detection of the subwords themselves. As noted above, morphemes are hard to detect from the surface form of a word. For the default setting where no curated resources exist to allow correct morpheme extraction from a word's form, as is the case in nearly all languages in the world, the mainstream of compositional representation research has centered on the raw character sequence, the unarguable atom of text, 5 which is used either via direct operation or as a basis for heuristics that define subword units based on statistical objectives. The great advantage of using characters or primitive character n-grams as the atomic 4 I will use the term subword to denote textual units which are largely between the character level and the word level, when no guarantee of their morphological soundness is attempted. In appropriate contexts, this can also denote word-long or character-long elements which are nevertheless obtained by a subword tokenizer. 5 At least in languages using the Latin script, like English. Chinese text analysis has benefitted from decomposing characters into strokes or radicals; Hebrew and Arabic include diacritical marks that are not character-intrinsic; and elsewhere, treatment of individual bytes from the Unicode representation of characters has also shown merit. unit for the model (Santos and Zadrozny, 2014; Kim et al., 2016; Wieting et al., 2016; Bojanowski et al., 2017; Peters et al., 2018) is that it rids us of the need to explicitly designate morphemes altogether; the challenge is to still capture the information they convey, somehow. In contrast, heuristically learning a subword vocabulary from information-theoretic notions (Sennrich et al., 2016; Kudo and Richardson, 2018) or character-sequence unigram distribution (Kudo, 2018 ) may find us many true morphemes, but there is no guarantee of either precision or recall: corpus collection effects are significant in determining the ultimate vocabulary, orthographic norms may still obfuscate many useful generalized morphemes, and many frequent character sequences may enter the subword vocabulary as the result of coincidental quirks. For example, the character sequence eva might contribute to the representation of unbelievable, passing along signals learned from unrelated words such as Eva or evaluate. The ever-growing popularity of systems which use such vocabularies in conjunction with the null composition function that ignores sub-word hierarchy and passes the downstream model embeddings corresponding to the raw subword sequence (see §8) prevents any possibility of correcting incorrect subword tokens at the word level: in this scenario, the next processing layer of the model will use the embedding for eva as if it were part of the input equally important to a frequent word like house. Another way to complement distributionally-trained embeddings is to incorporate signals from curated type-level relational resources. The prominent category of such resources is semantic graphs, such as WordNet (Fellbaum, 1998) and Babel-Net (Navigli and Ponzetto, 2010) , which encode the structural qualities of language as a representation of human knowledge. The core goal of semantic graphs is to describe connections between referents in the perceived and conceived world, and to this end they make an explicit distinction between words as character sequences and an internal semantic primitive which we can call concepts. Concepts form the chief node type in the semantic graph, connected by individual edges typed into relations such as hypernymy (elm "is a" tree) or meronymy (branch "is part of a" tree), as well as linguistic facts about concept names (shop.verb "is derivationally related to" shop.noun) which make use of the word-form partition of the graph's node set. In similar vein, relations which straddle the divide between form and function, like synonymy, are extractable from the bipartite subgraph relating word forms and their available meanings. In the context of language representation, these structures offer a notion of atomicity stemming from our conceptual primitives, an attractive premise. They may not answer all needs arising from inflectional morphology (since syntactic properties do not explicitly denote concepts) or some of the other word formation mechanisms, but the rich ontological scaffolding offered by the graph and the prospects of assigning separate embeddings for homonyms in a model-supported manner, assuming sense can be disambiguated in usage, seems much "cleaner" than relying on large corpora and heuristics to statistically extract linguistic elements and their meaning. In addition to this conceptual shift, as it were, the graph structure itself provides a learning signal not present in linear corpus text, relating the basic units to each other through various types of connections and placing all concepts within some quantifiable relation of each other (within each connected component, although lack of any relation path is also a useful signal). The structure can also occupy the place of the fragile judgment-based word similarity and analogy benchmarks, allowing more exact, refined, well-defined relations to be used for both learning the representations and evaluating them. Methods which embed nodes and relations from general graph structures before even considering any semantics attached to individual nodes and edges, like Node2vec (Grover and Leskovec, 2016) and graph convolutional nets (Schlichtkrull et al., 2017) , indeed serve as a basis and inspiration for many of the works in this space. The fundamentally different manner in which the relational paradigm is complementary to the distributional one in contrast with the compositional one has bearing on the OOV problem, which can be viewed from several perspectives. First is the potential of semantic graphs to improve representation of words that are rare or not present in a large corpus used to initialize distributional embeddings. This has proven to be a powerful direction by methods such as retrofitting (Faruqui et al., 2015) , where embeddings of related concepts are pushed together in a post-processing learning phase, showcasing WordNet's impressive coverage of English domain-specific taxonomies such as classical natural sciences. Elsewhere, properly modelling hypernymy, for example, has been found to help understand text with rare words whose hypernyms are well-represented in the pre-training corpus (Shwartz et al., 2017) . 6 Still, semantic graphs provide only a partial solution to the overall goal of OOV impact mitigation, given their limited scope and heavy reliance on expert annotation. From the other direction, systems relying on semantic graphs for applications such as question answering and dialogue generation are likely to encounter "OOVs" of their own, i.e. words and concepts not present in the underlying graph. Unlike the corpus-OOV problem, which cannot be quantified convincingly without selecting a specific downstream task first, coping with graph-OOVs can be examined through tasks intrinsic to the graph structure itself. One such task is relation prediction, where we assume a concept has a known connection with some other concept, and need to figure out which one. Depending on our perspective, either the source or target of the relation may be the OOV concept; for example, on first encounter of the concept indian lettuce, we wish to know its hypernym from our set of known concepts. This task is also useful for a similar class of graphs known as knowledge graphs (KGs), such as Freebase (Bollacker et al., 2008) 7 and WikiData (Vrandečić and Krötzsch, 2014) , which differ from semantic graphs in several aspects. While WordNet curates connections between semantic concepts and dictionary entries, including certain aspects of the physical world (e.g. "an elm is a tree"), KGs focus on real-world entities and often time-sensitive encyclopedic knowledge (e.g. "Satya Nadella is the CEO of Microsoft"). WordNet is a manually-crafted resource created by language and domain experts, whereas many KGs are either crowdsourced or automatically extracted from databases and large text corpora. As a result, KGs are typically disconnected, shallow, and sparse, boasting areas of hubness and areas of isolation; this contrasts with semantic graphs, where systematic connectedness and hierarchy have been observed (Sigman and Cecchi, 2002) . KGs are also distinguished by the richness of their relation type variety, in the hundreds or thousands, compared to WordNet's 18 relation types (including seven pairs of relations reciprocal to each other). Nevertheless, much of the work on the relation prediction problem has been developed and evaluated on both semantic and knowledge graphs, as well as on derived tasks like graph completion, where the entirety of a node's connections are to be inferred at once, imitating real-world scenarios of knowledge discovery. Over the years, distributional methods have been used to feed increasingly complex neural nets predicting relations by embedding both concept nodes and relation edges based on corpus-trained tables, to a large degree of success (e.g. Nickel et al., 2011; Socher et al., 2013; Bordes et al., 2013; Yang et al., 2014; Neelakantan et al., 2015; Ji et al., 2015; Shi and Weninger, 2017; Dettmers et al., 2018; Nathani et al., 2019) . The basic idea calls for embedding concepts into a metric space and modeling relations by some operator that induces a score for an embedding pair input, either by translating the concept vectors, combining them via bilinear operators, projecting them onto a "scoring scale", or designing an intricate deep system that finds complex relationships. While these systems achieve impressive results, they all build on an implicit assumption that relation prediction is a strictly local task: the fit of an edge can be estimated from the nodes it connects and the intended label alone. In KGs, where structure is of secondary concern, this assumption may go a long way before its limitations stress out performance; in the much more structure-crucial semantic graphs, it is increasingly likely that connections are predicted which should not be permissible from enforceable structural constraints alone, e.g. that the hypernym graph cannot contain cycles. Some systems indeed go beyond the individual edge to embed and predict relations, for example the idea of a path prediction task (Guu et al., 2015) which demands more structure reliance, or embedding methods leveraging local neighborhoods of relation interactions and automatic detection of relations from syntactically parsed text in an iterative manner (Riedel et al., 2013; Schlichtkrull et al., 2017) . Others have constructed prediction models where an adversary produces examples which violate structural constraints such as symmetry and transitivity (Minervini and Riedel, 2018) . Pinter and Eisenstein (2018) present a sys-tem which improves WordNet prediction by augmenting the distributionally-obtained signal with features (motifs) representing the global structure of the semantic edifice. In addition to the task benefit, the emerging feature weights lead to discovery of some general properties of English semantics. Recent developments in NLP have brought about a shift in the balance depicted so far with respect to the atomic level chosen to represent language in applications and the approaches taken to create these representations. Advances in multi-task learning and transfer learning, both in non-neural NLP and in non-NLP deep methods, matured well enough to allow deep NLP to use them effectively as well. The increase of available computation power and the extreme utility found to lie in recurrent nets, most notably the Long Short-Term Memory cell (LSTM; Hochreiter and Schmidhuber, 1997) , led to a series of works suggesting the incorporation of instance-specific context into the feature extraction part of a model, before applying any task-specific elements, beginning with simple prediction tasks (Melamud et al., 2016) , followed by near-full coverage of core NLP (Peters et al., 2018) . The next step was to continue training the sharedarchitecture context learner, which we can now safely call a language model, during the downstream step, in a process known as fine-tuning (Howard and Ruder, 2018) . Design and processing power considerations, but also downstream performance, fueled the shift (Radford et al., 2018) from recurrent net infrastructure to transformer models (Vaswani et al., 2017) , which in turn facilitated another major conceptual innovation where autoregressive token prediction was replaced by masked language modeling, where sequence-medial tokens are hidden from the representation layer and must be predicted based on the remaining context (Devlin et al., 2019; Liu et al., 2019) . Throughout this evolution, one main principle remained stable: the language prediction task acts as the pre-training step, providing a scaffolding model which is capable of representing tokens within a sequence at a level of effectiveness that allows downstream tasks to begin training with meaningful contextualized representations. The heart of contextualization lies in the distributional approach. The design of these pre-training tasks meant they can no longer tolerate OOV tokens at the rate encountered by static embedding algorithms, as that might render the models unusable for any words that appear in context with OOVs downstream, rather than just the OOVs themselves. On the other hand, the prediction layer creates a computational bottleneck which scales with the size of the vocabulary, since every token must be available for prediction at all model steps. Therefore, these models resorted to compositional techniques for the bottom layer where the input sequence is processed into tokens. The character convolution net selected for ELMo (Peters et al., 2018) did not gain traction, possibly because it didn't provide an adequate method for predicting text from the output layer, and so subsequent models, particularly those relying on transformers, operate over a sequence of equal-status tokens, each representing a word or a subword, from a mid-size vocabulary (tens of thousands) built in a pre-pre-training phase using statistical heuristic techniques mentioned in §6. These models inherit the problems endemic to these methods like inadequacy for certain OOV classes, morphological unsoundness, and length-imbalance; as well as issues like the added burden they impose on already limited-length token sequences. Common wisdom seems to hold that they make up for these shortcomings within the depths of their fully-connected transformer layers, and end up with satisfactory top-layer representations. Recent work challenging these models with truly novel word forms suggest otherwise (Pinter et al., 2020a,b) , while work on either incorporating the compositional signal into subword-vocabulary transformers (Ma et al., 2020; Aguilar et al., 2020; El Boukkouri et al., 2020; Pinter et al., 2021) , or replacing the subwords with characters or bytes altogether (Clark et al., 2021; Xue et al., 2021) , is rapidly gaining traction as well. Char2subword: Extending the subword embedding space from pre-trained models using robust character compositionality The interplay of semantics and morphology in word embeddings A neural probabilistic language model. The journal of machine learning research Sound advice on brand names Morphological priors for probabilistic neural word embeddings Sound symbolism in english: Weighing the evidence Enriching word vectors with subword information Freebase: a collaboratively created graph database for structuring human knowledge Translating embeddings for modeling multi-relational data Compositional morphology for word representations and language modelling Tnt: a statistical part-of-speech tagger Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging Classbased n-gram models of natural language Hierarchical multiscale recurrent neural networks Canine: Pre-training an efficient tokenization-free encoder for language representation A unified architecture for natural language processing: Deep neural networks with multitask learning Natural language processing (almost) from scratch Unsupervised discovery of morphemes Unsupervised models for morpheme segmentation and morphology learning Cours de linguistique générale Indexing by latent semantic analysis Convolutional 2d knowledge graph embeddings BERT: Pre-training of deep bidirectional transformers for language understanding Word: a typological framework CharacterBERT: Reconciling ELMo and BERT for word-level openvocabulary representations from characters Retrofitting word vectors to semantic lexicons WordNet: An Electronic Lexical Database Placing search in context: The concept revisited Attending form and context to generate specialized out-of-vocabularywords representations node2vec: Scalable feature learning for networks Traversing knowledge graphs in vector space Distributional structure. Word How robust are character-based word embeddings in tagging and mt against wrod scramlbing or randdm nouse? SimLex-999: Evaluating semantic models with (genuine) similarity estimation Long short-term memory Context encoders as a simple but powerful extension of word2vec Universal language model fine-tuning for text classification Learning deep structured semantic models for web search using clickthrough data Knowledge graph embedding via dynamic mapping matrix Character-aware neural language models Gestalt psychology, 2nd edn new york How many words are there? Glottometrics Subword regularization: Improving neural network translation models with multiple subword candidates SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing Neural architectures for named entity recognition English word-formation processes Finding function in form: Compositional character models for open vocabulary word representation Roberta: A robustly optimized bert pretraining approach Better word representations with recursive neural networks for morphology CharBERT: Characteraware pre-trained language model Mapping unseen words to task-trained embedding spaces context2vec: Learning generic context embedding with bidirectional LSTM Efficient estimation of word representations in vector space Linguistic regularities in continuous space word representations Adversarially regularising neural NLI models to integrate logical background knowledge Learning attention-based embeddings for relation prediction in knowledge graphs Annual Meeting of the Association for Computational Linguistics BabelNet: Building a very large multilingual semantic network Compositional vector space models for knowledge base inference Poincaré embeddings for learning hierarchical representations A three-way model for collective learning on multi-relational data The nature and measurement of meaning GloVe: Global vectors for word representation Deep contextualized word representations Predicting semantic relations using global graph properties Mimicking word embeddings using subword RNNs NYTWIT: A dataset of novel words in the New York Times Will it unblend? Learning to look inside: Augmenting token-based encoders with character-level information Word-formation in English What to do about non-standard (or non-canonical) language in nlp Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss Improving language understanding by generative pre-training Relation extraction with matrix factorization and universal schemas Contextual correlates of synonymy Learning character-level representations for partof-speech tagging Modeling relational data with graph convolutional networks Neural machine translation of rare words with subword units Proje: Embedding projection for knowledge graph completion Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection Global organization of the wordnet lexicon Reasoning with neural tensor networks for knowledge base completion Generating text with recurrent neural networks Mind your inflections! Improving NLP for non-standard Englishes with Base-Inflection Encoding Observed versus latent features for knowledge base and text inference Representing text for joint embedding of text and knowledge bases Attention is all you need Wikidata: A free collaborative knowledgebase Charagram: Embedding words and sentences via character n-grams Byt5: Towards a token-free future with pre-trained byte-to-byte models Embedding entities and relations for learning and inference in knowledge bases Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine Generalizing word embeddings using bag of subwords This survey is an adapted version of the introduction my PhD thesis. I thank my committee for helping to shape it: my advisor, Jacob Eisenstein; Mark Riedl, Dan Roth, Wei Xu, and Diyi Yang.