key: cord-0572464-z35q1su5
authors: Raman, Natraj; Shah, Sameena; Veloso, Manuela
title: Structure and Semantics Preserving Document Representations
date: 2022-01-11
journal: nan
DOI: nan
sha: a4e887a5f20289603448eaad607265adc29c778d
doc_id: 572464
cord_uid: z35q1su5

Retrieving relevant documents from a corpus is typically based on the semantic similarity between the document content and query text. The inclusion of structural relationship between documents can benefit the retrieval mechanism by addressing semantic gaps. However, incorporating these relationships requires tractable mechanisms that balance structure with semantics and take advantage of the prevalent pre-train/fine-tune paradigm. We propose here a holistic approach to learning document representations by integrating intra-document content with inter-document relations. Our deep metric learning solution analyzes the complex neighborhood structure in the relationship network to efficiently sample similar/dissimilar document pairs and defines a novel quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart in the representation space. Furthermore, the separation margins between the documents are varied flexibly to encode the heterogeneity in relationship strengths. The model is fully fine-tunable and natively supports query projection during inference. We demonstrate that it outperforms competing methods on multiple datasets for document retrieval tasks.

: Retrieval framework overview. Document representations are learned to preserve both the semantic content and structural relationships by efficiently mining similar/dissimilar pairs and varying the separation margins dynamically. The documents are ranked by their distance to a query projected into the same representation space.

Furthermore, incorporating state-of-the-art pre-train/fine-tune contextual language model paradigm [12] into graph neural systems is infeasible due to resource constraints.

Metric learning [13] based inference is as an effective alternative to graph traversal, with support for fine-tuning and inductive inference. The idea here is to use a simple distance function to separate the documents based on their similarity in the representation space. However, current attempts [14] using metric learning focus exclusively on semantic text similarity and suffer from the requirement for explicit labels to distinguish similar and dissimilar documents. These labels are often expensive to obtain and even if available, their flat nature cannot capture the rich and complex network interactions inherent in a large corpora. Automatically determining these labels is non-trivial due to the combinatorial explosion induced by entangled neighborhood structures. In addition, existing models utilize Siamese [15] or triplet [16] architectures that are not suitable for encoding the different facets of similarity.

We address these issues here by proposing a new deep metric learning based approach for learning document representations that accounts both for intra-document content and inter-document relations. Our solution does not require any explicit labels, and instead dynamically constructs a relative measure of similarity to separate the documents in the representation space (see Figure 1 ). Specifically, the corpus structure is analyzed offline to arrange the documents in an increasing order of connectedness. These ranked documents are repeatedly subdivided to sample structurally similar and dissimilar pairs while the equivalent semantic pairs are constructed from the document content. This sampling procedure covers a wide range of neighborhood in the relationship network and can scale well to large corpora.

We extend the classical triplet loss [17] with a quintuplet loss function that simultaneously encourages document pairs that are semantically relevant to be closer and structurally unrelated to be far apart. This extension also addresses a key limitation in triplet loss, where the separation margins are fixed a-priori. We instead allow the margins to grow geometrically based on the the extent of structural similarity. This flexibility facilitates a relative order of separation in the representation space and enables distinguishing strong relations from weaker ones. In contrast to graph neural methods, our learned model allows the query representation to be computed seamlessly at inference. Furthermore, it supports long-form text and fine-tunes the Transformer [18] language model weights adaptively during training, thereby enabling task specific customization.

We conduct experiments on multiple publicly available datasets [10, 19, 20, 21] and show that the proposed model outperforms competing methods. We also include an analysis of the learned embeddings. Our contributions are as follows:

• Beyond Semantics: A holistic approach to learning document representations that balances local document context with global relationship network, thereby preserving both semantics and structure.

• Structure Mining: A novel mechanism to construct similar and dissimilar pairs of documents based on a divide and conquer sampling of the neighborhood structure.

• Relative Margins: A discriminative treatment of the representation space, encoding the nuanced relations between documents through variable units of separation.

• Quintuplet Loss: An efficient multi-input neural architecture that aggregates in parallel two different loss functions corresponding to structural and semantical facets.

• Inductive and Fine-tunable: A retrieval centric model that natively supports query projection and can be fine-tuned for task specific objectives.

In the following, Section 2 compares our work with related efforts, Section 3 describes the model in detail, Section 4 presents the results and Section 5 summarizes our findings.

Recent document retrieval approaches [22, 23] are fuelled by applying deep neural networks to rank relevant documents in response to a query. Our setting is an ad-hoc top-K retrieval scenario in which there is no access to relevance judgements during training and the ranking is purely based on text similarity. Of particular interest to this paper are neural language models [12, 24] that learn text representations by pre-training on large unsupervised corpus and allow fine-tuning on target task. While excelling at several sentence level tasks such as classification, their use of a cross-encoder makes them unsuitable for large scale semantic similarity search. A workaround is to learn embeddings that can be directly compared with a similarity metric as in [14] . The major drawback of above models is that they focus exclusively on the content semantics, and ignore the valuable relationships between the documents. Furthermore, they are catered to sentence level inputs rather than at document level. While some models such as [10] define similarity with respect to document relations, the key difference with our work is that we explicitly account both for structure and semantics in tandem. The definition of an efficient label mining procedure based on neighborhood structure further differentiates our work.

Our model modifies the conventional triplet [16] network architecture with multi-instance inputs and defines a custom loss function. There have been previous efforts in using generic n-tuple inputs [25, 26, 27] and a variety of loss functions such as contrastive loss [28] , triplet-center loss [29] , lifted loss [30] , histogram loss [31] , multi-similarity loss [32] and circle loss [33] have been explored before. While we share with these models the general intention of designing an objective function that assigns larger weights to informative inputs, our work differs with its focus on introducing different notions of similarity rather than just improving pair selection strategy.

Another extension in our model is the flexible variation of separation margins based on the relative strength of relationships. In [34] , a dynamic violate margin for triplet loss is formulated by constructing a class-level hierarchical tree. We differ from this by supporting a more generic graph structure. [35] proposes a graded mechanism to push inputs by distinct margins according to their relevance degree in order to construct coherent visual embeddings. In contrast to their image mode and semantic relevance, we focus on text and the relationship structure.

Graph representation learning [36, 37, 38, 11] is an alternative approach to metric learning for incorporating network structure. However, its primary focus is on encoding network topology and even when text attributes are included [39] , they are treated as side information and consequently it is not possible to fine-tune the language model for task specific customization with this approach. Notably, generalizing the embeddings to new unseen vertices is not straight-forward with these models, and in comparison our model directly supports deriving the embeddings for out-of-sample query text.

This section first provides an overview of the model and then describes the components involved in detail.

The problem of retrieving a limited number of relevant documents corresponding to a user provided query in a selfsupervised zero relevance label scenario is addressed here. A traditional retrieval mechanism that is purely based on the document content may fail to surface documents that only have a weak reference to the query and yet are highly relevant due to associational and deductive reasoning. For example, a query on coronavirus mortality may not sufficiently match a document that discusses the benefits of vaccinations. However, indirect associations between the concepts present in this document and the query make it highly relevant. A structured corpus could have already captured the strong relationship between documents that discuss virus mortality and vaccinations through topic tags or citations. We wish to utilize this auxiliary structure knowledge resource for efficient retrieval.

Our approach involves training a document representation model that encodes the corpus structure along with the content semantics into the learned document embeddings in a metric learning setting. The key idea here is to ensure that similar documents are close in the representation space while dissimilar documents are well separated. The similarity (and dissimilarity) is defined based on both a document's content as well as its relationship with other documents. The former allows meaningful comparison with a query representation while the latter bridges the above highlighted semantic gaps between a query and document during retrieval. The representation model employs multiple instances of a deep neural network with parameter sharing to learn fixed length document embeddings. The state-of-the-art attention based Transformer [18] neural architecture is adopted here. The network weights are initialized from a language model such as [12] that is pre-trained on large unlabeled corpus and the weights are adaptively tuned during training to capture target domain specific information.

The network accepts similar and dissimilar document pairs as input, one each for content semantics and corpus structure.

With K inputs and a training size of N , there are O(N K ) combinations, and hence the construction of less redundant and highly informative document pairs pose a significant challenge. To tackle this, we detail an efficient pair mining procedure in the sequel that is suitable for multi-hop propagation and is based on biased sampling. Furthermore, training this deep learning network requires defining an objective function that encourages compact grouping and dispersed separation appropriately. Conventional loss functions use a margin value that is constant across the representation space to quantify the separation between similar and dissimilar document pairs. We relax this assumption and propose a novel quintuplet loss function where the margins vary dynamically based on the relationship strength for additional flexibility.

Finally, during inference a user defined query is converted into query embeddings using the trained document representation model. These query embeddings are compared with the document embeddings to obtain similarity scores and the documents are simply ranked by these scores and returned.

be a corpus of documents. Let each document be composed of tuples of text fragments such that

where S i is the number of fragments in document i. The fragments may be sentences, paragraphs or any other segmentation unit including simply overlapping windows of text and are comprised of sequences of words (tokens). This formulation using document fragments that are of manageable size as inputs rather than an entire document provides direct support for long-form text.

We assume the availability of relationship information between documents as part of the corpus structure. For example, documents that share the same topic code or documents that are referred by another document may be treated as being related. Let A ∈ R N ×N be an adjacency matrix, where A ij > 0 indicates a relationship between document i and j. The adjacency matrix may be unweighted with A ij ∈ {0, 1}, ∀1 ≤ (i, j) ≤ N or use real valued weights to capture the strength of the relationships.

When training the metric learning network, for a given anchor document i, we wish to identify a structurally related document φ + i and an unrelated document φ − i . A naïve random sampling from row i of A to identify φ + i and φ − i is inappropriate and it is important to examine the neighborhood structure present in the adjacency matrix to account for higher order proximities. Let f : A → A be a function that performs link analysis [40] on A and produces an intimacy matrix A ∈ R N ×N such that A(i, j) quantifies the connectivity strength between documents i and j after analysing the entire link structure. We set f based on the PageRank algorithm and compute A as

where α ∈ [0, 1] is the damping factor, I is the identity matrix and A is the column normalized adjacency matrix.

Let Υ i = argsort(A(i, :)) \i denote the sequence of documents that are in increasing order of connectivity to a document i, with i being excluded. Let n be the largest position in the sequence where A(i, Υ i,n ) > 0. We recursively subdivide this sequence to identify the potential set of documents that can serve as positive and negative pairs as follows:

Here Φ il denotes the candidate similar and dissimilar documents at level l for document i. Figure 2 provides an illustration of the partition mechanism for two levels of a target document d 6 with n = 6, where d 4 is structurally the closest and d 2 is the farthest. Now φ + i and φ − i are sampled uniformly from these candidates as

The above sampling procedure covers a wide range of the relationship network with an increased focus on the structurally related documents than the unrelated ones during partition. This inherent bias towards selecting hard triples of anchor, similar and dissimilar documents is desirable in metric learning for faster convergence. Furthermore, the explicit characterization of partition levels provides an opportunity to tailor the separation margins relative to the level from which the pairs were sampled. For example, we would expect the separation between similar and dissimilar pairs to be larger when sampled from Φ i1 and smaller for Φ iL , since the latter contains much harder examples.

It is also necessary to identify a semantically relevant document ψ + i and an irrelevant document ψ − i for a given anchor i. We set ψ + i to a corrupted form of the anchor document. Specifically, we replace 25% of the tokens in the anchor with random tokens sampled from the vocabulary or a special token such as [MASK] . Such token replacements have been hugely successful in masked language modeling [12, 41] and forces the model to distinguish the tokens based on the context, thereby avoiding overfitting. The irrelevant document ψ − i is computed through hard negative mining [42] i.e. the document amongst all the other documents in a batch that is semantically closest to the anchor is the hardest negative sample and is selected as the semantically irrelevant document.

The document embeddings are learned by an extension to the classical Siamese [15] and triplet [16] networks, with four different input branches that share the same neural architecture. The outputs from these branches are tied together using a common loss function. The four input branches correspond to an anchor i, structurally similar document φ + i , structurally dissimilar document φ − i and semantically similar document ψ + i . Note that the semantically dissimilar document ψ − i is determined online dynamically as outlined above, and hence does not require a separate input branch. Figure 3 illustrates the network design.

The Transformer neural architecture uses multiple self-attention and feed-forward layers to produce the aggregate representation of an input sequence. Let T : W → h be the Transformer function that accepts a sequence of tokens W and produces a fixed length vector h ∈ R E and δ : (a, b) → R + be a distance function. Since the documents are segmented into text fragments, it is necessary to first sample the fragment index and use it as input to the transformer network. Hence the input embeddings are computed as follows:

Thus we have five sets of embeddings

, with the last entry being determined online by comparing an anchor embedding with all the other embeddings in a mini batch B and picking the closest embedding. The loss function must encourage the distance from the anchor embeddings to a similar document embedding to be less than that to a dissimilar document embedding during training. Furthermore, we need to incorporate a margin term to specify the minimum separation distance.

Let m φ and m ψ be real-valued margin hyper-parameters corresponding to structure and semantic components respectively and l i be the partition level as in Equation (3). The quintuplet loss function L is defined as

where L φ is the structural relation loss, L ψ is the semantic relevance loss and γ ∈ [0, 1] is a hyper-parameter that controls the relative importance between these two losses. It is important to note that the margins for structure is scaled by the inverse of partition level, thereby varying the separation distance based on the connectivity strength of the similar and dissimilar documents to the anchor. This flexibility overcomes the limitation in traditional triplet loss where all the dissimilar points are pushed away by an equal margin.

Given a query sequence of tokens q during inference, its embedding vector is computed dynamically as h q = T (q). This vector is compared with the embedding vectors corresponding to all the fragments in all the documents of the corpus for ranked retrieval. Let ∆ : (i, j) → [0, 1] be a bounded similarity function and r is = ∆(h q , h is ) be the query similarity score with fragment s of document i. We can now determine the set of top K similarity scores R i from r is , ∀s = 1...S i . The aggregated similarity score for document i is then computed as

where w k is the weight for position k. We set w k = e −ωk , where ω is a hyper-parameter. Intuitively, the similarity value at position 1 will contribute more to the overall document score than say the value at position 5. Finally, these scores are used to perform the best ranking of documents in the corpus.

We first detail the different datasets and experiment settings used for evaluation, and then present the results and discussion.

The SciDocs [10] dataset contains a subset of scientific papers available in Semantic Scholar. We treat the paper abstracts as the text content, while the citation graph serves as the relationship network. The dataset provides ground-truths for a recommendation task, which can be used for evaluating document retrieval. In particular, it collects data from user clickthrough logs of an academic search engine to construct a set of similar papers for a query paper title. We use this clickthrough data purely for testing our model on query retrieval and do not use it for training.

The other datasets used for evaluation are: (a) arXivCS [20] , a document corpus from computer science publications in arXiv.org, with the citation structure mined from the T E Xfiles, (b) Scholarly [19] , which contains publications from ACM digital library and (c) ACL-ARC [21] , a corpus of publications about computational linguistics. Since there are no explicit abstract tags in these datasets, we use the first 20 sentences of a paper as the text content. Similar to SciDocs, we use the references for relation structure. There is no explicit ground-truth available with these datasets for evaluating retrieval. Hence, we use a citation context sentence (only after the first 20 sentences to ensure out-of-samples) as the query and expect to find content from the cited paper in the retrieved results.

We compare the performance of our retrieval model with several competing methods. The aggregated word vectors from Word2Vec [43] is an effective baseline model for evaluating distributed vector representations in the semantic feature space. For context-aware representations, we include the standard BERT [12] and DistilBERT [24] models with average embeddings of all the tokens, which tend to perform better than using CLS token output for semantic similarity tasks. We also fine-tune these models with text from our target datasets using the standard language model training objective to evaluate the effects of exposing task specific content. Additionally, we compare with SciBERT [44] , a language model based on BERT but pre-trained on a large corpus of scientific text, to assess the impact of structure inclusion on different types of Transformer models. For comparisons with equivalent loss functions, we use SentBERT [14] , a state-of-the-art model that uses triples of anchor, semantically similar and dissimilar sentences to learn the embeddings. Finally, we also compare with a state-of-the-art deep graph representation algorithm SGC [38] , in which graph convolutions are applied over text features based on the neighbourhood structure.

The hyper-parameters are set as follows: structure separation margin m φ to 2, semantic separation margin m ψ to 0.5, relative loss value γ to 0.5, score factor ω to 0.05 and damping factor α to 0.15. The document fragments are created using overlapping windows of text with a sequence length of 128. The training procedure uses Adam optimizer with a learning rate of 5e − 5 and an epsilon of 1e − 8. The distance function δ is set to Euclidean, while cosine similarity is used during inference. We train our models on an 8 GPU NVIDIA Tesla V100 instance with a batch size of 24 for 1 epoch.

The comparison results for the SciDocs dataset is furnished in Table 1 . We use Recall@K metric, which measures the percentage of relevant documents from ground-truth being returned in the top K results. For this dataset, the query is the paper title as searched by the user, while the relevant documents are those papers from the search results that were clicked by the user and marked as ground-truth. While the Word2Vec and vanilla BERT/DistilBERT models perform poorly, fine-tuning the latter with the abstracts seem to improve the recall rate. The SciBERT model performs better than BERT/DistilBERT owing to it being already trained on scientific data. By including structure information, the SGC model improves over the above models. However, its lack of support for fine-tuning inhibits its ability to perform effective semantic similarity. The closest match to our model in performance is SentBERT, which also uses metric learning in a fine-tune paradigm albeit considering only semantic similarity. The importance of adding structure with semantics is evident from the nearly 2% improvement offered by our models when compared with SentBERT, and an even larger difference for the SciBERT variant. Table 2 shows the quantitative comparisons for the other three datasets on the citation prediction task. The behavior seen for SciDocs is also consistently repeated for Scholarly, arXivCS and ACL-ARC datasets. Note that the citation context is used as query for these datasets, and it is possible that the abstract in the paper being cited does not sufficiently match the query text in the citation context. This is reflected in the overall low recall rates across the models. However, the relative difference between the models are still a useful assessment measure and the furnished results confirm the utility of including both semantics and structure rather than just the semantics. We also use sample sentences from papers (which were never seen during training) as query and assess whether the model returns the corresponding paper during retrieval at the top position. For SciDocs we use the title as query, while for the other datasets we use a sentence from the paper body. Unlike the recommendation and citation tasks, this self prediction task intuitively relies only on the intra-document content and hence is a good measure of evaluating any loss of performance in semantic matching due to the incorporation of structure. The results are presented in Table 3 and yet again our model mostly out-performs others. While our SciBERT model comfortably improves over the rest, our BERT/DistilBERT variant does not consistently exceed SentBERT. This is unsurprising since SentBERT's primary strength is on semantic similarity, which aligns well with this particular task.

Sample queries and their corresponding top 5 retrievals for SciDocs dataset are displayed in Figure 4 . Snippets of the document fragments returned by a semantics only model is shown in blue while those returned by our structure + semantics model is shown in green. A qualitative assessment substantiates their relevance to the query. The overlapping results between the two models is unsurprising considering their shared objective on semantic similarity. However, the model that incorporates structure seems more likely to return the truly relevant documents, as marked in the ground-truths. 

In order to validate whether the structure information is reflected in the learned document embeddings, we plot in Figure 5 the distances between documents that are structurally related in green and the documents that are unrelated in red. It can be seen in the left figure that for the model that ignores structure, the related and unrelated documents have a similar distribution for the distances. In contrast, the right figure using our model appears less cluttered with evident separation. This indicates that the learned embeddings have smaller distances between structurally similar documents and larger distances for dissimilar documents.

Additionally, we visually verify the structure preserving nature of the model in Figure 6 . The left side figure shows the ground-truth relationship structure for a few sample documents in the SciDocs dataset. For illustration purposes, the documents are colored by the subgraph they belong to. On the right side, the 2D embeddings of these documents obtained using t-SNE [45] projection is plotted. It can be seen that the documents within the same subgraph are cohesive together in the learned representation space. This cohesiveness of structurally related documents is beneficial because even if a relevant document does not sufficiently match a query, by virtue of its proximity to another connected document that contains the query terms, the relevant document's probability of being retrieved increases. Such a behavior is essential to surface indirect associations and logical deductions.

We also investigate the effect of varying the structural margins in a relative manner rather than fixing them to a constant value. Given an anchor and a structurally similar document, we identify a set of documents with varying degrees of dissimilarity to the anchor and compare the embedding distances between these dissimilar pairs and the similar pair. The variation in dissimilarity levels is due to the differences in relationships -e.g. the anchor may have a direct connection to the similar document (Level 0), one-hop connection to a dissimilar document (Level 1), a two-hop connection to another (Level 2) and no direct connection at all to the third dissimilar document (Level 3). We should expect the distance of the anchor to the documents progressively increase from one level to another. Figure 7 displays a radial plot of the differences in distances between Level 0 pair and the pairs at other levels for sample documents from SciDocs dataset. The level of a point is differentiated by color while its position depends on the distance rank. For instance, a point lying on the outer circle indicates a larger distance and on the inner circle the smallest distance. Intuitively, the Level 1 points should all lie on the inner circle and the Level 3 points on the outer circle, since a closer relationship implies nearer distance and vice-versa. This desirable behavior can be observed with a majority of Level 1 (red) points lying on the inner circle and Level 3 (blue) points placed on the outer circle. Even when not conforming to the ideal case, the errors are largely restricted to a single level misplacement. The introduction of flexible margins facilitates in imposing such an implicit order of separation between the documents and can thus effectively capture the nuances in relations. 

We also perform sensitivity analysis on the various parameters to gauge the importance of tuning them. In Figure 8 the recall rate of ACL-ARC dataset for different choices of γ is displayed. By design, the loss is concentrated on the structure component if γ = 0, while a value of 1 focuses on the semantics. It can be observed that extreme values that degenerate to consider only one of these components is not ideal, with the optimal values lying over a wide range between 0.25 and 0.75. Interestingly, ignoring the structure has a relatively greater penalty. This is possibly because the semantic component is already accommodated in a partial manner due to the use of pre-trained language model weights.

The parameter m ψ influences the separation margin enforced by the semantic component during training. The recall performance of our DistilBERT model for different choices of this parameter corresponding to various datasets is plotted in Figure 9 . While excessively large values result in a drop in performance, the results are stable between values of 0.5 and 1.5, indicating a broad range of options.

We introduced a new representation learning mechanism that integrates structural relations with semantic information to enrich the document embeddings. The sampling procedure based on link structure analysis helps traverse even complex neighbourhoods and the quintuplet loss function offers a flexible balance between the structure and semantic facets. The variation of separation margins in accord with the relationship strength results in a more coherent representation space. Our experiments illustrate the utility of this model for retrieving documents that are relevant to a query. In future, we wish to leverage deep graph embeddings and extend the model to support multi-modal inputs.

Reading wikipedia to answer open-domain questions

Coarse-to-fine query focused multi-document summarization

Content-based citation recommendation

Scalable clustering of news search results

Best match: new relevance search for pubmed

Query driven algorithm selection in early stage retrieval

Multi-stage document ranking with bert

Which* bert? a survey organizing contextualized encoders

Information retrieval as semantic inference: A graph inference model applied to medical search

Specter: Document-level representation learning using citation-informed transformers

Graph representation learning: A survey

Pre-training of deep bidirectional transformers for language understanding

Deep metric learning: A survey. Symmetry

Sentence-bert: Sentence embeddings using siamese bert-networks

Fully-convolutional siamese networks for object tracking

Deep metric learning using triplet network

Triplet loss in siamese network for object tracking

Attention is all you need

A comprehensive evaluation of scholarly paper recommendation using potential citation papers

A high-quality gold standard for citation-based tasks

The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics

An introduction to neural information retrieval

A deep look into neural ranking models for information retrieval

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter

Beyond triplet loss: a deep quadruplet network for person re-identification

Learning joint gait representation via quintuplet loss minimization

Adaptive offline quintuplet loss for image-text matching

Understanding the behaviour of contrastive loss

Triplet-center loss for multi-view 3d object retrieval

Deep metric learning via lifted structured feature embedding

Learning deep embeddings with histogram loss

Multi-similarity loss with general pair weighting for deep metric learning

Circle loss: A unified perspective of pair similarity optimization

Deep metric learning with hierarchical triplet loss

Ladder loss for coherent visual-semantic embedding

Semi-supervised classification with graph convolutional networks

Graph attention networks

Simplifying graph convolutional networks

Improved semantic-aware network embedding with fine-grained word alignment

Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications

Pre-training text encoders as discriminators rather than generators

Improved embeddings with easy positive triplet mining

Distributed representations of words and phrases and their compositionality

Scibert: A pretrained language model for scientific text

Visualizing data using t-sne

This paper was prepared for information purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co and its affiliates ("JP Morgan"), and is not a product of the Research Department of JP Morgan. J.P. Morgan makes no representation and warranty whatsoever and disclaims all liability for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful. © 2021 JP Morgan Chase & Co. All rights reserved.