key: cord-0482359-bhj70jpu authors: Sundriyal, Megha; Singh, Parantak; Akhtar, Md Shad; Sengupta, Shubhashis; Chakraborty, Tanmoy title: DESYR: Definition and Syntactic Representation Based Claim Detection on the Web date: 2021-08-19 journal: nan DOI: nan sha: 469ecdbfd2603558ca6caf9d3521824cec7d0b09 doc_id: 482359 cord_uid: bhj70jpu The formulation of a claim rests at the core of argument mining. To demarcate between a claim and a non-claim is arduous for both humans and machines, owing to latent linguistic variance between the two and the inadequacy of extensive definition-based formalization. Furthermore, the increase in the usage of online social media has resulted in an explosion of unsolicited information on the web presented as informal text. To account for the aforementioned, in this paper, we proposed DESYR. It is a framework that intends on annulling the said issues for informal web-based text by leveraging a combination of hierarchical representation learning (dependency-inspired Poincare embedding), definition-based alignment, and feature projection. We do away with fine-tuning computer-heavy language models in favor of fabricating a more domain-centric but lighter approach. Experimental results indicate that DESYR builds upon the state-of-the-art system across four benchmark claim datasets, most of which were constructed with informal texts. We see an increase of 3 claim-F1 points on the LESA-Twitter dataset, an increase of 1 claim-F1 point and 9 macro-F1 points on the Online Comments(OC) dataset, an increase of 24 claim-F1 points and 17 macro-F1 points on the Web Discourse(WD) dataset, and an increase of 8 claim-F1 points and 5 macro-F1 points on the Micro Texts(MT) dataset. We also perform an extensive analysis of the results. We make a 100-D pre-trained version of our Poincare-variant along with the source code. The increasing vogue behind Online Social Media (OSM) has led to a colossal sway within the media-consuming audience. Participation over OSM has swiveled into another corresponding and associate phenomenon. We are incessantly exposed to news, media, opinions, and perspectives. Such a constant barrage of information can, in turn, increase the plausibility behind resources being mishandled or exploited. The 45 ℎ Presidential US elections witnessed the alarming impact of fake news where a substantial percentage of their citizens were influenced by a malicious website [14] . In their study, Allcott and Gentzkow [1] revealed that every American clicked on at least one fake news article related to the presidential candidates during the elections. The recent cases of spreading fear and prompting false cures due to fake news spread have threatened countless lives during the COVID-19 global pandemic [24] . The effect of claims applies unprecedentedly and often leads to the loss of money and precious human life. Similarly, several such cases have been popping up within the global community. Several machine learning and natural language processing models have been proposed to handle fake news and the automated detection of erroneous claims. Within recent literature, fake news detection has been embellished into more of an umbrella term that encapsulates sub-tasks like stance detection, hostility detection, claim detection, etc. Argument Mining (AM) has been significantly confronted by NLP for the past few decades. Segmenting argumentative and nonargumentative texts, detecting claims, and parsing argument assemblies are some of the main concentrations within this field. Narrowing down from AM into the domain of claim detection, the task in layman terms is to detect sentences containing a claim. As an elementary intuition, claim detection is a pre-cursor to fact-checking, wherein segregation of claims aids in restricting the corpus that needs a fact-check. Toulmin [34] initiated the early works on argumentation in the 1950s; in his argument conjecture, he described a claim as 'an assertion that needs to be proven'. Formally, according to the Oxford Dictionaries , a claim is defined as: "to say that something is true although it has not been proven and other people may not believe Table 1 : Representative examples of claims and non-claims. Explicit expressions of claims are in italics. Label @realDonaldTrump A lot of people are saying cocaine cures COVID-19. So. Injecting disinfectant might not cure #coronavirus, but TheLastDance sure is saving lives during #Lockdown. Claim maybe if i develop feelings for covid-19 it will leave Non-claim Olympic events are rooted in old traditions. Non-claim it" 1 . The concept of false claims is entrenching its roots in every domain: online social media platforms, news articles, political agendas, and so forth. The divergence boundary between a claim and a non-claim is very thin, making the conception of a claim very subjective and abstruse. As a consequence, the categorization of claims is taxing for both human annotators and state-of-the-art neural models alike. Herein, the difficulty exists given the disparity in perception and the lack of an existing formalization for claims. The more classical works in claim detection practiced syntactic composition, where they cashed in on the use of combinatorial distillation from context-free grammars, constituency parse-trees, and other linguistic renditions to design their models [18, 21] . The more recent works shifted towards neural approaches, and leveraging large Language Models (LM) in an attempt at capturing dormant features through explicit linguistic encapsulation [8, 10] . In our previous work [15] , we attempted to encode both linguistic and contextual knowledge to study claims across different distributions. The literature across the encompassing of OSM-based informal texts and the fusion of structure and context for claim detection is still sporadic and needs to be driven. Table 1 shows a few representative examples of claims and non-claims. Motivation: Claims, as is very apparent, subsist across a variety of distributions, from essays, to Wikipedia articles, to OSM posts and comments, etc. However, with the voluminous explosion of data on social media, it is of paramount importance that we concentrate distinctively upon claims distributed through OSM [4, 27] . The web has become the pivot for all things social and global, and a vast majority of this human expression comes in textual form, specifically short, informal texts (Twitter, Facebook, etc.). More often than not the conformity bias [35] comes into play and pits users of strong wide-ranged opinions against each other, and can at times end up promoting something erroneous. For example, a tweet that reads 'drinking an antiseptic cures COVID' can be a cause for massive unrest and demands an immediate fact check. Visibly, with the increasing magnitude of data, automated furthering of texts that comprise a claim into a fact-verification pipeline should garner gravity. Another qualm, as previously mentioned, is the issue of division between claims and non-claims. These, at times, tend to lie in a domain of high invariability. Additionally, the severe imbalance of existing datasets leads to data severance (for performance 1 Claim -Oxford Definition augmentation) within already small datasets. On top of this, existing systems have inclined towards large LMs and provided a profusion of promise. However, they are, without a doubt, computationally expensive. We address the following intricacies and propose a framework that pursues the objective of erecting discernible feature spaces for the individual classes while avoiding the use of LMs and fixating on a linguistic and definition centring approach. Proposed Methodology: We propose DESYR, a DEfinition and SYntactic Representation based claim detection model that achieves competence over the better segregation of feature space for classification, and moreover, learns to leverage the guidelines for identifying claims and non-claims proposed in our previous work [15] for furtherance of the feature constitution. It attains this by employing an intelligible unity of feature projection, attention-based alignment, and pre-transformer Deep Learning. Also, in line with Lippi and Torroni [21] who argued that a claim could be grounded in linguistics, we propose DARÉ, the Dependency-PoincARÉ embedding, which is a dependency-inspired variant of Poincaré embedding [25] . It helps assimilate enhanced representations of word vectors by capturing intrinsic hierarchies in dependency trees. We evaluate DESYR across four web-based datasets which comprise short informal texts and observe up to par results throughout. We conduct a result analysis across several baselines, along with a general analysis for our predictions. The comparative investigation espouses the finer performance of DESYR compared against other state-of-the-art systems. • DESYR, a claim detection system tailored towards informal web-based texts. The said system determines the existence of claims in online text by aligning the query to encoded definitions and projecting them into a purer space. • Novel combination of gradient reversal layer and attentive orthogonal projection. To the best of our knowledge, the usage of gradient-reversal layer to learn the label-invariant features and leveraging an attentive orthogonal projection for drawing the choicest feature representations in the domain of claim detection is the primary attempt. • Comprehensive evaluation and state-of-the-art results. We evaluate DESYR against the state-of-the-art systems and LM baselines across four benchmark datasets. The contrast suggests the superiority of our architecture, and our ablation study highlights the importance of each module. • DARÉ, our novel dependency-inspired Poincaré variant. It shows promise for linguistically-grounded NLP tasks. We release a pre-trained 100-dimensional version of DARÉ. We release the 100-D pre-trained version of our Poincaré-variant (DARÉ) and source code of DESYR for further research at https://github.com/LCS2-IIITD/DESYR-CIKM-2021. Claims are an indispensable component of AM, and similarly claim detection is equally as crucial for the fake news detection pipeline. With the expansion of OSM, misinformation proliferated through claims now undeniably poses a greater risk to consumers. With zero Figure 1 : A schematic diagram of the DESYR framework for the claim detection. The spotlight network, s-net, is the backbone of DESYR that incorporates the alignment of input text considering the claim and non-claim definitions. Moreover, it selectively extracts the relevant features through an attentive orthogonal projection of label-invariant features (from r-net) and the aligned input representation (from d-net). The auxiliary classification layers help DESYR in learning sub-modules directly from the gradients' signals. deterrence, the spread of misinformation can be rapid and harmful. Therefore, claim detection has recently gained traction as an NLP task which works a precursor to automated fact verification. In 2011, Bender et al. [5] proposed an annotation system and presented the Authority and Alignment in Wikipedia Discussions (AAWD) corpus 2 which consisted of a collection of around 365 discussions curated from Wikipedia Talk Pages (WTP). Their work grabbed a lot of attention from researchers on claims and provided a basis for this challenging task. In the past decade, the study of claim detection procured some drag within the NLP research community with a principal attempt by Rosenthal and McKeown [32] . They endeavoured on mining claims from discussion platforms and implemented a supervised approach based on sentiment and wordgram derivatives. Although their work was restricted to a classical machine learning approaches, it formed the basis for future works in this field. Levy et al. [18] put forward a context-dependent claim detection (CDCD) model. They described a 'context-dependent claim' as "a general, concise statement that directly supports or contests the given topic". They exercised their approach only over Wikipedia scripts while maneuvering context-based and context-free feature sets for spotting claims. Lippi and Torroni [21] propounded the context-independent claim detection (CICD) model that employed constituency parse trees to capture structural knowledge without an explicit encapsulation of context. The pitfall being that they only engineered this around a thoroughly superintended Wikipedia corpus. Levy et al. [19] proposed the first unsupervised approach for claim detection. According to the authors, a claim begins with the word 'that' and the main concept (MC) or a topic name then follows. This work, however, is restrictive to distributions which are accompanied by formal texts. Most literature on claim focuses focuses on the domain specificity. As a result, there is a lack of generalization in existing systems. To confront this, Daxenberger et al. [10] performed a qualitative analysis across six datasets and argued that the anatomy of claims stands differently within different distributions. In the more recent times, the study of claims too, trended towards the utilization of transformers. Chakrabarty et al. [8] used over 5 million self-labeled Reddit comments that contained the abbreviations IMO (In My Opinion) or IMHO (In My Honest Opinion) to fine-tune their LM expecting to gravitate their distribution towards their conceptualization of a claim. They however, made no evident attempt at encapsulating the syntactic properties. Gupta et al. [15] leveraged both semantic and latent syntactic features through an amalgamation of linguistic encoders (part-of-speech and dependency based) and a contextual encoder (BERT). They additionally annotated a Twitter dataset and proposed thorough guidelines that were centred around annotating claims. Cheema et al. [9] investigated the role of images on claims. They presented a novel framework on leveraging dual modalities for claim detection. The CLEF-2020 shared task [2] witness several models that were specifically tweaked for claim detection. Williams et al. [36] bagged the cake with the fine-tuned RoBERTa [22] model that was accentuated by mean pooling and dropout. Nikolov et al. [26] took the second place with their out-of-the-box RoBERTa vectors that were heightened by Twitter metadata. DESYR attempts at abrogating the drawbacks from previous works -we propose an architecture grounded in linguistics and expedited by context and definition-based alignment. The more conventional renderings of claim detection are constructed with either contextual methodologies [8, 10] or syntactic methodologies [19, 21] at the centre. Recently, in our precedent work [15] , we proposed the LESA architecture that leverages individuallyfurnished contextual and linguistic encoders combined into one for improved performance on claim detection. Moreover, in the helm of natural language processing, we observe increasing use of language models and fine-tuning in addition to other modules in the architecture -a computationally heavy process with a significant number of learnable parameters. In LESA [15] , we incorporated three modules including two transformer-based models for the claim detection, thus making the architecture heavy on compute. Additionally, within the literature, we observe an inferior performance on the minority class, despite efforts on imbalance eradication. Given our outlook on the assimilation of syntax and context and driven by an attempt to prune the aforementioned drawbacks, we now propose DESYR. DESYR leverages representation learning by exercising a novel dependency-inspired variant of the Poincaré embedding [25] . Furthermore, we aim at amputating the class-invariant features, and obtaining superior class-representing features by incorporating the technique for the feature projection, highlighted by Qin et al. [30] . DESYR's backbone comprises of two networks in parallel -the regulation-net (r-net) and the spotlight-net (s-net). The regulationnet learns the class-invariant features through a gradient-reversal layer [13] . On the other hand, the spotlight-net draws representations of the input which are devoid of the class-invariant features, thereby allowing for better distinction between our binary labels. Both r-net and s-net incorporate a feature extractor module, called feature-net (f-net), in their respective modulesf-net and f-net . In addition, we also incorporate a definition network (d-net) that aims at aligning input texts to definitions of a claim and non-claim to further augment class variance. We leverage d-net in our spotlight network module by enabling the feature network (f-net ) to learn the segregation between claim and non-claim definitions. We hypothesize that learning such alignment would be exploited by DESYR in successive layers for the claim detection. In particular, we fine-tune the hidden representation in s-net to extract the essence of segregation between the claim and non-claim definition through d-net. Moreover, DESYR is specifically calibrated towards short informal web-based texts, and we leverage claim and non-claim definitions proposed in [15] to engineer our d-net. Intricate details of the aforementioned modules are discussed in further sections. We present a high-level architecture for DESYR framework in Figure 1 . The Depedency-poincARÉ (DARÉ) embedding is DESYR's variant of the Poincaré embedding Nickel and Kiela [25] . Employing DARÉ, Figure 2 : Hierarchy (graph) formulation for DARÉ training: dependency-based tree for the sentence 'A hearing is scheduled on the issue today'. we attempt to capture latent linguistic properties in textual dependency. Below, we briefly introduce dependency parsing followed by a dossier on how the former was imbued into the Poincaré ball. Dependency parsing is a function that maps a sequence of tokens { 1 , 2 , ...} to a dependency tree. A dependency parse tree is a directed graph with nodes and edges, where each node represents an individual token , and each edge represents the syntactic dependency between and . An edge → comprises a parent node directed at the child node with the dependency , where is the nature of the dependency. We employ spaCy 3 for the dependency parsing. An example is shown in Figure 2 . Poincaré Embedding: As highlighted by Nickel and Kiela [25] , embedding hierarchical graph-based information in Euclidean space can be difficult, owing to the exponential growth of nodes which brings leaf nodes in different branches close to one another thereby distorting hierarchies. With the Poincaré ball, the extent from the center grows exponentially, allowing one to fit an erratic amount of levels in the hyperbolic space. We then optimize our word vectors on this space and optimize the following loss function with negative-sampling: where ( , ) is the Poincaré distance such that, where is our set of vectors, Δ is the set of all embedded hierarchiescum-dependencies (in this case, ∀ → ), and ( ) is the set of random tokens that are not associated with . Additionally, we attempt at the disambiguation of part-of-speech (POS) by formulating our as word vectors of the tokens augmented with their POS, such that = .POS . Our loss function is trained similar to how it would be in Euclidean space. However, the only difference is that we employ Riemann Gradient Descent [7] for the optimization. We utilize Gensim's implementation 4 to train DARÉ. Prior to discussing r-net and s-net, we discuss the component f-net which is common to both the aforementioned networks and serves as the feature extractor. To put this into context, an f-net comprises of stacked BiLSTMs, whose hidden units are then processed by a sequential self-attention mechanism, as suggested by Zheng et al. [38] . Inspired by the Inception [33] architecture, we optimize f-net through an auxiliary softmax layer. The intuition behind having these auxiliary outputs is that they would act as implicit assistance against the vanishing gradient problem and make low-level features of the network more accurate. As mentioned before, the transitional outputs (all hidden units) from the BiLSTM are used as inputs for our r-net and s-net. To emphasize again, each has an individual f-net with no shared parameters. As mentioned earlier, the only distinguishing feature between f-net and f-net is that the former comprises of d-net. The d-net module helps in aligning the inputs to predefined guidelines/definitions that elucidate the characteristics of claims and non-claims. Intuitively, for any given input text, we find its alignment against the aforementioned sets of definitions using an attention-based mechanism. This helps us draw divergent associations from the input with respect to claims and non-claims. To put formally, suppose we have two sets of definitions, = [ 1 , 2 , · · · , ] for claims and = [ 1 , 2 , · · · , ] for nonclaims. The input text forms our query; this query is then processed (discussed below) against each of the definitions in and to get the claim-definition-map and the non-claim-definition-map, respectively. The d-net comprises of d-net and d-net . These are carboncopies where one is initialized with the definitions of claims and other with those of non-claims. We process these definitions and input text through a BiLSTM encoder. For each definition, we obtain a value vector, and for the input text, we take the last timestep representation as the query vector. Subsequently, for each definition encoding (value ), we calculate its attention score [23] against the query. Finally, the query-value-attention-score is pooled (global average) over the sequence axis to obtain a 1-dimensional representation. We repeat the process for each pair of query and definition -⟨query, ⟩∀ = {1, 2, · · · , } and ⟨query, ⟩∀ = {1, 2, · · · , }. The concatenation of all query-value-attention-scores forms the definition-based representation for against the respective definition-set. We append the representations from d-net and d-net behind the BiLSTM output at each time step to enhance the feature learning in f-net s and thereby, explicitly helping the transient BiLSTM features to become more archetypal. The d-net is only part of f-net s owing to the fact that s-net is our primary classifier whereas r-net acts analogously to a feature-selector. The regulation network acts to collect class-invariant (common, shared amongst the classes) vector representations from f-net r . It functions as a network trained in parallel to the s-net. As highlighted in [12, 30] , we employ a Gradient Reversal Layer (GRL) to capture the class-invariant features. In a nutshell, a gradient-reversal layer can be thought of as a pseudo-functional mapping where the forward and backward propagation are respectively defined by two opposed equations as follows: The transient BiLSTM's output from f-net r , that serves as the input to r-net, learns the invariant features and is then drifted to s-net for computing the orthogonal projection. The spotlight network is the prime module of DESYR. We amalgamate the class-invariant features of r-nets through an attention orthogonal project layer (a-OPL). The orthogonal projection layer aims at drawing the choicest feature representations [30] . For convenience, we refer to the feature representation from f-net s and f-net r as v ∈ R × and v ∈ R × , respectively. We design s-net to extract the semantic representation for our input text and project the same into a non-homogeneous domain space. To accomplish this, we project onto the orthogonal direction of . The space orthogonal to should, in theory, be rid of class-homogeneity, and projecting onto it should lead to discriminative information being stripped of class-invariant knowledge. Further, we describe the mathematical details of a-OPL. The projection between two vectors and is defined as, Utilizing the above equation, we project onto while tending to each time-step from their respective LSTM using a TimeDistributed Layer (TDL) such that, To further refine this feature representation, we then attend to each time-step in using sequential self-attention [38] such that, The aforementioned forms the basis for our a-OPL. The attentive vector is then utilized for classification. We train s-net and r-net parallel to each other and employ sparse categorical focal loss. In a classification setting, with labels , the loss is defined as, where is a vector that represents the approximate probability distribution across our two classes, and is the focusing parameter, which in essence acts to down-weigh easy-to-classify examples. Higher implies high discounting of the easy-to-classify examples. Owing to our formulation of a claim detection system for informal texts on the web, we accumulate four publicly-available web-based datasets as mentioned below: • Twitter Dataset: It is a claim detection dataset of COVID-19 tweets released recently [15] . The dataset consists of ∼ 10 labeled tweets. The authors also released a set of definitions for claim and non-claim text in their annotation guidelines 5 . As mentioned earlier, we utilize them for our definition network (d-net). In total, the annotation guidelines consists of 10 claim definitions and 8 non-claim definitions. Table 2 . We can observe that, as opposed to the Twitter dataset, MT, OC and WD datasets are imbalanced towards non-claims. Due to the label skewness, we adopt a sampling mechanism, similar to our precedent work [15] , for our experiments (c.f. Section 5). In this section, we delineate an exhaustive analysis of our model's performance and also carry out a predictive comparison against the state-of-the-art claim detection systems. We additionally analyze the predictions made by our best model for out-of-sample instances from Twitter. Besides the aforementioned, we conduct an ablation study to evaluate the substance values affixed by each sub-module of our architecture. In this section, we layout the backdrop for our experiments and highlight the key conditions and practices. To compute the DARÉ embedding of size 100, we use the open source Sentiment140 corpus, comprising 1.6 million tweets 6 . We use stacked BiLSTMs with 256 hidden units for both f-nets. To emphasize again, there are no shared parameters between f-net and f-net . Additionally, to encode our query and definitions in the d-net, we use a BiLSTM with 64 hidden units. We use pre-furnished definitions proposed in our previous work [15] to harbour our dnet; we encipher d-net with 10 definitions, and d-net with 8 definitions. To train our model, we proceed with a vocabulary size of 30 , and a maximum document length of 50. We use the Adam [17] optimizer and the sparse categorical focal loss [20] . We train the model for 100 epochs with a batch size of 32 and exercise early stopping. Since most of the datasets are unbalanced towards one class, we adopt the sampling technique to alleviate the issue. We experiment with multiple sampling ratio (c.f. Table 4 ), and as a result of finetuning, we select a sampling ratio of 5:2 (claims: non-claims) for the Twitter dataset based on foraging through the sampling ratio search space. For consistency, we select a sampling ratio of 5:2 (non-claims: claims) for the OC, WD and MT datasets as well. Note that the sampling technique is in line with the previous neural attempts that procured best results on a 1:1 ratio [10, 15] . Additionally, to ensure the riddance of seed stochasticity and to incorporate maximal data, we train for 5 random splits and average them using voting. Also, to incorporate a more holistic weighting mechanism, we vote on predictions across three different values (1, 2, 3) of the parameter in the focal loss. For evaluating the claim detection systems, we compute claim-F1 ( -1) and macro-F1 ( -1) scores. Another fact of importance is that with DESYR, we design an architecture which is lighter in comparison to its previous stateof-the-art systems, especially LESA. LESA churns approximately 111 model parameters, while our proposed model has only 7 model parameters. Moreover, the standard XLNet and BERT-based models require a few 100 parameters for the same. As is evident, constructing models in adherence with the task statement at hand can be equally as effective if not more, and at times can be accomplished with a fraction of the compute. Due to highly subjective nature of claims, claim detection can more often than not prove to be a demanding task even for humans let alone machines. As with automated, neural (machine-based) claim detection, the problem becomes even more acute in case of web-based short texts, that usually lack soundness in their linguistic edifice. Most of the existing claim detection models (including state-of-the-art systems) struggle with the accurate identification of claims. To assess and contrast the performance of DESYR, we consider the following systems as baselines: • BERT [11] : It is a bidirectional transformer-inspired auto-encoder LM that we fine-tune for classification. • XLNet [37] : Similar to BERT, this too is a bidirectional transformerinspired LM, the only difference being that this is an auto-regressive LM. We fine-tune it for classification. • Accenture [36] : The authors employed a fine-tuned RoBERTabased system and nabbed the first position in the CLEF-2020 claim detection shared task [3] . • Team Alex [26] : The system ranked second at CLEF-2020 shared task. The authors proposed the fusion of RoBERTa-based features and Twitter meta-data to detect claims. • LESA [15] : This is the state-of-the-art claim detection system wherein the authors proposed a system that leverages part-ofspeech and dependency-based linguistic encoders in sync with a BERT-based encoder to detect claims. • In addition, we also perform a simple K-means clustering-based evaluation. We assign dataset points to one of the two clusters -claim and non-claim, considering their BERT and Poincaré representations, separately. We present our collated results in Table 3 . To compute the base efficacy of the DARÉ embedding, we compare it against the BERT embeddings [31] without proffering any external supervision. Please note that the Sentence-Transformers package [31] facilitates out-ofbox computation of BERT-based dense vector representations for sentences 7 . We employ K-means clustering to segregate the claim with non-claim clusters. We evaluate the test data points in both clusters and report the results in the first two rows of Table 3 . We observe that DARÉ performs better than BERT on the Twitter dataset by a considerable margin. We also perform K-means clustering on the remaining three datasets as well and observe improvements in most of the cases. The improvements could possibly be attributed to our trained distribution being closer to web-based informal texts despite the training corpus being significantly smaller than BERT's (Wikipedia: 2, 500 million words, Book Corpus: 800 million words). Furthermore, we evidently observe that DESYR outperforms all the existing baseline systems including the current state-of-the-art, LESA [15] . On the Twitter dataset, DESYR obtains the foremost -1 score in contrast to all other baseline systems -it accounts for a +3.3% improvement over LESA in -1. With the OC dataset, we find that all the baseline systems including DESYR report low scores for claims. However, DESYR does yield -1 of 0.60 (with +9 points improvement over LESA's performance) -suggesting it performs well for the non-claim class. Out of the four datasets, we observe the highest relative improvement on the WD dataset with 0.59 -1 and 0.78 -1, translating to a climb of 68.5% and 27.8% over LESA, respectively. On the MT dataset as well, we observe an increment of 11.26% in -1 and 6.25% in -1. On average, DESYR improves the state-of-the-art performance by 16.36% in -1 and by 13.3% in -1. As a general observation, we see how systems grounded in linguistics, such as DESYR and LESA outperform large LMs like BERT and XLNet, which in turn goes to indicate the importance of task-specificity in model adaptation. Given that we hinge our model around Twitter, we perform the ablation study by hacking off individual components from DESYR one at a time, and thereafter, we evaluate the same on the Twitter dataset [15] . We present the ablation study results in Table 4 . We report -1 along with -1 and weighted-F1 ( -1). We draw the simplest variant of DESYR by dropping d-net and employing sparse categorical cross-entropy in place of focal loss to train our model. Additionally, along with the mentioned withdrawals, we train on a 1:1 sampling in place of the 5:2 sampling. As can be seen in Table 4 (rows 4-5), we observe an increase of 5 -1 points and 5 -1 points simply by reverting back to the 5:2 sampling. We conduct another similar ablation, while retaining d-net with categorical cross-entropy. We again observe an increase of 5 -1 points and 5 -1 points on a reversion to the 5:2 sampling. It is worth mentioning that the previous benchmark on the Twitter dataset applied a 1:1 sampling across all their experiments [15] . We, however, espy worse results on using the same. We additionally Here is my experience and I hope it will help you with your decision: In preschool, I had some issues, just like your son. non-claim claim claim 10 That's why Germany should not introduce capital punishment! claim non-claim claim MT 11 Alternative treatments should be subsidized in the same way as conventional treatments, since both methods can lead to the prevention, mitigation or cure of an illness. claim claim non-claim 12 Besides it should be in the interest of the health insurers to recognize alternative medicine as treatment, since there is a chance of recovery. non-claim claim claim discern that in two distinct cases, the addition of d-net results in a boost -we see a boost of 1 -1 point and 1 -1 point when comparing ablations that differ in their residence of d-net. Within DESYR we experiment with different sampling ratios, as is clearly evident we outperform on the original sampling and the 1:1 sampling. We observe that with the original sampling, we see competitive results on -1, however, the same doesn't hold true for -1, thereby the 5:2 sampling is a better fit, given that in addition to better results, it also helps keep the skew subservient. Furthermore, we detect that the variant of DESYR that comes without focal loss, performs worse in comparison to DESYR; the -1 and the -1 values drop by 1 point each. Another interesting ablation is where we choose to initialize DESYR with GloVe [29] in place of DARÉ. We see that DESYR in its native state outperforms GloVe initialization by 1 -1 point and 4 -1 points (rows 8-9). Glancing at Table 3 , one can evidently infer that all the claim detection systems including DESYR are still far from absolute and are therefore prone to errors. To reinforce our point once again, owing to the highly impressionistic nature of claims and folksiness on OSM sites, the detection of claims is cumbersome. To qualitatively appraise the performance of DESYR, we perform error analysis in this section. Table 5 highlights a few randomly sampled instances from the Twitter, OC, WD, and MT datasets, along with their gold and output labels as predicted by DESYR. For comparison, we analyze the predictions from the state-of-the-art claim detection architecture, LESA [15] as well. In some cases, both DESYR and LESA fail to identify claims; however, in most of the cases DESYR performed well. We additionally report instances misclassified by DESYR and/or LESA. As underlined previously, on the Twitter dataset, we observe that DESYR obtains better classification results in comparison to all other baseline systems. In examples 1 and 3 , we see that LESA misclassifies both the examples. In comparison, DESYR identifies the assertion and classifies them correctly. With mistakes being inevitable, we observe that most of the misclassified samples are non-claim ( 2 , 4 , 5 , 9 , 12 ). A potential reason could be the skewed nature of the Twitter dataset [15] -the dataset is imbalanced with a hulking predisposition towards claims (the number of claims is greater even after the 5:2 sampling ratio). The presence of the phrase, 'effective vaccination is completed' in example 2 , drives the system to assert on it and incorrectly predicting it as a claim. Example 7 has a dearth of context and could possibly have been a part of a bigger phrase, and on top of that it is not a statement that would severely affect public opinion, which is presumably why DESYR misclassifies it. The latter argument was based off the guidelines proposed in our previous work [15] . Clearly, the gold label for 9 indicates that it is not a claim; however, within the realm of possibility, it does seem that the person indirectly claims about having issues. It is possible that DESYR misclassifies in this case owing to this dormant pattern. Sentence 12 emphasizes a chance of recovery with an alternative medicine, i.e., they report a medical fact to be true, albeit with chance. There exists a possibility that DESYR and LESA interpret a chance of recovery using medication as a claim, and therefore, misclassify it. Nearly 51 lakh COVID-19 vaccine doses will be received by states/UTs within next 3 days: Health ministry. claim claim 4 I STAND WITH HUMANITY #IndiaStandwithPalestine non-claim non-claim 5 Par for the course. As if we'd trust an internal review. We are asking for #COVIDPublicEnquiryNow non-claim claim 6 They claim they are no longer asking for Aadhaar/mobile to give food at Indira canteens. But our Youth Congress team found otherwise. And documented it. We will continue to expose this government's lies. claim claim 7 A third Australian has died in India from COVID-19. The family members of 11,000 stranded Aussies are pleading for the government to bring them home. #9News claim claim 8 My heartfelt gratitude to the men in uniform who did not deter from putting their lives in danger saving the lives of our citizens under extreme conditions. non-claim non-claim 9 Interacted with doctors across India. They shared insightful inputs, based on their own experiences of curing COVID-19. The determination of our doctors during these times is remarkable! claim non-claim 10 Abolish unpaid internships. There is absolutely no valid reason that justifies why you're having students work 40h/week and paying them nothing. claim claim Over time, OSM sites have emerged as the hub for short, unstructured pieces of informal text, where the amount of slang and incoherence in writing is generally more significant than other online platforms. Considering the prime focus of DESYR, we tend to evaluate our DESYR's performance against real-world data to detect claims on the web. We collect 50 random samples from Twitter and predict their labels (claim or non-claim) using DESYR. Note that we collect these examples in the wild. Subsequently, we present the predictions to three human evaluators 8 and ask them to verify the labels following the claim annotation guidelines defined in [15] . Finally, a majority voting was used to get the final gold-label for these 50 tweets. We obtain an inter-annotator score (Fleiss kappa) of 0.76 among the three evaluators. We present some of the instances in Table 6 . As expected, most of the claims were correctly labelled by DESYR. Out of 50 samples, DESYR classifies 32 samples correctly. We observe -1 score of 0.60, while -1 score of 0.73. Along with that, our model marked some false positives at the expense of its precision. This is not an ideal situation but a better scenario than biasing towards false negatives where claims are wrongly classified as non-claims. Examples 1 and 3 are claims exhibiting some statistics. Our model DESYR rightly captured these values and classified both the tweets as claims. However, DESYR was unable to learn the importance of numerical features in 2 , specially when they occurred several times within the text, where it failed to interpret the significance of the numerical features and ended up mislabelling the tweet. Following the claim guidelines, we published in our prior work [15] , negating a false claim also accounts as a claim. In example 6 , the user tries to negate a claim and further claims to document the evidence. DESYR comprehends the assertions and rightly labels as claim. Through example 8 , the user imparts his/her personal beliefs that might or might not affect the public; thus, DESYR possibly 8 They are linguistics by profession; and their age ranges between 24-45 years. interpreted it as a personal opinion and marked it as non-claim. Examples 7 and 10 encompass strong claim phrases 'has died in India' and 'paying them nothing', respectively. Clearly, these two examples make strong social assertions that would be of interest to a larger audience and is possibly why it is labeled as claim. Finally, in example 9 , the user commends the determination of doctors during the global pandemic situation and expressed their experiences. This example would not fall under the claim category; however, DESYR mis-classifies it as a non-claim possibly due to the presence of the phrase 'curing COVID-19'. Our observations from the in-the-wild evaluation suggest that that DESYR can assign labels to unseen tweets quite efficiently and accurately. Moreover, we do not follow any unexpected behavior of DESYR. Thus, furnishing us with empirical shreds of evidence that DESYR can be used for claim detection in informal texts. Through this far-reaching and orderly study, we tend to make notable contributions that will end up being considerable strides in the field of claim detection. We epitomized Poincaré embeddings with the NLP task, which showed promising results for the claim detection task. The proposed model, DESYR, determined the existence of claims in the online text by aligning the query to encoded definitions and projecting them into a purer space. We evaluated DESYR across four web-based datasets, which comprise short informal texts, and observed up to par results. The comparative investigation espoused the more nuanced performance of our model compared against various existing systems. Experiments demonstrated the superiority of our model with ≥3% -1 improvements over the existing state-of-the-art claim detection system, LESA. Additionally, we exhibited every individual component's performance and significance in our model through an exhaustive ablation study. Finally, we showed the robustness of DESYR through a qualitative human evaluation in the wild on 50 random samples. Social media and fake news in the 2016 election Reem Suwaileh, and Fatima Haouari. 2020. Checkthat! at clef 2020: Enabling the automatic identification and verification of claims in social media Reem Suwaileh, and Fatima Haouari. 2020. CheckThat! at CLEF 2020: Enabling the Automatic Identification and Verification of Claims in Social Media The COVID States Project# 14: Misinformation and Vaccine Acceptance Annotating Social Acts: Authority Claims and Alignment Moves in Wikipedia Talk Pages Identifying Justifications in Written Dialogs Stochastic gradient descent on Riemannian manifolds IMHO fine-tuning improves claim detection On the Role of Images for Analyzing Claims in Social Media What is the Essence of a Claim? Cross-Domain Claim Identification BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Unsupervised domain adaptation by backpropagation Domain-adversarial training of neural networks Learning word vectors for 157 languages LESA: Linguistic Encapsulation and Semantic Amalgamation Based Generalised Claim Detection from Online Content Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse Adam: A method for stochastic optimization Context dependent claim detection Unsupervised corpus-wide claim detection Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection Context-independent claim detection for argument mining Roberta: A robustly optimized bert pretraining approach Effective approaches to attention-based neural machine translation During this coronavirus pandemic Poincaré Embeddings for Learning Hierarchical Representations Team Alex at CLEF CheckThat! 2020: Identifying Check-Worthy Tweets With Transformer Models World Health Organization et al. 2020. Immunizing the public against misinformation Joint prediction in MST-style discourse parsing for argumentation mining Glove: Global vectors for word representation Feature projection for improved text classification Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Detecting opinionated claims in online discussions Going deeper with convolutions The uses of argument Conformity biased transmission in social networks Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models Xlnet: Generalized autoregressive pretraining for language understanding Opentag: Open attribute value extraction from product profiles The work was partially supported by the Accenture Research Grant. T. Chakraborty would like to acknowledge the support of Ramanujan Fellowship, CAI, IIIT-Delhi and ihub-Anubhuti-iiitd Foundation set up under the NM-ICPS scheme of the Department of Science and Technology, India.