key: cord-0158009-97ls6jfa authors: Shen, Xinyue; He, Xinlei; Backes, Michael; Blackburn, Jeremy; Zannettou, Savvas; Zhang, Yang title: On Xing Tian and the Perseverance of Anti-China Sentiment Online date: 2022-04-19 journal: nan DOI: nan sha: 9d070671b49dabf988464779696aff75fe592878 doc_id: 158009 cord_uid: 97ls6jfa Sinophobia, anti-Chinese sentiment, has existed on the Web for a long time. The outbreak of COVID-19 and the extended quarantine has further amplified it. However, we lack a quantitative understanding of the cause of Sinophobia as well as how it evolves over time. In this paper, we conduct a large-scale longitudinal measurement of Sinophobia, between 2016 and 2021, on two mainstream and fringe Web communities. By analyzing 8B posts from Reddit and 206M posts from 4chan's /pol/, we investigate the origins, evolution, and content of Sinophobia. We find that, anti-Chinese content may be evoked by political events not directly related to China, e.g., the U.S. withdrawal from the Paris Agreement. And during the COVID-19 pandemic, daily usage of Sinophobic slurs has significantly increased even with the hate-speech ban policy. We also show that the semantic meaning of the words"China"and"Chinese"are shifting towards Sinophobic slurs with the rise of COVID-19 and remain the same in the pandemic period. We further use topic modeling to show the topics of Sinophobic discussion are pretty diverse and broad. We find that both Web communities share some common Sinophobic topics like ethnics, economics and commerce, weapons and military, foreign relations, etc. However, compared to 4chan's /pol/, more daily life-related topics including food, game, and stock are found in Reddit. Our finding also reveals that the topics related to COVID-19 and blaming the Chinese government are more prevalent in the pandemic period. To the best of our knowledge, this paper is the longest quantitative measurement of Sinophobia. The story of Xingtian speaks of a deity that fought against the Supreme Divinity. Although Xingtian's army lost, he refused to stop fighting, thus extreme actions were taken: he was decapitated and his head buried under a mountain. However, this was still not enough to stop Xingtian. He continued to fight, using his nipples to see and his bellybutton to speak. The story of Xingtian is one of persistence, and has parallels to worrying behavior on the Web: it continues to persist. While Sinophobia, i.e., anti-Chinese sentiment, had a staggering rise after the COVID-19 pandemic began, it has persisted for hundreds of years. For example, in 1882 the Chinese Exclusion Act was passed, which barred Chinese work-ers from entering the US until its repeal 60 years later in 1943 [22] . In 2013, a survey conducted by the Pew Research Center [20] showed that Sinophobia persisted in the West, e.g., only 34% of Americans, 28% of Italians, and 28% of Germans have a favorable opinion of China. With the developments of the Internet, mediums like texts, images, and videos are created and shared at ever increasing volume. However, there is also a downside of the Web, e.g., the rise of fringe communities such as 4chan's Politically Incorrect board (/pol/). Sinophobia is indeed a "popular" topic discussed on fringe Web communities [34, 38, 40] . The effects of Sinophobia are not only seen on the Web, but also in the physical world. Relia et al. [35] provide evidence that online racist activity correlates with hate crimes. The outbreak of COVID-19 has amplified Sinophobia on fringe Web communities and mainstream social media like Twitter and Reddit [38] . With the advent of the COVID-19 pandemic and its origins in China, Sinophobia has become a topic of research. Tahmasbi et al. [38] study the raise of Sinophobic behaviors after COVID-19 on both fringe and mainstream Web communities over 5 months and Ziems et al. [61] investigate the evolution of anti-asian on Twitter across three months after the outbreak of COVID-19. However, to the best of our knowledge, there is no study that analyzes Sinophobia over a longer period of time (i.e., before the COVID-19 era), and thus there is a meaningful gap in our understanding of the evolution of Sinophobia on social media. Also, the previous studies [38] only focus on word-level analysis, a more comprehensive content-level analysis with respect to toxicity and topics is missing. More importantly, as the world begins to recover from COVID-19, it remains unclear as to whether Sinophobic behavior has seen a downtrend as well. In this paper, we perform a large-scale longitudinal measurement of how Sinophobia has ebbed and flowed from 2016 to 2021 over two Web communities: 4chan's Politically Incorrect board (/pol/) and Reddit. With over 206,329,303 posts on /pol/ and 8,118,465,218 posts on Reddit, we quantify Sinophobic behaviors with regard to its origins, evolution, and content. Concretely, we first measure the temporal patterns of China-related posts and dive into the detail of Sinophobic slurs. We find that Sinophobia was prevalent before COVID- 19 and was correlated to political events such as 1 the Trump-Tsai Call [57] and Hong Kong Protests [45] . More importantly, many of such political events are not directly related to China, e.g., the U.S. withdrawal from the Paris Agreement [58] and inauguration of Joe Biden [51] . COVID-19, however, saw a drastic change in Sinophobia; the daily usage of Sinophobic slurs increases substantially after COVID-19, e.g., the utilization frequency of Sinophobic word "chicoms" increases 5.2×. Even with a hate-speech ban policy [43] , slurs still sneak on Reddit, which calls for the mainstream community to take more actions and responsibility. Using Google's Perspective API, we find that posts related to China and Chinese are more toxic than a baseline set of comments, and this drastically increases during the pandemic period (see the section "Temporal Analysis" for more details). We then analyze the content of posts and find that "china" and "chinese" have had a sharp shift away from referring to the country/Chinese government and towards Sinophobic slurs. For instance, on /pol/, the meaning of "china" is close to "taiwan" and "asia" in 2016, while shifting to "chink" and "chinkland" in 2020. By performing topic extraction, we find a diverse set of discussions. For instance, /pol/ and Reddit both have Sinophobic topics that are related to ethnics, economics and commerce, weapons and military, foreign relations, etc. Compared to /pol/, Sinophobic topics on Reddit are more diverse in languages and cover a wider range of topics that are related to daily life such as food, game, and stock. We also observe that the Sinophobic topics switch after the outbreak of COVID-19. For instance, people show more interest in pandemic-related topics, with an increase of 11× on 4chan's /pol/ (topic 3) and 5.32×on Reddit (topic 7). our analysis also reveals that, in the pandemic period, users express more toxic posts towards the Chinese government with anger, e.g., the scale of such topics exploded to 1.25× on 4chan's /pol/ (topic 5) and 1.71× on Reddit (topic 12). Disclaimer. Note that content posted on both Web communities we study can be considered racist and offensive. In the rest of this paper, we do not censor any language to better illustrate the peculiarities of the problem. We inform the reader that this paper contains content that is likely to be offensive and disturbing. We now review relevant previous work. We report on two areas: 1) study on Sinophobia across years; 2) measurement of racial activity on social networks. Sinophobia studies across years. East Asian prejudice has remained for a long time in the West: in the 19th century, the racial phrase "yellow peril" is created to insult Chinese and has been proved re-visited when COVID-19 outbreaks [21] . Several surveys and studies depicted the unfavorable attitudes towards China prior to COVID-19 [4, 7, 31] . Tahmasbi et al. [38] firstly examined Sinophobia during the outbreak of COVID-19 on Twitter and 4chan's /pol/. They mainly focused on Sinophobic behaviors in word-level and find that COVID-19 indeed provokes the emergence of Sinophobic slurs. However, their dataset only acrosses 5 months and therefore lacks a longitudinal perspective. Nguyen et al. [30] conducted sentiment analysis on race-related tweets and find the proportion of negative tweets referencing Asians increased by 68.4% till March 2020. Racism analysis on the internet. Besides Sinophobic behaviors, several papers measure online racial activities. Zannettou et al. [60] presented a quantitative analysis on online antisemitism. Cervi [8] and Chandra et al. [9] analyzed Islamophobia. Mittos et al. [26] focused on ethnic discrimination on Reddit and 4chan's /pol/ via understanding genetic testing conversations. Yang et al. [59] investigated on selfnarration of racial discrimination in Reddit threads. Remarks. To the best of our knowledge, a quantitative understanding of the cause of Sinophobia as well as how it evolves over several years is lacking. In this paper, we aim to bridge the gap. In this section, we present our datasets from 4chan's Politically Incorrect board (/pol/) and Reddit. Table 1 summarizes the two collected datasets. 4chan's /pol/. 4chan is an anonymous imageboard organized by sub-communities named "boards", each driven by a specific topic of interest. In this work, we focus on the Politically Incorrect board (/pol/) as it is the mainboard for the discussion of politics and world events. Additionally, 4chan is ephemeral as it maintains a limited number of active threads and permanently deletes threads after a week. We choose to include 4chan in our analysis as it is a fringe Web community, known for the dissemination of toxic and offensive language [19] , hence it is likely to include a considerable amount of Sinophobic language [38] . We obtain the data via the official API provided by 4chan [1] and follow the data collection approach described in [33] . In total, we collect 206,329,303 posts between June 30, 2016 to March 18, 2021. Reddit. Reddit is a social news aggregation and discussion website organized by subreddits, which are sub-communities created by users. Registered members are allowed to submit content to these subreddits, get replies from other users, or receive upvotes or downvotes by other members (upvotes and downvotes determine the popularity of content within the platform). We include Reddit in our analysis because of the platform's popularity [41] and the platform's diversity (i.e., with hundreds of millions of subreddits the platforms cover a huge set of interests). Due to the platform's popularity and diversity, we expect that discussions related to Chinese and Asian people in general are happening on Reddit, hence allowing us to include in our analysis a more mainstream platform (compared to 4chan). In this paper, we collect all posts and comments on Reddit from [3] . In total, we gather 8 Table 1 : Overview of the 4chan and Reddit datasets. Filtered posts refer to posts containing the term "china" or "chinese". Table 2 ) . 2020-01-23 COVID-19 lockdown in China [50] 10 2020-03-21 COVID-19 spread worldwide [49] 11 2020-08-01 Trump said he will ban TikTok in the US [6] 12 2020-12-13 A major leak containing a register with the details of nearly two million CCP members has occurred and exposed to the public [28] 13 2021-01-20 Inauguration of Joe Biden [51] To identify the scope of Sinophobia, we start with an investigation on the temporal patterns of the term "china" and "chinese." We then elaborate on Sinophobic slur findings and their influence as well as implications. Next, we study the meaning behind the changes in terms of correlation and perspectives. We first focus on the daily usage of "china" and "chinese" on 4chan's /pol/ and Reddit. To do this, for each post, we convert the text to lowercase, perform tokenization using NLTK [5] , and search for the terms "china" and "chinese." Figure 1 shows the daily percentage (over all posts per day) of the terms "china" and "chinese" on 4chan's /pol/ and Reddit. We annotate the figure with the day that the World Health Organization (WHO) first tweeted about coronavirus (orange dotted line, January 4, 2020) [44] , which is the first day that COVID-19 has officially entered the global spotlight on Web communities. In this paper, we consider the period before January 4, 2020 as the pre-pandemic period and the period after January 4, 2020, as the pandemic period. We also detect 13 peaks in our datasets and annotate them with blue dotted lines by using peak detection [37] . By manually checking around 2,000 posts in each peak, we find discussion on each peak corresponds to at least one event that happened around that day (see Table 2 ). Surprisingly, we find that events not directly related to China may also evoke discussions related to China and Chinese people. For instance, on June 1, 2017, President Trump announced that the U.S. would end all participation in the 2015 Paris Agreement on climate change mitigation [58] . This announcement evokes numerous posts of China and Chinese on both 4chan's /pol/ and Reddit and reaches its peak on June 2, 2017. At this peak, China and Chinese are frequently mentioned for their roles playing in environmental protection. For instance, a Reddit user posts:" fucking china is the one producing the co2 bigle." We then measure the peak width of each event via [37] and regard it as the interest period of each event. The relative height at which the peak width is measured is 0.5. We find that the pattern of interest period thoroughly changes during the pandemic period. For instance, in the pre-pandemic period, the average interest time of "china" ("chinese") remains 2.76 (2.83) days. However, it increases to 8.00 (9.28) days in the pandemic period, indicating prolonged discussions and extended interest around China and Chinese people. The daily usage of "china" and "chinese" also rises in the pandemic period. Take 4chan's /pol/ as an example, the average percentage of posts per day containing "china" ("chinese") mounts from 0.50% (0.34%) to 1.23% (0.66%) af-ter the outbreak of COVID-19. For Reddit, it also increases from 0.18% to 0.22%. This finding implies that COVID-19 indeed raise the attention towards China and Chinese people on fringe and mainstream Web communities. We then take a further step to understand the origin, usage, and evolution of Sinophobic slurs. Concretely, we semiautomatically capture anti-China and anti-Chinese slur words via word2vec models [24] trained on the entire 4chan's /pol/ dataset and a dataset of 1% randomly sampled posts from Reddit. Before training the models, we first preprocess the two datasets as follows: 1) we convert the content of posts to lowercase and expand contractions such as "it's" to "it is"; 2) we remove punctuation, stopwords, HTML tags, and numbers. Next, we train word2vec models for each Web community on the whole pre-processed corpus with all words that appear at least 20 times. We set the context window to 5 following [15] . To semi-automatically identify hate words towards "china" and "chinese", we leverage 15 hate words following [19] , which are "nigger," "faggot," "retard," "bitch," "idiot," "cunt," "kike," "fag," "nazi," "trash," "pussy," "goy," "frog," "spic," and "chink." Then, we perform operations (i.e., additions) on the word embeddings of each of the above hate words and the terms "china"/"chinese" and extract the 10 most similar words based on the resulting embeddings (e.g., extracting the most similar words to the embedding that is calculated from the addition of the embeddings for the words "nigger" and "china"). By iteratively doing the same procedure for all combinations of words, we obtain 300 words. We then count the number of appearances of each similar word and rank them. In Table 3 , we report top15 most similar words to "china" and "chinese," as well as the top 15 most frequently occurring words by combining the embeddings from the hate words and the terms "china" and "chinese." By manually looking into them, we find nine derogatory terms referring to China and Chinese, including "chink" (slur word referring to Chinese and East Asian people) [48] , "chinks" (plural of chink), "chinkland" (an offensive word referring to the land of chinks) [13] , "chinka" (derives from "chink" and the last syllable of the word "nigga") [12] , "chinaman" (evolved from its use in pejorative contexts regarding Chinese) [11], "chicoms" (a contemptuous term used to refer to a Communist Chinese) [10], "chang" (an ethnic slur to Chinese, evolved from the Chinese language mocker) [47] , "chyna" (an deliberately misspelled word to insult China) [39] , and "gook" (A derragatory term used against Asian ) [14] . An example post from 4chan's /pol/: "i can't believe there's actual people who side with fucking chyna. like seriously they are another level of shit people". Another /pol/ user posts: "everyone is seriously making me hate chinks. we need to ship them all back to chinka". Figure 2 shows the daily proportion of Sinophobic slur words on 4chan's /pol/. We first observe that the usage of Sinophobic slurs coincides with political events. For instance, on Aug 1, 2020, Trump announced he would ban Tik-Tok [6] , there is a sudden spike of all nine slurs on both Web communities. In addition, the average daily usage of slurs also surges in the pandemic period. For example, the utilization frequency of "chicoms" climbs 5.2× in the pandemic period. Note that this rise in slurs can not be simply attributed to the increase of "china" and "chinese". Take "chink" and "chinks" as an example. The daily usage of the two words occupies 29.51% and 42.79% of their referring word "chinese" in the pre-pandemic period. However, this proportion climbs to 41.77% and 52.75% in the pandemic period, indicating users are inclined to use slur words to refer to Chinese people. We see a different slur distribution on Reddit (see Figure 3 ). Since Reddit establishes hate-speech ban policy from Jun, 2020 [43], the frequency of Sinophobic slurs does not rise up as significantly as /pol/, but we still observe slurs sneak on Reddit. For instance, "chicoms" increases 39.73% in the pandemic period. These findings highlight the wildly usage of Sinophobic slurs on both 4chan's /pol/ and Reddit, especially during the pandemic period. To verify the tendency relationship between the term "china", "chinese", and slur words in a quantitative way, we measure the correlation coefficient of them in different periods. Concretely, for each Web community, we split the percentage values of daily posts into pre-pandemic and pandemic periods. We then calculate the Pearson correlation among the pre-pandemic percentage values and pandemic percentage values of the eleven terms, i.e., "china," "chinese,"as well as nine Sinophobic slurs (see Figure 4) . We find the correlation between "china" and "chinese" is close on 4chan's /pol/ (0.84) and Reddit (0.89) in the prepandemic period, and after COVID-19 outbreaks, it even mounts to 0.95 and 0.96 on /pol/ and Reddit, respectively, which sheds light on the increasing homogenization among the country and people. In addition, official words, e.g. "china, "chinese", also obtain a close-knit relationship to most of ethnic slurs in the pandemic period. Take "chinaman" and its referring term "chinese" as an example, in the pre-pandemic period, the correlation of "chinaman" is 0.48 on 4chan's /pol/ and 0.12 on Reddit. However, in the pandemic period, it surges to 0.84 on 4chan's /pol/ and 0.46 on Reddit, which indicates that the discussions around China and Chinese people substantially changed during the pandemic period. We then use Google's Perspective API [17] to identify two kinds of perspectives towards the posts, i.e., SE-VERE_TOXICITY and IDENTITY_ATTACK, in which SEVERE_TOXICITY perspective measures the degree of hate, rudeness, and disrespect of comments and IDEN-TITY_ATTACK evaluates how negative or hateful that the comments are targeting someone because of their identity. Concretely, for each Web community, we select the posts that contain "china" or "chinese" in the pre-pandemic and pandemic period. Note that we also extract 1% of all posts without the term "china" or "chinese" as a baseline. The CDF of different scores are summarized in Figure 4c and and we list our findings as follows. First, we observe that /pol/ posts are more toxic than Reddit in general, which is expected as /pol/ users are anonymous and considered more notorious [33] , and the platform is less moderated. For instance, the percentage of posts with SEVERE_TOXICITY score greater than 0.5 is more than 34.26% on 4chan's /pol/, while only 10.79% on Reddit. Second, compared to the posts without mentioning "china" and "chinese", the posts containing them have a higher score across all Perspective dimensions. Take /pol/ as an example, the percentage of posts with IDENTITY_ATTACK score ≥ 0.5 is more than 53.91% for the posts containing "china" or "chinese," while only 37.16% for the posts without them. Third, compared to the pre-pandemic period, the pandemic period has a higher toxicity level. For example, in the pre-pandemic period, the percentage of posts with SE-VERE_TOXICITY ≥ 0.8 is 3.15% on Reddit, and it climbs to 4.65% in the pandemic period. Takeaways. We find that events that are not directly related to China may also evoke discussion to China and Chinese people, e.g., the United States withdraw from the Paris Agreement [58] and Inauguration of Joe Bide [51] . And the discussions around China and Chinese people are prolonged in the pandemic period. For instance, in the pre-pandemic period, the average interest time of "china" ("chinese") remains 2.76 (2.83) days while it increases to 8.00 (9.28) days in the pandemic period. Analysis of Sinophobic slurs reveals that they are widely used on both fringe and mainstream Web community, especially during the pandemic period. Surprisingly, even with the hate-speech ban policy [43], slurs have still been observed sneaked on Reddit, which calls for the mainstream community to take more actions and responsi- Here (pre-pandemic)/(pandemic) means the posts that contain "china" or "chinese" in the pre-pandemic/pandemic period and (without) means the posts that do not contain "china" or "chinese" in the whole period. bility. We also find the usage patterns of Sinophobic slurs become more similar on both Web communities in the pandemic period, implying that people's linguistic habits referring to the Chinese are tilting towards slurs. Perspective analysis towards Sinophobic slurs shows that Sinophobia is more severe in posts from 4chan's /pol/, posts that contain "china" or "chinese", and posts in the pandemic period. In this section, we aim to study the evolution of Sinophobia from the content of the posts in our dataset. Specifically, we measure how drastically the semantic meaning of "china" and "chinese" has changed and understand the shift in a diachronic visualization way. Next, we provide a detailed analysis on a number of Sinophobic topics discussed on 4chan's /pol/ and Reddit over six years. To have better speculation on word meaning shifting of "china" and "chinese", for each Web community, we train multiple word2vec models on corpus of each month with the same setting we used to train the whole model (see the section "Temporal Analysis"). Note that we discard June, 2016 as it only contains two days' posts on /pol/, which is not meaningful for comparison. In this way, for each Web com-munity, we have 57 word2vec models corresponding to each month from July, 2016 to March, 2021. We treat the pretrained GoogleNews model [23] released by Google as our baseline model and align monthly models to it to ensure that the vectors are projected to the same coordinate axes [18] . Figure 5 and Figure 6 display the diachronic word embeddings of "china," "chinese," "virus," "hk," "america," and "jew" on 4chan's /pol/ and Reddit, respectively. We choose "virus" and "hk" to be the baseline of meaning shift towards events, "america" as a comparison of "china", and "jew" to compare with "chinese". The Y-axis is the cosine similarity between the vector of monthly models (aligned) and the baseline model, which is a conventional evaluation metric used to measure word embedding differences [38, 60] . First, we observe the most significant semantic changes of these words happen in different periods. For "china", "chinese", and "virus", it is during COVID-19 outbreaks (Jan, 2020); for "america", it happens when Donald Trump wins the U.S. election (Nov, 2016); for "hk", it is during Hong Kong protests (Oct, 2019); and for "jew", it is when Synagogue attack happened at High Holy Day (Oct, 2019). These findings hold for both 4chan's /pol/ and Reddit, which indicate that the semantic changes correspond to political events and are a cross-platform phenomenon. We also measure the average cosine similarity for different periods. Specifically, in the pre-pandemic period, the average cosine similarity of "china" ("chinese") is 0.25 (0.46) on 4chan's /pol/ 6 with variance of 0.02 (0.03). However, in the pandemic period, this value shifts to 0.18 (0.39) with variance of 0.04 (0.06), which indicates more significant semantic changes after COVID-19. To compare, the average cosine similarity of "hk", "america", and "jew" is 0.28, 0.30, 0.29 with variance of 0.07, 0.02, and 0.03 in the pre-pandemic period and 0.22, 0.29, and 0.33 with variance of 0.05, 0.03, and 0.03 in the pandemic period, which is more stable than "china", "chinese", and "virus". To further understand the meaning behind such dramatic changes, we visualize the diachronic word embeddings of words that are similar to the term "china" and "chinese", following the methodology proposed by [18] . In a nutshell, we start by selecting the month including the split month (2020-01) and six other months with the highest frequent mentions to "china" and "chinese" for each year, i.e., 2016-12, 2017-08, 2018-04, 2019-10, 2020-03, and 2021-01 (see Figure 1 ). Next, we train the word2vec model for each month and align these word2vec models to the first one (2016-12). All embeddings are projected into two dimensions via t-SNE [42] . For each model, we select the top 10 most similar words (denoted as reference words) of the keyword "china" or "chinese" and the results are depicted in Figure 9 . By inspecting the positions of words in Figure 9 , we can measure the corresponding meaning shifting in each month. For instance, for "china" on 4chan's /pol/ (see Figure 9a ), the meaning is close to "taiwan" and "asia" in November 2016. Then, it moves to "nk", "sk", and "korea" in August 2017, "prc" in April 2018, "hk" and "mainland" in October 2019. Until 2019, all these shifts are still related to geographic terms and political events, e.g., Trump-Tsai Call [57] , North Korean Nuclear Test [52] , Hong Kong Protests [45]. However, when it comes to January 2020, the meaning of "china" not only shift to pandemic words, e.g., "quarantine" and "outbreak", but also towards slur words like "chink' and "chinkland". Then, it moves to "chyna" in March 2020, and finally takes a step back in January 2021, which shows that the meaning of "china" is shifting from the notation of country to the Sinophobic slur words. We have a similar observation for "chinese" on /pol/. For Reddit, we find that they also shift from region-related words, e.g., "taiwan" and "tibet" to pandemic words, e.g., "tariff" and "virus." However, we do not observe Sinophobic slur words on it. We then dive into the detailed toxic topics that are related to China and Chinese in 4chan's /pol/ and Reddit. Concretely, we collect all posts that mention "china" or "chinese" and apply the same pre-processing strategy as the section "Racial Slurs", which results in 2,193,410 posts from 4chan's /pol/ and 26,183,882 posts from Reddit. Similar to [36] , we consider posts with SEVERE_TOXICITY score ≥ 0.8 as toxic posts and filter out 224,087 posts from 4chan's /pol/ and 388,060 posts from Reddit as the trainset of each Web community. We elect SEVERE_TOXICITY as it is less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words [16] . We leverage Top2vec [2] as our topic modeling method to extract topics from the post. After training with its default setting, the models generate 924 topics for 4chan's /pol/ and 1,542 topics for Reddit, respectively. We then hierarchically reduce the total number of topics to smaller values ranging from 10 to 80 with a stepping size of 10. Concretely, we use u_mass [25] to evaluate the quality of the models with different numbers of topics where a higher u_mass indicates a better model. We calculate the u_mass for each model and pick the one with the highest u_mass value as our final topic model, which is 20 for both 4chan's /pol/ and Reddit. Figure 7 and 8 shows the word cloud of each topic for each web community. Note that we sort the topics by their popularity, e.g., topic 0 is the most popular one. Topics on 4chan's /pol/. First, when looking at the results of 4chan's /pol/ (see Figure 7) , we uncover some general topics including ethnics (topic 0), economics and commerce (topic 1), foreign relations (topic 4), China internal affairs (topic 6), Chinese government (topic 5), weapon and military (topic 15), as well as event-related topics, like COVID-19 (topic 3) and U.S. election (topic 7), depicting the panoramic view of diverse anti-China rhetoric. For instance, the most popular topic on 4chan's /pol/ towards Chinese is topic 0, containing words referring to ethnics, mostly towards Chinese and other East Asians (e.g., "asians", "koreans", "chinese" as well as "iq", "genes"). A /pol/ user posted: "the chinese iq meme comes from the chinese government literally cherrypicking their best. most of china and almost the entirety of southern asia are dumb." Another /pol/ user posted: "most asians are awful in everyway, including the women. though japs, and occasionally korean or chinese, are decent like whites. the holocaust is a lie. whites are superior. most blacks are physically incompetent as much as they are mentally." We also observe several topics containing multi-language words, like Chinese (topic 16), Portuguese (topic16), and German (topic18), which indicates the active participation of speakers using these languages in Sinophobic discussions. For example, a /pol/ user who sets his country to Brazil posts "você é um preto socialista filho de uma puta que nunca se importou desse país inteiro estar se vendendo pra china." (in translation, "you're a black socialist son of a bitch who never cared about this whole country selling to china.") Interestingly, TikTok [54] , a famous video-sharing application owned by the Chinese company, is frequently mentioned in toxic posts of /pol/ (topic 13) and is regarded as a weapon from China, e.g., " tiktok is a chinese weapon designed to turn western children into weak pussies.", indicating the fear or hatred towards Chinese people. These kinds of emotions and prejudice are also regarded as the origins of Sinophobia [4] . Topics on Reddit. For Reddit, we can observe a number of similar toxic topics as 4chan's /pol/, e.g., economics and commerce (topic 1), weapon and military (topic 8), internal affairs (topic 17), ethnics (topic 3), foreign relations (topic 8), Chinese government (topic12), COVID-19 (topic 7), and U.S. election (topic 9). However, toxic topics on Reddit are more diverse in language and cover a wider range of aspects than 4chan's /pol/. For instance, the most popular topic on Reddit is a multi-language one, combining slurs in German, Italian, Spanish, and other languages. Examples are "zusammen china ficken (in translation, fuck together china).", "la brutta china è realtà. (in translation, ugly china is reality.)", and "china filha da puta (in translation, china son of a bitch), respectively. "Yellow fever", a strong sexual or romantic preference for persons of Asian descent [46] , is also one hot topic on Reddit (Rank top 6). For example, a Reddit user posts: "i got a bitch up in china, i like to fuck her vagina." In addition, we observe racial slurs in topics about daily life, including food (topic 10), game (topic 11), and stock (topic 18). For example, "I ate chinese food so you better hope I don't fart in your fucking face.", "stupid chinese cheaters messing up all the good battle royale games", "i feel like nio is the laughing stock of the trading world. lol fucking lying chinese hustle." Shift in topics. Now, we discuss the shift of topics from the pre-pandemic period to the pandemic period (see Table 4 and Table 5 ). Overall, we find users' interests dras- tically shift to COVID-19 on both Web communities. The shift ratio reaches 11× on 4chan's /pol/ and 5.32×on Reddit. We also observe that Chinese-goverment topics are the most toxic ones in the whole period on both /pol/ (ST = 0.86) and Reddit (ST = 0.85). The scale of it exploded to 1.25× on 4chan's /pol/ (topic 5) and 1.71× on Reddit (topic 12) during the pandemic period, impling users express more toxic posts towards Chinese government with anger. Meanwhile, we also notice the differences between the two Web communities. For instance, the topic that blaming Biden as a Chinese spy is both on 4chan's /pol/ (topic 7) and Reddit (topic 9). However, the discussion proportion increases 1.59× in the pandemic period on /pol/ while it remains the similar ratio, i.e. 1.01×, on Reddit. Besides, the most popular topic consisting of multi-language slurs on Reddit continues growing in the pandemic period, which implies that more countries join to contribute Sinophobic slurs. Takeaways. In this section, we focus on the evolution of Sinophobic content. We analyze the magnitude of semantic change and find that "china," "chinese," and "virus" suffer higher changes than the baseline words like "america," "hk," and "jew." For instance, compared to the prepandemic period on 4chan's /pol/, the average cosine similarity of "china," "chinese," and "virus" in the pandemic period drops 0.066, 0.065, and 0.076, respectively, while only 0.003, 0.057, -0.034 for "america," "hk," and "jew," respectively. To further understand the meaning of such changes, we visualize the diachronic word embeddings of words that are similar to "china" and "chinese." Our observation reveals that the meaning of "china" and "chinese" shifts from the notation of the country to the pandemic words and Sinophobic slurs (see also Figure 9 ). We then dive into the detailed Sinophobic topics and find that the most popular topic is related to Asian people on 4chan's /pol/, while on Reddit we find Sinophobic slurs shared in various languages. Also, we find that both Web communities share some common Sinophobic top-ics, e.g., ethnics, weapon and military, etc. However, compared to 4chan's /pol/, Reddit contains more Sinophobic topics related to daily life, including food (topic 10), game (topic 11), and stock (topic 18). After the outbreaks of COVID-19, we find that users' interests drastically shift to COVID-19 on both Web communities. The shift ratio reaches 11× on 4chan's /pol/ (topic 3) and 5.32× on Reddit (topic 7). In the pandemic period, users also express more toxic posts towards the Chinese government, which is the most toxic topics on both fringe and mainstream Web communities. In this paper, we investigate how online Sinophobia has evolved by performing a large-scale measurement from 2016 to 2021 over two Web communities (Reddit and 4chan's /pol/). We first investigate the temporal pattern of the posts that related to China and analyze discovered Sinophobic slurs. Our findings quantitatively reveal that Sinophobia was well established before COVID-19, most often sustained by directly or non-directly related political events. While the COVID-19 pandemic greatly increased Sinophobia online, it also marked a sharp change in the kind of Sinophobia exhibited: "china" and "chinese" shifted away from referring to the country/government of China towards Sinophobic slurs. When exploring the characteristcs of Sinphobic topics across the Web communities we study, we find an overlap in Sinophobic topics like ethnicities, however, Reddit also has much more benign topics like food. Much like Xingtian, although COVID-19 has made Sinophobia much more grotesque, it has been an ever present part of online discussion. Github -4chan/4chan-api: Documentation for 4chan's read-only json api Top2Vec: Distributed Representations of Topics The Pushshift Reddit Dataset Sinophobia: Anxiety, violence, and the making of Mongolian identity Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit Trump says he will ban tiktok from operating in the us people-around-the-globe-aredivided-intheir-opinions-of-china Exclusionary Populism and Islamophobia: A comparative analysis of Italy and Spain. Religions A Virus Has No Religion": Analyzing Islamophobia on Twitter During the COVID-19 Outbreak GENSIM. models.word2vec -word2vec embeddings About the api -attributes and language Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change Ilias Leontiadis, Riginos Samaras, Gianluca Stringhini, and Jeremy Blackburn. Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan's Politically Incorrect Forum and Its Effects on the Web America's global image remains more positive than China's. Pew Research When "model minorities" become "yellow peril"-Othering and the racialization of Asian Americans in the COVID-19 pandemic Sacramento's ChinaTown Mall. 1882 -american sinophobia, the chinese exclusion act and "the driving out Efficient Estimation of Word Representations in Vector Space Distributed Representations of Words and Phrases and their Compositionally Optimizing Semantic Coherence in Topic Models And We Will Fight For Our Race!" A Measurement Study of Genetic Testing Conversations on Reddit and 4chan China pledges neutrality -unless us strikes north korea first A leak containing a register with the details of nearly two million ccp members North korea says it will nuke u.s. at first sign of pre-emptive strike Exploring U.S. Shifts in Anti-Asian Sentiment with the Emergence of COVID-19 Poll: Asian-americans see individuals' prejudice as big discrimination problem Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board How anti-chinese sentiment is spreading on social media Ethnicity and National Originbased Discrimination in Social Media and Hate Crimes Across 100 US Cities AAAI Do Platform Migrations Compromise Content Moderation? Evidence from r/The_Donald and r/Incels Go eat a bat, Chang!": On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19 User comments for the name chyna The New York Times. How anti-asian activity online set the stage for real-world violence Reddit claims 52 million daily users, revealing a key figure for social-media platforms Visualizing Data using t-SNE Timeline: Who's covid-19 response -2020 hong kong protests Covid-19 Covid-19 lockdown in china Inauguration of joe biden Territorial disputes in the south china sea Timeline of the 2019-2020 hong kong protests Timeline of the 2019-2020 hong kong protests United states withdrawal from the paris agreement Understanding Self-Narration of Personally Experienced Racism on Reddit A Quantitative Approach to Understanding Online Antisemitism Sandeep Soni, and Srijan Kumar. Racism is a Virus: Anti-Asian Hate and Counterhate in Social Media during the COVID-19 Crisis. CoRR abs This work is partially funded by the Helmholtz Association within the project "Trustworthy Federated Data Analytics" (TFDA) (funding number ZT-I-OO1 4) and supported by the National Science Foundation (grant number 2046590).