Word Clustering for Historical Newspapers Analysis Lidia Pivovarova Jani Marjanen Elaine Zosa University of Helsinki firstname.lastname@helsinki.fi Abstract This paper is a part of a collaboration be- tween computer scientists and historians aimed at development of novel methods for historical newspapers analysis. We present a case study of ideological terms ending with -ism suffix in nineteenth- century Finnish newspapers. We pro- pose a two-step procedure to trace differ- ences in word usages over time: training of diachronic embeddings on several time slices and when clustering embeddings of selected words together with their neigh- bours to obtain historical context. The ob- tained clusters turn out to be useful for his- torical studies. The paper also discusses specific difficulties related to development of historian-oriented tools. 1 Introduction Big corpora of historical newspapers are now dig- italized and available for automatic processing. Newspapers have for long been important sources of information for historians and social scientists but massive digitalization opens the possibility to use advanced statistical and NLP methods for historical newspapers. Even though news as a genre have been well-studied in NLP community, switching to historical news imposes additional difficulties for text processing. Automatically dig- italized news archives contain much noise related to non-perfect OCR and article separation, as well as less standardised writing practices. Many NLP tools, such as POS-taggers and lemmatizers, are optimized to process modern texts and work less well on historical data. At the same time, histori- cal news share most of the properties of the mod- ern news data: they are biased, incomplete, con- troversial and apt to change over time. If historical news are challenging for linguistic analysis, they are even harder for historical stud- ies, since research questions historians are trying to answer are complex and lie far beyond fact discovery. Often they are interested in attitudes, stances, viewpoints, and discourse change in gen- eral. These tasks require development of novel methods and instruments that would be oriented specifically at historical research. We present NewsEye—a research project aimed at development of novel tools and methods for analysis of historical newspapers1. The project is a collaboration between digital humanists and computer scientists funded by the European Union’s Horizon 2020 research and innovation programme. This paper focuses on a case study of ideolog- ical terms ending with -ism suffix—such as liber- alism, socialism, or conservatism—in nineteenth century newspapers from Finland. These terms, known as isms, are condensed representations of complex notions that played an important role in political discourse in the nineteenth century (and long after that). Rhetorical usage of isms in his- torical text has been studied before (Kurunmäki and Marjanen, 2018b,a; Marjanen, 2018), though as far as we are aware this is the first attempt to apply statistical analysis to trace development of these terms in a diachronic newspaper archive. Not all words ending with -ism are ideologi- cal. This suffix could be also used for medical terms and diseases (rheumatism), scientific terms (magnetism), personal traits (cynicism), artistic movements (cubism), religions (baptism) or polit- ical practices related to particular persons (bona- partism). It is not always possible to draw a strict line between ideologies and other categories. 1https://www.newseye.eu/ Moreover, the ideological load of these terms might change over time. We apply a corpus-based analysis to find out how the vocabulary of isms changed in nineteenth century Finnish newspapers and how usage of ide- ological isms is different from other words with -ism suffix. We try to implement a robust analysis procedure that would be applicable to other tasks with minimal human intervention. Our method consists of two main steps: first, we extract from the corpus all words with suffix -ism, second, we cluster these words and their semantic neighbours in an unsupervised fashion. This procedure does not require a human intervention other than in- terpretation of results and, consequently, is poten- tially applicable to other research questions. 2 Data 2.1 Corpora Newspapers in Finland were published in two main languages—Finnish and Swedish. In the be- ginning of the nineteenth century the majority of newspapers were published in Swedish, though by the 1880s the Finnish and Swedish newspapers were printed in almost equal amount. The Finnish- and Swedish-language press had a different dis- tribution of topics and exposed slightly different political outlook, though contemporaries often re- lied on newspapers in both languages (Engman, 2016). Another peculiarity of these data is a cen- sorship accomplished by the Russian Empire gov- ernment. The censorship was abandoned in 1905, which led to an outburst of socialistic rhetoric in the press, especially in the Finnish-language news- papers since they were more likely to have a rural or working-class background. We use a digitalized collection of nineteenth- century Finnish newspapers freely available from the National Library of Finland (Pääkkönen et al., 2016). We use the full Swedish and Finnish data from 1820 to 1917, treating them as two separate corpora. Each corpus is split into five double- decades. The total amount of words in both cor- pora is presented in Table 1. In Figure 1 we present relative frequencies for the selection of most frequent isms in our data. It can be seen that a proportion of isms are grow- ing over time. The plots demonstrate some differ- ence between the datasets: e.g. patriotism is much more frequent in the Swedish dataset. Time slice Millions of words FINNISH SWEDISH 1820-1839 1.3 25.5 1840-1859 10.3 77.9 1860-1879 90.6 326.7 1880-1899 805.3 966.9 1900-1917 2439.0 953.0 Total 3346.6 2355.2 Table 1: Corpus size by double decade. Both corpora are lowercased and lemmatized using LAS, an open-source language-analysis tool (Mäkelä, 2016).2 LAS is a meta-analysis tool that provides a wrapper for many existing tools developed for specific tasks and languages. Though LAS supports multiple languages, most efforts were done to process Finnish data, includ- ing historical Finnish. The output for our Swedish data is more noisy. In particular, the Swedish LAS lemmatizer is unable to predict lemma for out- of-vocabulary words, e.g. boulangismen (definite form of ‘boulangism’). Thus we applied the addi- tional normalization and convert all words ending with -ismen or -ismens into -ism forms. For all other words we use the LAS output; implemen- tation of proper Swedish lemmatization is beyond the scope of this paper. 3 Approach 3.1 Diachronic embeddings We train continuous embeddings (Mikolov et al., 2013) on each double-decade. We use Gensim Word2Vec implementation (Řehůřek and Sojka, 2010) using the Skip-gram model, with a vector dimensionality of 100, window size of 5 and a fre- quency threshold of 100—only lemmas that ap- pear more than 100 times within a double decade are used for training. One hundred is an arbitrary and rather conservative threshold that ensures that each word in a model has reliable amount of con- text and embeddings are trustworthy. On the other hand, we lose some isms because they appear less than 100 times in a double-decade. For instance, patriotism and liberalism appear for the first time in the Swedish corpus in 1791 and 1820 respec- tively, but the corresponding vectors exist in our models starting from 1820-1839 and 1840-1859 respectively. The number of distinct isms in our models is presented in Table 2. 2https://github.com/jiemakel/las https://github.com/jiemakel/las (a) SWEDISH (b) FINNISH Figure 1: A selection of the most frequent words ending with suffix -ism/ismi. The x-axis presents relative frequency in items per million. Since training word embeddings is a stochastic process, the particular values of vectors do not stay close across runs, though distances between words are quite stable. To ensure that embeddings are stable across time slices, we follow the approach proposed in (Kim et al., 2014): embeddings for t + 1 time slice are initialized with vectors built on t; then training continues using new data. The learning rate value is set to the end learning rate of the previous model, to prevent models from di- verging rapidly. This approach has been previ- ously used in (Hengchen et al., 2019) with slightly different data. 3.2 Clustering We cluster word embeddings into semantically close groups using Affinity Propagation cluster- ing technique (Frey and Dueck, 2007). The main advantages of Affinity Propagation are that it de- tects number of clusters automatically and is able to produce clusters of various sizes. FINNISH Time slice ism close cluster select 1820 - 1839 0 - - - 1840 - 1859 0 - - - 1860 - 1879 1 157 1 12 1880 - 1899 35 5977 20 442 1900 - 1917 119 8940 70 1543 SWEDISH Time slice ism close cluster select 1820 - 1839 3 724 3 49 1840 - 1859 17 1845 12 211 1860 - 1879 61 5229 31 669 1880 - 1899 120 12233 54 1320 1900 - 1917 137 11858 56 1387 Table 2: Number of distinct words used on vari- ous steps of the algorithm: ism is a number of dis- tinct words with suffix -ism, close is a number of words, which cosine similarity to at least one ism is higher than 0.5, cluster is a number of clusters that contain at least one ism, select is a number of words in these clusters. Affinity Propagation has been previously used for various language analysis tasks, including collocation clustering into semantically related classes (Kutuzov et al., 2017) and unsupervised word sense induction (Alagić et al., 2018). Both papers pay special attention to fine-tuning of the algorithm and selection of hyper-parameters. We cannot tune the algorithm due to the lack of gold standard, which is typical for exploratory histor- ical research. We use standard implementation from the Scikit-learn package (Pedregosa et al., 2011), with default parameters. The procedure works as follows. In the data se- lection step we extract from the corpus all words with a cosine similarity of less than 0.5 to any ism. Then we perform clustering on this enriched dataset. Finally, the clusters are filtered so that only clusters that contain at least one ism word are presented for the qualitative analysis. The number of words used on various steps of analysis is presented in Table 2. It can be seen from the table is that the number isms in the Finnish data is much smaller than for the Swedish data. In particular in the two double decades there are no Finnish ism above the frequency threshold. That could be partially explained by the smaller amount of Finnish newspapers but also by the dif- ference between languages. The suffix ismi is not as productive in the Finnish language and used mostly with loan words, while Swedish more read- ily adopt ism suffix. In many cases Swedish words ending with -ism are translated into Finnish us- ing native suffixes. For example, Swedish katoli- cism is translated into Finnish as katolilaisuus. In some cases, two words with same meaning but different endings existed in the same time period, e.g. protestantismi and protestanttisuus or nation- alismi and kansallisuusaate. It can be seen in the table that though 0.5 is an arbitrary threshold up to 90% of words selected using this threshold are filtered out after the clus- tering. The number of selected clusters is gener- ally smaller than the number of words with suffix ism since isms tend to cluster together. 4 Results and Observations One of the main difficulties for our work is a lack of gold standard annotations. We cannot know in advance how the words should be clustered, es- pecially the most problematic ideological terms, which are the main objects of our study. However, we can make several common-sense assumptions on the expected outcome. For example, it would be reasonable to expect that disease names should not appear in the same cluster with philosophi- cal concepts or that artistic movements should be clustered together. In this section we present sev- eral observations, starting with those that can be considered as “sanity checks” for the clustering. Rheumatism In the nineteenth century rheumatism was often mentioned in the medical advertisements. Auto- matic advertisement filtering in historical news is not a trivial task since advertisements were less regulated, contained more text and looked simi- lar to other articles. Moreover, such filtering is not always necessary since advertisements might pro- vide researchers with valuable insights3. We use the entire corpora to build embeddings, and as a consequence rheumatism is one of the most frequent words with suffix -ism in our data, as can be seen in Figure 1 (for the Swedish data we sum up counts for spelling variants reumatism and rheumatism). Table 3, which shows all clusters from our Finnish data that contain words related to rheuma- 3See for example a recent blog post analyzing gender stereotypes in the nineteenth century drug advertisements: https://www.newseye.eu/blog/news/ british-drug-advertising-in-the-19th-century-through-the-prism-of-gender/ https://www.newseye.eu/blog/news/british-drug-advertising-in-the-19th-century-through-the-prism-of-gender/ https://www.newseye.eu/blog/news/british-drug-advertising-in-the-19th-century-through-the-prism-of-gender/ 1880-1899 1900-1917 reumatismi ‘rheumatism’ vähäverisyys ‘anaemia’ risatauti ‘lymphadenitis’ veripuute ‘anaemia’ heillou ‘weakness?’ocr luuvalo ‘gout’ nivelreumatismi ‘arthritis’ epämuodostuma ‘deformity’ kohju ‘hernia’ luumalo ‘gout’ocr kroonillinen ‘chronic’ mahatauti ‘gastroenteritis’ mahakatarri ‘gastritis’ iskä ‘?’ latus ‘?’ suolitauti ‘salt deposits’ riisitauti ‘rickets’ hermovaiva ‘nerve ailment’ liikavarvas ‘callus’ verenvähyys ‘anaemia’ ruumisvika ‘body problem’ veritauti ‘blood disease’ kihti ‘gout’ lihavuus ‘obesity’ kaljupäisyys ‘boldness’ verettömyydä ‘verettömyydä’ säilöstystauti ‘canning disease’ heikkohermoisuus ‘neurasthenia’ lihanen ‘obese’ sukupuoli- ‘sex/gender’ocr jalkahiki ‘foot odor’ sappitauti ‘biliary disease’ heitlous ‘weakness’ocr selkäydintauti ‘spinal cord disease’ kivuton ‘painless’ hermoheikkous ‘neurasthenia’ ruokasulatushäiriö ‘digestion problem’ reumatillinen ‘rheumatic’ kalvetustauti ‘anaemia’ vinous ‘skewness’ tautitila ‘disease place’ reumaatillinen ‘rheumatic’ vähäverinen ‘anaemic’ epämuodostua ‘to deform’ hermosairaus ‘neuropathy’ reumatismi ‘rheumatism’ hiustauti ‘hair disease’ jäsensärky ‘limb ache’ hermo ‘nerve’ oxygeno ‘?’ vatsakatar ‘gastritis’ umpitauti ‘constipation’ nuha ‘rhinitis’ hermotautinen ‘neurotic’ topioli ‘?’ kurkkukatarri ‘pharyngitis’ parannuskeino ‘remedy’ hoitokeino ‘cure’ spirosiini ‘spirosin’ lazarol ‘lazarol’ lääkitä ‘to medicate’ kotilääke ‘home medicine’ reumaattinen ‘rheumatic’ hammastauti ‘tooth disease’ rautaliuos ‘iron care’ jäsenkolotus ‘limb ache’ leini ‘rheumatism’ linjamentti ‘ointment’ parannusaine ‘betterment’ vilustuminen ‘cold’ luuvalo ‘gout’ latsaro ‘?’ hengityselimettauti ‘respiratory disease’ Table 3: Clusters containing Finnish words related to rheumatism. Original words are presented in italics together with English translations in quotes. ocr means the word is incorrectly spelled due to OCR errors; “?” means “impossible to translate”—these are mostly fragments of words appearing due to OCR errors. Bottom left: an advertisement of a rheumatism medicine from Hufvudstadsbladet, 01.03.1912, no. 59, p. 15 tism. It can be seen that rheumatism does not inter- fere with other isms: the clusters entirely consist of words related to drugs, medical procedures, dis- eases and other physical conditions, such as bald- ness or obesity. In that sense clusters are rather precise and justify our algorithmic decisions. On the other hand, cluster may be too fine- grained for our needs. In the 1900-1917 double- decade there are two clusters with similar mean- ing: one related to reumatismi ‘rheumatism’, an- other to nivelreumatismi ‘(rheumatoid) arthritis’. Very similar results were obtained on the Swedish data: reumatism ‘rheumatism’ and ledgngsreuma- tism ‘arthritis’ are split into different clusters even though spelling variants rheumatism and reuma- tism are clustered together. We suggest that the fine-grained clustering does not as such reflect semantic differences, but the differences in distribution come from slightly dif- ferent uses in the newspapers. While there are similarities it seems that rheumatism appears more often in medical advertising whereas the arthritis seems to be more likely to appear in text content with a more ambitious take on educating the pub- lic about medical issues. Spiritism In Table 4 we present clusters obtained from Swedish data that contain the word spiritism. The cluster for the 1860-1879 double decade contains a few words related to this popular practice such as pressensé and kabal though most of its content are names of famous scientists and writers. This might be an error: some of the names might be a person that were discussed in the context of spiritism (as objects to spiritism or as scientific authorities), e.g. Aristotle or Galileo, and others are words that are similar to these names. In other words, spiritism might be an outlier in this cluster. It might also be the case that spiritism was sometimes used as ’spiritualism’ and Darwin and the others were discussed in this context. This would require a further analysis. The clusters for the latter double-decades do not expose such problems and consist mostly of words clearly related to spiritism including some very specific terms, such as transmigration, and more general esoteric concepts, such theosophy or freemasonry. The 1880-1899 cluster might also reflect a contemporary discussion on relations be- tween science and mysticism, since it contains such isms as positivism or darwinism. Separatism Separatism is a more tricky concept, which un- dergo a noticeable usage change in our datasets as can be seen in Table 5, where we present clusters for Swedish separatism. 1860-1879 1880-1899 1900-1917 spiritism ‘spiritism’ spiritism ‘spiritism’ teosofi ‘theosophy’ spiritism ‘spiritism’ hypnotism ‘hypnotism’ pressensé ‘presence’ (Fr) frimureri ‘freemasonry’ feder ‘?’ andevärld ‘spirit world’ teosofisk ‘theosophic’ pater ‘pater’ voltaire ‘Voltaire’ mysterium ‘mystery’ spiritualism ‘spiritualism’ spiritistisk ‘spiritualistic’ telepati ‘telepathy’ darwin ‘Darwin’ renan ‘Renan’ darwinism ‘darwinism’ positivism ‘positivism’ själavandring ‘transmigration’ zola ‘Zola’ newton ‘Newton’ buddism ‘Buddhism’ darvinism ‘darvinism’ trolleri ‘magic’ journalism ‘journalism’ balzac ‘Balzac’ michelet ‘Michelet’ vegetarianism ‘vegetarianism’ astrologi ‘astrology’ ockult ‘occult’ astrologisk ‘astrological’ galilei ‘Galileo’ corneille ‘Corneille’ teosofisk ‘theosophic’ bibelkritik ‘Bible criticism’ astrologi ‘astrology’ frimureri ‘freemasonry’ aristoteles ‘Aristotle’ kabal ‘cabal’ metafysik ‘metaphysics’ teosofien ‘theosophy’ gondiagnos ‘eye diagnosis’ alkemi ‘alchemy’ oppert ‘Oppert’ rousseau ‘Rousseau’ darvin ‘Darvin’ darvins ‘Darvin’ clairvoyance ‘clairvoyance’(Fr) proudhon ‘Proudhon’ zolas ‘Zola’ utvecklingslära ‘evolution’ malthus ‘Malthus’ tankeläsning ‘mind reading’ quand ‘when’ (Fr) loyson ‘Loyson’ själavandring ‘transmigration’ tungomlstalande ‘tongues’ Table 4: Clusters containing Swedish word spiritism. 1860-1879 1880-1899 1900-1917 separatism ‘separatism’ separatism ‘separatism’ rent ‘?’ separatism ‘separatism’ riksid ‘national idea’ocr mysticism ‘mysticism’ naturalism ‘naturalism’ finskhet ‘Finnishness’ fennomanins ‘Fennomania’ statsid ‘state idea’ocr rikspolitik ‘national policy’ darwinism ‘darwinism’ moral ‘morality’ fennomani ‘Fennomania’ svenskhet ‘Swedishness’ bourgeoisins ‘bourgeoisie’ byråkratien ‘bureaucracy’ tidsanda ‘zeitgeist’ krass ‘crass’ utopi ‘utopia’ fennomanin ‘Fennomania’ vikingaparti ‘Viking party’ samhällsopinion ‘social opinion’ materialistisk ‘materialistic’ otro ‘incredible’ språkpolitik ‘language policy’ publicistisk ‘publishing’ sträfvandenas ‘?’ rikskomplex ‘national complex’ rationalistisk ‘rationalistic’ wantro ‘?’ partiagitation ‘party agitation’ partiyra ‘?’ nationalitet- ‘national’ocr santryska ‘true Russian’ menniskonaturen ‘human nature’ tidehvarfvets ‘?’ partifanatism ‘party fanaticism’ ämbetsmannavälde ‘officialdom’ materialism ‘materialism’ materialist ‘materialistic’ språkgräl ‘language quarrel’ gränsmärke ‘borderline’ gränsmark ‘borderline’ocr konservatism ‘conservatism’ språkfanatism ‘language fanaticism’ riksenhet ‘national assembly’ idealism ‘idealism’ rationalism ‘rationalism’ språkfråga ‘language question’ samhällskraft ‘social force’ statlighet ‘statehood’ negation ‘negation’ abstraktion ‘abstraction’ spräkfrägan ‘language question’ frihetssträvande ‘freedom-aspiring’ wäldets ‘?’ idealistisk ‘idealistic’ ljusskygghet ‘photophobia’ riksmakt ‘national power’ själfhärskarmakten ‘?’ Table 5: Swedish clusters containing word separatism 1880-1899 1900-1917 separatismi ‘separatism’ ruotsi-kiihkoinen ‘Svekoman’ ruotsinmielinen ‘Swedish-minded’ separatismi ‘separatism’ ruotsalaisuus ‘Swedishness’ viikinki ‘Viking’ ruotsi-mielinen ‘Swedish-minded’ nationalismi ‘nationalism’ natsionalismi ‘nationalism’ fennomaani ‘Fennoman’ epäkansallinen ‘anti-national’ viikingit ‘Vikings’ opportunismi ‘opportunism’ natfionalismi ‘nationalism’ocr separatisti ‘separatist’ ruotsikko ‘Swedish’(person) miikinki ‘Viking’ocr pöppö ‘?’ eristäytyminen ‘isolation’ kansalliskiihko ‘nationalism’ miikingit ‘Vikings’ocr suomimielinen ‘Finnish-minded’ ruotsi-mielisyys ‘Swedish-mindedness’ intelligens ‘intelligence’ länsieurooppalainen ‘Western-European’ wiitinki ‘Viking’ocr wiilinki ‘Viking’ocr miitinki ‘Viking’ocr ruotsimielinen ‘Swedish-minded’ rotutaistelu ‘race fight’ vapaamielisyy ‘liberalism’ocr suomi-kiihkoinen ‘Fennoman’ fennoman ‘Fennoman’ henkiheimolainen ‘soul mate’ sanomalehdistö! ‘press’ antipatia ‘antipathy’ dagbladilainen ‘member of the Dagblad circle’ miiking ‘Viking’ocr fennomani ‘Fennoman’ kansallinenviha ‘national anger’ kiihkokansallisuus ‘nationalism’ wiiking ‘Viking’ocr fennomaaninen ‘Fennoman’ ruotsikiihkoisuus ‘Svekomania’ eristäytyä ‘self-isolate’ liittolaisuus ‘alliance’ wiilinli ‘Viking’ocr miikinkilehti ‘Vikings’ newspaper’ocr suomenmielinen ‘Finnish-minded’ocr vihamieli-syy ‘hostility’ocr kansallinenylpeys ‘national pride’ miikinkiläinen ‘Vikingish’ocr ruolsinmielinen ‘Swedish-minded’ ruotsiliihloinen ‘Svekoman’ocr kielipolitiikka ‘language policy’ herranenluokka ‘?’ miikingilehti ‘Vikings’ newspaper’ocr epälansallinen ‘anti-national’ocr kansallinenliike ‘national movement’ Table 6: Finnish clusters containing word separatismi Most of the words in the 1860-1879 cluster are religious, philosophical or scientific notions, thus we can assume that the cluster presents a religious context of separatism. The 1880-1899 cluster con- tains completely different set of words, includ- ing reference to specific political entities, such as Fennomans movement and contains rather emo- tional expressions, such as agitation or fanaticism. These words are related to a contemporary dis- cussion about national identity and national lan- guage. The 1900-1917 cluster is again different from the previous two and contains more general political lexis. Thus, we can suggest that at the beginning the notion of separatism had mostly re- ligious meaning, when it was adopted by a limited number of liberals and finally spread into a more general political discourse. The Finnish clusters for separatismi, presented in Table 6, are quite similar to Swedish. The main difference is that in the 1860-1879 the word is mentioned less than 100 times and as a conse- quence excluded from our models. But the 1880- 1899 and 1900-1917 Finnish clusters follow the same pattern: the former contains quite specific references, while the latter consists of more gen- eral political words. The change in the distribution of separatism seems to be related to a change in the dominant context in which it was discussed (from religious context to a political context). This also entails some degree of semantic change. This contextual and semantic shift could be to some extent visible from changes in the near- est neighbours of separatism presented in Fig- ure 2a. However, nearest neighbours produce a more vague overview: for example, religious isms, such as pietism, are presented among near- est neighbours of separatism in 1860-1879. Simi- larly, the overlap between Finnish clusters, shown in Table 6, and nearest neighbours of separatismi, presented in Figure 6 is very limited. (a) SWEDISH (b) FINNISH Figure 2: tSNE plot word separatism and its near- est neighbours across time slices. This can be explained by the nature of the clus- tering procedure: each word can be among the nearest neighbours for any number of other words while Affinity Propagation assign a word to ex- actly one cluster so that socialism and katolicism are separated in clusters of their own. The differ- ence between outputs demonstrates an added value of the clustering, which selects only one word split among many possibilities provided by em- beddings. At the same time, this also means loss of information, especially for polysemous words. 5 Conclusion and Further Work We presented our ongoing work aimed at the im- plementation of tools facilitating historical studies of newspaper archives. We proposed an unsuper- vised procedure to trace differences in word usage over time. The procedure consists of two major steps: training of diachronic embeddings and then clustering embeddings of selected words together with their neighbours to obtain historical context. In this paper we applied this procedure to a group of words ending with suffix -ism. The method allowed us to distinguish ideological terms, such as socialism from other words with the same suffix, such as disease names or scien- tific terms. This promising result suggests that it is worthy to further elaborate the proposed method. At this stage of the work we are unable to draw any clear conclusions related to usage of isms in the nineteenth century in Finland. Clusters that contain ideological words are the most problem- atic for the interpretation, which is not surprising given complex nature of the underlying concepts. Nevertheless, we consider the obtained clusters useful for historical studies since they provide a re- searcher with a condensed representation of word usages in a large corpus. This is a novel way to look at historical data, which might be espe- cially useful in combination with other tools such as named entity recognition or topic modelling. Further improvements of the method should in- clude both parts, namely embeddings and cluster- ing. We plan to try building continuous word em- beddings (Dubossarsky et al., 2019; Gillani and Levy, 2019; Rosenfeld and Erk, 2018; Yao et al., 2018) that would allow us to investigate grad- ual semantic shifts rather than split data into dis- crete time slices. Improvement of clustering might include fine-tuning of the algorithm parameters, though this is quite hard to do without manually annotated data. Thus, our main focus would be in finding other applications for the proposed proce- dure that would be meaningful from a historical research point of view and easily assessed at the same time. We will also continue development of complex instruments for historical news analysis that would utilize clustering techniques together with other automatic text analysis methods. Acknowledgements We are grateful to Simon Hengchen and Mark Granroth-Wilding for the help with data prepara- tion. This work has been supported by the Euro- pean Unions Horizon 2020 research and innova- tion programme under grants 770299 (NewsEye) and 825153 (EMBEDDIA). References Domagoj Alagić, Jan Šnajder, and Sebastian Padó. 2018. Leveraging lexical substitutes for unsuper- vised word sense induction. In Thirty-Second AAAI Conference on Artificial Intelligence. Haim Dubossarsky, Simon Hengchen, Nina Tah- masebi, and Dominik Schlechtweg. 2019. Time-out: Temporal referencing for robust modeling of lexical semantic change. In The 57th Annual Meeting of the Association for Computational Linguistics (ACL). Max Engman. 2016. Språkfrågan: Finlandssven- skhetens uppkomst 1812-1922. Svenska litter- atursällskapet i Finland. Brendan J Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. science 315(5814):972–976. Nabeel Gillani and Roger Levy. 2019. Simple dynamic word embeddings for mapping perceptions in the public sphere. In NAACL HLT 2019. page 94. Simon Hengchen, Ruben Ros, and Jani Marjanen. 2019. A data-driven approach to the changing vo- cabulary of the nation in English, Dutch, Swedish and Finnish newspapers, 1750-1950. In In Pro- ceedings of the Digital Humanities (DH) conference 2019, Utrecht, The Netherlands. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal analysis of lan- guage through neural language models. ACL 2014 page 61. Jussi Kurunmäki and Jani Marjanen. 2018a. Isms, ideologies and setting the agenda for public de- bate. Journal of Political Ideologies 23(3):256–282. https://doi.org/10.1080/13569317.2018.1502941. Jussi Kurunmäki and Jani Marjanen. 2018b. A rhetori- cal view of isms: An introduction. Journal of Polit- ical Ideologies 23(3):241–255. Andrey Kutuzov, Elizaveta Kuzmenko, and Lidia Pivo- varova. 2017. Clustering of Russian adjective-noun constructions using word embeddings. In Proceed- ings of the 6th Workshop on Balto-Slavic Natural Language Processing. pages 3–13. Eetu Mäkelä. 2016. LAS: an integrated language anal- ysis tool for multiple languages. The Journal of Open Source Software 1. Jani Marjanen. 2018. Ism concepts in science and poli- tics. Contributions to the History of Concepts 13(1). Tomas Mikolov, Kai Chen, Greg S Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In NIPS. Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen, and Eetu Mäkelä. 2016. Export- ing Finnish digitized historical newspaper contents for offline use. D-Lib Magazine 22(7/8). F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830. Radim Řehůřek and Petr Sojka. 2010. Software frame- work for topic modelling with large corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks. ELRA, Valletta, Malta, pages 45–50. Alex Rosenfeld and Katrin Erk. 2018. Deep neural models of semantic shift. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pages 474–484. Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic word embeddings for evolving semantic discovery. In The 11th ACM In- ternational Conference on Web Search and Data Mining. https://doi.org/10.1080/13569317.2018.1502941 https://doi.org/10.1080/13569317.2018.1502941 https://doi.org/10.1080/13569317.2018.1502941 https://doi.org/10.1080/13569317.2018.1502941