OP-LLCJ180051 290..309 Exploring the linguistic landscape of geotagged social media content in urban environments ............................................................................................................................................................ Tuomo Hiippala Department of Languages, University of Helsinki, Finland, Digital Geography Lab, University of Helsinki, Finland and Helsinki Institute of Sustainability Science, University of Helsinki, Finland Anna Hausmann , Henrikki Tenkanen , and Tuuli Toivonen Digital Geography Lab, University of Helsinki, Finland, Department of Geography and Geosciences, University of Helsinki, Finland and Helsinki Institute of Sustainability Science, University of Helsinki, Finland ....................................................................................................................................... Abstract This article explores the linguistic landscape of social media posts associated with specific geographic locations using computational methods. Because physical and virtual spaces have become increasingly intertwined due to location-aware mobile devices, we propose extending the concept of linguistic landscape to cover both physical and virtual environments. To cope with the high volume of social media data, we adopt computational methods for studying the richness and diversity of the virtual linguistic landscape, namely, automatic language identification and topic modelling, together with diversity indices commonly used in ecology and information sciences. We illustrate the proposed approach in a case study cover- ing nearly 120,000 posts uploaded on Instagram over 4.5 years at the Senate Square in Helsinki, Finland. Our analysis reveals the richness and diversity of the virtual linguistic landscape, which is also shown to be susceptible to continu- ous change. ................................................................................................................................................................................. 1 Introduction Staying connected to social media has become an inseparable aspect of everyday life for many. This kind of constant connectedness is enabled by mobile devices, such as smartphones and tablet computers, which allow users to create and share content and to maintain personal relationships while being on the move (Deumert, 2014b; Baym, 2015). Mobile devices are also increasingly aware of their geographic location due to widespread adoption of positioning technology in consumer electronics (Kellerman, 2010). Consequently, many social media platforms now allow and explicitly en- courage users to anchor the content they create to specific geographic locations. This practice, known as geotagging, provides social media platforms with information about the mobility of their users, which can be used for targeting advertisements and profil- ing their consumer preferences. Geotagged social media content also holds po- tential for sociolinguistic inquiry. In this article, Correspondence: Tuomo Hiippala, University of Helsinki, P.O. Box 24, 00014, Finland. E-mail: tuomo.hiippala@helsinki.fi Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019. � The Author(s) 2018. Published by Oxford University Press on behalf of EADH. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 290 doi:10.1093/llc/fqy049 Advance Access published on 1 October 2018 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 http://orcid.org/0000-0002-8504-9422 http://orcid.org/0000-0002-9639-9532 http://orcid.org/0000-0002-0918-4710 http://orcid.org/0000-0002-6625-4922 XPath error Undefined namespace prefix we adopt the term virtual linguistic landscape, which Ivkovic and Lotherington (2009) coined for discussing multilingualism on the web, to describe the languages present in geotagged social media content posted from a specific geographic location. We propose that the virtual linguistic landscape may be considered an extension of the physical linguistic landscape in the built environment. To explore the characteristics of virtual linguistic landscapes, we analyse nearly 120,000 posts uploaded on Instagram from the Senate Square in Helsinki, Finland, over a period of 4.5 years. We seek to answer the following research questions: (1) How to characterize virtual linguistic land- scapes in terms of their linguistic richness and diversity? (2) How do virtual linguistic landscapes change over time? Given the high volume of data, we adopt methods from the field of natural language processing, namely, automatic language identification and topic model- ling. To measure linguistic richness and diversity, we use established indices from the fields of ecology and biology, which have been previously applied to the study of linguistic landscapes (Peukert, 2013; Manjavacas, 2016). We also perform temporal ana- lyses at various timescales to examine changes in the virtual linguistic landscape. We do not, however, seek to compare or make claims about the respective char- acteristics of virtual and physical linguistic landscapes (cf. Deumert, 2014a, pp. 117–18). Instead, we aim to develop methods for studying high volumes of geo- tagged social media content, setting the stage for approaches involving mixed methods, which are ul- timately necessary for achieving a comprehensive view of virtual linguistic landscapes. 2 Physical Places and Virtual Spaces Androutsopoulos (2014) has observed that new sources of data for sociolinguistic inquiry are cur- rently emerging at the intersection of research on computer-mediated communication (CMC) and linguistic landscapes. Whereas CMC covers private and public communication in digital media, such as social media platforms, discussion forums, and email, the research on linguistic landscapes focuses on ‘‘signs and other artifacts in public space’’ (Androutsopoulos, 2014, p. 75, our emphasis). These definitions may reflect an emerging division of work between the aforementioned domains of sociolinguistic research, as the study of linguistic landscapes has traditionally focused on built envir- onments, covering various locations ranging from tourist attractions (Bruyèl-Olmedo and Juan- Garau, 2015) to transportation hubs (Soler- Carbonell, 2016) and various media from billboards to shop signs (Gorter, 2013). At the same time, the broader notion of public space, which Androutsopoulos (2014) assigns to the domain of linguistic landscapes, has been and con- tinues to be transformed by digital technology in the form of both hardware and software (Dodge and Kitchin, 2005). In the field of human geography, one of the leading theorists of this transformation is Aharon Kellerman (see Kellerman, 2010, 2016), who has argued that mobile devices have enabled the emergence of a ‘‘double space’’ of intertwined physical and virtual spaces (see also Zook and Graham, 2007). This double space now increasingly envelopes its subjects, as access to the virtual space is no longer restricted by limitations arising from static hardware in the physical space, such as desk- top computers. Due to the increased potential for spatial mobil- ity, this double space can now fill or support many basic human needs, including those originally defined by Abraham Maslow (Kellerman, 2014). For example, needs pertaining to esteem, such as status and reputation, are increasingly formed in virtual spaces (Kellerman, 2014, p. 542). Kellerman (2010, p. 2993) identifies multiple connections be- tween the physical and virtual spaces, which are grouped along several dimensions: organization, or how such spaces are structured; movement, or the connections between spaces; and users, who popu- late these spaces. Two specific connections warrant further attention, namely, the convergence of phys- ical and virtual places, and the languages encoun- tered in virtual spaces, as both shape the virtual linguistic landscape. Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 291 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 First, Kellerman (2010, p. 2993) proposes that locations defined in the virtual space tend to con- verge with their ‘real’ counterparts in physical space. This tendency is also evident in user-generated social media content. To exemplify, visual content on social media platforms for mobile photography, such as Instagram, has been suggested to serve the purpose of mediating the user’s presence or activ- ities at some specific physical location (Villi, 2015). Alternatively, geographic information such as place names may be provided linguistically in the caption and/or in hashtags accompanying the visual con- tent. The most accurate form of geographic infor- mation, however, is produced by location-aware devices, which are now widely available to con- sumers through smartphones (Kellerman, 2010, p. 2997). Together, the combination of new commu- nicative practices and technological infrastructure may be suggested to drive the convergence of phys- ical and virtual spaces. Second, in terms of their linguistic characteris- tics, Kellerman (2010) suggests that physical spaces are characterized by domestic languages, whereas virtual spaces are dominated by English due to their international orientation. Lee (2017, p. 16) has observed that assumptions about the dominance of English in virtual spaces have been common among both academic and popular audiences ever since Internet became widely used. Yet measuring the actual linguistic diversity of virtual spaces re- mains a challenge (Paolillo, 2007), which is also af- fected by how such virtual spaces are defined and delimited (Leppänen and Peuronen, 2012). However, the current consensus seems to be that languages other than English are becoming increas- ingly prominent on the Internet (Lee, 2016, p. 118). In virtual spaces, the linguacultural make-up of users has the potential to be extremely diverse, be- cause online interactions do not require physical presence, but allow participation from distance, as illustrated in Fig. 1. Moreover, users may choose to use different languages for different audiences (Androutsopoulos, 2015). It is also important to acknowledge that online interactions can be asyn- chronous and unfold over longer periods of time. Moreover, not all social media content is necessarily created at the time of upload, as exemplified by the practice of posting content related to previous events under hashtags such as #throwback. Similarly, the content associated with a specific vir- tual location must not be necessarily created at the actual physical location. Acknowledging the possibility of such temporal and spatial discrepancies, we build on the work of Kellerman (2010, 2014, 2016) and propose that geo- tagged social media posts anchored to a specific geographic location act as an extension of the lin- guistic landscape of the corresponding physical en- vironment. This extension is enabled by the double space, which encompasses both physical and vir- tual spaces, assisted by technologies such as satellite positioning. However, unlike signs and other ob- jects found in the physical environment, social media posts cannot take a material form (although augmented reality may eventually allow them to be represented in physical space, cf. Allen et al., 2018), as they exist on platforms in the virtual space, which may be accessed using any device cap- able of doing so, either from the actual location or from distance. Like urban spaces in general, linguistic land- scapes are dynamic and sensitive to social and eco- nomic changes (Gorter and Cenoz, 2015). As Papen (2012) has shown, changes in the physical linguistic landscape may take place over longer timescales, oc- casionally spanning decades or more. The virtual linguistic landscape, in turn, may be more sensitive to short-term changes due to the immateriality of digital content. In addition, the use and status of a physical location are likely to influence its virtual linguistic landscape in geotagged social media, be- cause these attributes may be expected to be carried over from the physical space to the virtual space. To draw on an example, the linguistic landscapes of tourist attractions, landmarks, or transportation hubs may be expected to be diverse due to their cultural value or role in the transportation network (Bruyèl-Olmedo and Juan-Garau, 2015; Soler- Carbonell, 2016). With these points in mind, we now turn our attention towards the data collected from the Senate Square and the methods applied to its analysis. T. Hiippala et al. 292 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 3 Social Media Data and Computational Methods 3.1 Data and location We collected data from Instagram,1 a social media platform for sharing photographs and short videos, using the platform’s application programming interface (API). In total, we collected 117,418 posts uploaded by 74,051 unique users between 4 July 2013 and 11 February 2018, that is, over a period of roughly 4.5 years. As illustrated in Fig. 2, each geotagged post on Instagram is asso- ciated with a specific location pre-defined on the platform, which means the geographic coordinates of an individual data point do not provide GPS- level accuracy, unlike some other platforms, such as Twitter and Flickr. Instead, the geographic coordinates associated with an Instagram post refer to what is commonly termed a point-of-interest (POI) in the field of Fig. 1 A fictional example showing how (1) two Finnish users at the Senate Square speak Finnish with each other, but the other posts a photograph with an English caption on Instagram, having a number of international users in her social network. (2) Associating the photograph with the location named Helsinki Cathedral allows a German user who searches for content from Helsinki to discover the photograph. (3) Despite physical distance, German users can interact with the content and each other, contributing to the virtual linguistic landscape of the Senate Square. Each step in this chain of events involves language choices, which all contribute to the virtual linguistic landscape Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 293 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 geoinformatics (Hochmair et al., 2018). Instagram POIs are provided by the parent company, that is, Facebook. The response to any spatial query is therefore restricted to content associated with a POI on the platform. In our case, each post retrieved for the study was geotagged to a POI located within a 150-m radius from the point 60.169444 latitude and 24.9525 longitude (WGS- 84), which lies at the centre of the Senate Square in downtown Helsinki, Finland. We chose the location due to its status as a cul- tural landmark and a touristic attraction, which are likely to be reflected in its virtual linguistic land- scape. Overlooked by the Lutheran Cathedral and surrounded by the main building of the University of Helsinki and the Government Palace, the Senate Square and its neoclassical architecture are widely recognized as one of the most important landmarks in Helsinki and in entire Finland. The Lutheran Cathedral, in particular, which is shown in Fig. 2, is often used as a symbol for the city of Helsinki (Jokela, 2014). In addition to its role as a touristic attraction, the Senate Square serves as a venue for different events, ranging from concerts and festivals to protests and demonstrations. 3.2 Identifying the language of social media content Like many other forms of digital data, geotagged social media content may be characterized as ‘big’ due to its high volume, velocity, and variety (Kitchin, 2013). Together, these characteristics pre- sent several challenges for the collection, processing, and analysis of social media data. Challenges related to volume and velocity may be met by adopting a programmatic approach, that is, collecting data sys- tematically via an API and processing the data ac- cordingly (see Tenkanen, 2017, p. 22). For mapping the languages that make up the virtual linguistic landscape, further processing involves automatic language identification, which is an active area of research within the broader field of natural language processing (Zubiaga et al., 2016). Fig. 2 Social media platforms such as Instagram (1), Twitter (2), and Flickr (3) all allow users to embed geographic metadata into their content at various degrees of accuracy from GPS coordinates to POI locations defined by the platforms T. Hiippala et al. 294 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 Automatic language identification, however, is not a straightforward task due to the variety of the data, which in this case takes the form of lin- guistic variation. Much has been written about the language of social media in recent years, revealing variation across different linguistic structures (see Zappavigna, 2013; Seargeant and Tagg, 2014; Hoffman and Bublitz, 2017). On a more practical level, the length of social media posts is typically limited, which encourages the use of abbreviations, non-standard spellings, and other forms of creative language use (Carter et al., 2013, p. 196). Another challenge emerges from the use of hashtags, which are used to affiliate around shared values or topics (Zappavigna, 2011). Hashtags are often written in multiple languages (Barton, 2018; Lee and Chau, 2018), which injects multilingual material into otherwise monolingual texts. The same holds true for usernames on social media platforms. Each of the aforementioned issues introduces additional challenges to performing automatic lan- guage identification. Yet it should be noted that identifying the language of a sentence is not a straightforward task for humans either due to am- biguous language use or orthographically similar words in multiple languages. For example, a caption consisting of a single proper noun, such as ‘Helsinki’, may represent Finnish, English, German, or some other language whose vocabulary includes this word, essentially preventing the iden- tification of language. We evaluated several state-of-the-art frameworks that provide pre-trained models for performing automatic language identification. The libraries considered for the current study are listed in Table 1 and introduced briefly below. The first framework, fastText, relies on word embeddings, which is a technique for learning numerical repre- sentations of words in a vocabulary by observing their distribution in their context of occurrence (Bojanowski et al., 2017). The second framework, langid.py, is designed to provide reliable language identification across multiple domains, such as of- ficial documents, newspaper articles, and social media messages (Lui and Baldwin, 2012). Finally, the third framework, CLD2 or the Compact Language Detector 2, was originally developed for Google’s Chromium open-source project but has not been documented in a peer-reviewed publica- tion. For this study, we used CLD2 via the polyglot natural language processing library. All programs developed for this study were writ- ten using the Python 3.6.3 programming language, to take advantage of the wide range of libraries available within the Python ecosystem. The libraries used include the Natural Language Toolkit (NLTK; Bird et al., 2009), polyglot, spaCy, and gensim (Rehurek and Sojka, 2010) for natural language pro- cessing; scikit-bio for diversity measures; and pandas (McKinney, 2010) and scikit-learn (Pedregosa et al., 2011) for storing and manipulat- ing the data. All code written for this study is made publicly available with an open licence at: https:// doi.org/10.5281/zenodo.1404729. 3.3 Evaluating language identification frameworks To evaluate how the language identification frame- works introduced above perform on our data, we created a ground truth by randomly sampling the data without replacement for 1,476 captions. We then applied the preprocessing steps described in Table 2 to these captions, extracting a total of 2,011 sentences. Two annotators, namely, the first and the second author, subsequently identified the language of each preprocessed sentence manually. We annotated each language using its ISO-639 code, such as ‘en’ for English, or using multiple codes joined by a þ if the sentence featured more than one language, such as ‘enþfi’ for English and Finnish. To assess the level of agreement between the two annotators, we used the common metrics for mea- suring inter-rater agreement surveyed in Artstein and Poesio (2008), such as Fleiss’ � (0.929), Scott’s Table 1 Language identification frameworks used in the study Name Reference Number of languages supported fastText Bojanowski et al. (2017) 176 langid.py Lui and Baldwin (2012) 97 CLD2 – 83 Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 295 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 https://doi.org/10.5281/zenodo.1404729 https://doi.org/10.5281/zenodo.1404729 � (0.929), and Krippendorff’s � (0.929) as imple- mented in NLTK (Bird et al., 2009). The average observed agreement between the two annotators was 0.948. Overall, these metrics suggest that the ground truth can be reliably used for evaluating the performance of language evaluation frame- works, particularly as the manual classification also accounted for code-switching within sentences. For the final ground truth, we dropped captions whose language we disagreed on, retaining a total of 1,374 captions with 1,863 sentences, which was further reduced to 1,688 by leaving out sentences whose language could not be manually identified or which contained sentence-internal code- switching. We then evaluated the language identification frameworks against the ground truth and examined whether their performance would improve by excluding sentences with a low character count. fastText and langid.py had a slight advantage over CLD2, as they supported all manually identified lan- guages present in the ground truth, whereas CLD2 did not support Latin. However, the ground truth contained only three sentences in Latin, so this dis- advantage should not have a big impact on the performance of CLD2. Table 3 reports the reliability of predictions for each framework at different char- acter thresholds, using Krippendorff’s � to correct for chance agreement. Average observed agree- ment—or accuracy—is given in parentheses. As Table 3 shows, the fastText library and its pre- trained model provide superior performance com- pared to langid.py and CLD2 regardless of the char- acter threshold. langid.py and CLD2 begin to match fastText’s baseline performance only at the thresh- old of thirty characters or above, which simultan- eously involves losing nearly 60% of the data. This trade-off is obviously unacceptable, which is why we chose fastText for automatic language identification. 3.4 Measuring richness and diversity To measure the richness and diversity of the lan- guages that make up the virtual linguistic landscape, we adopt common indices used in the fields of ecol- ogy and information sciences, such as richness, Menhinick’s richness, Berger–Parker dominance, and Shannon entropy. Peukert (2013) provides a thorough introduction to using these indices to measure linguistic diversity, illustrating their appli- cation in a comparison of physical linguistic Table 2 The individual steps of the preprocessing strategy were designed to counter common challenges in automatic language identification, such as emojis and smileys, excessive punctuation, multilingual hashtags and usernames, and sentence-level code-switching 1 The original caption includes hashtags, user mentions, and smileys and emojis Great weather in Helsinki!!! On holiday with @username.:-) #helsinki #visitfinland 2 We begin by replacing any line breaks with whitespace and convert the emojis into their corresponding emoji shortcodes, which are wrapped in colons Great weather in Helsinki!!! On holiday with @username.:-) #helsinki #visitfinland:nerd_ face_&_sunny_&_passenger_ship: 3 The colons make finding the emojis easy using a regular expression, which we then apply to remove them Great weather in Helsinki!!! On holiday with @username.:-) #helsinki #visitfinland 4 We then remove any words that begin with an @ symbol, which indicates a username Great weather in Helsinki!!! On holiday with:-)#helsinki #visitfinland 5 Next, we remove any hashtags, that is, any words beginning with a # Great weather in Helsinki!!! On holiday with:-) 6 Any remaining non-alphanumeric words in the caption, such as the smiley:-) are then removed using a regular expression Great weather in Helsinki!!! On holiday with 7 Longer sequences of exclamation or question marks (e.g. !!!), full stops, and other kinds of punctuation are shortened to just one of each character (e.g. !) Great weather in Helsinki! On holiday with 8 These sequences can confuse the Punkt sentence tokenizer (Kiss and Strunk, 2006), which outputs a Python list containing sentence tokens. These tokens are then fed to the language identification frameworks one at a time [”Great weather in Helsinki!”, ”On holiday with”] T. Hiippala et al. 296 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 landscapes in two neighbourhoods in Hamburg, Germany and showing how these indices may be used to measure and compare linguistic diversity across locations. Manjavacas (2016), in turn, applies similar indices to geotagged Twitter posts from Berlin, Germany. Because these indices are relatively new to the study of linguistic landscapes, we intro- duce them in greater detail in connection with the analyses of linguistic richness and diversity in Section 4.4. 4 Exploring the Virtual Linguistic Landscape 4.1 Temporal patterns in social media activity Fig. 3 presents Instagram activity around the Senate Square over 24 h. The figures show the average number of posts and their standard deviation for each hour of the day for four different samples: Fig. 3a shows the hourly frequency of all posts in the data set over 1,681 days, which also includes posts without any linguistic content (n¼117,418). Not surprisingly, this frequency reflects common hours of activity in the city, with approximately four to six posts per hour for daytime and evening hours. During the night, the number falls down to roughly two posts per hour. A similar pattern may be observed in Fig. 3b, which only includes posts with captions (n¼102,687). The pattern changes when choosing different timescales and preprocessing the data for language identification (n¼77,338), as illustrated in Fig. 3c and d, which show the average number of hourly of posts for weekdays (n¼1,118) and weekends (n¼478), respectively. Whereas the weekdays show a peak around lunch hours, the activity in- creases considerably towards the evening during weekends. A D’Agostino–Pearson test showed that none of the hourly observations in Fig. 3c and d follow a normal distribution, which means that the statistical differences between hourly activity may be evaluated using Levene’s test and the Mann–Whitney U-test. For Levene’s test, which compares the variance of samples, the differences were found to be statistically significant for Hours 2 (W¼4.947, P¼0.027), 4 (W¼6.971, P¼0.009), 5 (W¼17.829, P ¼ <0.001), 7 (W¼5.536, P ¼ 0.019), 9 (W¼8.387, P¼0.004), and 16 (W¼7.111, P¼0.008). The Mann–Whitney U- test, which examines the difference in averages, showed a statistically significant difference for Hour 2 (U¼18,043.5, P¼0.025). This suggests that social media activity is subject to temporal variation, which can be revealed by examining the data on different timescales. In other words, studying the activity at lunch hour during the working week will reveal a different pic- ture than an analysis focusing on the late hours on the weekend. This variation will undoubtedly affect the appearance of the virtual linguistic landscape on the daily scale and beyond. As a culturally valued landmark and a tourist attraction, the Senate Square also experiences seasonal variation, attracting a higher number of users during the summer months and Christmas holidays, as shown in Fig. 4a. The seasonal pattern becomes increasingly pronounced due to the rapidly growing popularity of Instagram as a social media platform. Fig. 4b, in turn, shows the average number of sentences per day of the week, which reveals increased activity during the weekend. This trend, however, becomes less pronounced due to loss of Table 3 Krippendorff’s � scores for language identification frameworks at different character thresholds for prepro- cessed sentence length Framework No threshold >10 characters >20 characters >30 characters CLD2 0.845 (0.895) 0.850 (0.899) 0.895 (0.928) 0.961 (0.974) fastText 0.909 (0.939) 0.919 (0.946) 0.961 (0.974) 0.978 (0.985) langid.py 0.787 (0.851) 0.799 (0.861) 0.868 (0.908) 0.917 (0.943) Data loss 0% (0) 17.07% (318) 41.28% (796) 59.85% (1,115) Note: Best result is marked in bold. For data loss, the value in parentheses reports the number of sentences lost. Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 297 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 data when predictions are filtered using the prob- abilities provided by fastText, which is visualized using the coloured bands in Fig. 4b. Generally, these probabilities are distributed over the 176 lan- guages supported by fastText and range between 0 and 1, which reflects how confident the framework is about its prediction. Requiring a certain level of confidence, as expressed by the probability asso- ciated with a prediction, naturally results in a trade-off between the quality of predictions and volume of data. Including all predictions regardless of their level of confidence is likely to increase the number of errors, as very short sentences force fastText to make uninformed guesses based on limited data. To improve the quality of language identification while preserving the temporal features of Instagram activity at the Senate Square, we exclude predictions that fall into the first decile either in terms of their associated probability (<0.4231) or character length after preprocessing (<10), amount- ing to a loss of 17.31% of the data. This left us with (a) (b) (c) (d) Fig. 3 The daily ‘pulse’ of the Senate Square on Instagram. The line shows the average number of posts per hour, whereas the area indicates the standard deviation from the average. (a) Average posts per hour for all posts in dataset (n¼117,418) over 1,681 days. (b) Average posts per hour for posts with captions (n¼102,687) over 1,676 days. (c) Average posts per hour during weekdays (Monday to Friday, n¼1,188) for captions whose language could be identified (n¼55,293). (d) Average posts per hour during weekend (Saturday and Sunday, n¼478) for captions whose language could be identified (n¼22,045) T. Hiippala et al. 298 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 90,353 sentences in eighty unique languages posted over 1,662 days for analysing temporal changes in the virtual linguistic landscape. 4.2 The distribution of languages over time We now turn our attention towards the virtual lin- guistic landscape of the Senate Square by examining the sentence-level distribution of languages in the captions. The chosen level of analytical granularity was not linguistically informed but defined by our preprocessing strategy, which uses sentence tokeni- zation (see Table 2). Our discussion focuses on Fig. 5, which shows the top ten languages identified using fastText, accompanied by 99.9% confidence intervals estimated by drawing 10,000 bootstrapped samples from the underlying data. This means that the mean value lies within these intervals at 99.9% probability. If the confidence intervals do not over- lap, the difference between individual languages is significant at 0.01 level. The graphs in Fig. 5 are presented in pairs. On the left-hand side, the Y-axes show the daily relative frequency, which calculated given by dividing the number of observations for each language by the total number of daily observations for all languages. This measurement is intended to capture the power relations and visibility of different languages in the virtual linguistic landscape. On the right-hand side, the Y-axes give the number of sentences per day. This measurement is intended to account for the growing volume of data, which was observed in Fig. 4a. To begin with, Fig. 5a shows the daily relative frequencies for the three most common lan- guages—English, Finnish, and Russian—and the combined relative frequency for the remaining sev- enty-seven languages identified in the data (grouped together under the label ‘other’). These languages also underline the role of Senate Square as a tourist destination, as approximately half of the sentences are written in English. Furthermore, English seems to be gaining most from the growing popularity of Instagram, as indicated by the growing sentence count in Fig. 5b. Assuming that the dominance of English results from its role as a lingua franca, this raises questions about who the users of English are. We will return to this issue in Section 4.3. Generally, the ‘big three’—English, Finnish, and Russian—make up the vast majority of the virtual linguistic landscape. What is particularly worth noting in Fig. 5a and b is that Finnish overtook Russian as the second most common language only in 2015. Traditionally, Helsinki has been a popular destination among Russians due to its proximity and accessibility via road, rail, sea, and (a) (b) Fig. 4 Monthly and weekly Instagram activity around the Senate Square. (a) Number of unique users per month. Note that observations for 2013 and 2018 cover only a part of the year. (b) Average sentences per day of the week for sentences whose language could be identified at various probability thresholds Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 299 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 (a) (b) (c) (d) (e) (f) Fig. 5 Daily relative frequencies for languages identified using fastText, with 99.9% confidence intervals estimated using 10,000 bootstrapped samples from the underlying data, which are marked by the shaded areas. The lines show a third-order polynomial regression fitted using ordinary least squares. (a) Daily relative frequencies for the top-3 languages: English (en), Finnish (fi), Russian (ru) and other languages (n¼77). (b) Daily sentence counts for the top-3 languages. (c) Daily relative frequencies for the top 4–6 languages: (Japanese (ja), Korean (ko) and Swedish (sv). (d) Daily sentence counts for the top 4–6 languages. (e) Daily relative frequencies for the top 7–10 languages: Spanish (es), German (de), Italian (it) and Portuguese (pt). (f) Daily sentence counts for the top 7–10 languages T. Hiippala et al. 300 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 air. Interestingly, the decline of the Russian language coincides with the economic sanctions imposed on Russia due to the invasion of Ukraine, which caused the number of Russian tourists visiting Helsinki to dip in 2015 and 2016 (Official Statistics of Finland, 2018). Comparing the difference between the daily relative frequencies for Russian in 2014 and 2015–16 using the Kruskal–Wallis H-test was found to be statistically significant at H¼31.503, P ¼ <0.001. Figure 5c-f zooms into the languages outside the top three, which were grouped together under the label ‘other’ in Fig. 5a and b. Note that this move is accompanied by a changes of scale, as the relative frequencies and sentence counts for these languages are considerably lower than those in Fig. 5a and b. The observations are split into different figures for a clearer view, but if Fig. 5c-f were presented in a single graph, the confidence intervals would overlap for many languages, indicating that the differences in their frequencies and counts are not statistically significant. The way the relative frequencies of these languages fluctuate suggests that they contribute sporadically in the virtual linguistic landscape, which is also supported by their low sentence counts. Nevertheless, Fig. 5c and d shows how geograph- ically remote languages such as Japanese (ja) and Korean (ko) contribute to the virtual linguistic landscape, even temporarily surpassing Swedish, the second official language of Finland. The rela- tively low proportion of Swedish in the virtual lin- guistic landscape stands in stark contrast with the physical linguistic landscape, in which Swedish re- mains very prominent, as public signs are required to be bilingual if the number of minority speakers in the municipality exceeds 8% or 3,000 individuals (Syrjälä, 2017, p. 118). This is naturally the case with Helsinki as well, which is historically a bi- and multilingual city. However, fastText cannot dis- tinguish between standard Swedish and Finland- Swedish, which means these observations should not be associated exclusively with the Swedish- speaking minority in Finland, but include visitors from Sweden as well. Coming back to Japanese and Korean, it should be noted that although tourism statistics for Helsinki show that visitors from European countries outnumber Asians three to one (Official Statistics of Finland, 2018), the widespread adoption of mobile technology among Japanese and Korean users may explain their prominence in the virtual linguistic landscape. These languages, however, decline to- wards the present, although tourism statistics show that arrivals from Japan and Korea continue to in- crease, which may suggest that these users are aban- doning Instagram. European visitors, in turn, are likely to include a sizeable number of business trav- ellers, who may be less likely to contribute to the virtual linguistic landscape at the Senate Square, which may explain the relatively low proportion of major languages spoken in Europe such as Spanish, German, Italian, and Portuguese. 4.3 Language choices among users The most striking feature of the virtual linguistic landscape at the Senate Square is the dominance of the English language, as it is unlikely that half of the users active at the location would speak English as their first language. To investigate lan- guage choices among users, we retrieved the time and location of posts for up to thirty-three previous posts for each user, who were naturally limited to those users who had posted captions whose lan- guage we could identify. To determine the likely country of origin for each user, we first retrieved the administrative region of each coordinate/time- stamp pair in the location history using a point-in- polygon query. Next, we used the timestamps to determine the overall duration of user’s activity within each region by calculating the time between the oldest and newest posts. In addition to storing the region with the longest period of activity, we also recorded the region with the most activity. Finally, we calculated the average duration of activ- ity for each user by dividing the time spent at each region by the total number of regions visited. The initial data for estimating the users’ country of origin contained 75,685 posts by 49,842 unique users. On the average, the location history of a user contained 18.02 coordinate/timestamp pairs (SD¼8.46), whereas the average period of activity amounted to 152 days (SD¼161). To make our estimation more reliable, we discarded the first quartile for both coordinate/timestamp pairs and Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 301 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 the longest period of activity. In practice, this meant excluding users with eleven or fewer coordinate/ timestamp pairs and whose longest period of activ- ity was 44 days or less. For the final estimation, we retained a total of 45,685 posts by 31,442 unique users. For these users, we assumed that the admin- istrative region where the users had been active for the longest period of time could be used to approxi- mate their country of origin. Table 4 presents the distribution of sentences in the six most frequent languages shown in Fig. 5 among users from the ten most frequent countries of origin. As may be expected, the majority of users active in the vicinity of the Senate Square come from Finland, but what is surprising is that Finnish users post nearly as much in English as in Finnish. Previous surveys on the role of the English language in Finland have emphasized the popularity and importance of English, particularly among the youth (Leppänen et al., 2011). This may be a source of bias, as youth are also more likely to use social media (Longley et al., 2015; Hausmann et al., 2018). Nevertheless, the high proportion of sentences (45.9%) written in English warrants closer atten- tion, as similar findings have been reported for other social media platforms, namely Twitter, by Laitinen et al. (2018). To do so, we trained a topic model over mono- lingual English captions posted by users whose country of origin was estimated to be Finland. These data consisted of 8,636 captions with 5,552 unique words after removing rare and frequent words that appeared in a single sentence or in more than 25% of the sentences. The model was trained using the Latent Dirichlet Allocation algo- rithm for 150 iterations with ten passes through the corpus, using the implementation provided in the gensim library (Rehurek and Sojka, 2010). To pre- process the data, we adopted the procedure set out in Table 2. We also removed stopwords defined in NLTK (Bird et al., 2009) and lemmatized the words using the lookup table for English in spaCy. Finally, we calculated a coherence score, Cv, for each topic, which has been suggested to correlate strongly with human evaluations of topic coherence (Röder et al., 2015). Table 5 gives the ten most prominent topics with their ten most frequent words. Some of the coher- ence scores are fairly low, which is not surprising given the noisy social media data and the small size of the corpus. Nevertheless, the topics can provide insights into the nature of the content posted in English by Finnish users. To begin with, several topics seem to be strongly associated with the loca- tion, weather, leisure, and celebrations such as Christmas and New Year’s Eve (1 and 3) and the Lux light carnival (6). Many topics also feature words associated with a positive sentiment (3, 5–7, 9, and 10). This suggests that Finns use English to connect with international audiences, appraising the physical location and the activities associated with it in the virtual space. Finnish users appear to participate in maintain- ing the identity of the location as a culturally valued Table 4 The distribution of the six most common languages among the users originating in ten most common countries Country Finnish English Russian Swedish Japanese Korean All Finland 10,691 10,629 673 468 57 17 23,127 Russia 73 903 8,157 2 – 1 9,261 The USA 100 2,687 97 8 1 4 2,987 The UK 82 1,813 31 7 5 5 1,998 Germany 78 836 53 2 7 3 1,281 Sweden 88 528 59 308 4 3 1,061 Spain 72 478 112 4 1 10 1,048 Italy 55 554 133 6 5 7 1,019 France 37 474 110 3 9 12 817 Japan 14 247 1 1 364 14 674 Note: The countries are ranked by their popularity in the leftmost column. The rightmost column gives the total number of sentences written by users from the particular country in all languages. T. Hiippala et al. 302 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 landmark, at the same time construing the location as a tourist attraction. The role of English as the lingua franca of tourism (Francesconi, 2014), which may also explain the choice of language, is also supported by a positive view of the language and a high level of proficiency in Finland (Leppänen et al., 2011). However, the preference for English holds for most, but not all linguistic groups contri- buting to the virtual linguistic landscape: Table 4 shows that Russians clearly prefer their native lan- guage over English. 4.4 The diversity of the virtual linguistic landscape Finally, we turn towards the richness and diversity of the virtual linguistic landscape, applying the in- dices introduced in Section 3.4. The following dis- cussion focuses on Fig. 6, which shows several indices applied to the results of automatic language identification. We introduce these indices and ex- plain their implications below. Fig. 6a shows the linguistic richness, or simply the number of unique languages per day, and the number of singletons, that is, how many languages appear only once a day. In Fig. 6a, the parallel in- crease in unique languages and singletons suggests that smaller languages are driving the increase in linguistic richness. This observation was supported by a strong positive correlation for Pearson’s r be- tween 30-day rolling averages for unique languages and singletons (r¼0.975, n¼1,633, P ¼ <0.001). Increasing linguistic richness also correlated with the increase in unique users (r¼0.899, n¼1,633, P ¼ <0.001), as shown in Fig. 6b. To summarize, Fig. 6a and b suggests that the growing popularity of Instagram has resulted in an increasingly rich virtual linguistic landscape at the Senate Square, as smaller linguistic groups have adopted the platform. Simple richness index, however, does not ac- count for the growing volume of data due to the increasing popularity of the platform. This perspec- tive can be provided by Menhinick’s richness index, which emphasizes the relationship between data volume and richness. Menhinick’s richness index, shown in Fig. 6c, reveals a decreasing trend over the 4.5 years. This trend suggests that despite increasing linguistic richness, driven by the increase in smaller languages, the virtual linguistic landscape is increasingly dominated by languages such as English, Finnish, and Russian (cf. Fig. 5a and b). In other words, the growing volume of data has made the dominant languages increasingly promin- ent in the virtual linguistic landscape, which is re- flected in a decreasing value for Menhinick’s richness index. Measuring the diversity of the virtual linguistic landscape requires indices that account for both the number of languages observed and their relative proportions. One such index is the Berger–Parker dominance index, shown in Fig. 6d, which gives the fraction of observations for the language with the most posts per day. Given the observations in Fig. 5a, approximately half of the time the dominant language is English. The decreasing Table 5 A topic model trained over 8,636 captions written in English by Finnish users, with one topic per column 1 2 3 4 5 6 7 8 9 10 Helsinki Get Year Make Love Helsinki Good Town Look Day Christmas Cold Happy Start Night Lux Pizza Run Go One Cathedral Menu New Open Great Light Morning Conjurer Lot Independence Light Thing Well Art Enjoy Finland Beautiful Afternoon Let Church Market Finally Time Night People Sunday Walk Friday Know Back Senate Ready Take Welcome See Home Lovely Finnish Special Nice Square New Week Way Last Festival City Colour Right Finland Time May Picture Wine Come Snow Sun Well Exhibition Sunny Lunch Always Thank Drink December Amaze Blue Look Pretty Big Winter Taste Get Spring Weekend Wait Today Know Like Last 0.342 0.263 0.492 0.292 0.3 0.37 0.345 0.254 0.356 0.289 Note: The words (rows) associated with each topic are sorted by their weight in a descending order. The final row gives the coherence score Cv for the topic (Röder et al., 2015). Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 303 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 (a) (b) (c) (d) (e) (f) Fig. 6 Various diversity measures applied to the data set, with 99.9% confidence intervals estimated using 10,000 bootstrapped samples from the underlying data. The line shows a third-order polynomial regression fitted using ordinary least squares. (a) Richness and singletons. (b) Richness and daily unique users. (c) Menhinick richness. (d) Berger—Parker dominance. (e) Dominance. (f) Shannon entropy T. Hiippala et al. 304 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 Berger–Parker index suggests that the dominant lan- guages are losing ground to smaller languages, show- ing a drop of thirty points during the 4.5 years, which suggests that the virtual linguistic landscape of the Senate Square is becoming increasingly diverse. This observation is also supported by the decreasing dom- inance index in Fig. 6e, which measures the respect- ive proportions of languages: a dominance index of 0 would indicate that all languages are equally present, whereas an index of 1 would mean the total domin- ance of a single language. Finally, the observed increase in diversity is also sup- ported by Shannon entropy, shown in Fig. 6f, which captures the amount of information required to de- scribe the degree of order/disorder in a system. The higher the degree of disorder—in this case, the variety of languages and their respective probabilities of occur- rence—the more information is required to describe the state of the system, that is, the virtual linguistic landscape. Interestingly, the index for Shannon entropy peaks in 2017. This may suggest that the virtual linguis- tic landscape of the Senate Square has reached its max- imal degree of diversity (with slightly over eight languages on the average day, as shown in Fig. 6a pos- sible within the current userbase of Instagram. To summarize, several conclusions may be drawn from the indices in Fig. 6. The richness of the virtual linguistic landscape increases as the number of users grows. Although the number of languages found in the virtual linguistic landscape grows, dominant lan- guages such as English, Finnish, and Russian gain the most from the growth, enabling them to consolidate their position. Yet the proportion of dominant lan- guages is decreasing, which indicates increasing diver- sity. Put differently, smaller languages are gaining on the share of the dominant languages. At the same time, the virtual linguistic landscape at the Senate Square seems to have reached a point where the linguistic diversity no longer increases. In other words, the number of languages in the virtual linguistic landscape remains the same, but the smaller languages change. 5 Discussion and Conclusion Our results suggest that virtual linguistic landscapes can be effectively characterized using computational methods, which are necessary for handling high vol- umes of social media data. With carefully planned preprocessing, automatic language identification and other natural language processing techniques can do most of the analytical work in a sufficiently reliable manner. However, insights provided by automatic language identification are limited without the means to evaluate the respective proportions of the observed languages. Our analysis revealed a rich and diverse virtual linguistic landscape at the Senate Square, which is dominated by English, as the lan- guage is used extensively by both locals and tourists. The results also emphasize the role of Senate Square as a highly valued cultural landmark and a tourist attraction (Jokela, 2014). The cultural im- portance is manifested in the high number of posts by locals, whereas the impact of tourism is reflected by the high number of foreign visitors. In this respect, our findings support Kellerman’s (2010) view that qualities associated with the phys- ical place may be carried over to the corresponding virtual space. Although we did not explicitly touch upon the issue in the analysis, it should be noted that global mobility and tourism are a privilege of a select few rather than the many, which is likely to be reflected in the linguistic landscape. Choosing an alternative location for the study, such as a local transportation hub, would have likely yielded very different results (cf. Soler-Carbonell, 2016). The richness and diversity of the virtual linguistic landscape also resonate with Lee’s (2016, p. 119) proposal that user-generated social media content increases the potential for exposure to foreign lan- guages. Geotagged social media content may be par- ticularly effective for this purpose, as content associated with a location can be accessed through map interfaces instead of using hashtags or search terms in some specific language. This effect is fur- ther reinforced by Instagram, which allows locations defined on the platform to have multilingual names. All the content associated with the locations named in different languages is then aggregated under a single point of interest. This is also likely to drive the formation and maintain the double space, as conceptualized by Kellerman (2010). In addition, the nature of Instagram as a plat- form must be taken into account when interpreting Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 305 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 the results. Unlike Twitter, which acts as a forum for public discussion, Instagram may be preferred for sharing personal experiences (Zappavigna, 2011, 2016; Tenkanen et al., 2017). Together with the in- tended audience, the platform may affect language choices among users (Androutsopoulos, 2015). Tracing these linguistic repertoires would, however, require a much closer analysis of longitudinal data for individual users, which was beyond the scope of this article. However, our proposed method could be easily adopted for a large-scale study of what Pennycook and Otsuji (2014, p. 166) have called ‘‘a geography of linguistic happenings’’. Such ana- lyses, however, would still be limited by the spatial accuracy of Instagram, as observed in Section 3.1. Users may, for instance, associate content with lo- cations higher in the POI hierarchy (such as ‘Helsinki’ instead of ‘Senate Square’) or choose the wrong location altogether. In terms of other limitations, the results are nat- urally affected by how widely Instagram has been adopted by potential users of social media, and should be evaluated in the light of the inherent bias towards younger population found in social media data (Longley et al., 2015; Hausmann et al., 2018). Furthermore, the proposed method cannot provide a fine-grained view of the linguistic land- scape, because automatic language identification cannot detect code-switching within sentences, or distinguish between varieties of a single language, such as American and British English or Finland- Swedish and Standard Swedish, unless explicitly trained to do so. Despite these limitations, our results suggest that Instagram and other social media platforms with geolocated content do nevertheless hold much po- tential for sociolinguistic inquiry, as suggested by Androutsopoulos (2014). Tapping further into this potential, however, would benefit from collaborat- ing with geographers, to leverage more advanced methods for spatiotemporal analysis. Such analyses could be used, for instance, to reveal where and when particular linguistic groups are active, to evaluate the potential for interaction between these groups. Longitudinal analyses for individual users, in turn, could be used to investigate their linguistic repertoires. Finally, because computational methods develop rapidly, analytical tools should be shared openly to enable the replication and reproduction of research, which would benefit the entire field of study. A natural extension to the current work would be to take on what Jaworski and Thurlow (2010) have conceptualized as semiotic landscapes, whose ana- lysis would include other modes of expression be- sides language in the virtual linguistic landscape. Although research on artificial intelligence is making rapid progress in processing multimodal data (Bateman et al., 2017, pp. 163–4), identifying fine-grained patterns of multimodal communica- tion in high volumes of geotagged social media data is likely to remain a long-term endeavour. Nevertheless, sufficiently mature computational techniques can already support the study of both virtual and physical linguistic landscapes, and their potential applications should be explored further. Funding This work was supported by the Finnish Cultural Foundation and the Kone Foundation. References Allen, P. T., Fatah, A., and Robison, D. (2018). Urban encounters reloaded: Towards a descriptive account of augmented space. In Jung, T. and tom Dieck, M. C. (eds), Augmented Reality and Virtual Reality: Empowering Human, Place and Business. Cham: Springer, pp. 259–73. Androutsopoulos, J. (2014). Computer-mediated com- munication and linguistic landscapes. In Holmes, J. and Hazen, K. (eds), Research Methods in Sociolinguistics: A Practical Guide. Oxford: Wiley, pp. 74–90. Androutsopoulos, J. (2015). Networked multilingualism: some language practices on Facebook and their impli- cations. International Journal of Bilingualism, 19(2): 185–205. Artstein, R. and Poesio, M. (2008). Inter-coder agree- ment for computational linguistics. Computational Linguistics, 34(4): 555–96. T. Hiippala et al. 306 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 Barton, D. (2018). The roles of tagging in the online cur- ation of photographs. Discourse, Context and Media, 22, 39–45. Bateman, J. A., Wildfeuer, J., and Hiippala, T. (2017). Multimodality: Foundations, Research and Analysis – A Problem-Oriented Introduction. Berlin: De Gruyter Mouton. Baym, N. K. (2015). Personal Connections in the Digital Age, 2nd edn. Malden, MA: Polity. Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. Sebastopol, CA: O’Reilly. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword informa- tion. Transactions of the Association for Computational Linguistics, 5: 135–46. Bruyèl-Olmedo, A. and Juan-Garau, M. (2015). Shaping tourist LL: language display and the sociolinguistic background of an international multilingual reader- ship. International Journal of Multilingualism, 12(1): 51–67. Carter, S., Weerkamp, W., and Tsagkias, M. (2013). Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1): 195–215. Deumert, A. (2014a). Digital superdiversity: a commen- tary. Discourse, Context & Media, 4–5: 116–20. Deumert, A. (2014b). Sociolinguistics and Mobile Communication. Edinburgh: Edinburgh University Press. Dodge, M. and Kitchin, R. (2005). Code and the trans- duction of space. Annals of the Association of American Geographers, 95(1): 162–80. Francesconi, S. (2014). Reading Tourism Texts: A Multimodal Analysis. Bristol: Channel View Publications. Gorter, D. (2013). Linguistic landscapes in a multilingual world. Annual Review of Applied Linguistics, 33: 190– 212. Gorter, D. and Cenoz, J. (2015). Translanguaging and linguistic landscapes. Linguistic Landscapes, 1(1–2): 54–74. Hausmann, A., Toivonen, T., Slotow, R., Tenkanen, H., Moilanen, A., Heikinheimo, V., and Di Minin, E. (2018). Social media data can be used to under- stand tourists’ preferences for nature-based experi- ences in protected areas. Conservation Letters, 11(1): e12343. Hochmair, H. H., Juhász, L., and Cvetojevic, S. (2018). Data quality of points of interest in selected mapping and social media platforms. In Kiefer, P., Huang, H., Van de Weghe, N. and Raubal, M. (eds), Progress in Location Based Services 2018. Cham: Springer, pp. 293– 313. Hoffman, C. R. and Bublitz, W. (eds) (2017). Pragmatics of Social Media. Berlin and Boston: De Gruyter Mouton. Ivkovic, D. and Lotherington, H. (2009). Multilingualism in cyberspace: conceptualising the vir- tual linguistic landscape. International Journal of Multilingualism 6(1): 17–36. Jaworski, A. and Thurlow, C. (eds) (2010). Semiotic Landscapes: Language, Image, Space, London and New York: Continuum. Jokela, S. (2014). Tourism and identity politics in the Helsinki churchscape. Tourism Geographies, 16(2): 252–69. Kellerman, A. (2010). Mobile broadband services and the availability of instant access to cyberspace. Environment and Planning A, 42: 2990–3005. Kellerman, A. (2014). The satisfaction of human needs in physical and virtual spaces. The Professional Geographer, 66(4): 538–46. Kellerman, A. (2016). Daily Spatial Mobilities: Physical and Virtual. New York and London: Routledge. Kiss, T. and Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485–525. Kitchin, R. (2013). Big data and human geography: opportunities, challenges and risks. Dialogues in Human Geography, 3(3): 262–7. Laitinen, M., Lundberg, J., Levin, M., and Martins, R. (2018). The Nordic Tweet Stream: A dynamic real-time monitor corpus of big and rich language data. In Mäkelä, E., Tolonen, M. and Tuominen, J. (eds), Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 79, pp. 349–62. Lee, C. (2016). Multilingual resources and practices in digital communication. In Georgakopoulou, A. and Spilioti, T. (eds), The Routledge Handbook of Language and Digital Communication. New York and London: Routledge, pp. 118–32. Lee, C. (2017). Multilingualism Online. New York and London: Routledge. Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 307 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 Lee, C. and Chau, D. (2018). Language as pride, love, and hate: archiving emotions through multilingual Instagram hashtags. Discourse, Context and Media 22, 21–9. Leppänen, S. and Peuronen, S. (2012). Multilingualism and the internet. In Chapelle, C. A. (ed.), The Encyclopedia of Applied Linguistics. Oxford: Wiley- Blackwell. Leppänen, S., Pitkänen-Huhta, A., Nikula, T., Kytölä, S., Törmäkangas, T., Nissinen, K., Kääntä, L., Räisänen, T., Laitinen, M., Pahta, P., Koskela, H., Lähdesmäki, S., and Jousmäki, H. (2011). National survey on the English language in Finland: Uses, meanings and atti- tudes, Vol. 5 of Studies in Variation, Contacts and Change in English. Helsinki: University of Helsinki. Longley, P. A., Adnan, M., and Lansley, G. (2015). The geotemporal demographics of Twitter usage. Environment and Planning A: Economy and Space, 47(2): 465–84. Lui, M. and Baldwin, T. (2012). langid.py: An off-the- shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, July 10. Association for Computational Linguistics, pp. 25–30. Manjavacas, E. (2016). Mapping urban multilingualism through Twitter. Master’s thesis, The Free University of Berlin. McKinney, W. (2010). Data structures for statistical com- puting in Python. In van der Walt, S. and Millman, J. (eds), Proceedings of the 9th Python in Science Conference, Austin, Texas, United States, June 28–July 3, pp. 51–6. Official Statistics of Finland (2018). Accommodation statistics. http://www.stat.fi/til/matk/index.html (ac- cessed 6 July 2018). Paolillo, J. C. (2007). How much multilingualism? Language diversity on the internet. In Danet, B. and Herring, S. C. (eds), The Multilingual Internet: Language, Culture, and Communication Online. Oxford: Oxford University Press, pp. 408–30. Papen, U. (2012). Commercial discourses, gentrification and citizens’ protest: The linguistic landscape of Prenzlauer Berg, Berlin. Journal of Sociolinguistics 16(1): 56–80. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, É. (2011). scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–30. Pennycook, A. and Otsuji, E. (2014). Metrolingual multitasking and spatial repertoires: ’pizza mo two minutes coming. Journal of Sociolinguistics, 18(2): 161–84. Peukert, H. (2013). Measuring linguistic diversity in urban ecosystems. In Duarte, J. and Gogolin, I. (eds), Linguistic Superdiversity in Urban Areas: Research Approaches. Amsterdam: Benjamins, pp. 75–93. Rehurek, R. and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of 7th Language Resources and Evaluation Conference: Workshop on New Challenges for NLP Frameworks, ELRA, pp. 45–50. Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), ACM, pp. 399–408. Seargeant, P. and Tagg, C. (eds) (2014). The Language of Social Media. Basingstoke: Palgrave. Soler-Carbonell, J. (2016). Complexity perspectives on linguistic landscapes: a scalar analysis. Linguistic Landscape, 2(1): 1–25. Syrjälä, V. (2017). Naming businesses – in the context of bilingual Finnish cityscapes. In Ainiala, T. and Östman, J.-O. (eds), Socio-onomastics: The Pragmatics of Names. Amsterdam: Benjamins, pp. 183–202. Tenkanen, H. (2017). Capturing Time in Space: Dynamic Analysis of Accessibility and Mobility to Support Spatial Planning with Open Data and Tools. PhD thesis, Department of Geosciences and Geography, University of Helsinki. http://urn.fi/URN:ISBN:978- 951-51-2935-9. Tenkanen, H., Di Minin, E., Heikinheimo, V., Hausmann, A., Herbst, M., Kajala, L., and Toivonen, T. (2017). Instagram, Flickr, or Twitter: Assessing the usability of social media data for visitor monitoring in protected areas. Scientific Reports 7(17615). Villi, M. (2015). ‘‘Hey, I’m here right now’: Camera phone photographs and mediated presence. Photographies 8(1): 3–22. Zappavigna, M. (2011). Ambient affiliation: A linguistic perspective on Twitter. New Media and Society 13(5): 788–806. T. Hiippala et al. 308 Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 http://www.stat.fi/til/matk/index.html http://urn.fi/URN:ISBN:978-951-51-2935-9 http://urn.fi/URN:ISBN:978-951-51-2935-9 Zappavigna, M. (2013). Discourse of Twitter and Social Media: How We Use Language to Create Affiliation on the Web. London: Continuum. Zappavigna, M. (2016). Social media photography: con- struing subjectivity in Instagram images. Visual Communication, 15(3): 271–92. Zook, M. A. and Graham, M. (2007). Mapping digiplace: Geocoded internet data and the representation of place. Environment and Planning B, 34(3): 466–82. Zubiaga, A., Vicente, I. S., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza, A., and Fresno, V. (2016). TweetLID: a benchmark for tweet language identification. Language Resources and Evaluation, 50(4): 729–766. Note 1 http://www.instagram.com Exploring the linguistic landscape Digital Scholarship in the Humanities, Vol. 34, No. 2, 2019 309 D ow nloaded from https://academ ic.oup.com /dsh/article-abstract/34/2/290/5113152 by N ational Library of H ealth S ciences user on 21 M ay 2019 http://www.instagram.com