key: cord-0516643-t7b8170c authors: Mannisto, Anssi; Seker, Mert; Iosifidis, Alexandros; Raitoharju, Jenni title: Automatic Image Content Extraction: Operationalizing Machine Learning in Humanistic Photographic Studies of Large Visual Archives date: 2022-04-05 journal: nan DOI: nan sha: f14e988e385bee6518ed41f900e467bebe5930bc doc_id: 516643 cord_uid: t7b8170c Applying machine learning tools to digitized image archives has a potential to revolutionize quantitative research of visual studies in humanities and social sciences. The ability to process a hundredfold greater number of photos than has been traditionally possible and to analyze them with an extensive set of variables will contribute to deeper insight into the material. Overall, these changes will help to shift the workflow from simple manual tasks to more demanding stages. In this paper, we introduce Automatic Image Content Extraction (AICE) framework for machine learning-based search and analysis of large image archives. We developed the framework in a multidisciplinary research project as framework for future photographic studies by reformulating and expanding the traditional visual content analysis methodologies to be compatible with the current and emerging state-of-the-art machine learning tools and to cover the novel machine learning opportunities for automatic content analysis. The proposed framework can be applied in several domains in humanities and social sciences, and it can be adjusted and scaled into various research settings. We also provide information on the current state of different machine learning techniques and show that there are already various publicly available methods that are suitable to a wide-scale of visual content analysis tasks. In this paper, we introduce Automatic Image Content Extraction (AICE) framework for machine learning-based search and analysis of large image archives. We developed the framework in a multidisciplinary research project as framework for future photographic studies by reformulating and expanding the traditional visual content analysis methodologies to be compatible with the current and emerging state-of-the-art machine learning tools and to cover the novel machine learning opportunities for automatic content analysis. The proposed framework can be applied in several domains in humanities and social sciences, and it can be adjusted and scaled into various research settings. We also provide information on the current state of different machine learning techniques and show that there are already various publicly available methods that are suitable to a wide-scale of visual content analysis tasks. The so called visual turn has been one of the major paradigmatic changes in the fields of humanities and social sciences in the past few decades. Usually visual turn refers to a shift in emphasis in these fields towards an increasing importance of the visible. It gained prominence in the 1990s and succeeded the linguistic turn (Jay 2002, pp. 267-278; Pauwels 2000, p. 7) . At the beginning of the 21st century, the visual turn in humanities can be seen as an aspect of another major change, called digital turn (Nicholson 2013; Wevers and Smits 2020, p. 194) . These two turns -visual and digital -go on at the same time when media companies, museums, archives, and other traditional holders of vast image reservoirs have increasingly brought their collections into public. This opening has provided completely new possibilities for exploring and analyzing the historical images and visual heritage. One of the most important prerequisites for building a more visual social science is, as Grady (2007, p. 211) defines, demonstrating that visual data provide answers to research questions, which are not addressed satisfactorily by the use of more conventional, non-visual, methods. According to Cerku (2019, p. 237) , researchers should be aware of the importance of historical photographs when examining culture, because of their potential to reveal social, economic, and political life of the past. Emmison and Smith (2000, p. 2) argue that photographs have been misunderstood as constituting forms of data in their own right when they "should be considered in the first instance as means of preserving, storing, or representing information." In this sense, photographs should, in their view, be seen as analogous to codesheets, tape recordings, or any one of the numerous ways in which the social researchers seek to capture data. Collier (2004, p. 6) states that strong visual research collections present rich potential because they may be productively analyzed in many different ways, both direct and indirect. The ideal situation would be a carefully made visual research collection with comprehensive temporal, spatial and other contextual recording, good annotations, and collection of associated information in an organized data file. However, as Collier (ibid.) points out, such systematic collections are relatively rare, which leads many scholars to underestimate the true potential of photography, film, and video as reliable sources of cultural information. Efficient use of visual data in humanities has been hindered by the lack of suitable methodologies for analyzing and accessing large digital image collections. Traditional research methods are based on manual and mostly text-based tools developed in the era of analogue photographs. This limits the amount of considered photographs usually somewhere between 400-1000 (see examples in Section 4). These numbers are difficult to scale up as each photo must be analyzed individually and manually by a human coder. It is time-consuming to search typical or certain kinds of photos, special patterns or features in photos from a large image collection as the images usually lack content-aware annotations, or this information is incomplete or inadequate. Arnold and Tilton (2020, p . 1) point out that while visual collections may include extensive metadata about the provenance of a digital image, there is often little to no structured data pertaining to the content of the image itself. Similarly, Wevers and Smits (2020, p. 195 ) see that textual analysis in all shapes and forms, has come to dominate digital humanities and, as text-based querying provides mainly textual view of digital sources, "as a field, digital humanities has grossly neglected visual content, causing a lopsided representation of all sorts of digitized archives." Another problem is that different fields in social sciences, humanities, and behavioral sciences have been developing and using their methods distinct from each other. As Pauwels (2020, p. 2) states, "while visual practices and approaches are being proposed and advocated from myriad disciplinary and theoretical positions, it seems that often more effort is expended in trying to 'appropriate' a domain, method, or technique than in developing a more cumulative and integrative stance." Many scholars are nevertheless enthusiastic about the possibilities of using machine learning 1 methods in various fields of social sciences and humanities (e.g., Bødker 2018; Broersma and Harbers 2018; Fiorucci et al. 2020; Sherren et al. 2017) . Parry (2020, p. 18 ) ponders in her book chapter named Quantitative Content Analysis of the Visual in The Handbook of Visual Research Methods as follows: "Traditional media research techniques have been designed with limited media outlets in mind, manageable via human coding alone. But the 'big data' generated by online media practices requires appropriate digital tools and methods, including automated coding or sorting." In a same manner, Pauwels (2020, p. 9) , notes that while "the current hype around 'big data' contributes to the idea that empirical research is becoming 'more visual', (...) it does not take advantage of the many visual dimensions of culture and society as a prime source of data." Currently, there are only few examples of the actual use of machine learning tools and methods in these fields and many researchers see machine learning more as a far-future direction than a realistic op-tion for their studies. Rose, who has written one of the most comprehensive books on visual methodologies refers to image content recognition software only briefly in one paragraph (Rose 2016, p. 292) . She mentions that researchers are working on programs that will be able to undertake searches based on images' visual content rather than on the tags that have been attached to them. As Rose found such software commercial and very expensive, she chose not to discuss them further in her book. She (ibid., mentions that practical challenges of accessing and storing images at a large scale are significant. This is why she assumes that most readers of her book will be able to work with a few hundreds of images at most. Surprisingly, machine learning themes are not covered even in a recent comprehensive volume concentrating entirely on methodology, The Handbook of Visual Research Methods (Pauwels and Mannay 2020) . Under these conditions, although more and more historical image collections are made public, their utilization does not advance equally fast, unless new methodological tools are put into operation. An obvious reason explaining the rare use of machine learning tools in humanities is the extremely rapid development of the required machine learning techniques during the last decade. In particular, deep learning with convolutional neural networks (CNNs) along with the improved computing capabilities have relatively recently opened completely new opportunities in image content analysis (LeCun, Bengio, and G. Hinton 2015) . With the CNNs it is possible to explore the content (what is represented) and the style (how is represented) of images in large archives (Wevers and Smits 2020, p. 195) as well as to do distant viewing 2 of large collections of visual material before actually studying them (Arnold and Tilton 2019, p. i3) . Until recently, as Wevers and Smits (2020) point 2 Arnold and Tilton (2019, p. i6) describe distant viewing of large image collections as "framework for the automatic extraction of semantic elements of visual materials followed by the aggregation and visualization of these elements via techniques from exploratory data analysis." They call for establishing code systems for specific computational interpretations through metadata extraction algorithms. (ibid., pp. i3, i4, i13) out, the computational analysis of visual material has been limited to practical considerations of simple operations, such as adding metadata (Kuhn 2018) or looking at some basic features, such as size, color, or saturation of the visual material (Manovich 2012) . Wevers and Smits (2020, pp. 195, 203-204) call for new techniques for exploring visual material in archives using non-textual search methods which enable accessing the contents of images directly without referring to textual content, such as titles and captions, and without having to browse through archives manually. They (ibid., p. 198 ) state that new machine learning-based techniques contribute also to the construction of a radically new overview of visual culture, which allows for the analysis of trends and changes over extended periods. During the last few years, pilot studies applying state-of-the-art algorithms for photographic studies have started emerging. Wevers and Smits (ibid.) demonstrated the use of a pretrained CNN model for three different automatic visual content analysis tasks over a collection of 450,000 historical images: 1) determining whether an image is an illustration or photograph and defining the historical watershed moment when the number of photographs overtook the number of illustrations, 2) querying images based on abstract visual aspects (clustering visually similar advertisements), and 3) retraining the last layer of the network to classify images into nine relevant visual categories defined by domain experts. A recent study by Chumachenko et al. (2020) , co-authored by most of the authors of this papers, demonstrated the use of CNNs for conducting a novel kind of analysis of prominent Finnish World War II photographers. A CNN was trained to recognize the photographer from the photo, which allowed learning some abstract similarities and differences between the photographers. Furthermore, object detection was applied to analyze most typical topics for each photographer. Arnold and Tilton (2020) used CNN-based semantic image segmentation to produce structured data on the content of historical photos. Väisänen et al. (2021) applied state-of-the-art semantic clustering, scene classification, and object detection methods to analyze human-nature interactions in Finnish national parks using visitors' geotagged social media photos as source data. Araujo, Lock, and Velde (2020) created a protocol for using commercial Application Programming Interfaces (APIs) for Automatic Visual Content Analysis (AVCA) in large-scale image datasets. Traditional visual content analysis (see Section 2) starts from particular research questions and hypothesis, which are approached by examining structural content of individual images in a sample. The protocol introduced by Araujo, Lock, and Velde (ibid.) does not focus on these structural features and related variables and values. Rather, it creates a more qualitative level of analysis: "Using our proposed framework, researchers can evaluate the extent to which computer vision labels may be able to assist in the classification of concepts that are theoretically relevant to Communication Science, yet in many cases, the level of complexity of these concepts may fall short." (ibid., p. 259) As Araujo, Lock, and Velde (ibid.) focused on existing APIs instead of the underlying computer vision and machine learning methods, they point out that "the black-box approach of commercial API's also means that these models may change over time, which may pose challenges to replicability". (ibid., p. 258) In this paper, we take a more holistic view on the use of machine learning methods for visual studies in humanities and social sciences. We aim to provide a comprehensive picture of how adopting new methods and techniques will enhance traditional settings of photographic studies in various academic fields and of what new approaches these tools enable. We summarize our views into a novel Automatic Image Content Extraction (AICE) framework for large image archives that can replace the previous models when automatic tools are used. In AICE, we reformulate and expand the traditional visual content analysis framework to be compatible with the current and emerging state-of-the-art machine learning tools and to cover the novel machine learning opportunities for automatic content analysis. We also provide information on the current state of different machine learning techniques and show that there are already various publicly available methods that are suitable to a wide-scale of visual content analysis tasks. Our goal with AICE is to provide a general framework for analyzing and searching large image collections. It can be applied in several domains in humanities and social sciences, and it can be adjusted and scaled into various research settings. The rest of this paper is structured as follows. Section 2 focuses on the basics of traditional visual content analysis. Section 3 introduces our novel Automatic Image Content Extraction (AICE) framework. The following sections illustrate the use of AICE framework by explaining how machine learning can enhance traditional visual content analysis tasks (Section 4) and what kind of novel opportunities it will open (Section 5). Section 6 discusses the challenges and drawbacks associated with the novel tools. Finally, Section 7 concludes the paper. Content analysis is a method of analysing visual images that was originally developed to interpret written and spoken texts. It was first used in the interwar period by social scientists wanting to analyse journalism of the emerging mass media. It was given boost during the Second World War in order to detect implicit messages in German radio broadcasts. Hence, it is an explicitly quantitative methodology and it was developed to address the need of tackling the sheer scale of the mass media. (Rose 2016, pp. 85-86 , see also Berelson 1952, pp. 21-25) According to the classic definition of Berelson (ibid., p. 18) : "Content analysis is a research technique for the objective, systematic, and quantitative description of the manifest content of communication." While content analysis initially focused mainly on texts, it was also considered in the context of nonverbal information. Nevertheless, it is notable that Berelson (ibid.) in his classic book Content analysis in Communication research does not even include photographs in his list of "interesting applications" of quantitative content analysis. Instead, he lists with examples following non-verbal forms of communication: paintings, sculptures, drawings, cartoons, maps, gestures, tone and voice patterns, and music. (ibid., pp. 108-109) Reading Images: The Grammar of Visual Design, a book by Kress and Leeuwen (1996) , can be considered as turning point enhancing the use of quantitative methods in studying the contents of photographs in wide variety of fields in humanities and social sciences. In their book, Kress and Leeuwen (ibid., p. 1) provided "inventories of major compositional structures which have become established as conventions in the course of the history of visual semiotics, and to analyze how they are used to produce meaning by contemporary image-makers." They called their approach the grammar of visual design, which is a large, theory-like concept covering all sorts of images, such as photographs, movies, drawings, paintings, and advertisements. Many scholars have adjusted the theory and its tools to more narrow research settings. Bell (2004) was maybe the first to use the term Visual Content Analysis (VCA) leaning mostly on tools, variables, and values created by Kress and van Leeuwen. In his analysis, Bell studied the differences of cover images of Cleo Magazine published 25 years apart. Bell's version of VCA is a readily operable adaptation of the original grammar and its particularly suitable for photographic analysis. Because of its practicality we have chosen to use it as one of the major sources for the AICE framework we introduce in this paper. Bell (ibid., p. 8) describes VCA as "a systematic, observational method used for testing hypotheses about the ways in which the media represent people, events, situations, and so on. It allows quantification of samples of observable content classified into distinct categories." In a same manner, Riffe et al. (2019, p. 3) define quantitative content analysis "as the systematic assignment of communication content to categories according to rules, and analysis of relationships involving those categories using statistical methods." Although Bell and Riffe speak in the context of media or communication content, there is nothing that restricts the use of VCA only to those research areas. Lutz and Collins (in Rose 2016, p. 87) separate three factors which favor the use of content analysis in dealing with relatively large numbers of images. First, they suggest that content analysis can reveal empirical results that might otherwise be over-whelmed by the sheer bulk of material under analysis. Second, they insist that content analysis and qualitative methods are not mutually exclusive and content analysis can include qualitative interpretation. Finally, they also suggest that content analysis prevents a certain sort of 'bias', because it is a method for analysing images that does not rely on pre-existing interpretative categories. The factor separating quantitative analysis, like VCA, from many other main research techniques, such as qualitative analysis or discourse analysis, is its focus on the 'manifest content'. Riffe et al. (2019, p. 22) state that important in this specification is focusing on content's manifest (or denotative or shared) meaning as opposed to connotative (or latent) 'between-the-lines' meaning". They stress that quantitative content analysis is the "systematic and replicable examination of symbols, which have been assigned numeric values according to valid measurement rules." The analysis, then, involves comparing relationships of those values using statistical methods. According to Bell (2004, pp. 7-8) , typical research questions which may be addressed using content analysis include questions of priority/salience or of 'bias' of the content and historical changes in modes of representation of, for example, gender, occupational, class, or ethically codified images. As stated by Emmison and Smith (2000, p. 58) , one of the principal advantages of quantitative visual research is the historical depth, which can be obtained in an inquiry. One of the main challenge of content analysis has been that, if a study focuses on human behavior or visibility in photographs (as the typical cases are), even selecting the photos depicting people is laborious. The process is most burdensome, understandably, with photo archives that are not digitized. With such archives collecting and selecting material and annotating it manually is a very time-consuming practice as researchers must examine each photo individually. This is why it is usually not possible to consider a whole large-scale dataset with the traditional manual methods. With digitized archives, the process is mitigated as researchers can use dedicated software to browse and store images and to use search tools. As search tools usually are text-based, it typically means that only the original caption is used to describe or tell about the content. In such cases, it is not possible to make true content-based searches of the collection. Otherwise, the workflow mode with the digitized archives is quite similar to the traditional process where a researcher actually picks printed photos from a drawer or filing cabinet or uses a microfilm reader. With these traditional workflow modes, a researcher observes features or elements of a photo that are relevant for particular research questions. In quantitative analysis, a researcher or coder writes down certain aspects of a photo to a spreadsheet or database. Typically, these aspects are the amount or qualities of humans or of various object classes. As this observation process is time consuming, these object classes form usually just a fraction of the total amount of features in the photos. A fundamental difficulty with visual media is, as Manovich and Douglass (2013) see it, that they do not have a standard vocabulary or grammar. This makes the adaptation of automatic visual analysis challenging. The solution Manovich and Douglass (ibid.) suggest is to focus on "visual form", such as size, color, or saturation, which is easy for computers to analyze, and not semantics, which are harder to automate. However, we believe that systematic semantic analysis of the visual content has potential to provide much more interesting research outcomes and we see the current state-of-the-art machine learning techniques adequate to start using them also for practical semantic analysis purposes. In the next section, we describe our novel AICE framework designed for systematic large-scale visual content analysis using automatic computer vision techniques. Our main purpose is to transform and expand the traditional visual content analysis framework in a way that is compatible with the current and future automatic image analysis methods. We formulate our framework in a similar manner as the version of VCA by Bell (2004) discussed in the previous section, but we significantly enrich it with new variables, values, and connections to existing machine learning techniques. The guiding light for Bell, which we also follow in our framework, was that in order to choose variables, visual content "must be explicitly and unambiguously defined and employed consistently ('reliably') to yield meaningful evidence relevant to an hypothesis." (Bell 2004, p. 9) . To do that it is "first necessary to define relevant variables of representation and/or salience. Then, on each variable, values can be distinguished to yield the categories of content which are to be observed and quantified." (ibid.) Our AICE framework is summarized in Tables 1-3. The first two columns form the core of framework. They represent the variables and corresponding values for content analysis considering both the needs of photographic studies and feasibility of automatic analysis. The variables are divided into six main topics: Technical variables ('1 Technical') share some common, basic features of every image and they are useful in many subsequent higher level image analysis tasks (e.g., social distance estimation (Seker et al. 2021b) ). With modern photographs, variables 1.2-1.4 can usually be obtained from the metadata, but for historical photos this is typically not the case. When the information is not included, it may also be predicted from the photo content and, therefore, is included in our AICE framework. Especially for the historical photos, the camera technical variables may be also a study question as such to analyze the development of photography. Predicting the photographer from a general photo is naturally almost impossible, but may be possible to some extent when the possible pool of photographers is limited (Chumachenko et al. 2020) . The variables in '2 Composition' and '3 Modality' correspond to the "visual form" that is mostly easy to evaluate from the images. The semantic image content and participants, i.e., topics, people, objects, settings, and their properties are included in '4 Content/Participants', whereas the interaction between the participants and between the participants and the camera are evaluated in '5 Interaction'. The variables in '6 Visual similarity', on the other hand, bring into our framework many important applications of visual analysis based on appearance-based similarity, which may be evaluated focusing on different variables defined in AICE. The practical use of these variables will be illustrated in different examples below. It should be taken into account that the variables are interconnected. Machine learning methods for more complex tasks can need other variables, e.g., to perform age estimation, the people in the images must be detected first. In some cases, the same machine learning method (e.g., a trained CNN) may evaluate several variables (e.g., height and and weight estimation) or partial combinations of different variables (e.g., a classifier can categorize both scenes and events). The values suggested in the second column are only advisory. After selecting the variables of interest, the target values must be decided based on the research goals (and constrained by available machine learning techniques). The third column ('Difficulty') aims at giving an impression on the difficulty of the tasks for the current automatic image analysis and machine learning algorithms. 'Trivial' stands for variables that can be easily and without any errors determined from the image, 'Easy' tasks can be performed using the existing algorithms with a high accuracy, but there may be some errors or ambiguities. 'Medium' tasks can be currently performed to some extent, but errors are expected. Finally, 'Difficult' tasks may remain partially unsolvable also in the future. For example, professions may be recognized in some special cases where distinctive clothing or equipment is used, while defining the profession of a random person walking on a street during his/her free time is almost impossible. It should be noted that the evaluation is based on subjective views of the co-authors having a background in machine learning research. No experimental comparisons of the difficulty level were conducted. The difficulty levels depend significantly on the source images (e.g., the time period and environment where the photos were taken) and are subject to change as new algorithms for different tasks are being developed. Furthermore, it should be noted that we evaluated the difficulty levels considering only the difficulty of the task itself assuming that large-scale annotated training datasets are available. However, the more specialized the analysis task at hand, the less likely it is to find suitable training dataset or the training sets may contain only a small subset of topics of interest. We return to the challenges of the training data in Section 6. Finally, the last column ('Tasks') links the visual content analysis variables with the available methods and terminology used in the computer vision field. The description of the specific methods available for different tasks is beyond the scope of this paper, but interested readers may refer to articles in computer vision and machine learning fields to look for more information. A list of state-of-the-art methodologies for the different problems is: classification (Krizhevsky, Sutskever, and G. E. Hinton 2012), photographer recognition and framing classification (Chumachenko et al. 2020 We can now return to the recent pilot works demonstrating the use of CNN models for photographic studies described in Section 1. The first task considered by Wevers and Smits (2020) was to classify images into illustrations and photographs, which In this section, we have collected some pioneering studies that have used relatively large collections of images -photos the researchers themselves have taken, advertisements and media images, pho-tos stored in various photographic archives -in many fields of humanities and social sciences. We describe the material collection and the research methodology used and elaborate how applying machine learning tools according to the proposed AICE framework could have advanced the process of collecting material or determining variables and their values, and, thus, possibly enhanced the results. List of studies selected here is, of course, by no means comprehensive. In traditional photographic research settings of media images using visual content analysis, the number of photographs is typically between 400-1000. As examples of war or crisis related topics, Griffin and Lee (1995) constructed a systematic inventory of the types, range, and frequency of photographic images presented in American news magazines during Persian Gulf War with 1104 images (also Griffin 2004), Männistö (2004) . In all these studies, similar to all the other studies discussed later in this section, one of the main benefits of using automatic image analysis would be the possibility to conduct the study on a much larger dataset, possibly collected from additional sources and time periods. One of the pioneering projects using photographic information in the field of visual anthropology was 'Balinese Character: a Photographic Analysis' by Bateson and Mead (1942) . The study was carried out in the late 1930s with the aim of inquiring Balinese culture. In the center of the book, there are 759 photographs organized in different themes. The original amount of photos was 25,000. The analysis focused on gestures and bodily expressions (Emmison and Smith 2000) . Machine learning tools for AICE variables '5.3.4 Pose of MC/SC', '5.3.5 Action of MC/MM ', '5.4 Interaction', and '5.4 .3 Humanobject interaction' could be used to automatically repeat a similar analysis on the full dataset. Goffman (1979) analyzed stereotyped portrayals of males and females in magazine advertisements in a seminal work 'Gender Advertisements' including 508 images. After Goffman's work, several researchers have extended the inquiry into gender stereotypes using replications of Goffman's content analysis methodology, as stated by Bell and Milic (2002, p. 205 ) (See also Emmison and Smith 2000). For our paper, it is beneficial to take a closer look on the way Bell and Milic (2002) revisit Goffman's work. Their goal was to demonstrate usefulness of combining quantitative analysis with semiotic approach. In order to do this, they developed further Goffman's approach utilizing dimensions of visual structure, derived from and operationalized using Kress and van Leeuwen's system of analysis (see Section 2). They used this apparatus to analyze 827 advertisements collected from popular Australian magazines published during 1997-8. (ibid.) Many of the hypotheses Bell and Milic (ibid.) include are relatively easy to analyze automatically. Examples of such hypotheses are: "In terms of social distance, females will be framed more intimately and be less likely to be represented at a 'public' distance than men", "Women, more frequently than men, will be shown on the right section of a layout that is structured along the horizontal axis", and "Women will less frequently be depicted gazing at the camera/viewer than men". AICE variables '4.2.1.3 Gender', distance of '5.2.2 Of MC to others', '5. Bell and Milic (ibid.) give a lot of attention is the formal relations between depicted elements which can be analyzed in terms of position in the frame. The AICE framework is particularly capable acknowledging the spatial distribution of persons and objects inside the frame both in image plane and in depth. Several AICE variables, such as the spatiality of the main character ('5.1.1 Of MC in image plane') and the main motif ('5.1.3 Of MM in image plane') in image plane or in depth ('5.1.2 Of MC in depth' and '5.1.3 Of MM in image plane'), will contribute to this. With machine learning tools it would we easy to add also other related variables such as '4.2.1.1 Status', '4.1.2 Salience', '4.2.1.2 Age', or '5.3.5 Action of MC/MM'. Kozol (2009) began in 1980s to research Life magazine, which is considered as the most influential publication of visual news in the post Second World War period. She wanted to analyze whether the traditional "family values" were referenced as an idealized portrait of the white, middle-class nuclear family consisting of female housewives and male breadwinners. While collecting the material, Kozol faced several challenges. The first and foremost problem was the amount of material. There were simply too many issues to examine. In the end, Kozol selected to her study every issue in the months of October and May from fifteen years of the post war period looking for news photo-essays that included pictures of families. Kozol's experience is a typical example where a researcher cannot study the large amount of material as a whole. It is also a good example where machine learning tools could enhance the process. Machine learning could both filter out the potential family photos by selecting photos containing a woman, man and children (AICE variables '4.2.2.1 Number of people', '4.2.1.2 Age', '4.2.1.3 Gender') and be used to carry out also further analysis including '4.2.1.5 Ethnicity' and '4.2.1.8 Occupation'. While occupation may be difficult to define automatically (also manually), housewife detection could be eased by detecting actions ('5.3.5 Action of MC/MM') such as cooking, cleaning, or childcare. Lutz and Collins (in Rose 2016, pp. 87-93), made content analysis of nearly 600 of the photographs published in National Geographic between 1950 and 1986. They collected the sample choosing one photo at random from each of the 594 articles on non-Western people. This procedure produced a manageable number of photos that Lutz and Collins and their research assistants could analyze manually. In a close reading of those photos, they examined issues of race, gender, privilege, progress, and modernity through an analysis of factors such as color, pose, ). If Lutz and Collins had been able to utilize machine learning based automatic method for analysing the content, it would have enabled them to use a much larger sample, e.g., all the several thousands of photos that were in the original 594 articles. While some of the included tasks cannot be currently performed automatically at human accuracy, the larger sample would still give statistically significant results and more variables could be easily added. Grady (2007, p. 211 ) made a close examination of depictions of black persons in advertisements published in Life magazine, 1936-2000. His study is a good example of a very demanding research setting in which a carefully developed quantitative data forms a ground for making interpretations of some sensitive questions, such as how the commitment of white population to racial integration has changed. Grady's sample consisted of nine chronologically stratified periods from 1936-2000 and contained a total of 590 advertisements, which were scanned and entered into a database. The coding was done manually on a spreadsheet. Two among the most central observations in Grady's study (ibid., pp. 225, 234) are very interesting as they are combinations of variables in the proposed AICE framework. The first observation was on the percentage of black persons depicted in personal or private settings. This could be opera-tionalized in AICE using variables of '4.2.1.5 Ethnicity' and '4.4.3 Privacy'. The other observation was on the percentage of black persons in the sample where there is eye contact between black and white persons. The values in this variable were "Eye contact between races" and "One race looking at another". Grady noticed that his analysis shows that in the iconography of segregation black persons tended to look at whites admiringly, while whites looked at blacks in a more patronizing and condescending fashion. Today, Grady concludes, there is not much difference in how these two ethnic groups look each other; it is just that they hardly do so in advertisements. A part of a similar analysis could be conducted by combining AICE variables of '4.2.1.5 Ethnicity' and '5.3.1 Gaze direction of MC'. However, evaluating whether a look is admiring or patronizing would be very challenging. Even though the task can be seen as evaluating the variable '5.3.3 Expressions of MC/SC', it is an example where the selected target values make the task very difficult. Furthermore, annotated large-scale datasets for expression recognition do not contain these categories meaning that a lot of data should be first manually annotated to be able to even try training a machine learning model. Of the above-mentioned studies, we take Wilmott's paper for closer examination. It was published in 2017 and was thus made in a time when many machine learning tools would have been already available but were not used. The paper is noticeable and rare as it brings together a detailed VCA setting of a very serious theme and the use of Securitization Theory, which is a special branch of study of international relations. In that context, Wilmott (2017, p. 68) focused on visual communication acts, or what she refers to as picture acts. Two of the three research questions Wilmott sought to answer were such that they could benefit on the use of AICE: 1) How does gender figure into the visual depictions of Syrian refugees? 2) How do the visual depictions of Syrian refugees differ across U.K. online newspapers? The U.K. online quality news websites in Wilmott's study were The Guardian, The Telegraph, and The Independent. Thus, the sample reflects three different affiliations in the political spectrum. Every newspaper article in a certain time frame was scanned, and all the photographs of Syrian refugees were collected and included in the study according to several criteria, finally yielding inclusion of 299 photographs. Several coders annotated the material utilizing a codebook created originally by the Media Department of London School of Economics. (Wilmott 2017, p. 71) The research was built on several different technical considerations to evaluate whether photographs humanize or dehumanize refugees (ibid., p. 72). Most of these considerations could be easily automated following the AICE framework. Wilmott's first interest was to examine whether refugees are portrayed as individuals. This was done because previous studies assert that photographs of individual victims evoke more empathetic emotions in viewers than groups of victims, and further that large groups overshadow individuality and attribute common characteristics to all members of a given social group. Following preceding examples, Wilmott coded images of refugees into four categories, from (1) individuals to (4) large groups (AICE variable '4.2.2.1 Number of people'). The content analysis was then further refined through an experiment that explored a second factor: camera distance, using traditional categories of frame of view, starting from various types of 'close-ups' and ending to various 'overview shots' ('2.2 Photo framing'). Moreover, coding also looked at whether the refugees look directly into the camera ('5.3.1 Gaze direction of MC'). These decisions were based on literature that suggests that closer photos, especially those in which subjects look directly into the camera, produce a more intimate association with refugees. (ibid., p. 72) Furthermore, as Wilmott (ibid., describes, men are more often associated with violent behavior than women, and a group of female migrants more effectively visualizes the feeling of threat. Therefore, photographs have been coded by 'male and female', 'female', 'male', 'only children depicted ', and 'unclear' ('4.2.1.3 Gender' and '4.2.1.2 Age'). Wilmott's findings indicate that the image of the refugees has become mainstreamed and that there were no major differences on representing refugees between the compared newspapers. Wilmott (ibid., As the number of the photos in the study is rather low, only 299, it is obvious that the analysis of such a small sample would not be ideal or even proper use of machine learning tools at all. However, we note that currently the results have not been compared to the average images in the investigated newspapers and, therefore, the conclusion of the image of the refugees becoming mainstreamed lacks some evidence that could be obtained by comparing with the actual mainstream images. Furthermore, while no significant differences between the newspapers were found, comparisons against the average photos in the respective papers could show whether this differs from the typical images and whether there are differences between the newspapers from this point of view. We will get back to this example soon in the following section. ties in photographic studies brought by machine learning While we have mentioned already several times that machine learning methods are especially suitable for very large data-sets, it should be emphasized that this especially concerns comparisons. Machine learning methods tend to do some errors. For example, it could be that a distance estimation algorithm tends to somewhat underestimate the distance of the main character to the camera or a person detection algorithm might find only 95% of the people in the photos. However, these sort of errors can be considered systematic as long as the considered sample is large enough and the photos to be compared have a similar quality. Therefore, claims such as "Photo collection A contains exactly 4521 persons" or "In photo collection B, the average distance of the main character to the camera is 2.3m" may not be feasible, but the differences between datasets can be more accurately estimated, e.g., as "In photo collection A, photos have on average 30% more people than photos in collection B." We now return to Wilmott's study (ibid.) on de-picting Syrian refugees in newspapers, which we described in the previous section. AICE could bring a totally new layer of analysis as it could annotate all the photographs in a certain period in the three newspapers, not just the 299 belonging to articles concerning Syrian refugees. By annotating the whole material of thousands of images, the machine learning enhanced analysis could create a reliable reference material. Wilmott's study did show that the differences between the newspapers were small, but it did not tell how the samples compared in relation to the typical ways each of the newspapers represent the qualities of persons in photographs. Consider, for example, The Independent. What if it turned out that it does in general have a particularly low number of portrays with eye contact, while at the same time that number would be very high in the other two newspapers in question. In this imaginary case, the relative difference of depicting the eye contacts of Syrian refugees compared to the general way of showing eye contact in person photos would be considerably different between the newspapers. This is of course just a hypothetical example, but nevertheless illustrates how the ability to use the entiretynot just a fragment -of photos in the analysis would significantly broaden the analysis. Another important aspect is that AICE allows to easily include multiple variables into the research setting and look for any interesting correlations between them. It is understandable that using the traditional methods no one wants to invest a large amount time to manually annotate some variables that are unlikely to be interesting for the study at hand. However, with machine learning-based methods such analysis can be carried with little extra effort. While in many cases this may indeed confirm that the variable is not interesting, it is also likely that at some point such trials lead to interesting observations and novel insights of some phenomena that would have never been revealed by the traditional research methodology. In our view, two interrelating dimensions can be separated in spatiality and use of space in photographic studies: (1) positions of persons and objects inside the frame both in the image plane and in depth and (2) social distance of participants in an image. Analyzing spatiality has been an important goal in visual research ever since Hall (1966) published his theory of proxemics and social distances. Spatiality is an important aspect also for Emmison and Smith (2000) , who are the authors of the book Researching the visual, considered as one of the most comprehensive introductions to visual inquiry. According to Emmison and Smith (ibid., p. 4) , objects, people, and events which constitute the raw materials for visual analysis, are not encountered in isolation but rather in specific context: "It is this spatial existence which serves as the means whereby much of their sociocultural significance is imparted. Visual data, in short, must be understood as having more than just the two-dimensional component." They (ibid., p. 191) describe that everyday public behaviour has a spatial and territorial component to it. Their thinking leans on some classical texts in sociology, such as Simmel's 'The Sociology of senses' from early 20th century. Simmel drew attention to one of the key features of modern life, that interaction is based on sight rather than sound. (ibid., p. 192) The study of the ways humans utilize or orient to space described by Hall is essential also for the theory of Kress and Leeuwen (1996) . According to Hall, in everyday interaction, social relations determine the distance (literally and figuratively) we keep from one another. We carry with us a set of invisible boundaries beyond which we allow only certain kinds of people to come. The location of these invisible boundaries is determined by configurations of sensory potentialities -by whether or not a certain distance allows us to smell or touch the other person, for instance, and by how much of the other person we can see with our peripheral (60 degree) vision. (Bell 2004, p. 24; Kress and Leeuwen 1996, pp. 130-131) At the core of Hall's theory is a typology with categories corresponding different fields of vision. At intimate distance, we see the face or head only. At close personal distance we take in the head and the shoulders. At far personal distance we see other person from the waist up. At close social distance we see the whole figure. At far social distance we see the whole figure 'with space around it'. And at public distance we can see the torso of at least four or five people. (Hall 1966 , pp. 114-129. See also Bell 2004 Kress and Leeuwen 1996, pp. 129-31 ) For Kress and van Leeuwen, equally important as the social distance is the use of space in terms of positions of persons or objects within the frame. They show, for example, that the value of an element in the top half of a portrait-shaped frame is different from its value if located in the bottom half. A similar difference in information value exists also between the dimensions of center and margin in an image structure. (ibid., pp. 193-208 . See also Bell and Milic 2002, p. 211) . While exploring the changes and trends in the ways humans utilize space in everyday interaction is at the core of many historical and sociological studies using photographic archives as source material, proper tools for finding and measuring spatial qualities effectively have been lacking. Acknowledging and operationalizing them with machine learning tools would mean a major leap in photographic studies. Basically, this could enable to consider 2D photos as representations of a certain 3D situation or scene. By doing that, machine learning has potential to revolutionize many traditional research settings, especially those interested in the interaction of people. As shown in Table 3 , we have adopted the abovedescribed proximity typology as suggested target values for distance-related variables under '5.2 Distances'. This typology can be utilized at least in three ways in photographic studies, all of which could be operationalized with different machine learning tools. First, it will tell also about the physical distance between the photographer and the main character or theme ('5.2.1 Of MC to camera'). This distance unavoidably affects the interaction between the photographer and the main character. Second, determining the mutual distances of persons in photos ('5.2.2 Of MC to others') will give information of what kind of relations the persons have with each other, are they close or far away from each other, how are they interacting, and so on. Third, the typology can be used to evaluate group formation ('4.2.2.2 Number of groups') and distances between groups ('5.2.3 Between groups'). Furthermore, the fields of vision linked with the proximity typology closely correspond to the traditional definitions of photo framing in film and television. Framing is also a basic property and feature of every photograph, telling if it is a close-up or an overview (or something in between) of the main character or theme. When this information is automatically determined (AICE '2.2 Photo framing') and included in the metadata of a photo, it will serve as one of the basic variables in searching images from large image archives in many professional use cases. Photo framing information can be accompanied by information on camera angle ('2.3 Camera angle'). To bring out more aspects of the three-dimensional situation that took place in the front of the camera, the variable '5.1.7 Spatial relationships' focuses on how the participant locations are linked to each other (e.g., under/next to/behind/in front of). To give a more practical example, we turn our attention to a study of Temelová and Novák (2011) on daily street life in Prague. Their basic questions were "Who are the users of the space?" and "Are there spatial and temporal differences in the manner various social groups use the public space?". In order to demonstrate dimensions of differentiation, they explored the heterogeneity of users based on differences in people's wealth, age, ethnicity, and everyday practices. Methodologically, their study relied primarily on the fieldwork and direct observation of incidents and events which characterise the outdoor life of the area. Photographs were used only to illustrate typical and special ways of utilizing public space and peoples mutual distances and also to identify people belonging to various categories (age, professions, etc.) . The kind of research setting Temelová and Novák carried out would be possible to create, e.g., by using large amount of photographs taken from a fixed position with certain intervals of time. Along with the abovediscussed variables related to spatiality, it would be easy to automatically analyze other variables, such as '4.2.1.2 Age', '4.2.1.3 Gender', '5.3.4 Pose of MC/SC', and '5.3.5 Action of MC/MM', Temelová and Novák were interested in. Others, such as whether a person belongs to the category of "managers and professionals", estimated in terms of his/her clothing, of course, are more challenging, but may be approximated using variables '4.2.1.8 Occupation' and '4.2.1.12 Clothes'. This kind of approximation is present also in the original research setting using direct observation. Moreover, if specific machine learning tools are developed and finetuned to the settings like this, a similar research could be then easily repeated in many cities or places as a comparative approach. It should be noted that the differences and trends related to proxemics may naturally be quite subtle and not possible to extract from a small set of images. Instead, very large datasets are needed to reliably evaluate average behavioral traits. Such research has not been feasible using traditional manual research methods and, thus, machine learning methods may open a novel unexplored research direction. Analyzing photographic contents of the news media during the COVID-19 crisis, for example, would provide valuable material for observing the changes in peoples behavior in terms of their use of personal space. During this "corona era", the public speech was filled with notions concerning how far away people should be from each other. Reporting of the crisis filled or affected practically the whole coverage of the news media and the time span was rather long. It would form, thus, a good entity against which to compare findings in different countries or of a reference period few years earlier. Analysing interaction, gestures, postures, touching, or gazes between persons in photos is a typical task in many photographic research settings. As Emmison and Smith (2000, pp. 218-220) put it, humans use body language consciously or unconsciously to convey information to others, e.g., about emotions, social status, openness to interaction, and sexual arousal. Much of the most interesting theory is on the role of gaze and eye contact in contemporary social life. However, as the authors note, such work is difficult to conduct, because of the unobservability of the gaze. As Simmel and others have observed, the modern city has brought people into proximity. This has caused problems for the use of gaze and also, on occasion, for touch (ibid., p. 225). With traditional visual inquiry methods, analysing these kinds of variables from the photographs is one of the most time-consuming phases of any research process as it needs a lots of interpretation. Machine learning offers several tools for automating this kind of analysis shown in AICE under variable categories '5.3 Activity ' and '5.4 Interaction'. Here, it should be noted that while evaluating a person's pose (standing/sitting/laying) is relatively easy, it is challenging to evaluate actions with movement as a single shot of, e.g., walking, running, or jumping person, may look very similar and more information (video or several shots) is needed for reliable estimation. On the other hand, tools for '6.1 Similar images' can be used to find out whether an interesting gesture or pose appears repeatedly, and tools for '6.2 Appearance-based grouping' can be used to divide action and poses automatically into groups containing similar gestures (see Section 5.5). Combining this data (what he/she is doing and to where he/she is looking at) with scene recognition (e.g., '4.4.4 Scene' and '4.4.5 Event') and with the spatial information as discussed in 5.2 will produce an unprecedented and rich compilation of content properties of an image. These data are useful as such when making automatic annotations of photographic contents for various needs of image archives. It enables also constructing new kinds of search tools and methods and it will, thus, also contribute to improving accessibility of the image archives. The tools would be useful in several fields of social sciences and humanities, studying, e.g., possible changes of external forms of, or interaction between, populations in certain time span. As a practical example, pose estimation, such as the OpenPose model (Cao et al. 2019 ) illustrated in Fig. 1 , could be used in a sociobiological study similar to the study Klein carried out searching for differences in sitting postures of males and females. In Klein's study, postures were recorded in 600 men and women and the study focused on typical ways of sitting, depending on gender, age, and place, in one city in 1984. (Emmison and Smith 2000, pp. 220-226) With machine learning tools it would be possible to classify not just the original material, but reference material from other cities and in other times, depending of course on the available archives. Fur- Figure 1 : An example of a skeleton model (Cao et al. 2019 ) that can be used for automatic pose analysis. (Image source: https://www.pexels.com) thermore, automatically clustering sitting poses according to the AICE variable '6.2 Appearance-based grouping' could reveal typical sitting pose variants that were not observed in the manual examination. Defining what is the most salient element is present in almost every research setting in humanities and social sciences using photographs as source material. The most salient elements are usually also those which at minimum need to be acknowledged and annotated when describing the content of a photograph. As shown earlier in this paper, human behavior or visibility is typically at the core of study in many research settings. In such cases, the researcher usually needs to define who is the main character, which in most cases is also the most salient participant in the photograph. Defining salient elements can, thus, be considered as one central functions in visual content analysis. As Bell (2004, p. 21) describes, content analysis can show what is given priority or salience and what is not. It can show how images are connected, who is given publicity and how, as well as which agendas are run by particular media. Kress and Leeuwen (2021, pp. 179-220) discuss salience in detail with examples of many types of images (movies, classical art, advertisements, and magazine page layout). For them, salience indicates that "some image elements are made more conspicuous than others elements (. . . ) and this makes them more important of information in the whole" (ibid., p. 181). Salience is also one of the three key compositional principles of images alongside framing and information value of different parts of an image (ibid., p. 204). Traditionally, detecting salience from images has required a lot of attention -and interpretative work from the researcher -as salience can be a result or a combination of several different variables. This becomes obvious when looking how Kress and Leeuwen (ibid., p. 211) define the term: "Visual salience results from a complex interaction (. . . ) between a number of factors. These include size, sharpness of focus, (. . . ) areas of high tonal contrasts (. . . ), color contrasts, placement in the visual field, perspective (foreground objects are more salient than background objects, and elements that overlap other elements are more salient than elements they overlap)". Also, as Kress and Leeuwen (ibid., p. 216 ) point out, it is not only what is made salient, but also how it is made salient that contributes to the meaning. Machine learning methods offer a way to carry out large-scale salience analysis (AICE '4.1.2 Salience') in an objective manner and to combine the salience analysis with other AICE variables or with the textual content to analyze how certain topics are illustrated. Recognizing and analysing the main character can be seen as a subtask of salience analysis and it is traditionally one of the basic tasks -and variablesin wide variety of photographic studies in the fields of social sciences and humanities. A recent work (Seker et al. 2021a ) demonstrated that it is possible to automatically recognize the main character with a high accuracy in different kinds of photos as illustrated in Fig. 2 . It is worth emphasizing that the new opportunities machine learning-enabled analysis provide extend beyond the existing predefined categories. When deep neural networks are trained, they internally learn to form generalized representations of the images that allow them to classify also previously unseen images. These representations can be also extracted and used for searching or grouping images based on their visual similarity (AICE '6 Visual similarity') as also discussed by Arnold and Tilton (2019) with respect to distant viewing of the databases. Maybe the most straightforward way to exploit these internal representations is content-based image retrieval (Tzelepi and Tefas 2018) via query by example. This means that instead of searching for a particular category, it is possible to search for images that are most similar to a particular query image or a group of query images (AICE '6.1 Similar images'). Here, it should be noted that the similarity always depends on the task the training was performed for. For example, if a network was trained for recognizing faces, it will learn to extract representations that help it in this particular task. These representations can be then used for query by face, where the query image can depict any (unknown) face and the network can find the most similar faces. In the same manner, the representations learned by networks trained for the corresponding tasks can be used, for example, for querying by objects, environments, or poses. An illustrative example of query by pose is provided in Figure 3 : An illustration of content-based image retrieval in the case of query by pose.(Image source: https://www.pexels.com) Fig. 3 , where the top image is used for querying and possible query results with similar poses are shown below. As the similarity of the photos is measured by the similarity of poses in this example, other image content can be very different, while the photo is still considered as a close match to the query image. A more practical example of using such an approach in photographic studies was provided by Wevers and Smits (2020) in their second task for querying the advertisements based on their abstract visual aspects. Furthermore, the network representation can be used for grouping the images based on their similarity (AICE '6.2 Appearance-based grouping'). This can allow revealing such similarities and groupings of data that the researcher would not be able to look for. There are also tools for visualizing such found similarities and groupings as demonstrated in a recent paper (Chumachenko et al. 2020) , where the similarities between different photographers were illustrated based on clustering their photos. A simplified illustration of clustering images into groups based on the similarity of poses is provided in Fig. 4 , where the clustering algorithm have identified three distinct pose groups. It should be noted that in real photo collections the clusters are rarely so clear and distinct, but the images represent a more continuous distribu-tion of poses (or any other feature used for clustering) making the clusters fuzzier by nature. As mentioned in Section 5.3, tools for '6.1 Similar images', i.e. querying by example, could be used to find out whether an interesting gesture or pose appears repeatedly in a photo collection, and tools for '6.2 Appearance-based grouping', i.e. clustering based on visual similarity, could be used to divide action and poses automatically into groups containing similar gestures. These techniques are most suitable for studying such concepts where the text-based descriptions are cumbersome, but the visual similarity of the studied categories is high. The appearancebased techniques can also help avoid the difficulty of predefining the pose/action categories of interest and may also reveal similarities that could be very difficult to observe manually. 6 Challenges and risks of the automatic approach In this paper, we have extensively discussed the new opportunities and directions machine learning-based methods open in photographic studies. However, it is also good to keep in mind that there are also some challenges and risks related to the new tools. In this section, we review some of the challenges, and also give our suggestions for mitigating them. Maybe the most practical problem is that for many researchers in humanities having non-technical background, it may be difficult to find and apply suitable machine learning tools. Even though most of the tools discussed in this paper are publicly available, it requires some expertise to understand which methods are the most suitable ones, how to apply pretrained models, and how to adjust them for specific tasks. On the other hand, there are still interesting novel challenges also from the machine learning point of view. Therefore, we believe that interdisciplinary research between humanistic and machine learning researchers is the most fruitful solution to the practical issues and novel technical challenges. Furthermore, we believe that publishing the used algorithms in an easy-to-use format (e.g., as in Seker et al. 2021a) can help other researchers in their work and gradually lead towards easier application of the tools as well as easier comparison with earlier works. It is important to keep in mind in the analysis of the results that the machine learning models are not error free, and the harder the task, the lower the expected accuracy. Also different machine learning models can have very different performance in the same task, and it is important to keep in mind that a low performance may result from an unsuitable or poorly employed model, while other machine learning models might perform better. For the same reason, it is important to be aware of the employed models in any comparative studies and not directly compare the results of studies conducted using different models for the same task, because this might lead to biased or erroneous conclusions. Some of the errors made by machine learning models may appear unexplainable or even absurd due to the black box nature of the algorithms, which refers to the difficulty of explaining why a model gives a certain output, e.g., a classification decision (Rudin 2019) . The black box nature is an enormous problem in the fields such as medicine or autonomous driving, where a single critical error may lead to lifethreatening situations. In the humanistic studies, such individual errors are much less severe, but the possibility of errors should be still considered in the analysis. For this reason, we see machine learning algorithms most suitable for searching suitable research material among very large photo collecting and for conducting large-scale comparative studies. As long as errors can be expected to occur at approximately the same rate on each dataset, the results can indicate true findings. At the same time, explainable machine learning (Tjoa and Guan 2020) is expected to develop rapidly due to the pressure in the fields with huge economic value, and these developments can then benefit also the future humanistic photographic studies. A more challenging problem than individual errors are biased results or analysis. In their survey on biases and fairness in machine learning, Mehrabi et al. (2021) define fairness in the context of decision making as "absence of any prejudice or favoritism toward an individual or group based on their inher- An illustration of clustering images by the similarity of poses.(Image source: https://www.pexels.com) ent or acquired characteristics" and list several different sources of bias in machine learning applications. Most of the machine learning algorithms are neutral by nature, but they learn based on the training data shown to them and any bias in the data can be adopted or even emphasized by the model (A. Wang, Narayanan, and Russakovsky 2020). For example, if a facial expression recognition algorithm is trained using mainly photos of young white people, it is natural that it will be less accurate with old and/or people of colour. Many of the data-related biases have a humanistic or societal nature, and more collaboration between humanistic and machine learning researchers will be needed to better understand how these biases are reflected in the machine learning-based analysis, not only in humanistic studies but in various applications across the society. Besides the training data, also the design of the machine learning models may introduce some biases. For example, in our recent social distance estimation algorithm (Seker et al. 2021b ), we assume that people's torsos are upright and underestimates the distances if this assumption is violated. This happens more commonly, when people are sitting, which, in turn, most likely leads to more underestimation in indoor images. To tackle different biases, we further emphasize the importance of close collaboration between humanistic and machine learning researcher as well as open publication of the applied models and data whenever possible without privacy violations (Borgesius, Gray, and Eechoud 2015) . This will allow critical evaluation of the results and biases in the future studies and gradually lead to better understanding of possible pitfalls. When machine learning becomes more widely-used in humanistic studies, it is expected that the availability of high-quality datasets will encourage research in the related themes. While the opportunities can open exciting new directions in humanistic studies, it may become necessary to consider, whether this will dictate the topics and research question setting too much leading to increased inequality, e.g., by favoring majority cultures instead of minority cultures as research topics. In a similar way, if a certain category of interest is missing from the available training data, it may be tempting to just omit the category from the research, leading to altered research directions. To avoid such problems, humanistic field should not see machine learning as a ready-to-use tool, but also take responsibility in annotating large-scale image collections with the concepts relevant for the subsequent humanistic studies. To this end, the development of both faster and easier annotation tools and methods for incremental learning that allow adding some categories to already trained algorithms (e.g., D. Roy, Panda, and K. Roy (2020)) are important for future photographic studies. Introduction of novel machine learning methods has potential to both significantly reduce the time needed for selecting the suitable material and at the same time to give a more detailed picture of the quantitative features of very large image collections. The data-driven analysis and ability to automatically search and annotate content of images can remove the most laborious tasks in visual content analysis. This may also lead to decreasing the amount of errors in the observation or note taking process. These changes will help to shift the workflow from simple manual tasks to more demanding stages. This includes analyzing and examining the databases, performing a qualitative phase where the researcher seeks to answer 'what the data mean' (Bell 2004, p. 22 ). It will also enable the use of much larger sets of photos than has been traditionally possible. As each photo can be annotated with these new tools with detailed information of its content and structural patterns and locations, it will enable creating much more sophisticated research settings. Overall, the novel machine learning techniques will allow to renew at least the quantitative part of photographic studies in the near future. In this paper, we proposed a holistic framework for machine learning-based image analysis called Automatic Image Content Extraction (AICE). To make the traditional visual content analysis methodologies compatible with the novel machine learning techniques, we reformulated and expanded the traditional framework by adding several variables and suggesting suitable values for each variable. We also considered the difficulty of automatic extraction of each variable with the current state-of-the-art machine learn-ing models and linked the variables to existing machine learning techniques. The proposed framework can be applied in several domains in humanities and social sciences. We provided multiple examples on how machine learning techniques applied according to the proposed AICE framework could have enhanced earlier photographic studies and illustrated the novel opportunities they open for future studies. We also discussed the main challenges in adopting the automatic approach and suggested solutions to these challenges. As the main solution, we encourage expanding collaboration between humanistic and machine learning researchers. In addition, we believe that humanistic field should not see machine learning as a ready-to-use tool, but also take responsibility in annotating large-scale image collections with the concepts relevant for the subsequent humanistic studies. While this paper and our proposed AICE framework focus solely on images, it is worth noting that the novel opportunities opened by the advancement of machine learning techniques are wider. In particular, different multi-modal techniques that simultaneously analyze different sources of information, such as news texts, graphical elements, or page layout along with the images or videos consisting of both visual and audio data, can open additional interesting opportunities for the future humanistic and social science studies. Height and Weight Estimation from Unconstrained Images Automated visual content analysis (AVCA) in communication research: A protocol for large scale image classification with pretrained computer vision models Distant viewing: analyzing large visual corpora Enriching historic photography with structured data using image region segmentation Content analysis of visual images Goffman's gender advertisements revisited: Combining content analysis with semiotic analysis Content analysis in communication research Journalism history and digital archives Open data, privacy, and fair information principles: Towards a balancing framework Eurocity persons: A novel benchmark for person detection in traffic scenes Exploring Machine Learning to Study the Long-Term Transformation of News: Digital newspaper archives, journalism history, and algorithmic transparency OpenPose: realtime multiperson 2D pose estimation using Part Affinity Fields Applied visual anthropology in the progressive era: The influence of Lewis Hine's child labor photographs Machine learning based analysis of Finnish World War II photographers Approaches to analysis in visual anthropology The Visual Social Distancing Problem Researching the Visual: Images, Objects, Contexts and Interactions in Social and Cultural Inquiry Machine learning for cultural heritage: A survey Dual attention network for scene segmentation Gender advertisements Digital image processing Advertising images as social indicators: depictions of blacks in LIFE magazine Picturing America's 'War on Terrorism'in Afghanistan and Iraq: Photographic motifs as news frames Picturing the Gulf War: constructing an image of war in Time The hidden dimension. Garden City Application of Image Classification for Fine-Grained Nudity Detection Cultural relativism and the visual turn Gaze360: Physically Unconstrained Gaze Estimation in the Wild Geometric loss functions for camera pose regression with deep learning Life Magazine Photographs Reading images: The grammar of visual design Reading images: The grammar of visual design Imagenet classification with deep convolutional neural networks Images on the Move: Analytics for a Mixed Methods Approach Deep learning Visual Saliency Based on Multiscale Deep Features Deep facial expression recognition: A survey Transferable interactiveness knowledge for human-object interaction detection Deep learning for generic object detection: A survey Deepfashion: Powering robust clothes recognition and retrieval with rich annotations Journalismikritiikin vuosikirja 2004 How to compare one million images? Visualizing change: Computer graphics as a research method StreetStyle: Exploring world-wide clothing styles from millions of photos A survey on bias and fairness in machine learning Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR) A survey of clustering with deep learning: From the perspective of network architecture Covering the Dead: Death images in Israeli newspapers-ethics and praxis The Digital Turn: Exploring the methodological possibilities of digital newspaper archives Quantitative content analysis of the visual Taking the visual turn in research and scholarly communication key issues in developing a more visually literate (social) science An integrated conceptual framework for visual social The SAGE handbook of visual research methods Analyzing media messages: Using quantitative content analysis in research Age and gender recognition in the wild with deep attention Visual methodologies: An introduction to researching with visual materials Tree-CNN: a hierarchical deep convolutional neural network for incremental learning Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Automatic Main Character Recognition for Photographic Studies Automatic Social Distance Estimation From Images: Performance Evaluation, Test Benchmark, and Algorithm Digital archives, big data and image-based culturomics for social impact assessment: opportunities and challenges Analyzing human-human interactions: A survey Daily street life in the inner city of Prague under transformation: the visual experience of socio-spatial differentiation and temporal rhythms A survey on explainable artificial intelligence (xai): Toward medical xai Deep convolutional learning for content based image retrieval Exploring humannature interactions in national parks with social media photographs and computer vision REVISE: A tool for measuring and mitigating bias in visual datasets Transferring deep object and scene representations for event recognition in still images Deep face recognition: A survey The visual digital turn: Using neural networks to study historical images The politics of photography: Visual depictions of Syrian refugees in UK online media Visual relationship detection with internal and external linguistic knowledge distillation A CNN-RNN architecture for multi-label weather recognition Places: A 10 million image database for scene recognition We would like to acknowledge support from Intelligent Society Platform funded by Academy of Finland, project "Improving Public Accessibility of Large Image Archives".