key: cord-0108921-lbqzrw68
authors: Zaino, Giulia; Recchiuto, Carmine Tommaso; Sgorbissa, Antonio
title: Culture-to-Culture Image Translation with Generative Adversarial Networks
date: 2022-01-05
journal: nan
DOI: nan
sha: ee8adae4b4aae1b5cf25d9954d8a5aae097904d6
doc_id: 108921
cord_uid: lbqzrw68

This article introduces the concept of image"culturization", i.e., defined as the process of altering the"brushstroke of cultural features"that make objects perceived as belonging to a given culture while preserving their functionalities. First, we propose a pipeline for translating objects' images from a source to a target cultural domain based on Generative Adversarial Networks (GAN). Then, we gather data through an online questionnaire to test four hypotheses concerning the preferences of Italian participants towards objects and environments belonging to different cultures. As expected, results depend on individual tastes and preference: however, they are in line with our conjecture that some people, during the interaction with a robot or another intelligent system, might prefer to be shown images whose cultural domain has been modified to match their cultural background.

Y UKIKO is an 83-year-old Japanese woman living with her son Matsuo in a traditional Japanese house. Yukiko sleeps on a futon on the tatami floor, which, she says, is very good for her back. Recently, every time Yukiko has woken up at night to go to the bathroom, she has felt a little dizzy and confused, and sometimes she has had trouble finding the light switch. Last week she fell in the dark, and nobody noticed she was lying on the floor until morning. Matsuo is worried about her safety and decided to set up a Smart Environment composed of a small table-top robot assistant named Tetsuwan Atomu 1 , voice-activated lights and cameras that can detect emergencies. Yukiko gladly agreed to have a camera installed in her room: "I don't want to be a burden for my son" she says. It's Sunday evening: Tetsuwan greets her with a bow and then chats with her for a while about cherry blossoms in Spring by showing pictures of beautiful trees in Kyoto on its display. "You know so many things," Yukiko says with a smile as she lies down on her futon on the tatami floor, preparing to sleep, "Goodnight!" Tetsuwan receives an alert from one of the cameras that has recognized a person lying on the floor as an emergency, which ordinarily requires alerting Yukiko's family. However, according to the robot's cultural database, it is common for people living in traditional Japanese houses to sleep on a tatami floor, so this may not be an emergency. "Switch off the light, please," says Yukiko, yawning and confirming Tetsuwan's assessment. Tetsuwan relaxes: it would smile if it could.

All authors are with the University of Genoa, DIBRIS, Via All'Opera Pia 13, 16148, Genova (Italy), e-mail: antonio.sgorbissa@unige.it. 1 The android boy popular in Japan, known as "Astro Boy" in western countries The scenario above introduces a key concept: how to provide personalized interaction by re-configuring the behaviour of intelligent systems depending on the cultural context. Indeed, the way Tetsuwan Atomu greets Yukiko, the topics it chooses for conversation, the pictures it shows on its screen, the way it interprets the situation when Yukiko lies on the floor show that the assistant is culturally competent. The concept of "culture" is complex [10, 15, 20, 22, 33] , and there is no consensus among researchers on how to define it.

A simple yet effective definition holds that culture is a shared representation of the world of a group of people. By "culturally competent," we mean an intelligent system that can adapt its perceptions, plans, actions, and interaction style depending on the worldview of the person it is interacting with, including their beliefs, values, language, norms, and visibly expressed forms such as customs, art, clothing, food [7, 8, 44] .

Cultural factors in ICT have been investigated in the last decade, particularly in Robotics [4, 49] . All previous approaches focussed on what can make the system more or less acceptable to people of different cultures, either concerning its appearance or its behavior [32, 48, 53] including verbal and non-verbal interaction [3, 47] or social distance [14, 27] . However, as confirmed by recent systematic and scoping reviews, none of the earlier approaches aimed to define a general conceptual framework to make an intelligent system for health and social assistance culturally competent [1, 39, 45] . This problem was addressed [7, 8] for the first time 2 by developing a culturally competent Socially Assistive Robot (SAR) for elderly care starting from research in Transcultural Nursing [33] and culturally competent Health Care [5] . However, cultural competence is crucial not only in SARs, but in all AI domains that could benefit from a smart ICT infrastructure, including Health & Social Care, Education, Travel, and Business [2, 9, 25] .

In this general scenario, the article focuses on the visual representation of objects in different cultures [30] . The anthropological study of material culture teaches us that everyday objects have different designs in different world areas, even when they share many similarities in terms of the functionalities they offer. This is not a surprise: the design of objects is both related to the social tasks they intend to accomplish and to the material enabling them to do it [50] . As a consequence, in any given geographical areas and culture, some visual features tend to coherently repeat with a higher frequency, even if the presence of aesthetic universal have been found [18, 19] . Using a metaphore, two objects may have the same functionalities or affordances [16] , and yet be given "brushstrokes of cultural features" that make them uniquely recognizable.

Suppose now a robot interacting with a person, i.e., showing instructions on how to perform a given task such as shaving himself or cleaning the floor: this capability may play a key role in SARs, e.g., when interacting with older people with mild forms of dementia. In the case of floor cleaning, instructions are likely to go like this: (1) Take a bucket; (2) Take a mop; (3) Take a cleaning detergent; (4) Sweep or vacuum First; (5) Fill the bucket; (6) Dip and wring the mop; (7) Begin mopping, Figure 1 . How important is it that, during interaction with people, robots and other intelligent systems show images with which the person is familiar? The advertising industry knows all too well that a product may need different designs and culturally appropriate advertisements to hit the market in different areas of the world. The ad of a detergent in India or Japan is unlikely to use Italian actors playing the role of an Italian family in an Italian house unless a specific message in this sense is required, and the opposite is true in Italy.

Given these premises, the article's contribution is twofold.

• The article addresses for the first time the problem of culturizing images, and explores Generative Adversarial Networks (GAN) [17, 24, 29, 37, 40, 42, 59] for this purpose. We do not propose a new GAN architecture, but do this using state-of-the-art GANs and related tools. With the term "culturization," we intuitively mean altering the "brushstroke of cultural features" that make objects perceived as belonging to a given culture while preserving the perceived functionalities of objects. As clarified in the article, we train GANs to translate images from a source to a target cultural domain.

• The article evaluates, through an online questionnaire submitted to Italian participants with different age ranges followed by a statistical analysis of data, hypotheses related to the perceived cultural context of objects (both real and GAN-generated ones), the level of appreciation towards objects belonging to different cultural contexts, and finally the perceived realism of indoor environments patched with culturized objects. Section II describes State-of-the-Art related to the imageto-image translation using GANs. Section III describes the system and the process implemented to culturize objects and environments. Sections IV, V, and VI presents data acquisition and analysis. Conclusions are given in Section VII.

Image culturization can be described as an image-to-image translation problem: we need to translate an image from a source to a target cultural domain. Suppose to have • a set of images, related to each other by a shared characteristic that defines them as belonging to the same domain, also referred to as the source domain S;

• another set of images, different from the first one, defining a target domain T ; image-to-image translation is the problem of learning a mapping G : S → T such that the distribution of images from G(S) is indistinguishable from the distribution T . Otherwise said, the objective is to modify images belonging to S to make them "similar" to those belonging to T .

Image-to-image translation [28] may benefit from GANbased approaches [43] . Examples include adding semantic labels to photos by overlaying pixels that represent semantic information [12] , producing photorealistic images from sketched drawings [13, 58] , translating photographs into paintings [59] , or filling missing regions of images based on the available visual data [35] . GANs are not the only feasible approach to this problem: however, this article does not aim to find the most performing solution for image culturization. Our objective is to find a feasible (possibly suboptimal) solution that meets all constraints posed by image culturization, to be later tested with human participants to confirm the relevance of this new concept. According to this rationale, and considering that GAN-based methods are the vast majority of solutions [43] ), we will limit our analysis to GANs 3 .

GANs have been successful in several domains, ranging from the creation of purely synthetic images (e.g., faces of people that do not exist in the real world [17, 46] ) to the modification of selected features in a pre-existing image (e.g., the age or facial attributes of a person [23, 55] ).

In the simplest formulation, GANs have two components [17] : a generator and a discriminator. Both the generator and the discriminator shall be trained using a vast set of collected data x describing a probability distribution p data : the generator is trained to generate samples with probability distribution p g = p data , whereas the discriminator will aim to distinguish the real data from the generated ones. The main idea is to create a competition between two networks. On the one hand, the discriminator is challenged with samples that are increasingly difficult to recognize as fake. On the other hand, the generator is trained to "cheat" the discriminator by producing more realistic samples that make it fall into error.

In [17] the discriminator and the generator are both multilayer perceptrons with parameters θ g (the weights of the generator) and θ d (the weights of the discriminator). The generator x = G(z; θ g ) takes as input a noise variable z with probability distribution p z (z) and maps it to data space with a distribution p g (x). The discriminator is instead a binary classifier D(x; θ d ) that takes as input a sample x (that may either be from one of the samples of the data or the output of the generator) and outputs the probability that x came from the data distribution p data (x) rather than than from the generator. After training, both G and D will reach a point at which they cannot improve because p data (x) = p g (x), and D(x; θ d ) = 0.5. Optimal parameters (θ g , θ d ) are obtained by playing the following two-player minimax game with value function V , maximizing the cost function with respect θ d and minimizing it with respect to θ g as follows 4 .

Since the most promising application of GAN emerged to be in the field of computer vision, Deep Convolutional Generative Adversarial Networks (DCGANs) were developed to exploit the success of Convolutional Neural Networks (CNNs) [46] . The generator still takes in input a latent noise variable z but now generates an image through convolutional decoding operations. DCGANs proved able to generate higher-quality images and provided the basis for subsequent architectures. However, similar to the original GANs, they do not consent to specify additional constraints on the generated data samples. A relationship between the noise variable z fed to the generator and the generated images exists (e.g., different faces are produced), but without allowing us to control the outcome (e.g., to produce younger versus older people's faces).

Conditional GANs (CGAN) [41] overcome this limitation by providing additional information to the generator and the discriminator during training, forcing the generator to synthesize a fake sample with desired characteristics. The generator learns to produce realistic samples corresponding to a specific label, whereas the discriminator learns to distinguish fake sample-label pairs from real sample-label pairs. The value function V used in training is the same as Eq. 1, by substituting G(z) with G(z|y) and D(x) with D(x|y), where y is the additional information supplied to the network.

In the following, we will review a subset of GAN-based solutions [24, 29, 37, 40, 42, 59] that have interesting properties and requirements for image culturization and -as such -may be representative of broader classes. The analysis is limited to a small subset of the available solutions [43] . Since the focus of this article is on the impact of image culturization on people, different or improved solutions might be considered as they will emerge in the future.

All approaches typically require two datasets, each dataset containing images with homogeneous characteristics in terms of content, style, or resolution. However, some approaches have the additional requirements that images need to be paired during training: for every image of the source domain S, there should be a corresponding image of the target domain T . A notable example in this class is Pix2Pix [24] : the model postulates a transformation that can modify the source image to make it ideally belong to the target domain while keeping the result as close as possible to the input image: the goal of Pix2Pix is to learn this transformation and perform it. To this end, Pix2Pix adopts a CGAN approach where the input image conditions the generator: the discriminator takes as input a generated or a real image belonging to the target domain and guesses whether the image is real or not. The generator is a U-net [51] , that is an encoder-decoder that shares the information about the features extracted in the coding phase to the respective decoding layer: this helps the reconstruction phase by increasing the image quality. The discriminator is a PatchGAN [34] , which does not output a scalar probability value but a matrix where each value corresponds to the probability that a patch of the starting image is not fake. The network is trained with the CGAN loss function, by adding a weighted term that measures the difference between the generated image and the real image in the corresponding pair.

Pix2Pix and similar approaches obtain good results and turn out to be robust for several applications despite their simplicity. The main limitation is that they require large paired datasets to learn a mapping between a source and a target domain, which is problematic if you want to map objects into their culturized version in several different cultures -as large paired datasets are complex and expensive to build.

In order to overcome the constraint that datasets need to be paired during training, some approaches propose a new concept, referred to as consistency of the cycle: the image x S , after the first translation G(x S ) from S to T , is fed to a second generator that performs a backward translation F (G(x S )) to remap the image to the starting domain S and compares it with the original image. This concept was first introduced by CycleGAN [59], a conditioned GAN network that creates a mapping from S to T using the solution proposed in [26] for the generator and PatchGAN for the discriminator. CycleGAN is trained by substituting the original GAN loss in Eq. 1 with the one used in [38] and, more importantly, adding a weighed, "cycle-consistency" L1 loss measuring the difference between x S and F (G(x S )). CycleGAN has been used to re-create the style of impressionist painters starting from a photograph, following the intuition that we can capture the essence of a painter's style despite having never seen a side-byside example of, say, a Monet painting next to a photo of the scene he painted. This concept fits well with the culturization concept: in analogy with impressionist painters representing reality with their painting style, we want to capture how a human self-identifying with a given culture would represent a particular object through cultural brushstrokes.

A strength of CycleGAN and similar approaches [29, 57] is that they allow for building datasets very easily by searching for images on the web (e.g., Greek and Japanese vases) without additional requirements. However, this solution has also negative sides. First, it cannot focus on the object to be modified and keep the background unaltered, which is crucial if we want to culturize an individual object in a complex scene. Second, the mapping of images returns better results when changes occur in the texture (such as from horse to zebra or from orange to apple) rather than when they occur in the shape. CycleGAN can modify or learn the shape of an object but this can create artifacts and less realistic images.

As mentioned above, it would be useful if the network could additionally learn which parts of the scene to translate from S to T by letting the background unaltered. The solution proposed in attention-guided GAN [40] achieves this by adding a mechanism to isolate a region of interest in the image. Based on the structure of CycleGAN and the concept of the consistency of the cycle, two attention networks are added, one for the generator G : S → T and the other for F : T → S, which learn to extract "attention maps". Attention maps provide semantic information by segmenting images into regions that shall or shall not be translated from S to T : to this end, each pixel is assigned a continuous value in the interval [0,1], making attention maps differentiable on the borders of the area of interest. After feeding the input image to the generator, attention-guided GAN applies the learned mask and then adds the background using the inverse of the mask: as such, attention networks are trained in parallel with generators. Similarly, the discriminator is trained to consider only attended regions. CycleGAN can be interpreted as an attention-guided GAN where the attention maps equals 1 everywhere.

This approach and similar ones [11, 56] might look an improvement over CycleGAN since they allow modifying individual elements of a scene without requiring further constraints on the datasets: however, attention maps imply a higher computational load and training time. Additionally, attention-guided approaches have limitations when imageto-image translation involves changes in the objects' shape (typically required in image culturization, Figure 2 ) since they focus on keeping the image background unaltered.

Solutions exist to expand CycleGAN to deal with datasets that exhibit more evident changes in the shape of objects or include multiple instances of objects. To achieve this, Insta-GAN [42] requires additional information on the instances to be modified, obtained through segmentation masks a priori provided: S and T must contain both images and masks for all elements of interest within images. A set of generators supplements the original generator: each generator operates on a segmentation mask and produces a translated mask in the target domain. Like attention-guided GAN, information on the [56] , CycleGAN [59] . elements to be modified prevents changes on the parts not to be modified, i.e., the background. The cross-entropy used in CGAN is substituted with the one used in [38] and weighted terms are added, including the cycle-consistency loss and a background-preserving loss.

InstaGAN produces high-quality images where only selected regions are modified. However, it adds a new constraint: it requires a segmentation mask associated with each object. Unfortunately, not many datasets are built in this way, which reveals to be problematic for image culturization based on custom-built datasets, typically composed of images downloaded from the web.

GAN-based approaches exist that emphasize more the dicothomy between high-level semantics (depending on the object's functionalities or affordances) and low-level features ("brushstrokes of cultural features"). UNsupervised Image-toimage Translation (UNIT) [37] is a representative of this class based on CoGAN [36] . CoGAN consists of two GANs: each GAN takes a random vector as input without being conditioned to any input image. Generators are trained to decode the noise and produce pairs of images in the two target domains X 1 and X 2 , thus learning a joint distribution of multi-domain images from data. Discriminators are trained to distinguish generated images from real ones in the corresponding domains. Moreover, the weights of the first few layers of the generators (responsible for decoding high-level semantics) and those of the last few layers of the discriminators (encoding high-level semantics) are shared. In this way, the network capacity is reduced because it has fewer weights to update: differentiation between the two networks is only in low-level features, which make the two domains different. As such, CoGAN generates pairs of related images sharing high-level features (e.g., a smiling and a not-smiling version of the same face). However, CoGAN is unsuitable for image-to-image translation because it is fed with a random input: UNIT makes the network conditioned by adding variational autoencoders. Variational autoencoders E 1 , E 2 [31] are trained to encode images from two different domains X 1 and X 2 into a shared-latent space z, whereas generators G 1 , G 2 and discriminator D 1 , D 2 are trained to take z in input to generate images in the two domains X 1 and X 2 , similarly to CoGAN.

Authors show that the image translation stream from X 1 to X 2 (and vice versa) can be ideally represented by the compositions x 2 = G 2 (E 1 (x 1 )) and x 1 = G 1 (E 2 (x 2 ), and that the proposed shared-latent space assumption implies a cycle-consistency constraint. As such, UNIT has benefits and constraints comparable with CycleGAN.

GANs evaluation [6] is a challenging task. The main problem is that GANs are trained to reach an equilibrium situation where the generator can "cheat" the discriminator. Still, the former, taken alone, is not associated with a cost function to be minimized. Under these conditions, it is hard to predict when the generator will produce samples that fit the target probability distribution at their best. Then, researchers have defined qualitative and quantitative tools to evaluate GANs.

Since subjective evaluation is vital in most applications, it is reasonable to start with a human qualitative assessment of the generated images. This is often performed by relying on crowdsourcing platforms such as Amazon Mechanical Turk [24, 40, 57, 59] . Quantitative evaluation is usually performed using such metrics as the Inception score (IS) [52] and the Fréchet Inception Distance (FID) [21] -other metrics have been proposed [43] . IS evaluates generated images based on how Inception v3, a widely-used image recognition model that attains greater than 78.1% accuracy on the ImageNet [54] , classifies them as one of 1,000 known objects. Additionally, it evaluates if there is a sufficient diversity among the generated samples, addressing the so-called problem of the collapse of the model that can occur during training: the generator learns to generate a specific image of the target domain and keeps on generating the same image over and over (even when varying the input) because the latter is very good in deceiving the discriminator. FID also uses a pre-trained inception v3 model to measure the quality of the generated images. Still, it does so differently, i.e., by comparing the statistics of the model in classifying generated versus real images.

We are not evaluating a new GAN approach, and therefore IS and FID scores are not very relevant for our purpose. Instead, we compare different solutions on a qualitative basis to assess to what extent each solution can meet the requirements for image culturization. We start by observing that Pix2Pix and similar approaches require paired datasets, a very strong limitation given image culturization shall work on custom-built datasets not adhering to any specific requirement, e.g., images downloaded from the web. InstaGAN does not reveal a feasible solution for similar reasons since it requires a carefully designed dataset including images belonging to different cultural domains and the corresponding masks. We implemented CycleGAN, AttentionGAN, and UNIT that not pose constraints on the dataset, and then we qualitatively evaluated results. Tests confirmed that AttentionGAN has strong limitations when image-to-image translation requires altering the image shapes, Figure 2 . CycleGAN and UNIT might be good choices: after qualitative evaluation, and since CycleGAN is the basis for many other approaches (including InstaGAN that might be reconsidered in the future), we finally select it as the best candidate for image culturization.

III. CULTURIZATION PIPELINE This section describes the system and the process for training, testing, and run-time culturization of objects and environments from a source S to a target T cultural domain.

The whole process includes three phases, described in details in the following: training; testing; culturization.

During the training phase, we build datasets and train the networks. Three fully automated steps are executed in a pipeline, managed by a script in Python.

• Choice of images and creation of datasets. The user enters the name of an object (e.g., vase, sofa, pillow, lamp, etc.), a culture (e.g., European, Indian, Moroccan, Chinese, etc.), and the number of images to download. The images will be automatically downloaded from Google through the Selenium Web Driver API, a collection of open-source APIs to automate the testing of a web application: images shown in this article have been produced starting from datasets of 1, 000 images. 

During the testing phase, we map selected images from S to T , evaluate results, and possibly tune hyperparameters. Once again, three fully automated steps are executed in a pipeline.

• Choice of images to be used for testing and evaluation. Among the images automatically downloaded from Google, the user shall select a percentage of images to test the network (e.g., 10% of the whole dataset, not used for training), added to a "test repository" corresponding to the starting culture.

• Generation of culturized images. CycleGAN fed with images in the test repository to generate the new images representing the objects in the target cultural domain.

Only the trained generator is working here, whereas the discriminator is no longer needed.

• Quantitative and qualitative evaluation of results. The resulting images undergo a qualitative analysis performed by researchers that visually judge the quality of the output images in the target cultural domain. The FID score is computed as well as a confirmation that results are acceptable. If the quality is not judged as sufficiently good, the network is trained again by tuning selected hyperparameters (typically, the number of epochs).

If we want to culturize individual objects, it is now sufficient to feed CycleGAN with an image belonging to the source domain. However, if the objects to be culturized are part of an image showing a complex environment, four steps are executed in a pipeline (currently, not fully automated).

• Choice of an image of the environment. The image of an environment belonging to a cultural domain is provided.

• Preprocessing of images and object segmentation. The selected image is preprocessed using Faster-RCNNinception-V2 to extract all objects in the environment that are a candidate for culturization.

• Generation of culturized images. CycleGAN is used to map each object extracted from the environment into its counterpart in the target domain.

• Re-insertion of the modified objects in the original environment. The culturized objects are used to patch the original image of the environment. Currently, this step is not automated yet but manually performed through a supporting tool. Since we use culturized images to support interaction with people, an additional step is worth mentioning: we instruct our humanoid robot Pepper to display the resulting image on its tablet for a culturally competent interaction with users, Figure  1 . However, culturized images might be used in different ways depending on the application domain. Consequently, the following experimental session aims to investigate four general hypotheses about image culturization rather than focus on culturally-competent human-robot interaction.

In this work, experiments are performed with Italian participants only to simplify recruitment. Participants are shown images with different objects belonging to cultural domains generally referred to as European (E) or non-European (O for "other") and asked some questions (details in the following). Tests are aimed to verify the four general hypotheses below, which produce a set of specific hypotheses to be accepted or rejected through data acquisition and statistical analysis in the following sections.

• Hypothesis 1: objects are correctly recognized by Italian participants as belonging to the E culture. This happens both when the object originally belongs to the E culture and when it has been culturized from O to E.

• Hypothesis 2: European objects are generally preferred by Italian participants. This happens both when the object originally belongs to the E culture and when it has been culturized from O to E.

• Hypothesis 3: Italian participants who prefer E objects tend also to like objects culturized from O to E.

• Hypothesis 4: participants perceive environments as more realistic when objects in the image are culturized with GANs following the procedure in Section III-C, rather than when they are patched with unmodified objects downloaded from the Internet. Concerning Hypothesis 2, preferences are expected to vary depending on each person: hypothesizing that Italian people will prefer European objects incurs the risk of stereotyping and, generally speaking, is not true. Different persons may be more or less attracted by objects belonging to their own or other cultures, and this also depends on individual objects. The same person may adore colorful Moroccan lamps and, at the same time, be particularly attracted by the design of a lamp produced in Sweden or France. We are aware of this and do not intend to make stereotyped claims. The objective of Hypothesis 2, together with Hypohesis 3, is to explore if there are people with whom using images that match their cultural background can be a winning strategy.

Data to test the four hypotheses above are collected through a Google Form. The form is fully anonymous and includes four sections, showing pictures and asking related questions 5 .

• Anagraphic data: the form asks the respondent to declare their age (less than 20, 20-29, 30-39, 40-49, 50-59, 60-69, 70 and over) and if they usually travel out of Italy (never or very rarely; yes, but mostly in Europe; yes, both in Europe and outside Europe);

• Recognizing culture and expressing preferences: four vases, four sofas, four pillows, and four lamps are shown in sequence. For each object, the respondent has to provide a score to the following two questions on a 5points Likert scale: "Q1: How close this object looks to the European culture and tradition?" (from 1:very far to 5:very close); "Q2: Do you like this object?" (from 1:not at all to 5: a lot). For each object class (vases, sofas, pillows, and lamps), the four objects shown to the respondent have been produced in different ways: one has been downloaded from the Internet as belonging to the European culture and tradition (E); one has been downloaded from the Internet as belonging to a different culture and tradition (Chinese, Indian or Moroccan, O); one is an European object that has been "culturized" (E → O) using GANs; one is a non-European object that has been "culturized" using GANs (O → E).

• Comparing culturized and non-culturized objects: sixteen pairwise comparisons are shown, four of which concern vases, four sofas, four pillows, and four lamps. For each comparison, two objects are shown side by side: the original one (which can either be E or O) and its corresponding GAN counterpart (E → O or O → E, depending on the original one). The respondent has to answer the following two questions: "Q1: What object better represents the European culture and tradition? (the one on the left or the one on the right); "Q2: What object do you like the most? (the one on the left or the one on the right). A score of 1 is assigned to the winning object, and 0 is assigned to the loser. The order in which original and modified objects appear in the pair varies for each comparison. Figure 3 shows some of the pairs presented to respondents.

• Comparing realism of environments: eight pairwise comparisons are shown. For each comparison, two environments are presented to the respondent: an environment that has been modified by segmenting objects, culturizing them with GANs, and then pasting them back onto the original image (GCult); an environment that has been modified by substituting original objects with objects of different cultures downloaded from the Internet, pasted onto the original image (P Cult). The respondent has to answer the following question: "Q1: What image looks more realistic? (the one on the top or the one on the right), Figure 4 . A score of 1 is assigned to the winning environment, and 0 is assigned to the loser.

Notice that we show objects whose cultural belonging is emphasized. Greek vases are easily recognizable as belonging to the European culture, non-European vases are mostly Chinese and Japanese. Figure 3 shows that, when culturizing vases through GANs, the original Chinese or Greek drawings are recognizable in their GAN-modified counterparts: a Chinese vase with greek warriors and a Greek vase with an oriental tavern can be spotted in the Figure. The European sofas we chose are primarily designed in classic style: you may object that this kind of sofa is not very common in ordinary European houses. However, we wanted to avoid international, modernstyle sofas that may not be immediately recognizable as European. Pillows are probably less recognizable as belonging to the European / non-European cultures since colorful pillows are customary also in ordinary European houses. Finally, lamps are the objects where the cultural belonging depends on the shape the most: see the moon-shaped Moroccan lamp versus its European counterpart, and the European "night table" lamp rethought in a Moroccan style.

The Google Form has been online from 18/09/2021 to 7/10/2021. It has been advertised both through public social networks (Facebook and Twitter) and student social networks of the University of Genova.

Overall N=392 participants filled the questionnaire, out of which: 7.9% are less than 20 years old; 21.9% are in the range 20-29; 13.8% in the range 30-39; 20.4% in the range 40-49; 20.9% in the range 50-59; 11.5% in the range 60-60; 3.6% are 70 years or older. Concerning travels, 26% of the participants declared they never travel out of Italy, or very rarely; 48.5% frequently travel out of Italy, but mostly in Europe; 25.5% travel both in Europe and out of Europe 6 . belonging to the European culture) and Hypothesis 2 (E and O → E objects are preferred by Italian participants). Due to the number of samples N=392, it is not required to check distribution normality. Table I shows that, concerning Q1 ("How close this object looks to the European culture and tradition?"),

• for any object class, E objects achieve the highest average score;

• for any object class, O → E objects achieve the second highest average score. For all object classes, the ANOVA returns that the null hypothesis can be rejected. The Tukey HSD test returns that all groups averages, when pairwise taken, are significantly different with the only exception of the couple (E → O, O) in the vases, sofas, pillows classes.

As a confirmation, we then perform a one-tail, two sample Welch's T-test to compare E (the highest average score) with O → E (the second highest), and then O → E with O and E → O. In all tests, the null hypothesis is rejected in favour of the alternate hypothesis µ 1 > µ 2 with p < .05 for any object class.

Summarizing, with a focus on µ E and µ O→E that are more relevant for Hypothesis 1,

• µ E > µ O→E > µ O for vases, pillows, and lamps; • µ E > µ O→E > µ E→O for sofas; • µ E > µ O→E > µ O when considering all classes together. Concerning Q2 ("Do you like this object?"),

• for any object class, E achieves the highest average score; • for vases, the second highest average score is achieved by O → E; for all other classes the second highest score is achieved by O. The one-tail, two sample Welch's T-test returns that, for vases, the difference between E (highest average) and O → E (second highest average) is statistically significant with p < .05, and the same holds for the difference between O → E and O (third highest average). For sofas and pillows, the difference between E (highest average) and O (second highest average) is statistically significant with p < .05, and the same holds for the difference between O and O → E. For lamps, the .5 or 1 depending on whether that object class/group won 0, 1, or 2 comparisons. After averaging over N respondents, the maximum score in each cell is 1 (which happens if a group, say E, is always selected by all respondents for that object), and the sum between two competing groups (say, E and E → O) is 1 as well.

Analyses are performed using a one sample T-test: in the E vs. E → O comparisons, we check if the average number of times that E has been selected µ E > 0.5 with α = .05; in

It can be observed that, concerning Q1 ("What object better represent the European culture and tradition?"),

• for any object class, E objects are selected, on average, more than half of the times;

• for any object class, O → E objects are selected, on average, more than half of the times. The one-tail, one sample T-test returns that

• µ E > .5 with p < .05 for any objects as well as for all objects taken together;

• µ O→E > .5 with p < .05 for any objects as well as for all objects taken together. Concerning Q2 ("What object do you like the most?"),

• for vases, pillows, lamps, E objects are selected, on average, more than half of the times;

• the same happens for the O → E group, even if average scores are lower than the E group. The one-tail, one sample T-test returns that

• µ E > .5 with p < .05 for vases, pillows, lamps, as well as for all objects taken together.

• µ O→E > .5 with p < .05 for vases and pillows, µ O→E = .5 with p < .05 for sofas and lamps. When considering all objects together, µ O→E > .5 with p = .052. Next, we perform a stratified analysis controlling for age: results of respondents with age < 30 (N=117) are in Table  IV returns ρ E,O→E = .66 confirming that respondents giving high scores to E objects tend to privilege O → E objects as well. Table V shows averages scores corresponding to the third section of the questionnaire. To test Hypothesis 4, eight pairwise comparisons are shown to respondents: the sum of the scores totalled by GCult and P Cult is 1.

Statistical analyses are performed using a one sample Ttest: we check if the average number of times that the GCult environment has been selected µ GCult > 0.5 with α = .05.

It can be observed that, concerning Q1 ("What image looks more realistic?"):

• GCult environments are selected, on average, more than half of the time;

• the average tends to increase when performing a stratified analysis for age; in particular, the maximum value correspond to respondents whose age < 30. The one-tail, one sample T-test returns that µ GCult > .5 with p < .001 in all cases considered.

All data collected show that responses may vary significantly depending on the object class considered (e.g., results are not the same for vases or pillows) and different objects within the same class (e.g., in some cases the original, non-European lamp is preferred to the GAN-modified version, in other cases the opposite is true). This is particularly evident when objects are not pairwise compared: people express their preferences, notwithstanding their cultural belonging, depending on their personal tastes or just because an object looks better than another thanks to the skill of the photographer. The pairwise comparisons in the second section somehow control for confounding variables by presenting two objects with very similar shape, dimensions, perspective, lighting and sometimes texture, and altering only the cultural context. However, based on results, we cannot exclude that choosing different objects for pairwise comparisons might produce different outcomes.

Given these considerations, some conclusions can be drawn, by limiting our claims to the specific set of pictures shown.

First, it is evident from responses to Q1 in the first and second sections that E and O → E objects are rated as belonging to the European culture to a higher degree than their non-European counterpart, which is in line with Hypothesis 1. Very interestingly, O → E objects achieve a lower average score than E ones: this is likely due to the fact that GANs preserve some elements of the source image in the translation. O → E are perceived by respondents as not perfectly matching their expectations for a European object.

Second, it is evident from the responses to Q2 in the first and second section that E objects are preferred, on average, to all other objects, which is in line with Hypothesis 2. The only cases for which this result is not confirmed are the lamp class in the first section and the sofa class in the second section. However, when focussing on O → E objects, it is not possible to draw similar conclusions: O → E vases are always preferred to O vases, but results my vary for other object classes. The individual preferences of respondents may explain this: due to immigration, it is very common in Italian shops and houses to see Moroccan, Indian, and Chinese pillows and lamps. People are accustomed to these objects and perceive them as familiar, and many respondents may have a favorable bias towards colorful objects coming from all around the world. A similar explanation may hold when considering the sofa class: the classic style sofas shown in the questionnaire are not very common in ordinary Italian houses where a more sober design usually prevails, and then their non-European counterpart may be preferred. Also, it is interesting that the strongest bias towards O → E objects is observed in the vase class. Greek vases are correctly perceived as European artifacts and, on average, they are preferred to Far-eastern vases. This is in line with the explanation above: according to the authors experience, decorated Far-eastern vases as the ones shown in Figure 3 are less frequently encountered in ordinary Italian houses than Moroccan and Indian pillows and lamps. Then, when comparing Greek and Far-eastern vases, the choice might be more determined by cultural belonging than individual preferences and familiarity.

Third, both in the first and the second section younger respondents < 30 tend to prefer E and O → E objects: this traditionalist attitude looks coherent with the lesser experience with other cultures and the limited interest younger people may have in items of furniture. Many of them have never faced the problem of furnishing a house, possibly because they still live with their parents -in Italy, the average age when people leave their parents' house is around 30.

Fourth, and very important, we found a positive correlation between respondents giving a high score to E objects and O → E objects. Even if drawing strong conclusions about the preferences of Italians is not possible (and not desirable as it might lead to stereotyped assumptions), this confirms our intention to show some people culturized images that match their background during interactions with robots.

Fifth, it is evident from the responses to Q1 in the third section that environments modified by segmenting objects, modifying them with GANs, and re-inserting them in the original image are perceived as more realistic. Some respondents motivated their choice with a written comment to complain that objects in non-GAN images have a weird perspective, and they do not merge well with the background. Not surprisingly, younger people < 30 appear more skilled in distinguishing between the GAN-modified and non-GAN environments, with a more marked preference for the former.

We introduced the concept of image "culturization" and proposed a process for object culturization based on GAN technology. Results have been tested with recruited Italian participants to verify four main hypotheses concerning their preferences towards objects belonging to different cultures and the realism of culturized environments. Overall, experiments motivate our intention to proceed further along this path: even if, as expected, not all participants prefer objects belonging to their cultural background, those who prefer European over non-European objects also tend to have a positive attitude towards objects that we culturized to be perceived as European.

The solutions we proposed have three main limitations. First, the process of culturizing images is not completely automated since culturized objects need to be manually reinserted into the environment. This has no impact on the hypotheses tested with experiments. Second, the experimental results shall be taken with a grain of salt. When preparing the questionnaires, we selected a subset of objects and environments. However, we did not validate our questionnnaire to capture a hypothetic construct "positive attitude towards European object". Then, we cannot ensure that results may be generalized to any set of objects. The pairwise comparisons showing the original and culturized versions of the same object may somehow control confounding variables. Still, we cannot exclude that showing different pairwise comparisons may yield different results. Third, we implicitly assumed that it is possible to culturize an environment by culturizing the objects that the environment contains. This conjecture needs to be revised: the texture, shape, and colors of walls and other architectural features may play a role as well. However -given current GAN technology -this is likely the best we can do: we plan to test more advanced techniques as future work.

Finally, future work will include human-robot interaction with culturized images, not yet systematically tested with recruited participants due to the Covid-19 pandemic.

Scoping review on the use of socially assistive robot technology in elderly care

Critical race theory and the cultural competence dilemma in social work education

Effects of culture on the credibility of robot speech: A comparison between english and arabic

The influence of people's culture and prior experiences with Aibo on their attitude towards robots

Defining cultural competence: A practical framework for addressing racial/ethnic disparities in health and health care

Pros and cons of gan evaluation measures

Paving the way for culturally competent robots: A position paper

Knowledge representation for culturally competent personal robots: Requirements, design principles, implementation, and assessment

Normalizing the fraughtness: How emotion, race, and school context complicate cultural competence

Ontology is just another word for culture: Motion tabled at the 2008 meeting of the group for debates in anthropological theory

Attention-GAN for object transfiguration in wild images

The cityscapes dataset for semantic urban scene understanding

How do humans sketch objects?

Investigating the influence of culture on proxemic behaviors for humanoid robots

The Interpretation of Cultures

The Ecological Approach to Visual Perception

Generative adversarial networks

Cross-cultural universals of aesthetic appreciation in decorative band patterns

Product aesthetics

Thinking Through Things: Theorising Artefacts Ethnographically

Gans trained by a two time-scale update rule converge to a nash equilibrium

Culture's consequences: International differences in work-related values

Progressive face aging with generative adversarial network

Image-to-image translation with conditional adversarial networks

Cross-cultural competence in international business: Toward a definition and a model

Perceptual losses for real-time style transfer and superresolution

Cultural differences in how an engagementseeking robot should approach a group of people

Overview of image-toimage translation by use of deep neural networks: denoising, super-resolution, modality conversion, and reconstruction in medical imaging

Learning to discover cross-domain relations with generative adversarial networks

Thinking through material culture: An interdisciplinary perspective

Autoencoding beyond pixels using a learned similarity metric

Pepper or roomba? effective robot design type based on cultural analysis between korean and japanese users

Leininger's theory of nursing: Cultural care diversity and universality

Precomputed real-time texture synthesis with markovian generative adversarial networks

Context-aware semantic inpainting

Coupled generative adversarial networks

Unsupervised image-to-image translation networks

Least squares generative adversarial networks

Healthrelated ict solutions of smart environments for elderlysystematic review

Unsupervised attention-guided image-to-image translation

Conditional generative adversarial nets

InstaGAN: Instance-aware image-to-image translation

Imageto-image translation: Methods and applications

Transcultural Health and Social Care: Development of Culturally Competent Practitioners

The effectiveness of social robots for older adults: A systematic review and meta-analysis of randomized controlled studies

Unsupervised representation learning with deep convolutional generative adversarial networks

Effects of communication style and culture on ability to accept recommendations from robots

A cross-cultural study: Effect of robot appearance and task

From multicultural agents to cultureaware robots

What do things want? object design as a middle range theory of material culture

Convolutional networks for biomedical image segmentation

Improved techniques for training GANs

Cross-cultural studies on subjective evaluation of a seal robot

Rethinking the inception architecture for computer vision

Controllable and identity-aware facial attribute transformation

AttentionGAN: Unpaired image-to-image translation using attention-guided generative adversarial networks

Du-alGAN: Unsupervised dual learning for image-to-image translation

Toward realistic face photo-sketch synthesis via composition-aided gans

Unpaired image-to-image translation using cycleconsistent adversarial networks