key: cord-0586725-uiuv2siv authors: Ordun, Catherine; Raff, Edward; Purushotham, Sanjay title: The Use of AI for Thermal Emotion Recognition: A Review of Problems and Limitations in Standard Design and Data date: 2020-09-22 journal: nan DOI: nan sha: 1dfc71be5109240e2221b3cfd061322f1ca53c88 doc_id: 586725 cord_uid: uiuv2siv With the increased attention on thermal imagery for Covid-19 screening, the public sector may believe there are new opportunities to exploit thermal as a modality for computer vision and AI. Thermal physiology research has been ongoing since the late nineties. This research lies at the intersections of medicine, psychology, machine learning, optics, and affective computing. We will review the known factors of thermal vs. RGB imaging for facial emotion recognition. But we also propose that thermal imagery may provide a semi-anonymous modality for computer vision, over RGB, which has been plagued by misuse in facial recognition. However, the transition to adopting thermal imagery as a source for any human-centered AI task is not easy and relies on the availability of high fidelity data sources across multiple demographics and thorough validation. This paper takes the reader on a short review of machine learning in thermal FER and the limitations of collecting and developing thermal FER data for AI training. Our motivation is to provide an introductory overview into recent advances for thermal FER and stimulate conversation about the limitations in current datasets. Computer vision algorithms that use data from the visible spectrum (e.g. RGB) face a variety of challenges when it comes to human Facial Emotion Recognition (FER) due to the representation of superficial facial features laying on the epidermis. Physiological response from stress, fatigue, or other stimuli cannot be visualized on RGB but can be visualized through thermal imagery due to the changes in temperature detected sub-cutaneously. Thermal image data that can capture temperature changes correlated to human vital signs can be a powerful set of data for telemedicine applications supporting healthcare providers as a diagnostic tool for assessing inflammation and stress (Kosonogov et al., 2017) . Skin temperature can correlate to certain vital signs and offers a non-invasive method to remotely assess patients. As the cost of high resolution thermal sensors decline and more researchers release thermal FER datasets, there is a great potential to apply thermal imagery for telemedicine purposes. Since the Covid-19 pandemic, governments around the world have begun using thermal sensors combined with Figure 1 : RGB, near infrared and thermal images of a resting (up) and fatigued (down) face. In the thermal images, darker pixels corresponds to colder and lighter to hotter. (Lopez, del Blanco, and Garcia, 2017) AI tools for Covid temperature screening (Ting et al., 2020) . From the U.K, China, Italy, Australia, to the U.S., multiple companies are offering the promise of integrated thermal sensing with facial recognition (FR) (Van Natta et al., 2020) . We believe that with broader adoption of thermal FR due to changes in HIPAA rules due to Covid-19, it will only be natural that researchers will want to advance their technology towards emotion screening. We caution that before leaping to thermal FER, researchers should be fully aware of the restrictions and limitations of thermal imagery and the problems that may underlie existing thermal FER databases. The adoption of thermal imagery as a source for any humancentered AI task is not easy. Thus, the goal of this paper is to present the state of the literature and discuss the challenges hindering the full adoption of AI as a tool for thermal FER. When the public sector thinks about FER and facial recognition (FR), the go-to modality is the visible spectrum usually encoded as RGB. RGB images have dominated the area of FER, indicative through a variety of well known facial databases used in AI. 1 But, FR using RGB databases has become a controversial area of computer science, requiring careful consideration of its flaws and innate assumptions within the data and how it is applied (Martinez-Martin, 2019; Buolamwini and Gebru, 2018; Greene, Hoffmann, and Stark, 2019; Singer and Metz, 2019; Lohr, 2018) . Beyond the original intended academic purposes, some RGB databases have been taken down in order to prevent industry FR training (Murgia, 2019) . In the wake of Black Lives Matters protests in June 2020, Microsoft and IBM discontinued their development of FR, where Amazon invoked a one year moratorium on FR based on evidence of algorithmic discrimination against communities of color (Matsakis, 2020) . Of particular value to the public sector, is whether thermal imagery for FER affords any level of privacy protection and bias mitigation. The answer may stem from the separation of thermal imagery from other machine learning tasks, known to increase recognition and decrease anonymity. (Hammoud) We believe that long-wave Infrared Radiation (LWIR) used alone, as a data source for FER, may be able to provide some form of anonymity for healthcare applications to minimize racial, ethnic, and potentially gender bias, when compared to RGB for FER. Through its low, grey-scale resolution 2 and reliance on temperature vectors driven by underlying vasculature (Ioannou, Gallese, and Merla, 2014) , rather than superficial skin tone, texture, and pigmentation, thermal imagery can be more challenging to easily identify individuals. But there still remains a variety of issues to preserve privacy. For example, anonymity may not be possible if thermal FER is combined with the machine learning task of FR, especially since thermal FR is well researched with multiple methods proposed to detect and recognize individuals. The concept of separating FR from other tasks is not uncommon. Van Natta et al. (2020) , question whether during Covid-19 temperature monitoring, there is even a need to conduct FR given how the 1 CK+ , FER 2013 , FERET , EmotioNet, RECOLA, Affectiva-MIT Facial Expression Dataset, NovaEmotions, MultiPIE, Mc-Master Shoulder Pain, AffectNet, Aff-Wild2, the Japanese Female Facial Facial Expression database, and CASME II for microexpressions 2 Thermal imaging manufacturers offer a variety of color palettes for visualizing temperature beyond "white hot" such as "iron bow" and "rainbow". It should be cautioned that some manufacturers offer fusion visualizations that fuse the RGB and thermal images together thereby improving resolution. overall purpose is to identify infection as opposed to identity. It is important to caution, that although thermal FR is more challenging than the visible domain, it is feasible to use thermal imagery as a "soft" biometric due to its invariance under lighting and pose (Reid et al., 2013; Friedrich and Yeshurun, 2002) . For example, superficial vascular networks are unique to each person's face as proposed by Buddharaju et al. (2007) , and can be extracted through methods like anisotropic diffusion to identify minutiae points akin to fingerprints as shown in Figure 3 . Further, combining RGB with thermal can increase recognition accuracy. For example, Nguyen and Park (2016) used a combination of thermal and visible full body images for gender detection, finding that their proposed method of score-level fusion (training two separate SVM classifiers) combining thermal and visible led to a decrease in error of 14.672 equal error rate (EER) when compared to using thermal only (19.583 EER) and visible only (16.540 EER). (Buddharaju et al., 2007) In addition, there has been research in the computer and electrical engineering fields to develop sensor-level privacy for thermal sensors in situations where people need to be sensed and tracked, but not identified. Work by Pittaluga, Zivkovic, and Koppal (2016) demonstrated different techniques to include digitization that masks human temperatures measurements thereby obscuring any ability to detect faces shown in Figure 4 , manipulating the sensor noise parameters as the thermal image is being generated, and algorithms to under or overexpose specific pixels that are designated as "no capture" zones. Still in research, these techniques require different levels of hardware and firmware upgrades based on the thermal sensor. Thermal imagery has additional technical advantages including how it is (1) invariant to lighting conditions unlike RGB, allowing the detection of physiological response (heat) to occur in low light or total darkness; (2) is a reliable and accurate correlation to standard physiological measures like respiration and heart rate; (3) is non-invasive i.e., requiring no skin contact whatsoever, making it convenient and non-intrusive and potentially relevant for noncommunicative persons; (4) resistant to intentional deceit since physiological responses cannot be faked, whereas visible facial expressions can be controlled; and (5) is able to reveal facial disguises (i.e. wigs, masks) since these materials have high reflectivity and display as the brightest on thermograms compared to human skin which is among the darkest objects with low reflectivity (Pavlidis and Symosek, Figure 4 : Digitization privacy in different scenes: digitization results in scenes with people, computers and buildings. The left column are the input 16 bit images and the right column is the simulated output. (Pittaluga, Zivkovic, and Koppal, 2016) 2000). In addition, thermal imagery offers physiological signals of social interactions from person to person. In terms of deceit detection, it is valuable to note that RGB images can also be used to detect microexpressions using databases like CASME II. Microexpressions are genuine, quick facial movements that may be uncontrollable or unnoticeable by the individual, and therefore have been studied as an indication of deception (Yan et al., 2014) . The RGB images used for studying microexpressions, however, are different than standard RGB FR datasets. They consist of video sequences captured using spontaneous natural elicitation, captured at a high frame rate of 200 fps, and labeled with facial action units (FAUs) which are encoded combination of facial movements based on Paul Ekman's Facial Action Coding System (FACS) (Ekman, 1999) . A brief explanation of thermal radiation helps to understand how facial skin acts as a radiating surface. Thermal radiation is emitted by all objects above absolute zero (-273 .15 • C). Human skin is estimated at 0.98 to 0.99 (Yoshitomi et al., 2000) . The principal of thermal image generation is well understood by the Stefan-Boltzmann law that states total emitted radiation over time by a black body is proportional to T 4 where T is temperature in Kelvins: W = σT 4 where W is radiant emittance (W/cm 2 ), is emissivity, σ is the Stefan-Boltzmann constant (5.6705 · 10 −12 W/cm 2 K 4 ), and T is Temperature (K). A black body is an object that absorbs all electromagnetic radiation it comes in contact with. No electromagnetic radiation passes through the black body and none is reflected. Since no visible light is reflected or transmitted, the object Figure 5 : Long-Wave IR falls in the wavelength range of 8 µm to 15 µm looks black upon visualization from thermal imagery, when it is cold. Thermal sensors respond to infrared radiation (IR) and produce visualizations of surface temperature. Because LWIR operates in a sub-band of the electromagnetic spectrum per Figure 5 it is invariant to illuminating conditions meaning that it can operate in low light to complete darkness. By imaging temperature variations to emotionally induced stimuli such as videos or pictures, thermograms reveal genuine responses to social situations. This occurs through activation of the autonomic nervous system (ANS) where emotional arousal leads to a perfusion of blood vessels innervated at the surface of the skin (Ioannou, Gallese, and Merla, 2014) . These images are called thermograms and are the data captured in thermal FER datasets, with labels based of the emotional response elicited (i.e. happiness, disgust, sadness, deceit, stress, etc.). Although today's need for a touch-less system are paramount, the concept of using thermograms for contact-less physiological monitoring is not new and rooted in the intersection of physiological research (Selinger 2016; Buddharaju 2007; Pavlidis 2000; Ionnou 2014 ) and affective computing (Wilder 1996; Yoshitomi 2000; Goulart 2019 ). These include applications for FER where different emotions are detected from thermal facial images alone, in addition to person re-identification on thermal imagery, for FR. Since 1996 (Wilder et al., 1996) there have been numerous studies evaluating how thermograms correlate with vital measures. In 2007, Pavlidis demonstrated that thermal imagery is a reliable measure to assess emotional arousal where different regions of the face (zygo- maticus, frontal, orbital, buccal, oral, nasal) correlate with different emotional responses. Thermal imagery also visualizes the physiology of perspiration (Pavlidis et al., 2012; Ebisch et al., 2012) , cutaneous and subcutaneous temperature variations (Hahn et al., 2012; Merla et al., 2004) , blood flow (Puri et al., 2005) , cardiac pulse , and metabolic breathing patterns (Pavlidis et al., 2012) and has been used to monitor heat stress and exertion (Bourlai et al., 2012) . The reliability of thermal temperature readings have been repeatedly shown to be consistent and correlate accurately with gold standard physiological measures of electrocardiography (ECG), piezoelectric thorax stripe for breathing monitoring, nasal thermistors, skin conductance, or galvanic skin response (GSR) Sonkusare et al., 2019) . We can even observe these changes with the naked eye, such as embarrassment causing a person to blush (Sonkusare et al., 2019) , or fear leading to pallor (Kosonogov et al., 2017) . Merla (Merla, 2014 ) offered a survey of thermal studies in psychophysiology from 1990 to 2013, demonstrating a series of emotional responses detected on thermal imagery such as startle response, fear of pain, lie detection, mental workload, empathy, and guilt. These responses occur in different regions of the face, or ROIs. Salazar-Lopez found high arousal images elicited temperature increases on the tip of the nose (Salazar-López et al., 2015) . Kosnogov (Kosonogov et al., 2017) found that more arousing an image, the faster and greater the thermal response on the tip of the nose. He speculated that the speed and magnitude of these thermal responses were linked to autonomic adjustments normal to emotional situations. Zhu (Zhu, Tsiamyrtzis, and Pavlidis, 2007) found that deception was detected through increased forehead temperature and Puri (Puri et al., 2005) found the forehead to be correlated with stress. Social responses based on one-on-one personal contact can also be observed. For example, Ebisch (Ebisch et al., 2012) found "affective syn-chronization" of facial thermal responses between mother and child, where distress temperatures at the tip of the nose were mimicked by the mother as she watched her child in distress. Fernandez (Fernández-Cuevas et al., 2015) summarizes analysis by Ioannou, Gallese, and Merla (2014) Since 2000 with (Yoshitomi et al., 2000) , machine learning in thermal FER has grown slowly to include emotion classification by (Khan, Ingleby, and Ward, 2006; Nhan and Chau, 2009; Wang et al., 2014a; Jarlier et al., 2011; Wang et al., 2014b; Trujillo et al., 2005) with gradual adoption of AI methods such as neural networks. The ability to move away from manual, hand-crafted feature extraction to automatic learning through neural networks has already proven advantageous for thermal-to-visible image translation through GANs (Mallat et al., 2019; Kniaz et al., 2018; Chen and Year -Publication year, Affect -Expression type (Posed and Spont. mean basic discrete emotions), ROIs -facial regions of interest, Model -Deep learning algorithm type, Dataset -name of database, Target -the predicted class (all papers identified were classification), Acc -Best classification accuracy across models reported. Data -link to database provided if custom or name of public database provided, Code -link to code provided, Params -model parameters disclosed in paper, Annotations of (-) indicate information not disclosed, and (+) means it was disclosed in the paper. Table 2 , starting in 2010, indicating a slow evolution from manual feature extraction using geometric methods to learning latent representations using deep learning. These works do not consistently release code and have varied levels of explanation around experimental design and arousal stimulus, which we summarized in Table 3 . This makes it challenging to reproduce, much less compare across studies. Researchers in thermal emotion recognition such as Goulart et al. (2019) agree, particularly since there is no standard thermal FER imaging benchmark dataset consistently used across studies. In an empirical review reproducing 255 ma- chine learning papers, Raff (Raff, 2019) notes that papers which are scientifically sound and complete, should be independently reproducible based solely on explanation, details, and descriptions. Failures in reproducibility can occur when language or notation is unclear, when the algorithm is missing details about implementation or equations, and when nuanced details are left out. In Table 1 we catalog the few available (via request or publicly) thermal datasets that have been used for tasks including FR and FER. They vary in scope, where some do not have emotion labels at all, making it difficult to benchmark and standardize results that may eventually impact psychological and health-related decisions. One example of a recently developed thermal FER dataset is by Tufts University shown in Figure 9 . Some researchers have noticed the lack of variation across thermal FR dataset that fail to account for diverse emotional states, alcohol intake or exercise, and ambient temperature, leading them to doubt the rigor of the reported results especially in real life conditions (Shoja Ghiass, 2014) . Assuming that the lack of a comprehensive thermal FER benchmark dataset is one factor that hinders the advancement of AI research, we can begin exploring the challenges of designing such a dataset. But, developing a thermal FER dataset is different than simply crawling the web for RGB faces. The collection of thermal FER data requires an experiment unto itself, needing institutional review board (IRB) approval, subject recruitment, experimental design, and specialized equipment. As a result, thermal FER datasets are expensive in terms of time and labor. We have observed some trends across databases that if addressed in the development of a single high-fidelity dataset, may carve a path for greater adoption of thermal AI FER studies. We justify these assertions based on research in the psycho-physiology domain, below. Video sequences present timing of the arc of expression onset and delay. It is important to capture intensity and duration of expression which has been found consistent with automatic movement and neuropsychological models (Tian, Kanade, and Cohn, 2005) . Levenson (1988) indicated that duration of an emotional response is 0.5 4 seconds. But Nguyen (Nguyen et al., 2013) cites mistakes in many of the leading thermal recognition databases. In the USTC-NVIE database their procedure for data acquisition had video gaps between each emotion clip at 1-2 minutes which is too short for participants to establish a neutral emotion status. Research indicates that for thermal response (cutaneous skin temperature), there is a delay after stimulus that needs to be accounted for and recorded (Ioannou, Gallese, and Merla, 2014) and temperature change can occur in less than 30 seconds upon stimulation (Pavlidis et al., 2012) . Temperature changes at the tip of the nose can occur as fast as 10 seconds after stimulus and last 20 -30 seconds regardless of distress or soothing (Ebisch et al., 2012) . In a more recent paper, (Sonkusare et al., 2019) were able to quantify the temporal dynamics of thermal response when compared to gold standard measures like Galvanic Skin Response (GSR) demonstrating that thermal response occurred only 2 seconds later than GSR when exposed to an auditory stimulus. Static images without a time axis can be incomplete and will fail to capture the complete physiological signal and emotional response. Many existing thermal databases that are focused only on FR have discrete, posed affects based on the labeling defined by Ekman (Ekman 1999) . But affective researchers argue that spontaneous emotional reactions are more realistic since, "people show blends of emotional displayshence, the classification of human non-verbal affective feedback into a single basic-emotion category may not be realistic." (Gunes and Pantic, 2010; McDuff, Girard, and El Kaliouby, 2017) . Further, multiple emotions typically occur as opposed to a single discrete response. For example, in a 1993 study by Gross et al. 85 subjects self-reported a variety of feelings after watching a close-up arm amputation medical video (Gross and Levenson, 1993) . Another argument against discrete labels is the possibility that people express emotions as internalizers or externalizers, meaning different people suppress emotional expres- Figure 10 : Multiple feelings self-reported after exposure to high arousal video (Gross and Levenson, 1993) sion in different ways making it difficult to truly capture expression in a basic, discrete manner (Gross and Levenson, 1993) . To elicit spontaneous response, emotion researchers use static images such as the International Affective Picture System (Kosonogov et al., 2017) or short clips of emotional videos (Nguyen et al., 2013) . In a recent 2019 study by Sonkusare et al. (2019) , they use an auditory stimulus described in Figure 11 to mimic a startle response, spontaneously. Figure 11 : Example of an emotional stimulus by Sonkusare et al. to elicit a spontaneous response. A calming ocean video clip was played for 60 seconds. A loud gunshot sound (80dB) was played at 40seconds to mimic a startle response. (Sonkusare et al., 2019) Provide social or personal context In a similar vein to spontaneous, natural emotion collection, providing social context in an experimental setting will change the nature of the emotion recorded. Context labeling to account for elicitation methods that are prompted spontaneously through personal elicitation (i.e. images, videos), versus social interaction with another person (or robot per (Goulart et al., 2019) ) may signal different physiological responses reflected in thermal imagery. Factors that influence these responses may include interpersonal distance, gaze direction, and opposite gender in the interaction (Kosonogov et al., 2017; Gunes and Pantic, 2010) . A sociodynamic model of emotions (Mesquita and Boiger, 2014) asserts that emotions "emerge in interplay with and derive their specific function from the social context. This means that emotional experience and behavior will be differently constructed across various contexts". For example, Goulart (Goulart et al., 2019) analyzed emotional response for 17 children during a human-child robot interaction experiment shown in Figure 12 . Using Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA), they inferred happiness and surprise as the most frequently expressed, which were consistent with what the children self-reported upon interacting with the New-Mobile Autonomous Robot for Interaction with Autistics (N-MARIA) robot. In 2000 Yoshitomi (Yoshitomi et al., 2000) classified discrete affects by combining visible, thermal, and audio signals from 21 test subjects, achieving 85% accuracy. Zhu, Tsiamyrtzis, and Pavlidis (2007) discussed multimodal data as "cross scale" data for biomedical research, or interconnections of different types of data using AI to infer mappings even if some data is missing. In essence, both were developing multimodal machine learning models, where multiple modalities, or types of information, may be combined to increase the accuracy of models (Baltrušaitis, Ahuja, and Morency, 2018) . The approach to collect pairs is not new. Nguyen collected thermal FER pairs for the KTFE database (Nguyen et al., 2013) and the Iris (Hammoud), Eurecom (Mallat and Dugelay, 2018) , and University of Notre Dame (UND) also have pairs which offer greater flexibility for different AI use cases like image translation for person reidentification. This includes research into thermal-to-visible GANs (Mallat and Dugelay, 2018; Kniaz et al., 2018; Chen and Ross, 2019; Zhang et al., 2018) . With paired images capturing the RGB and LWIR images simultaneously using a camera equipped with a dual sensor, offers a mapping between both modalities for an AI algorithm to learn. Figure 13 : Example of TV-GAN trained on multimodal pairs for thermal-to-visible image translation (Zhang et al., 2018) Documenting experimental setup is important in order to minimize bias in the resulting thermogram, which can be affected by a variety of environmental and human subject conditions. Ioannao (Ioannou, Gallese, and Merla, 2014) articulates in his paper on the potential and limitations of thermal imaging in physiology that, "Cutaneous thermal responses to external stimuli of psychophysiological valence could result in small temperature variations of the ROIs. Thus, it is extremely important to ensure that the observed temperature variations are not artifacts due to either environmental physiological causes or simply subject motion." Some of these can be minimized, the methods of which should be recorded and shared in the paper so that other thermal FER data collection trials can be repeated or improved to control for these external factors. Figure 14 : Experimental Setup for Iris dataset capture (Kong et al., 2007) In Table 3 we provide a sample of experimental parameters from several thermal FER papers and show how they vary from paper to paper. This demonstrates non-standard setups over the years of thermal FER research that could affect the reusability and generalization of these data for AI experiments. But, different papers vary in the extent of how much they document their experimental protocol provided in an example set of papers in Table 3 . Multiple factors need to be managed in order to minimize variables in the environment that influence thermal capture, leading to potentially misleading thermograms such as 1) Cold or warm air, as well as humidity, 2) Facial expressions (e.g. open mouth), 3) Physi-cal conditions (e.g. lack of sleep, alcohol, caffeine), 4) Mental state (i.e. fear, stress, excitement), 5) Opaque to glasses, 6) Skin temperature variance through the day (Kosonogov et al., 2017) . Fernandez et al. provide a comprehensive review of environmental, individual, and technical factors that influence IR reliability per Figure 15 ( Fernández-Cuevas et al., 2015) . Figure 15 : Factors influencing thermal imagery of humans (Fernández-Cuevas et al., 2015) Experimental design also includes the demographics of recruited subjects. Very few details are provided about race and ethnicity shown in Table 3 for the exception of (Lopez, del Blanco, and Garcia, 2017) who indicated that nine out of 19 individuals were of Chinese ethnicity. With the ethical problems of visible FR in failing to train algorithms on a representative and balanced minority dataset, thermal FER researchers need to understand exactly what subjects are being included in the data and what underlying assumptions are being broadcast into training. Further, we have so far been discussing thermal FER on adults in the various papers introduced. Very few studies, limited to (Goulart et al., 2019) for child-robot interaction, (Ioannou et al., 2013) for guilt, (Ebisch et al., 2012) for child-mother imprinting, for mother-child of vicarious autonomic response, collect thermal FER data on children. For the exception of Panetta et al., none of the thermal databases we identified appear to include children in their dataset, to the author's knowledge for thermal FER. So far, much work is still needed to generate an ethnically and age-diverse thermal FER dataset. Lastly, experimental set-up should also document technical methods that aim at normalizing the detected thermal face. For example, Wang et al. (2014a) describes using the Otsu threshold algorithm to binarize the thermal images, detecting the face boundary, and removing baseline temperature to minimize the effects of temperature changes in the environment. Similar methods were introduced by Friedrich and Yeshurun in 2002 (Friedrich and Yeshurun, 2002) . Add spontaneous elicitation where possible, in addition to discrete set. Natural, "in the wild" expressions that offer accurate representations of emotion. Provide social or personal context Thermal data collected without social stimuli may not be useable for social use cases. If appropriate, label social context or if controlling for, document how social response has been minimized. Social interaction thermal FER expressions, with labeled context and scenarios. Collect multimodal pairs No opportunity to increase accuracy or learn from additional modality mappings if only one modality (thermal) is collected. May require dual sensor, or experimental design for simultaneous capture using two cameras. Multimodal pairs for various social, spontaneous elicited thermal FER domains. Document experimental setup Confounding through uncontrolled environmental variables can lead to misleading images. Report at minimum, the parameters shown in in Table 3 . Standard thermal FER experimental protocol for design and demographic documentation. Accounting for Sensor Differences Untested margin of error for images collected using different thermal sensors. No mitigation strategy. This is an open research question. Assessment with optical engineers to determine margin of error across sensors for human thermal FER. Lastly, the cost of thermal sensors through vendors like FLIR, have decreased over the past decade with increasingly higher quality resolution made accessible to the public. Prior papers have extensively used the Iris and Equinox (now discontinued) datasets. But with the release of more custom datasets as shown in Table 1 , is it fair to compare the output of thermal images from one sensor against another, which may have different optical properties? Or, is it sufficient that each sensor operates in the LWIR band? Many researchers have used different thermal sensors over the years: Pavlidis detected anxiety in thermal imagery in 2000 using an uncooled thermal camera with a spectral band of 8µm-14µm manufactured by Raytheon (the ExplorIR model) (Pavlidis and Symosek, 2000) , Nguyen in 2014 used a NEC R300 collecting in the 8µm-14µm band (Nguyen et al., 2013) , Aureli in 2015 used a FLIR SC660, an uncooled microbolometer sensor that collects in the 7.5µm 13µm band (Aureli et al., 2015) , and Eurecom researchers in 2018 used a FLIR Duo-Pro, an uncooled VOx Microbolometer sensor operating in 7.5µm13.5 µm (Mallat and Dugelay, 2018) . Table 3 provides a selection of thermal cameras used across various thermal FER studies as examples of how the cameras vary from study to study. It is daunting to attempt to design a universal, thermal FER benchmark dataset that can account for the myriad of challenges we described. Extensive funding for time, labor, and evaluation would be required. Some challenges are easier to mitigate than others, for example improving the documentation of experimental setup possibly using templates by Gebru et al. (2018) and Mitchell et al. (2019) versus designing physiological stimuli. But, there may be more feasible short-term solutions that emphasize quality of reviewing the limitations of individual datasets and annotating each with a new labeling system. First, we have observed there are a number of custom datasets as described in Table 2 and are confident that our review missed several proprietary, unpublished, non-English, or classified thermal FER datasets. As a result, there are likely multiple thermal FER databases available all collected with a different set of subjects, experimental setups, and labeling. Offering these in a central online location, would be one step towards inventorying the breadth of data already available worldwide. Figure 16 : Participants from diverse multimodal dataset collected by the IRIS Lab in 2006 (Chang et al., 2006) Secondly, combining across multiple existing thermal FER datasets and labeling by sensor, domain, posed or spontaneous emotion, resolution, and presence of social context, and stimulus, may be one step towards the aggregation of a larger database. Gathering training data across different datasets is not unusual in thermal FER, as previously noted when Wang et al. (2014a) combined the NVIE and Equinox datasets to train their DBM model. Both first and second steps would require an effort across researchers to offer up and make available their thermal FER datasets. Third, despite our review of the thermal FR and FER literature, we struggled to identify any research to evaluate the limits of obfuscating age, gender, ethnicity, and race using thermal imagery. Although some papers affirmed that their dataset consisted of diverse demographics (Chang et al., 2006) per Figure 16 , none to our knowledge, conducted quantitative tests with human reviewers and inter-rater statistics to test whether or not sensitive demographics could be masked. We believe that in order to assert that thermal imagery can afford any privacy protection and minimize bias, tests must be developed using IRB approval. More broadly, future work should take careful consideration into the scientific questions their research is tackling and the impact it may have in developing or prolonging undesired biases (Friedman and Nissenbaum, 1996) . Biometrics related research is inherently sensitive and solutions can be valuable to society (Jai, 2016). As such researchers should make sure they are familiar with ethical concerns that have occurred in neighboring application areas (Ensign et al., 2018; Chouldechova, 2017; Kleinberg, Mullainathan, and Raghavan, 2016) and remain open to understanding new perspective in which their research may be helpful or detrimental, and could be improved to reduce potential risks (Skirpan and Gorelick, 2017; Goldsmith and Burton, 2017; Sylvester and Raff, 2018) . In this paper, we introduced the advantages of using thermal imagery over RGB for facial FER and provided a survey of thermal FER AI papers, datasets, and selected samples of experimental design protocols. There are several technical benefits of using thermal imagery compared to RGB images for FER, one of which potentially being semi-anonymity. However, there are few labeled, standard thermal affective data sets available for AI training. We have provided a summary of the proposed challenges, with our insights on the consequences, mitigation, and opportunities for each in Table 4. Behavioral and facial thermal variations in 3-to 4-month-old infants during the still-face paradigm Multimodal machine learning: A survey and taxonomy Use of thermal imagery for estimation of core body temperature during precooling, exertion, and recovery in wildland firefighter protective clothing Physiology-based face recognition in the thermal infrared spectrum Gender shades: Intersectional accuracy disparities in commercial gender classification An indoor and outdoor, multimodal, multispectral and multi-illuminant database for face recognition Matching thermal to visible face images using a semantic-guided generative adversarial network Fair prediction with disparate impact: A study of bias in recidivism prediction instruments Mother and child in synchrony: thermal facial imprints of autonomic contagion Basic emotions. Handbook of cognition and emotion Runaway Feedback Loops in Predictive Policing Classification of factors influencing the use of infrared thermography in humans: A review Bias in Computer Systems Seeing people in the dark: Face recognition in infrared images Contact-free measurement of cardiac pulse based on the analysis of thermal imagery Datasheets for datasets Infrared face recognition: A comprehensive review of methodologies and databases Why Teaching Ethics to AI Practitioners Is Important Emotion analysis in children through facial emissivity of infrared thermal imaging Better, nicer, clearer, fairer: A critical assessment of the movement for ethical artificial intelligence and machine learning Scface -surveillance cameras face database Emotional suppression: physiology, self-report, and expressive behavior Automatic, dimensional and continuous emotion recognition Hot or not? thermal reactions to social contact Deepfake detection challenge Otcbvs benchmark dataset collection Fusion of visual and thermal signatures with eyeglass removal for robust face recognition A comparative study of thermal face recognition methods in unconstrained environments The autonomic signature of guilt in children: a thermal infrared imaging study Thermal infrared imaging in psychophysiology: potentialities and limits 50 years of biometric research: Accomplishments, challenges, and opportunities Thermal analysis of facial muscles contractions Automated facial expression classification and affect interpretation using infrared measurement of facial skin temperature variations Inherent Trade-Offs in the Fair Determination of Risk Scores Thermalgan: Multimodal color-to-thermal image translation for person re-identification in multispectral dataset Multiscale fusion of visible and thermal ir images for illumination-invariant face recognition A fully annotated thermal face database and its application for thermal facial expression recognition Facial thermal variations: A new marker of emotional arousal Emotion and the autonomic nervous system: A prospectus for research on autonomic specificity. Social psychophysiology: Theory and clinical applications Deep facial expression recognition: A survey Facial recognition is accurate Detecting exercise-induced fatigue using thermal imaging and deep learning A benchmark database of visible and thermal paired face images across multiple variations Cross-spectrum thermal to visible face recognition based on cascaded image synthesis Mom feels what her child feels: thermal signatures of vicarious autonomic response while watching children in a stressful situation What are important ethical implications of using facial recognition technology in health care? AMA journal of ethics Amazon won't let police use its facialrecognition tech for one year Large-scale observational evidence of cross-cultural differences in facial behavior Emotion detection through functional infrared imaging: preliminary results Revealing psychophysiology and emotions through thermal infrared imaging Emotions in context: A sociodynamic model of emotions Model cards for model reporting Microsoft quietly deletes largest public face recognition data set Body-based gender recognition using images from visible and thermal cameras A thermal facial emotion database and its analysis Classifying affective states using thermal infrared imaging of the human face A comprehensive database for benchmarking imaging systems The imaging issue in an automatic face/disguise detection system Interacting with human physiology Fast by nature-how stress patterns define human experience and performance in dexterous tasks Sensor-level privacy for thermal cameras Stresscam: non-contact measurement of users' emotional states through thermal imaging A step toward quantifying independently reproducible machine learning research Soft biometrics for surveillance: an overview The mental and subjective skin: Emotion, empathy, feelings and thermography Face liveness detection using thermal face-cnn with external knowledge Face recognition using infrared vision Improved rgb-dt based face recognition Many facial-recognition systems are biased The Authority of "Fair" in Machine Learning Detecting changes in facial temperature induced by a sudden auditory stimulus based on deep learning-assisted face tracking What About Applied Fairness? Facial expression analysis. In Handbook of face recognition Digital technology and covid-19 Automatic feature localization in thermal images for facial expression recognition A natural visible and infrared facial expression database for expression recognition and emotion inference Emotion recognition from thermal infrared images using deep boltzmann machine Fusion of visible and thermal images for facial expression recognition. Frontiers of Comparison of visible and infra-red imagery for face recognition Casme ii: An improved spontaneous micro-expression database and the baseline evaluation Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face Directional binary code with application to polyu near-infrared face database Tv-gan: Generative adversarial network based thermal to visible face recognition Forehead thermal signature extraction in lie detection We thank the three anonymous reviewers for AAAI 2020 for their feedback and comments. We also thank Steve Escaravage from Booz Allen Hamilton for his review of this article. This work is supported by grant CRII (IIS-1948399) from the National Science Foundation.