key: cord-0515858-jw1aimkw authors: Das, Hari Prasanna; Spanos, Costas J. title: Conditional Synthetic Data Generation for Personal Thermal Comfort Models date: 2022-03-10 journal: nan DOI: nan sha: b078b51ef72ca6745c511ada046df20c71339dea doc_id: 515858 cord_uid: jw1aimkw Personal thermal comfort models aim to predict an individual's thermal comfort response, instead of the average response of a large group. Recently, machine learning algorithms have proven to be having enormous potential as a candidate for personal thermal comfort models. But, often within the normal settings of a building, personal thermal comfort data obtained via experiments are heavily class-imbalanced. There are a disproportionately high number of data samples for the"Prefer No Change"class, as compared with the"Prefer Warmer"and"Prefer Cooler"classes. Machine learning algorithms trained on such class-imbalanced data perform sub-optimally when deployed in the real world. To develop robust machine learning-based applications using the above class-imbalanced data, as well as for privacy-preserving data sharing, we propose to implement a state-of-the-art conditional synthetic data generator to generate synthetic data corresponding to the low-frequency classes. Via experiments, we show that the synthetic data generated has a distribution that mimics the real data distribution. The proposed method can be extended for use by other smart building datasets/use-cases. Humans spend more than 90% of their day indoors, where their well-being, performance and energy consumption are demonstrably linked to thermal comfort. But, study shows that only 40% of commercial building occupants are satisfied with their thermal environment Graham et al. (2021) . There has been significant amount of research done to develop models to accurately predict thermal comfort metrics for occupants in a building. Contrary to conventional group-based thermal comfort models, personal thermal comfort models Liu (2018) focus on developing thermal comfort predictors at a building occupant level. They have proved efficient in human-centric cyber-physical systems to efficiently regulate the building control systems, as well as to understand the correlation between human factors affecting comfort. The general process is to conduct experiments with human subjects and collect their physiological signals along with other environmental parameters, and thermal sensations and preference. Then prediction models are trained to predict the thermal preference that governs the thermal comfort management actuators/ controllers. Recently, machine learning models have been introduced to successfully predict thermal comfort. In real life, often the thermal comfort data obtained is highly class-imbalanced. For instance, in the experiment in , on an average for each subject, around 65% of the data belonged to the "Prefer No Change"class, and the rest equally divided between the "Prefer Warmer"and "Prefer Cooler"classes. Machine learning algorithms require high amounts of varied data for efficient performance. Under such class-imbalance, machine learning algorithms perform sub-optimally. In case of buildings, having access to significant amounts of real data for the low-frequency classes, with human subjects is hard and expensive. To balance the classes, recent works have proposed undersampling the high-frequency class to match the count with low-frequency classes, or oversampling the low-frequency classes to match with the high-frequency class. In the former method, there is loss of information, which is undesirable, and in the latter case, there is possibility of overfitting. Another challenge that is faced comes from the concern of privacy. Often, sharing of thermal comfort data that are associated with users in a building face the challenge of privacy issues. To deal with the above challenges, we propose to generate conditional synthetic data for personal thermal comfort models. We propose to use the conditional generative models proposed in Das et al. (2021b) to generate synthetic data for the "Prefer No Change", "Prefer Warmer"and "Prefer Cooler"classes. The inputs to the generative model are thermal comfort features including physiological signals, temperature, humidity, clothing, activity levels, external parameters etc. The model is capable of extracting the feature representations corresponding to the individual classes, and also to generate new synthetic data keeping the conditional feature representation intact and changing the local noise. Our results show that the proposed model is able to generate synthetic data that mimic the real data. Synthetic data generation has been proposed to expand the diversity and amount of the existing training data in many different fields, often to improve the robustness of machine learning models. A few examples are as following. In healthcare, Ghorbani et al. (2019) propose a generative adversarial network (GAN (Goodfellow et al., 2014; Zou et al., 2019c) )-based synthetic data generator to improve the diversity and the amount of skin lesion images. Kohlberger et al. (2019) synthesize pathology images for cancer with realistic out-of-focus characteristics to evaluate general pathology images for focus quality issues. Han et al. (2019) propose synthetic generation to produce high-resolution artificial radiographs. For privacy-preserving data sharing, Xu et al. (2019) propose a method to model tabular data to enable their synthetic generation. In computer vision Das et al. (2021a) propose synthetic data generation across multiple domains. In smart buildings, Quintana et al. (2020) used a conditional tabular GAN based model for thermal comfort synthetic data generation. We use a state-of-the-art conditional synthetic data generation model that has shown improved results over all baselines to generate thermal comfort synthetic data. Our model is based on the method proposed in (Das et al., 2021b). Suppose we have N samples X with labels Y , with 3 possible thermal preference classes, Warmer/No Change/Cooler. We first train a classifier C (consisting of a feature extractor network denoted by g(·), and a final fully-connected and softmax layer, denoted by h(·), i.e. C(x) = h(g(x))) to classify the input sample (which in our case are thermal comfort features) and associated labels as Warmer/No Change/Cooler. Mathematically, this step solves the following minimization with backpropagation: By virtue of the training process, the classifier learns to discard local information and preserve the features necessary for classification (conditional information) towards the downstream layers. Once the classifier is trained, we freeze its parameters, and use it to extract the conditional (Warmer/No Change/Cooler) feature representation z = g(x) (as a vector without spatial characteristics) at the output of the feature extractor network for input image x. The dimension of z is chosen such that dim(z) << dim(x). During the training phase for the flow model, the conditional feature representation z is fed to the conditional generative flow. The flow model is trained using maximum-likelihood, transforming x to Since flow models are bijective mappings, the exact x can be reconstructed by the inverse flow with z and ν as inputs. During the generation phase, for an input sample x, we compute the conditional feature representation z. Keeping the conditional feature representation the same, we sample a new local representationν, and generate a conditional synthetic samplex, i.e. Here,x has the same conditional (Warmer/No Change/Cooler) features as x , but has a different local representation. An illustration of the proposed model is provided in Fig. 1 . In our previous work , we conducted an experiment to collect physiological signals (e.g., skin temperature at various parts of the body, heart rate) of 14 subjects (6 female and 8 male adults) and environmental parameters (e.g., air temperature, relative humidity) for 2-4 weeks (at least 20 h per day). The subjects also took an online survey, where they reported their thermal sensation (on a scale of -3 to +3) and thermal preference (Warmer, Cooler, No Change) among other parameters. For this work, we generated synthetic data for the 3 thermal preference classes (Warmer, No Change, Cooler) for 5 of the subjects. We designed fully-connected neural networks for the feature extractor, classifier, and conditional generator blocks. A test set is held out from the real dataset to be used for Table 1 : Thermal Preference classification performance with classifiers trained on real and synthetic data. The first number among the pair in each box is performance with a classifier trained on real data, while the second number is with a classifier trained on synthetic data generated by our proposed model. quantitative testing. We then compare the classification performance (COVID/Non-COVID) on this test set for a classifier trained on real data vs a classifier trained on the generated synthetic data. Since the datasets are imbalanced, we report the cohens kappa, accuracy and AUC score (together referred to as classification metrics). The classification results for a classifier trained on the real data vs a classifier trained on purely conditional synthetic data, and tested on a hold-out set of real data, is given in Table 1 . The classifier trained with synthetic data from our proposed model has the close classification performance to that of the classifier trained on real data. This shows the capability of our method to generate synthetic samples with a distribution that closely matches the real conditional data distribution. We presented preliminary results for thermal comfort synthetic data generation using a state-ofthe-art conditional synthetic data generation model. The results show that the generative model is capable of generating synthetic data that are close in distribution with the real data. There are numerous future work to the preliminary work that we have presented. The network of the models can be improved (with e.g. ResNets) for better results. Various scenarios can be explored such as mixing and interpolation in the latent space to generate unseen data. A similar methodology can be extended for synthetic data generation in several more smart building use cases (Zou et al., 2019b,a; Konstantakopoulos et al., 2019; Chen et al., 2021; Periyakoil et al., 2021; Das et al., 2019b Das et al., , 2020 Liu, 2018; Donti and Kolter, 2021; Jin et al., 2018) . Machine learning for sustainable energy systems Dermgan: Synthetic generation of clinical skin images with pathology Lessons learned from 20 years of cbe's occupant surveys Breaking medical data sharing boundaries by employing artificial radiographs. bioRxiv Biscuit: Building intelligent system customer investment tool Whole-slide image focus quality: Automatic assessment and impact on ai cancer detection Design, benchmarking and explainability analysis of a game-theoretic framework towards energy efficiency in smart infrastructure Personal thermal comfort models based on physiological parameters measured by wearable sensors Personal thermal comfort models with wearable sensors Decoupling global and local representations via invertible generative flows Environmental exposures in singapore schools: An ecological study Balancing thermal comfort datasets: We gan, but should we? Modeling tabular data using conditional gan Machine learning empowered occupancy sensing for smart buildings Wifi and vision multimodal learning for accurate and robust device-free human activity recognition Consensus adversarial domain adaptation B. Chen, P. Donti, K. Baker, J. Z. Kolter, and M. Berges. Enforcing policy feasibility constraints through differentiable projection for energy optimization. arXiv preprint arXiv:2105.08881, 2021.H. P. Das, P. Abbeel, and C. J. Spanos. Dimensionality reduction flows. arXiv preprint arXiv:1908.01686, pages 1-10, 2019a.