key: cord-0328423-w9e95et0 authors: Dhurandhar, Amit; Cecchi, Guillermo A.; Meyer, Pablo title: Expansive Linguistic Representations to Predict Interpretable Odor Mixture Discriminability date: 2022-04-11 journal: bioRxiv DOI: 10.1101/2022.04.11.487927 sha: 0cef52d71bf5c43d850bf785e5c001f9991e7179 doc_id: 328423 cord_uid: w9e95et0 Language is often thought as being poorly adapted to precisely describe or quantify smell and olfactory attributes. In this work, we show that semantic descriptors of odors can be implemented in a model to successfully predict odor mixture discriminability, an olfactory attribute. We achieved this by taking advantage of the structure-to-percept model we previously developed for monomolecular odorants, using chemical descriptors to predict pleasantness, intensity and 19 semantic descriptors such as ‘fish’, ‘cold’, ‘burnt’, ‘garlic’, ‘grass’ and ‘sweet’ for odor mixtures, followed by a metric learning to obtain odor mixture discriminability. Through this expansion of the representation of olfactory mixtures, our Semantic model outperforms state of the art methods by taking advantage of the intermediary semantic representations learned from human perception data to enhance and generalize the odor discriminability/similarity predictions. As 10 of the semantic descriptors were selected to predict discriminability/similarity, our approach meets the need of rapidly obtaining interpretable attributes of odor mixtures as illustrated by the difficulty of finding olfactory metamers. More fundamentally, it also shows that language can be used to establish a metric of discriminability in the everyday olfactory space. Language is often thought as being poorly adapted to precisely describe or quantify smell and in particular olfactory properties such as the similarity or the discriminability of two molecular mixtures. Contrary to this idea, we have previously shown that for pure odors, it is possible to build models using the chemical structure of molecules [1] to predict the perceptual values of natural language attributes of smells [2] . Also, older studies have shown that a direct comparison of pure odor profiles based on 146 semantic descriptors [3] or down to 25 semantic descriptors [4] , can be used to generate a distance between two pure odors that is highly correlated with their similarity ratings (r > 0.85). In these studies several metrics, such as Euclidean and Chi-squared distances or Tversky similarity, were used and the latter one was found to better match the similarity between two pure odors [4] . These past results show that indeed semantic descriptors can be used to quantify smell attributes, at least in the reduced universe of monomolecular odors. Odor mixtures, the real-world situation for olfactory perception, add an extra level of complexity because the overall perceptual qualities of a mixture are not the sum of the qualities of the molecules composing it. This has led to, in our opinion, the erroneous suggestion that perception in olfaction, as for vision and audition, is object-oriented and that odor-objects are buried in such odor mixtures [5, 6] . Recent studies have shown that a model using a simple set of 21 chemoinformatic structural descriptors can predict with a significative correlation the odor similarity and discriminability between mixtures of molecules (r = 0.31 to r = 0.51 for both equi-intense odors and with varying intensities) [7] . In this work, we show that, besides chemoinformatic features, semantic descriptors of odors can also be implemented in a most effective model to predict odor mixture discriminability and similarity. This allows, through analysis of the semantic relationship between olfactory attributes, an interpretation of why 2 mixtures of molecules smell similarly, besides having components in common. We do this by taking advantage of the structure-to-percept model we previously developed for pleasantness, intensity and 19 semantic descriptors describing monomolecular odorants [1] , to build a model able to predict the discriminability between any two odor mixtures as measured by a triangle test (see Figure 1 ). It not only meets the need of rapidly obtaining interpretable predictions of odor mixture discriminability, but also shows that language can be used to establish a metric in the everyday olfactory space. Interpretable metric learning from semantic attributes Instead of taking a direct approach to predict odor mixture discriminability from chemical descriptors [8, 7] , we decided to implement a model with an intermediary step that expands the dimensions using semantic descriptors [1, 2] (see Figure 1 ). Discriminability experiments consist of finding the percentage of subjects that can discriminate the odor that is different when presented with 3 vials where 2 have an identical odor mixture, sometimes diluted [9, 6] . The expansion was done taking advantage of a model we previously developed [1, 2] to predict, from the chemoinformatic structural descriptors of any molecule, the values for intensity, pleasantness and the 19 semantic descriptors ( see Figure 2a Top). We then averaged each of these 21 values across all the molecules in the mixture, irrespective of their number, to obtain the perceptual values of the mixture. The underlying approximation for this being that the perceptual contribution of each molecule to the olfactory mixture is independent. Given the clear limitations A discrimination test consists of identifying one of the 3 vials that has a different odor and is quantified as the fraction of subjects that perform this task correctly. Top. Our solution consists on implementing a model that predicts semantic attributes of pure molecules from chemoinformatic features and then perform a metric learning to obtain the discriminability value. Bottom. The existing solution consists on directly predicting discriminability using for each mixture the values of 21 selected chemoinformatic features to calculate an angle distance between the two vectors. In both models the chemoinformatic features from different molecules composing the mixture are integrated linearly. of this approximation and instead of trying several metrics as previously done [4] , we decided to perform metric learning [10] to match the defined distance between the predicted perceptual values of any two mixtures to their experimentally measured discriminability. We implemented a Mahalanobis distance, that is a weighted Euclidean distance between the semantic descriptors of the two mixtures in the olfactory perceptual space. Namely, the Mahalanobis distance between x i and y j is given by 2 = (x i y j ) > ⌃ 1 (x i y j ), ⌃ is a covariance matrix that we chose to be diagonal to allow interpretability of the fitted weights. The initial unit value for each descriptor weight was changed and fitted in order to predict the fraction of subjects that can discriminate a given pair of odor mixtures obtained from an experimental dataset (see Figure 1 and Methods). Given that in Bushdid et al subjects discriminated up to 260 intensity-matched different odor mixtures composed of 10 to 30 molecules and varying the amount of shared components (see Methods), it seemed like the perfect dataset to perform metric learning (see Figure 2a Middle). The weights obtained for 'intensity', 'pleasantness' and the 19 semantic descriptors when performing a 10-fold cross validation scheme for the metric learning are shown by decreasing importance in Table 1 . As we performed a Lasso regression to obtain the best metric fit using a minimal set of descriptors, 10 of the 21 weights are null and those descriptors do not contribute to the metric (see Table 1 ). The regression was also performed with a constant term, defining a lower limit for the distance, whose value was 0.5185. The good performance of this metric learning as measured by Root Mean Square Error (RMSE) can be seen in Figure Snitz 2) . In both cases we excluded experiments where the mixtures were the same. Although in these datasets the subjects were asked to rate the similarity between mixtures from 1 to 100, we obtained the discriminability as being 1 minus the similarity re-normalized between 0 and 1. The second external dataset was obtained from a discriminability experiment in a recently published article [7] where the odor mixtures did not have intensity matching and 50 pairs of different odor mixtures from a total of 120 mixtures composed from 10 to 30 molecules (Ravia) . For all this heterogeneous external datasets our semantic model trained on data from Bushdid et al fared similarly and was much better than the Direct model whose RMSE Having a well performing validated model, we now interpret its implications. As observed in Table 1 , the metric learning reduced the number of necessary attributes from 21 to 10 which are well distributed across the olfactory semantic space (see blue colored attributes on dendrogram in Fig.3 ). Not only that, but the 11 attributes that were discarded are either neighbors of another Diagram showing how the lasso metric learning for odor discriminability was able to extract 'intensity' and 9 optimal semantic descriptors out of 19, shown on table by order of importance. The dendrogram below shows 131 olfactory descriptors ordered by semantic similarity as measured by their cosine distance using vectors from word embbedings [12, 2] and in blue are whown the 10 optimal descriptors, equally distributed along the dendrogram. The 11 other red descriptors were eliminated during the optimization for being redundant, next to a blue descriptor, or not contributing to the metric. If the semantic attributes of mixtures used to describe them are different than the ones in blue, a transformation using the cosine distance between descriptors as used to construct the dendrogram can be implemented to project their values to the blue ones used in the model. attribute in the dendrogram, i.e 'grass' and 'wood', or in the near vicinity, as if the Lasso regression was weeding out redundant terms as measured by their semantic similarity. This points to the relevance of using language to quantify the properties of smell. Also, it is interesting to note that 10 dimensions have come up in several studies as to seemingly underlie the structure of the olfactory space [11] . The semantic model can also be used to study the olfactory space in two dimensions: Exploring how the distance between 2 mixtures changes with the number of molecules and the overlap between the molecules composing them (( Fig.4a & The abrupt transition from discriminable to indiscriminable mixtures in the olfactory space may be a clue to why it has been so difficult to prove/disprove the existence of olfactory metamers, i.e odors that smell the same but have no molecular components in common [7] . Hence, we next investigated in more detail the relationship between the Semantic model predictions for the learned metric and the actual discriminability values for the 3 datasets for a total of where no metamers can exist given that the lower tail in the perceptual curve is absent and the Just Noticeable Difference (JND) is small, green curve also has a small JND but a lower tail allows the existence of metamers and red is a curve with no lower tail but a large JND that allows the existence of metamers. between odors that are reliably discriminated (see also 4d). Indeed, the diagram of Figure 4d represents two different situations where metamers could exist and one were they do not. If the JND is small, as in our case, a lower tail in the perceptual curve is necessary in order to theoretically be able to find metamers (green and blue curves in Fig.4d) . Indeed, if metamers exist in the transition part of the curve, dues to its steepness even a small difference between mixtures can generate a large perceptual change, making it difficult to find metamers. If the JND is large, then no lower tail is needed to allow the existence of metamers (red curve in Fig.4d ) as they now can be found in the transition phase of the curve. We here describe a universal model to predict the discriminability and similarity between any pair of molecular mixtures. We achieved this by performing a linear integration of predictions of 21 olfactory attributes of pure odors using molecular descriptors to generate the percepts for odor mixtures and then fitting a Mahalanobis Distance to the discriminability score. This allowed to expand the description of the olfactory perception and only 10 of the olfactory attributes were needed to perform better than state of the art in several unseen datasets for the more relevant RMSE metric and comparably for Pearson correlation. We also show that as perceptual values can be converted, for any set of smell attributes, to the 19 semantic descriptors [2] , the approach here described can be generalized. Overall, our results show that language can indeed be implemented in a general manner as a measure of smell attributes and can be used to explain properties of the olfactory perceptual space. For example our model shows that the difficulty to find olfactory metamers is due to the steepness, i.e small JND, and absence of tail of the olfactory perceptual curve (Fig.4a) . This results prove that language can be used to establish an interpretable metric of odor discriminability in the everyday olfactory space, and hence opens the possibility to establish it as a more general tool to quantify the effect of known changes in olfactory perception in diseases such as Parkinson's [14] , Schizophrenia [15] and more recently COVID [16] . The predictions and validations for the Interpretable Semantic Olfactory Mixture Discriminability model were generated with the following steps: 1) Determine structural properties for each molecular mixture using Dragon descriptors 2) Predict the 'intensity', 'pleasantness' and 19 se- the same were exluded. Although in these datasets the subjects were asked to rate the similarity between mixtures from 1 to 100, we obtained the discriminability as being 1 minus the similarity re-normalized between 0 and 1. The Ravia dataset was obtained from a discriminability experiment in a recently published article [7] where the odor mixtures did not have intensity matching and 50 pairs of different odor mixtures from a total of 120 mixtures composed from 10 molecules to 30 molecules (Ravia) . Note that the model was trained to predict in the [0 1] interval, but its values for certain pairs of odors go above this limit. We now describe the details of implementation of the Semantic model. Predictions for the Direct model where kindly shared by Aron Ravia. In our method we want to learn an interpretable yet accurate metric that can map the similarity/discriminability between two odor mixtures in the semantic space to the experimentally measured human discriminability. Thus our method consists of the following steps: 1. Obtain semantic descriptors for odor mixtures. This could be done by using the olfactory models proposed in [1] to obtain mono-molecular predictions. The mono-molecular predictions of the molecules composing a mixture are averaged following each of the 21 dimensions to obtain the perceptual prediction for the mixture. 4. Output the model M. Note that the above model is interpretable since the features f i correspond to specific the semantic descriptors that define x i or y i corresponding to the same descriptor i.e "fruity". Moreover, the trained model is a sparse linear model and so weights are assigned to (few) individual f i s making the final output easy to interpret. In a certain sense, we learn a shifted Mahalanobis distance, as the bias term may be non-zero, quantifying the human discriminability between the mixtures. The design choices make the process interpretable and as shown in (Fig.4c) , the semantic descriptors are selected in order to cover the semantic space and avoid redundance: Between "warm" and "cold" only the latter is chosen, similarly between "grass" and "wood". A more general model would not necessarily maintain this interpretability such as when using a complex non-linear model or neural network instead of lasso, This would also happen if the new features were constructed as a concatenation of the original semantic features leading to a large dimensional feature space, then used for model training. Predicting human olfactory perception from chemical features of odor molecules Predicting natural language descriptions of mono-molecular odorants Comparison of odors directly and through profiling Olfactory quality: From descriptor profiles to similarities. Chemical senses The perception of odor objects in everyday life: a review on the processing of odor mixtures Philosophy of olfactory perception A measure of smell enables the creation of olfactory metamers Predicting odor perceptual similarity from odor structure Humans can discriminate more than 1 trillion olfactory stimuli Metric learning: A survey. Foundations and Trends R in Machine Learning In search of the structure of human olfactory space Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext. zip: Compressing text classification models Elad Schneidman, and Noam Sobel. Perceptual convergence of multi-component mixtures in olfaction implies an olfactory white Olfactory impairment predicts cognitive decline in early parkinson's disease Olfactory impairment in first-episode schizophrenia: a case-control study, and sex dimorphism in the relationship between olfactory impairment and psychotic symptoms More than smell-covid-19 is associated with severe impairment of smell, taste, and chemesthesis. Chemical senses Regularization and variable selection via the elastic net We thank Aaron Ravia for sharing his predictions and other materials, Pablo Polosecki for extensive discussions, the editor and reviewers for their useful comments.