key: cord-0990280-9k3fdw4d authors: Flores-Gallegos, N. title: Rényi’s divergence as a chemical similarity criterion date: 2021-11-22 journal: J Math Chem DOI: 10.1007/s10910-021-01307-6 sha: 4f512a6d341aef6d41c47d390b0b2d9785269b6c doc_id: 990280 cord_uid: 9k3fdw4d In this work, a new version of Rényi’s divergence is presented. The expression obtained is used as a tool to identify molecules that could share some chemical or structural properties, and a data basis set of 1641 molecules is used in this study. Our results suggest that this new form of Rényi divergence could be a useful tool that will eventually permit complementary studies in which the main goal is to obtain molecules with similar properties. Perhaps, currently in the modern sciences, there are no clear barriers between fields; for example, let us consider two different areas such as biology and computer sciences which in the last decade formed a field called bioinformatics. A similar situation happens with chemistry, in where there are no clear separations between the modern chemistry, physics, mathematics, and computer sciences; in this regard, a new discipline of chemistry called cheminformatics or molecular informatics has emerged in recent years. In this area, one of the objectives is to find molecules that may have some chemical or physical properties of interest. In the last year, in which, for practically the entire population of the world, it was suggested to stay isolated at home, all the scientific community was called to participate in the 'JEDI Grand Challenge: Billion molecules against COVID-19'. In an announcement by The Joint European Disruptive Initiative that was published on the main page of the journal 'Molecular Informatics', one can read, 'The Joint European Disruptive Initiative has announced a May 1st launch of the Billion Molecules against Covid19 Grand Challenge, with awards of up to € 2 million for the winners.' for which the main objective is 'to screen billions of molecules with blocking interactions relevant to SARS-CoV-2 and fast-track the route to a therapeutic treatment. ' It is interesting to note that to address a problem such as the mentioned before, there are two necessary ingredients; the first, of course, is the equations, which must be defined using solid fundamentals, instead of empirical, heuristic, or statistical assumptions, and the second is the algorithms, that is, how can we improve the search of biological, chemical, or physical properties that a group of molecules could have in common to obtain a biological or chemical agent that permits to define of a set of molecules that may act as a good treatment or even a vaccine. In this sense, cheminformatics is an area that has a strong link with QSAR analysis [1] ; however, it is not common to find that such analyses are complemented by descriptors based on the formal fundamentals of theoretical chemistry. In this regard, perhaps the first proposal of a similarity coefficient based on electron density was done in the late 1970s by Carbó [2] . On the other hand, according to Willett [3] , a cheminformatics analysis has three components, (i) a structure representation, (ii) a weighting scheme, and (iii) a similarity coefficient; in the first case, the analysis consists of performing a comparison of the structure between two molecules, using as a criterion the structural representation of the molecules in 2D or 3D; in the second case, one has to define a fingerprint of several molecules considering the number of the apparition of certain functionals groups, and in the last case, the analysis consists of choosing certain coefficients defined in terms of the number of features present in the compounds to analyze. In this sense, clearly one can define a vast number of such coefficients [4] ; for example, in Todeschini's work, the authors analyzed 51 different coefficients [5] , which could be an attractive aspect to perform some studies in the field of medicinal chemistry [6] . However, it is worthwhile to mention that the majority of these coefficients were defined in a heuristic way. In this work, we propose the use of a new definition of Rényi's divergence that can be used as a tool to find molecules that could have some chemical or structural properties in common. In the following section, we present a brief background of Rényi's divergence and its mathematical properties. Since the publication of Shannon's work in the 1940s, information theory [7] has been applied in several areas of science and technology. In chemistry, the use of the ideas of information theory started in the 1970s, and these studies were principally focused on the study of atoms and simple systems using Shannon's entropy in position and momentum spaces; in the 1990s, with the advantages of computers, it was possible to extend these studies to simple molecules and some simple chemical processes. In this line, it was also possible to show that Shannon's entropy can be related to some important concepts of quantum chemistry, such as the chemical reactivity, electron correlation, and structural stability of the molecules [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] . On the other hand, in the 1960s, the idea of generalizing Shannon's entropy emerged, and today, one can find three generalizations of it: (i) Rényi's entropy [18] (ii) the Havrda and Charvát entropy [19] and (iii) Tsallis' entropy [20] . The first two were proposed in the 1960s, while the third one was proposed at the end of the 1980s. Considering a historical point of view, perhaps Rényi's work can be considered the first attempt to generalize Shannon's entropy. Since the 1990s, Rényi's entropy has been profusely used in the field of atomic and molecular physics [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] , and these works were focused on the study of the electron correlation phenomenon. Nevertheless, we would like to address the attention of the readers to another concept, also proposed by Rényi, which is the divergence, and it can be interpreted as a measure that compares and quantifies the difference between two probabilistic distributions. It is defined as and in which the sets {p i } and {q j } are probabilistic distributions that fulfill 0 ≤ p i ≤ 1 , 0 ≤ q j ≤ 1 , and ∑ n i=1 p i = 1 , ∑ m j=1 q j = 1 . In both cases, the term , can take values of = 0, 1, … , but with the exception that ≠ 1 . Here, k nm = nm , and n and m are the sizes of the sets p i and q j , respectively. These constants prevent the numerical overestimation of the sums (see "Appendix"). From Eqs. (1) and (2), one can infer easily that R ij, ≠ R ji, ; because of this, we need to ensure that the expression used to calculate the divergence is symmetric; therefore, our proposal for the Rényi's divergence is If one applies the logarithm properties to this equation, then it can be reduced to where k nm = nm ; Eq. (3) and consequently Eq. (4) both fulfill the following criteria [37] : 1. R is a continuous function of p 1 , p 2 , … , p n and q 1 , q 2 , … , q m . 2. R is permutationally symmetric; that is, the total value of this measure does not change if the elements p 1 , p 2 , … , p n and q 1 , q 2 , … , q m are labeled in a different way, which implies that the pairs (p 1 , q 1 ), (p 2 , q 2 ), … can permute among themselves. 3. R ≥ 0. 4. Based on the previous property, then one can infer that the minimum value of R is zero. In the field of information theory, it is well known that Rényi's entropy is a generalization of Shannon's entropy; in this regard, Rényi's divergence is a generalization of several divergence measures. For example, if one considers = 0 , Eq. (4) becomes, An interesting aspect of this equation is that if one applies the exponential function to obtain e −R =0 , one obtains a discrete version of Carbó's similarity coefficient, which is related to the overlap of two distributions [38] [39] [40] [41] . which is known as the Bhattacharyya coefficient [42] and has been used extensively in the field of computational sciences as a measure of overlap between two statistical distributions. When → 1 , Eq. In general, we can say that Eqs. (5), (6), (7) and (8) can be used as measures that permit the comparison of two different probabilistic distributions. In this regard, if one extrapolates these equations to the context of quantum chemistry, then one obtains a set of so-called quantum similarity coefficients. Thus, if one wants to express Eq. (4) in its quantum version, then it is necessary to rewrite it in terms of two diagonalized density matrices, namely, A and B . These density matrices are related to the quantum states of the systems A and B; therefore, we can write Eq. (4) in terms of the trace of A and B to obtain where the superscript indicates the quantum version of Eq. (4), which can be rewritten in the following compact and simplest form: To apply Eq. (9) to systems such as atoms, molecules, macromolecules, it is necessary to use two diagonalized density matrices, in which each element of their trace is an occupation number; in the theoretical chemistry field, the occupation numbers are subject to 0 ≤ k ≤ 2 , ∑ l k=1 k = N , where N is the electron number. It is convenient to clarify that such numbers are restricted to the values of 0, 1, 2 when they are obtained through mono-determinantal methodologies, such as HF or DFT; moreover, when they are obtained through multideterminantal methodologies as CI or CC, they are subject to 0 ≤ k ≤ 2 , where k ∈ ℝ and ∑ l k=1 k = N . Therefore, to use such numbers with Eq. (10) they have to be normalized as � k = k ∕N . This new set of numbers is subject to 0 ≤ ′ k ≤ 1 , and ∑ l k=1 � k = 1 , and now the set { � k } fulfills all criteria of a probabilistic distribution, which is one of the main ingredients of any information measure. Perhaps the most interesting aspect of Eq. (10) is that it permits comparison of the quantum information of two molecular systems, which are described by two density matrices, and such matrices, according to Dirac, completely describe the state of the quantum systems [44] . In this sense, it is not hard to see that Eq. (10) can represent a quantum superposition; that is, in the context of quantum mechanics, a quantum superposition is considered a sum of states to obtain a new state. For example, if one considers Eq. (9) with = 0 , it is evident that this equation is reduced to a sum of logarithms, and each logarithm has an argument for the states, A and B ; therefore, one can interpret the numerical results obtained with Eq. (10) also as a quantum superposition. On the other hand, in the context of density functional theory, there are three main concepts linked to chemical reactivity, hardness, softness, and chemical potential [45] , which are defined as where the chemical potential, , is define as and is Mulliken's electronegativity definition [46] . Equations (11) , (12) and (13) can be rewritten in terms of the electron affinity, A, and the ionization potential, I, as and Using Koopman's theorem and under the LCAO approximation, the last three equations can be rewritten in terms of the HOMO and LUMO energies as and In general, Eqs. (17), (18) and (19) are used as powerful tools to study how a molecule can react under different chemical conditions; for this reason, we consider that such equations are important to complement our study. In the following section, we present some numerical results when Eq. (10) is applied to a large set of molecules; our analysis is complemented by the numerical results of Eqs. (17) , (18) and (19) . To apply and to show the versatility of Eq. (10), a set of 1641 molecules defined by Blair and Thakkar [47] was chosen. Such molecules are closed-shell molecules, and according to the authors, the molecules contain up to 34 atoms and 246 electrons, which can be used to study the interrelationships among the properties of molecules of organic, biochemical, and pharmaceutical interest. To perform the calculations of this study, we used Eq. (10) with the Löwdin occupation numbers obtained with the CISD method and DGDZVP basis set, while to calculate the hardness, softness and chemical potential in terms of the HOMO and LUMO energies, we used the functional M062x with the DGDZVP basis set. All calculations were performed with Gaussian 09 [48] . For practical proposes, we chose in a random way one molecule of the set of molecules used. The molecule chosen was number 750, tetranitromethane, C(NO 2 ) 4 , which is a highly explosive chemical used as an oxidizer in rocket propellants and is used to increase the cetane number of diesel fuels in the manufacture of liquid explosives. It has been classified as possibly carcinogenic to humans [49] . In Tables 1, 2 , 3 and 4, we present the numerical results of Eq. (10) using the values of = 0, 0.5, 1, 2 , and in Figs. 2, 3, 4 and 5 are depicted the molecules that according to the numerical results of Eq. (10) may share some properties. At this point, we consider that it is opportune to mention that we report only the numerical results of the molecules that are similar, instead of the 1641 numerical values of q R for each value considered in this study; however, we present the general trends of q R in Fig. 1a -d. For each case, a percentage of difference of q R among the molecules was chosen because, for example, if one observes Fig. 1a , in which the location of the molecule 750 is highlighted with red lines and follows the horizontal line, one can appreciate that there is a dense section in which several molecules appear. For that reason, we selected the molecules with a difference of 0.0001%, while for q R with = 0.5, 1, 2 , the difference used was 0.2%, 0.3% and 0.1%. To obtain small sets of molecules, clearly, one can change these percentage values to increase the size of the sets; on the other hand, if one follows the vertical line on Fig. 1a , one observes that there are several isoelectronic molecules, with N = 98 , but in the majority of the cases, the value of q R differentiates all these molecules. In addition, in Fig. 1a -d, it is also possible to observe the effect of the coefficient on q R ; in such figures, we noticed that for values of ∈ [0, 1] , the general trends have a convex behavior, while for the interval ∈ (1, 2] , the trends are concave. Table 2 Numerical results of Eq. (10) with = 0.5 The molecule of reference was tetranitromethane with q R = 25.3553653711884 , N = 98 , = 0.1658800 , S = 3.0142271 , and = −0.287170 . Where, mol58: azulene, mol86: bicyclo-3,3,1-nonane, mol149: cinnamaldehyde, mol153: cis-bicyclo-6,1,0-nonane, mol239: difuro[3,2-b:2′,3′-d]furan, mol567: naphthalene, mol607: p-dichlorobenzene, mol776: trans-2,3-dimethylnorborane, mol815: vinylsulfonylbenzene, mol838: 1′H-1,2′-bipyrrole, mol860: 1-butoxybutane, mol948: 1-octanol, mol949: 1-p-tolylethanone, mol999: 1,1,1,2-tetramethoxyethane, mol1132: 1,4-dimethylnorborane, mol1335: 2,1H-pyrrol-2-yl-1H-pyrrole, mol1487: 3-methyl-1H-indole Table 3 Numerical results of Eq. (10) with = 1 The molecule of reference was tetranitromethane with q R = 10.378971157751 , N = 98 and = 0.1658800 , S = 3.0142271 , and = −0.287170 . mol204: di-t-butyl sulfoxide, mol563: naphthalen-1-amine, mol565: naphthalen-2-amine, mol815: vinylsulfonylbenzene, mol879: 1-chloro-1,3-butadiene Therefore we can say that depending on the value used, we can find more or fewer molecules because some could be or could not be in the same position. This can be inferred easily by comparing the position of the molecule of reference used, which is depicted in the figures mentioned previously. Observing the four tables, one can note that depending on the value used, different molecules are found that according to Eq. (10) are informationally equivalent. For example, with = 0 (see Table 1 and Fig. 2) , the molecules 750, 1041, 1153 and 1163 are equivalent under the perspective of q R . These molecules have structures constituted mainly by rings, and three of them present aromaticity, while the Table 4 Numerical results of Eq. (10) with = 2 The molecule of reference was tetranitromethane with q R = 5.48769619636878 , N = 98 and = 0.1658800 , S = 3.0142271 , and = −0.287170 . mol242: diisobutyl sulfone, mol546: N-pyrazin-2-ylmethanesulfonamide, mol879: 1-chloronaphthalene, mol1640: 9H-purine-2,6,8-trione structure of tetranitromethane is tetrahedral (see Fig. 2 ). From a chemical reactivity point of view, the molecules 1041, 1153, and 1163 have similar values of softness, hardness, and chemical potential; however, if we compare such values with the values of tetranitromethane, we find that molecule 1163 has the closest value of hardness and softness but that its value of chemical potential differs significantly. Thus, based on the numerical results, we conclude that 1R-1,2,2-trifluoroethoxybenzene is similar to tetranitromethane. In Table 2 are reported the numerical results of Eq. (10) with = 0.5 , and the graphic representation of the molecules of this case is shown in Fig. 3 . The first general observation that we note is that with this value of , it is possible to find more molecules in comparison with the previous case, but this case, the numerical values of q R present a percentage difference of 0.2%. As in the previous case, this set of molecules constitutes mostly rings, but in this case, there are two linear molecules present in the set (see Fig. 3 ). Now, if we compare the values of softness, hardness, and chemical potential, it is possible to reduce this set of molecules. In this case, we find that only molecules 815 and 838 have similar hardness values = 0.15184 [a.u.] and = 0.151855 [a.u.] in comparison with the value for tetranitromethane of = 0.16588 [a.u.]. A similar situation happens with the hardness; however, the chemical potential of molecules 815 and 838 differs by almost 0.1 a.u with respect to that of tetranitromethane. In Tables 3 and 4 , we present the results of q R . Using the values of = 1 and 2, as in the first case, these values of produce values of q R with a small percentage difference. In both sets are present molecules with rings, and all they have some of the most electronegative atoms; nevertheless, in the case of the set with = 2 we also find two linear molecules (see Figs. 4 and 5) . When a value of = 1 is considered, only molecules 204 and 815 have similar hardness and softness values, while Table 1 the chemical potential is different by more than 50% with respect to tetranitromethane. On the other hand, with the value of = 2 , the four molecules differ considerably in all their chemical reactivity parameters with respect to tetranitromethane. 3 Graphic representation of the molecules: mol58, mol86, mol149, mol153, mol239, mol567, mol607, mol776, mol815, mol838, mol860, mol948, mol949, mol999, mol1132, mol1335, mol1487, used in Table 2 In Table 5 , we summarize our findings of q R . In this case, it was possible to reduce the set of molecules to four molecules. Based on this table, we note that molecule 204, di-t-butyl sulfoxide, with = 1 , has hardness and softness values close to tetranitromethane, which is the molecule of reference. This result may be explained in terms of the structure of both molecules (see Fig. 6 ). In this figure, one can observe that both molecules are semispherical, while the rest of them have rings and are semiplanar. Another interesting observation of the results presented in Table 5 is that the numerical Fig. 4 Graphic representation of the molecules: mol204, mol1041, mol1153, mol1163, used in Table 3 Fig. 5 Graphic representation of the molecules: mol242, mol546, mol879, mol1640, used in Table 4 results of q R with = 0.5, 1 suggest that molecule 815, vinylsulfonylbenzene, has similar properties to tetranitromethane; however, we also note that while the hardness and softness in several cases match with the molecule of reference, the chemical potential does not. In conclusion, we can say that the version of Rényi's divergence presented in this work could be an effective tool used in conjunction with the descriptors of the chemical reactivity to complement studies where the main propose is to find some chemical systems that may share some properties. Fig. 6 Graphic representation of the molecules: mol204 mol815, mol838, mol1163, used in Table 5 In this work, we presented a new version of Rényi's divergence defined in terms of two diagonalized density matrices. The equation proposed in this work satisfies rigorously all the properties of any informational measure. Our results suggest that Eq. (10) with values of = 0.5 and = 1 produces the best sets of molecules that may share some properties. Finally, we consider that our expression of Renyi's divergence could be a good candidate to complement studies in which the main goal is the search for chemical systems with similar properties or as a tool to classify sets of molecules that may share some physical or chemical properties. Proc. Fourth Berkeley Symp Reviews in Computational Chemistry Density-Functional Theory of Atoms and Molecules Acknowledgements N. Flores-Gallegos wishes to thank the CONACyT, the PRODEP-SEP program for support. The author declare that he has no conflict of interest. In this appendix, we present an explicit development on how we transform Eq. (3) into Eq. (4); first let us consider Eq. (3), When one applies the logarithm properties, ln(a) + ln(b) = ln(ab) , then the term ab produces n × m operations; however, it is important to remark that the term ln(a) + ln(b) produces n + m operations. For this reason, it is necessary to introduce a constant k nm = nm in each term of Eq. (2) to avoid such inconsistency; in the following lines, we show how this constant, k nm , works, From this last equation, it is clear that,