key: cord-0539289-rgke3kcd authors: Kjos-Hanssen, Bjorn; Niraula, Saroj; Yoon, Soowhan title: A parametrized family of Tversky metrics connecting the Jaccard distance to an analogue of the normalized information distance date: 2021-11-03 journal: nan DOI: nan sha: e10886c706dcfa29159d049065a49e4c0801d286 doc_id: 539289 cord_uid: rgke3kcd Jim'enez, Becerra, and Gelbukh (2013) defined a family of"symmetric Tversky ratio models"$S_{alpha,beta}$, $0lealphale 1$, $beta>0$. Each function $D_{alpha,beta}=1-S_{alpha,beta}$ is a semimetric on the powerset of a given finite set. We show that $D_{alpha,beta}$ is a metric if and only if $0lealpha le frac12$ and $betage 1/(1-alpha)$. This result is formally verified in the Lean proof assistant. The extreme points of this parametrized space of metrics are $J_1=D_{1/2,2}$, the Jaccard distance, and $J_{infty}=D_{0,1}$, an analogue of the normalized information distance of M.~Li, Chen, X.~Li, Ma, and Vit'anyi (2004). Distance measures (metrics), are used in a wide variety of scientific contexts. In bioinformatics, M. Li, Badger, Chen, Kwong, and Kearney [12] introduced an information-based sequence distance. In an information-theoretical setting, M. Li, Chen, X. Li, Ma and Vitányi [13] rejected the distance of [12] in favor of a normalized information distance (NID). The Encyclopedia of Distances [3] describes the NID on page 205 out of 583, as max{K(x | y * ), K(y | x * )} max{K(x), K(y)} where K(x | y * ) is the Kolmogorov complexity of x given a shortest program y * to compute y. It is equivalent to be given y itself in hard-coded form: max{K(x | y), K(y | x)} max{K(x), K(y)} Another formulation (see [13, page 8] ) is K(x, y) − min{K(x), K(y)} max{K(x), K(y)} . The fact that the NID is in a sense a normalized metric is proved in [13] . Then in 2017, while studying malware detection, Raff and Nicholas [14] suggested Lempel-Ziv Jaccard distance (LZJD) as a practical alternative to NID. As we shall see, this is a metric. In a way this constitutes a full circle: the distance in [12] is itself essentially a Jaccard distance, and the LZJD is related to it as Lempel-Ziv complexity is to Kolmogorov complexity. In the present paper we aim to shed light on this back-and-forth by showing that the NID and Jaccard distances constitute the endpoints of a parametrized family of metrics. For comparison, the Jaccard distance between two sets X and Y , and our analogue of the NID, are as follows: Our main result Theorem 11 shows which interpolations between these two are metrics. Incidentally, the names of J 1 and J ∞ come from the observation that they are special cases of J p given by We conjecture that J p is a metric for each p, but shall not attempt to prove it here. The way we arrived at Section 1 as an analogue of NID is via Lempel-Ziv complexity. While there are several variants [11, 18, 19] , the LZ 1978 complexity [19] of a sequence is the cardinality of a certain set, the dictionary. Definition 1 Let LZSet(A) be the Lempel-Ziv dictionary for a sequence A. We define LZ-Jaccard distance LZJD by It is shown in [12, Theorem 1] that the triangle inequality holds for a function which they call an information-based sequence distance. Later papers give it the notation d s in [13, Definition V.1] , and call their normalized information distance d. Raff and Nicholas [14] introduced the LZJD and did not discuss the appearance of d s in [13, Definition V.1], even though they do cite [13] (but not [12] ). Kraskov et al. [10, 9] use D and D for continuous analogues of d s and d in [13] (which they cite). The Encyclopedia calls it the normalized information metric, or Rajski distance [15] . This d s was called d by [12] -see Table 1 . Conversely, [13, near Definition Reference Jaccard notation NID notation [12] d [13] ds d [9] D D [14] LZJD NCD Remark 2 Ridgway [4] observed that the entropy-based distance D is essentially a Jaccard distance. No explanation was given, but we attempt one as follows. Suppose X 1 , X 2 , X 3 , X 4 are iid Bernoulli(p = 1/2) random variables, Y is the random vector (X 1 , X 2 , X 3 ) and Z is (X 2 , X 3 , X 4 ). Then becomes a Venn diagram relationship |{X 1 , X 2 , X 3 , X 4 }| = |{X 1 , X 2 , X 3 }| + |{X 2 , X 3 , X 4 }| − |{X 2 , X 3 }|. The relationship to Jaccard distance may not have been well known, as it is not mentioned in [9,2,12,1]. A more general setting is that of STRM (Symmetric Tversky Ratio Models), Theorem 10. These are variants of the Tversky index (Theorem 4) proposed in [7] . Definition 3 A semimetric on X is a function d : X × X → R that satisfies the first three axioms of a metric space, but not necessarily the triangle inequality: For sets X and Y the Tversky index with parameters α, β ≥ 0 is a number between 0 and 1 given by We also define the corresponding Tversky dissimilarity d T α,β by To motivate Theorem 3, we include the following lemma without proof. Lemma 5 Suppose d is a metric on a collection of nonempty sets X , with Thend is a metric onX . Theorem 6 (Gragera and Suppakitpaisarn [5, 6] . Moreover ρ ≤ 1 is necessary, so Theorem 6 gives αβ ≥ 1. We may note that overlap(X, Y ) = 1 whenever X ⊆ Y or Y ⊆ X, so that 1 − overlap is not a metric. The Sørensen-Dice coefficient is defined by Definition 10 ([7, Section 2]) Let X be a collection of finite sets. We define S : X × X → R as follows. For sets X, Y ∈ X we define m(X, The unbiased symmetric TRM is the case where bias = 0, which is the case we shall assume we are in for the rest of this paper. The Tversky semimetric D α,β is defined by D α,β (X, Y ) = 1 − S(X, Y ), or more precisely Note that for α = 1/2, β = 1, the STRM is equivalent to the Sørensen-Dice coefficient. Similarly, for α = 1/2, β = 2, it is equivalent to Jaccard's coefficient. Our main result is (see Figure 1 ): Theorem 11 Let 0 ≤ α ≤ 1 and β > 0. Then D α,β is a metric if and only if 0 ≤ α ≤ 1/2 and β ≥ 1/(1 − α). Theorem 11 gives the converse to the Gragera and Suppakitpaisarn inspired Theorem 7: Corollary 12 The Tversky dissimilarity d T α,β is a metric iff α = β ≥ 1. Proof. Suppose the Tversky dissimilarity d T α,β is a semimetric. Let X, Y be sets with |X ∩ Y | = |X \ Y | = 1 and |Y \ X| = 0. Then By Theorem 11, d T γ,γ is a metric if and only if β 0 ≥ 1/(1 − α 0 ). This is equivalent to 2γ ≥ 2, i.e., γ ≥ 1. The truth or falsity of Theorem 12 does not arise in Gragera and Suppakitpaisarn's work, as they require α, β ≤ 1 in their definition of Tversky index. We note that Tversky [17] only required α, β ≥ 0. Lemma 13 Let u, v, w, > 0. Then Proof. It is of course equivalent to show which is clearly the case. Proof. The only nontrivial task is to verify the triangle inequality. Define further functions u, v, w by Since d is a metric we have We proceed by forward reasoning: we need the truth of the following equivalent conditions: By Theorem 13, we are done. Theorem 15 For each α, the set of β for which D α,β is a metric is upward closed. Proof. Suppose D α,β0 is a metric and = β − β 0 ≥ 0. Let a XY := αm(X, Y ) and since the upfront factor of β may be removed without loss of generality, the question reduces to Theorem 14. Some convenient notation to be used below includes α = 1 − α; γ := βα ≤ 1 with β = 1/α; x∩y = |X ∩ Y |, x = |X| etc.; The triangle inequality then is equivalent to 1 ≤ 2α, i.e., α ≤ 1/2. Now let us show the if direction. The triangle inequality says α min{x y , y x } + α max{y x , x y } ≤ α min{x z , z x } + α max{z x , x z } + α min{z y , y z } + α max{y z , z y } By symmetry between x and y, we may assume that y ≤ x. Hence either y ≤ z ≤ x, y ≤ x ≤ z, or z ≤ y ≤ x. Thus our proof splits into three Cases, I, II, and III. Case I: y ≤ z ≤ x: we must show that αy x + αx y ≤ αz x + αx z + αy z + αz y . Since y x ≤ y z + z x and x y ≤ x z + z y , this holds for all α. Case II: y ≤ x ≤ z: We want to show that αy x + αx y ≤ αx z + αz x + αy z + αz y . In terms of γ = α/α this says The identity x y + y z + z x = x z + z y + y x holds generally since both sides counts the elements that belong to exactly one of X, Y, Z once each, and counts the elements that belong to exactly two of X, Y, Z once each. Since x ≤ z, it follows that Subcase II.1: C ≥ 0. Then Cγ + D ≥ 2C ≥ 0, as desired. Subcase II.2: C < 0. In order to show Cγ + D ≥ 0 for all 0 ≤ γ ≤ 1 it suffices that C + D ≥ 0, since then Cγ The statement C ≤ D says z y + (z x + x y ) ≤ y z + (y x + x z ), which holds by the reasoning from Case II using now z ≤ y. And now Proof. Consider the same example as in Theorem 16. Ignoring the upfront factor of β, we have In our example, The triangle inequality is then equivalent to: Since α > 1/2, we have 2α < 1. Let n > βα 1−2α . Then the triangle inequality does not hold, so D α,β is not a metric on the power set of {−(n − 1), −(n − 2), . . . , 0, 1, 2}. Proof (Proof of Theorem 11) . We saw in Theorem 16 that δ is a metric for 0 ≤ γ ≤ 1. (Recall that β = 1/(1 − α), so that γ = α/α.) In general if d is a metric and a is a function, we may hope that d/(a + d) is a metric. We shall use the observation, mentioned by [16] , that in order to show it suffices to show the following pair of inequalities: Here (3) follows from d being a metric, i.e., d xy ≤ d xz + d yz , since c ≥ 0 < a ≤ b =⇒ a a+c ≤ b b+c . Next, (4) would follow from a xy +d yz ≥ a xz and a xy +d xz ≥ a yz . By symmetry between x and y and since a xy = a yx in our case, it suffices to prove the first of these, a xy + d yz ≥ a xz . This is equivalent to x∩y + γ min{z y , y z ) + max{z y , y z } ≥ x∩z, which holds for all 0 ≤ γ ≤ 1 if and only if x∩y + max{z y , y z } ≥ x∩z. There are now two cases. Case z ≥ y: We have x∩y + z y ≥ x∩z since any element of X ∩ Z is either in Y or not. Case y ≥ z: x∩y + y z ≥ x∩z This is true since z y ≥ x∩z y . The mutations of spike glycoproteins of coronaviruses are of great concern with the new SARS-CoV-2 virus causing the disease CoViD-19. We calculate several distance measures between peptide sequences for such proteins. The distance where A i is the set of subwords of length 2 in x i but not in x 1−i , counts how many subwords of length 2 appear in one sequence and not the other. We used the Ward linkage criterion for producing Newick trees using the hclust package for the Go programming language. The calculated phylogenetic trees were based on the metric Z 2,α . We found one tree isomorphism class each for 0 ≤ α ≤ 0.21, 0.22 ≤ α ≤ 0.36, and 0.37 ≤ α ≤ 0.5, respectively (Figure 2, Figure 3 ). In Figure 3 we are also including the tree produced using the Levenshtein edit distance in place of Z 2,α . We see that the various intervals for α can correspond to "better" or "worse" agreement with other distance measures. Thus, we propose that rather than focusing on α = 0 and α = 1/2 exclusively, future work may consider the whole interval [0, 1/2]. Many researchers have considered metrics based on sums or maxima, but we have shown that these need not be considered in "isolation" in the sense that they form the endpoints of a family of metrics. More general set-theoretic metrics can be envisioned. The Steinhaus transform of δ with β = 1/α is: δ (X, Y ) = 2δ(X, Y ) δ(X, ∅) + δ(Y, ∅) + δ(X, Y ) = 2 γ min{x y , y x ) + max{x y , y x } (x + y) + γ min{y x , x y } + max{y x , x y } A question for future research is whether this Steinhaus transform is more or less useful than what Jiménez et al. [7] considered. We can consider a general setting for potential metrics that contains both the Steinhaus transform of δ and the STRM metrics. In terms of m(X, Y ) = min{x y , y x } and M (X, Y ) = max{x y , y x }, we can consider ∆ γ,s := γm+M x∩y+s(x∪y)+(γm+M ) . When s = 0 this is our STRM metric. When s = 1 it is the Steinhaus transform, ignoring constant upfront factors. Correctness of results. We have formally proved Theorem 11 in the Lean theorem prover, with a more streamlined proof than that presented here. The Github repository can be found at [8] . Clustering by compression The Google similarity distance Encyclopedia of distances Mutual information -Wikipedia, the Free Encyclopedia Semimetric properties of Sørensen-Dice and Tversky indexes Relaxed triangle inequality ratio of the Sørensen-Dice and Tversky indexes SOFTCARDINALITY-CORE: improving text overlap with distributional measures for semantic textual similarity Lean project: a 1-parameter family of metrics connecting jaccard distance to normalized information distance Hierarchical clustering using mutual information Hierarchical clustering based on mutual information On the complexity of finite sequences An information-based sequence distance and its application to whole mitochondrial genome phylogeny The similarity metric An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance Entropy and metric spaces Is the Jaccard distance a distance? MathOverflow Features of similarity A universal algorithm for sequential data compression Compression of individual sequences via variablerate coding This work was partially supported by grants from the Simons Foundation (#704836 to Bjørn Kjos-Hanssen) and Decision Research Corporation (University of Hawai'i Foundation Account #129-4770-4).