key: cord-0047464-m0hfczyz authors: Wang, Yifeng; Yao, Hanguang; Wan, Liangpeng; Li, Hua; Jiang, Junjun; Zhang, Yun; Wu, Fangmin; Chen, Junfeng; Xue, Xingsi; Dai, Cai title: Optimizing Hydrography Ontology Alignment Through Compact Particle Swarm Optimization Algorithm date: 2020-06-22 journal: Advances in Swarm Intelligence DOI: 10.1007/978-3-030-53956-6_14 sha: 5504b453e858985b47547446a1eab8ac504d9574 doc_id: 47464 cord_uid: m0hfczyz With the explosive growth in generating data in the hydrographical domain, many hydrography ontologies have been developed and maintained to describe hydrographical features and the relationships between them. However, the existing hydrography ontologies are developed with varying project perspectives and objectives, which inevitably results in the differences in terms of knowledge representation. Determining various relationships between two entities in different ontologies offers the opportunity to link hydrographical data for multiple purposes, though the research on this topic is in its infancy. Different from the traditional ontology alignment whose cardinality is 1:1, i.e. one source ontology entity is mapping with one target ontology entity and vice versa, and the relationship is the equivalence, matching hydrography ontologies is a more complex task, whose cardinality could be 1:1, 1:n or m:n and the relationships could be equivalence or subsumption. To efficiently optimize the ontology alignment, in this paper, a discrete optimal model is first constructed for the ontology matching problem, and then a Compact Particle Swarm Optimization algorithm (CPSO) based matching technique is proposed to efficiently solve it. CPSO utilizes the compact real-value encoding and decoding mechanism and the objective-decomposing strategy to approximate the PSO’s evolving process, which can dramatically reduce PSO’s memory consumption and runtime while at the same time ensure the solution’s quality. The experiment exploits the Hydrography dataset in Complex track provided by the Ontology Alignment Evaluation Initiative (OAEI) to test our proposal’s performance. The experimental results show that CPSO-based approach can effectively reduce PSO’s runtime and memory consumption, and determine high-quality hydrography ontology alignments. To solve water features's heterogeneity problem, many hydrography ontologies, such as the United States Geological Survey's Surface Water Ontology (SWO) [9] , Spanish National Geographic Institute (IGN)'s hydrOntology [14] , the Hydro3 module from the University of Maine's HydroGazetteer [13] and the Cree surface water ontology [15] , have been developed and maintained to describe hydrographic features and the relationships between them. However, these ontologies are independently developed with varying project perspectives and objectives, and therefore, they might use the same terminology for different concepts, they may be at different levels of abstraction, they may not include all of the same concepts, and they may not even be in the same language [2] . For example, SWO describes surface water features from the perspective of the earth's terrain, the water bodies and flows between them, while Cree surface water ontology presents the surface water features based on the people's utility for transportation via canoe, which largely focuses on water bodies' locations relative to one another. Matching ontologies aims at determining various relationships between two entities in different ontologies, which offers the opportunity to link hydrographical data for multiple purposes. The traditional ontology matching techniques [3, [16] [17] [18] dedicate to find the simple alignment whose cardinality is 1:1, i.e. one source ontology entity is mapping with one target ontology entity and vice versa, and the relationship is the equivalence, e.g. a "Wetlands" in one ontology is equivalent to a "Swamp Or Marsh" in another ontology. However, matching hydrography ontologies is a more complex task, whose cardinality could be 1:1, 1:n or m:n and the relationships could be equivalence or subsumption [19] . Therefore, the traditional ontology matchers is not able to determine the high-quality ontology alignment, and the complex ontology matching problem is one of the challenges in the ontology matching domain [11] , e.g. a "Watershed" in one ontology is a subclass of the union of the "Lake Or Pond" and "Swamp Or Marsh" in another. Being inspired by the success of Particle Swarm Optimization algorithm (PSO) based ontology matcher [1] in the ontology matching domain, we propose to use it to optimize the ontology alignment's quality. To this end, we propose a Compact PSO (CPSO), which uses the compact real-value encoding and decoding mechanism and the objective-decomposing strategy to execute the evolving process. The rest of the paper is organized as follows: Sect. 2 defines the hydrography ontology matching problem and proposes a hybrid similarity measure for calculating the similarity values between two hydrographical entities; Sect. 3 presents the CPSO in details, which includes the compact encoding mechanism, the Probability Vector (PV) and the crossover operator; Sect. 4 presents the experimental configuration and results; finally, Sect. 5 draws the conclusion. To evaluate the quality of an ontology alignment and the effectiveness of a matching approach, it is necessary to determine whether all correct correspondences have been discovered (completeness) and whether all discovered correspondences are correct (soundness). Normally, the alignment is assessed in terms of two measures, commonly known as precision and recall. Precision (or soundness) measures the fraction of the selected correspondences that are actually correct. Recall (or completeness) measures the fraction of the number of correct mappings discovered against the total number of existing correct alignments. Maximum precision (no false positive) and maximum recall (no false negative) refer to the absence of type I and type II errors respectively. Although a precision of 1 means that all correspondences found are correct, it does not imply that all correct ones have been found. Analogously, a recall of 1 means that all the correct correspondences have been discovered, but it does not provide the information about the number of falsely identified ones. Therefore, precision and recall are often balanced against each other by the so-called f-measure, which is the uniformly weighted harmonic meaning of recall and precision. Since f-measure can better balance the precision and recall, it is the most popular indicator that is utilized to measure the quality of an ontology alignment. Given a Reference Alignment (RA) R, which is the golden ontology alignment provided by the expert, and an alignment A, recall, precision and f-measure are respectively defined as follows [12] : where α is the weight to tradeoff recall and precision. Although recall, precision and f-measure can reflect the quality of the resulting alignment, they require domain experts to provide the reference alignment in advance, which is generally unknown for difficult real-life matching problems. Based on the observations that the more correspondences found and the higher mean similarity values of the correspondences are, the better the alignment quality is [1] , we utilize the following two metrics to approximate the f-measure: where |A| is the number of correspondences in A, φ is a function of normalization δi |A| respectively approximate recall and precision. On this basis, the optimal model for the ontology matching problem is defined as follows: where O 1 and O 2 are two ontologies, |O 1 | and |O 2 | are respectively the entity number of O 1 and O 2 ; x i means the ith entity correspondences, i.e. ith source entity is related with x i target entities; and the objective function is maximize the alignment's f (). This work utilizes a hybrid similarity measure to calculate the similarity value between two hydrographical classes. Given two hydrographical classes, we first extract their label and comments from the corresponding ontologies. Before calculating their similarity value, we utilize the natural language processing technique and Babelnet Translate which covers 271 different languages and becomes an appropriate machine translation tool in cross-lingual ontology matching domain, to process their labels l 1 and l 2 and comments c 1 and c 2 . In particular, this process consists in the following successive steps: -remove the numbers, punctuations and stop-words; -split the strings into words; -translate the words into English, and convert them into lower-case; -lemmatizing and stemming the English words; Then, four similarity values, i.e. sim(l 1 , l 2 ), sim(c 1 , c 2 ), sim(l 1 , c 2 ) and sim(c 1 , l 2 ), are calculated with soft TF-IDF [7] where two words are identical when they are the same literally or they are synonymous in the English Wordnet [6] , and the maximum one is selected as two classes' similarity value. In the next, CPSO for solving the ontology matching problem is presented in details, which approximates the population-based PSO's evolving process through a PV [5] . In the next, we first describe the encoding mechanism and PV, and then present the crossover operator and CPSO's pseudo-code. We utilize the real-number encoding mechanism. Since we need to encode a alignment in a solution whose kernel elements are two mapped concepts, we can simply make use of their indices in the ontologies. In an individual, each gene bit encode a positive integer that encoding the information that a source class with the same index as it are mapped with several target classes. In an alignment, if the ith source class is mapped with the target classes with index j 1 , j 2 , · · · , j n , j n < O 2 the ith gene bit's value will be (2, 3, 5, · · · , primeN umber O2 ) ⊗ (j 1 , j 2 , · · · , j n , 0, 0, · · · ) = 2 j1 × 3 j2 × 5 j3 × · · · × primeN umber jn m × primeN umber 0 m+1 × · · · . An example of the encoding mechanism is shown in the Fig. 1 . As can be seen from the figure, the source concept "exocrine pancreas" with index 1 is mapped to target concepts "Exocrine Pancreas" with index 1 and "Oropharynx Epithelium" with index 4, and therefore, the first gene bit's value is 2 1 × 3 4 × 5 0 × 7 0 = 162. The decoding process is show in Algorithm 1. In the obtain alignment, we utilize two thresholds, i.e. the upper threshold and lower threshold, to filter the class correspondences. When the class pair's similarity value is bigger than upper threshold, its correspondence's relationship is "equivalent"; when the class pair's similarity value is between upper threshold and lower threshold, its correspondence's relationship is "subsumption"; when the class pair's similarity value is lower than lower threshold, this correspondence will be removed. Next, for each object class, we utilize the approach proposed by D. Oliveira et al. [8] to further determine the relationships of "union" and "intersection" among its subsumed target classes. In CPSO, the Probability Vector (PV) [5] is used to represent the entire population. PV is defined as a n × 2 matrix [μ t , δ t ] where t indicates the generation, μ and δ are, respectively, vectors containing, for each gene value, mean and standard deviation values of a Gaussian Probability Distribution Function (PDF) truncated within the interval [−1,1]. In particular, for a gene bit indexed by i, a truncated Gaussian PDF characterized by a mean value μ i , and a standard deviations δ i is associated, and the formula of the PDF is as follows: where erf is the error function [10] . From the PDF, the corresponding Cumulative Distribution Function (CDF) is constructed by means of Chebyshev polynomials [4] , whose codomain of CDF is [0,1]. In order to sample the design variable x i from PV a random number rand(0, 1) is sampled from a uniform distribution. The inverse function of CDF, in correspondence of rand(0, 1), is then calculated, whose round value is x i . In each generation, PV is updated to moved to the elite so that the newly generated solutions could be closer to the elite. In particular, the update rules for each element of μ and δ are respectively given as follows: where N p represents the population scale. Given two individuals (called parents), the crossover operator generates one children, which are obtained by mixing the genes of the parents. Crossover is applied with a certain probability, a parameter of the PSO. Here, we use the widely used one-cut-point crossover operator. First, two individuals are decoded to obtain the corresponding alignments, and a cut position in two parent alignments is randomly determined and this position is a cut point which cuts each parent alignment into two parts: the left part and the right part. Then, the left part of one parent and the right part of the other are combined to form the children. For the sake of clarity, the pseudo-code of crossover operator is shown in Algorithm 2. The pseudo-code of CPSO is presented in Algorithm 3. CPSO first divides the problem into three sub-problems that respectively maximizes f α=0 , f α=0.5 and f α=1 . We initialize three PVs and local best individuals for three sub-problems, and then initialize the global best individual by selecting the one with best f 0.5 from them. In each generation, CPSO tries to solve each sub-problem by approximating PSO's position updating strategy, i.e. crossover an individual with the local best individual and global best individual to obtain a new individual, and then use the new one to update the local best individual and PV. After solving each sub-problem, we try to update the global individual with three new local best individuals. Input: maximum generation maxGen, crossover rate pcr, virtual population Np Output: global best individual ind elite 1: ** Initialization ** 2: generation t = 0; 3: initialize P Vα=0, P Vα=0.5 and P Vα=1 by setting μ t i = 0 and δ t i = 10; 4: generate three individuals through P Vα=0, P Vα=0.5 and P Vα=1 to respectively initialize three local best individuals ind α=0,elite , ind α=0.5,elite and ind α=0,elite ; 5: initialize ind elite = opti{ind α=0,elite , ind α=0.5,elite , ind α=0,elite }; 6: ** Evolving Process ** 7: while t < maxGen do 8: ** Updating ind α=0,elite ** 9: generate an individual indα=0,new through P Vα=0; 10: indα=0,new = crossover(indα=0,new, ind α=0,elite ); 11: indα=0,new = crossover(indα=0,new, ind elite ); 12: [ In order to test the performance of CPSO, the experiment exploits Hydrography dataset in Complex track provided by the Ontology Alignment Evaluation Initiative (OAEI). The hydrography dataset is composed of 4 source ontologies (Hydro3, hydrOntology native, hydrOntology translated and Cree) that each should be aligned to a single target Surface Water Ontology (SWO). The source ontologies vary in their similarity to the target ontology C Hydro3 is similar in both language and structure, hydrOntology native and hydrOntology translated are similar in structure but hydrOntology translated is in Spanish rather than English, and Cree is very different in terms of both language and structure. We compare the alignment's quality among PSO-based matcher [1] and OAEI's participants in Table 1 , and we also compare the runtime and memory consumption between PSO-based matcher and CPSO-based matcher in Table 2 . PSO and CPSO's results are the mean values of thirty independent executions. PSO's configuration is referred to its literature, and CPSO uses the following parameters which represent a trade-off setting obtained in an empirical way to achieve the highest average alignment quality on all testing cases. -Maximum generation: maxGen = 3000; -Crossover probability: p cr = 0.6; -Virtual population: N p = 20; -Upper threshold: threshold up = 0.8; -Lower threshold: threshold lo = 0.5; As can be seen from Table 1 , CPSO can determine better alignments than PSO-based matcher and OAEI's participants in terms of both precision and recall. In Table 2 , CPSO can significantly improve the converging speed and reduce the memory consumption, which shows the effectiveness of the compact encoding mechanism and the compact evolutionary operators. CPSO combines the mechanisms of a classic PSO with a competitive learning, and the results achieved by CPSO are better and are attained faster than classic PSO. The gains in solution quality and algorithm's performance are achieved respectively due to CPSO's particular competitive learning, which is effective to lead the algorithm to determine the optimal solution, and the simplicity of CPSO, which does not require all the mechanisms of a PSO, rather the few steps in the algorithm are small and simple. To efficiently optimize the hydrography ontology alignment's quality, in this paper, a discrete optimal model is constructed for the ontology matching problem, and a CPSO-based ontology matching technique is presented to solve it. CPSO utilizes the compact real-value encoding mechanism to approximate PSO's evolving process, which can dramatically reduce PSO's runtime and memory consumption while at the same time ensure the solution's quality. The experimental results show the effectiveness of our proposal. Discrete particle swarm optimisation for ontology alignment Alignment of surface water ontologies: a comparison of manual and automated approaches Optimizing ontology alignment in vector space Rational Chebyshev approximations for the error function The compact genetic algorithm WordNet: a lexical database for english Extended Tversky similarity for resolving terminological heterogeneities across ontologies Improving the interoperability of biomedical ontologies with compound alignments An ontology design pattern for surface water features Error functions, Dawsons and Fresnel integrals Survey on complex ontology matching Foundation of evaluation Enhanced place name search using semantic gazetteers An approach to comparing different ontologies in the context of hydrographical information Toward an inclusive semantic interoperability: the case of cree hydrographic features Efficient user involvement in semiautomatic ontology matching Optimizing ontology alignments through a memetic algorithm using both matchfmeasure and unanimous improvement ratio Using memetic algorithm for instance coreference resolution A complex alignment benchmark: GeoLink dataset