key: cord-0043217-3c7zm8s6 authors: Zhou, Tao; Lim, Ee-Peng; Lee, Roy Ka-Wei; Zhu, Feida; Cao, Jiuxin title: Retrofitting Embeddings for Unsupervised User Identity Linkage date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47426-3_30 sha: 2d6c22eb82b6b978360ab970794993f939c25c53 doc_id: 43217 cord_uid: 3c7zm8s6 User Identity Linkage (UIL) is the problem of matching user identities across multiple online social networks (OSNs) which belong to the same person. The solutions to UIL problem facilitate cross-platform research on OSN users and enable many useful applications such as user profiling and recommendation. As the UIL labeled data are often lacking and costly to obtain, learning user embeddings for matching user identities using an unsupervised approach is therefore highly desired. In this paper, we propose a novel unsupervised UIL framework for enhancing existing user embedding-based UIL methods. Our proposed framework incorporates two key ideas, user-discriminative features and retrofitting embedding. The user-discriminative features enable us to differentiate a specific user identity from other users in its OSN. From the user-discriminative features, we derive pairs of similar user identities across OSNs for retrofitting the base user embeddings of existing UIL methods. Through extensive experiments on three real-world OSN datasets, we show that our framework can leverage user-discriminative features to improve the accuracy of different user embedding-based UIL methods significantly. The quantum of improvement can also be surprisingly good even for existing UIL methods with very poor matching accuracy. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-47426-3_30) contains supplementary material, which is available to authorized users. With rapid development of online social networks (OSNs), the number of OSN platforms increases quickly serving different user needs. The multiplicity of OSNs motivates the problem of User Identity Linkage (UIL), which aims to match user accounts from different OSN platforms belonging to the same persons. UIL addresses the issue of fragmented user information across platforms, which is important to cross-platform user profiling, recommendation applications and research on social networks including information diffusion, community analysis, and influential user modeling. We denote an OSN as G = (U, E), where U represents the set of user identities and E ⊆ U × U represents the set of links between user identities. Each user identity u i ∈ U is associated with some attributes, e.g., name, content, etc. Given two OSNs G s and G t as the source and target platforms respectively, the UIL task is to find for each user identity u s from U s a user identity u t from U t such that u s and u t belong to the same real person. While the problem is defined for a two-platform setting, it can be easily extended to more platforms. UIL methods can be classified into supervised, semi-supervised and unsupervised approaches. Most of the existing UIL methods adopt the supervised and semi-supervised approaches [19] . These approaches require ground truth labeled user identity pairs for training while the collecting of labeled data suffers from many problems. The unsupervised approach to UIL can avoid the issues from labeled data. It, however, has another set of challenges: 1) Unsupervised UIL method has to cope with multiple attributes with heterogeneity domains, preferably in a unified manner; 2) Discriminative cross-platform attribute similarities are needed to compare the attribute values; 3) The attribute importance need to be incorporated without pre-labeling. Our main research objective is to create unsupervised UIL methods that can cope with the above challenges. With recent advances in embedding techniques, user embeddings techniques have been shown to be effective in solving the UIL problem in the unsupervised approach [21] . The user embedding techniques essentially map every user identity (from any OSN) to a common embedding space. User identities with similar attribute values are expected to be mapped to similar locations. Hence, user embeddings can effectively address the first challenge. To address the remaining two challenges, we propose a general framework for unsupervised UIL using two main ideas, namely: (a) user-discriminative features, and (b) retrofitting embeddings. User-discriminative features are ones that are indicative of specific user identities in an OSN. Retrofitting embeddings is a technique, used largely in word embeddings, to modify an existing base embeddings of words to incorporate some synonym word-pair knowledge [3] . To the best of our knowledge, this paper is the first that introduce retrofitting to improve user embedding-based UIL techniques. Our framework is novel in that it can accommodate any unsupervised UIL method as user embeddings. The framework then improves the user embeddings of the existing method using user identity pairs obtained by pairing user identities that are similar based on user-discriminative features. Overview of Proposed Framework. Fig. 1 depicts our proposed UIL framework for two OSN platforms, Platforms 1 and 2. The choice of two platforms is to keep the description simple, as the framework can be easily generalized to handle more OSN platforms. The framework takes an existing user embedding as input, which we call the base embeddings. In the figure, user identities u1 and u2 are from different OSNs, and they are assigned with embeddings vectors v1 and v2 respectively in the base embedding space. The other user identities are similarly mapped into the same user embeddings space. In Step 1, the framework identifies user-discriminative features for each user identity in Platform 1, and does the same for Platform 2. In Step 2, we form a set of cross-platform similar user identity pairs by pairing user identities with some overlapping user-discriminative features. In Step 3, a set of similar cross-platform user identity pairs are used to retrofit the base user embeddings for final user identity linkage. As only qualified part of user pairs are considered for retrofitting, this process would be highly efficient. Research Contribution. We summarize our key contributions as follows: -We propose user-discriminative features to overcome the issues of multiple attributes of heterogeneous domains in different OSN platforms. We introduce a parameter to incorporate the importance of attribute in UIL. -We propose an unsupervised UIL algorithm based on retrofitting embeddings which take advantage of both base user embeddings and similar user-identity pairs. To the best of our knowledge, this is the first time retrofitting is used to achieve higher UIL accuracy for different base user embeddings. Moreover, retrofitting is highly efficient compared to the base user embedding learning. -We conduct extensive experiments on three real-world OSN datasets with many different settings. The results show the effectiveness of our methods. There are many UIL methods adopting the supervised approach [5, 6, [10] [11] [12] [14] [15] [16] 22] . They can be broadly classified into those using classification techniques and others using embedding techniques. The former typically extract features from user attributes (e.g., user name, profile description, content, and network) to train a classifier for predicting pairs of user identities to belong to the same users or not. For example, Zafarani et al. [22] proposed MOBIUS, a UIL method which utilizes username features and a Naive Bayes classifier. There are few recent works using the user embedding techniques for supervised UIL. Man et al. [10] proposed PALE, a supervised embedding based UIL method that utilizes network features. PALE employs network embeddings incorporating known pairs of matching user identities from different OSNs as anchor links. Mu et al. proposed another supervised method called ULink [12] to map known pairs of matching user identities to a common latent space. Semi-supervised approach considers both labeled and unlabeled pairs of matching user identities in model learning [1, 2, 8, 13, 18, 20, 23, 24] . HYDRA is a semi-supervised framework which models user behaviors and structure consistency [8] . COSNET is a semi-supervised method which utilizes local and global consistency across multiple networks [23] . Unlike the above methods, our proposed model adopts an unsupervised embedding approach to address the UIL problem. There are relatively fewer works on unsupervised UIL [4, 7, 17, 21] . Gao et al. [4] proposed CNL, an unsupervised method, to utilize multiple attributes to perform UIL in an incremental manner. Factoid Embedding [21] is the state-of-art unsupervised approach which utilizes multiple user attributes to learn a user embeddings for UIL. The method first constructs factoids from user attributes and learns user identity latent representation by embedding the factoids. However, the user embedding is learned based on only local information of OSNs. Cross platforms features have been ignored. The UIL results, therefore, could be poor if the local user attribute information is noisy. Our proposed framework addresses this limitation by introducing user-discriminative features. In this paper, we will demonstrate how Factoid Embeddings can be used as input base embeddings so as to achieve improved UIL results. Our proposed framework supports different types of base user embeddings. While these embeddings are input to our framework, we would like to introduce two important embeddings techniques: one for matching user identities based on a single user attribute, and another for matching based on multiple user attributes. Single Attribute-Based User Embeddings. Here we use username attribute to illustrate the single attribute-based user embeddings, also known as name embeddings. Note that the process could be applied to any other attribute which has similarity measure for two users. We define username similarity using cosine similarity on TF-IDF vectors of n-gram representation of usernames. We first collect all n-grams (n ∈ [2, 5] in this work) from username of user identities from all the OSN platforms. Next, we construct the n-gram TF-IDF vector of every user in U s ∪ U t . Then we define sim username (u i , u j ) as the cosine similarity between the TF-IDF vectors of users u i and u j denoted by w i and w j respectively. The username embeddings v are learnt by minimizing: Factoid Embeddings. It is an embedding UIL method making use of multiple user attributes [21] . Details are in Section A.2 of Supplementary Material. User-discriminative features are ones that help distinguishing a user identity from others in the same OSN. From the discriminative features of user identities, we derive the cross-platform similar user identity pairs. Each crossplatform similar user identity pair (u i , u j ) is assigned a similarity score s ij . The larger the s ij , the higher the likelihood that the u i and u j from different OSNs belong to the same user. As a user identity can have multiple attributes, we derive different types of user-discriminative features and the associated cross-platform similarity score s ij 's as follows. One can derive for other user attributes in a similar manner. User-Discriminative Name Features. We use n-grams in username to generate discriminative name feature. For each user identity u i in OSN platform G s , we collect n-grams (n ∈ [2, 5] ) in its username as NG s i . Let NG s be the set of all n-grams of platform G s = (U s , E s ), i.e., ∪ ui∈Us NG i . The set of userdiscriminative n-grams is then defined by DN s = {ng|ng ∈ NG s , |{u i |ng ∈ NG s i }| < t n } where t n is a pre-defined threshold to keep only the ngrams that are not popular among user identities. In a similar way, we define the userdiscriminative n-grams for the target OSN as DN t . Given user u i from G s and user u j from G t , we finally define the similarity score s ij by Jaccard Similarity, i.e.: User-Discriminative Content Features. Users may generate different types of content such as text and images. In this paper, we consider textual content attribute only. The calculation is similar with user-discriminative name features except we exchange n-grams with words in user content and use t c as the threshold for selecting content features not popular among users rather than t n . We denote the neighbor set of a user identity u in OSN platform G s as N s (u). If the degree of u , a neighbor of u, is large, u would be less important for identifying u because many user identities have u as their neighbor. We thus use the degree of user identity to determine user-discriminative neighbors. For u i ∈ U s , we define the user-discriminative neighbors of u i as DB s t d is a threshold to determine neighbors who do not have many social connections. Similarly, the user-discriminative neighbors of u j in OSN G s , DB t j is defined. Unlike the earlier attributes, it is not possible to expect overlapping neighbors between two OSN platforms. We adopt some base user embeddings to determine pairs of similar discriminative neighbors across platforms. The similar discriminative neighbor pairs, denoted as DP ij , according to the base embeddings. Intuitively, the number of unique user identity pairs in DP ij reflects the cross-platform identity similarity between u i and u j . Hence, s ij is defined as: The intuition of our method is to retrofit by pushing cross-platform similar user identity pairs closer in embedding space while keeping the other base user embedding vectors unchanged as much as possible. For each cross-platform user identity pair (u i , u j ), we would retrofit the affected user embedding vectors according to the cross-platform similarity score s ij . The larger s ij , the closer the retrofitted embedding vectors of u i and u j should be. For a user identity u i , letv i denote the base embedding vector of u i generated using base user embeddings. We use v i to denote the retrofitted embedding vector of u i , which needs to be learned. Let P be the set of cross-platform similar user identity pairs with s ij scores, i.e. We learn the retrofitted embedding vector v for all the u ∈ U s ∪ U t by minimizing the following objective function: where ϕ (a, b) is the cosine distance between vectors a and b, i.e. ϕ (a, b) = 1 − cos (a, b) . α (0 ≤ α ≤ 1) is the weight to adjust the degree of retrofitting. Stepwise Approach. As we want to use multiple discriminative features to retrofit the base user embedding, we adopt a stepwise approach to retrofit the user embedding iteratively, which is to regard the retrofitted embeddings as new base embeddings when another discriminative feature is applied. To comprehensively capture the similar user identities, we introduce a hierarchical approach to generate the user-discriminative features. We basically select a few thresholds t's, and derive different sets of user-discriminative features based on the thresholds. Each cross-platform user identity pair will then be assigned a set of scores s ij 's, one for each set of user-discriminative features (e.g., name). With cosine distance, we cannot apply the traditional approach in retrofitting to minimize the objective function. Instead, we use Stochastic Gradient Descent (SGD) for optimization. To apply SGD, we rewrite the objective function in Eq. 4 as following by moving the position of summation: where β is the weight to adjust the degree of retrofitting. In each iteration, we update the embedding v i and v j by the following rule: where γ is the learning rate. The detailed optimization is available in Section C of Supplementary Material. We summarize our retrofitting embedding in Algorithm 1 which excludes hierarchical user-discriminative feature for clarity. Each of the hierarchical features would need one step of retrofitting similar as lines 2-17. The retrofitting algorithm is efficient as only the user pairs with positive score will be used. Finally, once we have learned the retrofitted embedding vector v for all the u ∈ U s ∪ U t , we will compute the cosine similarity between two user's embedding vectors as linkage score for each user pair across platforms. The user pairs with larger scores are more likely to belong to the same underlying natural person. In the optimization objective function, weight β is an important parameter to control the learning of retrofitting embedding. Smaller β indicates the retrofitted embedding preserves more of the base user embedding, while a larger β will give more weight to the user-discriminative features to change the base embedding. Therefore, β should be set larger when the base user embedding has not shown strong ability in performing UIL, and the user-discriminative features should be given more weight to improve UIL. The choice of β for different userdiscriminative features would also balance the importance of multiple attributes. We evaluate our proposed framework using three real-world datasets, namely Instagram-Twitter (IG-TW ), Foursquare-Twitter (FQ-TW ), and IG-TW content datasets. These are social network datasets that are significantly larger than other social networks for UIL research involving. We start from a set of Singapore-based Twitter users and retrieve users who declared their Instagram or Foursquare accounts to construct the multi-platform datasets. From the IG-TW dataset, we extracted user identities with post content only into the IG-TW content dataset. The statistics of all the datasets are summarized in Table 1 . We compare methods based on the proposed framework with several other unsupervised baseline methods. The baseline methods include: -Weighted content embedding (CE weighted): This is similar to content embedding except that the user identity is represented by a weighted average of word vectors of words found in the content attribute. The weight of a word is defined by its TF-IDF value in the content. -Factoid Embedding (FE): This is the state-of-art unsupervised method [21] . In our experiment, FE makes use of all attributes in each dataset (FE for IG-TW content dataset is denoted as FE c for differentiation). Name embedding and content embedding are used as the attribute embeddings in FE when applying the name and content attributes respectively. We evaluate different RE's using our proposed framework with different baselines as their base user embeddings. We use RE q p to denote our proposed method with q as base user embeddings and p as the user-discriminative feature(s) used for retrofitting. These proposed methods are RE NE n , RE NE nb , RE CE c , RE F E n , RE F E nb , RE F Ec c and RE F Ec cb , where n, b and c denote user-discriminative features for name, network and content respectively. Note that the order of applying user-discriminative features depends on the base embedding. For the sake of showing the usefulness of user-discriminative features, we start retrofitting the more discriminative features before the less discriminative ones. When generating user-discriminative features, we need to configure the threshold for each attribute as defined in Sect. 3. In the following experiments, t n for name attribute is set to 5. For the content attribute, we adopt a hierarchical approach to generate user-discriminative features using multiple thresholds. The details will be elaborated in subsequent sections. Threshold t d is set to 20 for all the datasets, and t s in network attribute is set to be 0.4, 0.2 and 0.15 when FE, NE, and CE are used as the base user embedding respectively. The choice of t s is dependent on the similarity distribution of user pairs cross platforms. In retrofitting embedding, the parameter β needs to be configured. We set β to 1 when name attribute is used for retrofitting (a special case will be mentioned in later experiment sections). When the content attribute is employed, β is varied for the different settings in a hierarchical approach used to generate the user-discriminative features. For network attribute, β is set to be 15, 10 and 4 respectively for IG-TW, FQ-TW and IG-TW content datasets. The maximum number of iterations for the optimization is set to be 30,000. It is also interesting to note that our optimization process is fast as we only consider pairs of users who have non-zero similarity scores based on their user-discriminative features. All the methods are required to rank the ground truth matching identity as high as possible based on the linkage score (cosine similarity of embeddings). We use HitRate@K and Mean Reciprocal Rank(MRR) to evaluate the ranking. Tables 2 and 3 respectively. For IG-TW dataset, RE NE n slightly outperforms NE, indicating that the user-discriminative name features can improve the UIL accuracy even when and NE are similar. A possible reason is that FQ-TW dataset only contains screen name, and has less useful information that can be used for retrofitting. β has been set to a relatively lower value (0.08) to avoid introducing noise in this specific case. Even though it is difficult to improve using user-discriminative name feature, the retrofitting could still be controlled to retain the base user embedding performance. On the whole, our proposed methods using the retrofitted embeddings have outperformed the state-of-art embedding based methods. RE F E nb , which combines FE with user-discriminative name and network features, obtains the best performance in linking user identities across multiple platforms. Experiments with Content and Network Attributes. IG-TW content dataset contains both content and network attribute information. Thus, the experiment on this dataset involves the baselines CE, FE c , and REs using user-discriminative content and network features. The results of the experiment on IG-TW content dataset are shown in Table 4 . The first four methods make use of content-only information. Our method RE CE c , with CE as the base user embedding, has significantly outperformed the baselines. The FE c in Table 4 uses CE as its attribute embedding. We observe that FE c has obtained a significant improvement in performance over CE by introducing the network information. However, it is interesting to note that RE CE c outperforms FE c , and RE F Ec c has even further improved the performance with the use of better base embedding (i.e., In this paper, we propose a novel unsupervised user identity linkage (UIL) framework for enhancing existing UIL methods based on user embeddings techniques. Our proposed framework incorporates two key ideas, user-discriminative features and retrofitting embeddings. Our framework applies the user-discriminative features to derive pairs of cross-platform similar user identities for retrofitting the base user embeddings. Through extensive experiments on three real-world OSN datasets, we show that our proposed framework can leverage user-discriminative features to effectively improve the accuracy of different base user embeddings. For future work, we will conduct a more in-depth study on the parameters used in our framework, and will design methods to optimize them automatically. Matching user profiles across social networks Discovering links among social networks Retrofitting word vectors to semantic lexicons CNL: collective network linkage across heterogeneous social platforms Exploiting innocuous activity for correlating users across sites Identifying users across social tagging systems What's in a name?: an unsupervised approach to link users across communities Hydra: Large-scale social identity linkage via heterogeneous behavior modeling Topical word embeddings Predict anchor links across social networks via an embedding approach Ad-link: an adaptive approach for user identity linkage User identity linkage by latent user space modelling De-anonymizing social networks Identifying users across social networks based on dynamic core interests Entity matching in online social networks How unique and traceable are usernames? In: PETS Linking users across domains with location data: Theory and validation Controllable information sharing for user accounts linkage across multiple online social networks User identity linkage across online social networks: a review Mapping users across networks by manifold alignment on hypergraph Unsupervised user identity linkage via factoid embedding Connecting users across social media sites: a behavioralmodeling approach Cosnet: Connecting heterogeneous social networks with local and global consistency Cross-platform identification of anonymous identical users in multiple social media networks