key: cord-0853870-nn6qxwhm authors: Kim, Hyoung Joong; Kim, Suah; Lee, Sungho title: On privacy enhancement using u-indistinguishability to COVID-19 contact tracing approach in Korea date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00010-1 sha: cb81bf3ce3989a3b11f0ffb29117fa5ead2c992e doc_id: 853870 cord_uid: nn6qxwhm South Korea's COVID-19 contact tracing is unique because detailed and personally sensitive information has been disclosed. As a result, privacy concerns and controversies have been raised. As long as the Korean format of contact information is to be shared, a more sophisticated technique is required to mitigate the risk of privacy breaches. To meet the requirement of minimum privacy infringement, technical solutions for privacy enhancement are needed. In this paper, a u-indistinguishability concept is proposed and its effectiveness is shown. By mixing at least u patients' quasi-identifiers into a cluster and their associated movement information into another cluster, linkability is weakened significantly. It is widely agreed that South Korea successfully managed its COVID-19 outbreak [1] . As of April 25, 2020, there had been 10,718 confirmed cases of COVID-19 in South Korea, with a total of 240 deaths [2] . The number of daily confirmed patients had been suppressed by about 10 people, mostly arriving from abroad since March 18, 2020 [3] . In early May, a young man who is believed to be one of the individuals behind the new COVID-19 cluster visited several bars and nightclubs tested positive. He is one of the latest super spreaders. As a result, quarantine authorities are still on high alert with resulting spread of confirmed cases among some nightclub patrons, but the public has full confidence in the ability of Korea Centers of Disease Control and Prevention (KCDC) to quickly stabilize the situation. This country has flattened its coronavirus curve by adopting an aggressive trace, test, and treat strategy. It is notable that South Korea calmed the coronavirus outbreak without lockdowns, roadblocks, and restrictions on movements. In addition, South Korea held a nationwide election on April 15, 2020 in the midst of a pandemic without a single one of the more than 29 million voters getting infected with the novel virus afterward [3, 4] . During the COVID-19 pandemic situation, both the Korea Baseball Organization and K-League started regular season, and their games have been broadcast live abroad. After the success of South Korea and Singapore in controlling COVID-19, many countries are now considering introducing contact tracing systems. Contact tracing is the process of identifying carriers of disease, and with whom they have come into contact, and let them quarantine as needed [5] . The contact tracing system of KCDC is a centralized system (Fig. 36.1 shows the data used by KCDC). This KCDC system is more comprehensive because it uses not only smartphone logs but also other data like credit card transactions and CCTV footage. Needless to say, privacy concerns and controversies Samarati [17] and Sweeney [14] introduced the k-anonymity concept. This measure makes each record indistinguishable with at least k À 1 other attributes. In other words, k-anonymity allows each equivalence class to have at least k attributes. Table 36.1 shows an example with nine patients. Attackers will try to reidentify the patients by associating Table 36 .1 with other databases like the voter registration list as Sweeny did [14] . To make reidentification difficult, suppression and generalization techniques are used. The suppression technique replaces the lower value with "*" so that k attributes can be expressed with the same value. In Table 36 .1, from the top, the three patients are in their 20s. The suppression technique replaces the first digit of the age of these three patients with "*" (See Table 36 .2 of the top three patients). The generalization technique replaces the individual values of attributes with a broader category. In Table 36 .1, from the bottom, their age is over 40. The generalization technique replaces their age more broadly with ">40" (See Table 36 .2 of the bottom three patients). The suppression is a special case of the generalization with a narrower range. For example, the age range "2*" of the suppression is equivalent to "20e29" of the generalization. Each QID tuple in 1 29 12345 Cancer 2 22 12345 Cancer 3 25 12346 Cancer 4 32 23456 Heart disease 5 36 23459 Stomach ulcer 6 34 23491 Heart disease 7 42 54321 Epilepsy 8 51 54322 Flu 9 56 54320 Angina pectoris [18] . Two attack models against k-anonymity are proposed: homogeneity attack and background knowledge attack [15] . The homogeneity attack leverages the case where all the values for sensitive information within a set of k attributes are similar or identical. Given the same disease name (i.e., cancer of top three patients in Table 36 .2), the attacker may guess that Alice, 29, living in an area with ZIP code 12345, had cancer. The background knowledge attack increases the accuracy of inference using background information about the patient. For example, Bob has a friend who is Japanese (of age 36 with ZIP code 23459), who has been admitted to a hospital. However, Bob does not know what illness his friend was hospitalized for. Assume that it is a well-known fact that Japanese have an extremely low chance of heart disease. Then, Bob can guess that his friend is suffering from a stomach ulcer. To beat these attack scenarios, l-diversity concept has been proposed [15] . The l-diversity principle lets every equivalence class contain at least l sensitive values. The top three patients in To beat this attack scenario, the t-closeness concept has been proposed [16] . The t-closeness model treats the values of an attribute distinctly by taking into account the distribution of data values for that attribute. An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole database is no more than a threshold t. The t-closeness model is similar to the l-diversity model in terms of the diversity of the sensitive attributes. However, the former is stricter than the latter because that considers the distribution of the attributes. Table 36 .5 shows that the top three patients' illnesses are diverse, and their distribution has entries of dermatology, respiratory, and gastroenterology. So do the middle three and the bottom three patients. It shows that 3-closeness is achieved in Table 36 .5. A new type of supplementing technique for the k-anonymity principle blends a number of dummy identities with fake location records [19] in location-based service applications. This approach is appealing because it effectively achieves k-anonymity without sacrificing data quality [20] . Applications of the blending methods like k-anonymity, l-diversity, and t-closeness can be found in Refs. [21e23]. South Korea's COVID-19 policy is considered effective, but unique. The uniqueness claim stems from privacy infringement concerns on the contact tracing and movement data sharing. Epidemiological investigators get real-time data feed from the KCDC on patients. After the investigation, KCDC releases the patients' movement information including their whereabouts, transportation used, people they contacted, times spent at To deter a reidentification attack, a quasi-identifier masking technique called k-anonymity [14, 25, 26] has been proposed. Sensitive attribute blending technique called l-diversity [15, 22] and sensitive attribute distribution assimilating technique called t-closeness [16, 22] have also been proposed. However, none of these techniques have been applied to the KCDC patient tracking approach. KCDC QID includes gender, age (or year of birth), nationality, routes of infection, date of confirmation, admitted hospital name (See Table 36 .6), and residential area name added by local government (See Table 36 .7). KCDC sensitive attributes are not seen in Table 36 .6, but is obviously COVID-19 itself. Movement information contains more detailed data including on times, places, and activities (See Table 36 .7). Since April 12, 2020, the KCDC guidelines have changed to release the movement information only for two weeks [24] . The movement information does not belong to the category of sensitive attributes as aforementioned. The movement data part (second column in Table 36 .7) is called an auxiliary attribute in this paper. Auxiliary attributes include five pieces of information: WHO did WHAT, WHEN, WHERE, and BY. The WHO attribute is not explicitly in the auxiliary section, but is in the QID part. The WHAT attribute is also not there in the auxiliary section, but is implied. The WHEN and WHERE attributes are explicitly in the auxiliary section. The BY attribute represents a means of transportation (i.e., by taxi, by subway). If there is no BY attribute, it means that the patient has moved on foot. Activity information (i.e., WHAT) guessed from the movement data is potentially very harmful to the patient [27] . For example, patient #3 in Table 36 .6 suffered from a rumor of an extramarital affair. The speculation was derived based on the fact that he accompanied a Chinese woman (patient #28) to a hotel and a plastic surgery clinic twice [28] . The reason why the movement information is dangerous is because the activity information is inferred no matter whether it is true or not. If a man in his 50s and a woman in her 30s went to a hotel together, it is not unusual to assume that this is because of an affair in Korea. This serious misunderstanding began simply because the two people's movements coincidentally overlapped. Unfortunately, the place and time were the same and led to an inaccurate conclusion. Clearly, this demonstrates the need for a method that maintains the accuracy of data while preventing inference. [29] . Attackers can infer so much information from the mobility analysis as demonstrated in the following hypothetical instance. Even if Carol does not know the actual details of the conversation between Alice and Bob, she can deduce the intention of the two from the meeting. Carol can infer that frequent communications or meetings between them denote the planning stage of an action. Observers can guess all sorts of information about the intentions and actions of the target. The importance of the patient movement data is clear. These data are very useful for the epidemiological investigation and disinfection. However, disclosing these data creates several problems; one of them is the possibility of activity inference. The problem began with a separate release of movement information for each patient. It was inferred that patients #3 and #28 had been having an affair with each other because each movement information overlapped. Because the identities of the two patients (#3 and #28) have not been disclosed and have been pseudonymized, the accuracy of the movement information is not compromised by mixing their mobility data into one cluster. If it is not enough to combine two patients' mobility data, then bundle three patients', and if that is not still enough, then mix the u patients of them. The number u represents the number of patients. This u can be adjusted according to the level of security requirement. The solution is simple and easy: anonymity loves company. The answer lies in mixing and unlinking so that QIDs are mixed along with (WHEN, WHERE, BY)-tuples, then, as a result, their linkability is blurred. Table 36 .7. It is easy to identify the name of café or restaurant since Table 36 .7 discloses detailed information on location (i.e., name of the subway station which is blinded in this paper and exit number). All the people she contacted are identified. The identity of the contact person is identified through the patient's statements and CCTV footage or credit card transaction verification. All close contacts should be tested for COVID-19. Disinfection is carried out in the places visited by the patient. Information for the patient #31 in Dongjak does not disclose much information since the woman followed the self-isolation guidelines well. On the other hand, the patient #16 has stayed a long time in Korea and met many people. In the table, his occupation is identified. Newspapers reported he is an American actor appearing in "The Phantom of Opera." A total of 8578 audience members who had watched the musical from March 18 to 31 were monitored by the Metropolitan Government of Seoul. Out of the total of 128 cast members, 126 were confirmed negative. A Canadian ballerina (#9864) tested positive on March 31, and the American actor (#10028) confirmed positive on April 2. The performance venue, Blue Square Interpark Hall, was closed after disinfection according to the guidelines. Table 36 .7 does not disclose the detailed information surrounding the patient like the venue which is inferred or provided by the competent authorities. The basic concept of k-anonymity, l-diversity, and t-closeness is the same: blurriness by mixing. The u-indistinguishability is no exception. Note that u-indistinguishability is conceptually very similar to l-diversity and t-closeness. The last two mix the sensitive attributes as much as possible for the prevention of privacy infringement of the k subjects. Since there are k subjects and k sensitive attributes in each cluster, the total number of WHO-WHAT ATTRIBUTE combinations is O À k 2 Á . However, each QID is associated with each sensitive attribute one-by-one. In addition, because of the suppression and generalization techniques, each cluster contains few distinct QIDs. For example, the top three QIDs in Table 36 .5 are reduced to one representative QID (i.e., an unknown person of age 20e40 living in the area with ZIP code 123**). As a result, the total number of sensitive attributes pertaining to a particular person is around OðkÞ. On the other hand, the u-indistinguishability blends WHO parts into a cluster, and WHEN and WHERE parts into another cluster, respectively, so that it is difficult to infer WHO did WHAT. Each cluster has at least u members of QID and more than or equal to u pieces of movement information. The first column of Table 36 .8 has two clusters. One cluster has three persons' information: two males of age 67 and 42, respectively, and a female of age 35. Another cluster also has three persons' information; two males of age 75 and 60, respectively, and a female of age 40. In this manner, QIDs and its associated auxiliary attributes are clustered for mixing the movement information (see Table 36 . 8) . Table 36 .8 shows an example of 3-indistinguishability cases. Each row (or cluster) has at least three patients and their seven pieces of movement information in the first row and four pieces in the second row. Note that auxiliary attribute mixing eliminates the overlapped movement information, if any, of accompanying people. The patient #3 and #28 in Table 36 .6 had visited the same place (a plastic surgery clinic and a hotel) at the same time. Thus, if they are in the cluster, the duplicated data pieces can be removed. QIDs in the example can be blurred by applying k-anonymity. Patients along with the movement information should be carefully selected and mixed based on the philosophy of l-diversity and/or t-closeness, if possible. The key point of this paper is in the u-indistinguishability; at least u patients and their mobility information are mixed into each cluster. It is up to the implementer to apply this technique in more detail. If the number of patients is large, u can be increased. If the number of patients is small, noise can be intentionally added to the movement information. In the cases of k-anonymity, l-diversity, and t-closeness, there is one-by-one mapping between the subjects in the first column and their associated attributes in the second column like Table 36 .7. However, the first row of Table 36 .8 shows that three subjects in the first column constitute a cluster, and seven attributes in the second column makes another cluster. Similarly, the second row shows a cluster of three subjects and another cluster of four attributes. Under the premise that KCDC's disclosure of contact tracing data is standard practice, this paper proposes a method named the u-indistinguishability to protect privacy. As is shown in Table 36 .8, it is not clear WHO did WHAT. Attackers can infer WHO did WHAT, but their accuracy will drop sharply, especially when u is large. The advantage of u-indistinguishability is that it makes it approximately O À u 2 Á for u patients and their more or less u pieces of mobility information. However, the total number of mobility attributes pertaining to a specific person is around OðkÞ. Attackers can conduct mobility analysis to infer WHO did WHAT. They will exploit all kinds of information from the mobility data columns. For example, Korean women do not go to a barber shop to get their hair done. Thus, it is not strange to assume that the barber shop is not included in the itinerary of a woman of age 40 in Table 36 .8. Of course, it is possible that she took her son to the barber shop, but the probability is very low. Since it takes more than 20 min to get a haircut, it can be inferred that a man in Table 36 .8 cannot also visit a cafe or golf club at the same time. We can make this reasoning using time information from the mobility data. As a result, careful information blending is required. Based on these assumptions, we can guess that a man visited the barber shop first and the cinema alone or visited the barber shop only. We can also guess that the man visited the barber shop first and the cinema with another man, or another woman, or both of them. It is clear that theoretically the man's itinerary includes the barber shop, café, golf club, and/or cinema. However, since there are u patients, so many scenarios are possibly to be made by combining all of them. In this paper, the u-indistinguishability is considered. The value u is a hyperparameter whose parameter value has to set before the learning process begins. In more detailed model, (u, v)-indistinguishability can be used with two hyperparameters, u and v. The value v denotes the number of moving activities. In general, the value v must be at least equal to the value u, and, in general, larger than the value u. In this paper, the KCDC and local governments' format used to share COVID-19 patient movement information is examined. The KCDC COVID-19 contact tracing system has been evaluated as having the potential to violate privacy with its new patient information disclosure format. The problem with this system is that it discloses too much information. Nevertheless, if Korea must continue to rely on this system, a sophisticated technique to reduce privacy breaches must be implemented. Under these circumstances, it is shown that the u-indistinguishability approach is sufficient to achieve the desired purpose by making it more difficult for inaccurate and unfair inferences to be made from the information provided during contact tracing. So far, South Korea does not apply the u-indistinguishability method. Countries test tactics in 'war' against COVID-19 Coronavirus disease-19 South Korea reports no new domestic coronavirus cases No New Domestic Coronavirus Cases, No Transmission from Election For contact tracing to work Information technologyebased tracing strategy in response to COVID-19 in South Korea: privacy controversies South Korean FM on the Coronavirus: 'We Have to Live With This risk Big data, big tech, and protecting patient privacy Assessment of the data sharing and privacy practices of smartphone apps for depression and smoking cessation BlueTrace: A Privacy-Preserving Protocol for Community-Driven Contact Tracing across Borders. White Pa per The next generation of data-sharing in financial services: using privacy enhancing techniques to unlock new value Computer and Information Security Handbook k-anonymity: a model for protecting privacy L-diversity: privacy beyond k-anonymity t-closeness: privacy beyond k-anonymity and l-diversity Protecting respondents' identities in microdata release A General Algorithm for K-Anonymity on Dynamic Databases An anonymous communication technique using dummies for location-based services Location Privacy Protection in Mobile Networks The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks Anonymization of sensitive quasi-identifiers for l-diversity and t-closeness Protecting trajectory from semantic attack considering k-anonymity, l -diversity, and t ecloseness South Korea's COVID-19 patient movement information release guidelines: too much detail Local synthesis for disclosure limitation that satisfies probabilistic k-anonymity criterion Anonymous scheme for privacy-preserving data collection in iot-based healthcare services systems More scary than coronavirus': South Korea's health alerts expose private lives A 'travel log' of the times in South Korea: mapping the movements of coronavirus carriers Thanks to Kari Karlsbjerg for proofreading the article and insightful suggestions. This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant number: (KHIDIHI19C0785020020).