key: cord-0432573-laj489ou
authors: Singhofer, Fabian; Garifullina, Aygul; Kern, Mathias; Scherp, Ansgar
title: rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm
date: 2021-05-18
journal: nan
DOI: nan
sha: 44d21ff073248b18b1827216a90484a8332a6a21
doc_id: 432573
cord_uid: laj489ou

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

Researchers benefit from companies, hospitals, or other research institutions, who share and publish their data. It can be used for predictions, analytics, or visualizations. However, often data to be shared contains Personally Identifiable Information (PII) which does require measures in order to comply with privacy regulations like the Health Insurance Portability and Accountability Act (HIPAA) for medical records in the United States or the General Data Protection Regulation (GDPR) in the European Union. One possible Preprint, Published on arXiv, 2021. measure to protect PII is to anonymize all personal identifiers. Prior work considered such personal data to be name, age, email address, gender, sex, ZIP, any other identifying numbers, among others [12, 16, 31, 34, 52] . Therefore, the field of Privacy-Preserving Data Publishing (PPDP) has been established which makes the assumption that a data recipient could be an attacker, who might also have additional knowledge (e. g., by accessing public datasets or observing individuals).

Data to be shared can be structured in the form of relational data or unstructured like free texts. Research in data mining and predictive models shows that a combination of structured and unstructured data leads to more valuable insights. One successful example involves data mining on COVID-19 datasets containing full texts of scientific literature and structured information about viral genomes. Zhao and Zhou [62] showed that linking the mining results can provide valuable answers to complex questions related to genetics, tests, and prevention of SARS-CoV-2. Moreover, the combination of structured and unstructured data can also be used to improve predictions of machine learning models. Teinemaa et al. [54] developed a model for predictive process monitoring that benefits from adding unstructured data to structured data. Therefore, links within heterogeneous data should be preserved, even if anonymized.

However, state of the art methods focus either on anonymizing structured relational data [21, 32, 35, 41, 53, 55] or anonymizing unstructured textual data [8, 16, 34, 45, 47, 50] , but not jointly anonymizing on both. In example, for structured data the work by Sweeney [53] introduced the privacy concept -anonymity, which provides a framework for categorizing attributes with respect to their risk of re-identification, attack models on structured data, as well as algorithms to optimize the anonymization process by reducing the information loss within the released version. For unstructured data like texts, high effort has been conducted to develop systems which can automatically recognize PII within free texts using rule based approaches [40, 45, 50] , or machine learning methods [8, 16, 24, 34] to allow for replacement in the next step. To the best of our knowledge, the only work that aimed to exploit synergies between anonymizing texts and structured data is by Gardner and Xiong [16] . The authors transferred textual attributes to structural attributes and subsequently applied a standard anonymization approach. However, there is no recoding of the original text, i. e., there is no transfer back of the anonymized sensitive terms. Thus, essentially Gardner and Xiong [16] only anonymize structured data. Furthermore, there is no concept of information redundancy, which is needed for a joined de-anonymization, and there is no weighting parameter to control the influence of relational versus textual arXiv:2105.08842v1 [cs. LG] 18 May 2021 Table 1 : Running example of a de-normalized dataset D with relational and textual attributes. A * is an attribute directly identifying an individual. A 1 , ..., A 5 are considered quasi-identifiers and do not directly reveal an individual. X is the textual attribute. See Table 3 for details on notations. * Relational Attributes 1 , ..., 5 Textual To illustrate the problem of a joined anonymization of textual and structured data, we consider an example from a blog dataset. 1 As Table 1 indicates, a combined analysis relies on links between the structured and unstructured data. Therefore, it is important to generate a privacy-preserved, but also consistently anonymized release of heterogeneous datasets consisting of structured and unstructured data. Due to the nature of natural language, textual attributes might contain redundant information which is already available in a structured attribute. Anonymizing structured and unstructured parts individually neglects redundant information and leads to inconsistencies in data, since the same information might be anonymized differently. Moreover, for privacy-preserving releases, assumptions on the knowledge of an attacker are made. Privacy might be at risk if the anonymization tasks are conducted individually and without sharing all information about an individual.

We provide a formal problem definition and software framework rx-anon on a joined anonymization of relational data with free texts. We experiment with two real-world datasets to demonstrate the benefits of the rx-anon framework. As baselines, we consider the scenarios where relational and textual attributes are anonymized alone, as it is done by the traditional approaches. We show that we can reduce the information loss in texts under the -anonymity model. Furthermore, we demonstrate the influence of the parameter that influences the weight between relational and textual information and optimize the trade-off between relational and textual information loss.

In summary, our work makes the following contributions:

• We formalize the problem of anonymizing heterogeneous datasets composed of traditional relational and textual attributes under the -anonymity model and introduce the concept of redundant information.

• We present an anonymization framework based on Mondrian [31] with an adapted partitioning strategy and recoding scheme for sensitive terms in textual data. To this end, we introduce the tuning parameter to control the share of information loss in relational and textual attributes in Mondrian. • We evaluate our approach by measuring statistics on partitions and information loss on two real-world datasets. We adapt the Normalized Certainty Penalty score to the problem of a joined anonymization of relational and textual data.

Below, we discuss related works on data anonymization. We provide a problem formalization in Section 3 and introduce our joined de-anonymization approach rx-anon in Section 4. The experimental apparatus is described in Section 5. We report our results in Section 6. We discuss the results in Section 7, before we conclude.

where records are grouped and each group is transformed such that their quasi-identifiers are equal. To achieve -anonymity, Samarati [46] studied suppression and generalization as efficient techniques to enforce privacy. In addition, Meyerson and Williams [38] and LeFevre et al. [31] have shown that optimal -anonymity in terms of information loss both in the suppression model and for the multidimensional case is -hard. Several algorithms have been developed to efficiently compute a -anonymous version of a dataset while keeping the information loss minimal. Sweeney [52] proposed a greedy approach with tuple suppression to achieve -anonymity. LeFevre et al. [31] suggested a top-down greedy algorithm Mondrian which implements multidimensional -anonymity using local recoding models. Ghinita et al. [17] showed how optimal multidimensional -anonymity can be achieved by reducing the problem to a one-dimensional problem which improves performance while reducing information loss. Based on the -anonymity model, several extensions have been introduced and studied, where -diversity and -closeness are most popular. Machanavajjhala et al. [35] introduced the model of -diversity to prevent homogeneity and background knowledge attacks on the -anonymity model. -diversity uses the concept of sensitive attributes to guarantee diversity of sensitive information within groups of records. Li et al. [32] introducedcloseness, which extends the idea of diversity by guaranteeing that the distribution within groups does not differ more than a threshold from the global distribution of sensitive attributes. While -anonymity was initially designed to be applied for a single table containing personal data (also called microdata), it has been transferred to different settings. Nergiz et al. [41] investigated the problem of anonymizing multi-relational datasets. They state that -anonymity in its original form cannot prevent identity disclosure neither on the universal view nor on the local view and therefore modified -anonymity to be applicable on multiple relations. Gong et al. [18] showed that regular -anonymity fails on datasets containing multiple entries for one individual (also called 1:M). To anonymize such data, they introduced ( , )-diversity as a privacy model which is capable of anonymizing 1:M datasets. Terrovitis et al. [55] applied -anonymity to transactional data. Given a set of items within a transaction, they treated each item to be a quasi-identifier as well as a sensitive attribute simultaneously. The solution introduces -anonymity which adapts the original concept of -anonymity and extends it by modeling the number of known items of the adversary in the transaction as . He and Naughton [21] proposed an alternative definition of -anonymity for transactional data where instead of guaranteeing that subsets are equal in at least transactions, they require that at least transactions have to be equal. Finally, Poulis et al. [42] showed how -anonymity can be applied to data consisting of relational and transactional data and stated that a combined approach is necessary to ensure privacy.

Anonymization of Unstructured Data. In order for textual data to be anonymized, information in texts that may reveal individuals and therefore considered sensitive must be recognized. In recent work, two approaches have been used to extract so called sensitive terms in text. First, Sánchez et al. [47] proposed an anonymization method which makes use of the Information Content (IC) of terms. The IC states the amount of information a term provides and can be calculated as the probability that a term appears in a corpus. The reasoning behind using the IC of terms to detect sensitive information is that terms which provide high information tend to be also sensitive in a sense that an attacker will gain high amounts of information if those terms are disclosed. The advances in the field of Natural Language Processing (NLP) have been used to detect sensitive terms by treating them as named entities. Named Entity Recognition (NER) describes the task of detecting entities within texts and assigning types to them. Named entities reflect instances or objects of the real world, like persons, locations, organizations, or products among others and provide a good foundation for detecting sensitive information in texts. Therefore, recent work formulated and solved the detection of sensitive information as a NER problem [11, 16, 24, 34, 57] . Early work on NER to identify sensitive terms was based on rules and dictionaries [45, 50] . Sweeney [50] suggested a rule-based approach using dictionaries with specialized knowledge of the medical domain to detect Protected Health Information (PHI). Ruch et al. [45] introduced a system for locating and removing PHI within patient records using a semantic lexicon specialized for medical terms. Advances in machine learning led to new approaches on the de-identification of textual data. Gardner and Xiong [16] introduced an integrated system which uses Conditional Random Fields (CRF) to identify PII. Dernoncourt et al. [8] implemented a de-identification system with Recurrent Neural Networks (RNNs) achieving high scores in the 2014 Informatics for Integrating Biology and the Bedside (i2b2) challenge. Liu et al. [34] proposed a hybrid automatic de-identification system which incorporates subsystems using rules as well as CRFs and Bidirectional Long Short-Term Memory (BiLSTM) networks. They argued that a combined approach is preferable since entities such as phone numbers or email addresses can be detected using simple rules, while other entities such as names or organizations require trained models due to their diversity. Fundamental work on transformer neural networks established by Vaswani et al. [58] raises the question, whether transformers can also lead to advances in anonymizing free texts. Yan et al. [61] suggested to use transformers for NER tasks as an improvement to BiLSTM networks. In addition, Khan et al. [27] showed that transformer encoders can be used for multiple NLP tasks and for specific domains such as the biomedical domain. Finally, Johnson et al. [24] were first to propose a de-identification system using transformer models [58] .

Their results indicate that transformers are competitive to modern baseline models for anonymization of free texts.

In addition to the detection of sensitive information using NER, important related work is also on replacement strategies for such information in text. Simple strategies involve suppressing sensitive terms with case-sensitive placeholders [45] or with their types [40] . While those strategies are straightforward to implement, a disadvantage is loss of utility and semantics in the anonymized texts. More complex strategies use surrogates as consistent and grammatically acceptable replacements for sensitive terms [11, 57] . In contrast to the generation of surrogates, Sánchez et al. [47] used generalization to transform sensitive terms to a more general version in order to reduce the loss of utility while still hiding sensitive information.

Work Using Synergies Between Both Fields. Anonymization of structured and unstructured data has mostly been considered in isolation. There were few works using synergies between both fields. Chakaravarthy et al. [5] brought a replacement technique for structured data to the field of unstructured texts. They used properties from -anonymity to determine the sensitive terms to be anonymized within a single document by investigating their contexts. Moreover, to the best of our knowledge, only Gardner and Xiong [16] studied the task of anonymizing heterogeneous datasets consisting of texts and structured data. They provided a conceptual framework including details on data linking, sensitive information extraction, and anonymization. However, their work has no concept of redundant information between structured and textual data, as we introduce in rx-anon. Furthermore, they have no weighting parameter to balance anonymization based on structural versus textual data like we do. Basically, Gardner and Xiong [16] transfer the problem of text anonymization to the structured world and then their approach forgets about where the attributes came from. They do not transfer back the anonymized sensitive terms to recode the original text. So the output of Gardner and Xiong [16] 's anonymization approach is just structured data, which lacks its original heterogeneous form.

We propose a method on automatically anonymizing datasets which are composed of relational attributes and textual attributes. Our approach is unsupervised and applicable across different domains. In order to achieve this task, we need to explain the process of anonymization, formalize the problem of anonymizing heterogeneous data, and describe our anonymization algorithm which is based on -anonymity.

For anonymizing a given dataset, multiple steps are necessary to provide a privacy-preserved release. We refer to release as the anonymized version of a given dataset, but a release does not necessarily have to be made publicly available. In general, the process of anonymization can be divided into three parts, namely preparation, anonymization, and verification [23] . In the preparation phase, the intended audience is assessed, attributes with their types are named, risks of re-identification attacks are analyzed, and the amount of anonymization is calculated based on the results of the prior steps. The next step involves the anonymization itself, where a dataset and determined parameters are taken as an input, and an anonymized dataset depicts the output. Finally, the verification step requires to assess that the required level of anonymization has been achieved (e. g., by removing all PII) while remaining the utility of the anonymized dataset.

Depending on the dataset to be anonymized, there exist several different attributes which need to be anonymized. We have analyzed the literature and categorize the attributes with respect to the scale of the data (i. e., nominal, ordinal, ratio) and their cardinality of relation (i. e., one-to-one, one-to-many, and many-to-many). While the scale is important to know how attributes can be manipulated in order to achieve anonymity, the cardinality of relation provides information how attributes and individuals relate to each other. Table 2 contains a non-exhaustive list of attributes, which typically appear in datasets and are critical with respect to re-identification attacks. For the attributes listed in Table 2 we use four scales, namely nominal, ordinal, interval, and ratio. However, interval and ratio can be grouped together as numerical for the anonymization task. Moreover, the cardinality of a relation between an individual and the attribute is to be interpreted as follows: A one-to-many relation means that one individual can have multiple instances of an attribute (e. g., multiple credit card numbers), whereas many-toone depicts a scenario where many individuals have one property in common (e. g., place of birth). One-to-one and one-to-many attributes directly point to an individual and therefore are considered direct identifiers and must be removed prior to releasing a dataset. However, many-to-one and many-to-many attributes do not reveal an individual directly and therefore are called quasi-identifiers and might remain in an anonymized form in the released version of the dataset.

Even though in Table 2 we present one exclusive cardinality of relation for each attribute, there are always cases where the cardinality of relation depends on context of attributes or whole datasets. An example is home address, where we state that it is a one-to-one attribute. However, this only holds if only one person of a household appears in the dataset. If multiple persons of a household appear in a dataset, we would need to consider it many-to-one. Moreover, if one individual might appear twice with different addresses (e. g., having two delivery addresses in a shop), it would be an one-to-many attribute.

In order to provide a method for anonymizing heterogeneous data composed of relational and textual attributes, we first formalize this problem. In the remainder of this work, we will focus on the task of de-identification of heterogeneous datasets containing traditional relational as well as textual data. We aim to anonymize a dataset by hiding directly identifying attributes. To prevent classical record linkage attacks using quasi-identifying attributes, we use -anonymity as our privacy model [53] . In general, identification threats based on information within textual documents can be categorized into two categories, where the former poses explicit and the latter poses implicit information leakage [48] . Within texts, we adapt -anonymity to prevent explicit information leakage, while keeping the structure of the texts as best as possible intact to allow for text mining on implicit information. In other words, using our privacy model, an attacker shall not be able to identify an individual based on attributes, their values, or sensitive terms in texts. However, obfuscating personal writing style as discussed in [13, 37] exceeds this work and is therefore not considered. Table 3 provides an overview of the notations used.

Heterogeneous -dataset. Given a dataset in form of relations 1 , ..., , containing both relational and textual attributes.

contains all data we want to anonymize. We pre-process for the anonymization process by using the natural join, i. e., = Table 2 : Non-exhaustive list of attributes to anonymize with scale and cardinality of relations (sorted by cardinality). Note, for the anonymization task the interval and ratio data can be grouped together as numerical.

Attribute Scale

Name [7, 36, 42, 56] nominal one-to-one Social Security Number [36, 56] nominal one-to-one Online identifier [7] nominal one-to-one Passport Numbers [7, 36] nominal one-to-one Home Address [7, 36, 56] nominal one-to-one Credit Card Number [36] nominal one-to-many Phone [36, 56] nominal one-to-many Email Address [7, 36, 56] nominal one-to-many License Plate Number [56] nominal one-to-many IP Address [7, 36, 56] nominal one-to-many Order Reference nominal one-to-many Age [42, 56] ratio many-to-one Sex / Gender [7, 42] nominal many-to-one ZIP / Postcode [56] nominal many-to-one Date of Birth [36, 56] interval many-to-one Zodiac Sign [49] ordinal many-to-one Weight [7, 36] ratio many-to-one Race [7, 36] nominal many-to-one Country nominal many-to-one City [56] nominal many-to-one Salary Figures [14] ratio many-to-one Religion [7, 36] nominal many-to-one Ethnicity [7] nominal many-to-one Employment Information [36] nominal many-to-one Place of Birth [36] nominal many-to-one Skill nominal many-to-many Activities [36] nominal many-to-many Diagnosis / Diseases [14, 36] nominal many-to-many Origin / Nationality [42] nominal many-to-many Purchased Products [42] nominal many-to-many Work Shift Schedules nominal many-to-many where the first relation describes the individuals (id, gender, age, topic, sign), while the latter relation (id, date, text) contains the posts and links them to an individual with id being the foreign key. We call an -dataset, if one attribute * directly identifies an individual, one or more traditional relational attributes 2 contain single-valued data, and one textual attribute 3 is in . In other words, an -dataset is any dataset, which contains at least one directly identifying attribute, one or more quasi-identifying attributes, and one or more textual attributes. For the remainder of this work, we will use relational attributes for attributes we consider traditional relational and textual attributes for attributes with textual values composed of multiple words or even sentences. In the example in Table 1 , the relational attributes are the direct identifier id as well as the quasi-identifiers gender, age, topic, and date. The textual attribute is text. We call a row in a tuple . Relational attributes are single-valued and can be categorized into being nominal, ordinal, or numerical (i. e., ratio or interval, which are treated equally in the anonymization process). A textual attribute is any attribute, where its domain is some form of free text. Therefore, we can state that . consists of an arbitrary sequence of tokens =< 1 , ..., >. Sensitive Entity Types. We define to be a set of entity types, where each value ∈ represents a distinct entity type (e. g., person or location) and each entity type is critical for the anonymization task. We then define a recognition function on texts as : → . The recognition function detects sensitive terms in the text and assigns a sensitive entity type ∈ to each token ∈ . Moreover, we define a mapping function on the set of structural attributes as : { 1 , ..., } → . The mapping function maps attributes 1 , ..., to a sensitive entity type in , which is used to match redundant sensitive information with the text.

Redundant Sensitive and Non-redundant Sensitive Terms. Some sensitive information might appear in a textual as well as in a relational attribute. In order to consistently deal with those occurrences, we introduce the concept of redundant sensitive information. Redundant sensitive information is any sensitive term ∈ . with ( ) = for which a relational value ∈ . with ( ) = exists, where = . In other words, redundant sensitive information is duplicated information, i. e., has the same value which appears under the same sensitive entity type in a relational attribute . and a sensitive term in . .

We introduce the attribute ′ , which contains all non-redundant sensitive information of . For the remainder of this work, attribute names with apostrophes indicate that these attributes contain the extracted sensitive entities with their types (see text ′ in Table 4 ). We model ′ as a set-valued attribute since in texts of . , zero or more sensitive terms can appear. Therefore, we explicitly allow empty sets to appear in . ′ if no sensitive information appears in . . We then replace in with ′ , so that the schema of becomes { * , 1 , ..., , ′ }.

Person Centric view * on the Dataset . If a dataset is composed of multiple relations, there might be multiple tuples which correspond to a single individual. In order to apply anonymization approaches on this dataset, we need to group the data in a person centric view similar to Gong et al. [18] , where one record (i. e., one row) corresponds to one individual. Therefore, we define * being a grouped and aggregated version of . This means, that we can retrieve * from as * = * F 1 ( 1 ),...,F n ( ),F X ′ (X ′ ) ( ), where * denotes a directly identifying attribute related to an individual used to group rows of individuals together, concurrently applies a set of aggregation functions F i and F X ′ defined on relational attributes as well as sensitive textual terms ′ . This aggregation operation should create a person centric view of by using appropriate aggregation functions F = {F 1 , ..., F n , F X ′ } on the attributes. For relational attributes , we use set as a suitable aggregation function, where two or more distinct values in for one individual result in a set containing all distinct values. For setbased attributes like ′ , we use the aggregation function union, which performs an element-wise union of all sets in ′ related to one individual. Table 4 presents a person centric view of our initial example where each record represents one individual. Dates as well as any non-redundant sensitive terms have been aggregated, as discussed.

-anonymity in * . Based on the notion of equivalence classes [42] and the definition of equality of set-based attributes [21] , an equivalence class for * can be defined as a partition of records where for any two records , ∈ holds ( . 1 , ..., . ) = ( . 1 , ..., . ) and . ′ = . ′ . Thus, within an equivalence class each record has the same values for the relational attributes and their sets of sensitive terms have the same values, too. Given our definition of equivalence classes, a person centric dataset * is said to be -anonymous if all equivalence classes of * have at least the size . We refer to the -anonymous version of * as ′ . ′ protects privacy by hiding direct identifiers. Moreover, since each of the quasi-identifying attributes and sensitive terms in texts appear at least times, ′ also protects against record linkage attacks.

Using the definitions from Section 3, we present our anonymization approach rx-anon. We present how we preprocess our data to generate a person centric view. We show how Mondrian [31] , a recursive greedy anonymization algorithm, can be used to anonymize -datasets. Mondrian transforms a dataset into a -anonymous version by partitioning the dataset into partitions with sizes greater than and afterwards recodes each partition individually. We introduce an alternative partitioning strategy called Global Document Frequency (GDF) as baseline for partitioning a dataset with sensitive terms. We use the running example (Table 1) to show how an -dataset is transformed to a privacy-preserved version.

Prior to anonymizing an -dataset, it needs to be transformed into a person specific view in order to apply -anonymity. Using the running example from Table 1 , we demonstrate the steps involved to create the person centric view shown in Table 4 . First, we identify sensitive terms in the texts and assign sensitive entity types to them.

In the remainder of this work, we will use subscripts to indicate the entity type assigned to a sensitive term. Given the first row of the example in Table 1 , the text is "My name is Pedro, I'm a 36 years old engineer from Mexico". The sensitive terms are Pedro person , 36 years old age , engineer job , and Mexico location . This analysis of texts is executed for all tuples in , while there can be multiple sensitive terms from the same entity type within a text, or even no sensitive terms at all. In the next step, we find and mark redundant sensitive information using the results of the prior steps. Therefore, we perform row-wise analyses of relational values with sensitive terms to find links, which actually represent the same information. In our example above, the sensitive term 36 years old age depicts the same information as the value 36 in the relational attribute age. Therefore, this sensitive term in the textual attribute is marked as redundant and is not considered as new sensitive information during the anonymization algorithm. Non-redundant sensitive information is stored in the attribute text ′ . Finally, we build a person-centric view to have a condensed representation of all information available for each individual. Therefore, as described in Section 3, we group the data on a directly identifying attribute to get an aggregated dataset. In the example in Table 1 , the directly identifying attribute * is id. We use set as the aggregation function for the relational attributes. Moreover, we collect all sensitive terms mentioned in texts of one individual by performing union on the sets of sensitive terms. Table 4 shows the person-centric view * of our dataset , which has been achieved by aggregating on the attribute id. Since the individuals with the ids 1 and 4 have blogged more than once on different dates, multiple dates have been aggregated as sets. Moreover, since those people also have blogged different texts on different days, all sensitive terms across all blog posts have been collected in the attribute text ′ .

Given a person centric dataset * , we want to build a -anonymous version ′ by using the definitions of the previous section. In order to achieve anonymization, we adapt the two step anonymization algorithm of Mondrian by LeFevre et al. [31] , which first decides on partitions 1 , ..., (refer to Algorithm 1), and afterwards recodes the values of each partition to achieve -anonymity. We use Global Document Frequency (GDF) partitioning as baseline partitioning algorithm (see Algorithm 2), which uses sensitive terms and their frequencies to create a greedy partitioning using presence and absence of sensitive terms.

Modified Mondrian Partitioning with Weight Parameter . The first step of the algorithm is to find partitions of records with a partition size of at least . LeFevre et al. [31] introduced multidimensional strict top-down partitioning where non-overlapping partitions are found based on all relational attributes. Moreover, they introduced a greedy strict top-down partitioning algorithm Mondrian. Starting with the complete dataset * as an input, the partitioning algorithm chooses an attribute to split on and then splits the partition by median-partitioning. The authors suggest to use the attribute which provides the widest normalized range given a sub-partition. For numerical attributes, the normalized range is defined as minimum to maximum. For categorical attributes, the In order to properly treat textual terms in this heuristic algorithm, we introduce a weight parameter to the modified Mondrian algorithm shown in Algorithm 1. can be a value between 0 and 1. It describes the priority to split partitions on relational attributes. = 1 means that the algorithm always favors to split on relational attributes. = 0 leads to splits only based on sensitive terms in textual attributes. = 0.5 does not influence the splitting decisions and therefore is considered as default. The partitioning algorithm stops if no allowable cut can be made such that the criteria ofanonymity holds for both sub-partitions. Therefore, we can stop splitting partitions if | | < 2 .

Algorithm 1: Modified Mondrian partitioning with weight parameter (adapted from LeFevre et al. [31] ). It applies a greedy strict top-down partitioning for relational attributes. Global Document Frequency (GDF) Partitioning. Using the idea of a top-down strict partitioning algorithm, we propose with GDF a greedy partitioning algorithm using the presence and absence of sensitive terms. The main goal is to keep the same sensitive terms within the same partition. This is achieved by creating partitions with records which have sensitive terms in common. Algorithm 2 presents the GDF partitioning algorithm, which is based on sensitive terms and their frequencies. Similar to Algorithm 1, we start with the whole dataset as a single partition. Instead of splitting the partition using the median of a relational attribute (Mondrian partitioning), we split partitions on a chosen sensitive term. While the first sub-partition contains only records, where the chosen sensitive term appears, the second sub-partition contains the remaining records. For choosing the next term to split on, multiple heuristics are possible. We propose to use the most frequently apparent sensitive term for the remaining texts in the partition as the term to split on. Taking the most frequent term allows us to keep the most frequently appearing term in a majority of texts while suppressing less frequently used terms. The term used to split is then removed and similar to Algorithm 1 the algorithm is recursively called using the first and second partition, respectively. GDF partitioning guarantees that records are partitioned such that sensitive terms in texts are tried to be kept by grouping records with same terms. Moreover, records with no or less frequently used sensitive terms are also included in one partition. Therefore, we build partitions with records which would prevent other partitions from being -anonymous. Example. Using the running example in Table 4 with = 2 and the GDF partitioning scheme, partitioning is achieved as follows. Starting with the initial complete dataset * (person-centric view) as the initial partition , we determine the most frequent term, which is either UK or engineer, both appearing twice. Without loss of generality, we assume engineer is chosen as the term to split on. Then we split = {1, 2, 3, 4, 5, 6} in = {1, 2} containing all records where engineer appears and = {3, 4, 5, 6} containing the remaining records. For , no allowable cut can be made since | | = 2. However, the algorithm continues with since | | = 4, and splits on UK as the most frequent term appearing twice in records within . This will lead to two new partitions Using = {4, 6} containing records where UK appears in the texts and = {3, 5} containing the remaining records. Finally, the algorithm results in an optimal partitioning with three partitions, each consisting of two records. In our case, we refer to optimal as a partition layout with the least amount of information loss within the textual attribute.

Recoding. In the next step, each partition is transformed such that values of quasi-identifiers of records are indistinguishable. This process is called recoding. Recoding can either be global [4] or local [31, 59] . Local recoding generalizes values per equivalence class, but equal values from two equivalence classes might be recoded differently. In contrast, global recoding enforces that the same values are recoded equally throughout the entire dataset. Since global recoding requires a global replacement of values with appropriate recoded values, the search space for appropriate replacements may be limited [30] . Therefore, even though global recoding might result in more consistent releases of data, local recoding appears to be more powerful due to its variability in finding good replacements. There are different recoding schemes for the different scales of the attribute. Nominal and ordinal values are usually recoded using Domain Generalization Hierarchies (DGHs) as introduced by Sweeney [52] and used in multiple other works [17, 41, 42, 59] . A DGH describes a hierarchy which is used to generalize distinct values to a more general form such that within a partition all values transform to a single value in the DGH. Generating DGHs is usually considered a manual effort, while there already exist approaches on automatically generating concept hierarchies as introduced by Lee et al. [29] , which have also been used in work on anonymization [21] . Alternatively, nominal and ordinal attributes can also be recoded as sets containing all distinct items of one partition. For numerical attributes, LeFevre et al. [31] propose to use either mean or range as a summary statistic. Additionally, numerical attributes can also be recoded using ranges from minimum to maximum. Moreover, for dates El Emam et al. [12] propose an automated hierarchical recoding based on suppressing some information of a date value shown in Figure 1 . The leaf nodes represent actual dates appearing in the dataset (ref. to Table 1 ). Non-leaf nodes represent automatically generated values by suppressing information on each level. Since we use a strict-multidimensional partitioning scheme, we apply local recoding as suggested in Mondrian [31] . For numerical attributes, we use range as a summary statistic. For date attributes we use the automatically generated DGH by El Emam et al. [12] as shown in Figure 1 . Moreover, since generalization hierarchies for gender, topic, and sign are flat, we recode nominal and ordinal values as sets of distinct values.

Example. After equivalence classes have been determined, relational attributes can be recoded. Table 5 shows how those recoding schemes are applied to the relational attributes of our running example. In addition, a -anonymous representation of the text attribute ′ has to be created. Terms, which are marked as redundant sensitive information, are replaced by the recoded value of its relational representatives. Using the anonymized version of our example in Table 5 , the age appearing in the text of the first row is recoded using the value of the attribute age of the same row. Moreover, non-redundant sensitive information is recoded using suppression with its entity type. If a sensitive information appears within all records of an equivalence class, retaining this information complies with our definition of -anonymity for set-valued attributes from Section 3. Therefore, it does not need to be suppressed (see sensitive term engineer in Table 5 ). However, if the same sensitive information is not appearing in every record within an equivalence class, this sensitive information (or the lack of it) violates our definition of -anonymity and must be suppressed. An example for such a violation in Table 5 is Mexico, which appears in the first record, but in no other record of its equivalence class. The result is the -anonymized dataset * .

We evaluate our framework on two real-world datasets using the modified Mondrian partitioning algorithm with weighting parameter as well as the GDF partitioning baseline. We use to manipulate the splitting decisions in Mondrian as discussed in Section 4 and measure the resulting partitions as well as information loss.

We require datasets that include a directly identifying attribute * , one or more quasi-identifying relational attributes , and one or more textual attributes containing sensitive information about individuals (refer to the definition of an -dataset in Section 3). We use the publicly available Blog Authorship Corpus and 515K Hotel Reviews Data in Europe datasets.

Blog Authorship Corpus. The Blog Authorship Corpus 4 was originally used to create profiles from authors [49] but has also been used in privacy research for author re-identification [28] . After cleaning the input data from unreadable characters and others, the corpus contains 681, 260 blog posts from 19, 319 bloggers, which have been written by a single individual on or before 2006 and published on blogger.com. While the vast majority of blog posts are written in English language, the corpus contains some posts written in other languages. However, non-English blog posts are the minority and therefore do not have a significant impact on the experiment results. A row in the corpus consists of the id, gender, age, topic, and zodiac sign of a blogger as well as the date and the text of the published blog entry. Each row corresponds to one blog post written by one individual, but one individual might have written multiple blog posts. On average, one blogger has published 35 blog posts. We treat id as a direct identifier, gender, topic, and sign as categorical attributes, while age is treated as a numerical attribute. The attribute date is treated as a special case of categorical attribute where we recode dates using the automatically generated DGH shown in Figure 1 . The attribute text is used as the textual attribute. The attribute topic contains 40 different topics, including industry-unknown (indUnk). Age ranges from 13 to 48. Gender can be male or female. Sign can be one of the twelve astrological signs. In addition to the Blog Authorship Corpus, we run experiments on a second dataset to verify our observations. We chose to use a dataset containing reviews of European hotels. We refer to this dataset as the Hotel Reviews Dataset.

Hotel Reviews Dataset. We use the 515K Hotel Reviews Data in Europe dataset 5 , called in the following briefly the Hotel Reviews Dataset, which contains 17 attributes, of which 15 attributes are relational and two attributes are textual. The textual attributes are positive and negative reviews of users. Among the relational attributes, we treat hotel name and hotel address as direct identifiers. The textual attributes are pre-processed and cleaned as described for the Blog Authorship Corpus. Negative and positive word count as well as tags are ignored and therefore considered insensitive attributes. The remaining attributes are treated as quasi-identifiers, with seven numerical, one date, and two nominal attributes. We recode all quasi-identifying attributes similar to the Blog Authorship Corpus. After preparing the Hotel Reviews Dataset, we have 512, 126 reviews for 1, 475 hotels remaining. 5 https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

As baselines, we consider the scenario where relational and textual attributes are anonymized independently. Usually, sensitive terms within a textual attribute are suppressed completely, which leads to total loss of utility of sensitive terms. With our experiments we want to show that we can improve, i. e., reduce the information loss in texts under the -anonymity model. Moreover, we want to optimize the trade-off between relational and textual information loss.

Similar to experiments conducted in prior work [17, 18, 42] , we run our anonymization tool for different values of = 2, 3, 4, 5, 10, 20, and 50. Regarding our new weighting parameter , we used values between 0.0 and 1.0 in steps of 0.1. Sensitive entity types in texts are those detected by spaCy's English models trained on the OntoNotes5 corpus 6 . We added rule-based detectors for the entities MAIL, URL, PHONE, and POSTCODE. We treat all sensitive terms appearing under those entity types as quasi-identifiers. For each value of , we conduct experiments using different partitioning strategies and parameter settings. In particular, we vary the weight parameter to tune Mondrian. To speed up experiment execution times, we ignore redundant sensitive information. Ignoring redundant sensitive information does not influence the experiment results, since both datasets do not provide a relevant amount of overlap between relational attributes and textual attributes. We use local recoding schemes for each experiment to make partitioning results comparable. For the evaluation, we analyze the anonymized dataset with respect to the corresponding partitioning sizes and information loss.

In addition, we repeat the experiments by just considering location entities with entity type GPE (geopolitical entity). We use those experiments to showcase an anonymization task with reduced complexity. We chose location-based entities since they are present in blog posts as well as in hotel reviews. Therefore, they allow for comparison of both datasets.

In order to evaluate our anonymization approach and compare results of partitioning, we introduce the following measures. In particular, we compare statistics on partitions as well as relational and textual information loss.

Statistics on Partitions. We are interested in the resulting partitions of the anonymized dataset.

Number of splits (based on relational versus textual attributes): We evaluate how partitions are created, based on relational attributes versus textual attributes, and how influences splitting decisions. We expect that for < 0.5 we observe more splits on textual attributes and for > 0.5 more splits on relational attributes.

Number of partitions and Partition sizes: In addition to the number of splits, we want to evaluate the size of the resulting partitions since they are closely related to information loss. By the nature ofanonymity, all partitions need to be at least of size . Relatively large partitions with respect to will tend to produce more information loss. Therefore, partition sizes closer to will be favorable and increase utility. We evaluate resulting partitions by counting the number of partitions, as well as calculating the mean and standard deviation of partition sizes.

Information Loss (Adapted to Heteregenous Datasets). Measuring the information loss of an anonymized dataset is well-known practice for evaluating the amount of utility remaining for a published dataset. We use Normalized Certainty Penalty (NCP) [59] to determine how much information loss has been introduced by the anonymization process. In particular, the NCP assigns a penalty to each data item in a dataset according to the amount of uncertainty introduced. We extend the definitions of NCP to the problem of anonymizing relational and textual data such that for one record , the information loss is calculated as ( ) = ( · ( ) + · ( ))/( + ), where is the importance assigned to the relational attributes, and ( ) denotes the information loss for relational attributes of record . Analogously, we define and ( ) for the textual attribute. For our evaluation, we set and to 1, i. e., weigh the loss stemming from relational data and textual data equally. Note, that this decision is independent of the parameter, which decides which attribute or term is actually used for the splitting of the partitions.

For relational attributes = { 1 , ..., } we define the information loss ( ) = ( ∈ ( ))/| |, where | | denotes the number of relational attributes.

is the information loss for a single attribute and depends on the type of attribute. It can be calculated either using for numerical attributes or for categorical attributes.

for numerical values is defined as For categorical values other than dates, | | will be the number of distinct values appearing in the recoded set. For date attributes, | | denotes the number of leaves of the subtree below the recoded value (see Figure 1 ).

For textual attributes, we define ( ) = ( ∈ . ′ ( ))/ | . ′ |, where for each sensitive information , we calculate the individual information loss ( ) and normalize it by the number of sensitive terms | . ′ |. We define the individual information loss for one sensitive term as ( ) = 1 if is suppressed, and 0 otherwise.

Finally, we can calculate the total information loss for an entire -Dataset * as ( * ) = ( ∈ * ( ))/| * |, where for each record the information loss ( ) is calculated and divided by the number of records | * |.

We present the results regarding partition statistics and information loss. For detailed experimental results with plots and tables for all parameter values, we refer to the supplementary material. In particular, details on the influence of and on the splitting decision for both datasets can be found in the supplementary material in Appendix A. 3 

To modify splitting decisions and therefore the distribution of information loss between relational and textual attributes, we introduced the tuning parameter to Mondrian partitioning. Thus, first, we verify how impacts splitting decisions. We count for a particular how often partitions are effectively split on a relational attribute and compare this metric to the number of splits on sensitive terms of textual attributes. We also evaluate the count of the resulting partitions and the partitions' sizes.

Partition Splits. Figure 2a shows the distribution of splitting decisions for experiments run on the Blog Authorship Corpus for = 5 and = 0.0 to 1.0. As was designed, = 1 results in only splits on relational attributes whereas = 0 results in splits only on sensitive terms. As our results show, an unbiased run of Mondrian with = 0.5 causes partitions to be split mostly on relational attributes. Since the span of relational attributes is lower compared to sensitive terms, relational attributes provide the widest normalized span and are therefore favored to split on. For > 0.5, the majority of the weight for splitting is given to the relational attributes. Thus, there is no relevant change since relational attributes are considered almost every time throughout the partitioning phase. However, for < 0.5, we observe that more and more splits are made based on textual attributes. For = 0.4, already more than half of the splits are based on textual terms. If only locations, i. e., entities of type GPE, are considered, is not in all cases able to control the share of splits between relational and textual attributes, since low values for do not result in more splits on textual attributes. Plots are omitted for brevity here and can be found in the supplementary materials.

For the Hotel Reviews Dataset, shown in Figure 2b , the number of splits is generally lower (see also partition sizes, below), since it contains less records. Also the splitting on textual attributes is less likely for hotel reviews compared to blog posts. In the case of experiments considering only location entities, the impact of is even smaller and splits are mostly performed on relational attributes. Details are provided in the supplementary materials. Table 6 provides statistics on partitions using Mondrian partitioning with varying as well as GDF partitioning for the Blog Authorship Corpus. In the table, count refers to the number of partitions produced under the specific values of and , while size refers to the average number of records per partition. The Mondrian partitioning algorithm produces the same partitioning layout for between 0.6 and 0.9. This observation matches statistics on partition splits, since for these values of the Mondrian algorithm decides to use the same attributes to split on. Furthermore, GDF partitioning is not able to generate partition sizes close to , compared to Mondrian partitioning. Table 7 shows the results when only location entities are considered. Here, = 0 leads to bigger and fewer partitions compared to other settings for . Comparing GDF to Mondrian with = 0, we observe that for low numbers of , GDF partitioning achieves in general smaller, but more variable partitions with regard to size. However, for larger values of , Mondrian partitioning achieves better distribution of partitions and therefore better distribution of sensitive terms.

The results for the Hotel Reviews Dataset are shown in Table 8 . Table 9 shows the results for the location type only. We make the same observations as for the Blog Authorship Dataset. However, due to the lower number of records in the Hotel Reviews Dataset, the total count of partitions is comparatively smaller. 

In addition to the statistics on partition splits, counts, and sizes, we are interested in how the partitioning performs with respect to the introduced information loss measure. Figure 3a provides an overview on relational information loss (y-axis) for different values of between 2 and 50 (x-axis) for the Blog Authorship Corpus. Figure 3b shows the textual information loss . Results for between 0.6 and 0.9 are not plotted, since they are almost identical to the run using = 0.5. Figures 4a and 4b provide the information loss for experiments run on the Hotel Reviews dataset.

Relational Information Loss. The information loss increases with larger throughout all experiments. Higher information loss is caused by having larger partitions and therefore higher efforts in recoding. Furthermore, we can state that information loss in the relational attributes increases if the tuning parameter decreases (see locations are considered, GDF partitioning as well as Mondrian partitioning with = 0 result in relatively high relational information loss compared to other experiment runs (see figures in the supplementary material). In both cases, the high relational information loss is caused by having partitions split only based on one option, namely the recognized sensitive locations appearing in the textual attribute (cf. previous section). Comparing with Figure 4a , we can state that relational information loss appears to be higher for the Hotel Reviews Dataset compared to the Blog Authorship Corpus. However, we still observe the same behavior where higher values of result in relatively lower relational information loss.

Textual Information Loss. Analyzing the information loss in the textual attribute, see Figure 3b , one observation is that for values of ≥ 10 the information loss in texts tends to become 1. This equals suppressing all sensitive terms in texts. Moreover, our modified Mondrian partitioning performs better compared to the naive partitioning strategy GDF. GDF partitioning results in partitions with unequal and larger sizes and therefore ends up with large partitions, which significantly increase information loss. Moreover, GDF partitioning decides on splitting partitions taking a single global maximum (most frequent term) ignoring the multi-dimensionality and diversity of sensitive terms in texts. We make the same observations on the Hotel Reviews Dataset plotted in Figure 4b . However, information loss for ≤ 5 tends to be slightly lower. If only locations are considered, textual information loss in hotel reviews can significantly be reduced (see figures in the supplementary material). Since the Hotel Reviews Dataset only contains reviews for hotels in Europe, there is a limited number of locations that are included. This leads to significant preservation of sensitive terms even for values of ≤ 10.

To get a deeper understanding of textual attributes on the anonymization process, we analyzed textual information loss on entity type level. Figure 5a provides an overview of information loss per different entity type extracted from text in the Blog Authorship Corpus for is 2 to 50 and a fixed = 0.2. It shows that there is a high information loss for most attributes, even for small . However, sensitive terms of type LANGUAGE may be reduced for values of ≤ 5. Since the number of distinct entities of type LANGUAGE is much lower compared to other entity types in the Blog Authorship Corpus like EVENT and PERSON, the entities (i. e., number of sensitive terms) of type LANGUAGE can be better preserved. We obtain similar results for Mondrian partitioning with other values of ≤ 0.4. We make the same observations on the Hotel Reviews dataset for both textual attributes, the positive reviews and negative reviews (see Figures 5b and 5c ). In addition to LANGUAGE entities, sensitive locations (GPE) can also be preserved for both textual attributes. 

Due to heterogeneity of sensitive terms in texts, by default, they are less likely considered to split on. By introducing the tuning parameter in our framework, we were able to control the Mondrian algorithm to preserve more information in either relational or textual attributes. Our experiments show that the partitioning parameter may be tuned in order to favor information preservation in textual attributes over relational attributes. We observe that a value of between 0.4 and 0.5 results in balanced splits, i. e., about the same number of splits are based on relational attributes versus textual terms. Our anonymization approach allows us to reduce the information loss in texts under the -anonymity privacy model. In contrast, in the related works [8, 34, 50] sensitive terms have been completely suppressed. Furthermore, our experiments show that for ≤ 5, not all sensitive terms need to be suppressed. In case of entities of type LANGUAGE, our approach could preserve about 60% for = 2 in the Blog Authorship Corpus (see Figure 5a ) and up to 80% of terms for = 2 in the Hotel Reviews Dataset (see Figures 5b and 5c) . Generally, when applying -anonymity on sensitive terms, it works better for texts from a specific domain (e. g., hotels) than cross-domain datasets (e. g., blogs), as the latter have a higher diversity.

While our approach presents a general framework to anonymize heterogeneous data, our choices on detecting and comparing sensitive terms may have an impact on the experiments' outcomes. We consider all sensitive terms in the texts to be quasi-identifiers. However, in certain situations, sensitive entity types should-similar to relational attributes-also be distinguished in direct and quasiidentifying attributes. Having a distinction between direct and quasi-identifiers is necessary in cases where texts include many names, or other identifiers appearing for multiple records. There may be a possible over-anonymization or under-anonymization in our rx-anon approach, influenced by the accuracy of the detected sensitive terms. Over-anonymization resembles the case where sensitive terms are falsely suppressed. It is caused by low precision and reduces utility of the anonymized data. This happens, when terms, which do not pose any risk of identity disclosure, are anonymized and the text loses important structures due to the missing terms. If sensitive terms are labeled with false entity types, they might also falsely be anonymized, since our strict definition of -anonymity requires also entity types to be equal. Under-anonymization describes a case where sensitive terms are falsely kept. This case is generally considered more critical than falsely suppressing terms and is related to low recall. If entities which should have been anonymized are not detected at all, the information they provide will appear in the released dataset and might reveal information which should not have been disclosed. We address this thread of validity and use a state-of-the-art NLP library spaCy to extract named entities from text. We use spaCy's recent transformer-based language model [9] for English (Version 3.0.0a0) 7 , which has an an F1-score for NER tasks of 89.41. However, there are cases that we are yet missing. There can be different writings of the same sensitive information which leads to over-anonymization. For example, the capital city of Germany may be referred to simply by its actual name "Berlin" or indirectly referred to as "Germany's capital". It is not possible for our current system to resolve such linkage. We refer to such cases as false negative matches. There may also be identical terms which actually have different semantics, which leads to underanonymization. In example, consider the phrases "I live in Berlin" and "I love Berlin", which appear in two different records and would happen to be grouped into the same partition. Our approach would treat both appearances of "Berlin" the same way even though in the first case it is referring to a place of residence while in the second case it is an expression of preference. We refer to such a scenario as false positive matches. In order to mitigate such false positive and negative cases, one can integrate more advanced text matching functions to our rx-anon framework, potentially depending on the requirements of a specific use case. For false negative matches, one may introduce synonym tables, semantic rules, and metrics such as Levenshtein distance to cope with spelling mistakes. To cope with false positive matches, one suggestion is to consider the surrounding context by comparing Part-of-Speech-Tags and dependencies of terms within and across sentences. For example, one could use contextualized word vectors [9, 39] . Note, in this work, we focus on showing that heterogeneous data can be anonymized using our rx-anon approach and demonstrate the influence of the splitting parameter on the creation of the partitions. Using different extensions to rx-anon such as word matching functions is prepared by proposing a framework approach and can be integrated and evaluated as required by a different use case or dataset. Finally, false negative matches and false positive matches can also occur on redundant sensitive information. While false negative matches result in inconsistencies in the released data, false positive matches obfuscate semantic meaning of sensitive terms in texts.

Our work has multiple implications which can be beneficial for other work. We showed that anonymizing unstructured text data can be achieved by extracting sensitive terms and casting the task into a structured anonymization problem. One may generalize the concept also for semi-structured data such as JSON documents. The idea of linking relational fields to attributes of other data types could be extended in order to retrieve a consistent, and privacypreserved version of heterogeneous JSON documents. In addition, tuning the partitioning using a parameter like is not only relevant in the context of anonymizing heterogeneous data, but could also be adapted to an attribute level to favor distinct attributes over others. An adjustable attribute-level bias within the partitioning phase of Mondrian would allow users to prioritize preservation of information in specific attributes. Suppose that one department within an organization shares data with a second department, which should do an age-based market analysis of sold products, but should not get access to raw data and therefore receive an anonymized version. As a consequence, the department providing data could adjust the anonymization using a bias to preserve more information in relevant attributes (i. e., age), and less information in others.

We introduced rx-anon as a step towards a framework for anonymizing hybrid documents consisting of relational as well as textual attributes. We have formally defined the problem of jointly anonymizing heterogeneous datasets by transferring sensitive terms in texts to an anonymization task of structured data, introduced the concept of redundant sensitive information, and the tuning parameter to control and prioritize information loss in relational and textual attributes. We have demonstrated the usefulness of rx-anon at the example of two real-world datasets using the privacy model -anonymity [53] . Data Availability and Reproducibility: Although extensive success has been achieved in anonymizing different types of data, there is limited work in the field of anonymizing heterogeneous data. Therefore, we would like to emphasize the importance and encourage researchers to investigate combined anonymization approaches for heterogeneous data to receive a consistent and privacypreserved release of data. The source code of rx-anon will be made publicly available to encourage reproduction and extension of our work.

As a framework approach, rx-anon can be extended in all aspects of the anonymization pipeline, namely the partitioning, string matching, recoding, privacy model, and supported entity types. Particularly, we are interested to see how anonymizing heterogeneous data can be achieved using other anonymization techniques thananonymity and use contextualized text similarity functions [9, 39] . A detailed discussion of the extensibility of our framework is provided in Appendix A.5.

The following sections contain extended experiment results. In particular, we provide numbers of distinct entities, give information about the performance of our framework, share statistics relevant for partitioning, and present details on information loss. Table 10 provides an overview of the number of distinct terms appearing in textual attributes. In general, the texts of the Blog Authorship Corpus contain significantly more distinct entities. Table 11 provides valuable insights in execution times of the experiments. Each experiment was executed on a single CPU core and did not require to analyze the texts, since the processed NLP state is read from cached results. In the case of experiments run on the Blog Authorship Corpus, execution times were significantly higher compared to the Hotel Reviews dataset. One observation is that if only relational attributes are considered (Mondrian, = 1), execution times come down to a fraction of experiments where sensitive terms are considered during the partitioning phase. Considering memory consumption, running a single experiment on the Blog Authorship Corpus required 25.2 GB for all entities and 13.4 GB in the case of just considering GPE entities (locations). In the case of the Hotel Review dataset, 5.4 GB and 4.2 GB were required respectively. 

In our experiments, we evaluate statistics on partition splits to gain insights how influences splitting decisions of Mondrian partitioning. Moreover, we also share statistics on resulting partitions.

A.3.1 Partition Splits. Figure 6 provides an overview of the distribution of splitting decisions between relational and textual attributes for experiments run on the Blog Authorship Corpus for all values of . The left column includes experiments considering all entities, while the right column presents results for experiments run only considering location (GPE) entities. A noteworthy observation is that for a fixed , the number of splits on textual attributes decreases if increases. Since we are only considering valid splits, sensitive terms have to appear at least 2 times within a partition to be split on. Therefore, in case of = 50, sensitive terms are required to appear 100 times, which is less likely due to heterogeneity of blog post texts.

If only locations, i. e., entities of type GPE, are considered, is not in all cases able to control the share of splits between relational and textual attributes, since low values for do not result in more splits on textual attributes. This effect is caused by the lack of multi-dimensionality. Since only one category of sensitive entity types is considered, Mondrian has only one option (namely split on sensitive terms with type GPE) to split on textual attributes. If splits on GPE terms fail (e. g., if there are none), Mondrian will ultimately continue to split on a relational attribute.

Similarly, Figure 7 highlights the impact of on partition splits for experiments run on the Hotel Reviews Dataset for all values of .

We present the results regarding partition statistics. Table 12 provides valuable insights on the number of partitions as well as the size and standard deviation regarding partition sizes for the experiments on the Blog Authorship Corpus considering all entities. Similarly, Table 13 provides an overview of the same metrics for the Blog Authorship Corpus only considering GPE entities (locations). Tables 14 and 15 share insights on partition statistics for the Hotel Reviews dataset.

In addition to evaluating resulting partitions, we are also interested in the actual information loss which is introduced by anonymizing a given dataset. Figure 12 and Figure 13 .

In addition to high-level charts on information loss, Figure 14 provides a detailed analysis of information loss per entity type for the attribute text in the Blog Authorship Corpus. Additionally, Figure 15 and Figure 16 visualize the textual information loss per entity type for the attributes negative review and positive review respectively.

As a framework approach, rx-anon enables several paths for future work. These include all aspects of the anonymization pipeline, namely the partitioning, string matching, recording, privacy model, and supported entity types. We provide examples below.

Partitioning. We showed how decisions on partitions significantly influence information loss. While the naive partitioning strategy GDF can deal with sparse, but diverse sets of sensitive terms, there might be partitioning strategies better suited to attributes with such properties. It would be interesting to see if clustering algorithms applied on sensitive terms lead to improved partitioning. Such clustering algorithms require a minimum lower bound on the partition sizes of at least . Abu-Khzam et al. [1] presents a general framework for clustering algorithms with a lower bound on the cluster size.

String matching. The matching of relational and textual attributes is currently using an exact string match. Another interesting research topic to build on our work is to investigate sophisticated methods to find non-trivial links within the dataset. Non-trivial links are links which cannot be detected using simple string matching. Mechanisms to reveal non-trivial links are discussed by Hassanzadeh et al. [20] . They studied approximations on string matching as well as semantic mechanisms based on ontology and created a declarative framework and specification language to resolve links in relational data. Those mechanisms would also be applicable to find links between relational data and sensitive entities. It would also be interesting to use string matching based on models using word embeddings [39] or transformer-based similarity functions [9] .

Recoding. Moreover, our current recoding strategy for sensitive terms in texts uses suppression to generate a -anonymous version of texts. However, suppression tends to introduce more information loss compared to generalization. Therefore, it would be interesting to introduce an automatic generalization mechanism for sensitive terms and evaluate it. One way to automatically generate DGHs for sensitive terms is to use hypernym-trees as discussed by Lee et al. [29] and used by Anandan et al. [3] to anonymize texts.

Privacy model. We used -anonymity as the privacy model to prevent identity disclosure. Even though -anonymity establishes guarantees on privacy, it does not guard against attacks where adversaries have access to background knowledge. Differential privacy introduced by Dwork [10] resists such attacks by adding noise to data. Our rx-anon can be extended by using such an alternative privacy model. An interesting question to answer would be how differential private methods defined on relational data could be combined with work on creating a differential private representation of texts [13, 60] .

Entity types. For the recognition of sensitive entities, we chose to use spaCy and its entity types trained on the OntoNotes5 corpus. We chose to use the OntoNotes5 entity types scheme since it provides more distinguishment and therefore more semantics to entities compared to WikiNer annotations. However, there are still cases, where more fine-grained entity recognition will reduce false positive matches. One example is the term "Georgia" which can refer to the country in the Caucasus, the U.S. state, or to a city in Indiana. Ling and Weld [33] presents a fine-grained set of 112 entity-types which would cover the explained example and state that a fine-grained entity recognition system benefits from its accuracy.

We present related work for anonymization of other types of data. Moreover, we present an overview of the existing regulations on anonymization and their view on PII. Finally, we present a nonexhaustive overview of anonymization tools and frameworks available.

Audio, Images, and Video

Even though this work only focuses on structured data and free text, recent work on anonymization of other forms of data is worth mentioning. For de-identification of images showing faces, Gross et al. [19] highlighted that pixelation and blurring offers poor privacy and suggested a model-based approach to protect privacy while preserving data utility. In contrast, recent work by Hukkelås et al. [22] applied methods from machine learning by implementing a simple Generative Adversarial Network (GAN) to generate new faces to preserve privacy while retaining original data distribution.

For audio data, recent work focused either on anonymization of the speaker's identity or the speech content. Justin et al. [25] suggested a framework which automatically transfers speech into a de-identified version using different acoustical models for recognition and synthesis. Moreover, Cohn et al. [6] investigated the task of de-identifying spoken text by first using Automatic Speech Recognition (ASR) to transcribe texts, then extracting entities using NER, and finally aligning text elements to the audio and suppressing audio segments which should be de-identified.

Additionally, recent work by Agrawal and Narayanan [2] showed that de-identification of people can also be applied to whole bodies within videos whereas Gafni et al. [15] focused on live deidentification of faces in video streams.

Finally, McDonald et al. [37] developed a framework for obfuscating writing styles which can be used by authors to prevent stylometry attacks to retrieve their identities. When it comes to unstructured text, their approach anonymizes writing styles in text documents by analyzing stylographic properties, determining features to be changed, ranking those features with respect to their clusters, and suggesting those changes to the user.

Information?

In order to understand what fields should be anonymized, a common understanding on what Personally Identifiable Information (PII) is needs to be established. Therefore, we provide a broad overview on regulations such as the Health Insurance Portability and Accountability Act (HIPAA), the General Data Protection Regulation (GDPR), and definitions by National Institute of Standards and Technology (NIST) to get an understanding for PII.

B.2.1 Health Insurance Portability and Accountability Act. First, we want to consider the Health Insurance Portability and Accountability Act (HIPAA) providing regulations to ensure privacy within medical data in the USA [56] . Even though the HIPAA privacy rule uses the terminology Protected Health Information (PHI), in general we can transfer their identifiers to the domain of PII. The HIPAA states that any information from the past, present, or future which is linked to an individual is considered PHI. In addition to domain experts defining PHI, the Safe Harbor Method defined in the HIPAA provides an overview of attributes which should be anonymized by removing [56] . Those attributes are in particular:

(1) Names (2) Geographic entities smaller than states (street address, city, county, ZIP, etc.) (3) Dates (except year) (4) Phone numbers (5) Vehicle identifiers and serial numbers (6) Fax numbers (7) Device identifiers and serial numbers (8) Email addresses (9) URLs (10) Social security numbers (11) IP addresses (12) Medical record numbers (13) Biometric identifiers, including finger and voice prints (14) Health plan beneficiary numbers (15) [7] . Instead of using the term PII, the GDPR refers to the term personal data. The regulation states that "'Personal data' means any information relating to an identified or identifiable natural person ... " [7] . Even though the GDPR does not explicitly state a list of attributes considered personal data, they provide some guidance on which properties are considered personal data. In particular the GDPR states that personal data is any data which can identify an individual directly or indirectly "by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person" [7] .

In contrast to the GDPR, the National Institute of Standards and Technology (NIST) provides guidance on protecting PII [36] . The NIST distinguishes PII in two categories. The first category includes "... any information that can be used to distinguish or trace an individual's identity ..." [36] . In particular, they list the following attributes:

• Name • Social Security Number • Date and place of birth • Mother's maiden name • Biometric records Moreover, the NIST labels "... any other information that is linked or linkable to an individual ... " also as PII [36] . Examples for linked or linkable attributes are:

• Medical information • Educational information • Financial information • Employment information

Multiple publicly available tools and frameworks for anonymization of data have been released. ARX 8 is an open source comprehensive software providing a graphical interface for anonymizing structured datasets [43, 44] . ARX supports multiple privacy and risk models, methods for transforming data, and concepts for analyzing the output data. Among the privacy models, it supports syntactic privacy models like -anonymity, -diversity, and -closeness, but also supports semantic privacy models like -differential privacy. Moreover, Amnesia 9 is a flexible data anonymization tool which allows to ensure privacy on structured data. Amnesia supportsanonymity for relational data as well as -anonymity for datasets containing set-valued data fields. Finally, Privacy Analytics 10 offers a commercial Eclipse plugin which can be used to anonymize structured data. Besides toolings for de-identification of structured data, there also exist frameworks or modules to achieve anonymization. pythondatafly 11 is a Python implementation of the Datafly algorithm introduced by Sweeney [52] as one of the first algorithms to transfer structured data to match -anonymity. Additionally, Crowds 12 is an open-source python module developed to de-identify a dataframe using the Optimal Lattice Anonymization (OLA) algorithm as proposed by El Emam et al. [12] to achieve -anonymity. Finally, an example for an implementation of the Mondrian algorithm [31] is available for Python 13 to show how -anonymity, -diversity, and -closeness can be used as privacy models.

There are multiple tools and frameworks for de-identification of free text. NLM-Scrubber 14 is a freely available tool for de-identification of clinical texts according to the Safe Harbor Method introduced in the HIPAA Privacy Rule. Moreover, MITRE Identification Scrubber Toolkit (MIST) 15 is a suite of tools for identifying and redacting PII in free-text medical records [26] . deid 16 is a tool which allows anonymization of free texts within the medical domain. Finally, deidentify 17 is a Python library developed especially for deidentification of medical records and comparison of rule-, feature-, and deep-learning-based approaches for de-identification of free texts [57] . 

Clustering with Lower-Bounded Sizes: A General Graph-Theoretic Framework

Person De-Identification in Videos

t-Plausibility: Generalizing words to desensitize text

Data Privacy through Optimal k-Anonymization

Efficient techniques for document sanitization

Audio Deidentification -a New Entity Recognition Task

EU General Data Protection Regulation (GDPR)

Deidentification of patient notes with recurrent neural networks

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Differential Privacy

De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus

A Globally Optimal k-Anonymity Method for the De-Identification of Health Data

Generalised Differential Privacy for Text Document Processing

Privacypreserving data publishing: A survey of recent developments

Live Face De-Identification in Video

HIDE: An Integrated System for Health Information DE-identification

Fast data anonymization with low information loss

Anonymizing 1:M microdata with high utility

Model-Based Face De-Identification

A declarative framework for semantic link discovery over relational data

Anonymization of Set-Valued Data via Top-Down, Local Generalization

DeepPrivacy: A Generative Adversarial Network for Face Anonymization

De-identification Guidelines for Structured Data

Deidentification of free-text medical records using pre-trained bidirectional transformers

Speaker de-identification using diphone recognition and speech synthesis

De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports

Morteza Ziyadi, and Mohamed AbdelHady. 2020. MT-BioNER: Multi-task Learning for Biomedical Named Entity Recognition using Deep Bidirectional Transformers

Authorship attribution with thousands of candidate authors

Automatic generation of concept hierarchies using WordNet

Incognito: efficient full-domain K-anonymity

Mondrian Multidimensional K-Anonymity

t-Closeness: Privacy Beyond k-Anonymity and l-Diversity

Fine-grained entity recognition

Deidentification of clinical notes via recurrent neural network and conditional random field

L-diversity: privacy beyond k-anonymity

Guide to protecting the confidentiality of Personally Identifiable Information (PII)

Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization

On the complexity of optimal kanonymity

Distributed Representations of Words and Phrases and their Compositionality

Automated de-identification of free-text medical records

Mul-tiRelational k-Anonymity

Anonymizing Data with Relational and Transaction Attributes

Flexible data anonymization using ARX-Current status and challenges ahead

ARX-A Comprehensive Tool for Anonymizing Biomedical Data

Medical document anonymization with a semantic lexicon

Protecting respondents identities in microdata release

Automatic generalpurpose sanitization of textual documents

Sanitization and Anonymization of Document Repositories

Effects of age and gender on blogging

Replacing personally-identifying information in medical records, the Scrub system

Simple Demographics Often Identify People Uniquely

Achieving k-Anonymity Privacy Protection Using Generalization and Suppression

k-Anonymity: A Model for Protecting Privacy

Predictive business process monitoring with structured and unstructured data

Privacy-preserving anonymization of set-valued data

Guidance Regarding Methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule

Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

Attention Is All You Need

Utility-based anonymization using local recoding

Oluwaseyi Feyisetan, and Nathanael Teissier. 2020. A Differentially Private Text Perturbation Method Using Regularized Mahalanobis Metric

TENER: Adapting Transformer Encoder for Named Entity Recognition

Link Analysis to Discover Insights from Structured and Unstructured Data on COVID-19