key: cord-0820545-6toj6i2m authors: Chaudhari, Sarang; Clear, Michael; Tewari, Hitesh title: Framework for a DLT Based COVID-19 Passport date: 2020-08-03 journal: Computing Conference, 2021 DOI: 10.1007/978-3-030-80129-8_9 sha: 593160033bbf94a85a05676cf0d19916c2afd3ae doc_id: 820545 cord_uid: 6toj6i2m Uniquely identifying individuals across the various networks they interact with on a daily basis remains a challenge for the digital world that we live in, and therefore the development of secure and efficient privacy preserving identity mechanisms has become an important field of research. In addition, the popularity of decentralised decision making networks such as Bitcoin has seen a huge interest in making use of distributed ledger technology to store and securely disseminate end user identity credentials. In this paper we describe a mechanism that allows one to store the COVID-19 vaccination details of individuals on a publicly readable, decentralised, immutable blockchain, and makes use of a two-factor authentication system that employs biometric cryptographic hashing techniques to generate a unique identifier for each user. Our main contribution is the employment of a provably secure input-hiding, locality-sensitive hashing algorithm over an iris extraction technique, that can be used to authenticate users and anonymously locate vaccination records on the blockchain, without leaking any personally identifiable information to the blockchain. Immunization is one of modern medicine's greatest success stories. It is one of the most cost-effective public health interventions to date, averting an estimated 2 to 3 million deaths every year. An additional 1.5 million deaths could be prevented if global vaccination coverage improves [13] . The current COVID-19 pandemic which has resulted in millions of infections worldwide [3] has brought into sharp focus the urgent need for a "passport" like instrument, which can be used to easily identify a user's vaccination record, travel history etc., as they traverse the globe. However, such instruments have the potential to discriminate or create bias against citizens [10] if they are not designed with the aim of protecting the user's identity and/or any personal information stored about them on the system. Given the large number of potential users of such a system and the involvement of many organizations in different jurisdictions, we need to design a system that is easy to sign up to for end users, and for it to be rolled out at a rapid rate. The use of hardware devices such as smart cards or mobile phones for storing such data is going to be financially prohibitive for many users, especially those in developing countries. Past experience has shown that such "hardware tokens" are sometimes prone to design flaws that only come to light once a large number of them are in circulation. Such flaws usually require remedial action in terms of software or hardware updates, which can prove to be very disruptive. An alternative to the above dilemma is an online passport mechanism. An obvious choice for the implementation of such a system is a blockchain, that provides a "decentralized immutable ledger" which can be configured in a manner such that it can be only written to by authorised entities (i.e. there is no requirement for a hard computation such as proof-of-work (PoW) to be carried out for monetary reward), but can be queried by anyone. However, one of the main concerns for such a system is based on: How does one preserve the privacy of user's data on a public blockchain while providing a robust mechanism to link users to their data records securely? In other words, one of the key requirements is to avoid having any personally identifiable information (PII) belonging to users stored on the blockchain. In the subsequent sections we describe some of the key components of our system and the motivation that led us to use them. The three major components are -extraction of iris templates, a hashing mechanism to store them securely and a blockchain technology. Finally, we present a formal description of our framework which uses the aforementioned components as building blocks. However, first we will briefly discuss some related work and then briefly some preliminaries with definitions and notation that are used in the paper. There has been considerable work on biometric cryptosystems and cancellable biometrics, which aims to protect biometric data when stored for the purpose of authentication cf. [11] . Biometric hashing is one such approach that can achieve the desired property of irreversibility, albeit without salting it does not achieve unlinkability. Research in biometric hashing for generating the same hash for different biometric templates from the same user is at an infant stage and existing work does not provide strong security assurances. Locality-sensitive hashing is the approach we explore in this paper, which has been applied to biometrics in existing work; for example a recent paper by Dang et al. [4] applies a variant of SimHash, a hash function we use in this paper, to face templates. However the technique of applying locality-sensitive hashing to a biometric template has not been employed, to the best of our knowledge, in a system such as ours. A quantity is said to be negligible with respect to some parameter λ, written negl(λ), if it is asymptotically bounded from above by the reciprocal of all polynomials in λ. For a probability distribution D, we denote by x ←$ D the fact that x is sampled according to D. We overload the notation for a set S i.e. y ←$ S denotes that y is sampled uniformly from S. Let D 0 and D 1 be distributions. We denote by D 0 ≈ C D 1 and the D 0 ≈ S D 1 the facts that D 0 and D 1 are computationally indistinguishable and statistically indistinguishable respectively. We use the notation [k] for an integer k to denote the set {1, . . . , k}. Vectors are written in lowercase boldface letters. The abbreviation PPT stands for probabilistic polynomial time. The entropy of a random variable is the average "information" conveyed by the variable's possible outcomes. A formal definition is as follows. Definition 1. The entropy H(X) of a discrete random variable X which takes on the values x 1 , . . . , x n with respective probabilities Pr X = x 1 , . . . , Pr X = x n is defined as In this paper, the logarithm is taken to be base 2, and therefore we measure the amount of entropy in bits. Iris biometrics is considered one of the most reliable techniques for implementing identification systems. For the ID of our system (discussed in the overview of our framework in Section 6), we needed an algorithm that can provide us with consistent iris templates, which will have not only low intra-class variability, but also show high inter-class variability. This requirement is essential because we would expect the iris templates for the same subject to be similar, as this would then be hashed using the technique described in the section 4. Iris based biometric techniques have received some good attention in the last decade. One of the most successful technique was put forward by John Daugman [5] , but most of the current best-in-class techniques are patented and hence unavailable for an open-source use. For the purpose of writing this paper, we have used the work of Libor Masek [7] which is an open-source implementation of a reasonably reliable iris recognition technique. Users can always opt for other commercial biometric solutions when trying to deploy our work independently. Masek's technique works on grey-scale eye images, which are processed in order to extract the binary template. First the segmentation algorithm, based on a Hough Transform is used to localise the iris and pupil regions and also isolate the eyelid, eyelash and reflections as shown in Figures 1a and 1b . The segmented Figure 1c . The iris features are extracted from the normalised image by one-dimensional Log-Gabor filters to produce a bitwise iris template and mask as shown in Figure 1d . We denote the complete algorithm by Iris.ExtractFeatureVector which takes a scanned image as input and outputs a binary feature vector fv ∈ {0, 1} n (this algorithm is called upon by our framework in Section 6). This data is then used for matching, where the Hamming distance is used as the matching metric. We have used CASIA-Iris-Interval database [2] in Figure 1 and for some preliminary testing. Table 1 shows the performance of Masek's technique as reported by him in his original thesis [8] . This algorithm performs quite well for a threshold of 0.4 where the false acceptance rate (FAR) is 0.005 and the false rejection rate (FRR) is 0. These values are used when we present our results to compare the performance of our technique when a biometric template is first hashed and then the hamming distance is measured to calculate the FAR and FRR, as opposed to directly measuring the hamming distances in the original biometric templates. One would assume the performance of our work to get better with the increase in efficiency of extracting consistent biometric templates by other methods. As aforementioned, we rely on the open-source MATLAB code by Libor Masek. For each input image, the algorithm produces a binary template which Table 1 : FAR and FRR for the CASIA-a Data Set contains the iris information, and a corresponding noise mask which corresponds to corrupt areas within the iris pattern, and marks bits in the template as corrupt. These extracted iris templates and their corresponding masks are 20 × 480 binary matrices each. In the original work, only those bits in the iris pattern that correspond to '0' bits in the noise masks of both iris patterns were used in the calculation of Hamming distance. A combined mask is then calculated and both the templates are masked with it. Finally, the algorithm calculates the bitwise XOR i.e. distance between the masked templates. The steps given below provides an overview for the algorithm used by Libor: Extraction: Matching: There are two major issues that we need to deal with before being able to use these templates in our system, specifically, template masking and conversion to linear vector. The above matching technique requires one to have 2 pairs of iris patterns and their corresponding masks to calculate the Hamming distance. However for the application we are targeting, at any time during verification, the system would have to match the extracted template of an individual (i.e. template and mask) against a hashed template stored on the blockchain. This means that we cannot incorporate the above matching algorithm into our system. We have two choices to mitigate this problem and obtain the masked template for the remaining steps: 1. We can calculate the masked template independently for each sample i.e. An underlying assumption for this method is that the masks for an individual would be approximately the same in every sample. This assumption is not too far-fetched as was clear from the preliminary analysis of our database. We will refer to these as type 1 templates. 2. We can maintain a global mask which can be the defined as for all extracted mask i of all image i belonging to the training database. And at the time of verification, we generate the masked template as This method has some added difficulty in finding the global mask and it also discards more data from the iris patterns as opposed to directly using the respective mask i for each template i . But it helps in maintaining a consistency among the masked templates. We will refer to these as type 2 templates. In a follow-up to this paper, we will use and provide results for templates of both types based on experiments we are conducting at the time of writing. For our use case, we need a one-dimensional input stream which can be fed into the hashing algorithm discussed in the subsequent sections. For converting those masked template matrices into linear feature vectors, we have two naive choices of concatenating either the row vectors or the column vectors. Before deciding the type of conversion, let us look at an important key factor, which is rotational inconsistencies in the iris templates. Rotational inconsistencies are introduced due to rotations of the camera, head tilts and rotations of the eye within the eye socket. The normalisation process does not compensate these. In order to account for rotational inconsistencies, when the Hamming distance of two templates is calculated, one template is shifted left and right bit-wise, and a number of Hamming distance values are calculated from successive shifts. This bit-wise shifting in the horizontal direction corresponds to the rotation of the original iris region. This method was suggested by Daugman [5] , and corrects misalignment in the normalised iris pattern caused by rotational differences during imaging. From the calculated Hamming distance values, only the lowest is taken, since this corresponds to the best match between two templates. Due to this, column-wise conversion seems like the most logical choice as this would allow us to easily rotate the binary linear feature vectors. Shifting the linear vector by 20 bits will correspond to shifting the iris template once (recall that the dimension of iris templates is 20 × 480). Putting it all together, we define the steps of our algorithm Iris.ExtractFeatureVector that we call upon later. -Iris.ExtractFeatureVector(image): • (template, mask) = createiristemplate(image) where createiristemplate is Masek's open source algorithm. • Obtain masked template (either type 1 or type 2 as defined in Section 3.1). • Convert masked template to linear vector as in Section 3.2. • Output binary linear vector fv ∈ {0, 1} n Note that the parameter n is a global system parameter measuring the length of the binary feature vectors outputted by Iris.ExtractFeatureVector. To preserve the privacy of individuals on the blockchain, the biometric data has to be encrypted before being written to the ledger. Hashing is a good alternative to achieve this, but techniques such as SHA-256 and SHA-3 cannot be used, since the biometric templates that we extracted above can show differences across various scans for the same individual. Hence using those hash functions would produce completely different hashes. Therefore, we seek a hash function that generates "similar" hashes for similar biometric templates. This prompts us to explore Locality-Sensitive Hashing (LSH), which has exactly this property. Various LSH techniques have been researched to identify whether files (i.e. byte streams) are similar based on their hashes. TLSH is a well-known LSH function that exhibits high performance and matching accuracy but, does not provide a sufficient degree of security for our application. Below we assess another type of LSH function, which does not have the same runtime performance as TLSH, but as we shall see, exhibits provable security for our application and therefore is a good choice for adoption in our framework. In the cryptographic definition of one-way functions, it is required that it is hard to find any preimage of the function. However, we can relax our requirements for many applications because it does not matter if for example a random preimage can be computed as long as it is hard to learn information about the specific preimage that was used to compute the hash. In this section, we introduce a property that captures this idea, a notion we call input hiding. Input hiding means that if we choose some preimage x and give the hash h = H(x) to an adversary, it is either computationally hard or informationtheoretically impossible for the adversary to learn x or any partial information about x. This is captured in the following formal definition. where λ is the security parameter. Random projection hashing, proposed by Charikar [1] , preserves the cosine distance between two vectors in the output of the hash, such that two hashes are probabilistically similar depending on the cosine distance between the two preimage vectors. This hash function is called SimHash. We describe a slight variant of SimHash here, which we call S3Hash. In our variant, the random vectors that are used are sampled from the finite field of where sgn : Z → {0, 1} returns 0 if its integer argument is negative and returns 1 otherwise. Note that the notation ·, · denotes the inner product between the two specified vectors. Let x 1 , x 2 ∈ {0, 1} n be two input vectors. It holds for all i ∈ {1, . . . , m} that Pr[h where h (1) = S3Hash R (x 1 ), h (2) = S3Hash R (x 2 ) and θ(x 1 , x 2 ) is the angle between x 1 and x 2 . Therefore the similarity of the inputs is preserved in the similarity of the hashes. An important question is: Is this hash function suitable for our application? The answer is in the affirmative because it can be proved that the function information-theoretically obeys a property we call input-hiding that we defined in Section 4.1. We recall that this property means that if we choose some binary vector x ∈ {0, 1} n and give the hash h = S3Hash R (x) to an adversary, it is either computationally hard or information-theoretically impossible for the adversary to learn x or any partial information about x. This property is sufficient in our application since we only have to ensure that no information is leaked about the user's iris template. We now prove that our variant locality-sensitive hash function S3Hash is information-theoretically input hiding. Theorem 1. Let X denote the random variable corresponding to the domain of the hash function. If H(X) ≥ m + λ then S3Hash is information-theoretically input hiding where λ is the security parameter and H(X) is the entropy of X. Proof. The random vectors in R can be thought of as vectors of coefficients corresponding to a set of m linear equation in n unknowns on the left hand side and on the right hand side we have the m elements, one for each equation, which are components of the hash i. e. (h 1 , . . . , h m ) . Now the inner product is evaluated over the integers and the sgn function maps an integer to an element of {0, 1} depending on its sign. The random vectors are chosen to be ternary. Suppose we choose a finite field F p where p ≥ 2(m + 1) is a prime. Since there will be no overflow when evaluating the inner product in this field, a solution in this field is also a solution over the integers. We are interested only in the binary solutions. Because m < n, the system is underdetermined. Since there are n − m degrees of freedom in a solution, it follows that there are 2 n−m binary solutions and each one is equally likely. Now let r denote the redundancy of the input space i.e. r = n − H(X). The fraction of the 2 n−m solutions that are valid inputs is 2 n−m−r . If 2 n−m−r > 2 λ , then the probability of an adversary choosing the "correct" preimage is negligible in the security parameter λ. For this condition to hold, it is required that n − m − r > λ (recall that r = n − H(X)), which follows if H(X) ≥ m + λ as hypothesized in the statement of the theorem. It follows that information-theoretically an unbounded adversary has a negligible advantage in the input hiding definition. Our initial estimates suggest that the entropy of the distribution of binary feature vectors outputted by Iris.ExtractFeatureVector is greater than m + λ for parameter choices such as m = 256 and λ = 128. A more thorough analysis however is deferred to future work. We have ran experiments with S3Hash applied to feature vectors obtained using our Iris.ExtractFeatureVector algorithm. The distance measure we use is the hamming distance. The results of these experiments are shown in Table 2 . Results for a threshold of 0.3 in particular indicates that our approach shows promise. We hope to make further improvements in future work. A blockchain is used in the system for immutable storage of individuals' vaccination records. The blockchain we employ is a permissioned ledger to which blocks can only be added by authorized entities or persons such as hospitals, primary health care centers, clinicians etc. Such entities have to obtain a public-key certificate from a trusted third party and store it on the blockchain as a transaction before they are allowed to add blocks to the ledger. The opportunity to add a new block is controlled in a round robin fashion, thereby eliminating the need to perform a computationally intensive PoW process. Any transactions that are broadcast to the P2P network are signed by the entity that created the transaction, and can be verified by all other nodes by downloading the public key of the We now describe an abstract interface for the permissioned blockchain that captures the functionality we need. Consider a set of partiesP. A subset of parties P ⊂P are authorized to write to the blockchain. Each party P ∈ P has a secret key sk P which it uses to authenticate itself and gain permission to write to the blockchain. How a party acquires authorization is beyond the scope of this paper. For our purposes, the permissioned blockchain consists of the following algorithms: -Blockchain.Broadcast(P, sk P , tx): On input a party identifier P that identifies the sending party, a secret key sk P for party P and a transaction tx (whose form is described below), then broadcast the transaction tx to the peer-topeer network for inclusion in the next block. The transaction will be included iff P ∈ P. -Blockchain.AnonBroadcast(sk P , tx) : On input a secret key sk P for a party P and a transaction tx, then anonymously broadcast the transaction tx to the peer-to-peer network for inclusion in the next block. The transaction will be included iff P ∈ P. A transaction has the form (type, payload, party, signature). A transaction in an anonymous broadcast is of the form (type, payload, ⊥, ⊥). The payload of a transaction is interpreted and parsed depending on its type. In our application, there are two permissible types: 'rec' (a record transaction which consists of a pair (ID, record)) and 'hscan' (biometric hash transaction which consists of a hash of an iris feature vector). This will become clear from context in our formal description of our framework in the next section which makes use of the above interface as a building block. The final point is that a block is a pair (hash, transactions) consisting of the hash of the block and a set of transactions {tx i } i∈[ ] . 6 Our Framework In this section we provide a formal description of our proposed framework which makes use of the building blocks presented in the previous sections. Our proposed system utilises a two-factor authentication mechanism to uniquely identify an individual on the blockchain. The parameters required to recreate an identifier are based on information that "one knows" and biometric information that "one possess". Figure 2 describes the overall algorithm that we employ in our proposed system. When a user presents themselves to an entity or organisation partic-ipating in the system, they are asked for their DoB (dd/mm/yyyy) and Gender (male/female/other). In addition, the organization captures a number of scans of the user's iris, and creates a hash H 1 (fv) from the feature vector extracted from the "best" biometric scan data. Our system can combine the user's DoB and Gender with H 1 (fv) to generate a unique 256-bit identifier (ID) for the user: The algorithm tries to match the calculated hash H 1 (fv) with existing "anonymous" hashes that are stored on the blockchain. It may get back a set of hashes that are somewhat "close" to the calculated hash. In that case the algorithm concatenates each returned hash (M atch i ) with the user's DoB and Gender to produce ID. It then tries to match ID with an ID in a vaccination record transaction on the blockchain. If a match is found then the user is already registered on the system and has at least one vaccination record. At this point we may just wish to retrieve the user's records or add an additional record, e.g. when a booster dose has been administered to the user. However if we go through the set of returned matches and cannot match ID to an existing ID in a vaccination record on the blockchain, i.e. this is the first time the user is presenting to the service, then we store the iris scan hash data H 1 (fv) as an anonymous record on the blockchain, and subsequently the ID and COVID-19 vaccination details for the user as a separate transaction. In each case the transaction is broadcast at a random interval on the blockchain peer-to-peer (P2P) network for it to be verified by other nodes in the system, and eventually added to a block on the blockchain. Uploading the two transactions belonging to a user at random intervals ensures that the transactions are stored on separate blocks on the blockchain, and an attacker is not easily able to identify the relationship between the two. Figure 3 shows a blockchain in which there are three anonymous transactions (i.e. Hash of Scan Data) and three COVID-19 vaccination record transactions stored on the blockchain pertaining to different users. The reader is referred to Section 4 for more details on how the hash is calculated in our system. Note that the storage of the anonymous hash data has to be carried out only once per registered user in the system. We present a formal description of our framework in Figure 4 and Figure 5 . Note that the algorithms described in these figures are intended to formally describe the fundamental desired functionality of our framework and are so described for ease of exposition and clarity; in particular, they are naive and non-optimized, specifically not leveraging more efficient data structures as would a real-world implementation. Let H be a family of collision-resistant hash functions. The algorithms in Figure 4 are stateful (local variables that contain retrieved information from In this paper we have detailed a framework to build a global vaccination passport using a distributed ledger. The main contribution of our work is to combine a Locality-sensitive hashing mechanism with a blockchain to store the vaccination records of users. A variant of the SimHash LSH function is used to derive an identifier that leaks no personal information about an individual. The only way to extract a user's record from the blockchain is by the user presenting themselves in person to an authorised entity, and providing an iris scan along with other personal data in order to derive the correct user identifier. However, our research has raised many additional challenges and research questions whose resolution require further investigation and experimentation, intended for future work. First and foremost, we need to improve the accuracy of the Iris.ExtractFeatureVector algorithm (e.g: deciding whether to use type 1 or type 2 template masking) and to accurately compute the entropy of the feature vectors. Furthermore, our variant of the SimHash algorithm, referred to in this paper as S3Hash, requires further analysis and evaluation, especially with respect to the domain of the random vectors r i , which are restricted to be ternary in this paper. Additionally, we must choose a suitable blockchain. Finally, the overall protocol would benefit from a thorough security analysis where not just privacy but other security properties are tested. The blockchain based mechanism that we have proposed can also be used as a generalised healthcare management system [6, 12] with the actual data being stored off-chain for the purpose of efficiency. Once a user's identifier has been recreated it can be used to pull all records associated with the user, thereby re- Similarity estimation techniques from rounding algorithms Chinese Academy Sciences' Institute of Automation (CASIA) -Iris Image Database COVID-19 Dashboard by the Center for Systems Science and Engineering Fehash: Full entropy hash for face template protection Statistical richness of visual phase information: Update on recognizing persons by iris patterns Managing lifetime healthcare data on the blockchain Recognition of human iris patterns for biometric identification. Final Year Project, The School of Computer Science and Software Engineering Matlab source code for a biometric identification system based on iris patterns. The School of Computer Science and Software Engineering Covid-19 immunity passports and vaccination certificates: scientific, equitable, and legal challenges A survey on biometric cryptosystems and cancelable biometrics Blockchain research beyond cryptocurrencies trieving their full medical history. We are in the middle of developing a prototype implementation of the system and hope to present the results of our evaluation in a follow-on paper. At some time in the future, we hope to trial the system in the field, with the hope of rolling it out on a larger scale. Our implementation will be made open source.