key: cord-0634855-0kogwfn5 authors: Alghazwi, Mohammed; Turkmen, Fatih; Velde, Joeri van der; Karastoyanova, Dimka title: Blockchain for Genomics: A Systematic Literature Review date: 2021-11-19 journal: nan DOI: nan sha: 9e0d617f29ccf6970d93ccbb6963a9d01efc73fc doc_id: 634855 cord_uid: 0kogwfn5 Human genomic data carry unique information about an individual and offer unprecedented opportunities for healthcare. The clinical interpretations derived from large genomic datasets can greatly improve healthcare and pave the way for personalized medicine. Sharing genomic datasets, however, pose major challenges, as genomic data is different from traditional medical data, indirectly revealing information about descendants and relatives of the data owner and carrying valid information even after the owner passes away. Therefore, stringent data ownership and control measures are required when dealing with genomic data. In order to provide secure and accountable infrastructure, blockchain technologies offer a promising alternative to traditional distributed systems. Indeed, the research on blockchain-based infrastructures tailored to genomics is on the rise. However, there is a lack of a comprehensive literature review that summarizes the current state-of-the-art methods in the applications of blockchain in genomics. In this paper, we systematically look at the existing work both commercial and academic, and discuss the major opportunities and challenges. Our study is driven by five research questions that we aim to answer in our review. We also present our projections of future research directions which we hope the researchers interested in the area can benefit from. The field of genomics holds great potential in enhancing healthcare. The genomic data produced through technologies such as high-throughput sequencing (HTS) can provide unique health-related information about each individual in a non-invasive manner and is crucial in the advancement of precision medicine [21] . It has been estimated that by 2025, between 100 million and as many as 2 billion human genomes could be sequenced [117] . At the same time as the amount of genomic data is soaring and the genomic technology is advancing, so are the challenges they pose. Who has access to the data [109] and how can they be kept safe? How can data be used and shared responsibly [81, 103] without losing the advantages of sharing for research and (future) patients? "The obligation to confidentiality" must be balanced with "the obligation to share" when it comes to genomic data. Some argue that sharing genomic data is an ethical obligation for those who benefited from sequencing their genome [64] . However, genomic data have certain characteristics that make them fundamentally different from traditional health records: they are long-lived as they carry valid information even after an individual passes away; they indirectly affect descendants and relatives of the data owner; they are large when the whole genome is sequenced (e.g. BAM file size 138 GB [118] ). Moreover, the genomic data are not only used in medical contexts but also in many others including forensics, insurance and pharmaceutical research to name a few. For instance, criminal investigators frequently employ DNA profiles stored in their forensic databases for criminal cases such as rape and murder. Similarly, pharmaceutical companies are investing hundreds of millions to gain access to genomic data towards developing new medicines. For instance, 23andMe [1] , a well-known biotechnology company, has been sharing its large genetic database with GlaxoSmithKline [129] . Authors' addresses: Mohammed Alghazwi, m.a.alghazwi@rug.nl, University of Groningen, Groningen, Netherlands; Fatih Turkmen, f.turkmen@rug.nl, University of Groningen, Groningen, Netherlands; Joeri van der Velde, k.j.van.der.velde@umcg.nl, University Medical Center Groningen, Groningen, Netherlands; Dimka Karastoyanova, d.karastoyanova@rug.nl, University of Groningen, Groningen, Netherlands. Paper Focus [83, 107] Potential of blockchain in healthcare with genomics discussed as an example. [98] Potential of blockchain in genomics and a use-case proposal. [110, 124, 125] Potential of blockchain in genomics. This paper Present a comprehensive systematic literature review of the current state-of-the-art methods in applying blockchain in genomics. While the potential societal impact of the improper use of genetic information is immense, there is a significant public benefit in the adoption of genomic data usage in patient diagnosis, screening and treatment. The main technical issues that have been highlighted in the literature are the storage and sharing [7, 8] of genomic data. More specifically, the need for efficient compression techniques, the lack of harmonized (meta)data, and perhaps more importantly the lack of secure and privacy-preserving technical infrastructures to acquire, process and share genomic data. Another challenge is the lack of common terms and conditions in the metadata, which describes both the data (raw data format and notations) and the access requirements (informed consent, data transfer agreements), decrease the efficiency of discovery and hinders data sharing [128] . In order to maximize the benefits of genomic data, the (meta)data need to conform to FAIR (findable, accessible, interoperable, reusable) principles [29] . There is an increasing interest in applying blockchain technology in healthcare. This is evident in the increasing number of articles published each year since the emergence of this technology. For instance, [58] provided an overview of the current trends in using blockchain in healthcare and showed the properties of blockchain that are most commonly used, and [2] classified each work in applying blockchain in healthcare based on the use-case. Blockchain has also been proposed as a candidate approach to address many of the challenges in handling genomic data [98, 110, 124] . The decentralization, immutability, and transparency properties make it an attractive option to solve the sharing and storage issues. In addition, it can be combined with privacy-preserving techniques to provide privacy, traceability, and integrity to the data being managed. In this review, we do not cover blockchain in general healthcare applications, our focus is on a subset of healthcare applications -genomics. Specifically, we focus on applications and solutions that utilize blockchain technology in managing, sharing, or processing human genomic data. Throughout the paper, we use the term "genomic data" to refer to data used in studying the human genome which includes the scientific study of gene interactions and complex traits/diseases that are caused by a combination of genetic and environmental factors. The use of blockchain in genomics is believed to have great potential as various researchers have pointed out. These works, as shown in Table 1 are focused on the potential and possible benefits of blockchain in genomics, and they list the possible use-cases with few examples works that have been done. For example, a recent paper [124] focused on exploring the opportunities and challenges of DLT in genomics by conducting a ranking-type Delphi study. However, instead of focusing on the potential and possible use-cases, the main objective of this paper is to present the current state-of-the-art methods in applying blockchain in the field of genomics. We look at the wide range of applications, motivations for blockchain, and the different approaches. Finally, we discuss the limitations and future directions, which can serve as a basis for other researchers. The review is structured as follows. First, the methodology used to conduct this review is explained in Section 2, then we give an overview of blockchain technology and genomic data storage/sharing in Section 3. In Section 4, we summarize the findings of this review by describing the current trend, application domains, and motivations for using blockchain in genomics. In addition, we discuss the different approaches and techniques used and gives an overview of the challenges faced. Section 5 provides a discussion on our findings and the unexplored opportunities of applying blockchain in genomics. The procedure used in this paper is in line with the methodology proposed by Kitchenham et al [67] . We utilize the Systematic Literature Review (SLR) approach as it provides a clear and structured way to search the literature and extract relevant information. Figure 1 represents the methodology followed, which is adapted from [19] . The main three stages are planning the review, performing the review, and documentation. The planning for this paper is presented in this section with the first step being the definitions of the research questions which we intended to address. Then, the development of the search protocol is presented to outline the search strategy, and finally, the selection criteria are laid out. We formulated the main question "What are the application domains, motivations, approaches, and challenges when applying blockchain in genomic applications?" which can be split into the following specific and structured questions that will be addressed in this review: • RQ1: What are the current research trends for the use of blockchain in genomics? • RQ2: What are the application scenarios of using blockchain in genomics? • RQ3: What are the benefits and advantages of using blockchain in these applications as described by the authors? • RQ4: What are the elements of blockchain technology used in genomic applications? What are the approaches or combinations of technologies used? • RQ5: What are the challenges and limitations when applying blockchain in genomics? Have these limitations been addressed by the authors? What has been specified for future research? 2.1.2 Search Protocol. The search strategy has been defined and performed after having the previous set of research questions. The selected data sources include both academic and non-academic sources. This is done to cover the wide range of applications and approaches used in both academia and industry. For academic sources, we collected papers from the following 6 electronic databases: • Google Scholar, • IEEE Xplore, • PubMed, • Springer SpringerLink, • Elsevier ScienceDirect, • ACM Digital Library, The search for relevant publications in these databases was performed using the query strings defined below: (blockchain OR "block chain" OR "distributed ledger" OR "smart contracts") AND (genomic OR genome OR genomics OR genes OR genetic OR genetics) In addition, preprints were collected from (arxiv.org). For non-academic sources, we used the Google search engine to find reports, blogs, and code repositories to select the relevant materials. This is done to find ongoing industry projects that considered the use of blockchain in genomics for commercial purposes. The set of keywords used to find these sources are selected based on the reviewers' background and knowledge related to blockchain and genomic data sharing. These keywords include the following: genomics, genomic data-sharing, blockchain, DLT, smart-contracts. The search was conducted in March 2021 and covered publications in the period 2009 -2021. The selection criteria (inclusion and exclusion) were defined prior to performing the search strategy in order to eliminate non-relevant sources. The selection criteria are shown in Table 2. 2.2 Performing the review 2.2.1 Article Selection. The initial search in the selected databases resulted in 752 papers. First, a screening of the titles and abstracts was performed and it was aimed to find and exclude duplicate and unrelated articles, which reduced this number to 61. An additional analysis of the full-text was performed on the remaining articles, and as result, 21 articles were discarded. 40 articles remained and were included in this review as shown in Figure 2 . There are more than 5000 diseases for which a risk level can be calculated by using the genetic information of an individual according to DisGeNET [36] . The genomic data presents an invaluable unique source of information for understanding complex traits and diseases [48] . Traditionally, genomic sequencing (in particular the whole genome sequencing) is considered to be a costly and time-consuming process. Thanks to the developments in genome sequencing technology, today, the time associated with whole genome sequencing is at the level of hours (e.g. one sample in one hour [34] ) and the cost is less than 600 Dollars [6] . Large datasets containing genomes and clinical data of individuals are becoming increasingly important to medical experts as the analysis of diverse data contributes to detecting fine-grain biological insights essential to improving public health [14] . As a result, an increasing number of clinicians are including analysis results obtained with these technologies in their day-to-day practices in the context of e.g. personalized medicine. Some of the frequent medical uses of genetic information include diagnostic and predictive DNA testing with the option of integrating polygenic risk scores where an individual's (and her relatives') disposition to certain diseases such as breast cancer is screened through specific genes (e.g. BRCA2). 3.1.1 Genomic data sharing. Interpretation is a key component of genomics research. Individual genomic variants can be interpreted in relation to specific signs or symptoms and multiple genomic variants can be assessed in relation to their collective impact on patients. Genetic (genotype) and clinical data (phenotype) can be combined to determine the best treatment for a patient. The quality of these interpretations is highly dependent on the data they are based on. Research and clinical knowledge sharing is essential to enable the refinement of interpretations. While policies and laws, which differ from country to country, allow the exchange of genomic data under certain conditions, genomics researchers have experienced how difficult and cumbersome the process is [77] . Another issue is the participation in genomics research which is currently low considering the requirements in genomics research. For instance, the number of participants in genome-wide association studies (GWAS) can reach over 1 million [89] . Current genomic data sharing methods depend on the level of privacy (and the task at hand) required as some parts of the genomic data are private while some others are not considered private. For instance, somatic variants in the human genome are not considered private as they cannot identify specific individuals or families. On the other hand, germ-line variants are unique to each person, and therefore, they require privacy protection. There are various genomic data exchange platforms that give researchers the ability to share genomic data publicly in order to advance the research in this field. Large organisations such as Clinical Genome (ClinGen) [24] and the Global Alliance for Genomics and Health (GA4GH) [46] have started the development of reliable resources to systematically define and interpret all human variation through broad data-sharing efforts. There are also large scale European efforts to promote/coordinate the cross-border collection, storage and sharing of human genome data in a secure way, e.g. Beyond 1 Million Genome (B1MG) [126] . One of the proposed solution is the Genomic beacon project initiated by GA4GH. The genomic beacons ease the process of genomic data sharing through the use of web services by answering queries about the presence of a specific allele in a genome. Institutions can launch their own beacons and connect to the project. There are currently over 100 beacons [44] . In general, beacons aim to respect the data privacy by allowing the institutions to define their own access restrictions and authorization schemes. However recently, privacy researchers have pointed out that the beacons actually have privacy issues and the individuals can be identified from them even if a data anonymization technique is applied to the data [111] . 3.1.2 Privacy of genomic data. In addition to being highly valuable, genomic data is highly sensitive as it may reveal information about an individual and/or his/her family. The susceptibility to certain diseases, ancestral traits of an individual and response to a drug are just a few of the use cases demonstrating the private nature of one's genetic information. The privacy concerns around the genomic data are among the main reasons that limit its wide spread use. In addition to (HIPAA) [96] and General Data Protection Regulation (GDPR) [25] , there are tailored regulations to address these concerns such as Genetic Information and Nondiscrimination Act (GINA) [26] . The conflicting need for genomic data to be both shared and private requires the use of privacy-preserving techniques which allow processing of the data while preserving privacy. There is a plethora of research on the topic of privacypreserving processing of genomic data that propose the use of privacy-enhancing technologies such as Homomorphic Encryption [9, 38] and secret sharing [114, 134] among others. We refer the reader to the surveys on the topic such as [4] for a more systematic presentation. The size of genomic data that ranges between 30-200GB, is one of the main obstacles for the application of privacypreserving techniques to genomic data processing. There are proposals to substantially reduce the storage footprint of genomic data as PetaGene's proposal on compressing NGS datasets in FASTQ and BAM format to provide on average 60% reduction while preserving the genotyping accuracy [102] . However, with the large scale collection efforts at the horizon such as B1MG [126] , an explosion of genomic data is expected. Furthermore, the large size of genomic data makes it difficult to store locally and therefore cloud databases might be used which introduces other privacy and security issues. In addition, the high-performance computation required for genomic data processing makes it difficult to perform locally as well. Finally, there are various security and privacy attacks on genomic data on current genomic data sharing platforms as listed in [10] . This opens the field to alternative approaches that follow the laws and regulations around privacy and at the same time provide utility. The detailed technical foundation of blockchain technology is outside the scope of this paper. However, it is important to shed light on some blockchain concepts, features, and terminologies that will assist the understanding of how blockchain is applied to solve problems in handling genomic data. For an extensive treatment, we refer the reader to other articles such as Kolb et al. [68] and Zhang et al. [133] . Blockchain is the innovative technology behind Bitcoin, the first open-source decentralized digital currency system. The initial design and implementation were done by an unknown entity named Satoshi Nakamoto in 2008/2009 [92] . Blockchain stores and verifies transactions on a ledger that is distributed to all nodes in a peer-to-peer (P2P) network. The transactions are organized into blocks which are protected by a combination of cryptographic techniques to ensure the integrity of the recorded transactions. A consensus protocol is then followed to validate the blocks and the blocks that are successfully validated are added to the growing chain of blocks. Although blockchain technology and Distributed Ledger Technology (DLT) are closely related, there is a difference. A distributed ledger is a ledger or a database that is spread across the nodes in the network and maintained by a group of peers, rather than a central agency. Blockchain is an implementation of DLT and unlike a database, it consists of a chain of blocks. These data blocks are unique data structures that distinguish blockchains from other DLT types. Other implementations of DLT include Hashgraph [11] and Directed Acyclic Graph (DAG) [140] . Blockchain can be divided into a few distinct types, which have their own characteristics, and directly reflect the network behavior. These types of blockchain can be classified into the following [120] : (1) Public Blockchain (public & permissionless): a permissionless ledger in which any anonymous node can join the network, and no trust requirement is enforced by the network members. The transactions are publicly broadcasted to all the nodes. Any node in the network can participate in the consensus mechanisms to validate the blocks. An example of this type of network is the Bitcoin blockchain [92] . (2) Private Blockchain (private & permissioned): a ledger that is managed by a single entity, and permission is required before any node can join the network. The access control mechanism provides a higher degree of privacy to the content of the blockchain transactions. Additionally, this type of blockchain provides higher performance in terms of block confirmations. An example of a platform of this type is MultiChain [45] . (3) Consortium/Hybrid Blockchain (public & permissioned): a ledger that is managed by a pre-selected group of nodes. Similar to private blockchains, nodes require permission to join the network. Validating blocks and transactions is done when a chosen set of nodes reach a consensus. The exact process depends on the pre-established rules of the consensus mechanism. An example of a platform of this type is Hyperledger [18] . Consensus in blockchain is the process to validate the blocks and their contents (transactions and code) in order to add them to the blockchain. This essentially solves the problem of allowing multiple parties that do not necessarily trust each other to agree on the state of a shared ledger. Consensus protocols are essential for the reliability of the blockchain. Proof-of-work (PoW) in bitcoin was the first used consensus protocol in blockchain. This consensus mechanism is comparable to a competition where nodes (miners) try to solve the same puzzle (preimage to a hash function) to validate the transactions and generate a new block. The node that provides the correct solution to the proof-of-work receives a reward in the form of cryptocurrency such as Bitcoin. The newly generated block is then broadcasted to all peers in the network and that block gets connected to the existing blockchain. PoW has been criticized for its energy waste and slow block confirmation. Various protocols have been developed to overcome some of the limitations in PoW such as Practical Byzantine Fault Tolerance (PBFT) and Proof-of-stake (PoS). For a more detailed description of blockchain consensus protocols, we refer the reader to Xiao et al. [131] . Smart contracts have emerged recently on blockchain due to the popularity of the Ethereum platform. However, the concept of smart contracts dates back to 1997 and was proposed by Nick Szabo [122] . The concept has evolved since then, but the main objective is to allow a smart contract program to run in a decentralized network and modify the state of the system in an automated, trusted, and verifiable way without intermediaries. Blockchain has made it possible to implement this concept and use it in different settings including finance, Identity Management, and healthcare [66] . There is a variety of blockchain platforms that allow executing smart contracts using a number of programming languages and one of the most commonly used languages is Solidity. Briefly, writing a smart contract involves establishing a set of requirements and instructions that are automatically executed once these requirements are met. In contrast to written contracts, smart contracts are executed automatically, are publicly verifiable, and do not require any intermediaries. The security and privacy features of blockchain rely on the use of a number of cryptographic techniques. Some of these techniques were leveraged by the original bitcoin blockchain design, while others were added to subsequent blockchain implementations to enhance security and privacy. The basic security and privacy techniques utilized in the bitcoin blockchain ensure that the system meets the security and privacy related requirements of online transactions and prevents known vulnerabilities. These requirements include: consistency of the ledger, the integrity of the transactions/data, availability of the system, confidentiality of transactions, and users' anonymity [133] . Blockchain ensures the consistency of the ledger across multiple nodes through the use of a consensus mechanism discussed previously. The integrity of online transactions is essential and the underlying system used to facilitate them must be secured against malicious tampering of the data. Blockchain transactions are resistant to tampering from both miners that confirm those transactions, and the external attackers that try to manipulate blockchain transactions. Using cryptographic hashing and digital signature, any modification on the transaction data would be detected by checking the validity of the digital signature. In addition, tampering with blockchain transactions requires altering the data stored in all blockchain nodes since the ledger of transactions is stored in all nodes in the network. Attacks such as distributed-denial-of-service (DDoS) are not feasible because of the highly decentralized nature of the blockchain network. It is important to note that the more blockchain nodes there are in the network the higher the resistance to the attacks on availability. Regarding the privacy aspect, blockchain provides pseudonymity through the use of public key infrastructure (PKI). The nodes or users are identified by their public addresses rather than their real identities. However, this fails to provide full anonymity as there is a risk of linking the public address to real identities by observing the interactions between different parties [133] . Another privacy limitation of the original blockchain implementation is that transactions and their data are publicly visible. Ensuring the confidentiality of transactions and smart contract data expands the possible applications of blockchain to include those that handle private and sensitive information. To overcome the aforementioned limitations, additional security and privacy techniques have been proposed such as mixing, anonymous signatures, and Zero-Knowledge Proofs (ZKP) [15, 133] . Mixing services have been proposed as a solution to provide unlinkability to transactions [17, 87, 108] . The mixers swap users' coins and prevent tracing the movement of coins in the blockchain, thus provide unlinkability. Anonymous signatures are another privacy cryptographic technique employed by certain blockchain applications to hide the identity of the signer. Anonymous signatures schemes such as group signatures [22] and ring signatures [105] conceal the identity of the signer among a group of users that signed a transaction/message. A further discussion on the additional privacy techniques that can be combined with blockchain is provided in Section 5.3. In this section, we present the outcomes of the review. The aim is to answer the research questions defined in Section 2. We first show the distribution of publications per year. Then, we list and classify each paper into sub-categories based on the application domain, and present our analysis on the motivation of using blockchain as described in the papers. We also present the methods and approaches employed in these papers. Finally, we list the open issues and challenges that we observed from the papers. To address RQ1, we analyzed the yearly trend in publications relating to the use of blockchain in genomics. This trend can be seen in Figure 3 , which shows an increasing interest in this topic. The interest started with a lot of commercial applications, but with time more academic research followed. Based on our findings, the first use of blockchain in genomics appeared in a commercial application, Genecoin [47], which started in 2014. The number of papers increased over the years with 18 papers in 2020 alone. The quick upward trajectory seen in Figure 3 is expected since blockchain is a relatively new technology that was introduced in 2009 and the implications of its use (e.g. for scalability and security/privacy) are just being studied in a non-cryptocurrency context. This section reports the range of existing blockchain-based solutions in genomics, which answers RQ2. Because the use of blockchain in genomics has attracted the attention of both academic and industrial communities, each with their own agenda on how this technology can be used, we classified the applications as shown in Figure 4 in two main categories: commercial and non-commercial applications. To clarify our classification process further, we distinguish between the two categories based on whether the blockchain is utilized for financial exchange in addition to data sharing. Commercial genomic marketplaces follow a business model and are generally aimed at facilitating the exchange of genomic data for financial benefits. In addition, these marketplaces are usually targeted at individual genomic data owners (individual users/customers or patients), and cryptocurrency or tokens are often used as incentives to promote data sharing. From our total number of selected papers (40), there are 27 papers with no commercial interests and 13 papers with a commercial motivation. Figure 5 shows the percentage of commercial vs. non-commercial papers on blockchain in genomic applications. Marketplaces. The commercialization of DNA sequencing by direct-to-consumer (DTC) companies has attracted an increasing number of customers in recent years. This is due to the technological advances that made it much cheaper and faster. One of the strategies to generate revenue for DTC companies such as 23andme is selling access to the collected DNA sequences to pharmaceutical companies. The fairness of this model raises questions in terms of the profit gained from buying this genomic data. Some argue that the profit should be passed onto the people, not the intermediaries [110] . As a result, there has been an increase in a new generation of companies that provides an open marketplace for genomic data sharing with the use of blockchain. Blockchain-based genomic marketplaces aim to cut the need for intermediaries and give the users control of their data. Individuals receive different types of incentives for selling or renting their genomic data. The most common incentive used in these marketplaces is cryptocurrency. Table 3 provides an overview of the used incentives along with the employed blockchain platform and the offered services on current genomic marketplaces. Genecoin [47] represents the first attempt at using blockchain in genomics when it was introduced in 2014. The company provides sequencing services through third-party labs. It then encrypts and stores the resulting DNA sequence in the bitcoin blockchain. The company does not provide any other motivation for genomic data collection rather than claiming to only gauge interest in this service in their website [47]. Genesy [20] Private blockchain based on HyperLedger Fabric Payment in fiat and cryptocurrency through Stellar [116] and Stripe [119] Sequencing services, selling access to genomic data, and a blockchain-based ecosystem for the sharing of genomic data. GenoBank [127] Ethereum-based blockchain with nonfungible tokens (NFT) Cryptocurrency: ERC-20 token Control over genomic data with DNA crypto wallet, and secure platform to process the data. Nebula Genomics [35] uses another business model, whereby people can upload their phenotypic data to the blockchain and earn tokens for doing so. They can then use these tokens, when they are enough, to purchase a whole-genome sequence from Veritas Technologies, which is a partner company of Nebula Genomics. There are also alternative ways: people can pay out of their pocket for the sequencing or a third-party, such as a pharmaceutical company or research center, can subsidize the cost of sequencing. Of course, the last option is only possible for particular health profiles which are attractive enough for these companies. LunaDNA [80] , plans to use blockchain as a marketplace for genomic data. Although the use of blockchain is still under development, they have already joined the Genetic Alliance, an advocacy group, with the goal to store this data in a cloud-based platform with security and privacy protections including access control and anonymization [30] . The company does not offer sequencing services but collects existing genomic data. In addition, it adopted a different model to incentivize users to share their data. Users receive company shares and therefore, they are part owners of the company and receive dividends once the aggregated data has value. LunaDNA believes that they are creating a community of part-owners, and in this community, the currency is the data. Shivom [130] is a blockchain-based ecosystem with libraries and data pipelines that are specific for genomic data. The platform connects researchers with DNA data that are controlled by individuals. The data owners are first anonymized and researchers can then leverage this data to conduct their research through the provided pipelines. The company aims to first protect patient data and accelerate medical and pharmaceutical research. Zenome [95] uses the Ethereum blockchain and its smart contracts to provide a marketplace for individuals to share or sell the right to access their genomic data to any interested parties such as researchers. The platform also allows users to store or buy computational resources from specific nodes in the network. These nodes are then rewarded with tokens called ZNA which stands for Zenome DNA tokens. Genomes.io [50] is another genomics blockchain company that allows consumers to securely store and manage their DNA data from the moment it is sequenced to when it is stored on the blockchain. This prevents any attempt to tamper with the data and at the same time gives data owners the chance to monetize their data. Genesy [20] aims to encourage data owners and organizations to collaborate by providing an ecosystem for managing the exchange and access to genomic data. Genesy provides sequencing services and the ability to sell access to generated data. The payment for these services is done through third-party APIs, namely Stellar [116] and Stripe [119] , which allow both fiat and cryptocurrency transfers. Genesy utilizes a private blockchain based on hyperledger fabric [18] with the aim to grow beyond that and become a consortium blockchain managed by various organizations. The Genesy blockchain consists of multiple nodes that record the transactions as well as data. Sensitive data that includes the user's personal data are encrypted and stored within the blockchain, while other larger genomic data are stored off-chain on external databases and cloud storage with hash pointers on the blockchain. Genobank [127] is exploring the use of non-fungible tokens (NFT) for portability and data tracing. The proposed method is to assign each unique human genome a unique NFT. The NFT allows for the full control of the user data while at the same time enables the data owners to authorize data consumers (researchers) to perform analysis on multiple environments. Longenesis [79] is still under development and aims to provide a decentralized end-to-end marketplace for health data including blood test results, medical history and genetic profile. Users will be able to use the Longenesis's platform to store and consent to participating in a specific medical study. The users could withdraw their consent at anytime. In addition, using smart contracts, the medical providers can offer to extend, modify or amend an agreement which can be accepted or rejected by the users. LifeCODE.ai [63] is a blockchain-based platform with a focus on storing and managing genomic data. The decentralized application (DApp) created by LifeCODE.ai facilitates trading of data through Ethereum's ERC-20 protocol that is implemented in the Quorum blockchain. The tokens are used to pay for access to patient data. To protect the privacy of the data, all health data stored in the blockchain network are encrypted. In addition, the data are owned by the individuals that submit them and all data movements are traceable. Applications. The selected studies on non-commercial applications of blockchain mainly focus on providing genomic data sharing for the advancement of research. Table 4 provides the list of scientific works we identified that applies a form of blockchain in relation to genomics. The selected studies fall into one or multiple of the following subcategories: data sharing, analysis, secure storage, access control, and logging. While each paper is categorized according to the main topic of research, overlaps occur. For example, one study focused on genomic data sharing for the purpose of performing analysis tasks on that data, and another paper focused on blockchain-based storage for the purpose of sharing the stored genomic data. In our analysis, we account for this overlap (as seen in Table 4 ). The majority of the papers we identified focused on using blockchain to support/build systems for multi-organizational or global sharing of genomic data. A noteworthy paper is the cancer gene trust (CGT) [52] , which demonstrates the benefits of blockchain in sharing genomic data, for the purpose of advancing cancer research. In addition, the authors launched a cohort study with a real patient dataset to illustrate the effectiveness of the CGT framework in terms of secure, efficient, cost-effective, open, and distributed sharing of genomic data. A similar approach is presented in [60] , but with additional mechanisms to distribute the whole-genome data. We also identified a set of papers that explored the use of blockchain in facilitating genomic data processing or analysis. Zhang et al [135] proposed an approach to perform a Genome-wide association study (GWAS) with a focus on privacy. The authors proposed performing GWAS by using a privacy-preserving sharing protocol (PPS) that enables genomic data sharing through the use of a gene fragmentation framework. The large genomic files are split into multiple fragments which are then distributed in a decentralized blockchain network to multiple service providers for storage, sharing, and analysis. This eliminates the possibility of one provider having the complete data, therefore, solving the issues related to centralization and privacy protection. Coinami [60] provided an alternative approach by incentivizing participants to perform HTS read mapping. The participants are given tokens as a reward. This replaces the traditional proof-of-work with HTS read mapping to validate blocks in the blockchain. Other approaches proposed combining genomic predictive modeling with blockchain to achieve a distributed model training. In [69, 70] , predictive models were trained in multiple organizations with blockchain coordinating the process in a decentralized way. Verification of the computation and analysis tasks performed by third parties is essential. In a blockchain setting, it is important to check the validity of the computation done by untrusted nodes in the network. In addition, the verification process should optimally be done with minimal computation resources. Zhang et al. [135] use a blockchain consensus protocol to validate the results of analysis or computation. Each job is assigned to multiple nodes and the outcome is compared. Then the result of the majority is considered correct. In [60] the computation results are checked by certified authority nodes in the blockchain network. The validity is ensured by randomly inserting pre-calculated decoy data. Among the selected papers, we found a group of papers that focused on using blockchain as a way to store genomic data securely. [57] utilized blockchain to store and query pharmacogenomics data. The authors illustrated the feasibility and efficiency of storing and accessing this data using the Ethereum blockchain and smart contracts. Each data record is inserted into the smart contract and assigned a unique ID to be used as mapping key. An index-based, multi-mapping approach is used to efficiently query the genomic data. The pharmacogenomics data used in this study is rather small in size compared to other common types of genomic data types. In [56] , the authors explored storing larger data files, specifically Sequence Alignment Map (SAM) files, which can be in the order of 10s of Gigabytes in size. This was achieved with a novel data structure that was built with the addition of data compression techniques and a private blockchain network. We found a small number of papers with a focus on using blockchain as means to provide and revoke access to genomic data in the form of consent management. Dwarna [84] is a web portal that harnesses blockchain for dynamic consent. The portal connects participants and researchers in a research partnership. The project incorporates GDPR and gives ownership of the data to the participants. The proposed architecture uses blockchain to record participants' consent. Storing consent in blockchain allows the participants to be the owner of the data i.e. third parties can only access the data when the owner of the data allows them to. In [16] , the authors consider consent for sharing individual genomic data as an instance of the Multi-Stakeholder Consent Management (MSCM) problem. This is due to the fact that each individual genome can reveal information not only about the owner of that genomic data but also about relatives. Therefore, to protect the privacy of the relatives of an individual, their consent must be taken into account. The authors in [16] , propose the use of blockchain to solve this consensus problem and obtain consent from multiple stakeholders. Another set of papers focused on the value of blockchain in building a global logging system. The 2018 IDASH competition [72] and specifically, Task1 of the competition explored the use of blockchain as a global logging system. Such a system can be used to provide an access log that records users' access to any data within any of the genomic data repositories in the system. A decentralized cross-site logging system has many advantages over traditional centralized internal logs that are currently common in practice. Most importantly, it eliminates the problem of a single point of failure and malicious changes to the logs. There were several participants [55, 82, 97, 101] in the competition and the submissions were evaluated based on specified criteria which include accuracy and speed. This is because the competition not only looked at the feasibility of blockchain as a cross-site logging system but also evaluated its performance and efficiency. The winner of this competition [55] showed that it is indeed feasible to utilize blockchain in building a cross-site genomic data access log. The performance of that solution is promising, and it is reasonable to assume that with additional improvements, such a system can be adopted for practical use. To address RQ3, we first identified the key blockchain features that are most desirable in genomic applications. These key features are incentives, decentralization, control of data, immutability, smart contracts, reliability, availability, transparency, and traceability. The motivation is different for each genomic application and it depends on the outlined requirements that are listed in each paper. Figure 6 shows the frequency of each key blockchain feature that motivated its use in the selected papers. Most papers list multiple benefits of using blockchain. The most highlighted feature is that it is an immutable and tamper-proof way to store data. In addition, decentralization and control of the data are highly mentioned benefits. The rest of this section provides a summary of the motivations to apply blockchain in the selected studies as described by the authors. A. Incentive (Cryptocurrency). An important mentioned benefit of using blockchain is the ability to build an incentive structure for sharing genomic data. This is especially relevant for genomic marketplaces where the objective is to create a fair ecosystem for the exchange of private data. The fairness is defined in terms of financial gain from data sharing (by data owners), and the aim is to allow the exchange of data for scientific research or other purposes without losing full control. Blockchain provides the required incentive structure for distributing genomic data in exchange for cryptocurrency or tokens. Additionally, an incentive scheme can be used to reward nodes in that network after completing a certain task such as sequence (HTS) read mapping in [60] . Any individual or organization is able to freely join and perform analysis tasks to gain tokens that could be redeemed for monetary value. B. Decentralization. The decentralized nature of blockchain networks was listed as an important feature in several papers. The consensus mechanism contributes greatly to the way blockchain is decentralized. It introduces a way for nodes in the network to reach an agreement without a central trusted authority. Decentralization provides several benefits depending on the use-case. In [52] decentralized open access is achieved by using blockchain combined with IPFS. Timely distribution of medical resources plays a significant role in the research and development of medical treatment especially in the event of a disease outbreak such as COVID19. However, the technical limitations in applying reliable decentralized technologies is one of the barriers in achieving this, which have kept such valuable data in silos and behind centralized servers. The authors in [52] claim that blockchain fills that gap and provides a reliable decentralized data sharing system. Decentralization is also useful in other cases such as coordination of analysis tasks that are performed at multiple nodes/locations. As shown in [69] [70] [71] 73] , blockchain can replace the need for a central server to intermediate the process of applying machine learning and combining the global model. This prevents the single point of failure/control when a third-party is used to coordinate, and the potential for this third-party to breach the privacy of data by examining the aggregated statistics. A similar approach is shown in [135] where the motivation for using blockchain is to coordinate the process of performing GWAS studies and guarantee the authenticity and confirmation of all activities (transactions) within the decentralized network. C. Control of Data. Control of genomic data should ideally be given to the owner of the data (the patient) or a trusted third-party acting on behalf of the owner such as doctors. Necessary consent and access management mechanisms in the current centralized systems requires more time and effort. A set of papers list the ability to control of data as the main motivation for using blockchain. This is especially evident in genomic marketplaces that claim to allow individuals to control who has access to their data and for what purpose. As mentioned previously, there are also proposals (e.g. [85]) in which the patient consent is stored on the blockchain to empower patients and enforce their control over their own data. D. Immutability. Immutable and tamper-proof data storage is the most desired property of blockchains in the selected papers. The immutability property in blockchain prevents the loss and alteration of data records which is essential in most genomic applications. The tamper-proof data structure of the blockchain, which relies on cryptographic hash pointers, prevents both accidental and intentional data tampering. Any changes to the confirmed blocks would make the blockchain inconsistent and can be discovered by any node in the network. This ensures a reliable and consistent shared ledger among untrusted or semi-trusted parties in the networks. Depending on the application requirements, on-chain storage can be leveraged to store data that needs to persist such as recording consent [85] and providing an audit trail [55, 82, 97, 101] . However, data privacy must be carefully considered since the immutability property applies to on-chain data that is shared across all nodes and can be openly viewed if not encrypted. [20] relies on smart contracts to allow access to the data and transfer the payment to the data owners. F. Reliability, availability, transparency and traceability. A small percentage of papers specifically listed these blockchain properties as the main motivation for using blockchain. Reliability and availability are essential in certain applications such as online model learning in [70] , which require data to be highly available to all nodes in the network. The importance of the transparency and traceability are mostly highlighted in applications where the data owners are informed of how the data is accessed and by whom. These blockchain properties can be exploited to give data owners control and in turn gain their trust. For instance, [85] emphasized the importance of transparency and argued that patients are more willing to contribute their genomic data for research purposes when they are informed about the use of their data. There are various platforms [74] , storage systems [13] , and privacy-preserving techniques [15] tailored to blockchainbased genomic data solutions. These technologies can be combined in different ways to deliver an application. In this section, we look at different approaches used in genomic blockchains in order to completely answer RQ4. First we present a general architecture covers most of the existing genomic blockchain systems. We then use this architecture to guide our presentation of the discussed work. Most previous work such as [133, 137] have discussed the architecture of blockchain in a general way, while others discussed the architecture in a specific application such as IoT [75] . In this section, we present an application-specific architecture tailored to blockchain-based genomic data applications. Blockchain technology has been used for a variety of genomic applications as we have listed in the previous sections. Each of the selected studies has its own system architecture for the specific application at hand. However, most of the proposed solutions have architectural similarities that paves the way to generalization. In Figure 7 , we present a generalized architecture for systems that use blockchain for genomic applications. The design for this architecture was aimed to summarize and cover a wide range of applications in genomics. Our proposed architecture consists of 6 layers: data collection layer, data storage layer, network layer, consensus layer, application layer, and presentation layer. The layers in this architecture are comparable to the one/s in the existing blockchain literature but with some modifications to effectively illustrate the architectural components exploited in genomic applications. [94] proposed a similar system architecture that consists of three layers: data gathering, storage, and application layer. We extended this architecture to include all layers within the blockchain. In the first layer of the architecture, we assume that each node is responsible for collecting and sorting the genomic data which can come in different formats such as BAM or FASTQ. These nodes in the systems can represent an individual, researchers, or organizations that want to share genomic data. Individuals can submit their own genomic data after being sequenced, which is the aim of most genomic marketplaces. Researchers and organizations can also obtain consent from patients to release their genomic data, or an anonymized version of it, to other researchers in the context of a particular disease. This is the case in [52] , where the de-identified genomic and clinical data are collected from cancer patients after consent is given. There are also other data types such as gene-drug interaction in [57] , and patient consent data [85] . After the collection, the data is sent to the next layer for storage. In the storage layer, the data can be stored in different ways depending on the requirements, which is discussed in section 4.4.3. The data is then broadcasted to the network using a specified network protocol. According to our analysis, the majority of the papers use a P2P network instead of the traditional client-server model. In the consensus layer, the nodes in the network come to an agreement on the state of the blockchain using a consensus protocol such as Proof-of-Work (PoW). The application layer is where smart contracts are written and deployed to facilitate various application functions which serve as the backend of the application. The presentation layer is responsible for interacting with smart contracts and blockchain in general. A critical step in designing a genomic blockchain system is the selection of a suitable blockchain platform that would deliver the required functionality for the intended application. In this section, we present our analysis of the blockchain platforms used in the selected studies. Our analysis is only based on solutions that included a prototype, a proof of concept, or an implementation to show the feasibility of blockchain in genomic applications. There are also some genomic blockchain studies that do not explicitly specify the blockchain platform or implementation, and others that do not reveal their underlying platforms especially in commercial applications. These are not covered in our analysis. Despite the fact that there are multiple types of distributed ledgers such as Directed Acyclic Graph (DAG) and Hashgraph, blockchain is the only type discussed in the genomic literature to our knowledge. In addition, the majority of the papers proposed the use of either private or permissioned blockchains. Privacy, scalability, and cost are among the most cited reasons for this. The use of private blockchains lowers the risk of information leakage since data is only shared with a set of known semi-trusted individuals or institutions. In addition, private blockchains are more scalable and often use consensus mechanisms that do not require cryptocurrencies and transaction fees. Aside from custom-made blockchain implementations that are designed to fit specific application requirements, the two main platforms used in genomic blockchain solutions are the following: and thus all presented solutions [55, 82, 97, 101] are based on MultiChain. The use of Multichain in these papers allowed efficient on-chain data storage of access logs. Additionally, MultiChain was used to implement ExplorerChain [69, 70] for the purpose of distributing the online machine learning models to the nodes in the permissioned blockchain network. Ethereum [27] is a blockchain platform that facilitates building smart-contracts and decentralized applications (dapps) that run on the blockchain network. Ethereum focuses on adaptability and flexibility, and, to achieve this, it supports Turing-complete programming language to build smart contracts easily. Although the Ethereum main blockchain network is public, in most of the selected papers, a private Ethereum blockchain is used. In multiple genomic marketplaces such as [35, 37, 95, 127] an Ethereum-based blockchain was used. In these use-cases, Ethereum smart contracts are used to facilitate access to genomic data files and the distribution of cryptocurrency. The previously mentioned CGT framework [52] is another example solution that relies on Ethereum smart contracts for the distribution of genomic data. On-Chain Storage. Storing data on-chain is achieved by simply adding the data (in binary format) to the transaction which effectively makes it part of the chain itself. This will eventually make the data immutable and highly available as the transaction will be distributed to all nodes in the network. However, some blockchains, especially public blockchains have a strict limit on the size of each transaction making it difficult to store large data files. This is due to the fact that each full-node needs to have enough resources to store the ever-increasing amount of data being generated. On-chain data are also publicly accessible to all nodes in the network and therefore privacy of the stored data must be considered. On-chain storage is most suitable for small data types that require immutable and tamper-proof storage. Data types that are commonly stored on-chain are meta-data and small genomic data. Small data types such as audit trail and observations of gene-drug interactions can be effectively stored on-chain as shown in [55, 57, 82, 97, 101] . However, the majority of the papers only use on-chain storage for metadata [52, 69, 70, 73, 85, 98] . On the other hand, whole-genome data which are stored in files such as BAM or VCF are large in size and difficult to store on-chain. There are attempts to store this data on-chain, such as genecoin [47] , and SAMchain [56] . Genecoin [47] sends DNA kits to customers, and after sequencing the DNA sample by a third-party sequencing facility, the extracted full-genome data is encrypted and stored in the bitcoin blockchain. While this approach is possible, it is not feasible for public blockchains. Speed and scalability are severely affected by the need for each node to replicate these large data files. On the other hand, SAMchain [56] uses a private blockchain (Mutichain) to store and share sequence alignment maps on-chain using nested database indexing and compression techniques. Indeed, one of the the main motivations of this paper is to prove that efficient storage and analysis are possible with a private blockchain. Off-Chain Storage. The practical limitations of on-Chain storage can be overcome by utilizing off-chain storage. In general, most off-chain storage techniques involve hashing (a piece of) data, which results in a small string that can be efficiently stored in the blockchain transactions or in a smart contract [13] . The actual data is then stored in local centralized storage system or may be replicated between multiple nodes. Smart contracts and distributed hash tables (DHT) are two of the most common approaches for off-chain storage. Smart contracts can specify what the data is, who has access to it, and where it is stored. On the other hand, a DHT is a network of storage nodes with a centralized index. The index stores the information that points to where a specific piece of data is stored. The storage nodes can either store the data entirely or a piece of the distributed data. One of the most popular DHT solution is the InterPlanetary File System (IPFS) [61] . An example is the CGT framework which enables sharing of large genomic data in a distributed environment through the use of off-chain storage, namely, IPFS [61] . Data is stored on the IPFS servers and only a strong hash (SHA-256) is added to the blockchain. The hash uniquely defines the entire state of all data submitted from the steward at that exact point in time. Similar hashing of raw data is used in [136] to validate that off-chain data has not been tampered with. Another notable off-chain storage technique, employed by Genesy [20] and CrypDist [98] , is to use cloud storage and linking the data to the data owners through cryptographic pointers stored in the blockchain. A transaction is created with every file containing a hash pointer to the data on the cloud. What is stored off-chain in this scenario is large BAM files, while access metadata and patient metadata, such as phenotypic and environmental data are stored on-chain. Compression. Data compression is an alternative approach to overcome the data storage limitations in blockchains. Compression reduces the dependence on large storage and the time/cost of transmitting large genomic data. [78] proposed a lossless compression algorithm called Blockchain Applied FASTQ and FASTA Lossless Compression (BAQALC) that provides efficient storage and transmission of next generation DNA sequence data on a blockchain network. In addition to early mentioned PetaGene [102], which claims to achieve 60% and 90% savings for BAM and FASTQ formats respectively, there is the MPEG-G [91] initiative. While blockchain relevant storage systems such as IPFS may have their compression methods, it is intuitive to think that more and more genomic data specific compression techniques, as in [78] , will be employed in blockchain storage systems themselves. In genomic applications, the privacy requirements vary depending on the data type. Somatic variants are generally considered to be non-private and do not require any privacy protection. On the other hand, germ-line variants are private and therefore, privacy protection is essential and it is enforced by existing regulations such as HIPAA and GDPR. In addition, genomic and medical information (extracted from EHR) are often combined for genotype-phenotype analysis. This creates another point of privacy concern because when multiple data points are combined, there is a risk of correlation attacks and patient identity can be revealed by combining multiple identifying data points [93] . With these privacy concerns, it is necessary to look closely at how this is currently achieved in genomic blockchains. In this section, we look at the security and privacy aspects of genomic blockchains. The majority of the papers rely on the security of the cryptographic protocols employed in the basic blockchain implementation which include Hash pointers, Merkle trees, digital signature, a public key infrastructure (PKI), and a consensus protocol [133] . As previously discussed in Section 3.2.4, the combination of these techniques provides a robust decentralized system that can withstand malicious tampering. Privacy through data anonymization is one of the approaches used in genomic blockchain literature, both CrypDist [98] and CGT [52] achieve privacy by sharing only somatic variants and removing the personally identifiable patient data and private germ-line variants. While anonymization provides a certain level of privacy, it does not guarantee protection against future re-identification attacks [106] . Another approach to privacy is the use of private blockchains. Gursoy, et al [56] uses a private blockchain that requires permission to access the data within the blockchain. With controlled access, there are limited number of security issues. According to the authors, it is also possible to store homomorphically encrypted data in SAMchain, which allows privacy-preserving computation on the data. However, the efficiency of this was not addressed. [ [69] [70] [71] address the privacy concerns of sharing patient data by distributing machine learning models to multiple institutions rather than sharing the data itself. The authors use blockchain as a way to coordinate the process of distributing the model instead of a central server that could potentially breach the confidentiality of the data. However, the authors point out that risks of re-identification still exist and further privacy-preserving techniques such as differential privacy are required for optimal privacy protection. Genie [136] presents a blockchain-based solution to AI model training with the added security of a trusted execution environment namely, Intel Software Guarded eXtensions (SGX). The secure enclave is used to train the models and therefore privacy is preserved by protecting the raw data while still allowing the sharing of insights from it. These security and privacy protection techniques are combined with the transparency, control, and verifiability of blockchain in the proposed solution. Zhang et al [135] provided a blockchain approach to perform GWAS studies with privacy protection. The analysis is done through third parties which provide the computing resources required to perform the analysis. Privacy preservation is achieved through a novel gene fragmentation framework. In the proposed framework, the gene sequence of one individual is fragmented into pieces and distributed to analysis nodes, which run a specified analysis on the given part of the data. The fragmentation lowers the probability of re-identification in each analysis node and makes sure each fragment of the data is unidentifiable. [54] uses multiple cryptographic techniques that are added to the data-sharing system. Homomorphic encryption is used to encrypt the data and process it. Differential privacy is also used to add another layer of privacy and prevent re-identification of individuals. This is done by adding noise to obfuscate the query results. The following is a list of current challenges and limitations in applying blockchain in genomics which answers the predefined RQ5. This has been noted as one of the major challenges in implementing solutions in most papers. Instability and lack of user-friendly interfaces are major barriers to public adoption and therefore limit its use to tech-savvy individuals. For the adoption of blockchain in genomics, organizations need to integrate and connect blockchain platforms with existing non-blockchain platforms. This problem is aggravated as there is a wide range of blockchain implementations that are not necessarily compatible with each other. Interoperability between blockchains reduces the dependence on a single blockchain platform. Most of the presented solutions in genomics rely on a specific blockchain platform and its features. A multi-blockchain approach, which doesn't rely on a specific blockchain platform, would provide better scalability and remove the security risks associated with the used platform. However, this approach is currently complicated and involve complex cross-chain communication. Research in the area of blockchain interoperability is growing and multi-blockchain approaches might be feasible in the future [12] . Blockchains that support smart contracts enable building rich applications that are not limited to financial transactions. However, the increased functionality provided by smart contracts exposes the system to more possible attacks such as the DAO attack [88] in 2016. The number of discovered smart contract vulnerabilities is increasing [23] , and they can be costly, either in terms of financial loss or data privacy loss. Smart contracts were used in a number of papers to achieve various important functionality, such as granting access to private data. One of the challenges in deploying such smart contracts to handle actual patient data is the security risk associated with them. Following best practices and performing security audits might reduce this risk, and research in smart contract security is still ongoing [23] . Even though privacy in blockchain has been studied extensively in the literature, privacy issues with blockchain in genomic applications have not been fully addressed. There is a need to examine some areas of privacy especially with the anonymity of users and re-identification through correlation attacks. While re-identification is sometimes required in research settings, it is essential to prevent disclosure of patient data for any other purpose. For instance, re-identification is required when additional materials are needed for further research into the case. However, re-identification to reveal patients' private data should be prevented. The privacy challenges in genomic blockchains include the following: (1) Identity and transaction privacy: maintaining the user's private identity and not relating it to the transaction. Correlation attacks, in which true identity could be revealed, is a privacy challenge in using public blockchains that requires further research. Ideally, identifying a user based on specific interactions with organizations should not be possible. On-going research in using zero-knowledge proofs (ZKPs) [100] have demonstrated the possibility of achieving this in financial applications. (2) Re-identification risks: in the case where blockchain serves as open access to genomic data for research purposes, the process of anonymizing and obtaining consent for genomic data is time-consuming and requires an honest third party as has been observed by [52] . This is difficult to scale when the number of patients is large. Moreover, risks of re-identification associated with open data sharing are still present even after the full de-identification process as has been shown in [106] . While this process might follow the best practice in anonymizing the data, there is still a risk that more advanced re-identification attacks might emerge in the future. Therefore, it is an open question whether other privacy-preserving mechanisms can be applied to ensure privacy against future attacks. Replicating the study to determine its validity is essential in genomic and research in general. Researchers or auditors would want to find exactly the same data without change after a number of years to replicate the study. In cases where blockchain facilitates data sharing for scientific studies, it is important to address this since data in a decentralized network reside in multiple storage locations. This problem manifests especially in solutions that use off-chain storage. The data should follow the FAIR principles, however, there seems to be a lack of focus on this aspect in genomic blockchains. In addition, data redundancy might occur where patients exist in multiple organizations with different assigned IDs. A challenge with decentralized sharing/analysis of genomic data is the possibility of publishing duplicate data, where the same patient data is shared but with different anonymous identities. This is a problem especially if this data is intended for research purposes and it can affect the validity of the study. The Cancer Gene Trust [52] highlighted this limitation and tried to eliminate this problem with a rule-based scoring system and 2 reviewers. There are existing techniques to identify duplicate records in different databases, such as [76] which performs privacy-preserving record linkage on several databases using secure multiparty computation. However, further research is still needed to address this problem in a decentralized blockchain network. 4.5.6 Verifiability. One of the issues associated with distributing processing tasks to untrusted parties is verifying the accuracy of the results. One possible solution is outsourcing the same analysis task to multiple analysis nodes and then the results can be compared to ensure correctness. However, the cost of this approach can be high, especially if the same task is distributed to a large number of analysis nodes. Therefore, a practical and scalable verification method to ensure that outsourced computation/analysis is indeed correctly computed by untrusted nodes in a blockchain network is still an open issue that requires further examination. There are significant advancements in the field of verifiable computation [132] which can be explored in a blockchain setting. Cryptographic techniques such as homomorphic encryption [49] and zero-knowledge proofs [5] can be used to maintain verifiable results. These techniques might prove to be effective in overcoming the limitations in existing work in genomic blockchains. It is challenging to ensure that the patients (data owners) are able to manage securely their keys and identity, especially with data related to individual health. Once patients have full control over their data, education mechanisms must be put in place for the patients, in order to provide them with valuable insights regarding best data management practices. Moreover, proper key management schemes need to be put in place along with mechanisms for "break glass" access to genomic/healthcare data in emergency settings. Challenges. The rise of genomic marketplaces raises some ethical concerns as discussed by Ahmed et al [3] and Defrancesco et al [33] . These authors argue that informed consent is questionable when a monetary incentive is involved and it can lead to mindless data sharing. It is yet to be known if these financial incentives would actually work in attracting more people to share their private data for research purposes, but perhaps alternative non-monetary incentives should be explored. For Instance, Mofokeng et al [90] showed how digital collectibles can be used to incentivize citizens to participate in wildlife conservation. The authors, in collaboration with CryptoKitties, have created a non-fungible token (NFT) and a turtle-inspired CryptoKitty. Then, they put it for sale on the blockchain and raised 25,000 for the conservation of wildlife. The buyer holds an immutable and unique digital asset which marks their contribution to wildlife. Therefore, further research into non-monetary incentives that encourage participation in genomic research for the purpose of advancing medical research might prove to be effective. In this section, we discuss the major findings of this review, the challenges, limitations, and propose several future research directions in applying blockchain in genomics. Our results suggest that there is an increasing interest in applying blockchain in genomics since the number of publications increases rapidly each year since the inception of blockchain and the developments in genomics. The use of blockchain technology in genomic applications is becoming common in both commercial and non-commercial settings. Commercial applications focus on the need for user-control (i.e. data ownership) and at the same time enable to profit from its use and sharing. Rewarding a form of cryptocurrency to the data owners is commonly used as an incentive mechanism to attract more contributions from the users. Non-commercial applications provide solutions for sharing, processing/analysis, secure storage, access control, and access logging of genomic data. These solutions aim to facilitate easy, efficient, multi-organizational genomic data sharing for the purpose of advancing the genomic research. The main motivations for using blockchain that were highlighted in the papers are immutability, decentralization, and access/usage control. Our findings indicate that private or permissioned blockchains are the most common blockchain types, and Multichain and Ethereum are the most common platforms used in genomics. The storage techniques used vary depending on the requirements. On-chain storage is mainly used for small data types or hashes/pointers of the data that are stored off-chain. Off-chain storage is often used for large data files or for data that requires strict access control. In these cases, either cloud storage or other decentralized file systems such as IPFS is used. Further investigation of data compression techniques suitable for decentralized storage, sharing and processing of genomic data is needed. Since genomic data is long-lived (i.e. valid for a very long time), data security and privacy are among the principal concerns. Most of the existing papers utilize existing protection mechanims that are part of the blockchain itself such as consensus mechanisms and digital signatures for the purposes of integrity, confidentiality and availability. Privacy is protected by either anonymization or the use of private blockchains with controlled access. There are, however, solutions that proposed adding cryptographic techniques to further protect the privacy of the data. More recent efforts aim to perform distributed genomic analysis through the use of smart contracts which we believe is a promising line of research due to better privacy guarantees and data retention controls. While this review aims to provide a comprehensive overview of the current state of the art, limitations still exist. The main focus of this review is to cover blockchain applications in genomics, therefore, we do not cover other healthcarerelated studies in which blockchain is proposed as a solution. Our search strategy was aimed to capture only papers that specifically emphasize genomic data applications. We also focused on the most popular commercial genomic platforms, and therefore we only reviewed the first 200 results returned from the initial Google search. The documentation in white papers is somewhat limited and could change as the technology matures. We observed that details in some white papers change over time. We tried to overcome this by including the details of the latest versions of these white papers. Additionally, we have not run any testing on the proposed solutions to verify the claims stated in the papers. To achieve large-scale deployments and adoption of blockchain in genomics, we point out that further exploration is needed for this technology to mature. There are multiple challenges and interesting problems in blockchain research that need to be addressed. In this section, we highlight some possible future research directions that we observed after conducting the review. Based on our results, the use of cryptographic protocols for privacy-preserving analysis and computation is limited in current blockchain-based approaches in genomics. Cryptographic protocols such as Multi-Party Computation (MPC) can enhance privacy and when combined with blockchain and smart contracts, it can help address the security, trust, and verifiability issues in distributed analytics [138] . Current research in this field has shown the feasibility of performing distributed privacy-preserving analysis on blockchain [51, 139] . Researchers can further investigate and develop new approaches for genomic applications. Furthermore, we observed that the solutions are all based on a limited number of blockchain platforms. While these platforms are popular and have proved to be effective, perhaps experimenting with other emerging blockchain platforms might show some interesting results. Another point to consider is that all papers included in this review propose the use of blockchain and not DLT. The use of DLT has been proposed for other applications such as IoT, but it seems to be missing in the genomic literature. Further research into the use of DLT is required to assess the feasibility of this technology in genomics. Attribute-based encryption (ABE) is yet another powerful technique that can be leveraged to enhance the security and privacy in blockchain solutions. With ABE users with certain attributes defined in their secret keys can decrypt the encrypted data with matching attributes. [121] proposed the use of ABE in verifying the authenticity of electronic health records (EHR). However, combining ABE with blockchain has not been utilized in current genomic applications and future research can explore this largely unexplored cryptographic method in blockchain literature. For example, ABE can be used to grant specific attributes to certain nodes or users in the blockchain network which in turn allows them to access or perform specific tasks on genomic data. Off-chain computation is another opportunity for future research. Recent development in moving smart contract computations off-chain is promising and would open the door for more applications to be built on the blockchain. There is incentive-based verification of off-chain computations such as Arbitrum [65] , and others based on cryptographic methods such as zero-knowledge proofs [39] . We believe there is an opportunity for researchers in genomic blockchains to experiment with these methods and perform various genomic data analyses such as GWAS studies and determine their efficiency and privacy. Genomic big-data are known to be difficult to move, and an emerging approach is to move the analysis pipelines to the data rather than moving the data itself. With this setup, organizations control the use of the data that resides in their own data repositories while sharing valuable insights from this data to researchers at the same time. Recent work, such as [69, 70] , have shown the validity of this approach. However, this is limited to model training. Further research into the applicability of using blockchain for distributed analytics might also prove to be effective. We also see a lack of focus on the aspect of trust and the emerging technologies that can be used to build it such as Decentralized Identifiers (DIDs) [104] and Verifiable Credentials (VCs) [115] . Utilizing blockchain to build a decentralized trust infrastructure, which can help in addressing other challenges such as data integrity and privacy. Following the trust-over-ip principles [32] in the genomic applications can provide a layer of trust when sharing data for research purposes, for instance, by sharing data only to those with credentials matching specific criteria. The use of blockchain in genomics is still in its early stages and there are many other use cases to be explored. We expect that the development of blockchain-based solutions would change the current genomic ecosystem. As one of the aims of using blockchain is to empower patients to control their data, their role in data sharing will be significant. Increased trust and automated processes provided by blockchain and smart contracts would scale the amount of data being shared. In addition, blockchain allows the design of incentives that can facilitate sharing, storing, and processing genomic data in a fair way and for the ultimate purpose of advancing our knowledge of the human genome. There is an increasing number of papers proposing blockchain based solutions to enable the storage, sharing and processing of genomic data. A similar trend can be observed from the large number of commercial/non-commercial blockchain applications that aim to enable genomic data exchange. In this paper, we provided a comprehensive overview of the existing efforts in this area. Our study employed a taxonomy in which genomic applications of blockchain are classified into commercial, and non-commercial applications. Non-commercial applications have been further categorized according to their specific goals, namely data sharing, analysis, secure storage, access control, and logging/auditing. After providing certain details about each application, we described the advantages and drawbacks of the proposed approach and provided a comparison between the proposals concerning the blockchain platform selection, how data is stored, shared, and protected in the proposal. Software instability, interoperability, and the security risks associated with the rigidity of smart contracts are some of the highlighted challenges. Another challenge is protecting the privacy of the data and the identities of the data owner. Privacy-enhancing technologies such as those mentioned earlier can be beneficial when applied in these settings. Our results suggest that immutability and decentralization are the main motivations for blockchain use in this context. We observed that empowering data owners to control their data is a common argument in the papers. In most papers, blockchain is used to give control of the data to organizations or individuals (patients). We also observed that the applications or use-cases of blockchain in genomics are rather limited compared to financial applications although there is a huge potential in exploring alternative use-cases. We identified a list of open issues/challenges from the perspective of the existing proposals (Section 4.5). We recommend experimenting with blockchain-based distributed analytics (i.e. processing) in genomics. In addition, evaluating the performance of various privacy-enhancing technologies such as homomorphic encryption, multi-party computation, zero-knowledge proofs, and off-chain computation can shed some light on the feasibility of privacy-preserving distributed analytics in blockchain networks. Another promising future direction is exploring blockchain-based trusted and verifiable access to genomic data. The trust-over-ip principles [32] can be utilized to create and manage decentralized identities for researchers. Our general impression is that genomic applications on blockchain are still in their early stages of research and development and require further transformation both socially and technologically in order to be adopted. Recent efforts to promote the use of blockchain in genomics such as [59] can help accelerate the adoption of this technology by showcasing the potential and feasibility of using it in various genomic applications. 23andMe: DNA Genetic Testing & Analysis Blockchain technology in healthcare: a systematic review DNA Data Marketplace: An analysis of the ethical concerns regarding the participation of the individuals Privacy preserving processing of genomic data: A survey A certifying compiler for zero-knowledge proofs of knowledge based on -protocols 23andMe competitor Veritas Genetics slashes price of whole genome sequencing 40% to $600 Building the foundation for genomics in precision medicine Towards precision medicine Protecting and Evaluating Genomic Privacy in Medical Tests and Personalized Medicine Privacypreserving techniques of genomic data-a survey The swirlds hashgraph consensus algorithm: Fair, fast, byzantine fault tolerance 2020. A survey on blockchain interoperability: Past, present, and future trends Blockchain-based decentralized storage networks: A survey Emerging technologies towards enhancing privacy in genomic data sharing Privacy-preserving solutions for Blockchain: review and challenges Multi-Stakeholder Consent Management in Genetic Testing: A Blockchain-Based Approach Mixcoin: Anonymity for bitcoin with accountable mixes Architecture of the hyperledger blockchain fabric Exploring the ambient assisted living domain: a systematic review The Genesy Model for a Blockchain-based Fair Ecosystem of Genomic Data Precision medicine: functional advancements. Annual review of medicine Group signatures A survey on ethereum systems security: Vulnerabilities, attacks, and defenses European Commission General Data Protection Regulation Genetic Information and Nondiscrimination Act What is Ethereum? Retrieved 05.11.2020 from Consensys. 2021. Quorum blockchain platform A FAIR guide for data providers to maximise sharing of human genomic data A new wave of genomics for all ReGene: Blockchain backup of genome data and restoration of pre-engineered expressed phenotype The trust over ip stack Your DNA broker IMEC-Int News release -New genome analytics platform makes clinical genomics affordable for daily use in hospital Nebula Genomics: Blockchain-enabled genomic data sharing and analysis platform Manual for Using Homomorphic Encryption for Zokrates-scalable privacy-preserving off-chain computations Digital transformation and governance innovation for public biobanks and free/libre open source software using a blockchain technology The Exonum platform Federated discovery and sharing of genomic data using Beacons MultiChain Private Blockchain -White Paper In genetics, context matters Non-interactive verifiable computing: Outsourcing computation to untrusted workers Privacy-Preserving Statistical Analysis of Health Data Using Paillier Homomorphic Encryption and Permissioned Blockchain Blockchain-Authenticated Sharing of Genomic and Clinical Outcomes Data of Patients With Cancer: A Prospective Cohort Study Accelerating genomic data generation and facilitating genomic data access using decentralization, privacy-preserving technologies and equitable compensation Using blockchain to log genome dataset access: efficient storage and query Storing and analyzing a genome on a blockchain Using Ethereum blockchain to store and query pharmacogenomics data via smart contracts A systematic review of the use of blockchain in healthcare IDASH. 2021. IDASH PRIVACY & SECURITY WORKSHOP Coinami: a cryptocurrency with DNA sequence alignment as proof-of-work IPFS Powers the Distributed Web An AI driven Genomic Profiling System and Secure Data Sharing using DLT for cancer patients Application of a Blockchain Platform to Manage and Secure Personal Genomic Data: A Case Study of LifeCODE. ai in China Rethinking the ethical principles of genomic medicine services Arbitrum: Scalable, private smart contracts Blockchain smart contracts: Applications, challenges, and future trends. Peer-to-peer Networking and Applications Systematic literature reviews in software engineering-a systematic literature review. Information and software technology Core Concepts, Challenges, and Future Directions in Blockchain: A Centralized Tutorial The anatomy of a distributed predictive modeling framework: online learning, blockchain network, and consensus algorithm EX pectation P ropagation LO gistic RE g R ession on permissioned block CHAIN (ExplorerChain): decentralized online healthcare/genomics predictive model learning Fair compute loads enabled by blockchain: sharing models by alternating client and server roles 2020. iDASH secure genome analysis competition 2018: blockchain genomic data access logging, homomorphic encryption on GWAS, and DNA segment searching Privacy-preserving model learning on a blockchain network-of-networks Comparison of blockchain platforms: a systematic review and healthcare examples Bin Xiao, Songtao Guo, and Yuanyuan Yang. 2020. A survey of IoT applications in blockchain systems: Architecture, consensus, and traffic modeling Privacy-preserving record linkage in large databases using secure multiparty computation Barriers to accessing public cancer genomic data BAQALC: blockchain applied lossless efficient transmission of DNA sequencing data for next generation medical informatics From genetic privacy to open consent Efficient logging and querying for blockchain-based cross-site genomic dataset access audit Fit-for-purpose?'-challenges and opportunities for applications of blockchain technology in the future of healthcare Dwarna: a blockchain solution for dynamic consent in biobanking Dwarna: a blockchain solution for dynamic consent in biobanking Immutable DNA Sequence Data Transmission for Next Generation Bioinformatics Using Blockchain Technology Anonymous CoinJoin transactions with arbitrary values Understanding a revolutionary and flawed grand experiment in blockchain: the DAO attack A scientometric review of genome-wide association studies Future tourism trends: Utilizing non-fungible tokens to aid wildlife conservation A peer-to-peer electronic cash system Privacy in the genomic era Research Opportunities for E-health Applications with DNA Sequence Data using Blockchain Technology The Zenome Project: Whitepaper blockchain-based genomic ecosystem Department of Health & Human Services Leveraging blockchain for immutable logging and querying across multiple sites Realizing the potential of blockchain technologies in genomics Securing Genomics Data Using Blockchain Technology Non-Interactive Zero-Knowledge for Blockchain: A Survey Decentralized genomics audit logging via permissioned blockchain ledgering International data-sharing norms: from the OECD to the General Data Protection Regulation (GDPR) How to leak a secret Estimating the success of re-identifications in incomplete datasets using generative models How blockchain technology can change medicine Coinshuffle: Practical decentralized coin mixing for bitcoin Do patients and research subjects have a right to receive their genomic raw data? An ethical and legal analysis Blockchain-based platforms for genomic data sharing: a de-centralized approach in response to the governance problems Privacy risks from genomic data-sharing beacons Nazar Zaki, and Fida Dankar. 2020. A Layered Blockchain Framework for Healthcare and Genomics High-performance integrated virtual environment (HIVE) tools and applications for big data analysis Privately computing set-maximal matches in genomic data Verifiable credentials data model 1.0. W3C, W3C Candidate Recommendation Stellar -An Open Network for Money Big data: astronomical or genomical? Strand NGS: Guide to Storage and Computation Requirements Stripe -A Complete Payments Platform Conceptualizing blockchains: characteristics & applications A decentralizing attribute-based signature for healthcare blockchain Formalizing and securing relationships on public networks Proof of disease: A blockchain consensus protocol for accurate medical decisions and reducing the disease burden Beyond Data Markets: Opportunities and Challenges for Distributed Ledger Technology in Genomics Distributed Ledger Technology in genomics: a call for Europe European Union. 2020. Beyond 1M Genomes Privacy Laws, Genomic Data and Non-Fungible Tokens The FAIR Guiding Principles for scientific data management and stewardship How Drug Companies Are Using Your DNA To Make New Medicine A survey of distributed consensus protocols for blockchain networks A survey of verifiable computation Security and privacy on blockchain Secure distributed genome analysis for GWAS and sequence comparison computation Enabling privacy-preserving sharing of genomic data for GWASs in decentralized networks Genie: A Secure, Transparent Sharing and Services Platform for Genetic and Health Data Blockchain challenges and opportunities: A survey Secure multi-party computation on blockchain: An overview Using Secure Multi-Party Computation to Protect Privacy on a Permissioned Blockchain Directed Acyclic Graph as Tangle: an IoT Alternative to Blockchains Design of Truly Distributed Storage for Large Medical Datasets