key: cord-0077826-xl7zd6dd authors: Lin, Hao; Li, Xiaolei; Gao, Haoyu; Li, Jie; Wang, Yongsheng title: ISC-MTI: An IPFS and smart contract-based framework for machine learning model training and invocation date: 2022-05-07 journal: Multimed Tools Appl DOI: 10.1007/s11042-022-13163-w sha: 759acd6ace9def1063ec0f457b77ff97b30ffe51 doc_id: 77826 cord_uid: xl7zd6dd Due to centralized storage, centralization problems are common in machine learning model training and invocation, which makes train data and trained models extremely vulnerable to tampering and stealing. A safe framework for training and invoking models called ISC-MTI (IPFS (InterPlanetary File System) and Smart Contract-Based Method for Storage and Invocation of Machine Learning Mobel) is proposed in this paper. The framework uses IPFS as the storage solution, EOS (Enterprise Operation System) blockchain as the smart contract platform, RSA and AES as the implementation of encrypted communication. The Action responsible for invoking the training data and trained models in the smart contract and the model training, uploading, and invoking methods are designed. The experimental results demonstrate that ISC-MTI can improve the safety of model training and invocation with losing a little efficiency. Simultaneously, ISC-MTI can provide anti-theft model capabilities, traceability, tamper resistance, reliability, and privacy for the process. With the development of machine learning, the performance of machine learning models is getting higher [1, 14, 17] . Machine learning models are constantly being deployed to personal computers and company servers to provide public access interfaces based on MLaaS (Machine Learning as a service) for users to predict and classify their test samples [16] . One of the main obstacles to the development of machine learning is the security issues facing machine learning. Without safeguarding machine learning, machine learning is vulnerable to model theft, adversarial (escape) samples, software system vulnerabilities, training data pollution, model tampering, etc [6, 7, 12, 28, 33] . At present, centralization problems are common in the training and storage process of machine learning models, which make train data and models extremely vulnerable to tampering and stealing. Hackers can use tampered machine learning models to generate more malicious software or more deceptive vulnerabilities [13] . More seriously, private information in training data may be reversed from the tampered machine learning models [3] . Once the models are stolen, the privacy-sensitive data collected and the models trained by companies or research institutes will lose their original value, which will cause substantial economic losses to the companies or research institutes. Considering security issues facing machine learning models, we design a safe and reliable framework based on IPFS and smart contract for training and invoking of machine learning models. The framework has decentralization, resistance to deletion and tampering, traceability, and high reliability, which can effectively resist training data pollution, malicious theft to data and models, and model tampering. The main contributions of this paper are summarized as follows: -We propose a novel framework for training and invoking of machine learning called simply ISC-MTI. -In the model training phase of ISC-MTI, IPFS is used as a storage solution for training data and models. The models are uploaded to IPFS after being trained in memory with the training data read from IPFS. In the model invocation phase of ISCMTI, EOS blockchain is used to build train data and model invocation contracts, which stipulate that the users must pay tokens to apply model invocations. -Based on the custom hash algorithm provided by IPFS, we use RSA and AES for data encrypted transmission to prevent ISC-MTI from possible man-in-the-middle attacks. Moreover, we have designed the transmission rules of keys in ISC-MTI. -We validate the efficiency of ISC-MTI and conduct multiple safety analyses. The remainder of this paper is organized as follows: the related works are discussed in Section 2. Section 3 presents the selection of associated technologies we used in this paper. Section 4 describes the threat model. Section 5 describes the ISC-MIT framework. Section 6 presents the experimental results and safety analysis. Finally, Section 7 draws the conclusion and future research directions. When we need to use machine learning to train a model or use a trained model, the general process is shown in Fig. 1 [24] . The general process involves three roles: data owner, model trainer, and model user. Data owners provide data to the model trainers. According to the data, the model trainers select the algorithms, parameters, and other information to train models. The machine learning algorithms will generate some model files at the end of training. The model trainers will store the model files in the local hard disks or the third-party cloud platforms and provide the user interface for model users. Model users will be charged a certain fee when using the model, which will enable the model trainers to earn back the cost they invested in training the model. In some cases, a group of people are both data owners and data trainers. But in commercial scenarios, model users and model trainers will not be the same group of people. At present, there are two main types of research about protection methods for machine learning training and invocation, namely centralized protection methods and decentralized protection methods. The centralized protection methods focus on resisting adversarial sample attacks and utilizing encryption algorithms. Khoda et al. [9] noted that the common method for defending adversarial sample attacks is to train machine learning models with adversarial samples and normal samples simultaneously. They proposed two new sample combination methods--the method based on the distance from malware cluster center and the method based on a probability measure derived from kernel-based learning (KBL). The results show that the KBL improves detection accuracy by 6%. Xu et al. [31] proposed VerifyNet. It uses key sharing protocols to protect the integrity of training samples, thereby preventing malicious adversaries from tampering with training samples and calculation results. Sun et al. [22] and Xu et al. [32] studied how to use the encrypted training data to train the models based on homomorphic encryption. The experiment verified the feasibility of their methods in terms of efficiency and accuracy. Centralized protection methods are various and logically complex. These research assumes that hackers can easily steal the models and training data. As compared to defenders, hackers will always be in an advantageous position [8] . If attackers steal the models and training data, they will be able to research more novel, effective and targeted attack methods. The decentralized protection methods store training data and model files in decentralized distributed storage [19, 25, 29] . In the traditional cloud environment, the methods of model storage rely on cloud service providers and TPAs (Third Party Auditors) [15] . Because cloud service providers and TPAs are not completely trusted, cloud servers are vulnerable to DDoS (Distributed denial of service attack), server node failures, and other reasons, the training data and trained models in the cloud have risks such as integrity destroyed, privacy leaked, and single node failure causing model unavailability. For solving these problems, scholars proposed that protect related data with blockchain. Blockchain has the characteristics of decentralization, audit-ability, and non-tampering, which can significantly improve train data and model security. Due to the need to reach a consensus on the onchain data within the entire network, blockchain is not suitable for storing large files [23] . In the production environment, data size of train data and models is vast. Enterprises cannot afford the high storage cost of blockchain. If the blockchain combines with traditional distributed file systems such as HDFS (Hadoop Distributed File System) and HBase (Hadoop Database), it will cause partial centralization and weaken the protection capabilities of the blockchain for machine learning [34] . Besides, in the Internet of Things(IoT), Internet of Vehicles(IoV), and Industrial Internet of Things, equipment security is worse and more vulnerable to attacks [11, 35] . However, calculation ability of equipment in these environments is low and blockchain platforms such as Bitcoin and Ethereum cannot work well. Therefore, there is a need to develop a framework for the protection of machine learning model as a benchmark to: -safely store training data and trained models. -protect training process and invocation process of machine learning. -apply blockchain technology without high storage cost. -do not require high calculation ability of the equipment. The research and application of the framework can effectively protect machine learning related transactions. Furthermore, it is beneficial to improve data utilization and develop data sharing. In the big data era, we believe the framework is meaningful. The training data involved in machine learning can be divided into structured data and unstructured data. Common algorithm libraries and frameworks for machine learning include Scikit-learn, TensorFlow, Keras, etc. According to algorithm libraries and frameworks, the trained model files can be saved as txt, Ckpt, pb, HDF5 (Hierarchical Data File 5) and json, which store the structure, index, internal parameters, and other data of the models. For compatibility with all file types, we must choose a decentralized file system that supports object storage to store training data and trained models. Since 2010, many file systems are developing towards decentralization, such as GPFS (General Parallel File System), Ceph, and IPFS [2, 5] . Table 1 presents the comparison of these decentralized file systems. IPFS is superior to other file systems in terms of supporting multiple OSs, complete decentralization, automatic de-redundancy, and anti-DDOS ability. IPFS protocol stack is shown in Fig. 2 . IPFS has no storage capacity limitation and provides high-throughput content-addressed storage, which is perfectly suitable for protecting training data and model files involved in the machine learning training process. When data needs to be extracted, IPFS reads the data blocks in all nodes in parallel. Therefore, IPFS can meet the read and write performance requirements of the training process. IPFS nodes are located all over the world, which can guarantee enough backups. Hackers cannot attack all nodes simultaneously, enabling IPFS to store and share a large number of data and files that are secure. Moreover, IPFS has a version control function similar to Git, which facilitates management and utility of trained models. So we choose IPFS as the storage solution for ISC-MTI. Smart Contract is a computer protocol designed to spread, verify, and execute contracts in an information-based way. Smart contracts allow trusted transactions without TPAs, which are traceable and irreversible. Blockchain makes smart contracts have the advantages of programmability, decentralization, non-tampering, and traceability. In the era of blockchain 3.0, current smart contract platforms include Bitcoin, Ethereum, Hyperledger, and EOS [4, 18, 26] . Table 2 presents the comparison of these smart contract platforms. Among papers about existing smart contracts, a good deal of equipment is used for mining [20, 30, 35] , which is not suitable for the IoT, IoV, and Industrial Internet environments. EOS has low demand for equipment calculation ability and the TPS (Transaction Per Second) of EOS is among the best in current smart contract platforms. The structure of an EOS Blockchain node is shown in Fig. 3 . Cleos is client of EOS and supports interaction with users. Keosd is a wallet component used to save account keys. Nodeos is the daemon and core of EOS nodes, responsible for generating blocks. EOS can complete the generation of a block every 0.5 seconds. The operation of transactions and smart contracts on EOS does not need to consume tokens. Instead, EOS consumes resources including bandwidth, CPU, and RAM, which makes EOS more suitable for the IoT and IoV. EOS supports issuing new tokens with eosio.token contract (no need to script new contracts) and transferring tokens to another account by the issuer's account. So we choose EOS as the smart contract platform for ISC-MTI. Since third-party servers and third-party cloud storage are untrusted, the training data and trained models in the storage are not safe. In ISC-MIT, we assume data owners are credible, which means the data provided by the data owner are all trusted data. Model trainers and model users are not completely credible, who may maliciously use data and models without any cost. Man-inthe-MiddleAttack (MITM) may be encountered during the data upload and invocation phase. Comparison of current decentralized file systems The capability of adversaries is as follows: -adversaries can infer the training data by invoking the trained model multiple times. -adversaries can steal non-volatile data such as the data in hard disk. -adversaries can intercept network data communicated between the users and IPFS. This section presents the detailed description of ISC-MTI. Figure 4 illustrates the architecture of ISC-MTI. ISC-MTI still involves three roles: data owner, model trainer, and model user. Data owners upload training data to IPFS, and IPFS will return the storage credentials of the training data (i.e., unique hash values). Model trainers read training data by smart contracts and train machine learning models in memory. After model training, data such as the structure and parameters of the models will be uploaded to IPFS. For defending MITM, model data will be encrypted by RSA encryption algorithm and AES encryption algorithm. After the model data is uploaded to IPFS, the storage credentials will also be returned. Model users read and decrypt trained models by smart contracts and private keys. Without saving the model file to the local disk, input the test samples prepared by yourself to implement the model's invocation. In ISC-MTI, no trained model files will appear in the local disk or third-party cloud storage. The design maximizes the security of machine learning models by mandatory decentralization and non-tamperable features of IPFS [21] . If there is a need for higher security, the invocation results can be stored in the IPFS and the keys of encryption algorithms can be saved on the blockchain. In ISC-MTI, the invocations of training data and models are completed by Transactions. Since we adopt EOS, a Transaction involving training data and trained model invocation consists of one or more Actions. After the user submits a transaction, the corresponding Fig. 4 Architecture of ISC-MTI smart contract is responsible for processing the transaction in the WASM virtual machine. The execution process of the smart contract is shown in Fig. 5 . The enforceable and non-tampering feature of smart contracts can protect the invocation process of training data and models. These smart contracts will be deployed to an administrator account. Administrators need to use eosio.token to issue their own tokens. The standard for issuing tokens is ERC20. The model trainers and model users can purchase the invocation right of training data and trained model by transferring tokens. The pseudo code of Action responsible for invoking resources in the smart contract is shown in Algorithm 1. 1 When a user invokes resources, the user needs to send the Resources Invocation Action to the smart contract deployed on the administrator account. Resources Invocation Action stipulates that users must pay enough tokens to the model trainers to get the storage credentials of the resource. If the transfer fails, the corresponding error message will be thrown and the transaction will be interrupted. The storage credentials in IPFS can be generated by user-defined hash algorithms. But the hash value is still transmitted in plaintext. Attackers can steal or destroy the hash value by MITM to steal or destroy training data and trained models [32] . MITM for IPFS includes Network Sniffing, Domain Name System Attack, and Address Resolution Protocol Attack, as shown in Fig. 6 . To solve the problem, we use RSA encryption algorithm and AES encryption algorithm to encrypt training data and models. The key of RSA is long and has good security, but the computation of RSA is large. AES has low memory requirements and fast encryption speed. Considering the efficiency of model training, we use AES to encrypt data and use RSA to encrypt AES's key. The storage credentials of resources were encrypted by payers' public key. Assuredly, payers can decrypt them by own private key when they use the resources. We all know that the data security in hard disk is much lower than that in memory. In order to prevent the model files from being saved to the local disk, different training, uploading, and invocation schemes of models need to be designed in the light of model file types. Combining the content of 4.2 sections and 4.3 sections, related pseudo code about training, uploading, and invocation of models is shown in Algorithm 2. In particular, for TensorFlow, the convert method is designed according to the pb file, which is the way Google recommends to save the TensorFlow models. Due to AES is used to encrypt data, the computation complexity of algorithm 2 mainly depends on AES. Obviously, the computation complexity of AES is O(n). Where n is the length of the plaintext. We use Python and C++ to implement ISC-MTI and perform necessary performance analysis. The experiment computer has i7-8700 CPU, 16G memory, NVIDIA GT 730, 500M broadband and installed CentOS-6.10x64, python 3.7.1, go-ipfs 0.6.0. The smart contract platform is EOS test chain. For verifying the availability of ISC-MTI comprehensively, we choose multiple data and algorithms to perform experiments. The data and algorithms used in the experiment are shown in Table 3 2 . For all parameters of machine learning, we set them as the default values in the algorithm libraries. In the experiment, we verify the influences of using IPFS and smart contract for the model training process on the trained model accuracy and training efficiency. We compare ISC-MTI with the ordinary training method based on the accuracy and calculation time. The calculation time includes the time consumed by all steps such as data reading, model training, model uploading, and model invocation3. The ratio of the training set to the test set is 7:3. The number of iterations is all set to 100. In the performance experiment, the calculation formula of the accuracy for classification is as follows: Where n is the number of samples; n ture is the number of samples correctly divided. The calculation formula of the accuracy for regression is as follow Where x pred i is predictive value; x ture i is ture value. Due to the exchange between data and IPFS as well as smart contracts in ISC-MTI, some training efficiency will be lost. For measuring the loss, we define an efficiency loss rate L, which is as follow Where t I SC is the time to train a model using ISC-MTI; t ord is the time to train a model using ordinary training method (see Fig. 1 ). The results of performance analysis experiment are shown in Table 4 . The time difference between using original training method and ISC-MTI with the size of dataset and model is shown in Fig. 7 . In the case of small data size and model scale, using ISC-MTI will significantly reduce the training efficiency, simultaneously, L is very large. In the case of large data size and model scale, for example, the vehicle image data and 62259 kb model in test number 5, L will gradually decrease. This phenomenon is more obvious in Fig. 7 . Due to the computation complexity of ISC-MTI mainly depends on the encryption algorithms, the larger the model size, the more additional training time. Besides, if the experiments are conducted on the EOS permissionless blockchain, L will have a certain degree of reduction because the performance of the EOS permissionless blockchain is significantly higher than the EOS test chain. The second time-consuming factor in ISC-MIT is data uploading to IPFS. In terms of IPFS transmission speed, the maximum upload speed can reach 20.08 MB/S. The value is calculated when the file is uploaded to IPFS for the first time. Because of the de-redundancy function in IPFS, the first storage of data is the slowest. The time used to read data is very short and negligible. The result shows that ISC-MIT can meet user requirements for transmission speed. In terms of accuracy, experimental results prove that using ISC-MTI will not affect the performance of the machine learning model. The change in accuracy does not exceed 0.2%. In general, ISC-MTI is a training method that trades training efficiency for training safety and data safety--the longer time using ordinary training, the smaller negative impact of using ISC-MTI on training efficiency, which is all within the acceptable range of model trainers and model users. This experiment verifies the usability of smart contracts designed by us. In the experiment, the token name is HAPY; the resource price p r is 10 HAPY; the model invocation account is Tommy; the model training account is linhao; the account balance of tommy is 100 HAPY. After the smart contract is compiled and deployed, tommy sends Resources Invocation Action to the smart contract. The running result is shown in Fig. 8 . Since the account balance of tommy is greater than the p r , the transaction should be successful. According to the smart contract in Algorithm 1, tommy needs to transfer 10 HAPY to linhao. The transfer record is shown in Fig. 9 . It can be seen from the above test results that the smart contract designed in this article has achieved the expected goal well. Ability to prevent model theft In ISC-MTI, no model files are saved in the local disk or traditional distributed cloud storage. Smart contracts are used for mandatory access control of resources. The attackers cannot directly steal the model files. In ISC-MTI, training data and trained models are stored in IPFS in blocks. IPFS nodes are located all over the world and IPFS can ensure that the number of data backups is sufficient. Unless attackers break all nodes simultaneously, training data and trained models cannot be deleted and tampered with. The data in ISC-MTI is stored in completely decentralized storage, which can effectively resist DDOS attacks. Simultaneously, it effectively avoids the failure of single node. Since there is no need to introduce TPA for data auditing, data privacy can be guaranteed. We constructed a safe framework for machine learning training and invocation called ISC-MTI. We performed related experiments to analyze the impact of using ISCMTI on the model accuracy, training efficiency, and safety of model training. We concluded that, in the case of a slight loss of training efficiency, ISC-MTI can provide machine learning model training with anti-theft model capabilities, traceability, tamper resistance, reliability, and privacy. Potential applications Our framework can come in handy in the applications of machine learning which relying on big data. We give a few examples heuristically, as shown below -recognize COVID-19 based on Medical data sharing [27] . -road congestion prediction in IoV [10] . -user credit assessment using transaction data in the banks. We did not consider the credibility of the resources in transaction, that is, did not design the pre-buy communication mechanism between data owners, model trainers and model users. Furthermore, there is a lack of credit evaluation and incentive mechanism for data owners, model trainers and model users in our ISC-MTI. -Study the application of ISC-MTI in large-scale networks and deep models. Reduce L in the case of small data size. -Study the combination of ISC-MTI and swarm, which is a decentralised storage and communication system for a sovereign digital society. -Consider the security of smart contracts and improve the security of ISC-MTI at the contract layer. Moreover, we will focus on the dynamic security of smart contracts. Artificial intelligence and patentability: review and discussions Data storage in big data context: a survey Model inversion attacks that exploit confidence information and basic countermeasures Survey of smart contract technology and application based on blockchain Research and application progress of blockchain in area of data integrity protection Software, School Of and University, Fudan, software and cyber security--a survey A low-cost image encryption method to prevent model stealing of deep neural network 5G cyberspace security game Robust malware defense in industrial IoT applications using machine learning with selective adversarial samples MF-TCPV: a machine learning and fuzzy comprehensive evaluation-based framework for traffic congestion prediction and visualization PADL privacy-aware and asynchronous deep learning for IoT applications SafetyNet detecting and rejecting adversarial examples robustly 2020) a survey of attack, defense and related security analysis for deep reinforcement learning Robust cascaded skin detector based on AdaBoost A survey of information security incident handling in the cloud MLAas: machine learning as a service Crowd abnormality detection in video sequences using supervised convolutional neural network Blurring the lines between blockchains and database systems: the case of hyperledger fabric Utilizing block chain technology in various application areas of machine learning Monetization of IoT data using smart contracts Blockchain-based secure storage and access scheme for electronic medical records in IPFS A survey on encrypted image recognition models Survey of storage scalability on blockchain Survey on privacy preserving techniques for machine learning Decentralized transfer learning using blockchain IPFS for deep Learning Blockchain smart contract for scalable data sharing in IoT: a case study of smart agriculture Analytical review analysis for screening COVID-19 disease An intrusion detection method of data tampering attack in train control system based on KF Blockchain-based data transparency: issues and challenges Application of blockchain technology in distributed energy transaction VerifyNet secure and verifiable federated learning Multinomial logistic regression model based on homomorphic encryption Research on attacks and defenses towards machine learning systems Blockchain as a service: next generation of cloud services Smart contract-based access control for the internet of things Ethics approval This article does not contain any studies with human participants or animals performed by any of the authors. All authors declare that they do not have any conflict of interest.