Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 69 A Secure Voice Signature Based Lightweight Authentication Approach for Remote Access Control Oladayo Olufemi Olakanmi Electrical and Electronic Engineering University of Ibadan Ibadan, Nigeria e-mail: Olakanmi.oladayo@ui.edu.ng Aminat Shodipo Embedded Systems and Security Research Group University of Ibadan Ibadan, Nigeria e-mail: Olakanmi.oladayo@ui.edu.ng Abstract—Crypto-authentication schemes become unreliable whenever the private key is compromised, making them unfit for any system or network that requires high level of confidentiality. Key compromise is inevitable due to the wider space of operability of key in most of the cryptography based authentication schemes. To improve the performance of any authentication system, there is need to narrow its key operability to the key owner, that is eliminating the influence of other parties in the key generation and operability. However, this is quite difficult especially for any network that is characterised with low computation and energy, and relied on the third party for key management. In this paper, we propose the adoption of Mel-Frequency Cepstral Coefficients (MFCC) based voice signature based authentication scheme with a taint of cryptography operation. The scheme extracts the user’s voice signature as the hash of the MFCC parameters of the voice and unique passcode which is used for the authentication. We used exclusive or to filter off remnant of noise from MFCC values without incurring extra hardware cost. The performance evaluation results show authentication accuracy of 87.8% at low computation cost and communication overhead. Keywords-MFCC, Authentication, Cryptosystem, Coding, Security, Access Control I. INTRODUCTION The adoption of cryptography in security system has not only enhanced the integrity of data and confidentiality but obviously contribute to the acceptability of most of the new technologies. However, cryptography based security schemes may not proffer universal security solution to most systems or networks. This is due to the restrictive processing power and memory resources, wider key generation procedure and operability that characterised these class of cryptograph based security solutions. Therefore, they may not be effective for some systems. Besides, key distribution is another issue in cryptography based security solutions, although public-key infrastructure can be used to eliminate problems involved with key distribution, however it comes with a lot of overheads. Therefore, it is very important to find ways to reduce the overheads yet not sacrificing other aspects of security. The major loophole of cryptography security solutions is key escrow. This is a cryptographic key exchange process in which a key is held in escrow, or stored, by a third party. The problem with this, is that a lost key or compromised by its original user(s) may be used to decrypt encrypted material. Key escrow is proactive through key disclosure laws. That is, it anticipates the need for access to keys. However, it also introduces new risks like loss of keys and legal issues such as involuntary self-incrimination which are likened to security weakness. Recently, attention has been shifted to the adoption of biometric solution to mitigate some of the security loopholes of cryptography solution. Biometric based technique may limit key generation and operability only to the users. However, some biometric based techniques have shortcomings which hinder their adoptability in access control. For example, replication is one of the major vulnerabilities of finger print based authentication systems. A few methods for fingerprint replication such as the use of grease stains on the scanner and/or latent fingerprint, the use of live finger, which is forceful amputated from the owner. MFCC is one of the popular algorithms for extracting features of speech in voice recognition. It is common to normalise their values when adopted in speech recognition systems. Efforts had been made to improve on its algorithm such as raising the log-mel- amplitudes to a suitable power before taking the DCT so as to nullify the effect of additive noise [9]. In this paper we used exclusive or (xor) to nullify the remnant additive noise in MFCC values and generates unique voice signature of the user in the authentication phase. DOI: 10.21307/ijanmc-2019-049 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 70 II. RELATED WORK Voice recognition system has become a sophisticate security tool and a technology with great potential in authentication systems. It is the only biometric based recognition system that processes acoustic information contrary to other forms of biometrics, such as fingerprint, DNA, retina etc., that are image-based. Each human has their unique characteristic in speech and voice that can be captured and analyzed. Voice recognition system can be divided into two phases; identification and verification phases. Voice identification is used to decide who an unknown voice belongs to amongst a set of known speakers while voice verification accepts or rejects the identity claim of a speaker. Voice identification can also be sub-classified into text-dependent and text-independent identification. In text-dependent identification, the individual has to utter the same keyword both in the test and training phases. Meanwhile, text-independent identification properly identifies the speaker regardless of what is being said. Voice recognition system has a lot of applications, such as authentication in remote identification and verification, mobile banking, ATM transactions, and online transactions and reservations; information security in device logon, and application and database security; Law Enforcement such as forensic investigation, and surveillance applications. Several works had been done on voice recognition and its application to solve different access control problems. For example, Sidiqs et al. [1], proposed a speech recognition system to translate human speech to an action in machine. They used MFCC by splitting the input signal amplitudes into frames which are processed using the mel-filterbak. The results are made into a codebook which is used as an input symbol to form a model of every word. Jagan and Rameshin [3] also developed a MFCC and Dynamic Time Wrapping (DTW) based speech recognition system for feature extraction and pattern matching respectively. Drisya and Anish [2] used Bessel features as an alternative to MFCC and LPCC to develop a text-independent speaker identification system. The quasi-stationary nature of speech signal was represented by damped sinusoidal based function, and a Bessel features derived from the speech signal was used to create Gaussian mixture models for text independent speaker identification. Meanwhile, work in [4] introduced a new algorithm for extracting MFCC for speech recognition. The results showed that the new algorithm reduced the computation power and accuracy compared to the conventional algorithm. The ability to perform postural transitions such as sittostand is an accepted metric for functional independence. The number of transitions performed in real life situations provides useful clinical information for an individual recovering from lower extremity injury or surgery. Consequent to this, Sadra and Eric [5] proposed a new inertialsensor based approach to detect transitions using wavelet transform. Their approach is robust for supervised laboratory and ambient settings. Also, authors in [6] developed a speaker recognition system using a statistical model like Gaussian mixture model (GMM) to implement a recognizer. The features extracted from the speech signal were used to build a unique identity for each authorised user. Estimation and maximization algorithms were used for finding the maximum likelihood solution for a model with latent variables. Laryngeal diseases and vocal fold pathologies have strong impacts on the quality of the voice recognition system. In [7] a user friendly approach was proposed to discriminate between normal and abnormal voice. The feature extraction technique was applied on the voice signal in the time domain and in the frequency domain. Another work on voice recognition is speaker identification system proposed in [8]. In the work, a speaker recognition system was implemented using a combination of MFCC and Kekre’s Median Code book Generation Algorithm (KMCG). The MFCC algorithm was used for feature extraction while the KMCG algorithm plays important role in code book generation and feature matching. However, most of these works were directed to voice recognition systems. Our work is directed to how the voice signature can be combined with cryptography operation to evolve an efficient access control scheme with narrow operability to mitigate key compromise. III. VOICE SIGNATURE BASED AUTHENTICATION SCHEME FOR ACCESS CONTROL The voice signature based authentication scheme involves two phases; registration and authentication phases. The registration phase takes voice samples and pass codes of all the authorised individuals, processed it in order to reduce the overall bulk and complexity, then extracts MFCC (Mel-Frequency Cepstral Coefficients) before generating the voice signature. This phase is sub-divided into six divisions; elimination of silent frame, framing, hamming windowing, MFCC generation, and voice signature generation. Meanwhile, the verification phase regenerates the user’s voice signature and compare it with all the encrypted voice signature stored in the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 71 memory of the access point’s memory using the k- means algorithm and correlation coefficient. Figure 1. Model of the remote voice based access control A. Registration Phase In the registration phase, voice samples are collected from a number of authorised users by the remote system. These voice samples are then pre- processed and their voice signature are obtained, encrypted and stored in the memory of the access control unit of the remote system. This stage consists of six sub-stages which are described below. To generate the voice signature all the stages must be executed in that order. 1) Elimination of silent frames It is pertinent to remove silent frames when processing speech in order to reduce the overall bulk and complexity of the speech signal. It is known that when humans are talking, it is very impossible not to have gaps or pauses in between words and sentences hence the importance of this phase. These gaps and pauses in the middle of speech increase the speech length or the number of frames to be processed, it therefore important to remove them. After proper studying of the spectrogram of speech, the amplitude for silent frames was pegged at 0.02 hence frames with amplitude less than 0.02 are removed. 2) Framing Framing sub-stage divides the continuous speech signal into frames of B samples, with adjacent frames being separated by A where . First frame consists of first B samples. Then, second frame begins with A samples after the first frame, and overlaps it by samples and so on. This continues until all the speech is accounted within one or more frames. Typically, B and A are chosen as 256 and 128 respectively. Figure 2 shows a speech signal after it has been framed. Figure 2. Speech signal after undergoing framing 3) Hamming Windowing After framing, Hamming windowing, as shown in Figure 4, is now used to tape the voice signal to zero at the beginning and the end, thereby reducing discontinuities in the signal. This helps to focus on the information at the centre of the frame as shown in Figure 4. For example, if the window is defined as and the speech signal as then the resultant signal after windowing is the signal defined as: . In this work, we used a hamming window of the form: (1) Figure 3. Hamming window before applying it on speech signal Figure 4. Speech signal after applying the Hamming Window International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 72 4) Fast Fourier Transform(FFT) FFT is then used to transform the speech signal from time domain to frequency domain in order to easily obtain the MFCC. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples . gives the fast Fourier transform of frame In general are complex numbers and we only consider their absolute values (magnitudes) because considering the phase produces very skewed results as shown in Figure 5 and 6. Figure 5. Audio signal with magnitude and phase Figure 6. Audio signal considering the magnitude only 5) Mel-Frequency Cepstral Coefficient Out of all the feature extraction algorithms that are available, MFCC is commonly used because of the way it closely mimics the natural human auditory system. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. The best approach to simulate the subjective spectrum is to use a filter bank, spaced uniformly on the spectral properties of the signal for the given frame analysis. The Mel spectrum coefficients are converted back to time domain using the Discrete Cosine Transform (DCT ). The MFCC is calculated as: ; 0 ≤ n ≤ k – 1 (2) The first component is excluded from the DCT since it represents the mean value of the input signal and hence carries little speaker specific information [3]. Figure 7. Mel-filter bank 6) Voice Signature generation After obtaining the MFCC values, the access control unit then selects the maximum value and minimum value as the MFCC of the user, and calculate the user’s voice signature as the encryption of hash of the MFCC and passcode of the user. That is, voice signature is generated as: = (3) The operator acts as a noise filter since the two MFCC values and includes the same additive noise and xor operator is mutually exclusive. 7) Authentication Phase To access the system, the scheme requests for the user’s voice signal through the mobile device in order to re-generate an access voice signature. It then compares the re-generated access voice signature with all the voice signatures in the memory in order to authenticate the user. IV. PERFORMANCE EVALUATION A. Experimental setup One hundred and fifty two tests using 5 different users of mixed sex were used to test the efficiency of the voice based authentication scheme. Voice signatures of the five users (2 females and 3 males) saying the same sentence were extracted. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 73 TABLE I. PERFORMANCE EVALUATION OF VOICE SIGNATURE BASED AUTHENTICATION APPROACH Trial No. of samples No.of False rejection No. of False acceptance % Accuracy User 1 30 3 1 87 User 2 30 3 2 83 User 3 30 2 0 93 User 4 30 1 1 93 User 5 32 4 2 83 Each person voice signature was matched with the rest of the voices signatures in the database. The accuracy of systems was determined in terms of false acceptances and false rejections. A false acceptance occurs when the system grants access to an unauthorised user while a false rejection occurs when the system denies the unauthorized user is granted access. The scheme was simulated on a simulation platform consisting of a mobile device (Samsung Galaxy S5 with a Quad-core 2.45 GHz processor, 2GB RAM, and Google Android 4.4.2 operating system, a PC with Intel(R) Core(TM) i5-7200U CPU @ 2.50 GHz processor as the remote access control system. We determine the computation and communication costs of the scheme. B. Result and Discussion The results of the performance of the scheme in terms of false acceptance and rejection ratios are shown in Table 1. It shows that the system is accurate with accuracy of 87.8%. Also, computation cost, in terms of the execution time by the scheme for points FFT, is obtained as shown in Figure 8. This shows that the computation cost increases as the number of users increases, and indicates that the scheme has low computation cost and can be easily adopted by any system that is characterised as computation and energy constraint system. Also, Figure 9 shows the energy consumption of the scheme in terms of number of cycles required. This also indicates that the proposed scheme is energy-aware since energy consumed by a processor is approximately proportional to number of cycles or frequency , and to the square of the processor voltage V [14]. Meanwhile, the communication overhead incurred for every authentication is 256 bits. This indicates that the scheme requires low bandwidth and there will never be congestion irrespective of the bandwidth of the communication channel. Figure 8. Computation cost (ms) Figure 9. Energy cost in terms of cycles V. CONCLUSION In this work, we demonstrated how voice signature can be used to developed access control scheme for remote system. We solved the effect of the additive noise on MFCC using to eliminate the congruent additive noise embedded in the maximum MFCC and minimum MFCC values. The MFCC, hash function and conventional pass-code are used generate voice signature from user’s voice signal. This is used to solve the problem of wider operability of key in cryptography based REFERENCES [1] Muslim Sidiq, Tjokorda Agung Budi W, Siti Saadah (2015). Design and Implementation of Voice Command Using MFCC and HMMs Method. 3rd International Conference on Information and Communication Technology ( ICoICT ). [2] Drisya Vasudev, Anish Babu (2014). Speaker identification using FBCC in Malayalam language. International Conference on Advances in Computing, Communications and Informatics (ICACCI). International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 74 [3] Jagan Mohan and Ramesh Babu(2014). Speech recognition using MFCC and DTW. 1st Int. Conference on Advances in Electrical Engineering, VIT, Vellore, India [4] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy, Kong-Pang Pun (2006). An efficient MFCC extraction method in speech recognition. 2006 ISCAS, Proceedings of IEEE International Symposium on Circuits and Systems. [5] Sadra Hemmati, Eric Wade (2016). Detecting postural transitions: A robust wavelet-based approach. Proceeding of IEEE 38th Annual International Conference of the Engineering in Medicine and Biology Society (EMBC), pp. 3704-3707. [6] S. G. Bagul, R.K.Shastri (2013). Text Independent Speaker Recognition System Using GMM. International Conference on Human Computer Interactions (ICHCI), pp 1 - 5. [7] Manal Abdel Wahed (2014). Computer aided recognition of pathological voice. 31st National Radio Science Conference (NRSC), pp. 349 – 354. [8] H B Kekre, V A Bharadi, A R Sawant, Onkar Kadam, Pushkar Lanke, Rohit Lodhiya (2012). Speaker recognition using Vector Quantization by MFCC and KMCG clustering algorithm. International Conference on Communication, Information and Computing Technology ( ICCICT ). [9] Tyagi and C. Wellekens (2005). On desensitizing the Mel-Cepstrum to spurious spectral components for robust speech recognition. In proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, pp. 529-532. [10] Saqui Z., Salam N., Nair N., Pandey N. (2011) Voiceprint recognition system for remote authentication survey. International Journal of Hybrid Information Technology, Vol. 4, No.2. [11] Sunil A., Shruti A., Rama C. (2010) Prosodic Feature Based Text Dependent Speaker Recognition Using Machine Learning Algorithms. International Journal of Engineering Science and Technology, Vol. 2 No.10, Pp. 5150-5157. [12] Kirti A., and Minakshee P., (2013). Speech and speaker identification for password verification system. International Journal of Advanced Research in Electrical,Electronic and Instrumentation, Vol. 2, Issue 6. [13] Parrul, R., Dubey, B., (2012). Automatic speaker recognition system. International Journal of Advanced Computer Research, Vol. 2, No.4. [14] Wikipedia. (2003). CPU power dissipation. (http://en.wikipedia.org/wiki/CPU-power-dissipation)