key: cord-0170603-rr77nr1e authors: Li, Yanfei; Geng, Tong; Li, Ang; Yu, Huimin title: BCNN: Binary Complex Neural Network date: 2021-03-28 journal: nan DOI: nan sha: a3912f4251b194b026d9a656b6db64fce6920ebd doc_id: 170603 cord_uid: rr77nr1e Binarized neural networks, or BNNs, show great promise in edge-side applications with resource limited hardware, but raise the concerns of reduced accuracy. Motivated by the complex neural networks, in this paper we introduce complex representation into the BNNs and propose Binary complex neural network -- a novel network design that processes binary complex inputs and weights through complex convolution, but still can harvest the extraordinary computation efficiency of BNNs. To ensure fast convergence rate, we propose novel BCNN based batch normalization function and weight initialization function. Experimental results on Cifar10 and ImageNet using state-of-the-art network models (e.g., ResNet, ResNetE and NIN) show that BCNN can achieve better accuracy compared to the original BNN models. BCNN improves BNN by strengthening its learning capability through complex representation and extending its applicability to complex-valued input data. The source code of BCNN will be released on GitHub. Deep neural networks (DNNs) recently achieved tremendous success in many computer vision applications. Aiming at replicating similar success for practical edge-side utilization, researchers are tweaking the DNN models for resources limited hardware. Binary neural networks (BNNs) [1] , [2] , which adopt only a bit for a neuron, stand out as one of the most promising approaches. To accommodate constrained hardware budget, rather than deducting the number of neurons in a model, BNN reduces the number of bits per neuron to the extreme -each element of the input, weight and activation of a BNN layer is merely a single binary value, implying +1 or -1. BNNs have demonstrated several appealing features for embedded utilization: (i) Execution efficiency: the natural mapping from a BNN neuron to a digital bit makes BNN extremely hardware-friendly. From the computation perspective, each 32 or 64 full-precision dot-product calculations can be aggregated into a Boolean exclusive-or plus a populationcount operation [1] , improving the execution efficiency by more than 10x [3] . From the memory perspective, rather than using a 32-bit single-precision floating-point, or a 16-bit halfprecision floating-point, in BNNs each neuron uses one bit, substantially improving the utilization of the memory storage and bandwidth [4] . The combination of the two effects can bring over three orders of latency reduction for single image inference compared to full-precision DNNs on GPUs [3] . (ii) Low-cost: Due to simpler hardware logic (e.g., avoiding using floating-point multiplier) and diminished memory demand, the hardware cost for implementing BNNs is much lower than DNNs [5] , [6] . (iii) Energy-efficiency: Due to low-cost hardware operations and smaller chip-area [7] , BNN-based designs are very friendly to portable devices with limited cycle life of batteries. (iv) Robustness: Due to the discrete parameter space through the binarization of weight, BNN shows better robustness than normal DNNs [8] , while certain properties can be formally verified [9] , [10] . Because of these advantages, BNNs have been utilized for a variety of practical applications, such as auto-driving [11] , COVID face-cover detection [12] , smart agriculture [13] , image enhancement [14] , 3D objection detection [15] , etc. Although BNNs exhibit these attractive features, concerns have been raised over the reduced accuracy compared to DNNs, which is largely due to the loss of information in the binarization process, and the reduced model capacity. Ever since the proposal of BNNs, continuous efforts have been invested from the machine learning community on improving BNN accuracy [16] - [25] , as briefly summarized in the next section. In the meanwhile, complex-valued neural networks [26] have been proposed as an amendment to the normal DNNs. Most existing DNNs adopt real-valued representation for the inputs and weights. However, considering the richer representational capacity [27] , the better generalization capability [28] , and the potential to facilitate noise-robust memory retrieval [29] , deep complex networks have been formulated [26] , in which the inputs, the weights and the outputs are all complex values. Correspondingly, the convolution, the batch normalization, the activation, etc. are reformed in complex operations. It has been shown that complex networks can deliver comparable or even better accuracy than DNNs under the same model capacity. A particularly attractive feature of the complex network is the ability to embed phase information of the input data naturally into the network representation. The phase information is critical for deterministic signals, such as neuronal rhythms in the brain [30] , Polarimetric synthetic aperture radar (PolSAR) images [31] , Fourier representation of wave, speech data [32] , multi-channel images like MRI [33] , etc. Considering the adoption of complex networks for terminal scenarios, in this work, we propose a novel network called Binary Complex Neural Network (BCNN) that integrates BNNs and complex neural networks. BCNN extends BNN with its richer representation capacity of complex numbers, but still conserving the high computation efficiency of BNNs for resource limited hardware. In BCNN, the input, weight and output of a layer are all binary complex values, i.e., one of {1+i, 1−i, −1+i, −1−i}, using dual-bits per neuron -one arXiv:2104.10044v1 [cs.NE] 28 Mar 2021 for the real part and one for the imaginary part. We propose binary complex convolution, which follows the complex number computation rules, but can still be calculated through the assembly-level xnor-popcnt machine instructions available in most hardware. Tackling the expensive computation cost of the original complex network in batch normalization [26] , which involves matrix inversion and square-rooting, in this work we propose a new batch normalization approach that can significantly simplify the computation logic while facilitate the convergence of the training. Furthermore, we propose a BCNN weight initialization strategy to accelerate the convergence speed and mitigate the chances of gradient explosion/vanishing. Evaluation results on the Cifar10 and ImageNet datasets show that BCNN can achieve better accuracy than BNNs using stateof-the-art models like ResNet [34] , ResNetE [35] , [36] , and NIN [37] . Our contribution in this paper are: 1) We propose the concept of binary complex number, including its dual-bit storage format and the binary complex computation mechanism. 2) We propose the binary complex neural network (BCNN), including quadrant binarization and its gradient. 3) We propose a BCNN-oriented batch normalization function, significantly lowering the computation cost. 4) We propose a BCNN-oriented weight initialization function, facilitating better convergence for the training. 5) Evaluations on Cifar10 and ImageNet datasets show that BCNN can achieve better accuracy than original BNNs. This paper is organized as follows. We summarize existing literature in Section-II, covering theoretical research about BNNs, the major approaches to enhance BNN accuracy, and the complex neural networks. We present the design details of BCNN in Section-III, covering the definition of binary complex numbers (Section-III-A), the quadrant binarization function (Section-III-B), the batch-normalization (Section-III-C), and the weight initialization (Section-III-D). We evaluate BCNN in Section-IV and conclude in Section-V. We briefly discuss existing literature about BNNs and complex neural networks. Binarized Neural Network (BNN) was originally evolved from Binarized Weights Network (BWN) [38] in which only weights are binarized. The foundation of modern BNNs were laid by the two cornerstone works [1] , [2] in which the fundamental components of BNNs were proposed, including (1) the binarization function and its approximated gradient through straight-through estimator (STE) on latent variables; (2) batch-normalization, which is crucial for BNNs to be able to converge; (3) the necessity to keep full-precision for the first and last layers. It was later explained by Anderson and Berg [39] on why BNNs could effectively approximate DNNs: (i) the binary vector through binarization preserves the direction of DNN real-valued vectors in the high-dimensional geometry space; (ii) the bit dot-product (popc(xnor())), through batch-normalization, preserves the property of original DNN dot-product; (iii) the real-value convolution for the first layer can embed input images into high-dimensional binary space, which can then be effectively handled by binary operations. BNNs are generally criticized for reduced accuracy compared to their DNN counterparts because of (1) Information loss due to input binarization and binary activation; (2) Reduced model capacity due to weight binarization (1 bit per neuron); and (3) Unsatisfied network structure or training methodologies as existing popular models and training strategies were mainly designed for real-value DNNs. Correspondingly, existing works propose to enhance BNN training accuracy via: (1) Reducing information loss. This can be achieved by adding gain terms (i.e., scaling factors) to better approximate DNN activation [16] . Gain terms are extracted based on the statistics of the inputs [17] , [18] , or gradually learned with the training [19] , [20] ; (2) Enhancing BNN model capacity. This is done by using multiple BNN components in the network (e.g., BENN [23] and Group-Net [22] ), or using more bits for a neuron where each bit represents a basis. The basis can be fixed to powers-of-two (e.g.,1,2,4,8,...) [17] , or adjustable as residual basis [18] , [21] , or even learned during training [19] ; (3) Designing BNN-specific network structure and training methods. As most existing network models were designed for DNNs, researchers started to design BNN-oriented network structures, these include ResnetE and BinaryDenseNet [24] which adopted more shortcuts for reusing information to maintain rich information flow in the network, and MeliusNet [25] , which conserved the mainstream information flow as full-precision in the first 256 channels, but using the two-block BNN structure (i.e., a dense block with an improvement block) for learning and attaching learned results in separated 64 channels. In this way, most information loss due to binarization could be avoided. Additionally, Bi-Real Net redirects the information-rich real-valued activation before binarization to the next block through a shortcut [35] . Alternative works concentrated on improving BNN training methodology. For example, focusing on the sign activation function and the STE gradient estimator, Alizadeh et al. showed that adapting the learning rate using second-moment methods was crucial for the successful adoption of STE in BNN training, compared with other optimizers [40] . Darabi et al. proposed a variation of the derivative of the Swish-like activation in place of the STE mechanism for obtaining more effective back-propagation functions [20] . Lahoud et al. presented to use a smooth activation function at the beginning, and then gradually sharpened it to a binary representation equivalent to sign during the training [41] . Hou et al. discussed loss-aware binarization, showcasing a proximal Newton algorithm with diagonal Hessian approximation that could directly minimize the loss with respect to the binary weights [42] . Observing that existing BNNs using real-valued latent weights to accumulate small update steps, Helwegen et al. viewed the latent weights as inertia, and introduced a BNN-specific optimizer called Binary Optimizer (Bop) [43] for the training. Focusing on other training aspects, Tang et al. used special regularization items to encourage the latent floating-point variables approaching +1 and -1 during the training [18] , Umuroglu et al. placed pooling before batch normalization and activation [44] . Mishra et al. guided the training of BNNs through a well-trained, full-precision teacher network by adjusting the loss function. Additional BNN works could be found in the two surveys [45] , [46] . This work falls into the second category, aiming at enhancing BNN model capacity by enhancing the neuron's expressibility. BCNN uses dual bits for each complex binary neuron. Nevertheless, it is fundamentally different from 2-bit quantization; each 2 bits here embed a binary complex calculation logic, which, as shown later, is capable of extracting more expressive features. To ensure fairness, in BCNN, without changing the model structure, we proportionally reduce the number of channels to ensure consistent model size. Complex numbers extend one dimensional real number line (i.e., -∞ to ∞) to two dimensional complex plane by using the real axis and the imaginary axis. Although complex numbers do not exist in the real world, its unique properties and computing rules provide useful amendments to the representativeness of real numbers, especially when phase information is presented. For example, in physics, complex numbers are more suitable for representing waves, as the coefficients are complex after the Fourier transform; in neuroscience, neuronal rhythms, which are crucial for neuronal communication, are characterized by the firing rate and the phase, thus can be naturally expressed as complex numbers [30] ; in geoscience, PolSAR images [31] , [47] can offer much more comprehensive and robust information compared with pure SAR images. The scattering properties of PolSAR images can be naturally described by the complex-valued polarization scattering matrix, where the amplitude of each element corresponds to the back-scattering intensity of the electromagnetic wave from the target to the radar, and the phase corresponds to the distance between the sensor platform and the target. In biomedical science, being able to effectively handle phase information greatly facilitates MRI image reconstruction [33] , [48] . Due to the richer representation and the need to process complex input signals with phases, there have long been efforts in constructing complex neural networks dating back to the 1990's [49] - [52] . The most recent one -Deep Complex Networks (DCN) was proposed by Trabelsi et al. [26] , which formulated the building blocks of complex-valued deep neural networks include complex convolution, complex batch normalization, complex weight initialization strategy, etc. DCN takes into consideration the correlation between the real part and imaginary part of the complex inputs and weights, demonstrating its effectiveness on classification tasks, showing comparable or even superior performance than real-valued DNNs with only half of the real-valued network size. This work is motivated by both DCN and BNN. We try to systematically integrate the two planes so that: (i) BCNN can show advanced accuracy compared to BNNs, with its richer representation through binary complex numbers; (ii) BCNN can drastically reduce the execution cost compared with DCNs, which is particularly attractive for embedded and edge utilization, where low cost, small size, low energy, and real-time response are usually demanding; (iii) Compared to BNNs, BCNN can naturally handle complex input signals, such as wave information directly from the sensors. We present our binary complex neural network (BCNN) in this section. We first define a binary complex number and discuss its convolution process. We then present how to perform complex binarization and binary complex batch normalization. Finally, we propose a weight initialization strategy for BCNN. Similar to a complex number z = x + iy that comprises a real part "x" and an imaginary part "iy", a binary complex number is defined as which can be encoded by two digital bits: the first one implies whether the real part is −1 or +1, while the second implies whether the imaginary part is −i or i. The bias is also a full-precision complex number, which is omitted in the following discussion for simplicity. Therefore, its dot product follows the complex calculation rules: Compared to BNN binary dot-product, a BCNN dot-product incorporates 4 binary dotproducts and two extra real-valued additions. In the matrix notation, it is expressed as: Binarization in BNN is the process of converting a fullprecision real number into a binary number: +1 or −1, which is generally viewed as the non-linear activation function for BNNs. Binarization can be performed in two approaches, known as deterministic and stochastic [1] , [2] . The stochastic one can potentially offer a better accuracy, but at the expense of high implementation cost, whereas the deterministic one is merely a sign function, as shown below: Most BNN works adopt the low-cost deterministic binarization function. Since sign is non-differentiable at 0, and its gradient is always 0 otherwise, direct back-propagation is infeasible. Prior works proposed the Straight-Through-Estimator (STE) to do the back-propagation: a clipping threshold, which typically sets to 1. The gradient of sign function is simply set as an idientity function. The threshold is used to cancel the gradient, when the inputs are geting too large, which can assist in the optimization process. The binarization to a binary complex number is to convert a complex number into a binary complex number (i.e., one of {1+i, 1−i, −1+i, −1−i}). We propose quadrant binarization where the output is determined based on which quadrant the input complex number belongs to in the two-dimensional Cartesian system, as shown in Figure 1 . Mathematically, a complex plane is a geometric representation of the complex numbers settled by the real axis x and the orthogonal imaginary axis y, where the two axes partition the plane into four quadrants, each bounded by two half-axes. Given four quadrants and four complex binary values, it is natural to link each quadrant with a complex binary value. The quadrant binarization is determined by the phase of a complex number, which is critical information in the complex signals. This quadrant binarization essentially decouples the real part and the imaginary part so that both parts could be processed separately as an ordinary binarization. For the forward propagation, the binarization is as: For the backward propagation, the gradient of the binarization is through two STEs, and applied over two independent fullprecision latent variables x and y: This keeps the simplicity of the binarization process for efficient hardware implementation and memory storage. Note that to improve accuracy, existing works propose various variants of the binarization function, such as scaling factor [16] - [20] , approximated sign() function [20] , [41] , etc. However, according to [24] , no obvious accuracy improvement had been observed by adopting these strategies. Therefore, in this work, we use the original BNN binarization function as the baseline [1] , [2] , which can achieve the best theoretic execution performance and model compression rate. In BNNs, in addition to the 32x memory storage and bandwidth savings by adopting 1-bit of a neuron (compared to 32-bits per floating-point neuron), the computation efficiency gains from the approach on how bit dot-product is performed: each 32 or 64 bit dot-product in DNN can be accomplished by a single exclusive-nor (xnor) operation followed by a population count (popc) operation, leading to 10-16x speedups [3] : A key requirement here is to conserve this property of being hardware friendly. Essentially in BCN, which implies that a BCNN dot-product can be computed by 4 BNN dot-products (i.e., popc-xnor) plus two full-precision additions. The 4 BNN dot-product can be operated in parallel at the same time, so the latency theoretically is close to one BNN dot-product. In a BCNN convolution layer, within the input, weight and output tensors, we use the first half for the real part and the second half for the imaginary part. Specifically, if the input tensor has M complex feature maps, it is equivalent to the input has 2M real-valued feature maps, where the first M represent the real components x and the remaining M represent the imaginary components y. The same case applies to the output tensor. Consequently, the weight tensor is in size (N × M × k × k) × 2 where the former half refers to the real part of the complex weight (i.e., A in Eq. 4), and the latter half refers to the imaginary part (i.e., B in Eq. 4). Batch normalization (BN) [53] has been proposed to accelerate convergence speed and contribute to better training accuracy. In real-value DNN, BN first normalizes the input so that the mean becomes zero and the variance becomes one. It then adjusts the normalized input by scaling through a learnable gain factor, and shifting through a learnable bias, as shown below: where r is the input, µ is the mean of the batch, σ is the variance of the batch, γ is the learned scale, β is the learned shift, is a tiny number for numerical stability. BN is important for DNNs, but is vital for BNNs. In addition to the normalization of the input, the learned gain and bias essentially increase the model capacity, or learning capability of a BNN layer. Without BN, the training of BNNs is even unlikely to converge. Different from Eq 5, standardizing complex input to normal complex distribution is much more complicated, because in addition to normalizing the mean and variance, Batch normalization in complex neural networks needs to ensure equal variance of the real and imaginary components. In deep complex Network [26] , the complex batch normalization is treated as a 2D whitening transformation -scaling the complex input by the square root of their variance along the real and imaginary components. This is achieved by multiplying the 0-centered data by the inverse square root of the covariance matrix:z where z is the complex input, E[z] is the mean of z, V is the 2 × 2 covariance matrix. The scaling parameter γ is a 2 × 2 positive semi-definite matrix with three degrees of freedom ( γ ri and γ ir is the same). The shifting parameter β is also a complex parameter. γ ri , γ ir , β are initialized with 0; γ rr and γ ii are initialized with 1/ √ 2. As shown in Eq 6, complex batch normalization involves the computation of matrix inversion and matrix square-root, which is too costly for the hardware. Besides, directly adopting this complex batch normalization approach in BCNN leads to poor training accuracy, or even non-convergence, which is shown in Section IV. Consequently, we propose a novel batch normalization method called complex Gaussian batch normalization (CGBN), which is more efficient and lightweight. Our objective is to normalize the input complex signal to a standard complex normal distribution (CN ) [54] , [55] . The standard complex normal random variable, also known as standard complex Gaussian random variable, is a complex random variable z whose real and imaginary parts are independent normally distributed random variables with mean equals to zero and variance equals to 1/2. In mathematical form, z ∼ CN (0, 1) implies: Consequently, we can separately normalize the real part and imaginary part of the input complex signal to a normal distribution with zero mean and 1/2 variance: The scaling parameter and shifting parameter are learnable complex values, the complex Gaussian batch-normalization is as shown below: where both the scaling parameter γ and shifting parameter β are learned during the training. γ is initialized as 1 This complex Gaussian batch normalization significantly reduces the computation complexity compared to Eq 6 by avoiding the calculation of the inverse square-root of the covariance matrix. The complex Gaussian batch normalization leads to faster convergence speed and can converge in all of the models and datasets we have evaluated. A proper weight initialization strategy can largely avoid exploding or vanishing gradient problem during backpropagation and accelerate the convergence speed during network training. Usually, the weight initialization follows two rules: (i) the input and output have the same variance in the forward propagation; (ii) the gradient of input and output have the same variance in the backward propagation. Two weight initialization strategies are broadly used for deep neural networks: Xavier [56] and He [57] . Xavier is suitable for symmetric activation functions such as tahn, softsign, etc. The initial weight parameters follow a uniform distribution with zero mean and variance equals to 2/(f an in + f an out). He is specially designed for ReLU like activation functions. The variance of the initialization distribution is 2/f an in instead. Following the Xavier [56] and He approach [57] , the complex neural network [26] derives the variance of the complex weight parameters. In the complex weight Initialization, a complex weight has a polar form: The variance of W is related to its amplitude not its phase, So the amplitude |W | is set to follow the Rayleigh distribution, with the probability density function being: [56] , to ensure V ar(W ) = 2/(f an in+f an out), then the parameter σ = 1/ √ f an in + f an out. For He [57] , to meet V ar(W ) = 2/f an in, set σ = 1/ √ f an in. The phase θ is unrelated to the variance, so it is initialized to follows the uniform distribution between −π and π. For BCNN, however, the complex weight initialization strategy does not work. After the quadrant binarization, the amplitude of binary complex weight is always sqrt (2) , which diminishes the efficiency of the original initialization strategy, which will be presented in our evaluation (Section IV). Here, following the Xavier approach [56] , we derive our BCNN's weight initialization. As discussed in Section III-A, in a BCNN layer l, the complex output h l = c l + id l is obtained by the convolution of the complex input z l = x l + iy l and the complex weight W l = A l + iB l , the complex bias is ignored for simplification. The f is the activation function, we have: The variance of real part and imaginary part can be written as: If C in and C out are the channel size of the complex input and output, then f an in = k 2 × C in , f an out = k 2 × C out . for the backward propagation, the gradient of is computed as: The weight A and B here is a C in -by-k 2 C out matrix while the gradient of the weightà andBis a C out -by-k 2 C in matrix. For input and output, the gradient of the real/imaginary part should have the same variance. With the assumption that for complex feature maps, the variance of the gradient for real part and imaginary part are the same, we have: Therefore, for the backward pass, we have: With a compromise of Eq 11 and Eq 11, we have: Therefore, in BCNN, we initialize the weight matrix by following the normal distribution with mean µ = 0.0 and variance σ = 1/(f an in + f an out). In this section, we evaluate the performance of BCNN for image classification using three deep neural network models: NIN-Net, ResNet18, and ResNetE18 on two popular image classification datasets, CIFAR10 and ImageNet. We use PyTorch to train all models. First we compare the BCNN and BNN with similar architecture and the same parameter size. Then, we evaluate and compare three different batch normalization and three weight initialization strategies for BCNN. Complex-valued Input Data Generation: As the raw data in Cifar10 and ImageNet datasets are real-valued, it is necessary to extend them to complex domain first. Our BCNN adopts a prior-art learning-based methodology proposed in [26] to generate imaginary parts. As shown in Figure 2 , the imaginary parts are learned by a real-valued residual blocks, then the concatenation of raw real-valued data and the learned imaginary parts can serve as the complex-valued inputs. The two conv layers are both 1×1 conv kernel, the channel of input and output are 3, so the real-valued residual block is lightweight in terms of both computation and storage. BCNN Model Configuration: Usually in BNNs, the first and the last layer are full-precision. Using full-precision for the first layer is to conserve the maximum information flow from the input images. Using the full-precision for the last layer is to conserve the maximum state-space before the final output, which is in particular meaningful for large dataset like ImageNet. For BCNN, we adopt the same strategy. Using fullprecision complex convolution for the first layer. For the last layer, we use a full-precision real-valued layer by treating M complex input as 2M real-valued input. For fair comparison, all networks used in our evaluation of BCNN and BNN are with very similar architecture and the same model size. As BCNN uses complex-valued parameters, the model size of BCNN is about twice the size of BNNs with the same network configurations. Therefore, we tune the numbers of channels at each layer of BCNNs to approximately 1/ √ 2 of the ones in BNNs. Network: Three networks are used for evaluation: NIN-Net, ResNet18 and ResNetE18. (a) NIN-Net consists of three stacked mlpconv layers followed by a spatial 2×2 MAX-Pooling and a global Average-Pooling layer. For Cifar10 dataset, we use the original NIN-Net proposed in [37] ; for ImageNet dataset, we use the enhanced version of NIN-Net proposed in [18] , which enlarges kernel sizes of the first four mlpconv layers from 1 × 1 to 3 × 3, and increase the output channels at the first two mlpconv layers from 96 to 128. (b) ResNet18 and ResNetE18: The block of ResNet18 and ResNetE18 are as shown in Figure. 3 and Figure. 4 (the stride of first convolution is 2, when the stride is 1, the bypass will be identical). ResNetE is a modified version of ResNet, which is equipped with extra shortcuts and adopts full-precision downsampling conv layer. With these modifications, ResNetE can preserve the information flow of the network better and process low-precision data, especially binary data, more efficiently. We first evaluate BCNN with NIN-net [37] and ResNet18 [34] on CIFAR-10 dataset. In training, Adam optimizer is adopted with the initial learning rate set as 5e-3. All models are trained for up to 300 epochs. We adjust the learning rates during training by multiplying them by 0.2 at the 80th, 150th, 200th, 240th, and 270th epochs respectively. Table I compares the accuracy of BCNNs, DNNs and BNNs. For NIN-Net and ResNet18, BCNN achieves 1.85% and 0.52% improvement on accuracy respectively over BNNs with the same model sizes. Figure.5 and Figure.6 show the variation of training loss and testing loss from epoch 0 to epoch 300. In Table II , we show the impact of different batch normalization methodologies and weight initialization strategy on BCNN. We evaluate three different batch normalization. BN (batch normalization) is the most popular batch normalization that being used in real-valued neural networks. CBN (complex batch normalization) was proposed in deep complex network. CGBN (complex gaussian batch normalization) is proposed in this work for our BCNN. Compared with the BN, the proposed CGBN improves the accuracy on both NIN-Net and ResNet18. For Cifar-10 dataset, our experimental results show that the CBN Batch Normalization can get higher accuracy compared to our CGBN. However, CBN is more computationally intensive and leads to non-convergence with large dataset, e.g. ImageNet. The detailed results on ImageNet will be given in the next section (NA in the table means non-convergence). Different parameter initialization schemes also affect performance of BCNNs. As shown in Table II . BCW (binary complex weight initialization) is proposed for BCNN in this paper. Xavier is used for real-valued network, Ray was proposed for deep complex network. Results show the proposed initialization technique results in 1.55% and 1.27% accuracy improvement for NIN-Net and ResNet18 respectively. In this section, we show the performance of BCNNs on ImageNet dataset. We use the standard pre-processing: all images are resized to 256 × 256 and randomly cropped to 224 × 224 for training. The validation dataset is with a single center crop. We use ADAM optimizer in training and set the II: Test accuracy with respect to batch normalization and weight initialization strategies on Cifar10. CGBN refers to Gaussian batch normalization proposed in this work. BN refers to normal batch normalization method [53] . CBN refers to the baseline batch normalization approach proposed in deep complex network [26] . BCW refers to the binary complex weight initialization proposed in this work. Xavier refers to the Xavier weight initialization strategy [56] . Ray refers to the complex weight initialization approach proposed in deep complex network [26] . NA means non-convergence. BatchNorm The accuracy comparison of BCNNs with different Batch Norm techniques and weight initialization strategies are shown in Table IV. BCNN with the proposed CGBN always provides higher accuracy than BN for all models. For NIN-E which is a relatively shallow network structure, CBN plus BCW have the highest accuracy. However, for the relativity deeper structures, e.g, ResNet18 and ResNetE18, CBN lead to non-converage We further evaluate the effects of different weight initialization strategies on ImageNet dataset. As listed in Table IV , the proposed weight initialization BCW shows increased accuracy over Ray for NIN-Net. For ResNet18 and ResNetE18, the proposed BCW leads to faster convergence with higher accuracy than BNNs. As a comparison, BCNN with Ray cannot converge in our testing. Overall, we show that BCNN, together with the proposed batch normalization and weight initialization strategies, can achieve better training accuracy on some large datasets such as ImageNet. As the next step, we will seek efficient implementation of BCNN on various hardware platforms, including GPUs [3] , GPU Tensorcores [4] , FPGAs [7] , [58] and ASICs [5] , [6] , for practical utilization in embedded systems and edge domains. In this work we propose the binary complex neural network, which combines the advantages of both BNNs and complex neural networks. Compared to BNNs, it achieves enhanced training accuracy and is able to learn from complex data; compared to complex neural networks, it is much more computation efficient, which is in particular beneficial to terminal scenarios such as smart edges and smart sensors. Future work includes the demonstration of BCNN on complex datasets, the implementation of BCNN on embedded hardware devices, and its practical applications. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1 Binarized neural networks Bstc: A novel binarized-soft-tensor-core design for accelerating bitbased approximated neural nets Accelerating binarized neural networks via bit-tensorcores in turing gpus O3bnn: An out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning O3bnn-r: An out-of-order architecture for high-performance and regularized bnn inference Lp-bnn: Ultra-low-latency bnn inference with layer parallelism Attacking binarized neural networks Formal analysis of deep binarized neural networks Verifying properties of binarized deep neural networks Gpu-accelerated real-time stereo estimation with binary neural network Binarycop: Binary neural network-based covid-19 facemask wear and positioning predictor on edge devices An fpga-based hardware/software design using binarized neural networks for agricultural applications: A case study Efficient super resolution using binarized neural network Binary volumetric convolutional neural networks for 3-d object recognition Xnor-net: Imagenet classification using binary convolutional neural networks Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients How to train a compact binary neural network with high accuracy Towards accurate binary convolutional neural network Bnn+: Improved binary network training Rebnet: Residual binarized neural network Structured binary neural networks for image recognition Binary ensemble neural network: More bits per network or more networks per bit Binarydensenet: developing an architecture for binary neural networks Meliusnet: An improved network architecture for binary neural networks Deep complex networks Full-capacity unitary recurrent neural networks Generalization characteristics of complexvalued feedforward neural networks in relation to signal coherence Associative long short-term memory Neuronal synchrony in complex-valued deep networks Pixel-wise polsar image classification via a novel complex-valued deep fully convolutional network Phaseaware speech enhancement with deep complex u-net Analysis of deep complex-valued convolutional neural networks for mri reconstruction Deep residual learning for image recognition Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm Training competitive binary neural networks from scratch Network in network Binaryconnect: Training deep neural networks with binary weights during propagations The high-dimensional geometry of binary neural networks An empirical study of binary neural networks' optimisation Selfbinarizing networks Loss-aware binarization of deep networks Latent weights do not exist: Rethinking binarized neural network optimization Finn: A framework for fast, scalable binarized neural network inference A review of binarized neural networks Binary neural networks: A survey Enhanced radar imaging using a complex-valued convolutional neural network Deepcomplexmri: Exploiting deep residual network for fast parallel mr imaging with complex convolution Complex domain backpropagation Approximation by fully complex multilayer perceptrons The complex backpropagation algorithm Journal of VLSI signal processing systems for signal, image and video technology Batch normalization: Accelerating deep network training by reducing internal covariate shift Statistical analysis based on a certain multivariate complex gaussian distribution (an introduction) Second-order complex random vectors and normal distributions Understanding the difficulty of training deep feedforward neural networks Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Cqnn: a cgra-based qnn framework