key: cord-0526127-rkwvz19i authors: Pan, Xiaoxuan; Cao, Xi; Wang, Weiting; Hua, Ziyue; Cai, Weizhou; Li, Xuegang; Wang, Haiyan; Hu, Jiaqi; Song, Yipu; Deng, Dong-Ling; Zou, Chang-Ling; Wu, Re-Bing; Sun, Luyan title: Experimental Quantum End-to-End Learning on a Superconducting Processor date: 2022-03-17 journal: nan DOI: nan sha: e055b50143a592244205e5c184775b28baa90340 doc_id: 526127 cord_uid: rkwvz19i Machine learning can be substantially powered by a quantum computer owing to its huge Hilbert space and inherent quantum parallelism. In the pursuit of quantum advantages for machine learning with noisy intermediate-scale quantum devices, it was proposed that the learning model can be designed in an end-to-end fashion, i.e., the quantum ansatz is parameterized by directly manipulable control pulses without circuit design and compilation. Such gate-free models are hardware friendly and can fully exploit limited quantum resources. Here, we report the first experimental realization of quantum end-to-end machine learning on a superconducting processor. The trained model can achieve 98% recognition accuracy for two handwritten digits (via two qubits) and 89% for four digits (via three qubits) in the MNIST (Mixed National Institute of Standards and Technology) database. The experimental results exhibit the great potential of quantum end-to-end learning for resolving complex real-world tasks when more qubits are available. Quantum computing [1] is revolutionizing the field of machine learning (ML) [2] [3] [4] . Powered by quantum Fourier transform and amplitude amplification, provable speed-up has been predicted for high-dimensional and big-data ML tasks using fault-tolerant quantum computers [5] [6] [7] [8] [9] . Even with noisy intermediate-scale quantum (NISQ) devices, quantum advantage is still promising, because the model expressibility can be substantially enhanced by the exponentially large feature space carried by multi-qubit quantum states [10, 11] . To deploy quantum machine learning algorithms on NISQ processors, the key part is to construct a parameterized quantum ansatz that can be trained by a classical optimizer. To date, most quantum ansatzes are realized by quantum neural networks (QNN) [10, [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] that consist of layers of parameterized quantum gates, and successful experiments have been demonstrated on classification [23] [24] [25] , clustering [26, 27] , and generative [28] [29] [30] learning tasks. The gate-based QNN ansatz naturally incorporates the theory of quantum circuits, but the learning performance is highly dependent on the architecture design and the mapping of circuits to experimentally operable native gates. A structurally non-optimized QNN cannot fully exploit the limited quantum coherence resource, and this is partially why high learning accuracy is hard to attain on NISQ devices without downsizing the training dataset. There are certainly much room for performance improvement by using more hardware-efficient quantum ansatzes, e.g., via deep optimization of the circuit architecture [31] and qubit mapping strategies [32] . Recently, a hardware-friendly end-to-end learning scheme (in the sense that the model is trained as a whole instead of being divided into separate modules) is proposed [33] by replacing the gate-based QNN with natural quantum dynamics driven by coherent control pulses. This model requires very little architecture design, system calibration, and no qubit mapping. One can also jointly train a data encoder that automatically transforms classical data to quantum states via control pulses, and this essentially simplifies the encoding process because the preparation of quantum states according to a hand-designed encoding scheme is no more required. More importantly, the natural control-to-state mapping involved in the encoding process introduces nonlinearity that is crucial for better model expressibility. In this paper, we report the first experimental demonstration of quantum end-to-end machine learning using a superconducting processor through the recognition of handwritten digits selected from the MNIST (Mixed National Institute of Standards and Technology) dataset. Without downsizing the original 784-pixel images, the end-to-end learning model can be trained to achieve 98% accuracy with two qubits for the 2-digit classification and 89% accuracy with three qubits for the 4-digit task, which are among the best experimental results reported on small-size quantum processors [34] . The demonstrated quantum end-to-end model can be easily scaled up for solving complex real-world learning tasks owing to its inherent hardware friendliness and efficiency. The basic idea of end-to-end quantum learning is to parameterize the quantum ansatz by physical control pulses that are usually applied to implement abstract quantum gates in variational quantum classifiers. In this way, a feedforward QNN can be constructed by the control-driven evolution of the quantum state |ψ(t) , as follows [35] : where H 0 is the static Hamiltonian which involves the coupling between different qubits, and r is the number of pulsed control functions/channels in the quantum processor. For example, if there are M qubits for the QNN and each qubit is dictated by c control functions (e.g., flux bias or microwave driving), we have r = c × M. Here, H m is the control Hamiltonian associated with the m-th control pulse that contains n sub-pulses over n sampling periods. The j-th sub-pulse is parameterized by θ m (t j ), and hence we denote the m-th control pulse by θ m = [θ m (t 1 ), θ m (t 2 ), ..., θ m (t n )]. The evolution of the In are applied to drive ψ (k) (t E ) to ψ (k) (t E+I ) that is to be measured. The parameters in W (k) and θ In are updated for the next iteration according to the loss function L and its gradient obtained from the measurement. The circled numbers represent the specific points in the data flow and the corresponding learning performances are shown in Fig. 4 . The top right is the false-colored optical image of the six-qubit device used in our experiment. quantum system under all n-th control sub-pulses constructs the n-th layer of the QNN. We illustrate the quantum end-to-end learning with a classification task based on the MNIST dataset. As shown in Fig. 1 , an image of a handwritten digit is randomly selected from the training dataset D. In the k-th iteration, the sampled image is converted to a d = 784 dimensional vector x (k) , and y (k) is the corresponding label. The input data x (k) is transformed by a matrix W (k) to the control variables θ (k) En = W (k) x (k) . This constructs a classical encoding block with r channels and E sub-pulses per channel: θ The generated control pulses θ (k) En then automatically encode x (k) to the quantum state ψ (k) (t E ) via the natural quantum state evolution of Eq. (1). In , which have the same form as θ (k) En but consist of I sub-pulses in each channel, are then applied to induce the quantum evolution from the encoded quantum state ψ (k) (t E ) . The inference controls are introduced for improving the classification performance. Finally, the end-time quantum state ψ (k) (t E+I ) is measured under the appropriate experiment-available positive operator O (k) according to the classical label y (k) , which gives the conditional probability (or confidence) of obtaining y (k) for a given input x (k) The corresponding loss function is defined as In the experiment, we select a batch of b samples in each iteration to reduce the fluctuation of L for faster convergence of the learning process. The gradient of the loss function L with respect to the encoding control θ i (t j ) [36] . The gradient of L with respect to W (k) can be derived from the gradient of L with respect to θ (k) En [33] . Therefore, we can apply the widely used stochastic gradientdescent algorithm in machine learning to update W (k) and θ (k) In by minimizing L on the training dataset D (see Supplementary Materials for details of the algorithms) [37] . Once the model is well trained, one can use fresh samples from a testing dataset to examine the recognition performance of the handwritten digits. The end-to-end model is demonstrated in a superconducting processor, as shown in Fig. 1 . All qubits take the form of the flux-tunable Xmon geometry and are driven with inductively coupled flux bias lines and capacitively coupled RF control lines [38] [39] [40] . Among the six qubits, Q 1 , Q 2 , Q 4 , Q 5 are dispersively coupled to a half-wavelength coplanar cavity B 1 , and Q 2 , Q 3 , Q 5 , Q 6 are dispersively coupled to another cavity B 2 . Each qubit is dispersively coupled to a quarterwavelength readout resonator for a high-fidelity single-shot readout and all the resonators are coupled to a common transmission line for multiplexed readouts. The qubits that are not relevant to the QNN is biased far away and can be ignored from the system Hamiltonian, therefore, the static Hamiltonian of the QNN can be written in the interaction picture as where J qp is the coupling strength between the p-th and qth qubits mediated by the bus cavity, E C,q denotes the qubit anharmonicity, and a q is the annihilation operator of the q-th qubit. Throughout this work, we set the encoding block with E = 2 layers followed by an inference block with I = 2 layers. As shown in Fig. 1 , for the q-th qubit in the n-th (n = 1, 2, 3, 4) layer of the QNN, there are c = 2 control parameters θ 2q−1 (t n ) and θ 2q (t n ), which are associated with the control Hamiltonians H 2q−1 = (a q + a † q )/2 (rotation along the x-axis of the Bloch sphere) and H 2q = (ia q − ia † q )/2 (rotation along the y-axis of the Bloch sphere), respectively. The control parameters are the variable amplitudes of the Gaussian envelopes of two resonant microwave sub-pulses, each of which has a fixed width of 4σ = 40 ns. All the quantum controls in the same time interval are exerted simultaneously. For an N-digit classification task, we take M = log 2 N + 1 qubits for the QNN: the classification results are mapped to the computation bases of the first log 2 N qubits (label qubits) by the majority vote of the collective measurement performed on label qubits, while one additional qubit is introduced for a better expressibility of the model. Therefore, the QNN in our experiment involves totally cM(E + I) = 8M control parameters. We perform the 2-digit ('0' and '2') classification task (N = 2) with Q 3 and Q 5 (M = 2). The working frequencies are 6.08 GHz and 6.45 GHz, respectively, which are also the flux sweet-spots of the two qubits. The effective coupling strength J 35 /2π = 4.11 MHz. We take Q 5 as the label qubit and assign the classification result to be '0' or '2' if the respective probability of measuring |g or |e state is larger. The end-to-end model is initialized with W = W 0 and θ In = θ 0 , where all elements of W 0 are 10 −5 and each element of θ 0 is tuned to induce a π/4 rotation of the respective qubit. The parameter update is realized as follows. Firstly, we obtain the loss function L according to Eq. (3) by measuring Q 5 . We perturb each control parameter in the control set { θ En , θ In } and obtain the corresponding gradient of L . The L and its gradient averaged over a batch of two training samples (b = 2) are sent to a classical Adam optimizer [41] for updating W and θ In . All control parameters are linearly scaled to the digitalto-analog converter level of a Tektronix arbitrary waveform generator 70002A, working with a sampling rate of 25 GHz, to generate the resonant RF pulses directly. The control pulses composed of in-phase and quadrature components are sent to each qubit with the corresponding RF control line. To obtain the classification result, we repeat the procedure and measure the label qubit for 5000 times. In the 4-digit ('0', '2', '7', and '9') classification task (N = 4), we take Q 3 , Q 5 , and Q 6 (M = 3) to construct the QNN, whose working frequencies are 6.08 GHz, 6.45 GHz, and 6.19 GHz, respectively. Q 3 and Q 5 are measured for the classification output. The target digits correspond to the four computational bases spanned by the two label qubits. The training procedure and algorithms are the same as those for the N = 2 task. The typical training process is shown in Figs. 2a-b. For better clarity, the curves are smoothed out by averaging each data point from its neighboring four ones. For the 2-digit (4digit) classification task, the experimental loss function L converges to 0.14 (0.22) in 300 (500) iterations. The training loss can potentially be reduced by increasing the depth E of the encoding block [33] . For comparison, numerical simulations are also performed with the calibrated system Hamiltonian, the same batches of training samples, and the same parameter update algorithms. As shown in Figs. 2a-b , the simulations match the experiments well. The small deviation of the experimental data may attribute to the simplified modeling of high-order couplings between the qubits and the control pulses [42] , as well as the system parameter drifting. To examine the performance of the end-to-end learning, we experimentally test the generalizability of the trained end-toend model with fresh testing samples (1000 for each digit), and count the frequencies of assigning these samples to different digits (see Figs. 2c-f). The measured overall accuracies (i.e., the proportion of samples that are correctly classified) are 98.7% for the 2-digit task and 89.5% for the 4-digit task, respectively, which are consistent with the simulation results (98.2% and 88.9%, respectively) based on the experimentally identified Hamiltonian. The performance of the model also relies on the amount of entanglement gained in the quantum state. When the number of QNN layers is fixed, the quantum state gets more entangled under longer pulse length τ (includes all the E + I = 4 subpulses in both the encoding and the inference blocks), but coherence may be lost in the prolonged control time duration due to the inevitable decoherence. We use the experimentally calibrated parameters to simulate the 2-digit classification process under different τ and different coherence times T 1 and T φ of the qubits. As shown in Fig. 3 , the average confidence 1 − L varies little with τ when T 1 or T φ is sufficiently small because the coherent control is overwhelmed by the strong decoherence. For larger T 1 or T φ (e.g., T 1 = 20 µs), the average confidence initially increases with τ, but then decreases after reaching the peak. This trend clearly indicates the trade-off between the gained entanglement and the lost coherence, and thus τ as well as the number of layers should be optimally chosen for the best balance. The end-to-end learning scheme provides a seamless combination of quantum and classical computers through the joint training of the control-based QNN and the classical data encoder W . To understand their respective roles in the classification, we check how the data distribution varies along the flow x → θ En → |ψ(t E ) → |ψ(t E+I ) → y (see 1 ∼ 4 in Fig. 1 ) in the 2-digit classification process. To facilitate the analysis, we use the Linear Discriminant Analysis (LDA) [43] that projects high-dimensional data vectors into two clusters of points distributed on an optimally chosen line (see details in the Supplementary Materials). The LDA makes it easier to visualize and compare data distributions whose dimensionalities are different. The projected clusters are plotted in Fig. 4 . In each subfigure, the distance between the centers of the two clusters is normalized, and hence we can quantify the classifiability by their standard deviations (i.e., the narrowness of distribution). As can be seen in Figs. 4a-b, the classical data encoder W effectively reduces the original 784-dimensional vector x to a 8-dimensional vector of control variables θ En , but the standard deviation is increased from 0.1658 for the original dataset to 0.2903 for the transformed control pulses. Then, the control-to-state mapping, which is both nonlinear and quantum, sharply reduces the standard deviation to 0.0919 for the encoded quantum state (Fig. 4c) , while the following quantum inference block does not make further improvement (Fig. 4d) . These results indicate that the classical data encoder is responsible for the compression of the high-dimensional input data, while the classification is mainly accomplished by the QNN. It should be noted that no quantum advantage is claimed here with a small-size NISQ processor. However, we notice that the end-to-end learning is very similar to quantum reservoir computing (QRC) [44, 45] in that both schemes exploit complex natural quantum dynamics for hard computing tasks, and QRC has been proven to have universal approximation property [46] and higher information processing capacity [47] . It is conjectured that similar conclusions can be made for quantum end-to-end learning, and these will be explored in our future studies. Apparently, the power of QNN can be exponentially increased when more controllable qubits are available. We expect to further improve the training efficiency of the end-toend learnings on larger NISQ processors, and develop more complicated ML application (e.g., unsupervised and generative learning) based on more complex datasets. The device is fabricated on a 2-inch c-plane sapphire substrate. A 100-nm aluminum film is first directly deposited in Plassys MEB 550S without any precleaning treatment. The resonators and control lines are then defined with photolithography followed by inductively-coupled-plasma etching. The wafer is diced into 8mm×8mm chips for the subsequent process. Double-angle evaporations are applied to define the qubit junctions. All control signals are transmitted through an SMA connector, a welded printed circuit board, and wirebonding lines to the chip. The marginal ground plane of the chip is densely bonded to the aluminum sample box. Onchip wire-bondings are also applied to connect different central ground areas on the chip to suppress the spurious slot-line modes. For clarity, we show in Fig. 1 of the main text the device that is similar to the one used in our experiment, but without the on-chip wire bondings. All qubits (Q 1 − Q 6 ) are flux-tunable Xmon qubits [1] [2] [3] with individual driving lines and flux bias lines. Each qubit is dispersively coupled to a quarter-wavelength readout resonator for a highfidelity single-shot readout and all the readout resonators are coupled to a common transmission line for multiplexed readouts. Among the six qubits, Q 1 , Q 2 , Q 4 , Q 5 are dispersively coupled to a half-wavelength coplanar cavity B 1 , and Q 2 , Q 3 , Q 5 , Q 6 are dispersively coupled to another cavity B 2 . Qubits (the p-th and q-th ones) that couple to the common cavity share a cavity-mediated coupling strength J qp = g qB g pB (∆ qB + ∆ pB )/2∆ qB ∆ pB [4] . Here g qB and ∆ qB denote the coupling strength and the detuning between Q q and the bus cavity, respectively. The parameters of the qubits and the bus cavity that are used in the experiment are shown in Table I . The assembled aluminum sample box is fixed to a copper plate that is anchored to the 10 mK base plate of a dilution refrigerator. Additional copper braiding lines between the sample box and the copper plate are used for a better thermal connection. The sample box is also enclosed within a cylindrical magnetic shield. The measurement wiring outside the magnetic shield is shown in Fig. S1 . The control signals for high-frequency drives are properly parameter Q 3 Q 5 Q 6 B 1 frequency (GHz) (sweet spot) 6.081 6.453 6.191 attenuated and low-pass filtered below 10 GHz. The flux bias lines for tuning qubit frequencies are low-pass filtered below 1 GHz. The input readout signals are generated from the same local microwave source and modulated with different frequency tones provided by a Tektronix AWG5208. The readout signals are multiplexed through a transmission line that couples to all the six readout resonators. The output readout signal is amplified with a Josephson parametric amplifier (JPA) [5, 6] working at the base temperature. A high-electronmobility-transistor amplifier working at 4 K temperature and an additional room temperature amplifier are also used to improve the signal-to-noise ratio before the detection with the digitizer at room temperature. The readout signal of each qubit is obtained by digitizing the corresponding down-converted output signal. The ground state |g or the excited state |e of the qubit is discriminated by taking a specific threshold value for each run of the measurement. The readout parameters of the three qubits that are Table II . We apply a Bayesian correction to the readout outcome to suppress the qubit decay errors and the qubit state discrimination errors due to insufficient separation of the |g and |e states. Therefore, the final probability distribution of the computational bases as a column vector P can be obtained from the original distribution P with To simulate the quantum dynamics of the real device, we use the following Markovina master equation [7] that includes the environmental induced relaxation and dephasing effects. The Lindbladian L [a q ] represents the relaxation of the q-th qubit with rate γ 1q = 1/T 1 q, and the Lindbladian L [a † q a q ] represents the pure dephasing of the qth qubit with rate γ 2q = 1/T φ q . Here a q is the annihilation operator of the q-th qubit, and the Lindbladian L [O]ρ = OρO † − 1/2{O † O, ρ}. The coherent part H(t) reads: where E C,q is the anharmonicity of the q-th qubit, and J pq is the dispersive coupling strength between qubits p and q. The functions θ 2q−1 (t) and θ 2q (t) are the control fields for the q-th qubit that induce x-axis and y-axis rotations, respectively. We use Eq. (1) to evaluate the influence of the relaxation and dephasing of the qubits on the learning performance, in which we assume all qubits have identical coherence times T 1 and T φ . Using the MNIST testing dataset, we examine the dependence of classification accuracies on the control pulse length, which are listed in Tables III and IV , respectively, under different T 1 and T φ . They collectively show that strong relaxation/dephasing tends to deteriorate the learning performance. Due to the voting method for assigning labels to input samples, an abrupt reduction to 50% can be observed in the tables when the coherence times are too short, because the qubits quickly decohere to their equilibrium state no matter what data samples are fed in. To see the trends more clearly, we display in Fig. 3 of the main text the average confidence 1 − L evaluated on the same testing dataset, which are consistent with the tables listed here. The aim of linear discriminant analysis (LDA) is to project the high-dimensional data vectors into two clusters of points distributed on an optimally chosen line so that they can be better visualized. Consider a dataset D = { x (k) , y (k) } with where x (k) ∈ R n and y (k) ∈ {0, 1}. Let D 0 and D 1 be the subsets associated with y (k) = 0 and y (k) = 1, respectively, the LDA analysis can be formulated as the seeking of a projector ω ∈ R n that maximizes where µ c and E c are the mean value and covariance matrix of D c , with c = 0, 1. The maximization of S guarantees that the distance between the mean values of the projected points from the two clusters is the largest with the smallest variance in each cluster. It is easy to solve the optimal projector: where the matrix inverse may be changed to pseudo-inverse when E 0 + E 1 is singular. The above LDA can be directly applied to the analysis of MNIST input data vector x and the converted encoding control variable θ En . As for the D-dimensiomal complex vectors |ψ(t E ) and |ψ(t E+I ) , we can transform them into 2Ddimensional real vectors and apply the same LDA [8] . To facilitate the comparison of data points in different vector spaces, we further process the projected points after the LDA: which normalizes the distance between the centers of two data clusters and shifts the overall data center to the origin. The algorithm for evaluating the loss function and the calculation of the gradient is shown in Algorithm 1. The Adam optimizer that is used in our experiment is shown in Algorithm 2. Algorithm 1 Calculate the loss function and the gradients with respect to the model parameters Input W, θ In , Batch of training data{ x (k) , y (k) } P out , g W , g In ← 0, δ ∈ R + for = 1 : b do P 0 = P( y (k) | x (k) ,W, θ In ) for i = 1 : Length( θ En ) do θ 1 = [ θ En (1), . . . , θ En (i) + δ , . . . ] g 1 (i) = (P( y (k) | x (k) , θ 1 , θ In ) − P 0 )/δ end for for i = 1 : Length( θ In ) do θ 2 = [ θ In (1), . . . , θ In (i) + δ , . . . ] g 2 (i) = (P( y (k) | x (k) , W , θ 2 ) − P 0 )/δ end for g W = g W − g 1 * x (k) /b g In = g In − g 2 /b P out = P out + P 0 /b end for return P out , g W , g In Key-Area Research and Development Program of Guangdong Provice Quantum Computation and Quantum Information Quantum machine learning Machine learning & artificial intelligence in the quantum domain: a review of recent progress Machine learning meets quantum physics Quantum principal component analysis Quantum support vector machine for big data classification Quantum boltzmann machine Quantum-enhanced machine learning A quantum machine learning algorithm based on generative models A rigorous and robust quantum speed-up in supervised machine learning Quantum machine learning in feature hilbert spaces Supervised learning with quantum-enhanced feature spaces Parameterized quantum circuits as machine learning models Circuitcentric quantum classifiers Classification with quantum neural networks on near term processors A quantum convolutional neural network on nisq devices Hybrid quantum-classical convolutional neural network model for COVID-19 prediction using chest X-ray images A quantum approximate optimization algorithm Fixed-angle conjectures for the quantum approximate optimization algorithm on regular maxcut graphs Generation of highresolution handwritten digits with an ion-trap quantum computer Learning and inference on generative adversarial quantum circuits Parametrized quantum policies for reinforcement learning An artificial neuron implemented on an actual quantum processor Entanglement-based machine learning on a quantum computer Nearest centroid classification on a trapped ion quantum computer Experimental demonstration of quantum-enhanced machine learning in a nitrogen-vacancy-center system Experimental realization of a quantum support vector machine Quantum generative adversarial networks for learning and loading random distributions Quantum generative adversarial learning in a superconducting quantum circuit Training of quantum circuits on a hybrid quantum computer Reinforcement learning for optimization of variational quantum circuit architectures Power of data in quantum machine learning End-To-End Quantum Machine Learning Implemented with Controlled Quantum Dynamics Experimental realization of a quantum image classifier via tensor-networkbased machine learning Open quantum systems General explicit difference formulas for numerical differentiation Stochastic gradient boosting Coherent Josephson Qubit Suitable for Scalable Quantum Integrated Circuits Perfect Quantum State Transfer in a Superconducting Qubit Chain with Parametrically Tunable Couplings Observation of topological magnon insulator states in a superconducting circuit Adam: A method for stochastic optimization Simple Pulses for Elimination of Leakage in Weakly Nonlinear Qubits Modern applied statistics with S-PLUS Opportunities in quantum reservoir computing and extreme learning machines Quantum reservoir processing Temporal information processing on noisy quantum computers Quantum reservoir computing using arrays of Rydberg atoms * These two authors contributed equally to this work Coherent Josephson Qubit Suitable for Scalable Quantum Integrated Circuits Perfect Quantum State Transfer in a Superconducting Qubit Chain with Parametrically Tunable Couplings Observation of topological magnon insulator states in a superconducting circuit 10-Qubit Entanglement and Parallel Logic Operations with a Superconducting Circuit Dispersive magnetometry with a quantum limited SQUID parametric amplifier Broadband parametric amplification with impedance engineering: Beyond the gain-bandwidth product Circuit quantum electrodynamics Description of entanglement Adam: A method for stochastic optimization Input Training dataset { x, y} s w , r w , s In , r In ← 0 W, θ In ← W 0 , θ In,0 lr = 10 −3 , β 1 = 0.9, β 2 = 0.999, ε = 10 −8 for k = 1 : N doThe gradient of the loss function L for parameter update in each iteration is obtained by averaging the gradients of the conditional probability P over a batch of randomly selected input samples (b = 2). This can reduce the fluctuation of L for faster convergence in the learning process. For the inference block, the gradient g In of L with respect to each parameter in θ In is directly obtained by rerunning the experiment with a small change in θ In and calculating the difference of P. As for the encoding block, the gradient g W with respect to the elements of the classical matrix W is needed. To reduce the experimental cost, we can equivalently calculate g W from the outer product between the measured gradient g 1 with respect to the encoding controls and the input data vector.Once the gradients with respect to the model parameters are obtained, we take Adaptive Moment Estimation (Adam) [9] update algorithm to update the corresponding parameters. The Adam algorithm is popular for its efficiency and stability in stochastic optimization of learning problems. In the Adam algorithm, lr, β 1 , β 2 , ε refer to algorithm configuration parameters and are chosen as default empirical values. The inter-mediate parameters s w , s In , r w , r In will be passed to the next iteration so as to adaptively control the parameter updating rate.