key: cord-0544672-h3cp8le1 authors: Choudhury, Sayantan; Dutta, Ankan; Ray, Debisree title: Chaos and Complexity from Quantum Neural Network: A study with Diffusion Metric in Machine Learning date: 2020-11-16 journal: nan DOI: nan sha: 4eba5f84209bad28884ba3f0f4bd8c88f16af73f doc_id: 544672 cord_uid: h3cp8le1 In this work, our prime objective is to study the phenomena of quantum chaos and complexity in the machine learning dynamics of Quantum Neural Network (QNN). A Parameterized Quantum Circuits (PQCs) in the hybrid quantum-classical framework is introduced as a universal function approximator to perform optimization with Stochastic Gradient Descent (SGD). We employ a statistical and differential geometric approach to study the learning theory of QNN. The evolution of parametrized unitary operators is correlated with the trajectory of parameters in the Diffusion metric. We establish the parametrized version of Quantum Complexity and Quantum Chaos in terms of physically relevant quantities, which are not only essential in determining the stability, but also essential in providing a very significant lower bound to the generalization capability of QNN. We explicitly prove that when the system executes limit cycles or oscillations in the phase space, the generalization capability of QNN is maximized. Moreover, a lower bound on the optimization rate is determined using the well known Maldacena Shenker Stanford (MSS) bound on the Quantum Lyapunov exponent. learning manifold, the probability of reaching a wide minima increases, which would result in minimum generalization error. But for a larger time constant, the number of minima attended decreases, thus a trade-off resulting in higher generalization error. To attain this optimal ability for the neural network to reach a large number of minima with an optimal time constant, the paper attempts to map the learning trajectory or unitary evolution of QNN to the trajectory of parameters using a Riemannian manifold called diffusion metric. The diffusion metric is introduced by Foressi in [35] and it is constructed by perturbing the flat Minkowski space by the magnitude of the noise in the gradient of the loss function. We can observe how the manifold changes in changing the neural architecture. In doing so, we will be able to correlate the optimal unitary evolution of QNN with optimal i.e. stationary action path of particles in diffusion metric. The correlation has two implications, firstly searching for an optimal QNN architecture as mentioned before, another coming from a more theoretical high energy physics perspective. The optimal unitary evolution denotes the minimum number of computations required to generate the final unitary U f from the initial unitary U i which in other words, the relative complexity [36] [37] [38] between the initial and final unitary, C(U f , U i ). This measure of complexity in unitary space is directly correlated with the optimal trajectory evolution of parameters or geodesic in the parameter space. The correlation establishes complexity as a function of the parameters of the QNN. After establishing the complexity, an extensive study of quantum chaos has been studied [36, 37, [39] [40] [41] [42] [43] [44] . The stability of the neural network in terms of Lyapunov stability and its evolution establishes how the neural architecture governs the stability of the neural network. Rather than an extensive study of the Lyapunov evolution, an extremal study in terms of the growth of the complexity has been carried out. The Maldacena, Shenker, Stanford (MSS) bound [39, 45] on the Quantum Lyapunov exponent, given by λ ≤ 2π/β with β inverse equilibrium temperature, puts forward an interesting limit on the optimization rate. The QNN cannot optimize its parameters completely, as there will always be a minimum deviation from the optimal parameters, which hold equality for maximal complexity systems. In this connection, the out-of-time-correlator (OTOC) has also been calculated using the universality relation between complexity and OTOC, given by C = − log(OTOC) ¶ . The paper considers a hybrid quantum-classical neural network framework based on PQCs [11, 12] optimizing quantum data with classical gradient-based algorithms like stochastic gradient descent (SGD). Throughout the paper, we have assumed that the length of the training dataset is large enough for the loss function to stabilize i.e. the loss function doesn't change as we increase the length of the training set. This led us to ¶ The concept of out-of-time-correlator (OTOC) is treated as a very important probe to quantify the amplitude of quantum chaos in terms of Quantum Lyapunov exponent. In this paper we have explicitly computed the expression for the OTOC from the first principle, rather using the universality relation between complexity and OTOC we have determined OTOC in terms of complexity. So this implies once the expression for the complexity can be computed from the present set-up the connection with quantum chaos can be very easily established using the mentioned universality relationship. avoid fluctuations due to sampling. The assumption of a large training dataset and the stabilizing of the loss function is inspired by Bialek et al [46] corresponding to quantum computation at thermal equilibrium [47] . The paper established the behavior of noise in SGD, which is governed by the neural architecture and dataset of QNN using the diffusion metric. The parameterized complexity of QNN is established by corresponding with the geodesic of parameter trajectories in the diffusion metric. The paper further analyses the stability of QNN using the Lyapunov exponent as a function of the neural architecture and dataset. The paper is divided into three sections, building up the mathematical background in Section 2 to the analysis of stability using the Lyapunov exponent in Section 4. In Section 2, Parameterized Quantum Circuits were introduced as a universal function approximator [48] and analysis were performed as a quantum analog of the statistical learning theory based on [46] . The diffusion metric [35] is introduced in Section 3, correlating the learning trajectory of QNN to the evolution of noise in SGD during training. After establishing the fundamental and mathematical concepts in Mathematical Background and Diffusion Metric, the paper determines complexity as a function of parameters in Parameterized Complexity. The complexity of QNN determines the Lyapunov exponents and thus the stability analysis was established in Quantum Lyapunov exponents. We focus on executing supervised learning tasks using Parameterized Quantum Circuit framework. In classical supervised learning, the model learns to map from an input dataset {x i } to the output {ŷ i }. The map represents a w−parameterized function y i = m(x i ; w), which is optimized to be close to the outputŷ i for all data index i belonging to the training dataset. The metric used to define the closeness of the parameterized function and output is called the loss function. The loss function can represent any metric like cross-entropy, likelihood loss, log loss, etc [13] . Here, we consider the mean square error loss. The main objective of the model is to optimize the loss function using a certain learning algorithm. These algorithms updates the parameters w of the parameterized function m(x i ; w) to optimize the loss function. Here, the model optimizes the loss function employing stochastic gradient descent (SGD) as its learning algorithm. SGD is an iterative method for optimizing the loss function by using the gradient of the loss function calculated from a randomly selected subset of the training dataset [13, 14] . This supervised learning scenario is valid for both classical and quantum neural network. In the quantum neural network context, the initial density matrix ρ in (x i ) is created by encoding the input data stream {x i } onto the Encoder Circuit Unitary U φ (x i ), which acts on the ground state |0 [11, 12] . The Unitary U φ can be represented as a sum of a linear combination of the basis operators α spanning over K−dimensional space with basis functions φ(x) as coefficients. The Encoder Circuit Unitary U φ is characterized by basis functions φ µ (x), where x is sampled independently and identically from the training dataset under a fixed probability distribution P (x) with variance σ 2 η [15, 46] . Mathematically the unitary U φ can be represented by: using which the input density matrix ρ in is defined as: Here the input density matrix ρ in is created from the input dataset using the equations 2.1-2.2. The quantum neural network (QNN) applies a parameterized Unitary operator U θ on the input density matrix to produce an output density matrix ρ out at every epoch (or iteration). Here, it is important to note that similar to the universal approximation theorem in artificial neural networks [48] , there always exists a quantum circuit that can represent a target function within an arbitrarily small error. The parameterized quantum circuit learning will always be able to optimize to any arbitrary small error but the depth or complexity of the circuit increases. This optimization also doesn't guarantee the generalization capability of the quantum neural network which would result in a high testing error. Motivated by the notion of deep neural networks and the unitary arrangement proposed by Beer et. al. [7] , we use the quantum neural network architecture of stacked unitary operators with a L number of layers. The unitary operator of the whole quantum circuit parameterized by θ is given by: where U i is the unitary at ith layer parameterized by weights w. The unitary at ith layer can be expressed as the sum of the linear combination of the basis operators σ with w as its coefficients. Mathematically, the unitary U i can be expressed as follows: Combining equations 2.3-2.4, the parameterized unitary operator can be expressed as the sum of linear combination of the basis operators σ spanning over P −dimensional space with parameter θ as its coefficient. Mathematically, the unitary operator U θ can be expressed as follows: where θ ν = g ν (w) is a function of weights. The parameters θ gets updated by the learning algorithm to optimize the loss function. The unitary operator U θ given by equation 2.5 acts on the initial density matrix given by equation 2.2 to produce the output density matrix ρ out . We measure the output density matrix ρ out using the observer operator B to get an expected value of the observation B given by Tr(Bρ out ). The aim of QNN is to optimize this expected observation value to a target observation valueB as an output. For every input dataset {x i }, we consider a corresponding output observation dataset {B i }. During training period, the training dataset (x i ,B i ) is sampled under the distribution P (x). The parameterized unitary operator U θ maps the input {x i } encoded initial density matrix to output density matrix. The observer maps this output density matrix to the loss function value f . Now using a learning algorithm (here, stochastic gradient descent) the QNN must find a sub-space of the unitary operator Uθ for which the loss function f is at its minimum. But again this sub-space doesn't guarantee generalization over dataset or minimum of testing error. ‖ The framework for Quantum Neural Network can be summarized following equations 2.1-2.2 and 2.3-2.5 as: and Loss function : where N is the length of the training dataset. We introduce a mapping Ω which the equations 2.6-2.7 maps Ω : θ → f for every epoch. It is important to note that there is a reduction in dimension to form a surjective K − to − 1 mapping. This results in many sub-spaces of optimal parameters for which the loss function f is minimum. Under a learning algorithm (here, stochastic gradient descent) QNN changes the unitary operator every epoch t by applying a mapping Θ : θ t → θ t+1 . To optimize the loss function, the learning algorithm tends to map Θ such that E(f (θ t )) > E((θ t+1 )). SGD tends the corresponding loss function to decreases iteratively at every epoch and finally reach the unitary Uθ, characterized byθ. In other words, when the parameters optimize θ →θ, the expected value of observation B will tend towardsB, resulting in the zero training error. The parametersθ are the optimal parameters. So, the matrixB can be represented as: ‖ There may be an overlapping sub-sub space for which the QNN could both optimize and generalize. Note that there can also be a situation when the sub-space of optimization doesn't overlap with the subspace of generalization. In that case, a trade-off takes place which is not favorable and a different neural architecture or Encoder Circuit should be looked after. where we used a shorthand ρ i in = ρ in (x i ) and η is the Gaussian noise with mean zero and σ 2 Γ variance, similar the classical variant used in [46] . The neural network optimizes by updating its parameters using the stochastic gradient descent (SGD). The learning algorithm or the mapping Θ : θ t → θ t+1 can be given as follows: where Γ represents the learning rate. Rather than optimizing the whole training dataset, we optimize the dataset in batches B, randomly picked from the whole training data. This not only reduces the computational cost but also adds stochasticity due to random sampling of batches which proves to be very essential in the generalization context, which will be discussed later in Complexity & Stability. The batch size |B| is much less than the total length of the training dataset N . The stochasticity in the SGD arises when |B| << N , which allows for higher stochasticity in random sampling while training. When both the length |B| ∼ N then the stochasticity loses and SGD becomes simple (not stochastic!) gradient-descent. SGD in the last few decades has been experimentally the most efficient in terms of accuracy and computational cost. This lead to its increasing interests and documentations in the computer-science community. Recently, with the works of [14, 15, 35] , there is a huge surge in interest in analyzing the SGD from a dynamical system perspective. We discuss this important aspect of SGD in Diffusion Metric. The loss function plays an essential part in the QNN framework as it is the objective function and purposely drives the gradient in the learning algorithm. Combining equations 2.1-2.2 and equations 2.5-2.7, the loss function can be expressed as: . (2.10) Further simplifying the factor ∆ under the condition that N → ∞ (large N ) as evaluated in [15, 46] we get: where the expansion co-efficient A ∞ j kp q can be expressed as: The 4−rank tensor A ∞ exists under the assumption that the equation 2.12 thermalizes or reaches equilibrium. The basis functions of the Encoder Circuit unitary operator U φ plays a pivotal part in creating the initial density matrix ρ in in equation 2.1-2.2. The basis functions are constant for a framework and thus selected prior to any training. The selection of these basis functions is thus important as the input dataset gets encoded into the basis functions. The tensor A ∞ signifies the relation between the Encoder Circuit and dataset. Hence, we call the tensor A ∞ as Encoder-Dataset tensor. Here, we assumed that the loss function stabilizes when the number of training data is large enough to neglect the fluctuations. Continuing equation 2.10 and replacing ∆ using equation 2.12, the loss function takes the following simplified form: (2.13) The above equation shows how the loss function is governed by the selection of Encoder-Dataset tensor A ∞ , given the observation matrix B. An important observation from equation 2.13 is that: when the parameters optimize i.e. θ →θ, the loss function minimizes to a non-zero constant σ 2 η , the variance of sampling distribution P (x). It is easy to mathematically validate as the loss function f is a mean squared error loss, the sampling the ordered pairs (x i ,B i ) also attributed to the loss function. Consider the sampling distribution P (x) as a delta function with σ 2 η → 0, then the loss function min f also tends to zero. Now, when we increase the variance of the distribution, or in other words, increase the diversification of the training dataset, the min f also increases. This also shows that the neural network will fail for a uniform sampling distribution with σ 2 η → ∞, there has to be an underlying structure of the dataset for the neural network to optimize. Previously in 2, we discussed about the notion of stochasticity in stochastic gradient descent (SGD) using equation 2.9. The hyper-parameters of SGD i.e. the batch size |B| and the learning rate Γ are crucial in achieving optimal learning trajectory in the context of computational cost and accuracy [14] . The learning trajectory accounts for the trajectory of parameters in the learning manifold, from the initial condition governed by the dynamical equation given by equation 2.9. A faster learning rate will skip minimal in the learning manifold, thus reducing the probability of achieving better minimal for optimization and generalization. Though a smaller batch size will reduce the computational cost, it also on the other hand increase the stochastic behavior of SGD large enough to skip the minimal, increasing training and testing error. This brings the focus to find a metric to calculate the stochasticity in the SGD, to analyze the effect of the hyperparameters, the Encoder Circuit along with the neural architecture of QNN on the behavior of stochasticity in the learning trajectory. In the literature by Foressi et. al. [35] showed that the Diffusion matrix D which is essentially the covariance matrix of the gradient of the loss function, provides a great insight into the stochastic nature of SGD. The diffusion matrix D becomes a null matrix when the learning trajectory is governed by a simple (not stochastic!) gradientdescent. This implies that when the matrix D is null, the sampling of the batches is irrelevant to the learning algorithm. At this time, the loss function has reached its critical point or in other words, the model has learned the training dataset. The magnitude of the Diffusion matrix determines the amount of stochasticity of SGD. The work [35] introduced a metric called Diffusion metric D, which is created by perturbing the Minkowski space of parameters with the magnitude of noise in the stochastic gradient descent. Foressi et. al. [35] showed that the trajectory of parameters θ governed by SGD follows a geodesic path in the diffusion metric under a potential given by V . Mathematically, the diffusion metric is given by the following expression: where < 1/ max λ D where λ D is the set of eigenvalues of diffusion matrix D and is the order of perturbation to the Minkowski space in the above expression. The Minkowski space corresponds to the diffusion matrix being D = 0 or in other words, the learning trajectory is governed by simple gradient-descent. Perturbing this Minkowski space with weak perturbation will distort the straight line geodesic path of parameters governed by simple gradient descent. This path corresponds to the path of the parameters with no excitation to explore other minima, thus increasing the probability of finding a better minima point resulting in better generalization. As mentioned earlier, the diffusion matrix is a covariance matrix of the gradient of the loss function f , which is mathematically expressed by the following expression: To evaluate the diffusion matrix, we consider evaluating the gradient of the loss function f as follows: using which we compute: where the dependency of θ µ with respect to θ ν can be given by the Jacobian G µ ν as follows: where {L(ν)} is the collection of indexes l for which ∂g ν (w) ∂w l = 0 as shown in [15] . It is important to note that G changes with epochs as weights evolve with time. The matrix G measures the dependence between the different parameters represented as coordinates. The matrix G being Dirac-delta function infers that the parameters are independent and the parameter space corresponds to the Minkowski space. From the dynamical system perspective, the matrix G governs the dependence of a parameter with other parameters, which in turn changes that parameter itself. One can correlate this scenario with manybody interactions with long-range hopping where the hopping energy from lattice site i to j corresponds to the magnitude of the matrix G j i . When the magnitude of each element of the matrix G is large enough, the hopping energy is large, making the disorder strength to decrease-ergodicity arises. On the other hand, when the magnitude of each element of the matrix G is small enough, the hopping energy is small making the disorder strength to increase-localization arises and ergodicity is lost. In the work by [21] , the inverse temperature β is defined by the hyper-parameters of SGD. This motivated us to correlate the Jacobian matrix G with the hopping energy. The correlation provides a holistic phase diagram between temperature T and disorder strength W similar to any Ising-like models as shown in [49, 50] . The phase diagram will provide a deeper understanding between the equilibrium systems or non-equilibrium systems in an artificial neural network context. The study of equilibrium and non-equilibrium aspects in artificial neural networks has been discussed in [21] . Similar to the the assumption in the equation 2.12 that the 4−rank tensor A ∞ thermalizes or reaches equilibrium, the diffusion matrix given in 3.2 also thermalizes. Using the similar treatment as [15, 46] , the approximated diffusion matrix can be given by Approximated Diffusion Matrix : where the matrix indexḠ δ ζ denotes the dependency of the complex conjugate of θ δ on the parameter θ ζ . Mathematically, the matrix index is given byḠ δ ζ = ∂θ * δ ∂θ ζ . The 8−rank Encoder-Dataset matrix A ∞ j kp qr sa b can be expressed as: The stochasticity of SGD changes with time which provides a temporal variation of the magnitude of perturbation in the Minkowski space. This perturbation in the approximated Diffusion metric can be correlated with the movement of masses in a Riemannian manifold where the parameters form the space-time coordinates. We now shift the problem from the approximated Diffusion metric with parameters to the trajectory of particles in the Riemannian manifold with the presence of small random masses. The magnitude of these masses is given by the magnitude of the noise in SGD, thus changes with time. The mass distribution on the Riemannian manifold also changes. Now, imagine you are told to control the trajectory of a particle from an initial point to its final point, by changing the mass distribution. The reward or aim of the particle is to visit more number of intermediate points while also reaching the target within a considerable time. The number of intermediate points corresponds to the generalization capability of the neural network and time here is the training time the QNN takes to reach the optimal points. In zero-mass distribution configuration, the parameters' trajectory would've been a straight line, reaching in less training time but also with less generalization capability. Changing the mass distribution increases the probability of the particle to more number of intermediate points, thus increasing the generalization capability. Analyzing the temporal distribution of mass thus becomes important in controlling the particle trajectory in maximizing its rewards. Further simplifying the factor Ψ using the definitions of Encoder-Dataset tensor in equation 2.12 and 3.8, we get where cov(a, b) is the covariance between two vectors a and b. Using the properties of Pauli matrices, the quantity Φ 1 can be simplified to It is important to observe than Φ 1 is independent of the index (ζ, η). Similarly, the quantity Φ 2 is also independent of the index (ζ, η). Thus the approximated diffusion matrix in 3.9 is a constant matrix with all elements as a constant number c, where the quantity c = D ∞ ηζ for all indexes (η, ζ) as shown in 3.9. The eigenvalues of matrix D ∞ are λ D = {0, 0, 0, 4c}, which has to be a positive quantity: The above equation may be not always be true as it completely depends on the selected Encoder-Dataset tensor A ∞ . Notice that when 4−rank Encoder-Dataset tensor A ∞ has all negative values at A ∞ p pa a for all a, p ≤ 4, then the inequality 3.12 cannot reach |∆θ| = 0. If the parameters optimizes completely i.e. |∆θ| = 0 then the inequality doesn't hold true, thus a contradiction. In these cases, a stricter inequality can be evaluated according to the elemental value of the matrix A ∞ and thus a limit to optimization can be evaluated. One can again correlate the inequality 3.12 with particles in the Riemannian manifold where the K = 4 parameters are the 4 space-time coordinates. In the space-time context, the inequality 3.12 shows that in certain metric (or Encoder-Dataset tensor), the spacetime coordinates can be restricted in reaching to its final coordinate i.e.θ, making the difference |∆θ| a non-zero quantity. It is certainly no surprising that these types of spacetime restrictions are quite common to see, reflecting an interesting correlation with QNN. At every epoch of training of QNN, a particular unitary operator U θ is prepared by the Parameterized Quantum Circuit. Brown and Susskind [37, 44, 51, 52] viewed the preparation of U θ as a time-series of discrete motions of an auxiliary particle on the Special Unitary SU group space. The particle starts at the identity operator I and ends at a target unitary operator U. The complexity of the unitary operator U θ is the number of minimum operators required to create U θ by the given circuit. Mathematically, it is given by the geodesic on the SU group space. QNN employs a unitary operator U θ at every epoch, thus the complexity of the QNN changes with epochs. In the QNN context, the final or target unitary operator is given by Uθ to produce minimum training error. The particle in group space travels from initial unitary operator U θ to the unitary operator Uθ. On the other hand, this can be corresponded with a particle in Diffusion metric traveling from initial parameter configuration θ 0 to the optimal parameter setθ as discussed in Diffusion Metric. A consequence of this correspondence is that the geodesic path traveled by the particle in the Diffusion metric can be correlated with the complexity as the geodesic traveled in the group space. Reflecting from the parameterized version of the approximated diffusion metric in equation 3.7, the complexity as a function of parameters can be correlated. Based on the parameterized complexity, one can further study the quantum chaos and complexity in QNN. The main objective of this section is to establish the complexity [36, 37, [52] [53] [54] [55] [56] [57] [58] as a function of parameters. This is motivated by the parameterized version of the approximated diffusion metric presented in equation 3.7. The parameterized complexity is evaluated by corresponding the diffusion metric introduced in [35] . The stochastic gradient descent follows a geodesic path on this diffusion metric, which has been discussed in [35] is given by the following equation: Geodesic on Diffusion Metric : where D ∞ is the approximated diffusion matrix and measures the degree of stochasticity. When the matrix D ∞ becomes a null matrix, then the equation 4.1 represents the simple gradient descent as a learning algorithm. The optimal parameterθ by integrating equation 4.1 can be shown as:θ where T is a hypothetical total training time to reach the optimal parameters from the initial parameter set θ 0 . The equation 2.5 in Mathematical Background correlates the parameters set θ with the unitary U θ . The trajectory of a particle in group space from initial unitary operator U θ exactly correspondence to the trajectory of parameters in Diffusion metric from the initial parameter set θ 0 toθ due to the linearity in 2.5. Using this correspondence, the evolution of unitaries in the unitary space as follows: Initiating with an initial unitary operator U θ , the unitary U θ evolves with epochs tending towards the target unitary operator Uθ. Susskind [37] discretized the special unitary group space in 0 −balls, where the auxiliary particle takes discretized steps into these balls to corresponds with the evolution of Unitaries. We assume a parameter set to belong to the optimal parameter set θ ∈θ, when the unitaries corresponding to the parameter fall in the 0 ball or, in other words, |U θ −Uθ| < 0 . The work [35] showed that SGD follows a geodesic path in diffusion metric at every epoch, which also corresponds to the complexity path on the group space. Based on the Complexity-Action conjecture in [36, 37, 44, 51, [54] [55] [56] 58] , we correspond the complexity of the unitaries in the group space with the action on the diffusion metric. The Complexity-Action conjecture as shown in [36, 37, 44, 51] is given by: The above equation shows that a change of complexity in group space will reflect a change in action in the diffusion metric and vice versa. The definition of Action defined on the diffusion metric [35] is given by From the Parameterized Quantum Circuit perspective, a change in the complexity will ensure a change in the unitary operator. This change in the unitary operator will cause a change in the parameter configuration in the diffusion metric, thus changing the action in the metric. Using 4.5 equation 4.3, the unitary as a function of complexity can be written as: where V is the potential under which the parameters evolve. On the other hand, using equation 2.5 the gradient of unitary with respect to change in complexity is given by: where we have introduced the following quantity: Therefore, equating the equations 4.6 and 4.7, one can conclude: The above equation establishes the distribution of complexity as a function of parameter and G −1 (I − D ∞ ) µν represents the elemental value of the matrix G −1 (I − D ∞ ). The complexity is thus given by the following expression: Parameterized Complexity : The above expression of Complexity C is difficult to evaluate exactly, analytically. We evaluate the complexity at specific epochs of the learning trajectory, as established in the next sub-section 4.2. Using [42, 43, 45, [59] [60] [61] , one can write down the following relation between OTOC and complexity: and OTOC = exp (− exp(λθ)) , (4.11) which further implies the following combined universal relation which particularly holds good in the context of quantum description of chaotic phenomena: where λ is identified to be the Quantum Lyapunov exponent for quantum chaos and satisfy the well known Maldacena Shenker Stanford (MSS) bound [39] , which is given by: where the temperature late time equilibrium saturation temperature for the maximal chaos is given by the present context is defined as: , (4.14) where, Γ being the learning rate and |B| is the batch size [21] . The equality sign or the upper bound corresponds to the maximal chaos where late time saturating behaviour can be observed. Here one can compute the Lyapunov exponent in terms of the Complexity as follows: which is basically can be measured from the slope of the log(C) vs θ plot, which we have plotted in the later half of this paper. This further implies the following bound on the complexity: In equation 4.10, the complexity expression is difficult to evaluate. We analyze complexity in certain critical learning epochs when the system is in a steady state i.e. the velocity of parametersθ = 0. We introduce a steady-state parameter set θ ss during which the system is stationary. Thus, in the framework of QNN the Lyapunov exponent is evaluated from the complexity for these epochs by the following simplified expression: We analyse the equation 4.17 by tending the quantity p at its extremal values i.e. initially we analyse when p → 0 and then we analyse when p → ∞. Using this mentioned identification we get the following simplified expression for the Lyapunov exponent: where we introduce a new function K(θ), which is given by the following expression: (4.20) During this simplification we have used the integration by parts in the above mentioned second step. This paper considers two extremal situations i.e. when the quantity p(θ) → 0 and p(θ) → ∞. Initially, a conditioned analysis on the stability of the system is performed here using p(θ) → 0: Using the equation, the following observations can be made when p → 0: 1. If p (θ) is a negative finite quantity at the critical parameter set θ * when p(θ * ) = 0, then Lyapunov exponent tends towards 0 − , thus the system stabilises with oscillations or limit cyclic behaviour in the phase space. 2. While if p (θ) is a positive finite quantity at the critical parameter set θ * when p(θ * ) = 0, then Lyapunov exponent tends towards 0 + , where the choatic nature of the system arises with unstable limit cycles. 3. If p (θ) = 0, then K(θ) can be further simplified to be following form: When p (θ) = 0, the value of K(θ) → −∞ thus the Lyapunov exponent λ → 0 − , stabilising the system with limit cycles. So, the condition with the quantity p(θ) → 0, shows the system inherently can execute stable or unstable limit cycles in the phase space. Now, let us consider another limiting condition with p(θ) → ∞, the Lyapunov exponent is given by the following simplified expression: which can be further generalized to the set of Quantum Lyapunnov exponents λ = 1 θ ss − θ 0 µ from the obtained result. Now, here it is important to note that the result here we have obtained is actually irrespective of p (θ). Using the equation 4.23, it is evident that when θ →θ, the Lyapunov exponent λ → ∞. But the expression for the max Lyapunov exponent for maximal chaotic phenomena can be achieved from the MSS bound [39, 45] (see previous discussion). Thus the relative difference, δθ = θ ss − θ 0 is constrained using equation 4.23 and equation 4.13 as: The above inequality shows that the rate of optimization of QNN is restricted and the maximum optimization rate corresponds to the maximum chaotic system. An optimization limit is claimed in inequality 3.12, where the conditions are applied to neural architecture and observation matrix, but it is applicable for the whole learning trajectory. Here, in inequality 4.24, no conditions are imposed on neural architecture and observation matrix but we are analyzing in the different learning epochs of the training phase. Optimization for the artificial neural network(ANN) with a high number of trainable parameters is empirically easy and can fit any training dataset resulting in zero training data. But the zero training error doesn't assure of the generalization capability of neural network [62] [63] [64] . When a neural network generalizes over a dataset, it understands the underlying structure of the dataset and thus reduces the difference between training error and testing error, called generalization error. Not much is discussed in the context of generalization or optimization in QNN as compared to ANN. Recently, [65] showed that QNN with the same structure as the corresponding ANN will have a better generalization property. In this paper, we mainly discussed or focused on the optimization property of QNN. We focused on the trajectory of parameters to its optimal parameter set in the learning manifold. But this is focusing on the one-half of the portion-training error. We introduce the generalization capability of QNN inspired by the notion of generalization in ANN as discussed in [14, 17, 18, 62, 63, 65] . As the QNN framework used in the paper is a quantum-classical hybrid, where the parameters are optimized in classical SGD, we can use the concepts of generalization in ANN. Thereby, we focused on the fact that the generalization capability of a neural network is associated with the variance of the parameters [14, 21] , higher generalization capability has high variance. Intuitively, it is taking into account the fact that a higher variance of parameters will increase the probability of finding a better optimal point which would result in better generalization capability. Thus the variance of parameters is a measure of the generalization in neural networks. The variance of parameters when the system is in a steady-state condition is given by the following expression: where in the last step we have used the well known Cauchy-Schwarz Inequality. After working out a bit we derive the following bound on the variance: where θ is the averaged parameters over its indexes. The inequality 4.25 shows that when the Lyapunov exponent is minimum or λ → 0, then the generalization capability is at its maximum. This is shown as when the system shows limit cycles with λ → 0 in phase space, the generalization capability reaches maximum. Interestingly, the work by [24] also argued that these oscillations in phases space are a crucial part of the stability of continuous memories in the human brain. The inequality 4.25 gives a theoretical perspective to this argument. Moreover, inequality 4.25 also shows that with an increase in inverse temperature β, the generalization capability increases. Correlating with the phase diagram [49, 50] , this corresponds to many-body localization or non-equilibrium states as the work [21] showed for an artificial neural network. Along with oscillations, [24] argued coherent phases like many-body localization also plays a pivotal role in the stability of continuous memories. But on the other hand, increasing the inverse temperature also corresponds to a slower convergence rate as shown in the equation 2.9. Thus there is a trade-off between convergence rate and generalization capability as previously intuitively mentioned. The change in the nature of the Lyapunov exponent is due to the matrix G −1 . For p → 0, the matrix G → ∞ and the system can be unstable or stable limit cycles depending on gradient p with no significant chaos with maximum generalization capability. But for p → ∞, using the Lyapunov exponent, the complexity C [59] can be given as: where θ ss,µ is the µ−index of the steady state parameter θ ss . and k is the constant to integration. The out-of-order correlator OTOC [42, 43, 45, 60, 61] is given by OTOC = exp(−C) which shows that out-of-order correlator can be represented as: The scrambling time t * as shown in [37, 41-43, 61, 66] is given by: The entropy S as shown in [37, 44, 59, 61] can be expressed as: Figure 4 .1, it is important to observe that as the eigenvalues of Encoder-Dataset tensor increases, the maximum complexity of the system also increases. Thus, the role of Encoder-Dataset matrix A ∞ can be interpreted. The paper uses Parameterized Quantum Circuits (PQCs) in the hybrid quantum-classical framework to perform optimization of quantum data with a classical gradient-based learn- a quantum analog of [46] . In doing so, the relation between the learning dynamics and the neural architecture of QNN is established. The relation is used to also establish the dependency of the noise in SGD on the neural architecture of QNN using the Diffusion metric. Using the definition of complexity [36, 37, [52] [53] [54] [55] [56] [57] [58] , the paper established dependency of the parameters on complexity. The parameterized Lyapunov exponent has been derived which estimates the stability of the system. The MSS bound [39] on the maximum Lyapunov exponents establishes a lower bound on the degree of optimization rate in parameters. The paper also proves that when the system executes limit cycles or oscillations in the phase space, the generalization capability of QNN is maximized. This is consistent with the biological notion argued by [24] that oscillations in phase space are important in the stability of the formation of continuous memories. The important contributions or results of the paper can be listed as follows: • Correlation between the evolution of unitary operators in the unitary space with the trajectory of parameters in the Diffusion metric. • Establishing Complexity, Lyapunov exponent, OTOC, and Entropy as a function of parameters of QNN. • Estimating the stability of QNN using Lyapunov exponent. • Proving that QNN with limit cycles or oscillations in phase space will have maximum generalization capability. • A lower bound on optimization rate has been determined using the MSS bound. Moreover, as neuroscience holds the fundamental architecture of neural networks, despite the proposal of quantum processing in neurons by Fisher [30] not much progress has been made to understand learning systems like human cognition from the perspective of quantum chaos and learning manifolds. Thus it not only becomes important to appreciate the application capability of QNN but also to analyze the quantum learning systems through the lens of statistical learning of QNN. A possible way of connecting the human brain with the models of neuroscience is correlating the famous Hodgkin-Huxley model [25] with the parameters' trajectory. Reverse engineering the QNN model that would correspond to the Hodgkin-Huxley model, can give much insight into the mechanism of the human brain. Quantum supremacy using a programmable superconducting processor Guang-Can Guo, and Guo-Ping Guo. 64-qubit quantum circuit simulation Commercialize early quantum technologies Quantum computing in the nisq era and beyond. Quantum Coherence in logical quantum channels A quantum-computing advantage for chemistry Training deep quantum neural networks Quantum perceptron models Implementing perceptron models with qubits Quantum gradient descent and newton's method for constrained polynomial optimization Quantum circuit learning Parameterized quantum circuits as machine learning models Deep Learning Entropy-sgd: Biasing gradient descent into wide valleys Geometry perspective of estimating learning capability of neural networks Estimating information flow in deep neural networks Opening the black box of deep neural networks via information Emergent properties of the local geometry of neural loss landscapes Sharp minima can generalize for deep nets An analytic theory of generalization dynamics and transfer learning in deep linear networks Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks Information scrambling in quantum neural networks Information flow in entangled quantum systems Nonequilibrium landscape theory of neural networks A quantitative description of membrane current and its application to conduction and excitation in nerve Is there chaos in the brain? ii. experimental evidence and related models Oscillations and chaos in neural networks: an exactly solvable model Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos Robust chaos in neural networks Quantum cognition: The possibility of processing with nuclear spins in the brain Neural architecture search with reinforcement learning Random search and reproducibility for neural architecture search Efficient neural architecture search via parameter sharing A geometric interpretation of stochastic gradient descent using diffusion metrics Holographic complexity equals bulk action? Three lectures on complexity and black holes Complexity, action, and black holes A bound on chaos Black holes and the butterfly effect Slow scrambling in disordered quantum systems Measuring the scrambling of quantum information Unscrambling the physics of out-of-time-order correlators Second law of quantum complexity The cosmological otoc: Formulating new cosmological micro-canonical correlation functions for random chaotic fluctuations in out-of-equilibrium quantum statistical field theory Predictability, complexity, and learning Universal quantum computation in thermal equilibrium Approximation capabilities of multilayer feedforward networks Many-body localization and quantum thermalization Quantum phase transitions Subsystem complexity and holography Complexity and shock wave geometries Models of quantum complexity growth Holographic complexity equals which action Aspects of the first law of complexity Comments on holographic complexity Circuit complexity for coherent states Circuit complexity in quantum field theory The generalized otoc from supersymmetric quantum mechanics: Study of random fluctuations from eigenstate representation of correlation functions Measuring signatures of quantum chaos in strongly-interacting systems Quantum aspects of chaos and complexity from bouncing cosmology: A study with two-mode single field squeezed state formalism Understanding generalization in deep learning via tensor methods Understanding deep learning requires rethinking generalization Exploring generalization in deep learning Generalization study of quantum neural network Towards the fast scrambling conjecture The research fellowship of SC is supported by the J. C. Bose National Fellowship of Sudhakar Panda. Also SC take this opportunity to thank sincerely to Sudhakar Panda for his constant support and providing huge inspiration. SC also would line to thank School of Physical Sciences, National Institute for Science Education and Research (NISER), Bhubaneswar for providing the work friendly environment. Particularly SC want to give a separate credit to all the members of the EINSTEIN KAFFEE Berlin Alexanderplatz for providing work friendly environment, good espresso shots, delicious chocolate and caramel cakes and cookies, which helped to write the most of the part of the paper in that coffee shop in the last few months. SC also thank all the members of our newly formed virtual international non-profit consortium "Quantum Structures of the Space-Time & Matter" (QASTM) for elaborative discussions. SC also would like to thank all the speakers of QASTM zoominar series from different parts of the world (For the uploaded YouTube link look at: https://www.youtube.com/playlist?list=PLzW8AJcryManrTsG-4U4z9ip1J1dWoNgd) for supporting my research forum by giving outstanding lectures and their valuable time during this COVID pandemic time. Last but not the least, we would like to acknowledge our debt to the people belonging to the various part of the world for their generous and steady support for research in natural sciences.