key: cord-0470173-3cle5zla authors: Hui, Xianfei; Sun, Baiqing; Zhou, Yan; SenGupta, Indranil title: Extraction of deterministic components for high frequency stochastic process -- an application from CSI 300 index date: 2022-04-06 journal: nan DOI: nan sha: e5f257b15df8da56651240dc88edb2a4924abe2c doc_id: 470173 cord_uid: 3cle5zla This paper models stochastic process of price time series of CSI 300 index in Chinese financial market, analyzes volatility characteristics of intraday high-frequency price data. In the new generalized Barndorff-Nielsen and Shephard model, the lag caused by asynchrony of market information is considered, and the problem of lack of long-term dependence is solved. To speed up the valuation process, several machine learning and deep learning algorithms are used to estimate parameter and evaluate forecast results. Tracking historical jumps of different magnitudes offers promising avenues for simulating dynamic price processes and predicting future jumps. Numerical results show that the deterministic component of stochastic volatility processes would always be captured over short and longer-term windows. Research finding could be suitable for influence investors and regulators interested in predicting market dynamics based on realized volatility. As we all know, financial fluctuations may come not only from the financial system itself, but also from other aspects of social and economic life. For example, COVID-19, has caused frequent and violent fluctuations in global financial markets (see [1] and [10] ). In the post-COVID-19 era, affected by internal and external factors in the market, the price of financial assets has been unstable during the first half of 2021. Facing a world with more dynamic economic situation, enterprises and research circles are realising the importance of the challenges and opportunities presented by financial fluctuations. The volatility of financial assets, which is the intensity of changes in the rate of return of financial assets over a period of time, is unobservable [11] . The measurement of volatility, which describes the potential deviation from the expected value, is the core issue in the study of financial volatility. The accurate prediction of financial volatility is the key factor for successful financial asset pricing (see [28] , [21] ), economic forecasting [9] , risk management [3] , portfolio optimization [17] , and quantitative investment [8] . Volatility Analysis of financial time series is a practical method to study the law of volatility and estimate volatility. In recent years, new progress has been made in the field of volatility estimation under high frequency environment. A large amount of literature focuses on the three directions of high-frequency volatility estimation, namely, model establishment [26] , model evaluation [4] , and model application [7] . Asset price process [12] (continuous, finite jump or Lévy jump) and data characteristics [16] (whether there is microstructure noise and whether regular sampling is implemented) are the two main contents in the field of volatility estimation under high frequency environment. Jump, excessive fluctuation of asset price in a certain period of time, is one of the key issues in asset price dynamics research. Theoretically, when there is no jump in asset price, the realized fluctuation is an unbiased and consistent estimation of potential fluctuation. However, the jump phenomenon of price volatility in the capital market is widespread. The jump leads to consistent overestimation of continuity fluctuations, causes realized volatility and realized range volatility to no longer be an unbiased and consistent estimate of potential volatility. In response to this jump phenomenon of asset price fluctuations, estimation of realized bipower variation, which was first proposed by Barndorff-Nielsen and Shephard, was used to decompose realized fluctuations into continuous fluctuations and jump fluctuations [6] . The Barndorff-Nielsen and Shephard model (BN-S) model [5] , which is used to describe the random behavior of price process in the research field of non-parametric methods for high-frequency time series, is a popular stochastic volatility model with a Lévy process as driving factor of financial asset price. From academic points of view, the classic BN-S model has many attractive properties. But its theoretical framework is not completely satisfied in many application scenarios. Problems such as lack of long-term dependence may lead to the failure of the model in use. Recently, a variety of improvement schemes to the basic model are proposed, generalized BN-S models are constructed, and multiple dimensional applications, such as jump capture [19, 20] , pricing [22, 23] , and risk management [2, 25] , are implemented in the process of random fluctuations in asset prices. Artificial intelligence in big data environment provides new tools for financial research and enriches the previous research on volatility estimation [18] . Over the past few years, data processing classifiers based on machine learning and deep learning have always shown excellent performance in the field of financial prediction. Compared with machine learning [15] , deep learning [30] has stronger optimization capabilities and more advantages when dealing with big data sets. In the research of asset volatility under random uncertain environment, the classic BN-S model including a single OU process is often constructed in the previous literature. However, the model will fail in application due to the lack of long-term dependence. Some existing studies solve this problem by superimposing the OU process, but the actual economic significance of different stochastic processes is less considered. The generalized BN-S model is used to study the volatility of daily sampled commodity prices by many authors in the US and European markets. There are fewer relevant studies on the Asia-Pacific market, and fewer relevant studies using the minute sampling frequency. As one of the fastest-growing markets in the world, the Asia-Pacific securities market has attracted more and more attention [31] . The CSI 300 index covers most of the domestic market value of China (the largest economy in the Asia-Pacific region), and reflects the market's mainstream investment returns and changes in the trader structure. The use of samples with high sampling frequency in the day can retain more market information and discover more detailed fluctuation characteristics caused by the impact of various information on the market. This research focuses on the price dynamics of the CSI 300 index with a sampling frequency of 1 minute, and uses the generalized BN-S model to quantitatively analyze the volatility process of financial time series to capture the deterministic component of the random process of price fluctuations. The impact of overnight information [27] on the market is avoided in data preprocessing. Samples with high sampling frequency in the day are used to retain market information to a greater extent and discover more volatility characteristics caused by abnormal information shocks on the market. Our approach to exploring stochastic process of asset price dynamics has several advantages. First, the certainty element (θ) in the new model help us freely fit stock index prices and dynamic volatility in a correlated but different way. Because the superposition of Lévy process is considered, it can solve the problem that the classical BN-S model does not have enough dependence for a long time. Second, the new model realize the estimation of delay parameter (b) in the case of the jump in volatility caused by sluggish market response is not synchronized with the jump of asset price. Finally, the new model can be used to capture the deterministic components of the intraday price volatility of CSI 300 stock index. It is easy to estimate the dynamic deterministic parameter with the help of machine learning algorithms and deep learning algorithms. It shows the application of data science in obtaining "deterministic components" from processes that are generally considered to be completely random. In general, the results offer promising avenues for simulating dynamic price processes and predicting future jumps. Numerical results show that the deterministic component of stochastic volatility processes would always be captured over short and longer-term windows. Research finding could be suitable for influence investors and regulators interested in predicting market dynamics based on realized volatility. The paper is organized as follows. In Section 2, the generalized BN-S model is introduced. In Section 3, the high-frequency CSI 300 stock index price data is selected as the research sample, the high-frequency financial time series are preprocessed, the descriptive statistical characteristics of the data set are obtained, and the distribution of price fluctuations is analysed. Based on the research results obtained in Section 3, a deterministic component out of high-frequency price stochastic processes is derived by using machine learning and deep learning algorithms to realize parameter analysis and estimation in Section 4. In Section 5, a brief conclusion is provided. Financial time series of different assets share many common features (heavy tailed distributions of log-returns, aggregational gaussianity, quasi long-range dependence). Many of these facts are successfully captured by stochastic models with Lévy processes. Lévy processes can be used to characterize the dynamic changes of the time series of financial asset prices with jump processes. Barndorff-Nielsen and Shephard (BN-S) model, which is a widely used stochastic model with Lévy processes, is used to describe the stochastic behavior of random time series in the research field of nonparametric methods of high-frequency time series. A brief introduction to this model is given as follows. Consider a frictionless financial market in which a risk-free asset with a constant rate of return r and a stock are traded on a fixed horizon date T . The classical BN-S model assumes that the price process of a stock (or, a commodity) S = (S t ) t≥0 , which is defined in a filtered probability space (Ω, F, (F t ) 0≤t≤T , P), is given by: where the log-return X t is given by: where σ t is the volatility at time t, the parameters µ, β, ρ ∈ R, and ρ ≤ 0. The variance process is given by: where λ > 0. With respect to the probability measure P, the process W = (W t ) is a standard Brownian motion. Observe that the Ornstein-Uhlenbeck process in this model is driven by an incremental Lévy process, which is a random process of positive mean recovery. The process Z = (Z λt ) is the subordinator (also known as background driving Lévy process or "BDLP"). The processes W and Z are independent of each other. Also, (F t ) is a conventional augmentation of the filtering produced by (W, Z). Clearly, the process σ 2 = (σ 2 t ) is strictly positive. The classical BN-S model has excellent performance in describing the dynamic characteristic response mode of stable asset prices in a short time. It is commonly used to capture some stylized features of time series observed in financial markets, such as semiheavy tails, aggregational Gaussianity, quasi long range dependency and self-similarity. The topic of delayed response in the financial market has been studied in papers such as [13] . On the other hand, the study in [29] handles this problem with a delayed price formula, where the price volatility obeys the form σ(S t − b), for some delay parameter b > 0. However, the parameter b is also stochastic, and this makes the resulting model unnecessarily involved. The research results [14] show that the superposition of Ornstein-Uhlenbeck (OU) type processes can achieve long-range dependence. The superposition of Lévy subordinations successfully fits the asynchronous changes from price and volatility in an interrelated but independent way. Referencing the previous research results [24] , the structure of a generalized BN-S model will be introduced as follows. The key point of our research is to capture the deterministic components out of high-frequency price stochastic processes. As proposed in [24] , suppose Z t and Z * t , with same (finite) variance, are two independent Lévy subordinators. There exists a Lévy subordinate Z λt independent of W , such that For 0 ≤ ρ ≤ 1, Z and Z are positively correlated Lévy subordinators. Assume that the dynamics of S t are given by (2.1) and (2.2), where σ t is given by In (2.6), the OU process Z = (Z λt ) is related to the corresponding Z in (2.3) and is also independent of W . In the following study, delay parameter b and the long range dependence property of model are considered. As shown in [24] , the price S = (S t ) t≥0 on some risk-neutral filtered probability space (Ω, F, (F t ) 0≤t≤T , P) is modeled by (2.1). And the convex combination of two independent subordinators Z and Z (b) would be implemented to expressed the dynamics of X t in (2.2) by: where 0 ≤ θ ≤ 1, θ is a deterministic parameter. At time t, λ > 0 is the proportional parameter. λt are independent Lévy processes. Compared to Z λt , the process Z λt corresponds to the greater Lévy intensity. For instance, if the Lévy densities of Z and Z (b) are given by v 1 ae −ax and v 2 ae −ax , respectively (for a > 0, v 1 > 0, v 2 > 0, and x > 0), then v 2 > v 1 . Also, in (2.7) the processes W, Z and Z (b) are independent, and (F t ) is the usual augmentation of the filtration generated by (W ; Z; Z (b) ). The variance process in (2.3) in this case is given by: where θ ∈ [0, 1] is deterministic parameter. For simplicity, assume θ = θ for the rest of this paper. The sum of (1 − θ )dZ λt and θ dZ λt , is a Lévy process, which is positively correlated with Z λt and Z After a simple calculation, the solution of (2.8) can be explicitly written as This enforces positivity of σ 2 t . Thus, the process σ 2 t is strictly positive and it is bounded from below by the deterministic function e −λt σ 2 0 . The instantaneous variance of log returns is given by The short-range-dependence problem of the classical BN-S model can be improved in the new model. The dynamics given by the new model incorporates a long-range dependence. Assume that J Z is a jump measure related to the subordinate Z of the Lévy process, J Considering the logarithmic regression of the classical BN-S model and the new model, the covariances of X t and X s are given by , t > s, (2.10) and 1 )). When s takes a fixed value, for the classical BN-S model, Corr(X t , X s ) rapidly becomes smaller with the increase of t. Such attenuation may cause the failure of the classical model in applications with a long time span. It can be seen that the BN-S model, when used to fit the random fluctuation process of risky assets, may get inaccurate fluctuation simulation results, affected by the change of the time parameter t. On the other hand, variance of the log-returns X t and X s (as shown in (2.11) respectively. Affected by the value of the parameter θ, Corr(X t , X s ) will never become "too small". Because the value of t must have an upper limit when s takes a fixed value. This is the main difference between the results of (2.10) and ( The CSI 300 index is always considered to have strong market representation. It covers most of China's domestic circulating market value and reflects the overall trend of China's Shanghai and Shenzhen markets. In particular, its constituent stocks include many mainstream investment stocks with market representation, liquidity and trading activity. So it is often used to study the returns of mainstream investments and changes in financial price fluctuations in the market. The main purpose of this paper is to explore the volatility characteristics of intra-day highfrequency price data, and then to study the quantitative indicators in the process of random fluctuations in financial time series. The generality and extensiveness of the application of the new model in the previous section are considered, and the CSI 300 index price is is considered as the empirical data of analysis. The corresponding intra-day high-frequency data is selected as the research sample. It is conducive to maximally retain market information to select research samples with a higher sampling frequency. The intraday closing price data of the CSI 300 index on consecutive trading days from January 1, 2021 to June 30, 2021 is considered as a sample. The sampling frequency of this sample is 1 minute. The data set contains a total of 28, 320 observations in 118 consecutive trading days(Data source: wind). The fluctuation curve of the historical data over time is shown in Figure 1 . It is necessary to discuss the data distribution characteristics of intra-day price changes and yield fluctuations, which help us to explore the basic laws of CSI 300 stock index time series fluctuations. In order to study the change trend and distribution characteristics of CSI 300 stock index over time in different time intervals, an intuitive way is chosen to visually analyze the data structure. Figure 2 shows the moving average curve of the CSI 300 index under different time spans. Normally, the trading hours of the CSI 300 index are each working day in 9 : 30 − 11 : 30 and 13 : 00 − 15 : 00, Beijing time (the effective trading time per day is a total of 4 hours). So four time spans of 1 minute, 30 minutes, 120 minutes (half a day) and 240 minutes (1 day) are chosen to observe the data set. In Figure 2 , blue represents the price change curve per minute, and red represents daily price fluctuations. It can be clearly seen that the general trends of the two curves are similar, but there are fewer repetitions. The blue curve fluctuates more sharply than the red one, which shows that the high-frequency data during the day contains more market information than the closing price. The yellow line (representing the price change every 30 minutes) and the blue line overlap more severely than the red line. The green curve, which represents price changes every 2 hours, is more stable than the yellow line. These results are also considered to confirm the above view, that is, the data set at a higher sampling frequency is more effective for us to find the realized volatility estimator. Compared with previous studies, the data set used in this paper has more advantages. A lot of misleading information exists in the unprocessed empirical data for various reasons. Unprocessed data is used directly, which may lead to undesirable results such as a decrease in the prediction accuracy of the time series. Therefore, the observed samples should be filtered before doing data analysis. Compared with the fluctuations in daily stock index yields and trading volume, the impact of overnight information on the market should not be ignored. Most of these price changes on overnight information are concentrated within one minute of the opening. In other words, price fluctuations within one minute of the opening could not represent changes in stock index fluctuations throughout the trading day. In order to avoid shocking intra-day fluctuations and causing abnormal data (such as high kurtosis, increasing outliers, etc.), the data within 1 minute of the opening of the daily observation sample should be excluded. After the overnight information was digested by the market, the empirical data would reflect the daily operation of the CSI 300 index price more accurately. In addition, we remove the outliers and zero-value data from the observed data to keep the observed sample data tidy. After preprocessing the empirical data according to the above filter conditions, the usable sample data are filtered out (28081 observations in total). The rejection rate of sample data is 0.84%. It shows that the observed samples have both liquidity and validity, and the intra-day high-frequency price information is effectively stored in the empirical data. Table 1 . The fluctuation characteristics of CSI 300 prices could be seen from the statistical results in Table 1 . We believe this is related to sampling frequency. In the case of sampling frequency per minute, the kurtosis of the sample is less than that of normal distribution. The generalized BN-S model mentioned in Section 2 is suitable to discuss the above data characteristics, because Lévy processes in the model could be used to characterize the dynamic changes of the time series of financial asset prices with jump processes. Generally, the high-frequency price has a smaller range of changes than the low-frequency price in the same period of time. For example, the price change per minute is often smaller than the change every two minutes in the same upward trend. In order to observe small changes in high-frequency data, the value of the percentage of price change is more suitable to be used as observational data than price data in the analysis of the volatility distribution of high-frequency data. The histogram of the CSI 300 price change percentage is shown in Figure 5 . It's seen that CSI 300 price change statistics per minute, which do not follow the normal distribution, are mainly concentrated in 0 (both positive and negative values exist). The graph is skewed to the right, which indicates that there are more rising empirical data than falling ones in the overall sample. Volatility jumps with different amplitudes and frequencies exist in each sample of the CSI 300. In the following section, the information of data characteristics shown in the above charts is used for learning and parameter estimation of empirical data. By using the data analysis results in Section 3, the value of θ in the generalized new BN-S model in Section 2 is found, and the deterministic component in the random process of high-frequency price data fluctuations is captured in this section. In order to achieve the above goals, the classification problem based on the historical data set was created by implementing the following steps. Step 1. Index the available historical price data and price change percentage data per minute of CSI 300 in chronological order. Step 2. According to the data fluctuation characteristics obtained in Section 3, create new data structures from historical data sets. Take the percentage of change in the closing price for 10 consecutive minutes as a subset of the rows, stacking layer by layer. Divide the empirical data according to the above rules to form a new CSI 300 index price data matrix. Step 3. Consider the volatility of the closing price per minute in the historical price data of the CSI 300 stock index, and determine the value of K to define the big jumps (large increases) in the high-frequency closing price fluctuation. Each time closing price is K lower than the price of the previous minute needs to be identified (for example, if K = 0.1%, the date and time, when the closing price of CSI 300 index is 0.1% lower than the previous minute's one, should be marked). Step 4. Create a target column θ in the new data matrix and assign values. If there are at least two big jumps in the next 10 minutes, the parameter θ in the target column of the row is 1. Otherwise, the θ corresponding to the row is 0. Step 5. Use several different machine learning algorithms and deep learning algorithms to learn from empirical data sets and estimate the value of θ. Substitute the obtained value of θ into (2.8) in Section 2, which means that the deterministic component of the CSI 300 fluctuation random process is captured. The variables involved in the above steps could be adjusted according to the characteristics of the data set. The adjustment rules could refer to the following reminders. 1. In step 2, if a multi-dimensional data structure is created by adjusting the division of data, the effectiveness of the result will be improved. In general, the more elements contained in a row subset, the more information is carried in the new matrix, and the accuracy of the result may be improved. At the same time, it also increases the workload of calculation and reduces the predictable time span. 2. In step 3, adjusting the value of K is believed to be an effective way to improve the results. Different values of K are suitable for different retrieval targets. Generally, the shorter the time, the smaller the change range of the observed data. A higher sampling frequency is often suitable for using a smaller value of K, which can identify more big jumps the same period. 3. In step 4, the value of θ is related to the number of the big jumps identified in a period. In the same data set, the more subset elements selected in step 2, the more big jumps recognized in each row of the matrix, and the greater the possibility of θ = 1. Setting the threshold for identifying the number of big jumps to be small will lead to a high probability of θ = 1, and a low probability of The above steps could be used to calculate θ with reasonable accuracy to prove that these steps are feasible. Referring to the data characteristics analyzed in Section 3, four sets of training set and test set dates are selected from the empirical data set to run the classifiers, and find the corresponding index in step 1. The date selection is shown in Table 2 . The fluctuation patterns of monthly data, weekly data, daily data, and intraday data are all considered in selecting test samples. The analysis results in Section 3 show that the average price of sample 2 is the highest, and the volatility is the most intense in February. Therefore,The daily data of February (T 1) where the maximum volatility is located could be selected as the estimated sample. After experiencing huge ups and downs, CSI 300 continued to fall in March, so the price changes in the first two weeks of sample 3 deserve attention and related weekly data (T 2) could be estimated. Monthly data forecast within a more stable range (T 3) is also worthy of attention. It is also meaningful to estimate the parameters on intra-day historical data in the last five days in the data set (T 4). The daily historical data in the last five days in the data set is selected to estimate the value of θ. In step 5, it is worth noting that the estimation results of θ by using 6 algorithms are not necessarily the same. The prediction results of different machine learning and deep learning algorithms often have different accuracy. In order to avoid possible misjudgments and make the results more accurate, the prediction results of various algorithms are evaluated. "Support" refers to the number of responsive samples that appeared during the calculation process. "Precision" is used to express the accuracy rate in all the prediction results, where θ takes 1 or 0. It is defined as the ratio of the number of accurate predictions θ = 1(θ = 0) to the number of all prediction results θ = 1(θ = 0). "Recall" shows the efficiency that θ = 1(θ = 0) is accurately predicted. It represents the ratio of the quantity accurately predicted θ = 1(θ = 0) to the true quantity θ = 1(θ = 0). The accuracy of parameter prediction results could be represented by the values of "precision" and "recall". The harmonic average of "precision" and "recall" is considered suitable to show the predictive effect of different algorithms directly. Its value is indicated by "F1-score". Several machine learning algorithms and deep learning algorithms are used for the empirical data set in the above time periods, and the results of the classification report of the accuracy evaluation of θ are recorded in Table 3, Table 4 , Table 5 , and Table 6 . It can be seen that the number of "support" in the report results,not affected by different algorithms, is only related to the time window T. It shows the number of jumps could be accurately identified by the six algorithms we used for the empirical price data of CSI 300 Index. Comparing the number of "supports" (T4