key: cord-0532315-pduk0zqb
authors: Tsai, Yun-Cheng; Szu, Fu-Min; Chen, Jun-Hao; Chen, Samuel Yen-Chi
title: Financial Vision Based Reinforcement Learning Trading Strategy
date: 2022-02-03
journal: nan
DOI: nan
sha: bd1d88583ade47586b25ed75d07db6182f2eac93
doc_id: 532315
cord_uid: pduk0zqb

Recent advances in artificial intelligence (AI) for quantitative trading have led to its general superhuman performance in significant trading performance. However, the potential risk of AI trading is a"black box"decision. Some AI computing mechanisms are complex and challenging to understand. If we use AI without proper supervision, AI may lead to wrong choices and make huge losses. Hence, we need to ask about the AI"black box", including why did AI decide to do this or not? Why can people trust AI or not? How can people fix their mistakes? These problems also highlight the challenges that AI technology can explain in the trading field.

Suppose investors want to directly predict the future transaction price or ups and downs. In that case, the fatal assumption is that the training data set is consistent with the data distribution that has not occurred in the future. However, the natural world will not let us know whether the subsequent data distribution will change. Because of this, even if researchers add a moving window to the training process, it is inevitable that "machine learning obstacles-prediction delay" will occur. Our method can avoid "machine learning obstacles-prediction delay", We also propose auto trading by deep reinforcement learning. Our new article has the following contributions:

Humans can judge the candlestick pattern by instinct and determine the trading strategies with the corresponding candlestick pattern in the financial market. Could a computer think as humans see? Our goal is to teach machines to understand what they see like humans. For example, the device recognizes candlesticks' patterns, inferring their geometry, and understanding the market's relationships, actions, and intentions. In addition, we would like to know why the decisions are what A.I. models are. One must seek explanations of the process behind a model's development, not just explanations of the model itself.

Our study designs an innovative explainable A.I. trading framework by combing financial vision with deep reinforcement learning. Hence, we propose a financial vision field, which can understand the critical components of a candle, and what they indicate, to apply candlestick chart analysis to a trading strategy. We combine deep reinforcement learning to realize intuitive trading based on a financial vision to surveillance candlestick. We involve observing a large number of the candlestick, forming automatic responses to various pattern recognition.

With these extraordinary capabilities in automatic control, it is natural to consider R.L. techniques in algorithmic trading. Indeed, several works have tried to apply R.L. to trade financial assets automatically.

One of the challenges to constructing an effective RL-based algorithmic trading system is to properly encode the input signal for the agent to make decisions. With the recent advances in convolutional neural networks (CNN), a potential scheme encodes financial time series into images.

This work proposes an algorithmic trading framework based on deep reinforcement learning and the G.A.F. encoding method. Our contributions are the following:

• Provide an algorithmic trading framework for the study of RL-based strategies.

• Demonstrate successful R.L. trading agents with G.A.F. encoded inputs of price data and technical indicators.

The difference between A.I. trading is that A.I. is a target environment (such as the Standard & Poor's 500). A.I. trading uses unsupervised machine competition to learn. Machine intelligence determines when to place an order or stop selling at a profit. The breakthrough of machine learning in A.I. trading is to use a new type of unsupervised learning to formulate strategies through data identification features. For example, the golden cross is a multi-feature and allows A.I. to backtest and learn through more than 400 million transaction records in 20 years. As a result, A.I. robots can find high-profit making money models and gain instant advantages through high-frequency calculations.

The global FinTech industry is already the next economic driving force for investment in many countries. Roboadvisors are already common in advanced European and American countries. We believe ordinary investors will gradually accept that financial advisors no longer rely only on high-end franchised services. Wealthy members generally serve the investing public with financial needs. The development of AI-to-AI transactions will make the financial industry a brighter future in upgrading the international A.I. transaction industry. The sector will fully upgrade to bring users a new investment and financial management experience, creating unprecedented synergies.

Reinforcement learning can interact with the environment and is suitable for applications in decision control systems. Therefore, we used the reinforcement learning method to establish a trading strategy for cryptocurrency and U.S.A. stock markets, avoiding the longstanding unstable trends in profound learning predictions. We found from experiments that the 15-minute price data of Ethereum train through transfer learning. After learning the candlesticks pattern suitable for entering and exiting the market of U.S. stock trading. Compared to the top ten most popular ETFs, the experimental results demonstrate superior performance. This study focuses on financial vision, Explainable methods, and links to their programming implementations. We hope that our paper will serve as a reference for superhuman performances and why the decisions are in the trading system. The paper organizes as follows. In Section 2, we introduce the concepts of financial vision for candlesticks pattern recognition. In Section 3 we introduce the R.L. background knowledge used in this work. In Section 4, we describe the proposed GAF-RL trading framework. In Section 5, we describe the experimental procedures and results in detail. Finally we discuss the results in Section 6 and conclude in Section 7.

Technical analysis is a specific product through historical data to make trading decisions or signal a general designation. The data includes price, volume, and price to get the assets on the market reaction. We try to find out the direction of the future changes. This idea is from the people's behavior in the market, which has reproducibility.

The trading account's psychological decision-making is from a large proportion of investors. By researching the past, others' trading behavior, and believing that this behavior may appear again based on experience, make a rational choice.

Following the chart is drawn from historical prices according to specific rules. These features help traders to see the price trend. The three more common types of charts are histograms, line charts, and the most widely used candlestick.

The candlestick originated from Japan in the 17th century. It has been popular in Europe and the United States for more than a century, especially in the foreign exchange market. As the most popular chart in technical analysis, traders should understand it. It is named after a candle, as shown in Figure 1 . Each bar of candlestick draws from open price, high price, low price, and close price as follows:

1. Open price: This price is the first price that occurs during the period; 2. High price: the highest price that occurs during the period; 3. Low price: the lowest price that occurs during the period; 4. Close price: The last price that occurs during the period.

If the close price is higher than the open price, the candlestick follows as: Focuses on the relationship between price and volume, money is the fundamental element to push prices. Even though the enormous volumes of goods do not necessarily lead to immediate price changes. However, the trading amount reflects the commodity on the market by the degree of attention, can also effectively neutralize dilute price volatility caused by the artificial manipulation, belong to the steady and lower the risk of loss of judgment strategy, usually accompanied by other strategy target reference.

In this work, we employ the Gramian Angular Field (GAF) [10] method to encode the time series into images. Firstly, the time-series X " tx 1 , x 2¨¨¨xn u to be encoded is scaled into the interval r0, 1s via the minimum-maximum scaling in Equation 1 .

The notation r x i represents each normalized element from the entire normalized set r X. The arc cosine values φ i of each r

x i are calculated.

These φ angles use to generate the GAF matrix as follows:

The generated GAF nˆn matrix, where n is the length of sequence considered, is used as inputs to the CNN. This method makes it possible to keep the temporal information while avoiding recurrent neural networks, which are computationally intensive. GAF encoding methods employed in various financial time-series problems [10, 11, 12] .

Reinforcement learning (RL) is a machine learning paradigm in which an agent learns how to make decisions via interacting with the environments [6] . The reinforcement learning model comprises an agent. The agent performs an action based on the current state. The action receives from the environment and returns feedback to the agent. The feedback can be either a reward or a penalty. Once the agent gets the reward, they adjust the relative function between the state and the action to maximize the overall expected return. The function could be a value function or a policy function.

A value function refers to the reward obtained from a particular action in a specific state. Therefore, accurately estimating the value function is an essential component of the model. Conversely, underestimating or overestimating the value of certain conditions or actions would influence learning performance. A policy function is ideal for achieving a maximum expected return in a particular state. In the reinforcement learning model, actions that maximize expected return (value) in a specific condition are called policies. In several advanced models, policy functions directly apply to maximize expected returns.

Concretely speaking, the agent interacts with an environment E over a number of discrete time steps. At each time step t, the agent receives a state or observation s t from the environment E and then chooses an action a t from a set of possible actions A according to its policy π. The policy π is a function which maps the state or observation s t to action a t . In general, the policy can be stochastic, meaning that given a state s, the action output can be a probability distribution πpa t |s t q conditioned on s t . After executing the action a t , the agent receives the state of the next time step s t`1 and a scalar reward r t . The process continues until the agent reaches the terminal state or a pre-defined stopping criteria (e.g. the maximum steps allowed). An episode is defined as an agent starting from a randomly selected initial state and following the aforementioned process all the way through the terminal state or reaching a stopping criteria.

We define the total discounted return from time step t as R t " ř T t 1 "t γ t 1´t r t 1 , where γ is the discount factor that lies in p0, 1s. In principle, γ is from the investigator to control how future rewards weigh the decision-making function. When we use a large γ, the agent weighs the future reward more heavily. On the other hand, future rewards are ignored quickly with a small γ. The immediate rewards will weigh more. The goal of the agent is to maximize the expected return from each state s t in the training process. The action-value function or Q-value function Q π ps, aq " ErR t |s t " s, as is the expected return for selecting an action a in state s based on policy π. The optimal action value function Q˚ps, aq " max π Q π ps, aq gives a maximal action-value across all possible policies. The value of state s under policy π, V π psq " E rR t |s t " ss, is the agent's expected return by following policy π from the state s. Various RL algorithms are designed to find the policy which can maximize the value function. The RL algorithms which maximize the value function are called value-based RL.

In contrast to the value-based RL, which learns the value function and use it as the reference to generate the decision on each time-step, there is another kind of RL method called policy gradient. In this method, the policy function πpa|s; θq is parameterized with the parameters θ. The θ will then be subject to the optimization procedure which is gradient ascent on the expected total return ErR t s. One of the classic examples of policy gradient algorithm is the REINFORCE algorithm [13] . In the standard REINFORCE algorithm, the parameters θ are updated along the direction ∇ θ log π pa t |s t ; θq R t , which is the unbiased estimate of ∇ θ E rR t s. However, the policy gradient method suffers from large variance of the ∇ θ E rR t s, making the training very hard. To reduce the variance of this estimate and keep it unbiased, one can subtract a learned function of the state b t ps t q, which is known as the baseline, from the return. The result is therefore ∇ θ log π pa t |s t ; θq pR t´bt ps t qq.

A learned estimate of the value function is a common choice for the baseline b t ps t q « V π ps t q. This choice usually leads to a much lower variance estimate of the policy gradient. When one uses the approximate value function as the baseline, the quantity R t´bt " Qps t , a t q´V ps t q can be seen as the advantage Aps t , a t q of the action a t at the state s t . Intuitively, one can see this advantage as "how good or bad the action a t compared to the average value at this state V ps t q." For example, if the Qps t , a t q equals to 10 at a given time-step t, it is not clear whether a t is a good action or not. However, if we also know that the V ps t q equals to, say 2 here, then we can imply that a t may not be bad. Conversely, if the V ps t q equals to 15, then the advantage is 10´15 "´5, meaning that the Q value for this action a t is well below the average V ps t q and therefore that action is not good. This approach is called advantage actor-critic(A2C) method where the policy π is the actor and the baseline which is the value function V is the critic [6] .

In the policy gradients method, we optimize the policy according to the policy loss L policy pθq " E t r´log π pa t | st; θqs via gradient descent. However, the training itself may suffer from instabilities. If the step size of the policy update is too small, the training process would be too slow. On the other hand, if the step size is too large, there will be high variance in training. The proximal policy optimization (PPO) [14] fixes this problem by limiting the policy update step size at each training step. The PPO introduces the loss function called clipped surrogate loss function that will constraint the policy change a small range with the help of a clip. Consider the ratio between the probability of action a t under current policy and the probability under the previous policy q t pθq " π pa t | s t ; θq π pa t | s t ; θ old q .

If q t pθq ą 1, it means the action a t is with higher probability in the current policy than in the old one. And if 0 ă q t pθq ă 1, it means that the action a t is less probable in the current policy than in the old one. Our new loss function can then be defined as

is the advantage function. If the action under the current policy is much more probable than in the previous approach, the ratio q t may be significant-leading to a considerable policy update step. The original PPO algorithm [14] circumvents this problem by adding a constraint on the ratio, which can only be in the range 0.8 to 1.2. The modified loss function is now.

where the C is the clip hyperparameter (common choice is 0.2). Finally, the value loss and entropy bonus add into the total loss function as usual:

is the value loss and H " E t rH t s " E t r´ř j π pa j | s t ; θq logpπ pa j | s t ; θqqs is the entropy bonus which is to encourage exploration.

Algorithm 1 PPO algorithmic Define the number of total episode M Define the maximum steps in a single episode S Define the update timestep U Define the update epoch number K Define the epsilon clip C Initialize trajectory buffer T Initialize timestep counter t Initialize two sets of model parameters θ and θ old for episode " 1, 2, . . . , M do Reset the testing environment and initialise state s 1 for step " 1, 2, . . . , S do Update the timestep t " t`1 Select the action a t from the policy π pa t | s t ; θ old q Execute action a t in emulator and observe reward r t and next state s t`1 Record the transition ps t , a t , log π pa t | s t ; θ old q , r t q in T if t " U then Calculate the discounted rewards R t for each state s t in the trajectory buffer T for k " 1, 2, . . . , K do Calculate the log probability log π pa t | s t ; θq, state values V ps t , θq and entropy H t . Calculate the ratio q t " exp plog π pa t | s t ; θq´log π pa t | s t ; θ old qq Calculate the advantage A t " R t´V ps t , θq Calculate the surr 1 " q tˆAt Calculate the surr 2 " clippq t , 1´C, 1`CqˆA t Calculate the loss L " E t r´minpsurr 1 , surr 2 q`0.5 V ps t , θq´R t

.01H t s Update the agent policy parameters θ with gradient descent on the loss L end for Update the θ old to θ Reset the trajectory buffer T Reset the timestep counter t " 0 end if end for end for 

As the figure 2 shows, there are two large architectures, one of which is part of GAF Encoding and Actor-Critic PPO architecture. GAF Encoding aims to analyze time sequences in a way that is different from traditional time sequences. To extract the time sequence characteristics from another perspective, we must independently train pattern recognition before the experiment to make its accuracy rise to a certain level. For example, take the probability distribution of patterns and be the input model of reinforcement learning. This study used the ETH/USD exchange data in units of 15 mins from January 1, 2020, to July 1, 2020, as a reference for the system's environment design and performance calculation. The period is the training period for the agent. The agent repeatedly applied the data to learn in this period, eventually obtaining an optimal investment strategy.

However, when it comes time to train the PPO's trading strategy, initial market information and Pattens' identification information are brought into the PPO's strategy model. The market information includes technical indicators, bid strategies, and the state of open interest and volume. The PPO model takes place after a sufficient number of trajectories collect. It is input into the PPO model of the actor-critic. There are two different layers, one is the action layer, and the other is the value layer. The output layer of the action layer will have three trading signals, namely buy, sell and hold. This strategy sets a limit of the chips to three times at most. The value layer is to determine the potential value of each state and the difference with the reward δ t to create the advantage functionÂ t , V µ pS t q and V µ pS t`1 q are simultaneously generated to update the value layer gradient.

In the PPO training framework, off-policy training is the old policy π θ old pa t |s t q interact with the environment produce an important sample to produce another policy gradient to update policy network π θ pa t |s t q. After sampling several times, the old policy weights to be updated as the weight of the new policy. The clipped surrogate loss function will constrain the policy change in a small range.

To avoid too much divergence between the distribution of the two policies during training, resulting in an unstable training situation. The actions that are output by the action layer uses to interact with the following environment to establish the whole training cycle. Define the state of probability distribution and max timestep from pattern recognition P, T max Define the environment state with the probability distribution s " P Define the number of total episode M Define the maximum steps in a single episode with max timestep S " T max Define the update timestep U Define the update epoch number K Define the epsilon clip C Initialize trajectory buffer T Initialize timestep counter t Initialize two sets of model parameters θ and θ old for episode " 1, 2, . . . , M do Reset the testing environment and initialize state s 1 " P 1 for step " 1, 2, . . . , S do Update the timestep t " t`1 Select the action a t from the policy π pa t | s t ; θ old q Execute action a t in emulator and observe reward r t and next state s t`1 " P t`1 Record the transition ps t , a t , log π pa t | s t ; θ old q , r t q in T if t " U then Calculate the discounted rewards R t for each state s t in the trajectory buffer T for k " 1, 2, . . . , K do Calculate the log probability log π pa t | s t ; θq, state values V ps t , θq and entropy H t . Calculate the ratio q t " exp plog π pa t | s t ; θq´log π pa t | s t ; θ old qq Calculate the advantage A t " R t´V ps t , θq Calculate the surr 1 " q tˆAt Calculate the surr 2 " clippq t , 1´C, 1`CqˆA t Calculate the loss L " E t r´minpsurr 1 , surr 2 q`0.5 V ps t , θq´R t 2´0 .01H t s Update the agent policy parameters θ with gradient descent on the loss L end for Update the θ old to θ Reset the trajectory buffer T Reset the timestep counter t " 0 end if end for end for 5 Experiments and Results

The experiment results are from two parts as follows:

1. The first part is the training patterns of GAF-CNN and 2. the second part is the strategies of PPO-RL.

The training patterns are from eight common candlestick patterns. Then the good enough pattern recognition is trained in advance for the second part environment of the PPO training framework. Figure 3 shows the training materials from the cryptocurrency ETH/USD 15 mins from January 1, 2020, to July 1, 2020. Due to the characteristics of this kind of commodity, it has the advantages of high volatility, continuous trading time, and significant trading volume. Therefore, there is enough information to learn more effective commodity trading strategies. The data object of this test is included across time from January 2019 to September 2020. In addition, the outbreak of the COVID-19 pandemic contains during this period to test whether the stability of this trading strategy is enough to act as an effective hedge tool. In the experiment, four types of commodities in different markets, bull, bear, and stable markets selected for comparison. Since many trading platforms do not require fees, this does not consider the transaction fees. When the market volatility is more violent, the return will correlate with growth in terms of profit performance. Conversely, when the market volatility gets less, the return on profit will reduce. Figure 4 shows that the attributes of the cryptocurrency market are relatively volatile and easily affected by the news. Therefore the strategy runs for the cryptocurrency market from August to 2020 June 2021. Several times, the market rapidly fluctuates due to celebrities who support and oppose cryptocurrency. Nevertheless, the strategy can still retain sure profit and hedging ability in such a market state. The result also proves that this trading strategy has a significant response-ability in the market with severe fluctuations. The following period is the agent's performance evaluation period. The agent uses a trading strategy to form decisions. The system accumulates the rewards obtained during the evaluation period, which served as a reference for performance. Figure 5 shows that the two large bull markets (Nasdaq: GOOG and NYSE: SPY) have had tremendous growth momentum over the past decade. However, they also suffered a certain degree of decline risk after the epidemic outbreak, as seen from the experimental results. During the epidemic, the strategy of the whole model will shift to the conservative and short strategy. The density of trades will decrease significantly, but it can also be excellent to catch into the market to buy in the rapid price pullback. In terms of transaction frequency, there are about 2 or 3 transactions a week. Figure 6 shows the other two assets. It has continued to enter the bear and stable market (USCF: BNO and iShares MSCI Japan ETF: EWJ). The BNO has experienced a considerable decline during this period. The PPO agent directly turned to the short direction, obtained a significant return, and bought back reasonably. The profit from this short bet accounted for most of the total gain. However, in the part of EWJ, the initial volatility of the index did not get a significant return, representing the inadequacy of the trading strategy for the market with severe fluctuations. Nevertheless, in a word, this trading strategy based on pattern recognition can indeed accurately provide appropriate strategies as hedging operations for investors when downside risks suddenly appear.

Suppose license plate recognition is a mature commercial product in Computer Vision. In that case, our Financial Vision is the Trade Surveillance System in Financial Markets. We want to use the concept that the camera on the road has been monitoring whether there are violations to illustrate the core contribution and use of financial vision in transactions. First, let us describe using computer vision technology to monitor violations at intersections. Such as characteristics of speeding, illegal turns, and running red lights. When a camera with license plate recognition capabilities captures the particular patterns, the traffic team will receive a notification and compare the red list with sending the photo to the offender. Using financial vision technology to monitor the price changes of the subject matter is like watching intersections. Investors set specific candlesticks patterns (for example, W bottom, M head, N-type) that require special attention and will receive Pattern Hunter's Notification. Then, how to conduct transactions is another matter. Moreover, how is this monitoring logic different from traditional price prediction?

Traditional price prediction uses the price of the past period p0 t´1q as the input X of Supervised Learning. The price at time t is the output Y (or directly use the rise or rise fall as the learning target). These all imply a fatal assumption: the forecast time point in the real market in the future, the distribution of these data, and The past historical data are consistent. But in any case, it divides into a training set and test set (even if the concept of moving window uses). When discussing intersection surveillance, the camera only cares about denying the offending vehicle's license plate. The camera does not predict whether a car will violate the regulations the next time. Therefore, when building the monitoring model, it will recognize the license plate independently. Outside of these road conditions. The road condition of this camera at the time p0 t´1q does not regard as input X of Supervised Learning, and the road condition at time t is regarded as output Y . Therefore, the transaction monitoring system constructs a financial vision dedicated to learning the candlesticks pattern map. When the candlesticks pattern appears (as if a red light is running, the camera will take a picture and notify the traffic team), list it, and inform the user.

We live in an era when cryptocurrency is at a crossroads of transitional adaptation that requires more sensitive approaches to adapt to the evolving financial market. By conducting Transfer Learning of cryptocurrency trading behavior, the study seeks any possibilities of implementing the trading pattern in the transactions of traditional commodities. With the Candlestick analysis, the model attempts to discover feasible rules for establishing a specific pattern. Subsequently, the trading pattern utilizes in broadening possible scenarios by Deep Reinforcement Learning. We endeavor to comprehend profitability strategies and market sentiment in various conditions to satisfy voluminous risk aversion demands in the market. The experiment results reveal that more accurate judgments can make when the commodity price fluctuates volatilely. On the contrary, human beings tend to be influenced by emotional bias on most occasions. The inclination acquaints human beings with more stable and long-term markets. Human beings are adapted to be blindly spotted and miscalculate the best entry point when volatile movement occurs. Integrating the application of the machine learning model can effectively compensate for human deficiency and enhance the efficiency of the trading process. Nevertheless, optimizing the trading frequency to consolidate the liquidity of different financial assets is another issue that deserves more attention in the dynamic financial market.

For the current wise investment, investors want to predict the future transaction price or ups and downs directly. The fatal assumption is that the training data set is consistent with the data distribution that has not occurred in the future. However, the natural world will not let us know whether the subsequent data distribution will change.

Because of this, even if researchers add a moving window to the training process, it is inevitable that "machine learning obstacles-prediction delay" will occur. Just search for "machine learning predict stock price," and researchers can find articles full of pits, all of which have this shortcoming. Therefore, our first contribution is not to make future predictions but to focus on the current "candlesticks pattern detection," such as Engulfing Pattern, Morning Star,. . . . However, the "candlesticks pattern" is usually a sensational description. It cannot become a stylized trading strategy if investors cannot write a program to enumerate all the characteristics. Even if a trader has a sense of the market and knows which patterns have to enter and exit, he cannot keep his eyes on all the investment targets. Moreover, our second contribution focuses on detecting trading entry and exit signals combined with related investment strategies.

Finally, we found from experiments that the 15-minute price data of Ethereum train through transfer learning is suitable for US stock trading. Compared to the top ten most popular ETFs, the experimental results demonstrate superior performance. This study focuses on financial vision, Explainable methods, and links to their programming implementations. We hope that our paper will reference superhuman performances and why the decisions are in the trading system.

Deep reinforcement learning for trading

Practical Deep Reinforcement Learning Approach for Stock Trading

Ai and international trade

Artificial intelligence for the real world

Adaptive quantitative trading: an imitative deep reinforcement learning approach

Reinforcement Learning: An Introduction

Shooting Star Pattern 1. The uptrend has been apparent. 2. The body of the first candle is white

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Peeking inside the black-box: a survey on explainable artificial intelligence (xai)

Imaging time-series to improve classification and imputation

Encoding candlesticks as images for pattern classification using convolutional neural networks

Explainable deep convolutional candlestick learner

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Proximal policy optimization algorithms

The Major Candlesticks Signals. The Candlestick Forum LLC, 9863 Swan Ct

The eight patterns used in this study will describe entirely in this section. The following eight figures illustrate the critical rules each pattern requires. The white candlestick represents a rising price on the left-hand side of each figure, and the black candlestick represents a dropping price. The arrow indicates the trend. The upward arrow indicates a positive direction, and the downward arrow indicates a negative trend. The text descriptions on the right-hand side are the fundamental rules referred to from The Major Candlestick Signals. [15] .