key: cord-0853202-jsf75jne
authors: Xu, Jia-Lang; Hsu, Ying-Lin
title: Analysis of agricultural exports based on deep learning and text mining
date: 2022-02-01
journal: J Supercomput
DOI: 10.1007/s11227-021-04238-w
sha: c284f9ac665a4472070c92d68fb35a69b72a4360
doc_id: 853202
cord_uid: jsf75jne

Agricultural exports are an important source of economic profit for many countries. Accurate predictions of a country’s agricultural exports month on month are key to understanding a country’s domestic use and export figures and facilitate advance planning of export, import, and domestic use figures and the resulting necessary adjustments of production and marketing. This study proposes a novel method for predicting the rise and fall of agricultural exports, called agricultural exports time series-long short-term memory (AETS-LSTM). The method applies Jieba word segmentation and Word2Vec to train word vectors and uses TF-IDF and word cloud to learn news-related keywords and finally obtain keyword vectors. This research explores whether the purchasing managers’ index (PMI) of each industry can effectively use the AETS-LSTM model to predict the rise and fall of agricultural exports. Research results show that the inclusion of keyword vectors in the PMI values of the finance and insurance industries has a relative impact on the prediction of the rise and fall of agricultural exports, which can improve the prediction accuracy for the rise and fall of agricultural exports by 82.61%. The proposed method achieves improved prediction ability for the chemical/biological/medical, transportation equipment, wholesale, finance and insurance, food and textiles, basic materials, education/professional, science/technical, information/communications/broadcasting, transportation and storage, retail, and electrical and machinery equipment categories, while its performance for the electrical and optical categories shows improved prediction by combining keyword vectors, and its accuracy for the accommodation and food service, and construction and real estate industries remained unchanged. Therefore, the proposed method offers improved prediction capacity for agricultural exports month on month, allowing agribusiness operators and policy makers to evaluate and adjust domestic and foreign production and sales.

Artificial intelligence-related technologies, resources, and infrastructure have gradually matured and can now be easily applied to various fields to solve multiple problems with good effect, including time series, image processing, audio signal processing, and natural language processing. While time series are widely used in various fields, long-and short-term memory models are preferred for deep learning analysis of time series. For example, Qin et al. [23] used time series to create a model used to detect abnormal behavior in controller area network (CAN) buses under tampering attacks. Tulensalo et al. [32] used local weather data to determine total local grid transmission losses. Shahid et al. [26] used time-series methods to predict the number of COVID-19 deaths and recovery cases in ten major countries. Tahvili et al. [30] proposed a natural language processing and data conversion approach that uses supervised learning methods to process and evaluate unbalanced data. Zhang et al. [40] applied text mining and natural language processing techniques to construction accident reports and used support vector machines (SVM), linear regression (LR), decision tree (DT), and other models plus an ensemble model to classify the causes of accidents.

Taiwan's early economic development was mainly based on agriculture, but with the transition to an industrial and technology-based economy, the importance of the agricultural sector gradually diminished. However, in order to guarantee food security and resource sustainability, the Taiwan government has begun to focus additional attention on the development of agriculture-related industries. What little arable land Taiwan has is not well-suited to large-scale agricultural operations, and farmers focus rather on cultivating agricultural products with high added value. This has led to steadily increasing exports, a trend the government hopes to encourage, allowing export figures to reflect actual income, in order to reduce the imbalance of domestic production and sales. The development of information technology allows for easy access to a wide range of information, and online news sources provide fast and convenient insight into what is happening at home or abroad. Wei et al. [33] suggested that purchasing manager indexes (PMI) can be effectively used to predict the prices of industrial stocks. Xu and Hsu [35] obtained good results using news related to climate change and oil prices to analyze and predict agricultural product prices. Chen and Gong [4] assessed the impact of global warming on the total factor productivity of agriculture, finding that global warming has an impact on agricultural products. Liu et al. (2020a, b) used multiple PMI as auxiliary variables to predict coal mining accidents. Su et al. [27] suggested there is a positive two-way causal relationship between the prices of agricultural products and oil. Sun and Li [28] suggested that the global financial crisis and common borders had a significant effect on China's trade profits on agricultural exports to ASEAN countries. This research investigates whether Taiwan's agricultural exports are impacted by international news on climate change, oil prices and other related matters, and changes in PMI for various industries. This research proposes an AETS-LSTM deep learning model, adjusting the characteristics and weight parameters of the learning target column, to successfully forecast future agricultural export trends.

The remainder of this research is arranged as follows: Sect. 2 reviews the literature on LSTM, text mining, and principal component analysis. Section 3 introduces the research architecture and methods used. Section 4 describes results and performance evaluation, and Sect. 5 presents conclusions.

The core of long short-term memory (LSTM) consists of three control gates: input, forget, and output. The input gate uses the input value and the value in the newly generated memory cell in an activation function to determine whether the value must be added to the long-term memory neuron. The forget gate determines whether the current value is a new topic or data that is the opposite of the current value and determines whether the value needs to be filtered or kept in the memory. The output gate determines whether the current value needs to be added to the output. The activation function of the output valve is usually determined using the Sigmoid method. Finally, the activation function tanh is used to determine whether long-term memory should be added to the output. The value falls between [-1, 1], with -1 ordering the removal of long-term memory, while 1 means it should be retained. Following Hochreiter and Schmidhuber [8] , Fig. 1 shows the LSTM architecture, followed by Eqs. (1) to (6) .

(1)

where f h is the forget value, i h is the input value, O h is the output value, ∼ c h is the memory cell candidate, h h−1 is the current output value, and x h is the input value. w i , w c , w o , w f and b i , b c , b o , b f are, respectively, the weight matrix and deviation vector. C h is a storage unit, and σ is the Sigmoid activation function.

Chen [2] proposed a nonlinear LSTM algorithm with good prediction accuracy for use in voltage prediction. Elsheikh et al. [6] proposed an LSTM model to predict the fresh water production of stepped solar distillers and conventional distillers. Kırbaş et al. [15] proposed using LSTM for predicting COVID-19 cases in Denmark, Belgium, Germany, France, the UK, Finland, Switzerland, and Turkey using performance indicators such as MSE, PNSR, RMSE, NRMSE, MAPE, and SMAPE to evaluate model accuracy, but with final results indicating that LSTM provided the best accuracy. Liu et al. (2020a, b proposed a model combining deep neural network (DNN) and long short-term memory (LSTM) to solve the problem of developing a system model based on given input and output data to predict sinter chemical composition. Miao et al. [20] proposed an LSTM framework for fog forecasting using hourly meteorological elements. They believed that the LSTM framework is more effective than traditional machine learning models in this application. Rahman et al. [24] proposed a novel diabetes classification model based on Conv-LSTM and used a grid search algorithm to perform hyperparameter optimization so that the applied model can find the best parameters. Tsantekidis et al. [31] proposed a combination of the ability of CNN to extract useful features and the ability of LSTM to analyze time series for evaluation. They demonstrated that their proposed model outperformed various compared LSTM and CNN models within the prediction range of the test. Yan et al. [37] proposed an ON-LSTM model to determine the remaining service life of gears and compared its performance with those of LSTM, GRU, DLSTM and DNN. The proposed ON-LSTM model achieved the best short-term and long-term prediction accuracy of the compared models.

Very few hard and fast rules exist for written text, and this means that text mining techniques are unable to identify definable texts in either long or short passages, despite their clear comprehensibility in daily life. TF-IDF is a common statistical calculation method that can be divided into two parts: TF (term frequency) and IDF (inverse document frequency). TF is a calculation term, where the frequency of occurrence is described by Eq. (7), while IDF is a measure to calculate a word's general importance, as in Eq. (8) . TF-IDF is used to filter common words, while retaining important words, as in Eq. (9) .

where n ij is the number of times a specific word or phrase appears in the news content, and Σ k n k,j is the sum of all words or phrases in the news content.

where D is all news content, and d i is the number of words that appear in all news content. If a word is not contained in the news content, the denominator is 0, thus t i +1 is generally used.

Therefore, TF-IDF is used to calculate the frequency of specific words or phrases in specific news content.

Word2Vec is a method of converting words into vectors which represent their meaning, because it is difficult to determine the relationships between words and phrases, as with synonyms, antonyms and corresponding words. This raises the importance of Word2Vec. Mikolov et al. [21] used large amounts of text data to represent the semantic meaning between words and phrases by means of their corresponding vectors. After embedding words in a space, words with similar meanings will have greater spatial proximity. The most common Word2Vec models are CBOW and Skip-gram. Skip-gram uses a given input word to predict the context, while CBOW uses given a context to predict the input word. Figures 2 and 3 show the CBOW and Skip-gram model architectures, respectively. Jain et al. (2021a, b, c, d, ) a proposed a Cuckoo Search-eXtreme gradient boosting model and optimized the model to recommend airlines. They also (2021 b) proposed a sparse self-attentive network-based aspect-aware model that can effectively predict consumer recommendation decisions. In addition, they (2021 c) proposed a multi-label classification model for travel recommendation, and finally, they (2021 d) explored the applicability of consumer sentiment analysis in online reviews of machine learning models and explored the literature review. Quamer et al. [22] proposed a self-attentive convolutional neural network model that can effectively perform sentence matching and natural language inference. Yen et al. [36] used the text in online news and stock forums to conduct text exploration and predict future financial performance. Choi et al. [3] proposed using text mining to analyze social network texts to identify cyber bullying. Jung and Lee [13] used text mining of keywords and citation information in academic papers to examine the information value of research scopes and trends. Mosa [18] proposed the mining of large-scale social media data for recombination into a multi-objective optimization (MOO) task for abstract extraction using the gravity search algorithm (GSA) to optimize several expression targets to generate brief social media summaries. To explore resident sentiment toward an urban waste classification policy, Wu et al. [34] used text mining methods to collect and analyze public comments on Weibo. Zhong et al. [39] proposed a four-step modeling model: (1) using an implied Dirichlet distribution to identify dangerous topics, then (2) a convolutional neural network (CNN) algorithm for the automatic classification of such hazards, followed by (3) word co-occurrence networks (WCN) which determine the relationships between the hazards, and finally (4) a quantitative analysis of keywords through word cloud methods to create a visual overview

of such hazards, thus providing managers with new knowledge and insights. Drury and Roche [5] applied text mining methods to a large number of papers and news reports cited in recent agricultural research papers, seeking to identify problems and potential applications. 

Principal component analysis (PCA) is a widely used unsupervised learning linear transformation technique that allows for original data to be converted into different modes of expression and also allows for data processing. PCA converts high-dimensional data into lower-dimensional data, thus reducing calculation time and memory space requirements to facilitate storage and analysis. For example, if the data have q internally correlated continuous variables, x 1 , x 2 , … , x q , there must be independent variables, using Eq. (1) for linear transformation, while in Eq. (11) the vector i is the eigenvector of array A , and i is the eigenvalue corresponding to the eigenvector i , i = 1, 2, … , q , and in Eq. (12), S explains the cumulative proportion of variance as the main component:

where the selection of r is 90% above S.

Boubchir and Aourag [1] proposed a new multivariate technique with PCA and PLS to effectively test the statistical influence of the stability of inverse perovskites and perovskites. Mahmoudi et al. al. (2020) used PCA to classify the spread of COVID-19 in France, Germany, Iran, Italy, Spain, the UK and the USA. Raj et al. [25] applied PCA for the simplified, economical and sensitive classification of vesicles and bronchus. García-Gil et al. [7] proposed a new method using the opensource cluster computing framework Apache Spark platform and main component analysis to reduce data volume, finding that high-dimensional data will affect the algorithm's calculation time. Jolliffe and Cadima [14] suggested that PCA can be used to reduce data dimensionality, thus increasing data interpretability and minimizing information loss. New data obtained are learned from the data set, making PCA an adaptive data analysis technique.

The experimental environment of this research includes two levels of hardware and software. The operating system used is Windows 10 64 bits, running on an i7 CPU, with 24G RAM and a GeForce GTX1650Ti graphics card. Python 3.6 is used for development.

(10)

This research uses web-crawlers to obtain international news, the MI of each industry, and the export index of agricultural products. The collected data are preprocessed, merged, and then used to train the LSTM model. The overall process is shown in Fig. 4 .

This research uses Web crawlers to collect all news reports on the international news section of the ETtoday Web site, using HTML tags to filter all international news content using search terms including agriculture, petroleum, climate and other related topics in the period January 1, 2014, to December 31, 2019.

To clarify and smooth the overall calculation and structure, the international news content data collected for this research were processed using Jieba word segmentation to facilitate subsequent data annotation.

Following Jieba segmentation, the TF-IDF method is used to identify the top 10 key words for each international news article.

Following Jieba segmentation, the Word2Vec method is used to vectorize words, allowing for the calculation of word similarity. The tested feature vector dimensions are 100, 200, and 300, and the tested and interpretable word vectors are all consistent. Therefore, the Word2Vec feature vector is set at 100 dimensions, and CBOW is used for training.

Keyword terms were categorized by month, and then, the ten most critical words were determined by means of word cloud analysis, as shown in Fig. 5 .

Identified key words were then processed using the trained Word2Vec model to obtain the word vectors of ten key words. 

Because of the excessive dimensionality of the generated word vectors, PCA is used to reduce to dimensions to 55, with an interpretability rate of 90%. Therefore, in this research, word dimensions should be reduced from 1000 to 55.

The manufacturing and non-manufacturing purchasing managers' indexes for 2014 to 2019 are obtained from the Business Indicators Database Web site.

Manufacturing and non-manufacturing indexes were obtained for various industries including chemical/biological/medical, transportation equipment, accommodation and food service, wholesale, finance and insurance, food and textiles, basic materials, education/professional, scientific/technical, information/communications/broadcasting, transportation and storage, retail, electrical and machinery equipment, electronic and optical industry, and construction and real estate.

Public data for the total export value of agricultural products from 2014 to 2019 were obtained from Taiwan's Council of Agriculture.

This research obtained the fluctuation data of the total export value by subtracting the difference between the present and previous months.

This research merged the parameters for the various data including agricultural export fluctuations, the PMI of each industry, the outlook for each industry in the coming six months, and the 55-dimensional word PCA.

The data from 2014 to 2017 are insufficient for effective training. Following Yoon et al. [38] , time-series GAN was used to generate real samples through various real and synthetic time-series data, thereby generating sufficient data for training.

The input parameters of the proposed AETS-LSTM model are as follows: fluctuations in agricultural exports, 55-dimensional words after PCA, the PMI of each industry, and the outlook for each industry for the next six months. The output parameter is fluctuations in agricultural exports over the following month. The training sample used is data from 2014 to 2017 following application of the time-series GAN algorithm. The test sample is data from 2018 to 2019. Srivastava et al. [29] noted that Dropout is independent of each neuron and each iteration in the hidden layer and can improve the problem of overfitting. Setting position and size will also be a key factor. For example, too high a value will cause neurons to be completely covered and the model will not learn the training characteristics. Too low a value may lead to model overfitting in training, leading to Dropout. Zeng et al. [41] noted that L1 or L2 regularized sparse representation methods have been applied in different fields, where a representation based on L1 regularization is sparser, while a representation based on L2 is simpler and faster. Table 1 shows the parameter settings used for training and modeling in this research.

First, this research uses the proposed AETS-LSTM model to predict the rise and fall of agricultural exports in the PMI of each industry. Figure 6 shows that the chemical/biological/medical, accommodation and food service, financial and insurance, basic materials, education/professional, science/technical, information/communications/broadcasting, transportation and storage, and retail industries produce better predictions of the rise and fall of agricultural product exports than those obtained using other industries. Next, this research combines the PMI of each industry and the keyword vectors into the proposed of AETS-LSTM model to compare and evaluate the rise and fall of agricultural exports. The indicators for predicting the rise and fall of exports use precision, recall, f-score, sensitivity, specificity, and accuracy, and the result analysis is shown in Table 2 and Fig. 6 . With the exception of the accommodation and food service and the construction and real estate industries, all other industries performed well using AETS-LSTM models. Among them, the finance and insurance industries can show that more than 80% of the results of all performance evaluations, while their prediction accuracy improved from 69.57% of the original PMI to 82.61%, which is better than other industries. Finally, this research compares the prediction results obtained using the proposed AETS-LSTM model and the neural network and SVM model for the four industries with the best prediction results, namely the finance and insurance, transportation equipment, food and textile, and electrical and machinery equipment industries. Figure 7 shows that the proposed AETS-LSTM model achieves excellent prediction results for the rise and fall of agricultural exports. 

This research proposes a new AETS-LSTM model for effectively predicting agricultural export trends. Agricultural export trends are affected by many factors, such as news content, climate change, and the PMIs of various industries. Few studies have examined the impact of news content and PMI trends on agriculture. Experimental results show that PMI values for the finance and insurance industries combined with keyword vectors have a relative impact the prediction of the rise and fall of agricultural exports and can improve the prediction accuracy for the rise and fall of agricultural exports by 82.61%. The prediction accuracy for chemical/biological/medical, transportation equipment, wholesale, finance and insurance, food and textiles, basic materials, education/professional science/technical, information/communications/ broadcasting, transportation and storage, retail, electrical and machinery equipment, and the electrical and optical industries can be improved by combining keyword vectors, while the prediction accuracy for the accommodation and food service and construction and real estate industries remained unchanged. Therefore, this research can enhance the understanding of agribusiness operators and policy makers with regard to the rise and fall of agricultural exports month on month and allow them to better evaluate and adjust domestic and foreign production and sales.

Taiwan's agricultural product export data are presented monthly, restricting the number of data points available, and relevant privacy laws restrict access to data Fig. 7 Comparison of algorithm accuracy for the top four industries beyond publicly available information, such as public news reports and Open Data resources. These data availability restrictions place limitations on feature prediction accuracy. The current study focuses exclusively on agricultural exports, but future work could expand on the current results by applying the model structure to other types of exports.

Materials genome project: The application of principal component analysis to the formability of perovskites and inverse perovskites

Voltages prediction algorithm based on LSTM recurrent neural network

Identification of key cyberbullies: A text mining and social network analysis approach

Response and adaptation of agriculture to climate change: Evidence from China

A survey of the applications of text mining for agriculture

Utilization of LSTM neural network for water production forecasting of a stepped solar still with a corrugated absorber plate

Principal components analysis random discretization ensemble for big data

Long short-term memory

Consumer recommendation prediction in online reviews using Cuckoo optimized machine learning models

SpSAN: sparse self-attentive network-based aspect-aware model for sentiment analysis

A multi-label ensemble predicting model to service recommendation from social media contents

A systematic literature review on machine learning applications for consumer sentiment analysis using online reviews

Research trends in text mining: Semantic network and main path analysis of selected journals

Principal component analysis: a review and recent developments

Comparative analysis and forecasting of COVID-19 cases in various European countries with ARIMA, NARNN and LSTM approaches

An empirical study of early warning model on the number of coal mine accidents in China

Comprehensive system based on a DNN and LSTM for predicting sinter composition

A novel hybrid particle swarm optimization and gravitational search algorithm for multi-objective optimization of text mining

Principal component analysis to study the relations between the spread rates of COVID-19 in high risks countries

Application of LSTM for short term fog forecasting based on meteorological elements

Efficient estimation of word representations in vector space

SACNN: self-attentive convolutional neural network model for natural language inference

Application of controller area network (CAN) bus anomaly detection based on time series prediction

A deep learning approach based on convolutional LSTM for detecting diabetes

Nonlinear time series and principal component analyses: Potential diagnostic tools for COVID-19 auscultation

Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM

Do oil prices drive agricultural commodity prices? Further evidence in a global bio-energy context

The trade margins of Chinese agricultural exports to ASEAN and their determinants

Dropout: a simple way to prevent neural networks from overfitting

A novel methodology to classify test cases using natural language processing and imbalanced learning

Using deep learning for price prediction by exploiting stationary limit order book features

An LSTM model for power grid loss prediction

Are industry-level indicators more helpful to forecast industrial stock volatility? Evidence from Chinese manufacturing purchasing managers index

Attitude of Chinese public towards municipal solid waste sorting policy: a text mining study

The Impact of News Sentiment Indicators on Agricultural Product Prices

A Two-Dimensional Sentiment Analysis of Online Public Opinion and Future Financial Performance of Publicly Listed Companies

Long-term gear life prediction based on ordered neurons LSTM neural networks

Time-series generative adversarial networks

Hazard analysis: A deep learning and text mining framework for accident prevention

Construction site accident analysis using text mining and natural language processing techniques

An antinoise sparse representation method for robust face recognition via joint l1 and l2 regularization

The authors would like to thank the referees and the editors for their comments and valuable suggestions.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.