key: cord-0044720-xanp9nbu authors: McHugh, Orla; Liu, Jun; Browne, Fiona; Jordan, Philip; McConnell, Deborah title: Data-Driven Classifiers for Predicting Grass Growth in Northern Ireland: A Case Study date: 2020-05-18 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50146-4_23 sha: 0101ab9fbcbd1a358daeb46288bda100ea96adaa doc_id: 44720 cord_uid: xanp9nbu There are increasing pressures to combat climate change and improve sustainable land management. The agriculture industry is one of the most challenging areas for these changes, especially in Northern Ireland, as agriculture is one of the larger industries. Research has been carried out across the island of Ireland into methods of improving farm efficiency in multiple areas of farming, including livestock health, machinery improvements, and crop growth. Research has been carried out in this study into grass growth in the dairy farming sector, specifically within Northern Ireland. Grass growth prediction aims to inform farmers and policy makers in their decision-making process regarding sustainable land management in agriculture. The present work focuses on analysing and evaluating how data-driven classifiers can be used for grass growth prediction using the data related to soil content, weather, grass quality components etc. Four classifiers, namely Decision Trees, Random Forest, Naïve Bayes, and Neural Networks, are chosen for this purpose. Classification results based on a real-world data set are analysed and compared to evaluate and illustrate the performance and robustness of the classifiers. The results indicate that it is difficult to declare a single classifier with the highest performance and robustness. Nevertheless, it indicates that tree classification methods are better suited to the data to be studied, as opposed to probabilistic methods and weighted methods, e.g., the naïve Bayes classifier obtained a predictive performance of 78% when classifying spring seasonal grass growth data. reduce its carbon emission by 35% by 2030, to meet UK targets [2] . In NI, one of the main contributors to GHG emissions is the agricultural sector, which produces almost 30% of the total NI output [2] . Dairy farming is one of the largest agricultural industries in NI. According to the Committee on Climate Change, agricultural emissions in NI have continuously increased since 2009 despite efforts to improve the efficiency in dairy farming [2] . Therefore, it is vital that tools and support are provided to farmers and stakeholders within the industry to inform them on solutions and actions that can improve farming efficiency and reduce emissions. This study relates to the improvement of dairy farming efficiency by focusing on sustainable land management and examining grass growth which is one of the cheapest feed sources for livestock in NI [3] . Grass growth rates are variable across the year and depend on various factors, with some of the most influential factors being meteorological e.g., rainfall, solar radiation, and temperature. NI has a temperate climate that allows for a long growing season. Soil conditions such as temperature and moisture also have an influence on grass growth, curtailing growth particularly when soils are oversaturated or excessively dry. Other factors relating to management also impact grass growth such as fertiliser application, grazing intensity, and grazing rotation length. Grass related data have been collected by the Agri-Food and Biosciences Institute (AFBI) across NI in their GrassCheck project. AFBI is a research and development organisation that supports the Department for Agriculture, Environment, and Rural Affairs (DAERA) and other UK government bodies and public organisations. The GrassCheck project consists of farmer research gathered across 50 locations in NI including beef, sheep, dairy, and crop plot farming. The project will run for three years from 2018 to 2020 collecting grass growth data, grass quality data, grazing event data, and meteorological data. The authors of this research have performed an exploratory statistical analysis of the GrassCheck dataset detailed in [4] . In this study, the R programming language was used to provide a statistical overview, correlation analysis, and linear regression analysis of the GrassCheck data to identify the grass growth predictive features. A boxplot visualising the variance in the grass growth features (including pre-grazing cover, utilisation, and soil moisture) illustrated the variability of grass growth over an 8-month period in which data was recorded. A correlation analysis identified strong positive relations between offtake, pre-grazing cover and grass growth and strong negative relations between post-grazing cover and grass growth. Linear regression was performed on the GrassCheck dataset to determine which features had the greatest influence on grass growth. Using this method, pre-grazing cover and the available amount of grass to livestock (known as available) features were shown to be the best fit models when used as the explanatory variables. Other statistically significant features include offtake, utilisation, and month [4] . However, this study is still limited in finding the in-depth pattern for grass growth prediction. Advanced data analytics are expected to further enhance predictions by using, for example, data-driven classification models such as neural networks, naïve Bayes, and decision trees to expand on the exploratory statistics used to analyse grass growth data. Therefore, the aim of this research was to aid in understanding how various grassland features contribute to the prediction of grass growth, and to analyse and evaluate various classification models to deduce which are the most suitable for grass growth prediction. This paper is organised as follows; Sect. 2 provides an overview of related research in the area. Section 3 provides the methodology underpinning the research, with results and discussion presented in Sect. 4. Conclusions and future work are discussed in Sect. 5. In Ireland, research has developed a grass growth prediction model for dairy based farming [3, 5] . The Moorepark St. Gilles Grass Growth Model, known as the MoSt GG model, is a descriptive model providing insight into grass growth at paddock levels in Ireland. There are various inputs into this model including forecasted meteorological data, management strategy information, and fertiliser application, specifically nitrogen (N). The outputs of this model include daily grass growth, N information such as the soil content, grass content, grass uptake, and nitrate leaching. The output from the model was compared to the output from an experimental farm in Cork, Ireland, for a period of two years. It was observed that, while the model was successful in improving some areas of prediction from a previous model, i.e., better prediction of production per cutting date and per plot, it was not always accurate in others. For example, the N prediction in grass content and nitrate leaching was underestimated, potentially since the model does not consider previous years management techniques. Although this model was not designed for NI, the same principles can be applied to aid constructing a decision support system to support sustainability in NI. The decision support system could be expanded to make predictions to support farmers and policy makers in their decisions regarding sustainability. Classification approaches, such as decision trees, artificial neural networks, and support vector machines have been used in multiple research studies for different classification problems. These include agricultural issues such as crop disease prediction [6] , crop yield prediction [7] , and grassland biomass estimation [8] . This research discusses the various classification methods used in agricultural prediction. When predicting crop disease, multiple classification methods were used including, neural networks, naïve Bayesian, random forest, decision trees, support vector machines, k-nearest neighbor, and ensemble models [6] . In this study, it was found that random forest and Gaussian naïve Bayes classifiers performed better than other classifiers when predicting binary data, while neural networks and random forest were better when predicting the original dataset. Multiple linear regression and density-based clustering classification methods have been used in this research area [7] . Multiple linear regression, neural networks, and adaptive neuro-fuzzy inference systems were also used in the area of grassland biomass estimation [8] . This research highlighted the use of the neuro-fuzzy system as it performed better when estimating biomass than the artificial neural networks and the multiple linear regression. The data used in this research is grassland data provided by AFBI, from the Grass-Check project. Different features have been collected including grass growth, grass quality, grazing events, and meteorological data. The features within this dataset have been outlined in Table 1 below. There is a total of 4917 records that have been labelled using the classifications of High, Medium, and Low. The data have been binned into these labels based on calculating interquartile ranges on the grass growth value, where the lower quartile is 32.5 kg DM/Ha and the upper quartile is 75.9 kg DM/Ha. Approximately 25% of the records are low, 50% are medium, and 25% are high. In numerical terms, this equates to 1146 low records, 2336 medium records, and 1153 high records. The prediction of grass growth can help farmers make management decisions about grazing, cutting, and other areas of farming decisions in order to improve farm efficiency. For instance, knowing that there will be a low grass growth rate in the next month can allow farmers to ensure they have adequate stocks of feed concentrates to ensure the wellbeing of their livestock. There are some grass growth entries that are missing, which results in 282 records with an unknown classification category. These unknown variables were removed from analysis during this study. The missing variables have been introduced through the method of data collection used to collect the data which relied on individual farmer input. Missing data was also introduced via the grass growth dataset being measured daily, while the grass quality and grazing events were not measured daily, but measured more sporadically. This meant when the datasets were joined on the Farm ID and the Date, there were empty variables where there was no recordings in the grass quality and grazing event datasets. The following prediction models were chosen due to their ease of use, and popularity within the predictive analytics domain and application in the agricultural industry. Decision Tree (DT). DT classifies instances via a tree structure, where individual attributes are represented by nodes, and there are links between nodes. The DT calculates the information gained from the attribute and makes decisions based on which attribute has the most information gain [6] . Random Forest (RF). The RF can be described as a collection of individual decision trees that work together as an ensemble [6] . This classification model is useful as each individual decision tree is unlikely to make the same mistakes as the others, and therefore, the classification is safer from error. Naïve Bayes (NB). NB is a probabilistic classifier that assumes attributes are independent of each other and they carry the same weight when making predictions [9] . Neural Network (NN). A NN is a classifier that, like a decision tree, uses nodes and links to make predictions. However, each node is assigned a weight, with priorities at each node split being given to the feature that has a larger weight [9] . Two case studies were carried out in the analysis of the GrassCheck dataset. Firstly, analysis of the 2018 dataset was performed where missing data were included (4917 instances, 19 features). Secondly, the same dataset was analysed where instances containing missing data were removed (107 instances, 19 features). The datasets were divided into seasonal data (winter data were excluded as there is no grass growth during these months and, therefore, no recordings take place). The dataset was divided into Spring (March-May), Summer (June-August), and Autumn (September-October). This resulted in imbalanced datasets as there is more likely to be high growth rates in summer months and lower growth rates towards the cooler times of year. To resolve this imbalance, the larger sets could have been reduced to the approximate size of the smallest set, resulting in even divisions. This method was not applied as it would result in a dataset that is too small to perform classification models. Therefore, the Synthetic Minority Oversampling Technique (SMOTE) [10] was applied to the smallest set in the dataset in order to make synthetic data that resembled actual data in the dataset. In the Spring dataset, SMOTE was applied at 250%, which means the smallest dataset is increased by 250%. SMOTE was applied again at 35%, resulting in approximately balanced numbers of Low, Medium, and High labelled data (384, 376, 273, respectively). In the Summer dataset, SMOTE was applied at 150%, and 90%, to result in approximately balanced numbers of Low, Medium, and High data (642, 675, 665, respectively). In the Autumn dataset, SMOTE was applied at 100%, 250%, and 60%, to result in approximately balanced numbers of Low, Medium, and High data (299, 292, 291, respectively). For each case study, ten-fold cross validation was carried out to evaluate each classifier. The evaluation metrics in this research are Kappa statistics, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Precision, Recall, F-Measure, and ROC Area. Several features in the dataset were determined to not have informational values, including Farm ID, Field, Conditions, Paddock, and Notes. As discussed in the statistical analysis performed on the data [4] , some features have a stronger correlation with grass growth. This was further analysed using feature selection methods including the Pearson's Correlation Coefficient, Information Gain Attribute Evaluation, and Wrapper Subset Evaluation. This section summarises the results and discusses the outcome of the experiments. The analysis performed consists of four classifiers used on multiple variations of the dataset. This includes yearly divisions and period divisions of spring, summer and autumn. Classification analysis was performed on the data using techniques including DT, RF, NN, and NB. The tables below displays the outcome of the analysis, by showing the percentage of correct predictions, Kappa statistics, MAE, RMSE, precision (P), recall, F-measure (FM), and ROC, respectively. The tables below ( Table 2 and Table 3 ) display information from the classification of data over the year of 2018. Table 2 shows the evaluation metrics on the whole dataset without handling the missing data, which contains 4917 instances. Table 3 displays the evaluation metrics on the same dataset, but with all instances including the missing variable removed, resulting in classification being performed on 107 instances. Table 2 shows that the tree classifiers, i.e., DT and RF, have the best metrics out of the four classifiers. DT has the greatest Kappa statistic of 0.56, lowest MAE of 0.25, and highest recall and F-measure of 0.74 and 0.73, respectively. RF shows the best performance in RMSE with 0.35, precision with 0.75, and ROC area with 0.87. The results show that, overall, the RF classification method performed the best over the two datasets as the ROC curve is the highest in both sets at 0.871 and 0.917. It also has the lowest RSME across the classifiers at 0.3523 and 0.3211. However, RF is susceptible to influence by an imbalanced dataset, i.e., the Medium category is a larger set than the other two growth rates, resulting in a skewed output. The NN classifier showed the greatest improvement when missing data were removed from the dataset in terms of all metrics, e.g., it increased from 0.33 to 0.59 in Kappa statistics, while reducing the MAE from 0.30 to 0.17, from Table 2 to Table 3 , as a networks performance will increase when all features are available, i.e., when there are no missing data. The NB classifier showed little difference in terms of performance when comparing the analysis using the full dataset, to the dataset where missing values are with no strong improvement observed. This is due to the assumption of independence of the attributes, as not all the attributes in these data are independent, e.g., offtake, available, and utilisation depend on the pre-grazing and post-grazing cover. The tables below (Table 4, Table 6 , and Table 8 ) display information from the classification of data pertaining to spring, summer, and winter. Table 5, Table 7, and Table 9 show the evaluation metrics on the spring, summer, and winter dataset, respectively, where SMOTE has been applied to balance the classes in the datasets, resulting in balanced categories of low, medium, and high. All the classifiers were improved from Table 4 to Table 5 , in terms of correctly predicting instances when the dataset classes have been balanced, e.g., NB showed the largest increase of 15%, from 63 to 78%. However, this does not mean that NB is a good classification method for these data. Although it has shown the most improvement in all features, including MAE and RMSE (reduction of 0.09 and 0.10 respectively), it is the overall least successful when predicting the level of grass growth, again due to the assumption made of independence. RF could be considered a good performer as second to NB, as it improved the most across most of the metrics. For instance, the Kappa statistic increased by 0.14, and the precision and recall have increased by 0.062 and 0.063, respectively. Overall, there are minor differences between DT, RF, and NN as they have similar evaluation outputs across all of the metrics. In Table 6 , DT has the largest percentage of correctly predicted instances of 77%. It also has the best performance in Kappa statistic, precision, recall, and F-measure (0.60, 0.77, 0.77, 0.76, respectively). However, when SMOTE is applied in Table 7 to balance the classes in this dataset, RF becomes the better classification method as it has the best metric value in Kappa, RMSE, precision, recall, F-measure, and ROC area, (0.76, 0.28, 0.85, 0.84, 0.84, and 0.95 respectively). RF shows the greatest improvement in six of the eight metrics, including precision (0.75 to 0.85), recall (0.75 to 0.84), and F-measure (0.74 to 0.84). This means the classifier is returned more accurate results. NN shows the greatest improvement in RMSE, where it reduced from 0.39 to 0.32, and in the ROC area, where it increased from 0.91 to 0.92. Overall, all of the classification methods have improved in all of the evaluation metrics with the addition of synthetic data. Table 8 shows the evaluation metrics of the classifiers before SMOTE has been applied, and indicates that the DT classifier could be considered the most appropriate method due to its performance across the metrics. For example, it has the greatest Kappa statistic of 0.63, the highest precision and recall rates of 0.80 each, and Fmeasure which is 0.79. However, when the synthetic data were added to the dataset, RF became the most accurate classifier due to its performance in Kappa, RMSE, precision, recall, F-measure, and ROC area (0.76, 0.28, 0.84, 0.84, 0.84, and 0.95 respectively). Again, NB showed the greatest improvement out of the four classifiers, with improvements across all of the features including Kappa statistic (increase of 0.30), MAE (reduction of 0.13), and RMSE (reduction of 0.11). However, NB was the least accurate classification method for this dataset as DT, RF, and NB, performed better across all of the metrics, e.g., NB has an MAE of 0.22 when SMOTE was applied, while the other classifiers have an MAE of 0.17 or below. NN showed the greatest improvement in precision as it increased from 0.67 to 0.71, although the increases in each of the classifiers in this metric were very minor. Three methods of feature selection were carried out as well on the dataset with no missing values including the Correlation Attribute Evaluation. This method is also known as the Pearson's Correlation Coefficient, in which attributes are ranked on how much information they provide to the prediction of the target class. The results of this method show there is more of a correlation between Dry Matter, Soil Moisture, Offtake, Pre-Grazing Cover, Available and the target category class. Attributes such as Total Rainfall, Week, Month, and Crude Protein, have less of a correlation as the values are closer to 0 than 1. Information Gain Attribute Evaluation with a ranker filter was also used for feature selection. In this method, attributes such as Date, Offtake, Dry Matter, Available, and Pre-Grazing Cover, provide more information to the prediction of the target class. Attributes which provide less information for predicting included Acid Detergent Fiber, Dry Matter, and Post Grazing Cover, while attributes such as Month, WSC, Utilisation, Crude Protein, Air Temperature, Total Rainfall, and Post-Grazing Cover, had no information gain with a value of 0. Another method of feature selection used on the data was the Wrapper Subset Evaluation, using a Decision Tree with the Best First Ranker method. This method uses a decision tree to evaluate numerous subsets to determine the best subset. In this method, the merit of the best subset was 0.748, and found that the optimal number of folds for this dataset is 5 folds. Attributes identified as having the most significance included Week, ADF, ME, Utilisation and Total Rainfall. The feature selection methods described above have selected different attributes as being the most informative. Each of the classification methods were run again using the information provided by the feature selection methods. However, the results proved to be poorer when features were removed. A Python script was developed to perform the same feature selection methods, and difficulties were found due to the text fields (County, Event, Date) and negative numbers (Utilisation, Offtake, Available) in the dataset, which are problematic in the feature selection methods chosen. The negative numbers were normalised and the text fields were categorised, and the classification was run again. As it had been before, each of the methods chose different features as important and there was no strong similarity between features. As well as this, features which would be designated important in real life (e.g., rainfall) were not classed very highly and vice versa. Therefore, some further investigation on the feature selection methods to better suit the available dataset needs to be done, along with the more elaborate data pre-processing method to be used to classify and clean the data in order for the above feature selection methods become feasible. The present work focused on analysing and evaluating four data-driven classifiers for grass growth prediction using some real grass data collected related to soil content, weather, grass quality components etc. From the above study, it was found that tree classifiers were better methods of classification, namely the DT and RF methods. DT performed better in datasets which contained imbalance, such as in the seasonal divisions of spring, summer, and autumn. It consistently performed the best in Kappa statistics, precision, recall and F-measure across all seasonal data. RF performed consistently in RMSE and ROC area in both imbalanced and balanced datasets, with its best performance values as low as 0.27 (RMSE), and as high as 0.96 (ROC) on the spring dataset with synthetic data. This was the highest ROC value across all the classifiers, while the lowest value was 0.69, produced by NB on the autumn dataset. Once synthetic data were applied, and the imbalance was eradicated, RF became the overall best classifier in each of the experiments that was carried out. It consistently performed the best in Kappa, RMSE, precision, recall, F-measure, and ROC area. NB could be considered the least successful classification method, as its accuracy and evaluation metrics were well below that of the other three methods. The best performance from this classifier was on the spring dataset with SMOTE applied, in which its accuracy was 78%, while the other classifiers were 84% and 85% accurate in the same dataset. Other measures including MAE (0.16) were good in the NB classifier, however, this was the highest error rate in the dataset. Other interesting results were produced by the DT classifier, as it reduced in performance, over all of the evaluation metrics, from the whole year to the dataset when instances containing missing data were removed. This is due to the size of the dataset, as a small dataset of 107 instances does not contain enough information for the DT to make accurate decisions. The above study has demonstrated the good potential of using data analytics for grass growth prediction, although the overall performance of those four classifiers are not exceptional considering, for example, the accuracy rate, which is partially due to the quality of data (missing data, imbalance and uncertainty inside), and partially due to limiting to only four classifiers. More elaborate data preprocessing and cleaning methods can be used, and other types of classifiers can be also explored further in future work. One limitation of this study is the inability to explain how the classifiers came to the conclusion of their prediction. At present the classifiers are assigned greater weights when there is higher information gain, and lower weights when there is little information gain. Future work will consider the use of expert knowledge to assign weights to attributes, which will allow the conclusion to be better explained to users, and to give definitive reasoning for the prediction. This study underpins research for aiding farmers and policy makers in their decisions regarding sustainable land management. The agriculture and farming industry of NI requires tools and strategies to encourage sustainable land management, especially due to its greater contribution to gaseous emissions in NI. The study has highlighted the need for a system that can handle missing data and uncertainty. The data-driven approach is expected to be combined with expert knowledge from the industry and models must be integrated to enhance the overall performance and create a multilayer decision support system, to support farmers and policymakers when making land sustainability decisions. Climate Change Act Committee on Climate Change: Reducing emissions in Northern Ireland Weather forecasts to enhance an Irish grass growth model A decision analytic framework and exploratory statistical case study analysis of grass growth in Northern Ireland Development of the Moorepark St Gilles grass growth model (MoSt GG model): A predictive model for grass growth for pasture based systems Predicting crop diseases using data mining approaches: classification Analysis of crop yield prediction using data mining techniques Modeling managed Grassland biomass estimation by using multitemporal remote sensing Towards detecting crop diseases and pest by supervised learning SMOTE: synthetic minority over-sampling technique