key: cord-0124337-7ufarwvm authors: Adhikari, Poonam; Kumar, Ritesh; Iyengar, S.R.S; Kaur, Rishemjit title: What a million Indian farmers say?: A crowdsourcing-based method for pest surveillance date: 2021-08-07 journal: nan DOI: nan sha: 74541f8a217b3e35b1a2f6929a62bfcbb72dd370 doc_id: 124337 cord_uid: 7ufarwvm Many different technologies are used to detect pests in the crops, such as manual sampling, sensors, and radar. However, these methods have scalability issues as they fail to cover large areas, are uneconomical and complex. This paper proposes a crowdsourced based method utilising the real-time farmer queries gathered over telephones for pest surveillance. We developed data-driven strategies by aggregating and analyzing historical data to find patterns and get future insights into pest occurrence. We showed that it can be an accurate and economical method for pest surveillance capable of enveloping a large area with high spatio-temporal granularity. Forecasting the pest population will help farmers in making informed decisions at the right time. This will also help the government and policymakers to make the necessary preparations as and when required and may also ensure food security. Agriculture accounts for 20.19% of India's GDP [7] , and on average, it employs over 50% of the population as per Indian economic survey [6] . Over the past three decades, India achieved remarkable Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). KDD Workshop on Data-driven Humanitarian Mapping, 27th ACM SIGKDD Conference, 2021, Virtual Conference © 2021 Copyright held by the owner/author(s). growth in agriculture production by expanding agricultural land, usage of high-yielding varieties of crops, fertilizers, and pesticides. However, increasing production through the expansion of farming areas is limited. Besides, production is restrained by several other factors such as limited availability of irrigation water, erratic weather patterns, lowering of groundwater levels, decreasing fertile land (availability per capita), and potential damage to crops due to pests. Pests have been responsible for extensive crop damage in countless ways affecting overall production. Irrespective of the crop, pest attacks have been found in roots, stalks, bark, stems, leaves, buds, flowers, and fruits of plants [15] . According to the global burden [19] of crop loss, pests and diseases cause about 20-40% loss in agriculture. In India, pests cause an annual loss of about US $42.66 million in agriculture [29] , implying pest infestation is a serious problem. Although this problem persists for a long time, our efforts to control and manage the damage have not been entirely resolved. Understanding pest population dynamics is an effective step towards pest management that can reduce harmful pesticides. However, monitoring pests population is complex as their population fluctuates with time, crop, location, and season. Therefore, a precise understanding of population dynamics is a challenging task. Traditionally, pest management practises include usage of pesticides such as fungicides, herbicides, and insecticides. However, these are toxic to human health and are responsible for contaminating the agriculture ecosystem [16] . Over the years, researchers have employed different sensors and IOT based methods for pest detection and monitoring. Some of these methods include usage of acoustic sensors, radars, LiDAR [18] , cameras [30] , infrared sensors, etc [27] . However, these methods lack in scalability due to low area coverage by sensors and are also expensive to deploy. Satellite imagery couple with machine learning techniques has also been used for pest detection [33] . Another approach includes pest forecasting based on the weather and topological information. Wu et al. [32] used multifactor spatial interpolation model which uses weather information as one of the data source and used geography information system (GIS) for forecasting. Hooker et al. [20] used multiple linear regression on rainfall, humidity and temperature data to predict pest in wheat plants. However, these studies focus on single plant or pest. There is a need for economical and scalable method for collecting high frequency and high resolution data. In this work, we have proposed a crowdsourced data collection and pest attack prediction methodology. Crowdsourced data has been shown to be powerful tool in predicting infectious disease outbreak such as influenza and middle east respiratory syndrome [28] and dengue fever [21] . It has the capability of capturing high frequency information keeping the cost low and provides wide spatial extent and high resolution. We have examined crowdsourced data recorded from farmer call centers, popularly know as 'Kisan call centers' (KCC) [9] managed by Ministry of Agriculture, Government of India, which is available online at [12] . KCCs have been established throughout India to inform and advise farmers on the queries posed by them and proved to be efficient as 61.5% of farmers depend on KCC for information support on various agricultural issues [26] . They have been a successful model to establish direct communication between experts/semi-experts and the farmers. This big data consists of agricultural queries and can be used for investigating various problems faced by farmers [25] . Viswanath et al. [31] analyzed the KCC datasets to find the peak query hours. Mohapatra et al. [23] studied KCC data to answer most common queries by generating FAQ, Jain et al. [22] have used KCC data to develop a chat bot for farmers. However, to the best of our knowledge no such work has been done in the field of pest prediction using KCC data. Our method is a data-driven approach to gather pest population information, which is accurate, economical, and is capable of covering large area with block-level granularity. We used time-varying agricultural queries from different districts of India to study pest population dynamics. Interestingly, the spatiotemporal mapping of the normalized query features mimicked the events panning out in the real world accurately as examined by contemporary news articles and reports. Further, we developed machine learning models to forecast the pest population across India. This forecasting model will not only help in knowing pest arrival time but also identify the period of occurrence. This could help to track and predict pest severity and provide novel signals to the government or other agencies engaged in this area to preempt any outbreak. Our dataset consists of KCC queries for 2015-2020 including 10,981,793 queries from 31 states and 553 districts collected from the Open Government Data (OGD) website [12] . It consists of season, crop category, query type, crop, the question asked by the farmer, and answer provided by the KCC executive, location information, and time of query raised. The query season reflects the harvest of the crop, Rabi, Kharif, or Zaid crops. Category field contains information about the query category, such as pulses, vegetables, etc. The location is divided into three fields: state name, district name, and block name. An example of the data is shown in the Table 1. KCC data is not systematically organized and maintained leading to many challenges, such as missing information, spelling mistakes, and usage of code-mixed multiple languages. We performed data cleaning by discarding queries that have incomplete details, such as missing state names, district names, and creation dates. Further, we translated the queries to English using Google Translate API and spell corrected using TextBlob [14] python library for both question and KCC answer. An example has been shown in Figure 1. In order to extract pest related queries, we generated a list of possible names for each pest by considering one character error and kept queries for which pest name is found either in question raised by a farmer or the answer given by the KCC executive. The flow chart labelling the pest queries is shown in Figure 2 . Following this procedure, we are left with a dataset of size 867,337 for pest related queries which is 7.89% of total queries. Also, we aggregated our data by date, district, and pest to calculate the frequency of daily pest queries from a specific district. As the pest frequency depends on the agricultural land, the pest frequency from a district was normalized by gross cultivable area [5] in the given district. In order to quantify seasonality of pest population, we computed temporal auto-correlation to a lagged version of query frequency using [17] . The auto-correlation vs. lag plot can be used to find seasonality in the pest attack. The periodicity of the peak shows the pattern after a period equal to the lag. Seasonal Auto-regressive Integrated Moving Average (SARIMA) is a forecasting method for continuous-time series data with seasonal variation. We used the SARIMA model for forecasting pest occurrences. So, as a first step we checked for data stationary property using Dickey-Fuller (DF) test [24] as given by = −1 + . Here, y is the variable of interest, i is the time index, and is the error term. For pest frequency time series where DF tests failed (i.e., nonstationary), we further take its log transformation followed by an iterative n-order differencing as shown in equation 1 till the stationarity has been achieved. To confirm the stationarity, we also checked for the time dependence of the mean and standard deviation. (i.e., rolling mean and rolling standard deviation are constant). Test statistics of the DF test is shown in Table 2a = − − (1) y denotes the pest frequency at time , and denotes seasonal difference. In order to determine the optimal values of SARIMA model parameters, i.e., trend autoregression order (p), the trend moving average order (q), seasonal autoregressive order (P), and seasonal moving average order (Q), we performed a grid search. Table 2b shows the parameter combinations. Akaike information criterion (AIC) was used as a performance metric. We analyzed the five-year queries and found that the most significant number came from the agricultural sector at 79.42%, followed by horticulture at 18.81% and livestock, weather, and fishing below 1%. Among the different crops, cereals received the most queries with 20.9% followed by pulses: 5.53%, oil seeds: 4.82%, fiber crops: 3.61%, millets: 2.95% and rest with less than 2%. We further analyzed only pest-related queries for 264 varieties of crops and observed that the top queries were related to paddy (dhan) with 16.2% followed by cotton (kapas) 8.66%, brinjal 7.62%, sugarcane (noble cane) 4.45%, wheat 4.29% and rest below 3%. We observe that Tamil Nadu receives the highest number of normalized pest queries as shown in Figure 3 . In order to determine the veracity of our methodology for extracting useful information from queries, we compared the major pest attack events reported in popular news and reports to the spatio-temporal pest attack hot spots observed in our data. To this end, we scraped news articles from two of the most popular national newspapers, The Times of India (ToI) [11] and The Hindu [4] along with other sources such as DownToEarth. According to the report by DownToEarth [13] , whiteflies were responsible for immense crop loss during the year 2015-2016 in the states of Punjab and Haryana. We verified this report with our data by extracting whitefly queries for the year 2015-2016 as shown in Figure 4a . It shows that the states of Himachal Pradesh, Punjab, and Haryana, on average received the maximum number of queries per unit cultivated area. This information was also reported by The Hindu on August 25, 2015 [2] , which was one month and 19 days later than the first query received in KCC. Similarly, our data display bollworm attack on Cotton (Kapas) during 2016 in the region of Andhra Pradesh and Maharashtra as shown in Figure 4b and the corresponding scenario was reported in [8] . In 2018, maize was reported to be affected by the armyworm in the districts of Andhra Pradesh [3], and as expected, KCC also receives a large volume of maize related queries corresponding to armyworm from the regions of Andhra Pradesh shown in Figure 4c . It was first reported in the hindu [1] on October 08, 2018 which was 1 month 20 days later than the first query started appearing in the KCC. The much-publicized locust attack of 2020 also reported in Wikipedia [10] in the northern part of India was observed in the queries related to locust attacks in regions of Rajasthan, Punjab, Himachal Pradesh as shown in Figure 4d . It was also reported earlier in the KCC dataset. These results show crowdsourced data from KCC is helpful to analyze pest dynamics as it may be able to detect pest outbreaks earlier than reported in popular news papers. 3.3.1 Pest Seasonality. We computed auto-correlation vs. lag plots of different pests with a minimum lag of a day. The auto-correlation plots of whitefly, bug, aphid and stemborer are shown in Figure 5 . We observed peak in a time period of six month or an year. For whitefly, bug, aphid and stemborer over different months over five years is shown in Figure 6 . Figure 6a and 6b shows white-fly and bug display yearly seasonality, where whitefly is observed during the months of January to April and bugs in August and September. Figure 6c and 6d show aphid and stem borers reflect half-yearly seasonality and majorly found in the months of February, March and August, September. 3.3.2 Forecasting. We split our time series pest data into 70% training and 30% testing. We performed a DF test on the training data, and the results for aphid and termite are shown in Table 2a . In the next step, we used grid search across the autoregressive (AR) and seasonally autoregressive (SAR) hyperparameters on training data to select best fitting parameters based on the lowest AIC value (as shown in Table 2b ). After setting the hyperparameters, a model is built by fitting the training data. Further, we used the trained model for prediction on the test data. Root mean square error, mean standard error and confidence interval corresponding to forecasting results of aphid, insect, bug, and stemborer are shown in Table 3 Pest This study has drawn attention to the fact that crowdsourced data can be considered a way to study pest population dynamics as it mimics real-world events. Using this data, we investigated pest seasonality and found that their populations differ over time and place. We used time-series pest data to produce a reliable forecast, encouraging practical application. Our method can be used easily and provide an alternative approach to study pest attacks.In future, we intend to quantitatively validate our approach by possibly detecting all pest outbreak stories in the top news papers in a comprehensive manner. Early predictions of pest attacks can help us find effective ways to protect crops and achieve better harvests. This helps in knowing pest attack arrival time and identifies the period of occurrence. Thus forecasting the pest population will help farmers take appropriate action on a timely basis. It will also help farmers decide the amount of pesticides to be used, and it is desirable to plan appropriate control measures with maximum efficiency. Armyworms march on Cotton crop in Punjab comes under whitefly attack -OTHER STATES -The Hindu Fall Armyworm attack: The damage done The Hindu: Breaking News, India News, Sports News and Live Updates ICRISAT-District Level Data India economic survey 2018: Farmers gain as agriculture mechanisation speeds up, but more R&D needed -The Financial Express India GDP sector-wise 2020 -StatisticsTimes India: Pests love a warmer world | PreventionWeb Locusts' attack in western Rajasthan leaves farmers high and dry, ruin lakhs of hectares of crops -India News News -Latest News, Breaking News, Bollywood, Sports, Business and Political News | Times of India Open Government Data (OGD) Platform India Pest attacks on rise across India, yet no discussion on spurious pesticides TextBlob: Simplified Text Processing -TextBlob 0.16.0 documentation Agricultural pests of South Asia and their management Persistence of pesticides-based contaminants in the environment and their effective degradation using laccase-assisted biocatalytic systems Distribution-free statistical tests Insect pest detection, migration and monitoring using radar and LiDAR systems. Innovative Pest Management Approaches for the 21st Century Global burden of crop loss Using weather variables pre-and post-heading to predict deoxynivalenol content in winter wheat Correlation between Google Trends on dengue fever and national surveillance report in Indonesia AgriBot: agriculture-specific question answer system Using TF-IDF on Kisan call centre dataset for obtaining query answers Augmented dickey fuller test Big Data Analytics and Visualization Techniques: A Case Study from Agriculture Domain Sustainable models of Information Technology for agriculture and rural development IoT Based Pest Controlling System for Smart Agriculture Methods using social media and search queries to predict infectious disease outbreaks Emerging issues of plant protection in India Plant disease and pest detection using deep learning-based features Hadoop and Natural Language Processing Based Analysis on Kisan Call Center (KCC) Data Application of GIS technology in monitoring and warning system for crop diseases and insect pests Damage mapping of powdery mildew in winter wheat with highresolution satellite image