key: cord-0192316-55n1u2rf authors: Fan, Joshua; Bai, Junwen; Li, Zhiyun; Ortiz-Bobea, Ariel; Gomes, Carla P. title: A GNN-RNN Approach for Harnessing Geospatial and Temporal Information: Application to Crop Yield Prediction date: 2021-11-17 journal: nan DOI: nan sha: 970e8f35dba8f6571a4b608780952adc7f429f10 doc_id: 192316 cord_uid: 55n1u2rf Climate change is posing new challenges to crop-related concerns including food insecurity, supply stability and economic planning. As one of the central challenges, crop yield prediction has become a pressing task in the machine learning field. Despite its importance, the prediction task is exceptionally complicated since crop yields depend on various factors such as weather, land surface, soil quality as well as their interactions. In recent years, machine learning models have been successfully applied in this domain. However, these models either restrict their tasks to a relatively small region, or only study over a single or few years, which makes them hard to generalize spatially and temporally. In this paper, we introduce a novel graph-based recurrent neural network for crop yield prediction, to incorporate both geographical and temporal knowledge in the model, and further boost predictive power. Our method is trained, validated, and tested on over 2000 counties from 41 states in the US mainland, covering years from 1981 to 2019. As far as we know, this is the first machine learning method that embeds geographical knowledge in crop yield prediction and predicts the crop yields at county level nationwide. We also laid a solid foundation for the comparison with other machine learning baselines by applying well-known linear models, tree-based models, deep learning methods and comparing their performance. Experiments show that our proposed method consistently outperforms the existing state-of-the-art methods on various metrics, validating the effectiveness of geospatial and temporal information. Climate change (Houghton et al. 1990 ) has become a real and pressing challenge that poses many threats to our everyday life. Besides the evident extreme events (Trenberth, Fasullo, and Shepherd 2015) , climatic variations also gradually impact the yields of major crops (Zhao et al. 2017 ). Crop production is vulnerable and sensitive to fluctuations in climatic factors such as temperature, precipitation, soil, moisture and many other factors (Ortiz-Bobea, Knippenberg, and Chambers 2018) . As the planet gets warmer, all of these swiftly-changing factors could perturb annual crop yields. Many recent works urge rethinking crop production practices under climate change (Reynolds, Ortiz et al. 2010; Raza et al. 2019) , which motivates the crop yield prediction problem (Van Klompenburg, Kassahun, and Catal 2020) . Crop yield prediction can help with food security (Shukla et al. 2019) , supply stability (Garrett, Lambin, and Naylor 2013) , seed breeding (Ansarifar, Akhavizadegan, and Wang 2020) , and economic planning (Horie, Yajima, and Nakagawa 1992) . However, crop yield depends on numerous complex factors including weather, land, water, etc. While there exist specialized process-based models to simulate crop growth (Shahhosseini et al. 2021) , they often produce highly-biased predictions, require strong assumptions about management practices, and are computationally expensive (Leng and Hall 2020) . Therefore, in recent years, powerful yet inexpensive machine learning methods have been widely adopted in crop yield prediction and demonstrated impressive results (Romero et al. 2013; Dahikar and Rode 2014; Marko et al. 2016; You et al. 2017; Khaki et al. 2020; Khaki, Khalilzadeh, and Wang 2020) . Machine learning models (especially deep learning models) benefit from the large capacity, sophisticated non-linearity and mature techniques inherited from other application domains. Despite the enormous amount of machine learning papers for crop yield prediction, many of them share similar methods. Among around 70 papers we surveyed, 48 used neural networks, 10 used tree-based methods (e.g. decision tree, random forest), and 10 used linear regressions (e.g. lasso). These methods often only differ in location (US, Brazil, India) , study granularity (province, county, site/farm), crop types (soybean, corn), and time range (weeks to years). Similar findings are also reported in (Van Klompenburg, Kassahun, and Catal 2020). For many relatively small selfcollected datasets, simpler models are preferred. But these models may not perform well on a large and diverse region like the entire US. In this paper, we compare various machine learning techniques on a nationwide scale, and propose a novel graph-based framework, GNN-RNN, which integrates both geospatial and temporal knowledge into inference. We train and compare these methods on over 2,000 counties from 41 states in the US mainland, with data covering years 1981 to 2019. There are up to 49 climatic and soil factors for each county, including precipitation, temperature, wind, soil moisture, soil quality, etc. Most of these factors vary across time within the year, or vary across different soil layers. All features are publicly available from sources such as PRISM, NLDAS, and gSSURGO. Furthermore, although not every county has crops planted, USDA provides corn yield labels from 41 states, and soybean yields from 31 states. Recent works using machine learning demonstrate promising results for crop yield prediction (Khaki et al. 2020) . Nevertheless, these methods treat each county as an i.i.d. sample in their models, which is plausible in small regions but may not fully utilize the spatial structure of a larger region. For instance, if one county has a splendid harvest, the neighborhood counties are very likely to have high yields as well, which violates the independence assumption. It is also problematic to treat counties in the northern US and southern US as i.i.d. samples, as the distribution of their climatic and soil conditions is very different. We hence introduce graph neural networks (GNN) (Wu et al. 2020) to take into account the geographical relationships among counties. When the model makes a prediction for a county, it can combine the features from neighboring counties with its own features to boost the predictive power. GNN models have been successful in many tasks such as election prediction (Jia and Benson 2020) and COVID forecasting (Kapoor et al. 2020) . Additionally, we show that GNNs can work synergistically with RNNs to combine both geospatial and temporal information for prediction. We will show the novel GNN-RNN model can achieve superior performance in experiments. As far as we know, our work is the first to incorporate geographical knowledge into crop yield prediction. To further lay a solid foundation in this task, we compare various widely adopted machine learning methods with our method, including lasso (Tibshirani 1996) , gradient boosting tree (Friedman et al. 2001) , CNN, RNN, CNN-RNN (Khaki et al. 2020) . These are also predominant methods among the papers we surveyed. The experimental results show that GNN-based methods consistently outperform these existing models on the nationwide benchmark. On both RMSE and R 2 , our GNN-RNN outperforms the state-of-the-art CNN-RNN model by 10%. In crop yield prediction, we denote each county's climatic features by x c,t and ground-truth crop yield (for a particular crop) by y c,t ∈ R, where c, t represent county and year respectively. Each x c,t contains four types of features (detailed descriptions of these features can be found in the Experiments section): weather features x w c,t ∈ R nw×52 , land surface features x l c,t ∈ R n l ×52 , soil quality features x s c ∈ R ns×6 , and some extra features (e.g. crop production index) We denote the number of weather, land surface, soil quality, and extra variables as n w , n l , n s , n e respectively. Among these features, x w c,t , x l c,t change both spatially and temporally, while x s c , x e c are county-specific and remain stable over time. The goal is to predict y c,t given x c,t . Recent work (Khaki et al. 2020 ) also showed features from past years can help with the prediction, so we reformulate our task as predicting y c,t with {x c,t , x c,t−1 , ..., x c,t−∆t }. ∆t is the length of year dependency. If ∆t = 0, the model will not consider features from prior years. Regardless of whether the models use historical features or not, the first step is always to extract an embedding for each year from x c,t . Then a prediction can be made based on the embedding from the current year or the embeddings from the last few years. The four types of features x w c,t , x l c,t , x s c , x e c have different structures. Using a uniform neural network to extract the embedding may not effectively exploit the structure in the raw data. For example, weekly features x w c,t , x l c,t naturally incorporate a temporal order, but county-specific soil features x s c do not change temporally and are measured at different depths underground. Therefore, we use separate neural networks to process the differently structured-parts from x c,t : f wl (·) handles the features that vary over time. Since land surface features like soil moisture from x l c,t are weekly data closely related to weather, we concatenate x l c,t and x w c,t before further passing to f wl . Given the temporal order, an RNN or a CNN can be used for f wl to facilitate information aggregation along the time axis. On the other hand, f s (·) aggregates information along soil depths. We use CNN as the architecture for f s . x e c only contains six scalar values, so we directly pass it to the output embedding. The final embedding h c,t is the concatenation of h wl c,t , h s c , x e c . Though new crops are planted every year and yields primarily depend on climatic factors within one year, it has been observed that the trend and variations captured by recent history can be very informative for prediction (Khaki et al. 2020) . For example, crop yields have tended to increase over the past few decades due to improvements in technology and genetics (Ortiz-Bobea and Tack 2018). While data on the underlying technological improvement is unavailable (Khaki et al. 2020) , we can observe recent trends in crop yield. Our per-year embedding extraction makes it easy to incorporate historical knowledge. All we need is an RNN that reads the per-year embeddings from the current year and several prior years. The output from the last time step would be our prediction for the crop yield of the current year: where r(·) is an RNN, and h c,t is the embedding from year t for county c. The model described so far follows the CNN-RNN framework, which has previously been shown to outperform single-year NN models (Khaki et al. 2020 ). Eq. 2 shows how one can extend the use of embeddings from Eq. 1 temporally. Then a natural question is, Can we take advantage of the embeddings geospatially as well? Intuitively, if a county has good yields, nearby counties tend to have good yields as well. The weather and soil conditions should also transition smoothly across the continent. The additional features from neighboring counties could boost the prediction if used properly. A recent success in COVID-19 forecasting (Kapoor et al. 2020) with similar insights could further support incorporating geographical knowledge, where the graph-based representation learning greatly improves case prediction. Graph Neural Network Graph Neural Network (GNN) (Zhou et al. 2020 ) is a novel type of neural network proposed to unravel the complicated dependencies inherent in graph-structured data sources. Given its strong power in representation learning, GNN has demonstrated prominent applications in chemistry (Gilmer et al. 2017 where V is the set of nodes and E is the set of edges between nodes. In our crop yield prediction task, each node is a county. E is represented as a symmetric adja- N is the total number of counties. Each node is associated with x c,t for every year. GraphSAGE A popular GNN model, GraphSAGE, (Hamilton et al. 2017 ) is a general framework that lever-ages node feature information and learns node embeddings through aggregation from a node's local neighborhood. Unlike many other methods based on matrix factorization and normalization (Jia and Benson 2020), GraphSAGE simply aggregates the features from a local neighborhood, and is thus less computationally expensive. The features can be aggregated from a different number of hops or search depth. Therefore the model often generalizes better. GraphSAGE is suitable for crop yield prediction because most counties only border a few others and the adjacency matrix is sparse. It also provides flexible aggregation methods. Formally, for the l-th layer of GraphSAGE, where z (0) c,t = h c,t from Eq. 1, and l ∈ {0, 1, ..., L}. N (c) = {c , ∀A c,c = 1} is the set of neighboring counties for c. The aggregation function for the l-th layer is denoted g l (·), which could be mean, pooling, or graph convolution (GCN) function. In practice, we found mean or pooling are effective and computationally efficient. a (l) c,t is the aggregated embedding from the bordering counties. We concatenate a (l) c,t with the last layer's embedding z (l−1) c,t before the transformation using W (l) . σ(·) is a non-linear function. GNN-RNN The output embedding from GNN's last layer z (L) c,t thus extracts the information (e.g., weather, soil) from the whole local neighborhood for year t. To integrate the historical knowledge, we can do the same as in Eq. 2, by taking the GNN output embeddings from prior years: where z (L) c,t is the GNN embedding from year t . Loss Function We use log-cosh function as our objective: Log-cosh works similarly to mean square error, but is not as strongly affected by the occasional wildly incorrect prediction. It is also twice differentiable everywhere. Mini-batch training is adopted during optimization. Batch loss is the average log-cosh loss of all samples in a batch. As one of the early works, (Liu, Goering, and Tian 2001) attempted a shallow neural network on corn yield prediction. It was shown that neural networks could beat conventional regression algorithms (Drummond et al. 2003) . In recent years, owing to the development of deep learning, neural network-based models have become more prevalent in the crop yield prediction field (Dahikar and Rode 2014; Gandhi, Petkar, and Armstrong 2016). As we mentioned in the introduction, 48 out of 70 recent works we surveyed employed neural nets. More than half of the NN-based works adopted CNN and one fourth of them employed RNN. There are two groups of research directions according to different input sources. The first group of methods read remote sensing data such as satellite images or normalized difference vegetation index (NDVI), and used that to estimate the yields. (You et al. 2017 ) applied a deep Gaussian process to predict crop yields from a series of the multi-spectral satellite images. (Nevavuori, Narra, and Lipping 2019) employed deep CNNs to reduce the prediction uncertainty on RGB images. (Kim et al. 2019 ) did a case study in the Midwest and compared various AI models on satellite product datasets. 20 out of all the 70 papers primarily or only dealt with remote sensing data. These methods illuminate the use of the widely available remote sensing data, but it is hard to directly model the relations between crop yields and environmental factors that actually affect the yields. Therefore, another line of research aims at collecting environmental factors and directly training models with these factors as inputs. (Cakir, Kirci, and Gunes 2014) used temperature, rainfall and other meterological parameters to predict wheat production. (Khaki and Wang 2019) collected crop genotypes and environments to predict the performance of corn hybrids. More recently, CNN-RNN (Khaki et al. 2020) incorporated historical environment knowledge and proved benefits. Our GNN-RNN takes one more step by further adding neighborhood information. Dataset-wise, most papers have their own small-scale datasets, which vary largely in different dimensions. Scalewise, (González Sánchez et al. 2014 ) only studied a single zone in Mexico, while (Wang et al. 2018 ) studied the whole country of Argentina. Time-wise, (Wang et al. 2018) spanned over 5 years, while (You et al. 2017 ) investigated 13 years. It is thus hard to fairly compare models or validate the performance. There have been several efforts to evaluate models on a large and consistent dataset. The closest one to ours is the one from the CNN-RNN paper (Khaki et al. 2020) . However, they only used 13 states from the Corn Belt and used fewer features than us (for example, they did not use land surface data such as soil moisture). By contrast, we evaluate our models at a nationwide scale, using data from 41 states and 39 years. This forces our models to generalize to a diverse range of locations that have very different climatic and geographic conditions, instead of overfitting to a single region. We compare 11 representative machine learning models, including GNN and GNN-RNN, on US county-level crop yields for corn and soybean. We evaluate performance on three metrics: RMSE, R 2 , and correlation coefficient. Given a test year t, we use year t − 1 for validation and all the prior years for training. For example, if the test year is 2019, we train on data from years 1981-2017 (inclusive), validate on 2018 crop yields, and test on 2019 crop yields. Crop yield labels for corn and soybean are available from the USDA Crop Production Reports (USDA 2013) for numerous counties in the US. Not all counties report data in every year, but the coverage is still quite comprehensive. For example, for corn, all years between 1981 and 2003 have over 2,000 counties across 41 states reporting data. We train and evaluate our model on all counties where yield data is available. (When computing the loss, we ignore counties that do not have yield labels for that year.) We use a variety of climate, land surface, and soil quality variables as input features; these features are available for almost all counties in the contiguous 48 US states (3,107 counties in total 1 ). We draw 7 weather features from the PRISM climate mapping system (Daly and Bryant 2013) : precipitation, min/mean/max temperature, min/max vapor pressure deficit, and mean dewpoint temperature. These features are available at a 4 × 4 km grid for each day. We acquire 16 land surface features from the North American Land Data Assimilation System (NLDAS) (Xia et al. 2012) , which is a large-scale land surface model that closely simulates land surface parameters. These features include soil moisture content, moisture availability, and soil temperature (all at various soil depths), as well as observed weather variables such as wind speed and humidity. These variables are available at a 0.125 × 0.125 degree (∼ 14 km) spatial resolution, every hour. Soil quality features were acquired from the Gridded Soil Survey Geographic Database (gSSURGO) (Soil Survey Staff 2020), at a 30 × 30 meter resolution. These features include available water capacity, bulk density, and electrical conductivity, pH, and organic matter. Unlike the weather and land surface features, the gSSURGO soil quality features are fixed and do not change over time. In addition to the raw features, we use the raw sand, silt, and clay percentages to compute the "soil texture type" of each pixel based on the Natural Resources Conservation Service Soil Survey's classification scheme (Soil Survey 2021), and then compute the fraction of each county occupied by each soil texture type. In total, we have a total of 20 gSSURGO variables that are depth-dependent (so there are values for 6 different soil depth levels), and 6 "extra" variables which are not depthdependent (such as crop productivity indices). Finally, as in (Khaki et al. 2020) , we use the average crop yield (over all counties) of the previous year as an additional input feature, to capture the increasing trend in crop yield over time. A full list of the features can be found in the Appendix. All of these datasets were originally available as gridded raster data at a variety of spatial resolutions. We aggregated each feature to the county level by computing the weighted average of the variable over all grid cells that overlap with the county. Each grid cell is weighted by the percentage of the cell that lies inside the county, multiplied by the percentage of that grid cell which is cropland, pasture, or grassland; the land cover percentages are computed using the National Land Cover Database (Homer, Fry, and Barnes 2012) . In addition, the time-dependent variables (weather and land surface) were aggregated from daily to weekly frequency to make the prediction task more tractable. We consider two types of methods: (a) single-year methods that only use features from year t to predict yield for the same year t, and (b) 5-year methods that use features from a 5-year series (years {t − 4, t − 3, . . . , t}) to predict yield for year t. Single-year methods. We first consider methods that only use a single year of data to make predictions, to provide a fair comparison to the single-year GNN. For non-deep baseline methods, we select lasso, ridge regressor and gradient boosting regressor. For these methods, we flatten all the features from the entire year into a single feature vector, ignoring the temporal and soil-depth structure in the data. Next, we tried three baseline deep learning architectures for f wl (·): LSTM (Hochreiter and Schmidhuber 1997), GRU (Chung et al. 2014) , and 1-D CNN (Kalchbrenner, Grefenstette, and Blunsom 2014) . All of these methods process the weekly time-series of weather and land surface data within the year. We compare these methods with our single-year GNN model (Eq. 3), which incorporates geospatial context in making predictions. 5-year methods. For history-dependent models, we follow (Khaki et al. 2020 ) by considering a 5-year dependency for a consistent and fair comparison. Two baseline models using LSTM and GRU respectively handle the raw inputs {x c,t−∆t , ..., x c,t } directly with r(·). Specifically, they flatten the features for each year into a single vector (disregarding the weekly structure of the weather data or the depth structure of the soil data), and then feed the 5 year-vectors into the LSTM or GRU. The most recent CNN-RNN model (Khaki et al. 2020) pre-processes the raw features with a CNN (choosing CNN for f wl (·)) and then uses a LSTM to model the sequence embeddings as described in Eq. 2. Finally, the GNN-RNN model proposed in this paper (Eq. 4) still uses a CNN for f wl (·) to encode the raw features into an embedding for each year, then uses the GNN to refine the embeddings using information from the county's spatial context, and then passes those embeddings into an LSTM. We evaluate all methods on three popular metrics for regression: root mean square error (RMSE), the coefficient of determination (R 2 ), Pearson correlation coefficient (Corr). RMSE and R 2 tell us how well a regression model can predict the value of the response variable in absolute terms and percentage terms respectively. Note that our RMSE figures are expressed in units of the standard deviation of that crop's yield (across all years). Corr is essentially a normalized measurement of the covariance between two sets of data, and captures the strength of the linear correlation between true and predicted values. See Appendix for formal definitions. For the shallow models (ridge regression, lasso, and gradient boosting regressor), we used scikit-learn's implementations. For the baseline single-year models, we evaluated using LSTM, GRU, and CNN as f wl (·) to process the weekly weather and land surface data. For CNN, we used a 1-D CNN similar to the one in (Khaki et al. 2020 ), but we process all weather and land surface parameters together. The CNN contains series of 1D convolutions, ReLUs, and average pooling layers; this sequence is repeated four times. For all methods that use LSTM or GRU, we used PyTorch's implementation with 64 hidden states. The same CNN is used as the encoder for the weekly weather and land surface data in the CNN-RNN, GNN, and GNN-RNN models. (We also tried using an LSTM as the encoder for the weekly data for these models, but this did not improve results.) For all methods except for the 5-year LST-M/GRU, we processed the soil data using another small 1-D CNN (with three convolutional layers, and without average pooling), where the convolutions operate across 6 different soil depths. For the simple 5-year baseline models (LSTM and GRU), we fed the flattened feature vectors for each year through an LSTM or GRU, followed by a 2-layer fully connected network. For the GNN and GNN-RNN models, we used the implementation of GraphSAGE from the dgl library; we used a 2layer GNN, with edge dropout of 0.1. The adjacency graph of US counties is provided by the US Census Bureau. We used stochastic mini-batch training to train the model, where each layer samples 10 neighbors to receive messages from. We tried different aggregation functions and found that the "pooling" approach generally performed best. For all methods, we use the Adam optimizer (Kingma and Ba 2014) , sometimes with a mild cosine or step decay. We tried learning rates between 1e-5 and 1e-3, used a weight decay of 1e-5 or 1e-4, and a batch size of 32, 64, or 128. We trained the model for 100 to 200 epochs (until the validation loss clearly stopped improving). We chose the epoch and hyperparameter setting that produced the lowest RMSE on the validation year (the year before the test year). We ran the Table 1 : Evaluation results. For RMSE, lower is better; for R 2 and Corr, higher is better. We grouped the methods based on whether they use 1 year of data (1y) or 5 years of data (5y) to make predictions. GNN and GNN-RNN models 3 times with different random seeds to evaluate the variance in the results. The Appendix contains more details about hyperparameters. We evaluate the model on four test datasets: 2018 corn, 2018 soybean, 2019 corn, and 2019 soybean. These datasets span a wide geographic area, as well as differing growing conditions (2019 was a bad year due to the wet spring in the Midwest, which caused planting to be delayed). The results on these datasets are shown in Table 1 . For the methods that only use 1 year when making predictions, our GNN model clearly outperforms comparable baselines across all datasets and metrics (except for 2018 soybean Corr, where it is slightly worse than GRU). For the methods that use a history of 5 years, our GNN-RNN outperforms competing baselines in almost all cases (except for 2018 soybean Corr, where it is slightly worse than CNN-RNN). For example, in 2019, our corn yield prediction on R 2 score is 16% better than the prediction of a state-of-the-art work (Khaki et al. 2020 ). On average, we achieve a relative R 2 improvement of 10.44% over the recent CNN-RNN model, 16.16% over the 5-year LSTM, and a relative RMSE improvement of 9.6% over the CNN-RNN model, 13.18% over the 5-year LSTM. These indicate the importance of exploiting geospatial context in making these predictions. Figure 2 shows an example scatterplot of true vs. predicted corn yields for the test year 2018. The GNN-RNN model is able to capture differences in yield between counties quite well. One minor issue is that the model is not able to capture the counties with very high yields very well; the model almost never predicts a yield higher than 220, but there are actually several counties with a true yield higher than this. This may stem from the fact that such high yields were almost never seen before in the training years. We can also see these trends in the map (Figure 3 ). While the GNN-RNN model captures large-scale trends in crop yield very well, it sometimes outputs overly smooth predictions within a region, and under-predicts the area of high true yield in the Midwest. Improving the GNN's ability to detect fine-scale variations without smoothing them out is an important area for future work. We can see that crop yield prediction on a large scale is rather challenging, due to the complexity of the prediction task and the data involved. Each data point (one county/one year) has over 6,000 features, and standard models can easily overfit to noise in the data and fail to generalize. In order for the prediction task to be tractable, a model needs to take advantage of the various forms of structure in the data; temporal structure within a year (to capture weather patterns in different times of the year), temporal structure across years (to capture long-term trends such as technological improvements), and geospatial structure (to capture correlations between nearby county yields). Our GNN-RNN model is the first model to take all of these aspects into account when making crop yield predictions, and achieves superior performance compared with the existing state-of-the-art. In practice, crop yield predictions are most useful if they can be made well before harvest, as this gives time for markets to adapt, and humanitarian aid to be organized in cases of famine (You et al. 2017) . To simulate this, at test time only, for each county we mask out all weather and land surface features from after June 1 (week 22) of the test year, and replace them with the average values for that county during the training years. Then we pass the masked features through a pre-trained model to obtain predictions. The results for several methods for 2018 corn are presented in Table 2 . The graph-based models (GNN and GNN-RNN) clearly outperform competing baselines in this scenario, again illustrating the importance of utilizing geospatial context. In this paper, we propose a novel GNN-RNN framework to innovatively incorporate both geospatial and temporal knowledge into crop yield prediction, through graph-based deep learning methods. To our knowledge, our paper is the first to take advantage of the spatial structure in the data when making crop yield predictions, as opposed to previous approaches which assume that neighboring counties are independent samples. We conduct extensive experiments on large-scale datasets covering 41 US states and 39 years, and show that our approach substantially outperforms many existing state-of-the-art machine learning methods across multiple datasets. Thus, we demonstrate that incorporating knowledge about a county's geospatial neighborhood and recent history can significantly enhance the prediction accuracy of deep learning methods for crop yield prediction. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] were not present in the original gSSURGO dataset. Rather, for each pixel, we used the raw silt, clay, and sand percentages to compute the "soil texture type" of that pixel, based on the National Resources Conservation Service Soil Survey's classification scheme (Soil Survey 2021). This classification scheme is depicted in Figure 5 . After classifying each pixel's soil texture type, we compute the fraction of each county that is occupied by each soil texture type. Extra features (x e c ) also come from the gSSURGO dataset (Soil Survey Staff 2020), but are not depthdependent. They are listed below: For all methods, we use the Adam optimizer (Kingma and Ba 2014) . For the CNN-RNN, GNN, and GNN-RNN methods, we tried many hyperparameter configurations, most intensively on the 2018 corn dataset. We tried learning rates from {1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3}, used a weight decay of 1e-5 or 1e-4, and a batch size of 32, 64, or 128. We tried using a mild cosine decay (with η min ∈ {1e-5, 1e-6}, T 0 ∈ {34, 100, 200}), or step decay (every 25 epochs, γ ∈ {0.5, 0.8}) for the learning rate scheduler. We trained the model for 100 to 200 epochs (until the validation loss clearly stopped improving). We chose the epoch and hyperparameter setting that produced the lowest RMSE on the validation year (the year before the test year). For the GNN and GNN-RNN models, we used the implementation of GraphSAGE from the dgl library; we used a 2-layer GNN, with edge dropout of 0.1. We used stochastic mini-batch training to train the model, where each layer samples 10 neighbors to receive messages from. We tried different aggregation functions, such as "mean" and "pooling." We ran the GNN-RNN model 3 times with random seeds {0, 1, 2} to evaluate the variance in the results. For the baseline models, we used seed 0. The final hyperparameter configurations are listed in the below tables. Batch size 128 Learning rate 1e-4 LR scheduling Step (25 epochs, γ = 0.5) Number of epochs 100 Weight decay 1e-5 Step (25 epochs, γ = 0.5) Number of epochs 100 Weight decay 1e-5 We evaluate our model across all counties in the test year with data. We use three standard regression metrics: RMSE, R 2 , and Pearson correlation coefficient (Corr). The RMSE is the square root of the mean squared error between the prediction and the true value: RM SE = c (y c − y c ) 2 N where y c is the true yield for county c, y c is the model's predicted yield for county c, and N is the total number for counties in the test set with yield data. In this paper, we further divide RMSE by the standard deviation of the current crop's yield (across all years), in order to make the results for different crops comparable. R 2 is a measure of how much the variation in the data can be explained by the model predictions. Formally, c (y c −ȳ) 2 whereȳ is the average yield across the entire test dataset. The top of the fraction is the sum of the squared residuals (difference between true yield and model prediction). The bottom is the total sum of squares (of the difference between the true yield and the average yield across the test dataset), which is proportional to the overall variance of the test data. The Pearson correlation coefficient (Corr) measures the strength of the linear correlation between the true and predicted values. The correlation between two variables x and y is given as Againx andȳ are the means of x and y respectively. We let x be the model prediction and y be the true yield. We are planning to submit an expanded version of this paper to a journal, such as in the field of agricultural economics. After the journal article is published, we plan to make the entire aggregated dataset and codebase publicly available. All of the raw data used comes from publicly available sources. For now, we have submitted a data appendix with a small subset of the aggregated county-level data (only the states of Illinois and Iowa), and a code appendix containing our implementation of the GNN-RNN model. Performance prediction of crosses in plant breeding through genotype by environment interactions Yield prediction of wheat in south-east region of Turkey by using artificial neural networks Empirical evaluation of gated recurrent neural networks on sequence modeling Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting Agricultural crop yield prediction using artificial neural network approach. International journal of innovative research in electrical, electronics The PRISM climate and weather system: an introduction Statistical and neural methods for site-specific yield prediction Protein Interface Prediction using Graph Convolutional Networks The elements of statistical learning Rice crop yield prediction using artificial neural networks Land institutions and supply chain configurations as determinants of soybean planted area and yields in Brazil Neural message passing for quantum chemistry Predictive ability of machine learning methods for massive crop yield prediction Inductive representation learning on large graphs Long short-term memory The National Land Cover Database Yield forecasting Climate change Residual correlation in graph neural network regression A convolutional neural network for modelling sentences Examining covid-19 forecasting using spatio-temporal graph neural networks Crop yield prediction using deep neural networks. Frontiers in plant science, 10: 621. Khaki, S.; et al. 2020. A cnn-rnn framework for crop yield prediction A comparison between major artificial intelligence models for crop yield prediction: Case study of the midwestern united states Adam: A method for stochastic optimization Predicting spatial and temporal variability in crop yields: an inter-comparison of machine learning, regression and process-based models A neural network for setting target corn yields Soybean varieties portfolio optimisation based on yield prediction Crop yield prediction with deep convolutional neural networks. Computers and electronics in agriculture Growing climatic sensitivity of US agriculture linked to technological change and regional specialization Is another genetic revolution needed to offset climate change impacts for US maize yields? Impact of climate change on crops adaptation and strategies to tackle its outcome: A review Adapting crops to climate change: a summary. Climate change and crop production Using classification algorithms for predicting durum wheat yield in the province of Buenos Aires. Computers and electronics in agriculture Few-Shot Learning with Graph Neural Networks Coupling machine learning and crop modeling improves crop yield prediction in the US Corn Belt Climate Change and Land: an IPCC special report on climate change, desertification, land degradation, sustainable land management, food security, and greenhouse gas fluxes in terrestrial ecosystems Soil Survey. 2021. Soil Texture Calculator. Soil Survey Staff. 2020. Gridded Soil Survey Geographic (gSSURGO) Database for the Conterminous United States Regression shrinkage and selection via the lasso Attribution of climate extreme events National Agricultural Statistics Service. United States Department of Agriculture Crop yield prediction using machine learning: A systematic literature review Deep transfer learning for crop yield prediction with remote sensing data A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation System project phase 2 (NLDAS-2): 1. Intercomparison and application of model products Deep gaussian process for crop yield prediction based on remote sensing data Temperature increase reduces global yields of major crops in four independent estimates Graph neural networks: A review of methods and applications. AI Open Min vapor pressure deficit Land surface features (x l c,t ) come from the NLDAS land surface model (Xia et al. 2012), with an original spatial resolution of 0.125 degrees (14 km) and a temporal resolution of hourly Precipitation hourly total (kg/m 2 ) Moisture availability (%) Moisture availability (%), 0-100 cm 4. Soil moisture content (kg/m 2 ) Soil moisture content (kg/m 2 ) Soil moisture content (kg/m 2 ) Soil moisture content (kg/m 2 ) Soil moisture content (kg/m 2 ) Soil moisture content (kg/m 2 ), 100-200cm 10. 2-m above ground specific humidity (kg/kg) 11. 2-m above ground temperature (K) The dataset has a 30-m spatial resolution for the continental U.S. These variables do not change over time. However, they vary with depths, which are measured at 6 soil depth layers (0-5cm, 5-15cm, 15-30cm, 30-60cm, 60-100cm, 100-200cm). Because soil quality at a given point can vary substantially within a county, accounting for the location of agricultural activity can be critical when constructing appropriate county-level soil variables. Thus, the Available water capacity of the dominant soil component 2. Bulk density Electrical conductivity of the dominant soil component 4. Organic matter This research was supported by USDA Cooperative Agreement 58-6000-9-0041 and USDA NIFA Hatch Project 1017421. We would like to thank Rich Bernstein for constructive suggestions and Samuel Porter for help in processing the gSSURGO dataset. We provide a list of all of the input features used in the model, grouped by source. Note that all features are spatially aggregated to the county level, using a weighted average (where each grid cell is weighted by the fraction of the cell that lies inside the county, multiplied by the percentage of the grid cell that is cropland/pasture/grassland). An example of this aggregation process is depicted in Figure 4 . Temporally, all time-dependent features are also aggregated to weekly frequency. Weather features (x w c,t ) come from the PRISM dataset (Daly and Bryant 2013) , with an original spatial resolution of 4 km and a temporal resolution of daily: We ran our code on Python 3.7, using the following libraries: PyTorch 1.8, DGL 0.7.1, NumPy 1.18.5, SciPy 1.2.0. We trained on NVIDIA Tesla V100 GPU with 16GB memory, and used 12 CPU threads for GNN-RNN. The GNN-RNN model takes roughly 8 hours to train for 100 epochs on our full US dataset. Here are maps and scatter plots showing example results for the GNN-RNN model on the other datasets.2019 corn: Fig. 6 describes the difference between the ground-truth corn yields for counties in 2019, and our predictions. To demonstrate the similarity, we plot their difference in the bottom figure. As shown in the bottom subfigure, almost all differences are all close to 0. Fig. 7 shows another plot of true-vs-predicted comparison. All the dots cling to the identity function, which means good prediction results.2019 soybeans: Fig. 8 and Fig. 9 are similar true-vspredicted plots for soybeans. The prediction results are also promising.