key: cord-0437286-ouk4ctyk authors: Beck, Michael A.; Liu, Chen-Yi; Bidinosti, Christopher P.; Henry, Christopher J.; Godee, Cara M.; Ajmani, Manisha title: Presenting an extensive lab- and field-image dataset of crops and weeds for computer vision tasks in agriculture date: 2021-08-12 journal: nan DOI: nan sha: edbd0a574e7bcb60d939daa8f1dad8b2c13efda6 doc_id: 437286 cord_uid: ouk4ctyk We present two large datasets of labelled plant-images that are suited towards the training of machine learning and computer vision models. The first dataset encompasses as the day of writing over 1.2 million images of indoor-grown crops and weeds common to the Canadian Prairies and many US states. The second dataset consists of over 540,000 images of plants imaged in farmland. All indoor plant images are labelled by species and we provide rich etadata on the level of individual images. This comprehensive database allows to filter the datasets under user-defined specifications such as for example the crop-type or the age of the plant. Furthermore, the indoor dataset contains images of plants taken from a wide variety of angles, including profile shots, top-down shots, and angled perspectives. The images taken from plants in fields are all from a top-down perspective and contain usually multiple plants per image. For these images metadata is also available. In this paper we describe both datasets' characteristics with respect to plant variety, plant age, and number of images. We further introduce an open-access sample of the indoor-dataset that contains 1,000 images of each species covered in our dataset. These, in total 14,000 images, had been selected, such that they form a representative sample with respect to plant age and ndividual plants per species. This sample serves as a quick entry point for new users to the dataset, allowing them to explore the data on a small scale and find the parameters of data most useful for their application without having to deal with hundreds of thousands of individual images. A sufficient amount of labelled data is critical for machine-learning based models and a lack of training data often forms the bottleneck in the development of new algorithms. This problem is magnified in the area of digital agriculture as the objects of interest -plants -have a wide variety in appearance that stems from the plants' growing stage, its specific cultivar, and its health. Plants also react in appearance to outside factors such as drought, time of day, temperature, humidity, and sunlight available. Furthermore, the correct classification of plants requires expert knowledge, which cannot easily be crowdsourced. All of this frames the labelling of plant-data as a challenge that is significantly harder compared to similar image-labelling tasks. Yet, as we witness the introduction of sensors [1] [2] [3] [4] , robotics [5] [6] [7] [8] [9] [10] , and machine learning [11] [12] [13] [14] [15] [16] to agricultural applications, there is a strong demand for such training data. This research area running under the names of precision agriculture, digital agriculture, smart farming, or Agriculture 4.0 has the potential to increase yields while reducing hte usage of resources, such as water, fertilizer, and herbicides [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] . This next revolution in agriculture is fuelled by data, in particular labelled image-data with rich metadata information. In this paper we describe two datasets, each consisting of hundreds of thousands of images, suitable for machinelearning and computer vision applications. The first dataset, the lab-data, consists of indoor-grown crops and weeds that had been imaged from a wide variety of angles. The plants selected are species common on farmlands in the Canadian prairies and many US states. This dataset consists of images of individual plants, images showing several plants with and without bounding boxes, and metadata for each image and Field-data Crops and weeds in farmland 540,000 images plant. At the time of writing more than 1.2 million images have been added to this dataset, since April 2020. All of these images had been captured and automatically labelled by using a robotic system as in [29] . The second dataset, the field-data, consists of images taken in the field in the growing seasons of 2019 and 2020. These images show a top down perspective on the crops (and weeds) as they grow in cultivated farmland. We present here these two datasets with respect to some of their key metrics, such as the number of images per species imaged. Further we provide a sample from the labdata consisting of 14,000 images. This sample (around 1% of the total data) is intended to give researchers an overview on how the data is structured with respect to available plant ages and growing stages, as well as the number of individual plants imaged by the system. For this, we carefully selected images from the entire dataset, such that the age distribution per species is preserved and a wide range of individual plants is represented. Following best-practices for data-accessability, as outlined in [11] , we give immediate open access to this data-sample, provide the metadata, and give insights on how the data was collected. In terms of long-term data storage and accessability both full datasets will eventually be fully open-source following a data management plant of stagewise release. Our goal is to provide researchers in digital agriculture with labeled data to facilitate data-driven innovation in the field. The rest of this paper is structured as follows: In Chaper 2 we describe the general structure of both datasets and their metadata. In Chapter 3 we visualize and describe key characteristics of the datasets, such as the number of individual plants per species. In Chapter 4 we describe the structure of the sample. We conclude the paper in Chapter 5 with information on the planned growth of the datasets and how to obtain the sample and the original dataset. The lab-data can be divided into 4 different kind of files that relate to each other as follows: • Plain images: These are the images as captured by the camera. They typically show several plants in the same image. • Bounding box images: These images are the same as the original images with the difference that they are overlaid with visible bounding boxes around the plants as calculated by the system. Plants too close to the border of the image or overlapping too far into each other are not being bounded by the system. • Single plant images: These are images cropped out from plain images according to the calculated bounding boxes. Only plants for which a bounding box has been drawn are cropped out as individual image. • JSON-files: These files contain the metadata associated with each plain image and are described in more detail below. See Figure 1 for examples on a plain image, the respective image with bounding boxes, and cropped out single plant images. Each JSON-file contains information about the plain image and the respective single plant images as follows: • version: An internally used version number • room, institute, camera, lens: These fields encode the location and camera-equipment used. • vertical res, horizontal res: The resolution given in pixels along the vertical and horizontal image-axis, respectively. • camera pose: This field contains the following subfields: x, y, z, polar angle, azimuthal angle. The first three subfields describe the camera position in cm with respect to an origin-point inside the imagable volume of the system. The latter two subfields describe the camera's pan (polar angle) and tilt (azimuthal angle). See Figure 3 for details. • bounding boxes: This field contains a list of elements each cooresponding to one plant around which a bounding box was drawn by the system (see above for the description of bounding box images and single plant images). Each element in this list contains the same list of subfields as follows: The field-data collection is in structure similar to the above. However, as there are no labels attached to individual plants we have only the following fields in use: version, file name, date, time, room, institute, camera, lens, label, vertical res, horizontal res, source file path. Note that the label field is with respect to the entire image, whereas for lab-data it was associated with a cropped out single plant image. The entry under label thus describes the type of crop that is cultivated on the farmland from which the image had been taken. Indeed, as imaging also took place before the application of herbicides, we can see weeds between the dominant crops on some images. Other fields in the JSONfiles have the same interpretation and usage as above. An example for a field-data image is given in Figure 2 . We now describe the lab-data with respect to different metrics, followed by metrics on the field-data. Figure 4 shows the number of images taken by the system starting from April 2020 (imaging before April 2020 took place and is available on request. It is not included in this dataset due to experimentation with the imaging system itself). As of this writing we have taken more than 446,000 plain images (that is images that contain multiple plants) and cropped out over 1.2 million single plant images from these. The system is run several times per week producing thousands of new plain images. The data acquisition rate drops significantly after October 2020 as access was restricted due to the COVID pandemic. We anticipate the acquisition rate to raise as accessability to our facilities improve. The third panel in 4 shows the imaged plants' agedistribution. We define age as the number of days elapsed between seeding the individual plant and the time the image had been taken. The histogram uses a binning size of 7 days. The majority of plants had been imaged, when they were between 7 and 35 days old. This corresponds to the growth stage in farmlands at which it is critical to distinguish between weeds and crops. Thus, our emphasis on plants of this age matches the data needed for important applications in digital agriculture such as estimating germination rate of crops, quantifying weeds, and automated weeding. By our definition of age the germination time itself influences the age-value in our metadata. Since germination times vary between species, so do their age-distributions. For example, the age distribution of weeds is generally shifted towards "older" plants by one or more weeks compared to crops. Indeed, in our efforts to grow both of them, we have encountered that weeds require a longer time and more care to germinate. Most plants are only imaged towards a point at which they "outgrow" the system, i.e., where the plants' size and shape lead to overlaps and inaccuracies when calculating bounding boxes. Table 2 lists how many single plant images per species are in the dataset. The number varies strongly by species, which is due to availability of seeds, germination success, and access to our facilities. In Table 3 we list how many individual plants we have imaged per species. Again the numbers vary due to availability and germination success of seeds and access to our facilities. We now give a short description of the field-data. The collection of field-data was performed by imaging the field via a stereoscopic camera mounted to a tractor. The camera, pointed straight down, records a video as the tractor drives through the field. We chose one of the two video channels to extract frames as images. These images form the field-data collection. Here the amount of images extracted is chosen such that consecutive images show some overlap with respect ot the area imaged. We also provide the video data itself for the user to extract images under their own timing conditions or to work on the video directly. Table 4 Table 4 . Number of field images per month. Year Month Image count 2019 June 45954 July 84033 2020 May 14084 June 197980 July 167896 August 32230 Table 5 . Number of field images per crop. and Table 5 give a breakdown by the number of images extracted from the videos per month and species, respectively. Field-data from the 2020 growing season is further accompanied by metadata information about temperature, windspeed, cloud coverage, and camera height from ground. To create a visual overview on the lab-data we created a subsample of it that is structured as follows: For each species listed in Table 2 we have selected 1,000 single plant images, thus the subsample contains 14,000 images. Furthermore, within each of these categories we have selected images, such that the age-distribution of the 1,000 images closely matches the age-distribution of all available images for that species. In addition we selected images such that all individual plants grown are represented in the subsample with the following exceptions: There are 51 individual Common Beans present in the subsample (instead of 53 in the entire dataset), as well as 113 Canola plants (of 128), 51 Soybean plants (of 84) and 37 Wheat plants (of 47). The distribution of the image dimensions (width, height) for the subsamples resembles the size distribution of the entire dataset, we did however not select images to directly optimize the sample under that criteria. The total size on disk of the subsample is approximately 2.2 GB. The subsample does only contain single plant images, which are organized in one subfolder per species. We consider this subsample as a good entry point into the entirety of the dataset, which can be used to train some initial models. For example, simple models that differentiate between species or classes of species (e.g., monocots versus dicots, crops versus weeds). In this paper we presented an extensive dataset of labelled plant images. These images show crops and weeds as common in the Canadian prairies and northern US states. After describing the data-structure we presented a subsample that mirrors the full dataset in key characteristics, but is smaller in overall size and thus more tractable. We are actively growing the dataset into several dimensions: New field-and lab-data is being acquired and processed as of writing. Furthermore, additional data-sources such as the generation of 3d-pointclouds and hyperspectral scans are being tested and developed. Additional field-data sources are also being explored, including imagery from UAVs and a semi-autonomous rover. Data from these sources will accompany the datasets presented in this paper in the near future. The 14,000 images sample is available on https://doi.org/10.25739/rwcw-ex45 at the CyVerse Data Store, a portal for full data lifecycle management. The full dataset which contains 1.2 million single plant images (and counting) is made available to researchers and industry through the data-portal hosted by EMILI under http://emilicanada.com/ (Digital Agriculture Asset Map). The authors take Lobet's general critique [11] on data-driven research in digital agriculture (or any research field) seriously. We further created a datasheet following the guidelines of Gebru et al. [30] 3-d imaging systems for agricultural applications -a review A survey of ranging and imaging techniques for precision agriculture phenotyping Nanostructured (bio)sensors for smart agriculture Evolution of internet of things (IoT) and its significant impact in the field of precision agriculture Advances in robotic agriculture for crops Agricultural robots for field operations. part 2: Operations and systems Agricultural robots for field operations. part 2: Operations and systems Agricultural robotics: The future of robotic agriculture Research and development in agricultural robotics: a perspective of digital farming Farming reimagined: a case study of autonomous farm equipment and creating an innovation opportunity space for broadacre smart farming Image analysis in plant sciences: Publish then perish Automated plant species identification-trends and future directions Machine learning in agriculture: A review Computer vision and artificial intelligence in precision agriculture for grain crops: a systematic review Deep learning in agriculture: A survey A comprehensive review on automation in agriculture using artificial intelligence Controlled comparison of machine vision algorithms for rumex and urtica detection in grassland Deep learning with unsupervised data labeling for weed detection in line crops in uav images Analysis of morphology-based features for classification of crop and weeds in precision agriculture Digital image processing techniques for detecting, quantifying and classifying plant diseases Lights, camera, action: high-throughput plant phenotyping is ready for a close-up Machine learning for high-throughput stress phenotyping in plants High throughput phenotyping to accelerate crop breeding and monitoring of diseases in the field High-throughput phenotyping Citizen crowds and experts: observer variability in image-based plant phenotyping Plant phenomics, from sensors to knowledge The digitisation of agriculture: a survey of research activities on smart farming Smart farming: Agriculture's shift from a labor intensive to technology native industry An embedded system for the automated generation of labeled plant images to enable machine learning applications in agriculture Datasheets for datasets