key: cord-0039583-b604g4x3
authors: Joly, Alexis; Goëau, Hervé; Kahl, Stefan; Botella, Christophe; Ruiz De Castaneda, Rafael; Glotin, Hervé; Cole, Elijah; Champ, Julien; Deneu, Benjamin; Servajean, Maximillien; Lorieul, Titouan; Vellinga, Willem-Pier; Stöter, Fabian-Robert; Durso, Andrew; Bonnet, Pierre; Müller, Henning
title: LifeCLEF 2020 Teaser: Biodiversity Identification and Prediction Challenges
date: 2020-03-24
journal: Advances in Information Retrieval
DOI: 10.1007/978-3-030-45442-5_70
sha: 222df8e47fc6f20d11efac8c604960e7bbb523e7
doc_id: 39583
cord_uid: b604g4x3

Building accurate knowledge of the identity, the geographic distribution and the evolution of species is essential for the sustainable development of humanity, as well as for biodiversity conservation. However, the difficulty of identifying plants and animals in the field is hindering the aggregation of new data and knowledge. Identifying and naming living plants or animals is almost impossible for the general public and is often difficult even for professionals and naturalists. Bridging this gap is a key step towards enabling effective biodiversity monitoring systems. The LifeCLEF campaign, presented in this paper, has been promoting and evaluating advances in this domain since 2011. The 2020 edition proposes four data-oriented challenges related to the identification and prediction of biodiversity: (i) PlantCLEF: cross-domain plant identification based on herbarium sheets, (ii) BirdCLEF: bird species recognition in audio soundscapes, (iii) GeoLifeCLEF: location-based prediction of species based on environmental and occurrence data, and (iv) SnakeCLEF: image-based snake identification.

Accurately identifying organisms observed in the wild is an essential step in ecological studies. Unfortunately, observing and identifying living organisms requires high levels of expertise. For instance, plants alone account for more than 400,000 different species and the distinctions between them can be quite subtle. Since the Rio Conference of 1992, this taxonomic gap has been recognized as one of the major obstacles to the global implementation of the Convention on Biological Diversity [6] . In 2004, Gaston and O'Neill [27] discussed the potential of automated approaches for species identification. They suggested that, if the scientific community were able to (i) produce large training datasets, (ii) precisely evaluate error rates, (iii) scale up automated approaches, and (iv) detect novel species, then it would be possible to develop a generic automated species identification system that would open up new vistas for research in biology and related fields.

Since the publication of [27] , automated species identification has been studied in many contexts [26, 29, 30, 38, 41, 43, 44, 48] . This area continues to expand rapidly, particularly due to recent advances in deep learning [25, 28, 31, 42, [45] [46] [47] . In order to measure progress in a sustainable and repeatable way, the LifeCLEF [15] research platform was created in 2014 as a continuation and extension of the plant identification task [37] that had been run within the ImageCLEF lab [12] since 2011 [32] [33] [34] . LifeCLEF expanded the challenge by considering animals in addition to plants, and including audio and video content in addition to images. LifeCLEF 2020 consists of four challenges (PlantCLEF, BirdCLEF, GeoLifeCLEF, and SnakeCLEF), which we will now describe in turn.

Motivation: For several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria. These physical specimens are used to study the variability of species, their phylogenetic relationship, their evolution, or phenological trends. One of the key step in the workflow of botanists and taxonomists is to find the herbarium sheets that correspond to a new specimen observed in the field. This task requires a high level of expertise and can be very tedious. Developing automated tools to facilitate this work is thus of crucial importance. More generally, this will help to convert these invaluable centuries-old materials into FAIR [23] data.

The task will rely on a large collection of more than 60,000 herbarium sheets that were collected in French Guyana (The "Herbier IRD de Guyane", CAY [14] ) and digitized in the context of the e-ReColNat project [8] .

iDigBio [11] hosts millions of images of herbarium specimens. Several tens of thousands of these images, illustrating the French Guyana flora, will be used for the PlantCLEF task this year. A valuable asset of this collection is that several herbarium sheets are accompanied by a few pictures of the same specimen in the field. For the test set, we will use in-the-field pictures coming different sources including Pl@ntNet [20] and Encyclopedia of Life [9].

Task Description: The challenge will be evaluated as a cross-domain classification task. The training set will consist of herbarium sheets whereas the test set will be composed of field pictures. To enable learning a mapping between the herbarium sheets domain and the field pictures domain, we will provide both herbarium sheets and field pictures for a subset of species. The metrics used for the evaluation of the task will be the classification accuracy and the mean reciprocal rank.

Motivation: Monitoring birds by sound is important for many environmental and scientific purposes. Birds are difficult to photograph and sound offers better possibilities for inventory coverage. A number of participatory science projects have focused on recording a very large number of bird sounds, making it possible to recognize most species by their sound and to train deep learning models to automate this process. It was shown in previous editions of BirdCLEF [35, 36] that systems for identifying birds from mono-directional recordings are now performing very well and several mobile applications implementing this are emerging today. However, there is also interest in identifying birds from omnidirectional or binaural recordings. This would enable more passive monitoring scenarios like networks of static recorders that continuously capture the surrounding sound environment. The advantage of this type of approach is that it introduces less sampling bias than the opportunistic observations of citizen scientists. However, recognizing birds in such content is much more difficult due to the high vocal activity with signal overlap (e.g. during the dawn chorus) and high levels of ambient noise. Task Description: In 2020, two scenarios will be evaluated: (i) the recognition of all specimens singing in a long sequence (up to one hour) of raw soundscapes that can contain tens of birds singing simultaneously, and (ii) chorus source separation in complex soundscapes that were recorded in stereo at very high sampling rate (250 kHz SR). For the first scenario, participants will be asked to provide time intervals of recognized singing birds. Participants will be allowed to use any of the provided metadata complementary to the audio content (.wav format, 44.1 kHz, 48 kHz, or 96 kHz sampling rate). The task is focused on developing real-world applicable solutions and therefore requires participants to submit single models trained on none other than the mono-species recordings provided as training data. For the second task on stereophonic recordings, the goal will be to determine the species singing in chorus simultaneously during a time interval. In contrast to task one, the challengers are invited to run automatic source separation before or jointly to the bird species classification, taking advantage of the multi-channel recordings. Participants will be allowed to use any other data than the provided recordings, but will have to provide the scripts to check that their solution is fully automatic. For both tasks, the evaluation measure will be the classification mean average precision (c-mAP, [39] ).

Motivation: Automatic prediction of the list of species most likely to be observed at a given location is useful for many scenarios related to biodiversity management and conservation. First, it could improve species identification tools (whether automatic, semi-automatic or based on traditional field guides) by reducing the list of candidate species observable at a given site. More generally, this could facilitate biodiversity inventories through the development of location-based recommendation services (e.g. on mobile phones), encourage the involvement of citizen scientist observers, and accelerate the annotation and validation of species observations to produce large, high-quality data sets. Last but not least, this could be used for educational purposes through biodiversity discovery applications with features such as contextualized educational pathways.

The challenge will rely on a collection of millions of occurrences of plants and animals in the US and France (primarily from GBIF [10], iNaturalist [13], Pl@ntNet [20] and a few expert collections). In addition to geocoordinates and species name, each occurrence will be matched with a set of geographic images characterizing the local landscape and environment around the occurrence. In more detail, this will include: (i) high resolution (about 1 m 2 /pixel) remotely sensed imagery (from NAIP [18] for the US and from IGN [2] for France), (ii) bio-climatic rasters (1 km 2 /pixel, from WorldClim [24] ) and (iii), land cover rasters (30 m 2 /pixel from NLCD [17] for the US, 10 m 2 /pixel from CESBIO [3] for France).

The occurrence dataset will be split in a training set with known species name labels and a test set used for the evaluation. For each occurrence (with geographic images) in the test set, the goal of the task will be to return a candidate set of species with associated confidence scores. The evaluation metrics will be the top-K accuracy (for different values of K) and a set-based prediction metric which will be specified later.

Motivation: Developing a robust system for identifying species of snakes from photographs is an important goal in biodiversity and global health. With over half a million victims of death and disability from venomous snakebite annually, understanding the global distribution of the >3, 700 species of snakes and differentiating species from images (particularly images of low quality) will significantly improve epidemiology data and treatment outcomes. The goals and usage of image-based snake identification are complementary with those of other challenges: classifying snake species in images, predicting the list of species that are the most likely to be observed at a given location, and eventually developing automated tools that can facilitate integration of changing taxonomies and new discoveries.

Images of about 100 snake species from all around the world (between 300 and 150,000 images per species) will be aggregated from different data sources (including iNaturalist [13] ). This will extend the dataset used in a previous challenge [22] hosted on the AICrowd platform. The distribution of the number of images between the classes is highly imbalanced.

Task Description: Given the set of images and corresponding geographic location information, the goal of the task will be to return for each image a ranked list of species sorted according to the likelihood that they are in the image and might have been observed at that location.

All information about the timeline and participation in the challenges is provided on the LifeCLEF 2020 web pages [16]. The system used to run the challenges (registration, submission, leaderboard, etc.) is the AIcrowd platform [1] .

The long-term societal impact of boosting research on biodiversity informatics is difficult to overstate. To fully reach its objective, an evaluation campaign such as LifeCLEF requires a long-term research effort so as to (i) encourage non-incremental contributions, (ii) measure consistent performance gaps, (iii) progressively scale-up the problem and (iv), enable the emergence of a strong community. The 2020 edition of the lab will support this vision and will include the following innovations:

-SnakeCLEF, a challenging new task related to the identification of snake species, will be introduced. It will be organized by the University of Geneva, Switzerland. -The PlantCLEF task will focus on a brave new challenge: the identification of plant pictures based on a training set of digitized herbarium sheets (crossdomain image classification). -The BirdCLEF challenge will be enriched with new annotated soundscape data, and with challenging new high sampling rate stereophonic recordings (from SEAMED, SMILES ANR, and SABIOD [21] projects). -The GeoLifeCLEF challenge will be enriched with new plant and animal occurrences from two continents and high resolution remotely sensed imagery.

The results of this challenge will be published in the proceedings of the CLEF 2020 conference [5] and in the CEUR-WS workshop proceedings [4] .

Carte d'ccupation des sols

Encyclopedia of Life

National Agricultural Imagery Program

Snake Species Identification Challenge

Plant identification: experts vs. machines in the era of deep learning

Sensor network for the monitoring of ecosystem: bird species recognition

Automated species identification: why not?

Plant identification using deep neural networks via optimization of transfer learning parameters

Proceedings of the 1st workshop on Machine Learning for Bioacoustics -ICML4B. ICML

Proceedings of the Neural Information Processing Scaled for Bioacoustics, from Neurons to Big Data. NIPS International Conference

Plant identification based on noisy web data: the amazing performance of deep learning

The ImageCLEF 2013 plant identification task. In: CLEF, Valencia

The ImageCLEF 2011 plant images classification task

ImageCLEF2012 plant images identification task

Overview of BirdCLEF 2018: monophone vs. soundscape bird identification

LifeCLEF bird identification task

The ImageCLEF plant identification task

Interactive plant identification based on social image data

Overview of LifeCLEF 2018: a large-scale evaluation of species identification and recommendation algorithms in the era of AI

Overview of BirdCLEF 2019: large-scale bird recognition in soundscapes

Contour matching for a fish recognition and migration-monitoring system

Multi-organ plant classification based on convolutional and recurrent neural networks

A toolbox for animal call recognition

Automated species recognition of antbirds in a Mexican rainforest using hidden Markov models

The inaturalist species classification and detection dataset

Machine learning for image based species identification

Automated plant species identification-trends and future directions

Automated identification of animal species in camera trap images

This work is supported in part by the SEAMED PACA project, the SMILES project (ANR-18-CE40-0014), and an NSF Graduate Research Fellowship (DGE-1745301).