key: cord-024865-umrlsbh5 authors: Jiang, Shufan; Angarita, Rafael; Chiky, Raja; Cormier, Stéphane; Rousseaux, Francis title: Towards the Integration of Agricultural Data from Heterogeneous Sources: Perspectives for the French Agricultural Context Using Semantic Technologies date: 2020-04-29 journal: Advanced Information Systems Engineering Workshops DOI: 10.1007/978-3-030-49165-9_8 sha: doc_id: 24865 cord_uid: umrlsbh5 Sustainable agriculture is crucial to society since it aims at supporting the world’s current food needs without compromising future generations. Recent developments in Smart Agriculture and Internet of Things have made possible the collection of unprecedented amounts of agricultural data with the goal of making agricultural processes better and more efficient, and thus supporting sustainable agriculture. These data coming from different types of IoT devices can also be combined with relevant information published in online social networks and on the Web in the form of textual documents. Our objective is to integrate such heterogeneous data into knowledge bases that can support farmers in their activities, and to present global, real-time and comprehensive information to researchers. Semantic technologies and linked data provide a possibility for data integration and for automatic information extraction. This paper aims to give a brief review on the current semantic web technology applications for agricultural corpus, then to discuss the limits and potentials in construction and maintenance of existing ontologies in agricultural domain. Recent advances in Information and Communication Technology (ICT) aim at tackling some of the most important challenges in agriculture we face today [5] . Supporting the world's current food needs without compromising future generations through sustainable agriculture is of great challenge. Indeed, among all the topics around sustainable agriculture, how to reduce the usage, and the impact of pesticide without losing the quantity or quality in the yield to fulfill the requirement of the growing population has an increasingly important place [6] . Researchers have applied a wide range of technologies to tackle some specific goals. Among these goals: climate prediction in agriculture using simulation models [7] , making the production of certain types of grains more efficient and effective with computer vision and Artificial Intelligence [11] , soil assessment with drones [14] , and the IoT paradigm when connected devices such as sensors capture real-time data at the field level and that, combined with Cloud Computing, can be used to monitor agricultural components such as soil, plants, animals and weather and other environmental conditions [16] . The usage of such ICTs to improve farming processes is known as smart farming [18] . In the context of smart farming, IoT devices themselves are both data producers and data consumers and they produce highly-structured data; however these devices and the technologies we presented above are far from being the only data sources. Indeed, important information related to agriculture can also come from different sources such as official periodic reports and journals like the French Plants Health Bulletins (BSV, for its name in French Bulletin de Santé du Végétal ) 1 , social media such as Twitter and farmers experiences. The goal of the BSV is to: i), present a report of crop health, including their stages of development, observations of pests and diseases, and the presence of symptoms related to them; and ii), provide an evaluation of the phytosanitary risk, according to the periods of crop sensitivity and the pest and disease thresholds. The BSV and other formal reports are semi-structured data. In the agricultural context, Twitter -or any other social media-can be used as a platform for knowledge exchange about sustainable soil management [10] and it can also help the public to understand agricultural issues and support risk and crisis communication in agriculture [1] . Farmer experiences (aka Old farming practices or ancestral knowledge) may be collected through interviews and participatory processes. Social media posts and farmer experiences are nonstructured data. Figure 1 illustrates how this heterogeneous data coming from different sources may look like for farmers: information is not always explicit or timely. Our objective is to integrate such heterogeneous data into knowledge bases that can support farmers in their activities, and to present global, real-time and comprehensive information to researchers and interested parties. We present related work in Sect. 2, our initial approach in Sect. 3 and conclusions and perspectives in Sect. 4. We classify existing works into two categories: information access and management in plant health domain, and data integration in agriculture. In the information access and management in plant health domain category, the semantic annotation in BSV focuses on extracting information for the traditional BSV. Indeed, for more than 50 years, printed plant health bulletins have been diffused by regions and by crops in France, giving information about the arrival and the evolution of pests, pathogens, and weeds, and advises for preventive actions. These bulletins serve not only as agricultural alerts for farmers but also documentation for those who want to study the historical data. The French National Institute For Agricultural Research (INRA) has been working towards the publishing of the bulletins as Linked Open Data [12] , where BSV from different regions are centralized, tagged with crop type, region, date and published on the Internet. To organize the bulletins by crop usage in France, an ontology with 272 concepts was manually constructed. With the volume of concepts and relations augmenting, manual construction of ontologies will become too expensive [3] . Thus, ontology learning methods to automatically extract concepts and relationships should be studied. INRA has also introduced a method to modulate an ontology for crop observation [13] . The process is the following: 1) collect competency questions from researchers in agronomy; 2) construct the ontology corresponding to requirements in competency questions; 3) ask semantic experts who have not participated in the conception of the ontology to translate the competency questions into SPARQL queries to validate the ontology design. In this exercise, a model to describe the appearance of pests was given but not instantiated, nevertheless it could be a reference to our future crop-pest ontology conception. Finally, Pest observer (http://www.pestobserver.eu/) is a web portal [15] which enables users to explore BSV with a combination of the following filters: crop, disease and pest; however, crop-pest relationships are not included. It relies on text-mining techniques to index BSV documents. Regarding data integration in agriculture, AGRIS 2 , the International System for Agricultural Science Technology states that many initiatives are developed to return more meaningful data to users [4] . Some of these initiatives are: extracting keywords by crawling the Web to build the AGROVOC vocabulary, which covers all areas of interest of the Food and Agriculture Organization of the United Nations; and SemaGrow [9] , which is an open-source infrastructure for linked open data (LOD) integration that federates SPARQL endpoints from different providers. To extract pest and insecticide related relations, SemaGrow uses Computer-aided Ontology Development Architecture (CODA) for RDF triplification of Unstructured Information Management Architecture (UIMA) results from analysis of unstructured content. Though INRA kick-started categorizing the french crop bulletins using linked open data, and that project SemaGrow shed light upon heterogeneous data integration using ontologies, both projects focused on processing formal and technical documents. Moreover, in CODA application case, IsPestOf rule was defined but not instantiated. Therefore, a global knowledge base, that covers the crops, the natural hazards including pests, diseases, and climate variations, and the relations between them, is still missing. There is also an increasing necessity to a comprehensive and an automatic approach to integrate knowledge from an ampler variety of heterogeneous sources. -Linguistic preprocessing: Unstructured and semi-structured textual data are passed through a linguistic prepossessing pipeline (Sentence segmentation, Tokenization, Part-of-Speech (POS) tagging, Lemmatization) with existing natural language processing (NLP) tools such as Stanford NLP (https:// nlp.stanford.edu/), GATE (https://gate.ac.uk/) and UIMA (https://uima. apache.org/). -Terms/concept detection: At the best of our knowledge and from the state of the art study, there is no ontology in french that modulates the natural hazards and their relations with crops. Existing french thesaurus like french crop usage and Agrovoc can be applied to filter collected data and served as gazetteer. Linguistic rules represented by regular expressions can be used to extract temporal data. Recurrent neural network (RNN), conditional random field (CRF) model and bidirectional long-short term memory (BiLSTM) were applied for health-related name entity recognition from twitter messages and gave a remarkable result [2] . Once the ontology is populated, it could provide knowledge and constraints to the extraction of terms [17] . -Relation detection: Similar to term/concept detection, initially there's no ontology. A basic strategy could be using self-supervised methods like Modified Open Information Extraction (MOIE): i) use wordnet-based semantic similarity and frequency distribution to identify related terms among detected terms from previous step ii) slicing the textual patterns between related terms [8] . Once the ontology is populated, it could contribute to calculate semantic similarities between detected terms in phase i). New digital technologies allow farmers to predict the yield of their fields, to optimize their resources and to avoid or protect their fields from natural hazards whether they are due to the weather, pests or diseases. This is a recent area where research is constantly evolving. We have introduced in this paper work relevant to our problem, namely: the integration of several data sources to extract information related to the natural hazards in agriculture. We then proposed an architecture based on ontology learning and ontology-based information extraction. We plan in a first phase build an ontology from twitter data that contains vocabulary in the existing thesaurus. To evaluate the constructed ontology, we will extract crops and pests from the learnt ontology, and compare it with tags in pest observer. In the following iterations, we will work on ontology alignment strategies to update the ontology with data from other sources. To go further, multilingual ontology management with keeping tempo-spacial contexts should be investigated. A little birdie told me about agriculture: best practices and future uses of Twitter in agricultural communications Ontology-based healthcare named entity recognition from Twitter messages using a recurrent neural network approach A crop-pest ontology for extension publications Discovering, indexing and interlinking information resources. F1000Research Information technology: the global key to precision agriculture and sustainability Appelà projets -durabilité des systèmes de productions agricoles alternatifs Advances in application of climate prediction in agriculture Automatic relationship extraction from agricultural text for ontology construction Designing innovative linked open data and semantic technologies in agro-environmental modelling The use of Twitter for knowledge exchange on sustainable soil management Computer vision and artificial intelligence in precision agriculture for grain crops: a systematic review A methodology for the publication of agricultural alert bulletins as LOD Annotation sémantique pour une interrogation experte des Bulletins de Santé du Végétal Towards smart farming and sustainable agriculture with drones Open data platform for knowledge access in plant health domain: VESPA mining Internet of things (IoT) and cloud computing for agriculture: an overview Ontology-based information extraction: an introduction and a survey of current approaches Big data in smart farming-a review