key: cord-0625763-e947s6n5
authors: Ol'ondriz, David Amat; Puigdevall, Poncc Palau; Palau, Adria Salvador
title: FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions
date: 2021-10-05
journal: nan
DOI: nan
sha: 1c5c023137de6097e0c73d406728e98439d3ae85
doc_id: 625763
cord_uid: e947s6n5

In this paper we introduce the Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visio-linguistic datasets. The dataset also includes widely spoken languages such as Spanish and English. To assist further research, we include a benchmark over the text-image retrieval task using ADAPT, a SotA existing technique.

The COVID19 pandemic has accelerated the digitalisation of restaurants and the growth of the food delivery sector. National lockdowns have made it impossible to go to bars and restaurants, which has prompted many people to discover the possibility of ordering food and drinks online. Therefore, solving tasks such as text-image retrieval for food and drinks search engines has become increasingly important. The lack of large-scale multilingual datasets covering this domain [14] means that it is hard to build efficient search engines and recommender systems, especially for under-represented languages [23] . In this paper, we aim to help bridge this gap by offering a large multi-language dataset that covers several of such languages, in addition to some common languages such as Spanish, French and English. The Food Drinks and groceries Images Multi Lingual dataset: FooDI-ML (github.com/Glovo/foodi-ml-dataset). FooDI-ML features data collected by a food and groceries delivery app: Glovo (https://glovoapp.com/).

In recent years there has been increased attention to visio-linguistic datasets. Besides classical vision tasks such as image classification and segmentation, image and text datasets can be used to learn multi-modal tasks. For example, image captioning, text to image generation and text-image retrieval. The size of publicly available image and text datasets has grown steadily from the flickr30K dataset [33] , including 30K English-language examples ‡ to the WIT dataset, a multi-language dataset extracted from Wikipedia that includes over 11M unique examples. To our knowledge, the dataset presented in this paper is the second largest multilingual visio-linguistic dataset available to date.

The structure of this paper is as follows: in Section 2 we review prior work and place our dataset in a broader research context. In Section 3 we describe the dataset in detail, including notable examples and proposed tasks. Section 4 presents a benchmark over the text-image retrieval task. Finally, Section 5 features the conclusion and a discussion on future work.

• We present the FooDI-ML dataset, a dataset of over 2.8M food, drinks and grocery images included together with over 9.5M natural language descriptions, store names, product names, and collection sections in which they are included. It is the second largest multilingual dataset of this kind publicly available.

• We provide a dataset covering a use domain rarely covered (multi-language samples for food, drinks and groceries).

• The dataset covers languages that despite having a large number of speakers are underrepresented in trained datasets. Therefore, it has potential to reduce bias in food, drinks and grocery based search engines.

• We compute a train/test/validation split and two benchmark tasks to facilitate future research.

2 Prior work

Datasets including images coupled to natural language data are often referred to as visio-linguistic datasets [25] . The two pioneering datasets in the field were published just a few months apart from each other: the flickr30K and COCO datasets [33, 13] . Both datasets were collected by asking humans to label tenths of thousands of images manually. This means that these datasets, albeit seminal, do not include an amount of data comparable with vision-only or text-only datasets with sample numbers routinely reaching the order of 100M samples or more [27, 13, 26, 32] .

In addition to their relatively smaller size, these initial datasets included exclusively English language examples. This shortcoming spun an effort to extend the work to other languages. Soon, new datasets, together with relabeled versions of the same datasets appeared in several languages. Amongst them German [6] , Japanese [16] , Chinese [12] , Dutch [28] , Italian [2] and Vietnamese [11] . In addition to this, crow-sourced translations of the COCO and flickr30K dataset are relatively frequent and can be found in non-peer-reviewed github repositories and other sources § .

The problem of dataset size and language coverage was not solved until recently when two approaches (and corresponding datasets) were introduced, both of them based in web crawling: the Conceptual Captions dataset [24] and the Wikipedia-based Image Text Dataset [25] . Google's Conceptual Captions (CC) dataset includes 3.3M English examples obtained by web crawling of images and alt-text annotations. The samples in the dataset are then cleaned using a complex funnel of rules [24] . Another dataset from google, the google Open Images Dataset [10] provides 9M images annotated with 675k localized narratives. The Wikipedia-based Image Text Dataset (WIT) managed to avoid complex cleaning rules by relying on Wikipedia as a highly curated data source. In doing so, it managed to gather a larger amount of high-quality data. This provided the largest publicly available dataset to date, including more than 11M unique examples in more than 100 different languages [25] .

The increase in size and language coverage achieved by WIT and CC was especially aimed at obtaining large pretrained cross-modal networks. This, however, did not solve the problem of specificity. The lack of domain-specific datasets made it hard to train high-performant cross-modal networks for some applications. In our dataset's domain -food and drink images and descriptions ‡ From now on, K stands for "thousand" and M for "million" § See, for example, this Spanish translation available in Kaggle https://www.kaggle.com/colmejano/mscocoes-spanish-coco-captions -this was identified as a problem for the task of image recipe retrieval [22] . In fact, the largest dataset available until recently contained only 101K samples pertaining to Chinese food recipes [4] . Other food datasets, such as the ISIA Food-500 dataset, contain up to 400K samples but their visio-linguistic value is limited, as each image is related to a class and a food name rather than to a natural language description [15] . The FoodX-251 dataset (N=118K) [9] , ChineseFoodNet (180K) [5] , Food101 (101K) [3] and MAFood-121 (21K) [1] are other notable examples of datasets targeting food classification. Only ChineseFoodNet is multilingual including both Chinese and English classes.

To fill in this void, Recipe1M+, a large dataset containing over 14M images and 1M recipes was made public for English-language recipes [14] . Despite the number of text samples being relatively small, the large number of food images was an important step towards improving machine learning tasks in the food domain. This dataset covers only one language and is focused in food recipes instead of food descriptions and broader food categories. The latter are more likely to appear in marketplace applications, and therefore might be more useful in an industrial context.

To compare our dataset with other existing datasets we follow the approach of the WIT paper, where the number of languages and number of samples are the two main factors compared (see table 1 ). Our dataset is the second biggest available both in terms of language and number of captioned samples.

FooDI-ML or the Food Drinks and groceries Images Multi Lingual dataset is a dataset of food, drinks and groceries images and descriptions collected from all partners operating in the Glovo app (https://glovoapp.com/) in the last six years. The dataset contains data from 37 countries, with a significant representation of 33 languages. Amongst them are some common languages such as French, English and Spanish (which is the most common language), but also some rare languages such as Kazakh and Basque. More interestingly perhaps, some largely spoken languages such as Ukrainian but generally underrepresented in existing datasets are also present.

The dataset's size places it as the second richest publicly available visio-linguistic dataset, featuring 2.8M images (amongst them 1.5M unique), and up to 9.5M unique text samples. Each sample in the dataset corresponds to a different product that has been offered at some point in the Glovo app. There are 2.8M unique products (see Appendix B.2 for detailed statistics). Each product has up to five data points associated with it: the store name, the product name, the product description, the product image and the product "collection section". The latter is a meta category included in the Glovo app that restaurants and other sellers in the marketplace can use to organise their menu. Examples of collection sections are "drinks", "our pizzas", "desserts", etc. Note that collection sections are not standardised, and can be in general chosen by the partner. See Fig. 1 to see the store name, collection section, product name, product description and product image in the context of the Glovo application.

The number of products (2.8M) is larger than the number of unique images, as repeated images are only counted once (although they can be repeated many times over the data set). For details of how this data was collected, we refer to Sec. 3.1. 

We collect data from all glovoapp partners. This data has been obtained and automatically saved in glovoapp's databases during the six years that Glovo has been in operation. Most of the data included in the dataset is generated organically. That is, glovoapp does not enforce strict compliance measures for the names of products, descriptions and collection sections included in the catalog. This means that the partners that offer their menus through the app can freely generate this information as long as it is not offensive or breaks glovoapp's content rules.

The data featured in the dataset includes only those products present in the glovoapp that have an image associated with them. We decided to not include samples that contained no image information in order to have a fully multi-modal dataset. To maximise the usefulness of the dataset, we computed a hash of our samples (including an image hash) and deleted all identical samples. This reduced the dataset size from its original size of 7.5M samples to 2.8M. The reason behind the large presence of duplicated samples is the presence of large franchises in the markets where Glovo operates.

In this dataset we only include grocery partners that have food items in their catalog. This means that we exclude the majority of partners belonging to a new business vertical: q-commerce ("quick commerce"). These are mostly e-commerce partners with very large product selections and much less product variation across languages and countries. We intend to offer a curated version of this data in a future version of FooDI-ML. It is worth mentioning that all data presented here has been publicly available at some point through the Glovo app. This means that it could potentially have been collected through crawling techniques by third-party actors. Whether this has happened or not, this dataset has not been made public through other sources.

In order to maximise the usefulness of our dataset, we decide to engage in minimal data postprocessing. This is limited to (i) reducing the maximum size of the product images, (ii) processing the store names in the dataset so that they can be more useful for ML practitioners.

The images present in FooDI-ML are scaled so that the largest size of the image is always equal to or smaller than 512 pixels while maintaining the aspect ratio. No other transformation is implemented.

We also perform cleaning and processing of our store names. The original dataset had a large number of stores that were used by glovapp agents to backup store information after deactivation and to perform other tasks pertaining to the business' operations. This means that in this case the store name field does not add additional information (as it normally contains dummy information such as "test" or "backup"). It also means that hierarchical information obtained from these stores is not always useful (for example, if one wants to understand the relation between the menu and store name). In order to address this, we rename these store names with a generated string: "AS_N". This stands for "Auxiliary Store number N" -where N is a positive integer. We also add a column in the dataset "HIER" -for hierarchical-, which is marked as True for those stores that have not been processed.

Our dataset includes samples belonging to up to 37 different countries in Europe, the Middle East, Africa and Latin America. Some languages such as Spanish (most of Latin America), English (in Nigeria, Ghana, Uganda), Russian (in several countries in Eastern Europe), Portuguese (Brazil, Portugal) are prevalent due to their broad geographical presence. Others like Ukrainian, Georgian and Italian are present due to the big market share that those countries represent in the Glovo app.

We obtain statistics on language using the fasTtext classifier [8] and filtering for those languages that we know that are not present in the country of operation. See details in appendix B.1. 

Glovoapp does not enforce the presence of product descriptions in the menus of the partners that use the app. This results in only ∼34% of the samples (corresponding to ∼980k samples) including the product description field. In Fig. 3 we include a geographical overview of the quality of the datasetwhich countries include a higher percentage of samples containing all fields.

The fact that some samples do not include product descriptions should not discourage researchers. In most cases, the product name and collection section are sufficient to fully describe the product, and concatenating these two fields provides a satisfactory natural language description of the product (see the analysis in github.com/Glovo/foodi-ml-dataset/tree/main/notebooks)

FooDI-ML features samples from small restaurants and shops but also from large multinational food franchises. This means that the 2.8M samples are distributed amongst only 38K unique store names from which 36K (the overwhelming majority) feature less than 150 samples. If we disregard store names marked as Auxiliar Stores we see that 48% of all samples belong to store names featuring less than 150 samples. That is, local restaurants and shops. The remaining 52% belongs to larger chains or restaurants with very large, changing menus. This points towards a well-balanced dataset where the long tail of stores holds a significant percentage of the samples. Fig. 4 shows the stores with more samples in descending order. As one would expect, grocery chains and large multinational brands top the list. This is due to the fact that grocery shops have much larger catalogs than restaurants, and that international brands may have the same product repeated many times over in different locations and different languages under the same store name. 

We name "differential food samples" the groups of samples that essentially contain the same entity but with different ingredients. This kind of samples coming from the same partner usually have similar image features such as illumination, background, object pose, etc. For example: in a burrito restaurant this can correspond to several pictures of the same burrito containing different combinations of ingredients -see Fig. 5 . This can be used to train saliency neural networks, ingredient generation algorithms, segmentation algorithms, etc.

Differential food samples are common in our dataset, as most pizzerias, burger stores, burrito stores, salad places, etc. featuring descriptions include samples of this type. Although hard to precisely compute, we estimate that at least 11% of our samples that include a description belong to any of these food types and feature more than one sample per store (see github.com/Glovo/foodi-ml-dataset for the analysis). This produces tenths of thousands of examples, enough to train an ML algorithm. 

Another interesting subset of samples present in our dataset are grocery samples, and samples from well-known fast-food companies. Many of such images are high-quality enough to be usable in optical character recognition tasks. Often, the text imprinted on the image also appears in the product description allowing for some interesting applications such as solving saliency tasks -see Fig. 6 . 

One of the main contributions of this dataset is the multiplicity of languages that it covers. Especially useful is the fact that we offer product descriptions of similar images in different languages. As an example, one can easily find similar burgers for one of our partners in Russian, Spanish, Italian and English with slightly different images and descriptions. See Fig. 7 as an example. It is hard to calculate the exact amount of this type of sample. However, similar to differential food samples, a lower bound can be established from stores that are globally present -such as the big international chains shown in Fig. 4 

There are several ML tasks that can be performed on a visio-linguistic data set. Amongst the most common are image captioning [29] , text to image generation [19] , text-image retrieval [30] and visual question answering [7] . The dataset that we present here is well-suited for all of them. Especially, for the design of new food-specific tasks such as ingredient-based saliency and food image generation. From a practical application, text-image retrieval and image-text retrieval are both very important in countries with underrepresented languages. In some developing countries, the main official language is not widely spoken, making it crucial to improve search engines in the minority languages.

Despite its large sample size and language coverage, FooDI-ML has some limitations.

• Some languages are still underrepresented due to a "lingua franca" effect. This happens both in developed and developing countries. For example in Spain where Catalan and Basque have a much smaller presence than their proportion of speakers. In Nigeria, Ghana and Uganda English is dominant while local languages such as Bantu and Hausa are not present in FooDI-ML. In summary, the dataset succeeds including languages that are dominant in their countries (like Ukrainian) but underrepresented in public datasets, but struggles to appropriately include minority languages. This can lead to the typical problems associated with the lack of training data: under-performance, bias and lack of generalisation.

• The issue of representation mentioned above is heavily influenced by market dynamics (which population groups are perceived to be more likely to use the app). This means that there is an additional danger of under-representing dialects perceived as representative of lower purchasing power populations.

• Glovoapp partners are let a high degree of freedom when choosing their images and text representations. This causes some problems: (i) a high proportion of the samples do not include product descriptions. In most cases this is not a problem as the product name and collection section include enough information to obtain linguistic embeddings. However, in some cases -such as very well-known local dishes -this can be a problem. (ii) there is no standardisation on product images, which means that in a minority of samples the same image is used to represent different products. We have observed this, for example, in some small pizzerias where an image with several pizzas is used to represent all pizza choices. (iii) due to the very large amount of data and the freedom given to glovoapp partners and agents, it is possible that there are some low-quality tags and descriptions. This includes typos, informal words, and language mixing.

• Although the dataset is well balanced between small and big volume partners, it is also true that fast food dominates the delivery sector. This means that pizzas, burgers, fries, etc. are over-represented in the dataset compared to the real local diet of the countries where glovoapp operates. Similarly, dishes that are harder to deliver like cocktails, ice-cream cakes, steak, etc. are less present than easily deliverable food like sushi. This can potentially cause the typical underperformance issue where fast food and deliverable food are more performant in tasks such as image-text retrieval.

• Some global food types, such as pizza, burgers, sushi and salads are present in all languages shown in Fig. 2 . However, food types with a local footprint such as -for example-"coca de recapte" (a traditional Catalan savory pastry) are only present in local languages. In this case, Catalan and Spanish. This can hinder the performance of some cross-language tasks.

• All images included in FooDI-ML are commercial images. At Glovo, stores are given the freedom to take and upload pictures independently which leads to large image variability. That said, it is true that photos included in a commercial application will differ from images of food taken in a real setting. Therefore, models trained with this dataset might struggle to generalise to all kinds of food pictures. Expanding the dataset with some other datasetssuch as food datasets that rely on web crawling -might help to mitigate the issue.

We propose a train/test/validation split of 70/15/15 stratified across countries and samples including product descriptions. We choose this stratification because the presence of a product description is a good indicator of a high-quality visio-linguistic sample, therefore ensuring that the evaluation and test metrics are representative and in line with the average quality of the samples. After this split, the train dataset contains 2M samples, and the test and validation splits contain 433k samples each. Table 2 shows the full statistics for the proposed split. Text image retrieval is a sub-task of cross-modal retrieval that consists of retrieving an image given a query text or set of captions. This retrieval is typically done by minimising a predefined distance between two vectors: one representing the image, and one representing the text. Amongst other applications, text-image retrieval models are used to locate appropriate descriptions for a given image.

In this work we use an existing SotA approach: ADAPT [31] , to provide benchmark metrics for our dataset. ADAPT is designed to incorporate ideas of attention modules into multimodal embedding. Amongst other dependencies, ADAPT uses FasterRCNN [20] to generate image features (regions of interest). We slightly modify ADAPT in order to be able to reduce its GPU memory footprint and make it trainable over our large dataset. Training details in https://github.com/Glovo/foodi-ml-dataset.

We choose ADAPT for two reasons (i) it is a high-performant fast-to-train SotA technique (our dataset is large and thus resource-intensive). (ii) because it is designed to be used in general visio-linguistic datasets without complicated preprocessing steps.

In consistency with previous work, we use retrieval performance (R@N) as evaluation metric. For Spain, the scores obtained are approximately one order of magnitude lower than those reported for COCO/Flickr30K. This mirrors results obtained with other large multi-modal datasets like WIT [25] . For the whole dataset, scores significantly vary depending on the task: t2i outperforms i2t. For retrieval (t2i), ADAPT performs significantly better than random but much worse than reported previously. This drop in performance is expected: ADAPT is heavily reliant on a variant of Faster-RCNN trained with images belonging to a similar domain as COCO/Flickr30K. Moreover, Faster-CNN provides bounding box embeddings, which give information about the class and geometric disposition but not about object sub-components (ie. ingredient features). The difference between Spain and Global is caused by the increase of complexity brought by the presence of several languages. 

We present a large multilingual visio-linguistic dataset covering 33 languages, 37 countries, over 2.8M images and 9.5M text samples: FooDI-ML. The dataset is obtained from the partners of a large food and grocery delivery company: Glovo. The data is collected over six years of its operations and has undergone minimal post-processing. Therefore, this is a domain-specific dataset covering mostly the case of food and drink products sold in restaurants.

This dataset opens the door for several applications, insofar unavailable for the broader research community due to the lack of public datasets. For instance, multilingual image-based search engines based on food and drink examples, refinement of existing pretrained models for the food and drink industry, and improvement of food image embeddings. Insofar the only comparable dataset in this domain was the recipe1M+ dataset, limited to the English language.

In addition to describing the dataset, we include an overview of notable samples and suggest a set of tasks that it can support. We also provide a train/test/validation split and a benchmark task.

As future work, we plan to release the V2 of this dataset, including many more grocery products, and also many products referring to marketplace products in general. We estimate that the V2 of the dataset will at least double the size of the dataset that we present here. We also plan to release some specialized tasks based on the food and drink domain.

languages of the country and their frequency are cross-referenced with the dataset to take into consideration this under-representation.

• The over-representation of fast food and deliverable food as compared with the typical local diet can lead to a higher performance for this food type. This could lead to an increase in the consumption of these food types, which can lead to associated health issues.

The data included in the dataset is not of personal nature and has been publicly available on the internet through our app at some point of time. Offensive content is forbidden by glovoapp's internal guidelines, and removed when it appears. In general, due to the nature of the dataset (food, drinks and groceries) this has not been a significant concern.

Regularized uncertainty-based multi-task learning model for food analysis

Large scale datasets for image and video captioning in italian

Food-101-mining discriminative components with random forests

Deep-based ingredient recognition for cooking recipe retrieval

Chinesefoodnet: A large-scale image dataset for chinese food recognition

Multimodal pivots for image caption translation

Learning to reason: End-to-end module networks for visual question answering

Bag of tricks for efficient text classification

Foodx-251: A dataset for fine-grained food classification

The open images dataset v4

Uit-viic: A dataset for the first evaluation on vietnamese image captioning

Coco-cn for cross-lingual image tagging, captioning, and retrieval

Microsoft coco: Common objects in context

Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images

Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network

Cross-lingual image caption generation

World map / world atlas / atlas of the world including geography facts and flags

Explore language knowledge in europe

Zero-shot text-to-image generation

Faster r-cnn: Towards real-time object detection with region proposal networks

Extract from an interview with Sanni Bundgaard, Stijn Aelbers, James Purcell, and Meena Bhandari. Translators without borders

Learning cross-modal embeddings for cooking recipes and food images

Using deep learning for ranking in dish search

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Wit: Wikipediabased image text dataset for multimodal multilingual machine learning

Revisiting unreasonable effectiveness of data in deep learning era

Yfcc100m: The new data in multimedia research

Didec: The dutch image description and eye-tracking corpus

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

Camp: Crossmodal adaptive message passing for text-image retrieval

Adaptive cross-modal embeddings for imagetext alignment

Billion-scale semi-supervised learning for image classification

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Complete list of country names

This work has been funded by glovoapp. There are no additional revenues related to this work.We thank all employees, agents, clients and partners of glovoapp for making this work possible. We would also like to thank the legal team, the central data science team, and the infrastructure team in the company for being open to publishing this dataset. We also thank Maxim Khalilov from glovoapp for his useful comments.

FooDI-ML is open source and free to use under the creative commons BY-NC-SA license. Access to the data set is available through: https://github.com/Glovo/foodi-ml-dataset.

We can think of two potentially negative impacts of our work:• The under-representation of some minority languages and languages perceived as belonging to a lower-income population can lead to lower performance tasks in the corresponding population sectors. This could lead to further disenfranchising of such population sectors (typically minorities). However, we would like to point out that this dataset makes available many under-represented languages and cuisine types that before had been barred from existing datasets. Therefore, the overall effect will be positive as long as the local B Appendix B: methods

The full code for the linguistic analysis is provided in https://github.com/Glovo/foodi-ml-dataset. In summary, the logic applied is the following:1. We retrieved the languages spoken in each country where we operate from second sources of information (sources: [18, 21, 17] ) (information available in https://github.com/Glovo/foodiml-dataset).2. For each language sample we obtain the three most probable languages using the fastText classifier [8] 3. We assign the first most probably language that belongs to the list of spoken languages in the countries where we operate. If none does, we assign the most spoken language in the country where the sample originates from.