key: cord-0060317-4nd5npwt
authors: Belcastro, Loris; Marozzo, Fabrizio; Talia, Domenico; Trunfio, Paolo
title: Cloud Computing for Enabling Big Data Analysis
date: 2021-03-23
journal: Cloud Computing and Services Science
DOI: 10.1007/978-3-030-72369-9_4
sha: 1b1ae0c4efceb8f6be6b7cff56ae9e851187d7a6
doc_id: 60317
cord_uid: 4nd5npwt

Every day billions of people access web sites, blogs, and social media. Often they use their mobile devices and produce huge amount of data that can be effectively exploited for extracting valuable information concerning human dynamics and behaviors. Such data, commonly referred as Big Data, contains rich information about user activities, interests, and behaviors, which makes it intrinsically suited to a very large set of applications. For getting valuable information and knowledge from such data in a reasonable time, novel scalable frameworks and data analysis techniques on Cloud systems have been developed. This paper aims at describing some recent Cloud-based frameworks and methodologies for Big Data processing that can be used for developing and executing several data analysis applications, including trajectory mining and sentiment analysis. The paper is organized in two main parts. The first part focuses on tools for developing and executing scalable data analysis applications on Clouds. The second part presents data analysis methodologies for extracting knowledge from large datasets.

In the last years the ability to produce and gather data has increased exponentially. In the Internet of Things' era, huge amounts of digital data are generated by and collected from several sources, such as sensors, mobile devices, web applications and services. Moreover, with the large use of mobile devices, every day billions of people access web sites, blogs and social media producing a massive amount of digital data that can be effectively exploited to extract valuable information concerning many events, facts, human dynamics and behaviors. Such data, commonly referred as Big Data, contains valuable information about user activities, interests, and behaviors, which makes it intrinsically suited to a very large set of applications [6] . The huge amount of data generated, the speed at which it is produced, and its heterogeneity in terms of format, represent a challenge to the current storage, process and analysis capabilities. To extract value from such kind of data, novel distributed and Cloud-based frameworks and scalable data analysis techniques have been developed for capturing and analyzing complex and/or high velocity data.

In this scenario, high performance computers, such as many and multi-core systems, Clouds, and multi-clusters, paired with parallel and distributed algorithms are commonly used by data analysts to tackle Big Data issues and get valuable information and knowledge in a reasonable time. In particular, Cloud computing systems provide large-scale computing infrastructures for complex high-performance applications, such as those that use advanced data analytics techniques for extracting useful information from large, complex datasets [29] . However, combining Big Data analytics techniques with scalable computing systems allows for obtaining new insights from data in a shorter time.

This paper aims at presenting some recent Cloud-based frameworks and methodologies for Big Data processing that can be used for developing and executing several data analysis applications, including trajectory mining and sentiment analysis. The paper is organized in two main parts. The first part focuses on tools for developing and executing scalable data analysis applications on Cloud. The second part presents data analysis methodologies for extracting knowledge from large datasets.

In particular, the paper is structured as follows. Section 2 presents the Data Mining Cloud Framework designed for developing and executing distributed data analytics applications as workflows of services. In such environment data sets, analysis tools, data mining algorithms and knowledge models are implemented as single services that can be combined through a visual programming interface for designing distributed workflows to be executed on Clouds. Section 3 describes a high-level library for developing parallel data mining applications based on the extraction of useful knowledge from large dataset gathered from social media. The library aims at reducing the programming skills needed for implementing scalable social data analysis applications. Section 4 presents Nubytics, a Software-as-a-Service (SaaS) system that exploits Cloud facilities to provide efficient services for analyzing large datasets. The system allows users to import their data to the Cloud, extract knowledge models using high performance data mining services, and exploit the inferred knowledge to predict new data and behaviors. Section 5 describes SMA4TD, a methodology for discovering behavior and mobility patterns of users attending large-scale public events, by analyzing social media posts. Section 6 presents a novel Region-of-Intererest (RoI) mining technique that exploits the information contained in geotagged social media items (e.g. tweets, posts, photos or videos with geospatial information) to discover RoIs with high accuracy. Section 7 presents a methodology for discovering the polarization of social media users during election events characterized by the competition of political factions. The methodology uses an automatic incremental procedure based on neural networks for analyzing the posts published by social media users. Finally, Sect. 8 concludes the paper.

The Data Mining Cloud Framework (DMCF) [24] is a software system for designing and executing data analysis workflows on Clouds. DMCF supports a large variety of data analysis processes, including single-task applications, parameter sweeping applications [23] , and workflow-based applications. A Web-based user interface allows users to compose their applications and submit them for execution to a Cloud platform, according to a Software-as-a-Service approach.

The DMCF's architecture has been designed to be implemented on different Cloud systems, so as to take advantage of main cloud computing features, such as elasticity of resources provisioning. In DMCF, at least one Virtual Web Server runs continuously in the Cloud, as it serves as user front-end. In addition, users specify the minimum and maximum number of Virtual Compute Servers, which are in charge of executing the data mining tasks. The DMCF can exploit the auto-scaling features that allows dynamic spinning up or shutting down Virtual Compute Servers, based on the number of tasks ready for execution in the DMCF's Task Queue. Since storage is managed by the Cloud platform, the number of storage servers is transparent to the user.

Workflows may encompass all the steps of discovery based on the execution of complex algorithms and the access and analysis of scientific data. In datadriven discovery processes, knowledge discovery workflows can produce results that can confirm real experiments or provide insights that cannot be achieved in laboratories. In particular, DMCF allows to program workflow applications using two languages:

1. VL4Cloud (Visual Language for Cloud), a visual programming language that lets users develop applications by programming the workflow components graphically [26] . 2. JS4Cloud (JavaScript for Cloud), a scripting language for programming data analysis workflows based on JavaScript [25] .

Both languages use two key programming abstractions:

1. Data elements denote input files or storage elements (e.g., a dataset to be analyzed) or output files or stored elements (e.g., a data mining model). 2. Tool elements denote algorithms, software tools or service performing any kind of operation that can be applied to a data element (data mining, filtering, partitioning, etc.).

In particular, three different types of Tools can be used in a DCMF workflow:

1. A Batch Tool is used to execute an algorithm or a software tool on a Virtual Compute Server without user interaction. All input parameters are passed as command-line arguments.

Tool is used to insert into a workflow a Web service invocation. 3. A MapReduce Tool is used to insert into a workflow the execution of a MapReduce algorithm or application running on a cluster of virtual servers [7] .

For each Tool in a workflow, a Tool descriptor includes a reference to its executable, the required libraries, and the list of input and output parameters. Each parameter is characterized by name, description, type, and can be mandatory or optional. As an example, a MapReduce Tool descriptor is composed by two groups of parameters: generic parameters, which are parameters used by the MapReduce runtime, and applications parameters, which are parameters associated to specific MapReduce applications. In the following, we list a few examples of generic parameters:

-mapreduce.job.reduces: the number of reduce tasks per job; -mapreduce.job.maps: the number of map tasks per job; -mapreduce.input.fileinputformat.split.minsize: the minimum size of chunk that map input should be split into;

Another common element is the task concept, which represents the unit of parallelism in our model. A task is a Tool included in a workflow, which is intended to run in parallel with other tasks on a set of Cloud resources. According to this approach, VL4Cloud and JS4Cloud implement a data-driven task parallelism. This means that, as soon as a task does not depend on any other task in the same workflow, the runtime asynchronously spawns it to the first available virtual machine. A task T j does not depend on a task T i belonging to the same workflow (with i = j), if T j during its execution does not read any data element created by T i .

In VL4Cloud, workflows are directed acyclic graphs whose nodes represent data and tools elements. The nodes can be connected with each other through direct edges, establishing specific dependency relationships among them. When an edge is being created between two nodes, a label is automatically attached to it to express the kind of relationship between the two nodes. Data and Tool nodes can be added to the workflow singularly or in array form. A data array is an ordered collection of input/output data elements, while a tool array represents multiple instances of the same tool.

In early versions, DMCF has exploited the default storage provided by public cloud infrastructures for any I/O operations. This implies that DMCF's I/O performance was limited by the performance of the storage provided by cloud providers. In work [27] it has been discussed how to use the Hercules system within DMCF for storing temporary data generated by workflow-based applications. Hercules is a highly scalable, in-memory, distributed storage system [18] . In a later work [22] , we also used a data-aware scheduling runtime that exploits data locality during the execution of workflows. An experimental evaluation was carried out to evaluate the advantages of these strategies and to demonstrate the effectiveness of the solution. Using the proposed data-aware strategy and Hercules as a temporary storage service, I/O overhead was reduced by 55% compared to standard Azure storage-based execution, leading to a 20% reduction in total execution of the workflow. Figure 1 shows an example of data analysis workflow developed using the visual workflow language of DMCF. In JS4Cloud, workflows are defined with a JavaScript code that interacts with Data and Tool elements through three functions:

1. Data Access, for accessing a Data element stored in the Cloud; 2. Data Definition, to define a new Data element that will be created at runtime as a result of a Tool execution; 3. Tool Execution, to invoke the execution of a Tool available in the Cloud.

Once the JS4Cloud workflow code has been submitted, an interpreter translates the workflow into a set of concurrent tasks by analysing the existing dependencies in the code. The main benefits of JS4Cloud are:

1. It extends the well-known JavaScript language while using only its basic functions (arrays, functions, loops); 2. It implements both a data-driven task parallelism that automatically spawns ready-to-run tasks to the Cloud resources, and data parallelism through an array-based formalism; 3. These two types of parallelism are exploited implicitly so that workflows can be programmed in a totally sequential way, which frees users from duties like work partitioning, synchronization and communication. 

DMCF has been used to implement several Big Data analytics applications, including a workflow for discovering patterns and rules from trajectory data [2] . Figure 3 shows the VL4Cloud workflow that define the steps of such application. Experimental evaluation has been carried out on GPS datasets tracing the movement of taxies in the urban area of Beijing. The results showed that, due to the high complexity and large volumes of data involved in the application scenario, the trajectory pattern mining process takes advantage from the scalable execution environment offered by DMCF in terms of both execution time, speed-up and scale-up. DMCF has also been used to implement a Cloud-based computing infrastructure for the analysis of SNP microarray data [1] . It was possible to define a software tool (Cloud4SNP) for the parallel preprocessing and statistical analysis of pharmacogenomics SNP microarray data. Experimental evaluation shows efficient execution times and very good scalability. Moreover, the system implementation shows how the exploitation of a Cloud platform allows researchers and professionals to face in an elastic way the requirements of small as well as very large pharmacogenomics studies.

DMCF also supports data classification workflows that include MapReduce computations. As an example, in [8] DMCF has been used to implement a MapReduce data analysis application for predicting flight delays. Every year approximately 20% of airline flights are delayed or canceled mainly due to bad weather, carrier equipment, or technical airport problems. The goal of that application is to implement a predictor of the arrival delay of scheduled flights due to weather conditions. To run the workflow, we used a Hadoop cluster composed of 1 head node and 8 worker nodes, over the cloud servers used by the DMCF environment. With this setting, the turnaround time decreased from about 7 h by using 2 workers, to about 1.7 h by using 8 workers, with a speedup that is very close to linear values.

Several developers and researches are working on the design and implementation of tools and algorithms for extracting useful information from data gathered from social media. In most cases the amount of data to be analyzed is so big that high-performance computers, such as many and multi-core systems, Clouds, and multi-clusters, paired with parallel and distributed algorithms, are used by data analysts to reduce response time to a reasonable value [9] . Several research projects consider not only the data analysis task, but also procedures including other data processing tasks needed for building social data applications. In particular, these projects aim at helping scientists to implement all the steps that compose social data mining applications without the need to implement common operations from scratch.

ParSoDA (Parallel Social Data Analytics) [12] is a Java library that includes algorithms widely used to process and analyze data gathered from social media with the goal of extracting different kinds of information (e.g., user mobility, user sentiments, topic trends, and frequency). ParSoDA defines a general structure for a social data analysis application that is formed by the following steps:

-Data acquisition: during this step, it is possible to run multiple crawlers in parallel; the collected social media items are stored on a distributed file system (HDFS [28] ). -Data Filtering: this step filters the social media items according to a set of filtering functions. -Data Mapping: this step transforms the information contained in each social media item by applying a set of map functions.

-Data Partitioning: during this step, data is partitioned into shards by a primary key and then sorted by a secondary key. -Data Reduction: this step aggregates all the data contained in a shard according to the provided reduce function. -Data Analysis: this step analyzes data using a given data analysis function to extract the knowledge of interest. -Data Visualization: at this final step, a visualization function is applied on the data analysis results to present them in the desired format.

For each of these steps, ParSoDA provides a predefined set of functions. For example, for the data acquisition step, ParSoDA provides crawling functions for gathering data from some of the most popular social media platforms (e.g., Twitter and Flickr), while for the data filtering step, ParSoDA provides functions for filtering geotagged items based on their position, time of publication, and contained keywords. Users are free to extend this set of functions with their owns. Figure 4 presents the reference architecture and execution flow of a ParSoDA application that runs on the Hadoop [30] or Spark [31] framework. In such way, it is possible to implement several parallel and distributed data mining applications with high scalability. As shown in Fig. 4 (a), user applications can utilize ParSoDA and other libraries (e.g., Mahout 1 , MLlib 2 ). Applications can be executed on Hadoop or Spark, using YARN as resource manager and HDFS as distributed storage system. Figure 4 (b) provides details on how applications are executed on a Hadoop or a Spark cluster. The cluster is formed by one or more master nodes, and multiple worker nodes. Once a user application is submitted to the cluster, its steps are executed according to their order (i.e., data acquisition, data filtering, etc.).

On a Hadoop cluster, some steps are inherently MapReduce-based, namely: data filtering, data mapping, data partitioning and data reduction. This means that all the functions used to perform these steps are executed within a MapReduce job that runs on a set of worker nodes. In particular, the data filtering and data mapping steps are wrapped within Hadoop Map tasks; the data partitioning step corresponds to Hadoop Split and Sort tasks; the data reduction step is executed as a Hadoop Reduce task. The remaining steps (data acquisition, data analysis, and data visualization) are not necessarily MapReduce-based. This means that the functions associated with these steps could be executed in parallel on multiple worker nodes, or alternatively they could be executed locally by the master node(s). The latter case does not imply that execution is sequential, because a master node can make use of some other parallel runtime (e.g., MPI).

On a Spark cluster, the main steps are executed into two Spark stages that run on a set of worker nodes. A stage is a set of independent tasks executing functions that do not need to perform data shuffling (e.g., transformation and action functions). Specifically: data filtering and mapping are executed at the first stage (Stage 0 ), while data partitioning and reduction are executed at the second stage (Stage 1 ). Concerning the remaining steps (data acquisition, data analysis, and data visualization), the same considerations made for Hadoop apply to Spark.

Writing a parallel data analysis application from scratch usually requires deep programming skills and the writing of many lines of code. In fact, designing and implementing such kind of applications pose a number of challenges to developers such as parallelization of complex algorithms, reduction of communication costs, and optimization of memory usage. As demonstrated in [13] , the use of ParSoDA leads to a drastic reduction of lines of code. In particular, ParSoDA allows programmers to save hundred lines of code in the main (as the programmer needs to specify only the functions to be used and their parameters), in the data acquisition and data partition steps (where built-in functionalities are exploited), as well as in the data filtering, mapping, and reduction steps (where the programmer needs only to define the function logic). For the data analysis and visualization steps, we used the same code to invoke external libraries, which does not lead to a gain in terms of lines of code. However, for these steps, Par-SoDA ensures many advantages in terms of usability. In fact, in the application main defined through ParSoDA, all the MapReduce jobs created for the different steps, such as the ones in the analysis and visualization steps, are automatically chained. This means that the output of a job is automatically used as input to the next step. In contrast, without ParSoDA, programmers need to manually control the execution flow among different jobs.

The scalability of ParSoDA has been evaluated by running the data analysis applications on a private cloud infrastructure with 300 cores and 1.2 TB of RAM.

In the experiments we run, the Spark version of ParSoDA has been preferred, since, as demonstrated in [10] , it resulted to be faster than the Hadoop version of that library.

Nubytics [16] is another system we developed to exploit Cloud facilities to provide scalable services in the analysis of very large datasets. The system allows users to import their data to the Cloud, extract knowledge models using high performance data mining services, and use the inferred knowledge to predict new data. Nubytics provides data classification and regression services that can be used in a variety of scientific and business applications. Scalability is ensured by a parallel computing approach that fully exploits the resources available on the Cloud. In addition, Nubytics is provided in accordance with the Software-as-a-Service (SaaS) model. This means that no installation is required on the user's machine: the Nubytics interface is offered by a web browser, so it can be run from most devices, including desktop PCs, laptops, and tablets. This is a key feature for users who need ubiquitous and seamless access to scalable data analysis services, without needing to cope with the installation and system management issues of traditional analytics tools.

Nubytics differs from general purpose data analysis frameworks like Azure ML, Hadoop and Sparks, or data-oriented workflow management systems like ClowdFlows and DMCF, as it provides specialized services for data classification and prediction. These services are provided by a Web interface that allows designers and analysts to focus on the data analysis process without worrying on low level programming details. This approach is similar to that adopted by BigML. However, Nubytics also focuses on scalability, by implementing an ad hoc parallel computing approach that fully exploits the distributed resources of a Cloud computing platform.

The Nubytics architecture includes storage and compute components. The storage components are:

-Data Folder that contains data sources and the results of data analysis, and Tool Folder that contains algorithms for data analysis and prediction.

-Data Table, Tool Table and Task Table that contain metadata information associated with data, tools, and tasks. -Task Queue that contains the tasks to be executed.

The compute components are:

-Virtual Compute Servers that execute the data analysis tasks.

-Virtual Web Servers that host the system front end, i.e., the Nubytics web interface.

The architecture manages submission and execution of data analysis tasks by the following steps:

1. Using the services provided by the front end, a user can configure and submit the desired data analysis task (e.g., training a classification model from a dataset). 2. Exploiting a data parallel approach, the system models the task as a set of parallel sub-tasks that are inserted into the Task Queue for processing. 3. Each idle Virtual Compute Server picks a sub-task from the Task Queue and concurrently starts its execution. 4. Each Virtual Compute Server gets the part of data assigned to it from the Data Folder where the original dataset is stored. 5. After sub-task completion, each Virtual Compute Server puts the result on the Data Folder. 6. The front end notifies the user as soon as the task has completed, and allows her/him to access the results.

The Nubytics front end is divided into three sections -Datasets, Tasks and Models -corresponding to the three groups of services provided by the system: dataset management, task management and model management.

Datasets of a user are stored in a Cloud storage space associated to the user's account. The Datasets section provides several data management services, including: importing (uploading) a dataset from the user's terminal; exporting (downloading) a dataset to the user's terminal; listing and searching the available datasets; modifying the metadata of a dataset; creating a copy, deleting, or restoring a dataset.

The Tasks section provide services for configuring, submitting and managing data analysis tasks. Two classes of tasks can be submitted: training tasks and prediction tasks.

A training task takes as input a dataset and produces a classification or regression model from it. The goal of classification is to derive a model that classifies the instances of a dataset into one or more classes. Using a classification model, we can predict the membership of a new data instance to a given class from a set of predefined classes. The goal of regression is to build a model that associates a numerical value to the instances of a dataset. Therefore, a regression model can be used to forecast a quantitative value starting from the field values of a new data instance.

The configuration of a training task is made by selecting the input dataset, the class field (which is categorical in case of classification and numerical in case of regression), and the predictive fields that must be considered for the analysis (they can be all -or a subset of -the original dataset fields). A parallel computing approach is used to speedup the execution of training tasks. This is done using a data parallel approach that divides the original task in sub-tasks, assigns a sub-task to a different virtual compute server on the Cloud, and joins the partial results computed by multiple servers into a single model.

A prediction task takes two input elements: a model generated by a training task, and a new dataset whose instances must be classified or regressed. As a result, the new dataset will include a new field containing the predicted class label (in case of classification) or numerical value (in case of regression) of each tuple. Also in this case, parallelism is exploited by performing the prediction task in parallel on multiple Cloud servers.

Multiple tasks can be submitted to the system, and the user can monitor the status of each one through a task management interface, as shown by the screenshot in Fig. 5 . For each task, the interface shows the task type (prediction or training), some information about execution (start, end, and elapsed time), and the current status. Additional details on a task can be seen by selecting the corresponding row. For instance, the figure shows Input Dataset and Output Model of the second task, which is a Training task. 

SMA4TD (Social Media Analysis for Trajectory Discovery) [17] is a methodology aimed at discovering behavior rules, correlations and mobility patterns of visitors attending large-scale events, trough the analysis of a large number of social media posts. In particular, the main goals of the methodology are as follows.

We analyze the collected data to discover the places that have been most visited by users, and the events that have been most attended by visitors during the observed period.

of Attended Events. We extract the sets of places that are most frequently visited together by users, and the events that have been most attended by visitors during the observed period.

We analyze the collected data to discover mobility behaviors among the places, and to extract useful knowledge (i.e. patterns, rules and regularities) about the attended events.

We study the mobility flows of people attending the events, evaluating which countries visitors came from and which countries they moved after the events. In some cases, this information can give some insights about the touristic impact on the local territory.

The methodology is composed of seven steps: i) identification of the set of events; ii) identification of places-of-interests where the events take place; iii) collection of geotagged items related to events and pre-processing; iv) identification of users who published at least one of the geotagged items; v) pre-processing and creation of the input dataset; vi) data analysis and trajectory mining; and vii) results visualization.

The first two steps aim at defining the events E and the corresponding placesof-interest P. Specifically, during step 1, each event is described by the id of the place-of-interest (PoI) where it is located, starting/ending time of the event, and other optional data (e.g., free/paid event, type of event, etc.). Step 2 is aimed at defining the geographical boundaries of the PoIs in P. This can be done in two ways: i) manually defining the boundaries of the PoIs (e.g., as polygons on a map); ii) automatically, using external services (e.g., cadastral maps [19] ), or public web services providing the geographical boundaries of a place given its name (e.g., OpenStreetMap 3 ).

The goal of step 3 is to collect all the geotagged items G posted during each event e i ∈ E from the place p i where e i was held. Data collection is done by using the publicly available APIs provided by most social media platforms. The G dataset is pre-processed in order to clean, select and transform data to make it suitable for analysis. In particular, we first clean the collected data by removing all items with unreliable positions (e.g., items with coordinates that have been manually set by users or applications). Then, we proceed by selecting only the geotagged items posted by users who actually attended an event, by removing replies and favorites posted by other users. Finally, we transform data by keeping one item per user per event, because we are interested to know only if a user attended an event or not. The identification of users is the goal of step 4. This is done by extracting the set U of distinct users who published at least one geotagged item in G.

Step 5 creates the input datasets

.., e ik }, optF ields> in which e ij is the j th event attended by user u i , and optF ields are optional descriptive fields (e.g., nationality, interests).

After having built the input dataset D, it is analyzed for discovering behaviour and mobility patterns of users attending the large-scale event under investigation. Specifically, we perform both associative and sequential analysis, as described in the following.

Associative analysis is exploited with the goal of discovering (inside data) the item values that occur together with a high frequency. The mechanisms of association allow identifying the conditions that tend to occur simultaneously, or the patterns that repeat in certain conditions. Applied to dataset D, we perform two associative mobility mining tasks: (i) frequent event sets discovery, aimed at extracting the sets of events (places) that are most frequently attended (visited) together by visitors during the whole observed large-scale event; and (ii) frequent event rules extraction, devoted to discover frequent associative rules among the events.

On the other hand, sequential analysis algorithms are intended to discover the sequences of elements that occur most frequently in the data. Unlike associative analysis, in sequential analysis are fundamental the time dimension and the chronological order in which the values appear in the data. In our case, this type of analysis is useful to discover the most frequent mobility patterns among the places, and/or the most frequent sequences of attended events. Moreover, if the observed period is extended to some days (or weeks) before/after the event time, we can also discover the origin/destination (i.e., country, city) of visitors and which countries visitors came from/move after the event (i.e., to infer touristic insights).

Finally, results visualization is performed by the creation of info-graphics aimed at presenting the results in a way that is easy to understand to the general public, without providing complex statistical details that may be hard to understand to the intended audience. The graphic project is grounded on some of the most acknowledged and ever-working principles underpinning a 'good' info-graphic piece.

In this section we present the results obtained by analyzing geotagged posts of social media users attending the FIFA World Cup 2014 and EXPO 2015.

FIFA World Cup 2014. During the FIFA World Cup 2014, we monitored the Twitter users attending the 64 matches played during the football competition and analyzed such data through the SMA4TD methodology to discover behaviors and frequent movements of fans [14] . In this case study, the placesof-interest are the stadiums in which the World Cup matches have been played. The corresponding RoIs have been manually defined from a map as the smallest rectangles fully containing the boundaries of each stadium. For each match, we considered only the tweets posted from coordinates falling within the above defined RoIs during the matches. Totally, the number of tweets that have been collected (from June 12 to July 13, 2014) amounted to about 526,000. We have made several analyzes on user behavior. For example, we described how the number of people attending the matches changed over time. To do that, we report in Fig. 6 trends and numbers (i) of Twitter users we tracked attending at the matches during the World Cup, and (ii) of attendees officially published by the FIFA website 4 . Specifically, Fig. 6 shows a time plot of the collected attendance data, in which the number of attendees is plotted versus the number of matches. It clearly shows that there are several peaks of participation during the competition, probably corresponding to some matches that have attracted more attention with respect to other ones. Interestingly, in some cases Twitter data peaks are equivalent to the official attendance ones. We also studied the participation of fans to the matches. The results show that 71.3% of the fans attended a single match, 16% attended two matches, 6% attended three matches, and only 6.7% attended four or more matches. We also studied the most frequent paths of fans who attended two or three matches of the same team during the group stage.

For example, the most frequent 2-match-set was Colombia-Greece, Colombia-Cote d'Ivoire , followed by Brazil-Mexico, Croatia-Mexico , and by Argentina-Bosnia, Argentina-Iran , i.e., matches likely attended by fans of Colombia, Mexico and Argentina. Looking at their nationality, spectators were likely fans of Mexico, Brazil and Australia.

For the second study case we monitored Instagram users who visited the EXPO 2015 pavilions aiming at discovering mobility patterns inside the exhibition area, correlations among visits to pavilions, and the main flows of origin/destination of visitors [15] . EXPO 2015 5 was a Universal Exposition held under the theme "Feeding the Planet, Energy for Life", which was hosted in Milan, Italy, from May 1 st to October 31 st , 2015. Exhibitors were individual countries, international organizations, civil society organizations and companies, for a total of 188 exhibition spaces. Some of the exhibitors were hosted inside individual (self-built) pavilions, while others were hosted inside shared pavilions. For the sake of uniformity, in this paper we will use the term pavilion to indicate both an individual pavilion and a distinct area (assigned to a given exhibitor) of a shared pavilion. Cumulatively, about 22.2 million people visited the EXPO area and its pavilions during the six months of the whole exposition, making it the world-wide largest event of the year 2015. Visitors at EXPO used various social network to share their experience with friends and followers.

The set of events E considered for this scenario is composed by the showcases (each one organized by a country or organization/company) exhibited in the exposition spaces (generally referred as pavilions in the following). Specifically, let us consider E = {e 1 , e 2 , ..., e 188 }, where each e i is described by the following properties:

where p i is the pavilion, t begin i is May 1st and t end i is October 31st. The places-of-interest to be considered are the pavilions. Specifically, we defined the PoI set P = {p 1 , p 2 , ..., p 188 }, where each p i is a pavilion that has been used as exhibition area during the EXPO 2015. For each PoI, we drew its corresponding RoI as a rectangle bounding the pavilion area. Figure 7 shows a comparison between trends and numbers of the Instagram visitors we tracked, and the official visitors published on the EXPO website (see Footnote 5) . The observed period is August 1 st -October 31 st , but official numbers have been published only for the period starting in August, thus the corresponding curve has been traced only for the last three months. We used different scales for Instagram visitor numbers and the EXPO visitor ones: on the right is the scale of the formers, while on the left is the scale of the latter ones. In particular, Fig. 7(a) shows a time plot of the daily visits to EXPO. The trends are quite evident: initially (May and June) the visitors are relatively few; then, they grow significantly during the months of September and October. Moreover, there are several peaks of attendance, corresponding to visits occurred during the week-end days. By looking at the trends in the figure, it can be noted a strong correlation (Pearson coefficient 0.7) between official visitor numbers and those obtained from our analysis, which confirms the reliability of the results we obtained. Figure 7 (b) compares Instagram and official visitor numbers, aggregated by the week day (Pearson correlation 0.94). The results clearly show that during the week-end days there is a peak of visits, with the highest number of people registered on Saturdays.

Geotagged data gathered from social media can be used to discover interesting locations visited by users called Places-of-Interest (PoIs). Since a PoI is generally identified by the geographical coordinates of a single point, it is hard to match it with user trajectories. Therefore, it is useful to define an area, called Regionof-Interest (RoI ), to represent the boundaries of the PoI's area. RoI mining techniques are aimed at discovering Regions-of-Interest from PoIs and other data. [11] is a novel RoI mining technique that exploits the indications contained in geotagged social media items (e.g. tweets, posts, photos or videos with geospatial information) to discover the RoI of a PoI with a high accuracy. Given a PoI p identified by a set of keywords, a geotagged item is associated to p if its text or tags contain at least one of those keywords. Starting from the coordinates of all the geotagged items associated to p, G-RoI calculates an initial convex polygon enclosing all such coordinates, and then iteratively reduces the area using a density-based criterion. Then, from all the convex polygons obtained at each reduction step, G-RoI adopts an area-variation criterion to choose the polygon representing the RoI for p.

Let a PoI P be identified by one or more keywords K = {k 1 , k 2 , ...}. Let G all be a set of geotagged items. Let G = {g 0 , g 1 , . ..} be the subset of G all , obtained by applying a G-RoI preprocessing procedure that selects from G all only the geotagged items associated to P, i.e., the text or tags of each g i ∈ G contains at least one keyword in K. Let C = {c 0 , c 1 , ...} be a set of coordinates, where c i represents the coordinates of g i ∈ G. Thus, every c i ∈ C represents the coordinates of a location from which a user has created a geotagged item referring to P. Let cp 0 be a convex polygon enclosing all the coordinates in C, obtained by running the convex hull algorithm [3] on C, described by a set of vertices {v 0 , v 1 , ...}.

To find the RoI R for P, the G-RoI algorithm uses two main procedures:

-G-RoI Reduction. Starting from cp 0 , it iteratively reduces the area of the current convex polygon by deleting one of its vertex. A density-based criterion is adopted to choose the next vertex to be deleted. The density of a polygon is the ratio between the number of geotagged items enclosed by the polygon, and its area. At each step, the procedure deletes the vertex that produces the polygon with highest density, among all the possible polygons. The procedure ends when it cannot further reduce the current polygon, and returns the set of convex polygons CP = {cp 0 , ..., cp n } obtained after the n steps that have been performed. -G-RoI Selection. It analyses the set of convex polygons CP returned by the G-RoI reduction procedure, and selects the polygon representing RoI R for PoI P. An area-variation criterion is adopted to choose R from CP . Given CP , the procedure identifies two subsets: a first subset {cp 0 , ..., cp cut−1 } such that the area of any cp i is significantly larger than the area of cp i+1 ; a second subset {cp cut , ..., cp n } such that the area of any cp i is not significantly larger than the area of cp i+1 . The procedure returns cp cut as RoI R. This corresponds to choosing cp cut as the corner point of a discrete L-curve [20] obtained by plotting the areas of all the convex polygons in CP on a Cartesian plane, as detailed later in this section.

Without going into algorithmic details that can be found at [11] , we briefly describe how the G-RoI reduction and selection procedures work through a real In their posts and photos, social media users identify the Colosseum with different keywords, such as Coliseum, Coliseo, Colisée, and synonymous such as Flavian Amphitheatre or Amphitheatrum Flavium. All the geotagged items in our sample contain at least one of such keywords. From these posts, the 200 coordinates shown in Fig. 8(a) have been extracted. Given the coordinates, the G-RoI reduction procedure calculates the initial convex polygon cp 0 (shown Fig. 8(b) ), and then iteratively reduces the area. Figure 8 (c) shows polygon cp 1 obtained after the first step by deleting one of the vertices from cp 0 . The G-RoI reduction procedures iterates until it cannot further reduce the current polygon. The output of the procedure is the set of convex polygons CP = {cp 0 , cp 1 , ..., cp n } obtained at each step.

The G-RoI selection procedure identifies the point p cut that is located at the maximum distance (dist max ) from the reference line joining the first point and the last point under analysis (p 0 and p n ). If the set of points {p cut , ..., p n } follows a linear trend, i.e., there is no point below a threshold line at distance th from the reference line joining the points p cut and p n , then the procedure returns the polygon corresponding to p cut as RoI R (see Fig. 8(d)) . Otherwise, the G-RoI selection procedure iterates by finding a new cut-off point from the set of points on the right of p cut .

We experimentally evaluated the accuracy of G-RoI in detecting the RoIs associated to a set of PoIs. The analysis was carried out on 24 PoIs located in the center of Rome (St. Peter's Basilica, Colosseum, Circus Maximus, etc.) using about 1.2 millions geotagged items published in Flickr from January 2006 to May 2016 in the areas under analysis. Specifically, we made several preliminary tests to find parameter values that perform effectively in that scenario, taking into account that the various PoIs are characterized by significant variability of shape, area and density (number of Flickr photos divided by area). In particular, the threshold th was set to 0.27. The experimental results showed also that G-RoI is able to detect RoIs with high accuracy. Over a set of 24 PoIs in Rome, G-RoI achieved a mean precision of 0.78, a mean recall of 0.82, and a mean F 1 score of 0.77.

IOM-NN (Iterative Opinion Mining using Neural Networks) [5] is a new methodology for estimating the polarization of public opinion on political events characterized by the competition of factions or parties. It can considered as an alternative technique to traditional opinion polls, since it is able to capture the opinion of a larger number of people more quickly and at a lower cost. In particular, IOM-NN uses an automatic incremental procedure based on feed-forward neural networks for analyzing the posts published by social media users. Starting from a limited set of classification rules, created from a small subset of hashtags that are notoriously in favor of specific factions, our methodology iteratively generates new classification rules. A classification rule allows to determine if a post is in favor of a faction based on the words/hashtags it contains. Then, such rules are used to determine the polarization of social media users -who wrote posts about the political event -towards a faction. As shown in Fig. 9 , the proposed methodology consists of three main steps: ii) parliament election, in which a faction supports a party; iii) presidential election, in which a faction supports a presidential candidate [21] . The posts are collected by using the keywords that people commonly use to refer the political event E on social media. Such keywords K can be divided in two groups: -K context , which contains generic keywords that can be associated to E without referring to any specific faction in F .

where K ⊕ fi contains the keywords used for supporting f i ∈ F (positive faction keywords).

The keywords in K are given as input to public APIs provided by social media platforms, which permit to collect posts containing one or more keywords.

The collected posts are pre-processed before the analysis. The output of this step is a collection of posts P . In particular, they are modified and filtered as follows:

-The text of posts is normalized by transforming it to lowercase and replacing accented characters with regular ones (e.g., IOVOTOSI or iovotosí → iovotosi). -Words are stemmed for allowing matches with declined forms (e.g., vote or votes or voted → vot). -Stop words are removed from text by using preset lists. -All the posts written in a language different from the one(s) spoken in the nation(s) hosting the considered political event are filtered out.

The input of the algorithm for the classification of posts is composed of: the posts P generated in the previous step, the set of positive faction keywords K ⊕ F , the maximum number of iterations max iters , the minimum increment of the classified posts eps at each iteration, and a threshold th. Instead, the output is a collection of posts C that have been classified in favor of a faction.

As discussed in [4] , the algorithm is divided in two parts. The fist part performs the preliminary iteration (iteration 0). At this iteration, IOM-NN exploits the set of positive faction keywords (K ⊕ F ) for classifying a part of the posts. Specifically, it classifies a post in favor of a faction if it contains only positive keywords for such faction. In general, at the end of this iteration, only a small part of posts are classified, since not all users use keywords in K ⊕ F for declaring their support to factions. The second part iteratively generates new classification rules for classifying other posts. At each iteration, such rules are inferred by exploiting the posts that have been classified at the previous iterations.

This algorithm is used for determining the polarization of users. The input is composed of: a collection of classified posts C, a filtering function filter with its parameters par f , and a polarization function polarize with its parameters par p . The output is composed of a collection of classified users U and a faction score (S) containing the polarization percentages for each faction. As first step, the classified posts are aggregated by user to produce a dictionary (C U ), which contains the list of classified posts P u for each user u. Two empty variables are initialized for storing the output. On each pair u, P u of C U , the algorithm performs the following operations:

-It filters out all the pairs that do not match the criteria defined by the filter function. For example, users who published a number of posts below a given threshold are skipped. -Using the classified posts P u , it computes v u s a vector containing the score of user u for each faction. The score vector is calculated by using the function polarize.

-It adds the pair u, v s to U .

Then, the algorithm calculates the overall faction score S as the normalized sum of the user vector scores u, v u s . Finally, the output is returned.

In this section we describe and analyze a case study: the 2016 US presidential election, which was characterized by the rivalry between Hillary Clinton and Donald Trump. The analysis has been performed on data collected for ten US Swing States: Colorado, Florida, Iowa, Michigan, Ohio, New Hampshire, North Carolina, Pennsylvania, Virginia, and Wisconsin. Overall about 2.5 million of tweets, posted by 521,291 users, have been collected from October 10, 2016 to November 7, 2016 (the day before the election). From such data we filtered out all the tweets posted by users with a not defined location or with a location that does not belong to any of the considered states. In particular, for each faction f i we defined three set of keywords K ⊕ fi , K fi and K fi that are respectively positive, negative and neutral keywords for faction f i . For example, for the Hillary Clinton faction K ⊕ Clinton contains keywords used to clearly support her party (e.g., #voteHillary), K Clinton contains keywords to speak negatively about her (e.g., #neverhillary), K Clinton contains neutral keywords (e.g., clinton or democrats). IOM-NN exploits only positive faction keywords (K ⊕ fi ) for classifying posts and then for determining the polarization of users.

For such study case, the filter and polarize functions have been configured as follows. Specifically, a user u is considered only if he/she fulfills the following criteria: i) u posted at least minPosts on the political event of interest; ii) it exists a faction f for which u has published more than 2/3 of his/her posts. For each user u, the polarize function returns a vector score as follows: the percentage of posts written by u in favor of preferred faction f , 0 for the other factions. Figure 10 shows how the user polarization algorithm works on some classified posts. For each user, the posts if favor of Clinton and Trump are counted. Users who fulfill the criteria of filter function are considered and added to the set of classified users U . Then U is combined and normalized to obtain the vector S containing the overall polarization percentages.

As shown in Fig. 11 , IOM-NN was able to correctly identify the winning candidate in 8 out of 10 cases, outperforming the opinion polls that correctly classifies 6 out 10 states. 

In science and business, scientists and professionals analyze huge amounts of data, commonly called Big Data, to extract information and knowledge useful for making new discoveries or for supporting decision processes. This can be done by exploiting Big Data analytics techniques and tools. In this scenario, Cloud computing represents a compelling solution for Big Data analytics, allowing faster data analysis, that means more timely results and then prompt data value. This paper presented some Cloud-based frameworks and methodologies for Big Data processing that we recently developed. They can be used for developing and executing different kinds of real-world data analysis applications. In particular, in the first part of the paper we presented some tools for developing and running Big Data application on Clouds (i.e., DMCF, ParSoDA, and Nubytics).

To complete the view, in the second part of the paper we presented innovative methodologies for extracting useful information about mobility behaviors (i.e., SMA4TD and G-RoI) and political sentiment of social media users (i.e., IOM-NN).

Cloud4SNP: distributed analysis of SNP microarray data on the cloud

Trajectory pattern mining for urban computing in the cloud

The quickhull algorithm for convex hulls

Discovering political polarization on social media: a case study

Learning political polarization on social media using neural networks

Programming models and systems for big data analysis

Programming visual and scriptbased big data analytics workflows on clouds

Using scalable data mining for predicting flight delays

Big data analysis on clouds

Appraising SPARK on large-scale social media analysis

G-RoI: automatic region-ofinterest detection driven by geotagged social media data

ParSoDA: high-level parallel programming for social data mining

A high-level programming library for mining social media

Following soccer fans from geotagged tweets at FIFA World Cup

Analyzing social media data to discover mobility patterns at EXPO 2015: methodology and result

Nubytics: scalable cloud services for data analysis and prediction

SMA4TD: a social media analysis methodology for trajectory discovery in large-scale events

A hierarchical parallel storage system based on distributed memory for large scale systems

Point of interest to region of interest conversion

Analysis of discrete ill-posed problems by means of the L-Curve

Analyzing polarization of social media users and news sites during political campaigns

A data-aware scheduling strategy for workflow execution in clouds

A cloud framework for parameter sweeping data mining applications

A cloud framework for big data analytics workflows on azure

Js4Cloud: script-based workflow programming for scalable data analysis on cloud platforms

A workflow management system for scalable data mining on clouds

Exploiting in-memory storage for improving workflow executions in cloud platforms

The hadoop distributed file system

Data analysis in the cloud: models

Hadoop: The Definitive Guide

Apache spark: a unified engine for big data processing