Design and Application of a Multi-Variant Expert System Using Apache Hadoop Framework


sustainability

Article

Design and Application of a Multi-Variant Expert
System Using Apache Hadoop Framework

Muhammad Ibrahim * and Imran Sarwar Bajwa

Department of Computer Science & IT, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan;
imran.sarwar@iub.edu.pk
* Correspondence: ibrahimbwp@gmail.com

Received: 20 October 2018; Accepted: 12 November 2018; Published: 19 November 2018 �����������������

Abstract: Movie recommender expert systems are valuable tools to provide recommendation services to
users. However, the existing movie recommenders are technically lacking in two areas: first, the available
movie recommender systems give general recommendations; secondly, existing recommender systems
use either quantitative (likes, ratings, etc.) or qualitative data (polarity score, sentiment score, etc.)
for achieving the movie recommendations. A novel approach is presented in this paper that not
only provides topic-based (fiction, comedy, horror, etc.) movie recommendation but also uses both
quantitative and qualitative data to achieve a true and relevant recommendation of a movie relevant
to a topic. The used approach relies on SentiwordNet and tf-idf similarity measures to calculate the
polarity score from user reviews, which represent the qualitative aspect of likeness of a movie.
Similarly, three quantitative variables (such as likes, ratings, and votes) are used to get final
a recommendation score. A fuzzy logic module decides the recommendation category based on this
final recommendation score. The proposed approach uses a big data technology, “Hadoop” to handle
data diversity and heterogeneity in an efficient manner. An Android application collaborates with
a web-bot to use recommendation services and show topic-based recommendation to users.

Keywords: recommender systems; opinion mining; SentiWordNet; polarity scores

1. Introduction

Since the advent of web intelligence, artificial intelligence-based services, frameworks and
products have become popular in the World Wide Web. One of the key services of such web
intelligence applications is a recommendation system. In recent times, recommendation systems have
become popular in the domain of movies, music, books, restaurants, garments, mobile applications
and many other fields of life. Such recommendation systems filter huge amount of structured and
unstructured data and predict the preference of a user that one would give to an item. In the last
decade, a few movie recommendation systems have been presented using conventional methods [1–3].
However, previous movie recommender systems lack various features and/or accuracy of true
recommendation. A majority of these recommender systems use quantitative variables (likes or ratings)
and a few others use qualitative variables (polarity score, etc.) [4–7]. This paper proposes an intelligent
and automated recommendation system that provides two-fold novelty. First, our recommender
system uses a multi-variant popularity matrix to recommend a suitable movie to a user on the basis
of both quantitative and qualitative variables to achieve true recommendations. Secondly, a fuzzy
logic-based module provides the final recommendation of movies in a particular field of user’s
choice (such as comedy, action, horror, fiction, etc.), whereas the currently available systems give
general recommendations.

In our multi-variant recommendation system, one of the challenges was opinion mining of users’
reviews to calculate a polarity score that shows the degree of likeness or dis-likeness of a movie by

Sustainability 2018, 10, 4280; doi:10.3390/su10114280 www.mdpi.com/journal/sustainability

http://www.mdpi.com/journal/sustainability
http://www.mdpi.com
https://orcid.org/0000-0002-5161-6441
http://www.mdpi.com/2071-1050/10/11/4280?type=check_update&version=1
http://dx.doi.org/10.3390/su10114280
http://www.mdpi.com/journal/sustainability


Sustainability 2018, 10, 4280 2 of 21

a user [8–12]. Such polarity scores provide a qualitative aspect of user’s opinions about a particular
movie. Another challenge was how to handle the diversity of heterogenous data as the presented
approach uses both quantitative and qualitative data. To handle this issue a big data solution involving
Hadoop was used in our approach because it efficiently handles data heterogeneity and data diversity
in a better way.

Nowadays society has changed, people own smartphones and they are highly dependent on
mobile applications such as recommendation systems, which need to communicate with smartphone
Apps so that users can easily interact with the services and efficiently select the recommended items [9].
Therefore, in this paper, a recommender system is coupled with an Android application and a web-bot
offering open web services and merging movie data from linked data composed with different
external resources.

Big data is defined by four dimensions represented by four V’s (volume, variety, velocity,
and veracity). Volume is represented by the amount of text data that we use to generate
recommendation. Variety represents the different types of data extracted from different sources
like blogs, Facebook, and Twitter as well as different review and opinion sites. Reviewers can write
their reviews, remarks, and feedback in any format-like structure, semi-structured, or unstructured
and these should be handled by the system. Velocity represents the speed of data generation on the
internet. Veracity represents the trust worthiness of the data.

A multi-variant recommendation system can get benefit of a NoSQL environment in reducing
complexity and to handling the sparsity by factorization, ensuring scalability by using an empowered
server machine and dealing with heterogeneity by using Hadoop platform to handle the big data
issues [13–15].

Ontology and linked data can be information sources for movies’ descriptions and are available
among Internet applications and are provided through the semantic queries of standard web
technologies, such as URIs, RDF, HTTP and the semantic web. The linked data from Google Places,
Trovacinema, Wikipedia and netflix or linked movie databases (http://linkedmdb.org) are useful for
the recommender systems [16].

The work presented in [13] discusses a movie recommendation system that uses movie ratings
to recommend movies only in a general category. However, this work seriously lacks accuracy of
true recommendations. The reason for less accuracy in [13] is the use of only numeric data such as
likes and ratings. Such numeric features only cover the quantitative aspect of the users’ likeness.
However, the qualitative aspect of likeness of users is totally ignored in this work, which makes the
results of this work questionable. Here, it is important to mention that quantitative and qualitative
aspects of likeness of users can provide us with true recommendations. The qualitative aspect of
likeness can be achieved from text reviews that were not covered by the approach used in [13].
Moreover, this approach is tested on a single small dataset. Other issues with [13] are tabulated in
Table 1. There is need of a multivariate approach that involves both quantitative and qualitative aspects
to finalize a recommendation and achieve highly accurate results. Table 1 represents differences in
previous approaches and our multi-variant approach.

Table 1. Deviation in different approaches.

Source Multi-Variants Ratings Votes Likes
Polarity
Scores

Tf-Idf
Fuzzy
Logic

Multi-Data
Sources

[13] ×
√

×
√

× × × ×
[17] ×

√
×

√ √
×

√
×

[18] × × × ×
√

× × ×
[19] × × × × × × × ×

Multi-variant
approach

√ √ √ √ √ √ √ √

http://linkedmdb.org


Sustainability 2018, 10, 4280 3 of 21

During the literature review of modern recommender systems, Hsieh’s movie recommender
system [13] was identified as the relevant. This system has used the benefits of big data solutions and
also provides a mobile app to interact with the recommender system. However, the key short-coming in
this work is the limited approach used to recommend movies. Major issues with this recommendation
system are discussed and comparison with our approach is given in Table 2.

Table 2. Difference to Hsieh’s work [13].

# Hsieh’s Work Our Approach

1
It is a general recommendation system
for movies.

Our approach supports topic vise
recommendation of movies such as drama,
comedy, action, horror, etc.

2
This approach only uses quantitative data
(ratings and likes) for recommendation that
provides less accuracy. .

Our approach uses both quantitative (votes,
likes, etc.) and qualitative data (polarity
score) for true recommendation of movies.

3

This approach banks on simplistic calculation
in the base of similarity measures. No real
decision making approach is used that makes
quality of results questionable.

Our approach uses Fuzzy Logic approach for
better decision making and true
recommendations of movies.

4
True likeness of users is not reflected by
this approach.

Our approach reflects true likeness of the
users as qualitative aspects of likeness is
also considered.

5
This approach is tested only on one
limited dataset.

Our approach is tested on three
large datasets.

6
This approach calculates recommendation on
two quantitative variables.

Our approach calculates recommendation on
three quantitative variables and one
qualitative variable.

The recommender systems field has made significant progress with many new techniques
proposed and new systems developed. However, modern systems still require significant
improvements to provide better recommendations. The major contribution to knowledge and novelty
of the work is outlined below:

i. A topic (action, comedy, horror, etc.) based recommendation is supported.
ii. Multi-variant (ratings, votes, likes and polarity score) parameters are used.
iii. Both quantitative and qualitative data is used for movie recommendations.
iv. Three external data resources are used for datasets (Metacritics, IMDB, and Fandango).
v. A web-bot used to fetch the web contents collaborates with the server.
vi. Filters and integrates movie descriptions from linked data or ontology (linkedmdb).
vii. Recommender system is developed using NoSQL environment with apache Hadoop.
viii. Fuzzy sets are established for movie ranking categorization.
ix. Front end App collaboration with the movie recommender through web services is supported.

The rest of this paper is organized as follows. In Section 2 related work is discussed,
where recommender systems running for different subjects could clear up the native problems of
user’s data processing. A multi-variant ranking model is presented, for movie recommender using
a mobile application, and Apache Hadoop in Section 3. Experiments and results, and evaluation of the
system are discussed in Sections 4 and 5, respectively. The conclusion and future works are presented
in Sections 6 and 7, respectively.

2. Related Work

Recommendation services typically rely on customer reviews or customer ratings and
such recommendations can provide a useful service for new customers. Emotional expressions,


Sustainability 2018, 10, 4280 4 of 21

social interaction and behavior changes of the users are studied on the Twitter, allowing management
to distinguish clients who do or do not return [20]. Vox Civitas obtains responses from social media
(e.g., Twitter), which can support journalistic investigations in more effective ways [21]. Anomalies
are removed and pure data is obtained which reflect the United Kingdom’s worst influenza [22].
Detection of noise in text (tweets) from micro-blogs is discussed in [23].

Sentiment analysis approaches can be used to extract sentiments associated with positive or
negative polarities for specific subjects from a document, instead of classifying the whole document as
positive or negative [24]. An NLP-based methodology of sentiment evaluation on user’s comment
has been used as a way to retrieve the best and perfect YouTube videos. The process works in four
steps. First, a review collection and preprocessing component extracts data (comments) from the
particular YouTube video and language preprocessing is undertaken to prepare for the next process.
Second, the processed text goes through NLP-based methods to generate data sets. Subsequently,
the sentiment classifier (Sentistrength) is applied on the data sets to calculate the positivity and
negativity ratings. Finally, the standard deviation applied to get the rating result [25].

Features level sentiment analysis, which is based on the idea that an opinion consists of a sentiment
(positive or negative) and a feature of movies is another approach. Each short comment is represented
as a sequence of sentiment words and underlying states [9,12]. A linear regression model (LRM),
a supervised machine learning technique to classify twitter gossip (positive and negative) has been
used to predict the box-office revenue for different movies [8]. Neural networks (NNs) classification
of sentiment analysis of large movie reviews has been handled by introducing a method. Recursive
neural networks wrap the previous sentence-level-sentiment classification and are used with recurrent
neural networks. Recursive neural networks are used for sentence-level analysis and a recurrent neural
network is used for whole passage analysis to create better results [26].

The vector space model (VSM) was used to implement the instance-based learning (IBL)
classification method. Text documents were treated as vectors in IBL algorithms to identify the
class (positive or negative review) of the document [27]. Sentiment analysis is negative when the text
includes some negative words, such as “bad acting, stilted dialog.” It is positive if the text includes
some positive words such as “it’s funny”. A suggestion instead of an exact rating is done by sentiment
classification of the comments (polarity), and then aggregated into a rating score selected as the
recommended list of popular movies [11,17]. For example, the hotel management of Starwood Hotels
and Resorts use social media’s strength to stay connected with their guests, to guide them, and seek
responses to the services they provide. [28].

Other related work includes typical recommender frameworks that construct calculations in light
of different fuzzy set theoretic likeness measures (the fuzzy set augmentations of the Jaccard list, cosine,
closeness or relationship similitude measures), and aggregation techniques for figuring suggestion
certainty scores (the maximum-minimum or weighted-whole fuzzy set theoretic accumulation
strategies) for recommendation [4]. The strategy for ranking in light of the content involves building
a sentiment graph from the collocation of adjectives, PageRank algorithm and a very small set of
adjectives (such as ‘good’, ‘excellent’, etc.) that rank different movies using reviews of box office movies
by users of a popular movie review site [18]. With regard to the utilization of labels with the end
goal of recommendation of movies, the German movie website, Moviepilot uses viewers, and movie
ratings, and all out labels are marked to every movie. Labels are allotted by a group of moderators and
viewer are then able to rate how well the labels fit every motion picture [3]. This collaborative filtering
was first applied elsewhere in filtering the information in Usenet news [29].

Music recommender systems provide personalized music recommendations and Ringo Agent
was one of the first applications [30]. Content-based filtering recommends movies based on
a comparison between user profile data and content of movies. Content-based filtering is also called
cognitive-filtering. The recommendations are generated by matching users and movie content [4].
Collaborative filtering is also called social filtering. The fundamental rule behind collaborative filtering
is that if a user likes a certain category of movie in the past, then they may like similar movies in the


Sustainability 2018, 10, 4280 5 of 21

future. This information is used in deciding which movie to suggest [19,29,30]. Hybrid filtering is
a combined technique of content filtering and collaborative filtering [31].

The previous work discussed above suggests that most of the approaches used for
recommendation services, especially for movie recommendation are uni-variant and use varaiables
such as ratings, which tend to provide results with low accuracy. There are other variables including
likes, number of reviews, and the sentiment score of a review that can help in achieving an accurate
and efficient recommendation, and in this paper we aim to use these new variables for the proposed
movie recommendation service.

3. Multi-Variant Expert System

The proposed approach works on the fetched data (scores and reviews) from a set of movie
websites and databases. The collected data is heterogeneous in nature, such as numeric data
(for example, number of votes, number of likes and number of ratings) and text data (for example,
user reviews of movies). A web crawler was developed to fetch structured and unstructured data and
store the fetched data in a NoSQL database on a server machine for further processing.

The used approach works in two parallel streams. In the first stream, the text data (such as
movie reviews) is preprocessed using NLP modules, tf-idf algorithms and the SentiWordNet auxiliary
database to identify polarity scores of terms (lexicons) in the form of negative and positive scores.
All the movies reviews are processed for an aggregate polarity score for each movie from each
participating external data source. In the second stream, all the numeric scores and weights
(rating, votes and likes) of the movies are normalized and computed to achieve weighted aggregate of
polarity scores. The result of a search query of the movies is shown in the user interface of an Android
app (see Figure A1). The user query interacts with the server and the server processes the request
by forwarding it to the web crawler. The web crawler module responds to the server’s request by
crawling the web for keywords (lexicons) matching and downloading the webpages to the server,
and then the server processes the data to generate a recommendation as shown in Figure 1.

Sustainability 2018, 10, x FOR PEER REVIEW  5 of 21 

The previous work discussed above suggests that most of the approaches used for 

recommendation services, especially for movie recommendation are uni-variant and use varaiables 

such as ratings, which tend to provide results with low accuracy. There are other variables including 

likes, number of reviews, and the sentiment score of a review that can help in achieving an accurate 

and efficient recommendation, and in this paper we aim to use these new variables for the proposed 

movie recommendation service. 

3. Multi-Variant Expert System 

The proposed approach works on the fetched data (scores and reviews) from a set of movie 

websites and databases. The collected data is heterogeneous in nature, such as numeric data (for 

example, number of votes, number of likes and number of ratings) and text data (for example, user 

reviews of movies). A web crawler was developed to fetch structured and unstructured data and 

store the fetched data in a NoSQL database on a server machine for further processing. 

The used approach works in two parallel streams. In the first stream, the text data (such as movie 

reviews) is preprocessed using NLP modules, tf-idf algorithms and the SentiWordNet auxiliary 

database to identify polarity scores of terms (lexicons) in the form of negative and positive scores. All 

the movies reviews are processed for an aggregate polarity score for each movie from each 

participating external data source. In the second stream, all the numeric scores and weights (rating, 

votes and likes) of the movies are normalized and computed to achieve weighted aggregate of 

polarity scores. The result of a search query of the movies is shown in the user interface of an Android 

app (see Figure A1). The user query interacts with the server and the server processes the request by 

forwarding it to the web crawler. The web crawler module responds to the server’s request by 

crawling the web for keywords (lexicons) matching and downloading the webpages to the server, 

and then the server processes the data to generate a recommendation as shown in Figure 1. 

 
Figure 1. Multi-variant expert system for movie recommendation. 

  
Figure 1. Multi-variant expert system for movie recommendation.


Sustainability 2018, 10, 4280 6 of 21

3.1. NLP Module

Real-world data is generally incomplete and noisy, and is likely to contain irrelevant and
redundant information or errors. By pre-processing, raw unstructured data can be converted into
a structured, understandable form as shown in Figure 2. Since the real-world data can contain
ambiguity and anomalies, it is necessary to remove these abnormalities before the actual analysis of
data. The data was pre-processed to remove anomalies and to identify the abbreviated language of the
actual English and also remove the reviews which were in other languages other than English [32–34].

Sustainability 2018, 10, x FOR PEER REVIEW  6 of 21 

3.1. NLP Module 

Real-world data is generally incomplete and noisy, and is likely to contain irrelevant and 

redundant information or errors. By pre-processing, raw unstructured data can be converted into a 

structured, understandable form as shown in Figure 2. Since the real-world data can contain 

ambiguity and anomalies, it is necessary to remove these abnormalities before the actual analysis of 

data. The data was pre-processed to remove anomalies and to identify the abbreviated language of 

the actual English and also remove the reviews which were in other languages other than English 

[32–34]. 

 
Figure 2. Preprocessing data using NLP module. 

3.1.1. Tokenization 

Then, the given character stream is separated into units called tokens. The tokens might be 

words or numbers or highlighting check. Tokenization does this by finding word limits. For example, 

“The message of this film is simple” string is tokenized as [The] [message] [of] [this] [film] [is] 

[simple] [35,36]. 

3.1.2. Stemming (Lemmatization) 

This is optional; most stemming usesthe Porter Stemmer. English words like “look” can be 

arched with a morphological suffix to deliver “looks, looking, looked”. These have a similar stem, 

“look” [37,38]. 

3.1.3. Stop Word Evacuation 

Most regularly used words do not convey much significance.For example: “the, an, of, for, in 

...”. We used a small corpus based library to exclude stp words from the input data. This library is 

developed in Java. 

3.1.4. POS-Tag Generation 

The query was then analyzed and POS tags were generated of all the words in the query. Then, 

the resulting string of words and their relevant POS (Parts of Speech) tags are tokenized on the basis 

of space. A good example of an English sentence, “This movie is so riddled”, is Pos-tagged as [this/DT 

movie/NN is/VBZ so/RB riddled/JJ]. The Treebank Project (URL: 

https://catalog.ldc.upenn.edu/docs/LDC95T7/treebank2.index.html) shows 36 POS-tags [39], e.g., 

determiner [DT], adjective [JJ] and adverb [RB], etc. 

3.2. Polarity Computation 

3.2.1. Lexical Frequency Measuring 

For this purpose, we applied a tf-idf frequency measure [40,41]; we first calculated the frequency 

of the valid/important terms, which represents the number of times that term ‘t’ occurs in the 

document (review) ‘d’, as in the following Equation (1): 

tf(t, d) = f(t, d) (1) 

Figure 2. Preprocessing data using NLP module.

3.1.1. Tokenization

Then, the given character stream is separated into units called tokens. The tokens might be words or
numbers or highlighting check. Tokenization does this by finding word limits. For example, “The message
of this film is simple” string is tokenized as [The] [message] [of] [this] [film] [is] [simple] [35,36].

3.1.2. Stemming (Lemmatization)

This is optional; most stemming usesthe Porter Stemmer. English words like “look” can be
arched with a morphological suffix to deliver “looks, looking, looked”. These have a similar stem,
“look” [37,38].

3.1.3. Stop Word Evacuation

Most regularly used words do not convey much significance.For example: “the, an, of, for, in ...”.
We used a small corpus based library to exclude stp words from the input data. This library is
developed in Java.

3.1.4. POS-Tag Generation

The query was then analyzed and POS tags were generated of all the words in the query.
Then, the resulting string of words and their relevant POS (Parts of Speech) tags are tokenized
on the basis of space. A good example of an English sentence, “This movie is so riddled”, is Pos-tagged
as [this/DT movie/NN is/VBZ so/RB riddled/JJ]. The Treebank Project (URL: https://catalog.ldc.
upenn.edu/docs/LDC95T7/treebank2.index.html) shows 36 POS-tags [39], e.g., determiner [DT],
adjective [JJ] and adverb [RB], etc.

3.2. Polarity Computation

3.2.1. Lexical Frequency Measuring

For this purpose, we applied a tf-idf frequency measure [40,41]; we first calculated the frequency of
the valid/important terms, which represents the number of times that term ‘t’ occurs in the document
(review) ‘d’, as in the following Equation (1):

tf(t, d) = f(t, d) (1)

https://catalog.ldc.upenn.edu/docs/LDC95T7/treebank2.index.html
https://catalog.ldc.upenn.edu/docs/LDC95T7/treebank2.index.html


Sustainability 2018, 10, 4280 7 of 21

After calculating tf we calculated idf (inverse document frequency) of the terms to obtain
information about how rare or common that term is in the documents (reviews). We used the
Equation (2):

idf(t, D) = logN/dt (2)

where as d ∈ D and t ∈ d, N is total number of documents in the corpus N = |D|. The end results are
then obtained by applying the Equation (3):

tfidf(t, d, D) = tf(t, d)∗ idf(t, D) (3)

3.2.2. Polarity Identification

SentiWordNet 3.0 automatically annotates all WordNet 2.0 (synsets) according to their degrees of
positivity, negativity and neutrality. In this step, the SentiWordNet score was used in the sentiment
analysis of the documents (reviews) [42,43]. For this purpose, we applied the Equation (4):

Polarity_term_score = SentiWordNetScore ∗ Frequency(tfidf) (4)

The SentiWordNetScore (positive or negative) of the term and its frequency were computed to get
the overall sentiment of the terms in the documents.

The SentiWordNetScore of each term for all reviews of the movie is calculated and the score
(negative or positive) tells us how many terms are positively or negatively important in the review.
Then, all the positive terms scores are added to obtain the positive term’s weight, and also all the
negative terms scores are combined to obtain the negative term’s weight in a review. The polarities of
all the reviews of the movies from each participating website are calculated as follows.

Polarity of a Term

By applying sign P(t) on term, if SentiWordNetScore (Sti ) of terms (ti) of the review (ri) of movie
(mi) is less than zero then the term lies in the negative poll (nt), if greater than zero than it lies in the
positive poll (pt) and if it is equal to zero than it lies in the neutral poll (tn) as shown in equation (5).

P(t) =



−Sti , Sti < 0 ( nt = negative term )

Sti , Sti = 0 (tn = neutral term)
+Sti , Sti > 0 (pt = positive term )

(5)

Polarity of a Document (Reviews)

For calculating the polarity of a document (review) polarityri (r), positive terms ptri and
negative terms ntri are aggregated for each document (review) from each participating websites’
negative_termri (x) and negative_termri (y) and then take their differences are taken to find the polarity
of each documents (reviews) by applying the sign function f(r) as shown in Equations (6) and (7).

positive_termri (x) =
n

∑
i=0

ptri (6)

negative_termri (y) =
n

∑
i=0

ntri (7)

where as ptri ∧ ntri ∈ ri and ri ∈ mj
Polarity of a review is calculated as shown in Equations (8) and (9):

polarityri (R) = sgn
[
|xri|−

∣∣∣yri∣∣∣ ] (8)


Sustainability 2018, 10, 4280 8 of 21

f(r) =



−pri , pri < 0 ( nr = negative review )

pri , pri = 0 (rn = neutral review)
+pri , pri > 0 (pr = positive review)

(9)

by applying sign f(r) on each review, If the difference of aggregated positive_termri (x) of review (ri)
of the movie (mj) from website (wk) and aggregated negative_termri (y) is less than zero then review
is sentimentally lie in negative poll (nr), if greater than zero than lie in positive poll (pr) and if equal to
zero than lie in neutral poll (rn).

Polarity of a Collection (Movie Reviews)

review_positive_scoremj (a) is the aggregated polarity score of positive reviews pr and
review_negative_scoremj (b) is the aggregated polarity score of negative reviews nr of a particular
movie mj from a particular website (wk) used to calculated the polarity of the movie from participating
sites as given in Equations (10) and (11).

review_positive_scoremj (g) =
n

∑
i=0

pri (10)

review_negative_scoremj (h) =
n

∑
i=0

nri (11)

where as pri ∧ nri ∈ mj and mj ∈ wk.
Polarity of a collection is calculated using Equation (12):

polarityri (p) = sgn
[ ∣∣∣gmj∣∣∣− ∣∣∣hmj∣∣∣ ] (12)

where as gmj ∧ hmj ∈ mj and mj ∈ wk. Here wk is a movie website such as IMDB.

3.3. Weighted Polarity Manipulation

Opinion mining determines the emotions (positive or negative) of textual communication on social
media, and examines the positive or negative emotions by simply extracting polarity scores from the
review (number of stars or thumbs up/down and votes etc.). However, we used both the polarity score
and weight score (rating, votes and likes) of the movies. First, we computed the aggregated polarity
score of each movie from each participating site, and then we took the average of the aggregated
polarity by total reviews of the respective movie and their site. Again, we take aggregation of average
polarity score. Also, total likes of the movie were combined with weighted_average_polarity to find
the aggregated_weighted_average_polarity. After that, the final score of the movie was rescaled to get
the ranked score and category of the movie. In this computation, Equations (13)–(18) are used.

aggregate_polarity_Scoremj (g) =
n

∑
i=1

pi (13)

whereas gmj ∈ mi , mi ∈ wk.

weightmj =
[
(voteswk ) +

(
ratingwk

)]
(14)

whereas weightmj ∈ mi , mi ∈ wk.

weighted_average_polaritymj (a) =
gmj
n

+
(

weightmj

)
(15)


Sustainability 2018, 10, 4280 9 of 21

whereas n is number of reviews of movie (mj) from movie website (wk).

aggregated_weigted_average_polaritymj (G) =
N

∑
k=1

awk + likesmj (16)

average_aggregated_weighted_average_polaritymj (A) =
Gmj
N

(17)

where as awk ∈ mj , mj ∈ wk , wk ∈ N.
(N) is the number of movie websites (Metacritic, IMDBand Fandango) which has huge collection

of material.

Rank_scoremj (R) =
rmj − min(rmj )

max(rmj )− min(rmj )
∗ 10 (18)

Here R is the rescale value of the normalized average aggregated score (rmj ) of the movie (mi)
from movie websites (N) to rank the top five movies (M).

3.4. Categorization

Movie genres are various forms or identifiable types, categories, classifications or groups of
movies (genre comes from the French word meaning “kind”, “category”, “or “type”). http://www.
filmsite.org/filmgenres.html.

The user can directly query for content (e.g., “news London”, “golf 1940”, “documentary
Alfred Hitchcock”, “movies tonight”). The querying is done on different fields describing the
content (e.g., title, creator, year, genre, language, location). The search results are presented as
organized according to concepts from common vocabularies (e.g., Time Ontology, Geo Ontology,
WordNet) [29,44–46].

3.4.1. WordNet

The WordNet library was used in our approach to find synonyms and alternative forms of query
terms, e.g., “weather” = {“weather report”, “weather forecast”, etc.}. Identification of synonyms in
data helps in obtaining better and more accurate results.

3.4.2. Geo Ontology

Finds related geographical areas, e.g., “London” = {City of London, Camden, Westminster,
Greenwich, Greater London, England, and UK}.

3.4.3. Time Ontology

Determines temporal context e.g., “tonight” = {18:00–24:00}, or “this week” = {10/02–17/02}.

3.4.4. TVA-CS

Finds related genres (e.g., “sports” = {sport reports, sport live, sport news, sport documentary,
football game, etc.} Documentary, football game, etc.).

3.5. Recommendation

The final recommendation is achieved by using the fuzzy logic approach on the following fuzzy
set to evaluate the final score and find the category of the movie as follows.

Step 1 if final score ≥ 8 then Category: “A: Recommended”
Step 2 else if final score ≥ 6 then Category: “B: Top Recommended”
Step 3 else if final Score ≥ 4 then Category: “C: Recommended Average”

http://www.filmsite.org/filmgenres.html
http://www.filmsite.org/filmgenres.html


Sustainability 2018, 10, 4280 10 of 21

Step 4 else if Final Score ≥ 2 then Category: “D: Least recommended”
Step 6 else Category: “F: Not recommended”

Figure 3 shows the final recommendations of the movies in a particular category (such as comedy,
horror, fiction, etc.) in one of the five different classes. The user interface showing the output is
discussed in Appendix A and Figure A1.

Sustainability 2018, 10, x FOR PEER REVIEW  10 of 21 

Category: “B: Top Recommended” 

Step 3 else if final Score ≥ 4 then 

Category: “C: Recommended Average” 

Step 4 else if Final Score ≥ 2 then 

Category: “D: Least recommended” 

Step 6 else 

Category: “F: Not recommended” 

Figure 3 shows the final recommendations of the movies in a particular category (such as 

comedy, horror, fiction, etc.) in one of the five different classes. The user interface showing the output 

is discussed in Appendix A and Figure A1. 

 
Figure 3. Multi-variant ranked category. 

4. Experimental Setup 

4.1. NoSQL for Big Data Stroage  

The number of publicly available test corpora is quite limited and comparatively of small size 

with respect to the number of texts documents in a corpus. Thus, producing adequately precise 

comparisons between reported performances is difficult. So, we decided to build a new corpus and 

for this purpose, we used three different external data source websites to extract a large number of 

reviews, votes, ranking and likes. The data for our corpus was retrieved by a web bot implemented 

in PHP. We wrote a webpage (web-bot) scraping scripts which extract movie URLs with matching 

user’s queries, if the query keywords (lexicons) are matched then the crawler downloads the page in 

a server machine NoSQL environment using Hadoop, otherwise this page is discarded [13,14,47–52]. 

This procedure is depicted in Figure 4. The process of data extraction uses the following steps. 

Step 1 Receive URLs for a movie type i.e., comedy, horror, fiction, etc. 

Step 2 Matches the keywords from the query to the page  

If 

Step 3 Keywords matched 

Then 

Step 4 Download the web page 

Step 5 Send it for storage 

Step 6 Discard the page 

Step 7 Repeat the step 2 to 6 until all the matched web pages are found. 

Figure 3. Multi-variant ranked category.

4. Experimental Setup

4.1. NoSQL for Big Data Stroage

The number of publicly available test corpora is quite limited and comparatively of small size
with respect to the number of texts documents in a corpus. Thus, producing adequately precise
comparisons between reported performances is difficult. So, we decided to build a new corpus and
for this purpose, we used three different external data source websites to extract a large number of
reviews, votes, ranking and likes. The data for our corpus was retrieved by a web bot implemented
in PHP. We wrote a webpage (web-bot) scraping scripts which extract movie URLs with matching
user’s queries, if the query keywords (lexicons) are matched then the crawler downloads the page in
a server machine NoSQL environment using Hadoop, otherwise this page is discarded [13,14,47–52].
This procedure is depicted in Figure 4. The process of data extraction uses the following steps.Sustainability 2018, 10, x FOR PEER REVIEW  11 of 21 

 
Figure 4. Data collection process from websites using Web Crawler. 

Web crawler (web-bot) downloads the webpages (crawled pages) by which it extracts more 

contents (Meta tags) like movie reviews, rating, votes and likes and other irrelevant pages discarded 

as shown in Figure 4. 

Computational processing can occur on data stored either in a file-system (unstructured) or in a 

database (structured) as shown in Figure 5. Apache Hadoop is the de facto data operating system. It 

is an open-source software framework processing of big data on clusters of commodity hardware. 

 
Figure 5. Multi-variant recommendation system. 

A multi-variant web agent is implemented in the Hadoop environment to handle big data 

generated by recommendation systems in order to improve the scalability and efficiency. The above-

mentioned Figure 5 shows the interaction and computation in the Hadoop environment between the 

Android user app, web bot and the external participating sites for data. In each Mapper, there are 

various algorithms porterStemmer(), tokenizer(), POStager(), polarityComputation(), 

weightedRanking(), Webcrawler(), etc. and in Table A1, the Server machine’s specification and 

Android device specification are presented. The hardware used in implementation is discussed in 

Appendix B and mentioned in Table A1. 

4.2. Experiment and Results 

We used three repositories whose reviews should be trusted, that is, IMDB, Metacritic and 

Fandango, which contain all the required data (reviews and scores). These repositories contain movie 

 
Figure 4. Data collection process from websites using Web Crawler.


Sustainability 2018, 10, 4280 11 of 21

Step 1 Receive URLs for a movie type i.e., comedy, horror, fiction, etc.
Step 2 Matches the keywords from the query to the page If
Step 3 Keywords matched Then
Step 4 Download the web page
Step 5 Send it for storage
Step 6 Discard the page
Step 7 Repeat the step 2 to 6 until all the matched web pages are found.

Web crawler (web-bot) downloads the webpages (crawled pages) by which it extracts more
contents (Meta tags) like movie reviews, rating, votes and likes and other irrelevant pages discarded
as shown in Figure 4.

Computational processing can occur on data stored either in a file-system (unstructured) or in
a database (structured) as shown in Figure 5. Apache Hadoop is the de facto data operating system.
It is an open-source software framework processing of big data on clusters of commodity hardware.

Sustainability 2018, 10, x FOR PEER REVIEW  11 of 21 

 
Figure 4. Data collection process from websites using Web Crawler. 

Web crawler (web-bot) downloads the webpages (crawled pages) by which it extracts more 

contents (Meta tags) like movie reviews, rating, votes and likes and other irrelevant pages discarded 

as shown in Figure 4. 

Computational processing can occur on data stored either in a file-system (unstructured) or in a 

database (structured) as shown in Figure 5. Apache Hadoop is the de facto data operating system. It 

is an open-source software framework processing of big data on clusters of commodity hardware. 

 
Figure 5. Multi-variant recommendation system. 

A multi-variant web agent is implemented in the Hadoop environment to handle big data 

generated by recommendation systems in order to improve the scalability and efficiency. The above-

mentioned Figure 5 shows the interaction and computation in the Hadoop environment between the 

Android user app, web bot and the external participating sites for data. In each Mapper, there are 

various algorithms porterStemmer(), tokenizer(), POStager(), polarityComputation(), 

weightedRanking(), Webcrawler(), etc. and in Table A1, the Server machine’s specification and 

Android device specification are presented. The hardware used in implementation is discussed in 

Appendix B and mentioned in Table A1. 

4.2. Experiment and Results 

We used three repositories whose reviews should be trusted, that is, IMDB, Metacritic and 

Fandango, which contain all the required data (reviews and scores). These repositories contain movie 

 
Figure 5. Multi-variant recommendation system.

A multi-variant web agent is implemented in the Hadoop environment to handle big
data generated by recommendation systems in order to improve the scalability and efficiency.
The above-mentioned Figure 5 shows the interaction and computation in the Hadoop environment
between the Android user app, web bot and the external participating sites for data. In each
Mapper, there are various algorithms porterStemmer(), tokenizer(), POStager(), polarityComputation(),
weightedRanking(), Webcrawler(), etc. and in Table A1, the Server machine’s specification and Android
device specification are presented. The hardware used in implementation is discussed in Appendix B
and mentioned in Table A1.

4.2. Experiment and Results

We used three repositories whose reviews should be trusted, that is, IMDB, Metacritic and
Fandango, which contain all the required data (reviews and scores). These repositories contain movie
(2016 and 2017) data for 1000 of the most popular movies (with a significant number of votes and
ratings) and their reviews were released in 2016 and 2017, and as of 22 March 2017.

We computed the polarity score of the text data (reviews) by computing the movie’s reviews
corpus which was fetched from each participating external data source sites. This procedure was
followed by data preprocessing, tf-idf classification and polarity identification using SentiWordNet to
compute the polarity scores of each term of each document from each participating data source sites.
The following tables illustrate the values of movies fetched data which were processed and evaluated.

Here, Table 3 shows the movie’s title and the corresponding movie ID as follows.


Sustainability 2018, 10, 4280 12 of 21

Table 3. Movie Id of movies used in the experiments.

Movie Title Movie ID

Avengers: Age of Ultron (2015) m1
Cinderella (2015) m2
Ant-Man (2015) m3

Do You Believe? (2015) m4
Hot Tub Time Machine 2 (2015) m5

Table 4 shows the popular external data source sites and their corresponding movies sites ID for
better formulation.

Table 4. Movie database sites ID.

Movie Database Site Movie Database Sine ID

Metacritic w1
IMDB w2

Fandango w3

Some computed values such as polarity scores of movie reviews from participating sites, which are
already labeled are shown in Table 5.

Table 5. Calculated polarity cores.

Polarity Scores

Movie ID w1 w2 w3

m1 21 29 3
m2 −17 16 26
m3 2 2 4
m4 −2 2 13
m5 1 19 7

Here we normalized the scores “likes” metascore and IMDB rating scores are normalized ratings
to a (0–5) scale because Metacritic and IMDB user rating is out of ten stars, but Fandango rating is
out of five stars so we normalized the Metacritic and IMDB to five stars. These normalized values are
given in the Table 6.

Table 6. Normalized and un-normalized rating.

Un-Normalized Rating Scores Normalized Rating Scores

Movie
ID

w1 w2 w3 w1 w2 w3

m1 7.1 7.8 5 3.55 3.9 5
m2 7.5 7.1 5 3.75 3.55 5
m3 8.1 7.8 5 4.05 3.9 5
m4 4.7 5.4 5 2.35 2.7 5
m5 3.4 5.1 3.5 1.7 2.55 3.5

Here Metacritic_votes, IMDB_votes and Fandango_votes are the number of the votes, which are
allotted by users to the particular movies from specific movie websites and are represented in the
Table 7.

After calculating and aggregating the polarity score from each participating movie site and taking
an average of polarity scores by total movies, and taking the weighted average polarity by adding
the weights (normalized ranking and votes) to each average polarity, Likes may also represent the


Sustainability 2018, 10, 4280 13 of 21

behavior of users, which impact the movie rating. That is why we selected the likes in our model to
present the multi-variant approach. We added Facebook likes to them to take the aggregated weighted
average polarity for better recommendations. The classified scores are shown in Table 8.

Table 7. Movie votes.

Movie Votes

Movie ID w1 w2 w3

m1 1330 271,107 14,846
m2 249 65,709 12,640
m3 627 103,660 12,055
m4 31 3136 1793
m5 88 19,560 1021

These multi variants (votes, ranking and likes) are computed according to our model,
which ranked the movies as mentioned from the corpora of different movie data, the final score
and category are represented in Table 9.

The multi-variant score of m1 movie “Avengers: Age of Ultron” is greater than six which is
why it is categorized “B”, the m2 movie “Cinderella” score is greater than four so it lies in Category
“C”, m3 and m4 “Ant-Man” and “Do You Believe?”, respectively, are greater than six so these are
categorized “B”, and the m5 “Hot Tub Time Machine 2” movie score is greater than four so it is also
lies in “C” category. Figure 6 represents the final ranking score of a movie’s particular category such as
fiction, horror, drama, etc.

Figure 7 shows the rating scores of a particular movie from three specific movie websites We
compared their normalized rating scores among these sites, and we observed that rating by Fandango
is so high that is not significant individually.

Sustainability 2018, 10, x FOR PEER REVIEW  15 of 21 

These multi variants (votes, ranking and likes) are computed according to our model, which 

ranked the movies as mentioned from the corpora of different movie data, the final score and category 

are represented in Table 9. 

Table 9. Final scores and movie category. 

Movie 

ID 
Likes 

Aggregated Weighted 

Average Polarity 
Final Ranking Score 

Movie 

Category 

m1 308,130 403,895.2977 7.44944873 B 

m2 331 26,534.68444 5.878236708 C 

m3 140,000 178,785.0504 6.979146632 B 

m4 97,000 98,656.85966 6.636052698 B 

m5 14,000 20,892.44087 5.740277373 C 

The multi-variant score of m1 movie “Avengers: Age of Ultron” is greater than six which is why 

it is categorized “B”, the m2 movie “Cinderella” score is greater than four so it lies in Category “C”, 

m3 and m4 “Ant-Man” and “Do You Believe?”, respectively, are greater than six so these are 

categorized “B”, and the m5 “Hot Tub Time Machine 2” movie score is greater than four so it is also 

lies in “C” category. Figure 6 represents the final ranking score of a movie’s particular category such 

as fiction, horror, drama, etc. 

 
Figure 6. Multi-variant movie ranking. 

Figure 7 shows the rating scores of a particular movie from three specific movie websites We 

compared their normalized rating scores among these sites, and we observed that rating by Fandango 

is so high that is not significant individually. 

 
Figure 7. Rating difference. 

In Figure 8, voting for movies also represents the user’s interest in movies from different 

participating websites, which indicates a huge difference if we select only one site. One site is not 

adequate for a ranking approach, which is the reason we selected multi-variants from different sites. 

Figure 6. Multi-variant movie ranking.

Sustainability 2018, 10, x FOR PEER REVIEW  15 of 21 

These multi variants (votes, ranking and likes) are computed according to our model, which 

ranked the movies as mentioned from the corpora of different movie data, the final score and category 

are represented in Table 9. 

Table 9. Final scores and movie category. 

Movie 

ID 
Likes 

Aggregated Weighted 

Average Polarity 
Final Ranking Score 

Movie 

Category 

m1 308,130 403,895.2977 7.44944873 B 

m2 331 26,534.68444 5.878236708 C 

m3 140,000 178,785.0504 6.979146632 B 

m4 97,000 98,656.85966 6.636052698 B 

m5 14,000 20,892.44087 5.740277373 C 

The multi-variant score of m1 movie “Avengers: Age of Ultron” is greater than six which is why 

it is categorized “B”, the m2 movie “Cinderella” score is greater than four so it lies in Category “C”, 

m3 and m4 “Ant-Man” and “Do You Believe?”, respectively, are greater than six so these are 

categorized “B”, and the m5 “Hot Tub Time Machine 2” movie score is greater than four so it is also 

lies in “C” category. Figure 6 represents the final ranking score of a movie’s particular category such 

as fiction, horror, drama, etc. 

 
Figure 6. Multi-variant movie ranking. 

Figure 7 shows the rating scores of a particular movie from three specific movie websites We 

compared their normalized rating scores among these sites, and we observed that rating by Fandango 

is so high that is not significant individually. 

 
Figure 7. Rating difference. 

In Figure 8, voting for movies also represents the user’s interest in movies from different 

participating websites, which indicates a huge difference if we select only one site. One site is not 

adequate for a ranking approach, which is the reason we selected multi-variants from different sites. 

Figure 7. Rating difference.


Sustainability 2018, 10, 4280 14 of 21

Table 8. Weighted average polarity scores.

Reviews Aggregated Polarity Average Polarity Weighted Average Polarity

w1 w2 w3 w1 w2 w3 w1 w2 w3 w1 w2 w3

Movie ID Reviews Reviews Reviews
Aggre.

Polarity
Aggre.

Polarity
Aggre.

Polarity
Average
Polarity

Average
Polarity

Average
Polarity

Weighted
Average
Polarity

Weighted
Average
Polarity

Weighted
Average
Polarity

m1 66 1168 30 21 29 3 0.318 0.0248 0.1 1333.86 271,110 14,851.1
m2 67 363 27 −17 16 26 0.253 0.0440 0.962 252.49 65,712.5 12,645.9
m3 64 605 24 2 2 4 0.031 0.003 0.166 631.08 103,663 12,060.1
m4 22 69 22 −2 2 13 0.090 0.0289 0.590 33.25 3138.72 1798.59
m5 29 101 20 1 19 7 0.034 0.1881 0.35 89.73 19,562.7 1024.85


Sustainability 2018, 10, 4280 15 of 21

Table 9. Final scores and movie category.

Movie ID Likes
Aggregated

Weighted Average
Polarity

Final Ranking
Score

Movie Category

m1 308,130 403,895.2977 7.44944873 B
m2 331 26,534.68444 5.878236708 C
m3 140,000 178,785.0504 6.979146632 B
m4 97,000 98,656.85966 6.636052698 B
m5 14,000 20,892.44087 5.740277373 C

In Figure 8, voting for movies also represents the user’s interest in movies from different
participating websites, which indicates a huge difference if we select only one site. One site is not
adequate for a ranking approach, which is the reason we selected multi-variants from different sites.
Sustainability 2018, 10, x FOR PEER REVIEW  16 of 21 

 
Figure 8. Movie votes difference. 

Here, Figure 9 represents the differences in weighted average polarity scores by computing the 

multi-variant to show the categories. 

 
Figure 9. Comparison scores of movies. 

Time complexity was computed and also the watched time at different machines was observed. 

Time complexity in the worst case of our approach is O(n) because n number of datasets are used, 

and all following operation take one unit time, so time complexity of following is operations O(1). 

The computation watched time details are presented in Table 10. 

Table 10. Computational watched time at different machines. 

CPU Clock Move Add. Sub. Mul. Div. Comp. Speed 

TMS320C30 (16.67 MHz) 22 3n 4 3 4 19 6 ms 

MC68000 (16 MHz) 5 20 20 70 160 15 59 ms 22 3n 4 3 4 19 59 ms 

Z80 (8 MHz) 22 3n 4 3 4 19 280 ms 

5. Evaluation 

Recall is defined as the number of relevant movies retrieved by a search divided by the total 

number of existing relevant movies, while precision is defined as the number of relevant movies 

retrieved by a search divided by the total number of movies retrieved by the search. The precision is 

the proportion of recommendations that are good recommendations, 

Precision = tp/(tp + fp) (19) 

and recall is the proportion of good recommendations that appear in top recommendations. 

Recall = tp/(tp + fn) (20) 

tp: predicted positive interested movie it is true, it is really interested. 

0.00

5,000.00

10,000.00

15,000.00

20,000.00

25,000.00

30,000.00

w1 w2 w3

Movie Id m1 m2 m3 m4 m5

Figure 8. Movie votes difference.

Here, Figure 9 represents the differences in weighted average polarity scores by computing the
multi-variant to show the categories.

Sustainability 2018, 10, x FOR PEER REVIEW  16 of 21 

 
Figure 8. Movie votes difference. 

Here, Figure 9 represents the differences in weighted average polarity scores by computing the 

multi-variant to show the categories. 

 
Figure 9. Comparison scores of movies. 

Time complexity was computed and also the watched time at different machines was observed. 

Time complexity in the worst case of our approach is O(n) because n number of datasets are used, 

and all following operation take one unit time, so time complexity of following is operations O(1). 

The computation watched time details are presented in Table 10. 

Table 10. Computational watched time at different machines. 

CPU Clock Move Add. Sub. Mul. Div. Comp. Speed 

TMS320C30 (16.67 MHz) 22 3n 4 3 4 19 6 ms 

MC68000 (16 MHz) 5 20 20 70 160 15 59 ms 22 3n 4 3 4 19 59 ms 

Z80 (8 MHz) 22 3n 4 3 4 19 280 ms 

5. Evaluation 

Recall is defined as the number of relevant movies retrieved by a search divided by the total 

number of existing relevant movies, while precision is defined as the number of relevant movies 

retrieved by a search divided by the total number of movies retrieved by the search. The precision is 

the proportion of recommendations that are good recommendations, 

Precision = tp/(tp + fp) (19) 

and recall is the proportion of good recommendations that appear in top recommendations. 

Recall = tp/(tp + fn) (20) 

tp: predicted positive interested movie it is true, it is really interested. 

0.00

5,000.00

10,000.00

15,000.00

20,000.00

25,000.00

30,000.00

w1 w2 w3

Movie Id m1 m2 m3 m4 m5

Figure 9. Comparison scores of movies.

Time complexity was computed and also the watched time at different machines was observed.
Time complexity in the worst case of our approach is O(n) because n number of datasets are used,
and all following operation take one unit time, so time complexity of following is operations O(1).
The computation watched time details are presented in Table 10.


Sustainability 2018, 10, 4280 16 of 21

Table 10. Computational watched time at different machines.

CPU Clock Move Add. Sub. Mul. Div. Comp. Speed

TMS320C30 (16.67 MHz) 22 3n 4 3 4 19 6 ms
MC68000 (16 MHz) 5 20

20 70 160 15 59 ms
22 3n 4 3 4 19 59 ms

Z80 (8 MHz) 22 3n 4 3 4 19 280 ms

5. Evaluation

Recall is defined as the number of relevant movies retrieved by a search divided by the total
number of existing relevant movies, while precision is defined as the number of relevant movies
retrieved by a search divided by the total number of movies retrieved by the search. The precision is
the proportion of recommendations that are good recommendations,

Precision = tp/(tp + fp) (19)

and recall is the proportion of good recommendations that appear in top recommendations.

Recall = tp/(tp + fn) (20)

tp: predicted positive interested movie it is true, it is really interested.
tn: predicted positive uninterested movie it is true, it’s really uninterested.
fp: predicted positive interested movie but wrong, it is actually interesting.
fn: predicted negative uninterested movie but wrong, it is actually uninteresting.
In the recommendation domain, a perfect precision score of 1.0 means that every movie

recommended in the list was good (although this says nothing about if all good recommendations
were suggested) whereas a perfect recall score of 1.0 means that all good recommended movies
were suggested in the list. Typically, when a recommender system is tuned to increase precision,
recall decreases as a result (or vice versa).

F-Score = 2. (precision.recall)/(precision + recall) (21)

Table 11 shows some outcomes of our recommendation system.

Table 11. Outcomes of multi-variant recommendation system.

Evaluating Parameters TP TN FP FN

Aggregated Polarity 220 356 278 146
Weighted Average Polarity 486 244 196 74

aggregated weighted Average Polarity 640 138 97 125
Final Ranking Score 953 11 16 5

In Table 12 and Figure 10 for ecommendations in this domain, a single value is obtained
by combining both the precision and recall measures and indicates the overall utility of the
recommendation list. One thousand movies were used as exemplary data sets. Evaluations are
really important in the recommendation engine building process, which can be used to empirically
discover improvements to a recommendation algorithm.

This research used the MovieLens 1K dataset. There are 943 users and 1000 movies; we used
the 1000 ratings, votes, likes and views from the users on the films to test the performance of
proposed method.

The results of the f-measure differentiated the accuracy of our work from others. If we use the
multi-variants system it provided an accuracy of about 98.6%.


Sustainability 2018, 10, 4280 17 of 21

Table 12. Weighted average polarity scores.

Evaluating
Parameters

Aggregated
Polarity

Weighted Average
Polarity

Aggregated Weighted
Average Polarity

Multi-Variant

Precision 0.3819 0.6658 0.8226 0.9886
Recall 0.4418 0.7126 0.8684 0.9835

F-measure 0.4097 0.6884 0.8449 0.9860

Sustainability 2018, 10, x FOR PEER REVIEW  17 of 21 

tn: predicted positive uninterested movie it is true, it’s really uninterested. 

fp: predicted positive interested movie but wrong, it is actually interesting. 

fn: predicted negative uninterested movie but wrong, it is actually uninteresting. 

In the recommendation domain, a perfect precision score of 1.0 means that every movie 

recommended in the list was good (although this says nothing about if all good recommendations 

were suggested) whereas a perfect recall score of 1.0 means that all good recommended movies were 

suggested in the list. Typically, when a recommender system is tuned to increase precision, recall 

decreases as a result (or vice versa). 

F-Score = 2. (precision.recall)/(precision + recall) (21) 

Table 11 shows some outcomes of our recommendation system. 

Table 11. Outcomes of multi-variant recommendation system. 

Evaluating Parameters TP TN FP FN 

Aggregated Polarity 220 356 278 146 

Weighted Average Polarity 486 244 196 74 

aggregated weighted Average Polarity 640 138 97 125 

Final Ranking Score 953 11 16 5 

In Table 12 and Figure 10 for ecommendations in this domain, a single value is obtained by 

combining both the precision and recall measures and indicates the overall utility of the 

recommendation list. One thousand movies were used as exemplary data sets. Evaluations are really 

important in the recommendation engine building process, which can be used to empirically discover 

improvements to a recommendation algorithm. 

This research used the MovieLens 1K dataset. There are 943 users and 1000 movies; we used the 

1000 ratings, votes, likes and views from the users on the films to test the performance of proposed 

method. 

Table 12. Weighted average polarity scores. 

Evaluating 

Parameters 

Aggregated 

Polarity 

Weighted 

Average Polarity 

Aggregated Weighted 

Average Polarity 

Multi-

Variant 

Precision 0.3819 0.6658 0.8226 0.9886 

Recall 0.4418 0.7126 0.8684 0.9835 

F-measure 0.4097 0.6884 0.8449 0.9860 

The results of the f-measure differentiated the accuracy of our work from others. If we use the 

multi-variants system it provided an accuracy of about 98.6%. 

 
Figure 10. Performance comparison of movie recommendation system. 

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

1

average polarity weighted average
polarity

 aggregated weighted
average polarity

 Multi-Variant

 precision recall F-measure

Figure 10. Performance comparison of movie recommendation system.

6. Conclusions

This paper presented an intelligent and automated recommender system to provide topic
(action, comedy, horror, etc.) based, accurate recommendations of movies to users. The used approach
relies on both quantitative and qualitative data for achieving authentic recommendations. The used
quantitative data includes ratings, votes, likes, etc., and the quantitative data is the polarity score that
is calculated from user reviews using NLP and opinion mining techniques. The developed application
was tested on three external data sources such as Metacritics, IMDB, and Fandango and achieved
better results in terms of true recommendations as compared to previous approaches. The presented
recommender system was developed using a NoSQL environment with Apache Hadoop to filter and
integrate movie descriptions from linked data or ontology (linkedmdb). Our approach used a Fuzzy
logic approach for movie ranking categorization. A front-end application is designed in Android for
the interaction of a user with the movie recommender through web services. When users search for
a movie through a mobile app then the server effectively responds to users with a recommended list of
movies. Thus, users can take a decision before and secure a watch time for the movie and can conserve
other important resources, like money and energy, etc.

7. Future Work

Further work is required to enhance the system for both registered and unregistered viewers
or users of apps by adding more parameters, such as showbiz industry influence, movie quality,
movie trends and a user’s profile-based movie recommendation system in a NoSQL distributed
environment. Semantic and sentiment computation are required to find the semantic relation between
the movies and users as well as the psychological influence of movies.

Author Contributions: M.I. designed the algorithm and conducted the experiments. I.S.B. supervised the
research work.

Funding: This research received no external funding.

Conflicts of Interest: The authors declare no conflict of interest.


Sustainability 2018, 10, 4280 18 of 21

Appendix A.

One android app provides the following listed features in the user interface for users’ usage,
by which users can request or query a movie and app will respond with movie list, the list is provided
by the server machine by interaction of the app. The following Figure A1. illustrates the specification
of the mobile application.

Sustainability 2018, 10, x FOR PEER REVIEW  18 of 21 

6. Conclusions 

This paper presented an intelligent and automated recommender system to provide topic 

(action, comedy, horror, etc.) based, accurate recommendations of movies to users. The used 

approach relies on both quantitative and qualitative data for achieving authentic recommendations. 

The used quantitative data includes ratings, votes, likes, etc., and the quantitative data is the polarity 

score that is calculated from user reviews using NLP and opinion mining techniques. The developed 

application was tested on three external data sources such as Metacritics, IMDB, and Fandango and 

achieved better results in terms of true recommendations as compared to previous approaches. The 

presented recommender system was developed using a NoSQL environment with Apache Hadoop 

to filter and integrate movie descriptions from linked data or ontology (linkedmdb). Our approach 

used a Fuzzy logic approach for movie ranking categorization. A front-end application is designed 

in Android for the interaction of a user with the movie recommender through web services. When 

users search for a movie through a mobile app then the server effectively responds to users with a 

recommended list of movies. Thus, users can take a decision before and secure a watch time for the 

movie and can conserve other important resources, like money and energy, etc. 

7. Future Work 

Further work is required to enhance the system for both registered and unregistered viewers or 

users of apps by adding more parameters, such as showbiz industry influence, movie quality, movie 

trends and a user’s profile-based movie recommendation system in a NoSQL distributed 

environment. Semantic and sentiment computation are required to find the semantic relation 

between the movies and users as well as the psychological influence of movies. 

Author Contributions: M.I. designed the algorithm and conducted the experiments. I.S.B. supervised the 

research work. 

Funding: This research received no external funding. 

Conflicts of Interest: The authors declare no conflict of interest. 

Appendix A 

One android app provides the following listed features in the user interface for users’ usage, by 

which users can request or query a movie and app will respond with movie list, the list is provided 

by the server machine by interaction of the app. The following Figure A1. illustrates the specification 

of the mobile application. 

 
Figure A1. Illustration of multi-variant recommendation. 

Appendix B 

All the experiments were performed to test the performance and accuracy of the proposed 

approach using Intel i7 @ 3.4 GHz, operating on Linux/Ubuntu 14.04, 64-bit with 8 GB memory. The 

nltk tool kit is written in Python under the GPL open source license, the Stanford CoreNLP Natural 

Language Processing Toolkit and libraries [50–52] as well as Apache Hadoop 2.0 are used for the 

Figure A1. Illustration of multi-variant recommendation.

Appendix B.

All the experiments were performed to test the performance and accuracy of the proposed
approach using Intel i7 @ 3.4 GHz, operating on Linux/Ubuntu 14.04, 64-bit with 8 GB memory.
The nltk tool kit is written in Python under the GPL open source license, the Stanford CoreNLP Natural
Language Processing Toolkit and libraries [50–52] as well as Apache Hadoop 2.0 are used for the
deployment of the NoSQL environment for our movie recommendation system. Table A1 represents
the hardware and software specifications.

Table A1. Server machine and Android device specification.

Resources Specification of Server Machine Specification of Android Device

Processor
INTEL i7 processor with 3.4 GHz clock

rate
Qualcomm Snapdragon 835, 2.45 GHz

octa-core Kryo 280 CPU, Adreno 540 GPU
RAM 8 GB/machine 4 GB

NoSQL Hadoop 2.0, Apache Cassandra.
Operating System Linux/Ubuntu 14.04 Android 7.1.1

Storage 1 TB 128 GB (UFS 2.1)
Topology Connected by gigabit Ethernet cable 802.11ac Wi-Fi with MIMO, Bluetooth 5.0 LE

APP web-bot android app

References

1. Raigoza, J.; Karande, V. A Study and Implementation of a Movie Recommendation System in a Cloud-based
Environment. Int. J. Grid High Perform. Comput. 2017, 9, 25–36. [CrossRef]

2. Christakou, C.; Vrettos, S.; Stafylopatis, A. A hybrid movie recommender system based on neural networks.
Int. J. Artif. Intell. Tools 2007, 16, 771–792. [CrossRef]

3. Said, A.; Kille, B.; de Luca, E.W.; Albayrak, S. Personalizing Tags: A Folksonomy-Like Approach for
Recommending Movies. In Proceedings of the 2nd International Workshop on Information Heterogeneity
and Fusion in Recommender Systems, Chicago, IL, USA, 27 October 2011; pp. 53–56.

4. Zenebea, A.; Norciob, A.F. Representation, similarity measures and aggregation methods using fuzzy sets
for content-based recommender systems. Fuzzy Sets Syst. 2009, 160, 76–94. [CrossRef]

5. Singh, D.K.; Gangwar, A.; Sharma, A. Movie Recommendation System. Volume 4. Available online:
www.ijariit.com (accessed on 23 July 2018).

6. Wang, Z.; Yu, X.; Feng, N.; Wang, Z. An improved collaborative movie recommendation system using
computational intelligence. J. Vis. Lang. Comput. 2014, 25, 667–675. [CrossRef]

7. Jain, K.N.; Kumar, V.; Kumar, P.; Choudhury, T. Movie Recommendation System. In Intelligent Computing and
Information and Communication; Springer: Singapore, 2018; pp. 677–686.

http://dx.doi.org/10.4018/IJGHPC.2017010103
http://dx.doi.org/10.1142/S0218213007003540
http://dx.doi.org/10.1016/j.fss.2008.03.017
www.ijariit.com
http://dx.doi.org/10.1016/j.jvlc.2014.09.011


Sustainability 2018, 10, 4280 19 of 21

8. Yessenov, K.; Misailovic, S. Sentiment Analysis of Movie Review Comments. Methodology 2009, 17, 1–7.
9. Bhuiyan, H.; Ara, J.; Bardhan, R.; Islam, R. Retrieving YouTube Video by Sentiment Analysis on User

Comment. In Proceedings of the 2017 IEEE International Conference on Signal and Image Processing
Applications (IEEE ICSIPA 2017), Kuching, Malaysia, 12–14 September 2017.

10. Singh, V.K.; Piryani, R.; Uddin, A.; Waila, P. Sentiment analysis of movie reviews: A new feature-based
heuristic for aspect-level sentiment classification. In Proceedings of the 2013 International Multi-Conference
on Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), Kottayam, India,
22–23 March 2013.

11. Alsaqer, A.F.; Sasi, S. Movie Review Summarization and Sentiment Analysis using RapidMiner.
In Proceedings of the 2017 International Conference on Networks & Advances in Computational Technologies
(NetACT), Thiruvanthapuram, India, 20–22 July 2017.

12. Ouyang, C.; Liu, Y.; Zhang, S.; Yang, X. Features-level Sentiment Analysis of Movie reviews. Adv. Sci. Technol.
Lett. 2015, 81, 110–113.

13. Hsieh, M.Y.; Chou, W.K.; Li, K.C. Building a mobile movie recommendation service by user rating and APP
usage with linked data on Hadoop. Multimed. Tools Appl. 2017, 76, 3383–3401. [CrossRef]

14. Godhani, G.; Dhamecha, M. A Study on Movie Recommendation System Using Parallel Map Reduce Technology;
V.V.P. Engineering College: Rajkot, India, 2017; Volume 5.

15. Reza, M.; Sinha, A.; Nag, R.; Mohanty, P. CUDA-enabled Hadoop cluster for Sparse Matrix Vector
Multiplication. In Proceedings of the 2015 IEEE 2nd International Conference on Recent Trends in
Information Systems (ReTIS), Kolkata, India, 9–11 July 2015.

16. Castells, P.; Fernández, M.; Vallet, D. An Adaptation of the Vector-Space Model for Ontology-Based
Information Retrieval. IEEE Trans. Knowl. Data Eng. 2007, 19, 161–272. [CrossRef]

17. Wang, J.; Liu, T. Taiwan Improving Sentiment Rating of Movie Review Comments for Recommendation.
In Proceedings of the 2017 IEEE International Conference on Consumer Electronics—Taiwan (ICCE-TW),
Taipei, Taiwan, 12–14 June 2017.

18. Wijaya, D.T.; Bressan, S. A Random Walk on the Red Carpet: Rating Movies with user reviews and pagerank.
In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA,
USA, 26–30 October 2008; ACM: New York, NY, USA, 2008; pp. 951–960.

19. Chang, A.; Liao, J.F.; Chang, P.C.; Teng, C.H.; Chen, M.H. Application of artificial immune systems combines
collaborative filtering in movie recommendation system. In Computer Supported Cooperative Work in
Design (CSCWD). In Proceedings of the 2014 IEEE 18th International Conference on Computer Supported
Cooperative Work in Design (CSCWD), Hsinchu, Taiwan, 21–23 May 2014; pp. 277–282.

20. Tumasjan, A.; Sprenger, T.O.; Sandner, P.G.; Welpe, I.M. Predicting Elections with Twitter: What 140 Characters
Reveal about Political Sentiment. ICWSM 2010, 10, 178–185.

21. He, W.; Zha, S.; Li, L. Social media competitive analysis and text mining: A case study in the pizza industry.
Int. J. Inf. Manag. 2013, 33, 464–472. [CrossRef]

22. Murnane, E.L.; Counts, S. Unraveling abstinence and relapse: Smoking cessation reflected in social media.
In Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems, Toronto,
ON, Canada, 26 April–1 May 2014; ACM: New York, NY, USA, April 2014; pp. 1345–1354.

23. Diakopoulos, N.; Naaman, M.; Kivran-Swaine, F. Diamonds in the rough: Social media visual analytics for
journalistic inquiry. In Proceedings of the 2010 IEEE Symposium on Visual Analytics Science and Technology,
Salt Lake City, UT, USA, 25–26 October 2010; pp. 115–122.

24. Baldwin, T.; Cook, P.; Lui, M.; MacKinlay, A.; Wang, L. How Noisy Social Media Text, How Diffrnt Social
Media Sources? In IJCNLP; The Association for Computational Linguistics: Stroudsburg, PA, USA, October
2013; pp. 356–364.

25. Corley, C.D.; Cook, D.J.; Mikler, A.R.; Singh, K.P. Text and structural data mining of influenza mentions in
web and social media. Int. J. Environ. Res. Public Health 2010, 7, 596–615. [CrossRef] [PubMed]

26. Asur, S.; Huberman, B.A. Predicting the future with social media. In Proceedings of the 2010
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Toronto,
ON, Canada, 31 August–3 September 2010; Volume 1, pp. 492–499.

27. Timmaraju, A.; Khanna, V. Sentiment Analysis on Movie Reviews using Recursive and Recurrent Neural
Network Architectures. Available online: https://cs224d.stanford.edu/reports/TimmarajuAditya.pdf
(accessed on 14 November 2018).

http://dx.doi.org/10.1007/s11042-016-3833-0
http://dx.doi.org/10.1109/TKDE.2007.22
http://dx.doi.org/10.1016/j.ijinfomgt.2013.01.001
http://dx.doi.org/10.3390/ijerph7020596
http://www.ncbi.nlm.nih.gov/pubmed/20616993
https://cs224d.stanford.edu/reports/TimmarajuAditya.pdf


Sustainability 2018, 10, 4280 20 of 21

28. Sarker, A.; Ginn, R.; Nikfarjam, A.; O’Connor, K.; Smith, K.; Jayaraman, S.; Upadhaya, T.; Gonzalez, G.
Utilizing social media data for pharmacovigilance: A review. J. Biomed. Inform. 2015, 54, 202–212. [CrossRef]
[PubMed]

29. Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; Riedl, J. GroupLens: An open architecture for collaborative
filtering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative
Work (CSCW ’94), Chapel Hill, NC, USA, 22–26 October 1994; ACM: New York, NY, USA, 1994; pp. 175–186.

30. Shardanand, U.; Maes, P. Social information filtering: Algorithms for automating “word of mouth”.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’95); Katz, I.R., Mack, R.,
Marks, L., Rosson, M.B., Nielsen, J., Eds.; ACM Press/Addison-Wesley Publishing Co.: New York, NY, USA,
1995; pp. 210–217.

31. Lekakos, G.; Caravelas, P. A Hybrid Approach for Movie Recommendation; Springer Science + Business Media:
New York, NY, USA, 21 December 2006.

32. Tumsare, P.; Sambare, A.S.; Jain, S.R. Sentiment Analysis Approach for Movie Reviews of Natural Language.
Int. J. Res. Comput. Commun. Technol. 2014, 3, 256–261.

33. Kreutzer, J.; Witte, N. Opinion Mining Using SentiWordNet Semantic Analysis; HT 2013/14; Uppsala University:
Uppsala, Sweden, 2013.

34. Haddia, E.; Liua, X.; Shib, Y. The Role of Text Pre-processing in Sentiment Analysis. Procedia Comput. Sci.
2013, 17, 26–32. [CrossRef]

35. Webster, J.J.; Kit, C. Tokenization as the initial phase in NLP. In Proceedings of the 14th conference on
Computational linguistics, Nantes, France, 23–28 August 1992.

36. Vijayarani, S.; Janani, M.R. Text mining: Open source tokenization tools—An analysis. Adv. Comput. Intell.
Int. J. 2016, 3, 37–47.

37. Issac, B.; Jap, W.J. Implementing spam detection using bayessian and porter stemmer keyword
stripping approaches. In Proceedings of the TENCON 2009–2009 IEEE Region 10 Conference, Singapore,
23–26 November 2009; pp. 1–5.

38. Porter, M.F. An Algorithm for Suffix Stripping. J. Program. 1980, 14, 130–137. [CrossRef]
39. Alphabetical List of Part-Of-Speech Tags Used in the Penn Treebank Project. Available online: http://www.

ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html (accessed on 11 June 2018).
40. Tf-idf Weighting. Available online: https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-

1.html (accessed on 7 April 2018).
41. Hakim, A.A.; Erwin, A.; Eng, K.I.; Galinium, M.; Muliady, W. Automated document classification

for news article in Bahasa Indonesia based on term frequency inverse document frequency (TF-IDF)
approach. In Proceedings of the 2014 6th International Conference on Information Technology and Electrical
Engineering (ICITEE), Yogyakarta, Indonesia, 7–8 October 2014; pp. 1–4.

42. Esuli, A.; Sebastiani, F. Sentiwordnet: A Publicly Available Lexical Resource for Opinion Mining. Available
online: http://nmis.isti.cnr.it/sebastiani/Publications/LREC06.pdf (accessed on 21 September 208).

43. Baccianella, S.; Esuli, A.; Sebastiani, F. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis
and opinion mining. LREC 2010, 10, 2200–2204.

44. Penalver-Martinez, I.; Garcia-Sanchez, F.; Valencia-Garcia, R.; Rodriguez-Garcia, M.A.; Moreno, V.; Fraga, A.;
Sanchez-Cervantes, J.L. Feature-based opinion mining through ontologies. Expert Syst. Appl. 2014, 41, 5995–6008.
[CrossRef]

45. Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [CrossRef]
46. Pedersen, T.; Patwardhan, S.; Michelizzi, J. WordNet::Similarity—Measuring the Relatedness of Concepts;

The Association for Computational Linguistics: Stroudsburg, PA, USA, 2004.
47. Bird, S.; Loper, E. NLTK: The Natural Language Toolkit; The Association for Computational Linguistics:

Stroudsburg, PA, USA, 2004.
48. Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.J.; McClosky, D. The Stanford CoreNLP

Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstrations 2014, Baltimore, MD, USA, 22–27 June 2014.

49. Atserias, J.; Casas, B.; Comelles, E.; Gonzàlez, M.; Padró, L.; Padro, M. FreeLing 1.3: Syntactic and Semantic
Services in an Open-Source NLP Library; TALP Research Center Universitat Politècnica de Catalunya: Barcelona,
Spain, 2006.

http://dx.doi.org/10.1016/j.jbi.2015.02.004
http://www.ncbi.nlm.nih.gov/pubmed/25720841
http://dx.doi.org/10.1016/j.procs.2013.05.005
http://dx.doi.org/10.1108/eb046814
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
http://nmis.isti.cnr.it/sebastiani/Publications/LREC06.pdf
http://dx.doi.org/10.1016/j.eswa.2014.03.022
http://dx.doi.org/10.1145/219717.219748


Sustainability 2018, 10, 4280 21 of 21

50. Tiwari, J.; Pawar, M.; Pandey, A. A hadoop based collaborative filtering recommender system accelerated on
gpu using opencl. Int. J. Eng. Sci. Res. Technol. 2017, 6, 195–209, Retrieved 5 September 2017.

51. Thangavel, S.K.; Thampi, N.S.; Johnpaul, C.I. Performance Analysis of Various Recommendation Algorithms
Using Apache Hadoop and Mahout. Int. J. Sci. Eng. Res. 2013, 4, 279–287.

52. Jose, A.V.; Jini, K.M. Personalized Movie Recommender System using Rank Boosting Approach on Hadoop.
IJIRST Int. J. Innov. Res. Sci. Technol. 2015, 2, 2349–6010.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

http://creativecommons.org/
http://creativecommons.org/licenses/by/4.0/.

	Introduction 
	Related Work 
	Multi-Variant Expert System 
	NLP Module 
	Tokenization 
	Stemming (Lemmatization) 
	Stop Word Evacuation 
	POS-Tag Generation 

	Polarity Computation 
	Lexical Frequency Measuring 
	Polarity Identification 

	Weighted Polarity Manipulation 
	Categorization 
	WordNet 
	Geo Ontology 
	Time Ontology 
	TVA-CS 

	Recommendation 

	Experimental Setup 
	NoSQL for Big Data Stroage 
	Experiment and Results 

	Evaluation 
	Conclusions 
	Future Work 
	
	
	References