Data mining models for student careers


Data mining models for student careers

Renza Campagni, Donatella Merlini ⇑, Renzo Sprugnoli, Maria Cecilia Verri
Dipartimento di Statistica, Informatica, Applicazioni Università degli Studi di Firenze, Viale Morgagni 65, 50134 Firenze, Italy

a r t i c l e i n f o

Article history:
Available online 6 March 2015

Keywords:
Data mining
Educational data mining
Student careers
Clustering
Frequent pattern analysis

a b s t r a c t

This paper presents a data mining methodology to analyze the careers of University graduated students.
We present different approaches based on clustering and sequential patterns techniques in order to iden-
tify strategies for improving the performance of students and the scheduling of exams. We introduce an
ideal career as the career of an ideal student which has taken each examination just after the end of the
corresponding course, without delays. We then compare the career of a generic student with the ideal
one by using the different techniques just introduced. Finally, we apply the methodology to a real case
study and interpret the results which underline that the more students follow the order given by the ideal
career the more they get good performance in terms of graduation time and final grade.

� 2015 Elsevier Ltd. All rights reserved.

1. Introduction

The large supply of data stored in the computer systems of sev-
eral companies, both public and private, has given a push in the
direction of the development of new technologies for data manage-
ment and analysis. Data mining techniques originate in this con-
text, with the aim of discovering hidden and non-trivial
relationships among information of various nature. This collection
of techniques, used in different sectors, including the educational
environment, comes from the traditional methods of data analysis
and have the characteristic of being able to treat large amounts of
data.

In the field of education, educational data mining is a recent
research area that explores and analyzes the information stored
in student databases in order to understand and improve the per-
formance of the student learning process. Data are analyzed by
using statistical, machine learning and data mining algorithms,
with the aim of resolving problems of educational research and
improve the entire educational process. Recently there has been
an increase in the use of educational software instruments and of
databases containing students information, so we have large
repositories of data reflecting how students learn. In addition, the
use of Internet in education has created the context of e-learning
or web-based education which continuously generates large
amounts of data concerning the interactions between teaching
and learning. Educational data mining tries to use all this

information to better understand learners and learning, and to
develop methodologies which, integrating the data with the the-
ory, allow to improve the educational process. Educational data
mining is a growing research area that involves researchers all over
the world from different and related research areas and since 2008
an annual International Conference on Educational Data Mining
has been established (http://www.educationaldatamining.org).
Great efforts have been made in the direction of describing the
state of the art of this research area and, in the recent years, several
survey papers have been published on the subject (Baker, 2010,
2014; Baker & Yacef, 2009; Luan, 2002; Peña-Ayala, 2014;
Romero & Ventura, 2010, 2013).

As already observed, usually data mining techniques are applied
to large data sets. In the context of education, however, we are
often faced with data sets corresponding to small groups of stu-
dents following the same curriculum. Referring to a university con-
text, for example, even when a degree program is frequented by
many students the data of interest correspond to relatively small
data sets. The recent paper (Natek & Zwilling, 2014) focuses on
the study of data mining techniques applied to small data sets con-
cerning higher education institutions and concludes that the use of
these techniques in real-life situations is useful and promising and
can provide administrators with precious tools for decision.

Over the years, several data mining models have been designed
and implemented to analyze the performance of students. For
example, in Delavari, Shirazi, and Beikzadeh (2004) and Delavari,
Somnuk, and Beikzadeh (2008), a model is proposed which pre-
sents the advantages of data mining technology in higher educa-
tional systems; the authors give a sort of road map to assist the
institutions to identify the ways to improve their processes. In
Daimi and Miller (2009), the authors illustrate a classification

http://dx.doi.org/10.1016/j.eswa.2015.02.052
0957-4174/� 2015 Elsevier Ltd. All rights reserved.

⇑ Corresponding author.
E-mail addresses: renza.campagni@unifi.it (R. Campagni), donatella.merlini@

unifi.it (D. Merlini), renzo.sprugnoli@unifi.it (R. Sprugnoli), mariacecilia.verri@unifi.
it (M.C. Verri).

Expert Systems with Applications 42 (2015) 5508–5521

Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.02.052&domain=pdf
http://www.educationaldatamining.org
http://dx.doi.org/10.1016/j.eswa.2015.02.052
mailto:renza.campagni@unifi.it
mailto:donatella.merlini@ unifi.it
mailto:donatella.merlini@ unifi.it
mailto:renzo.sprugnoli@unifi.it
mailto:mariacecilia.verri@unifi.it
mailto:mariacecilia.verri@unifi.it
http://dx.doi.org/10.1016/j.eswa.2015.02.052
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


model to investigate the profile of students which most likely leave
university without ending their career. In particular, they use some
classification algorithms implemented in the WEKA system (Witten,
Frank, & Hall, 2011). Recommendations of suitable courses for stu-
dents are analyzed with different approaches in Bydzovska and
Popelínský (2014) with the aim of predicting student success. In
Damaševičius (2010) a framework is proposed for mining educa-
tional data using association rules. More recently, Romero, Zafra,
Luna, and Ventura (2013) proposes the application of association
rule mining to improve quizzes and courses and (Saarela et al.,
2014) applies frequent itemset mining and association rule learn-
ing to students previously grouped by clustering techniques. In
Guruler, Istanbullu, and Karahasan (2010), in order to explore the
factors having impact on the success of university students, a
system based on the decision tree classification technique is pre-
sented. Clustering is used in Campagni, Merlini, and Verri (2014)
for analyzing data concerning the evaluation of courses taken by
students, linked to their results in the corresponding exams. The
work presented in Dutt, Aghabozrgi, Ismail, and Mahroeian
(2015) reviews different clustering algorithms applied to educa-
tional data mining context while (Peña-Ayala, 2014) is an interest-
ing review of recent educational data mining development whose
contents are in turn analyzed by a data mining approach. As
already observed, data mining techniques have also been applied
in computer-based, e-learning and web-based educational systems
(Bouchet, Harley, & Trevors, 2013; Bogarín, Romero, Cerezo, &
Sánchez-Santillán, 2014; Castro, Vellido, Nebot, & Mugica, 2007;
Hämäläinen, Laine, & Sutinen, 2006; Koedinger, Cunningham,
Skogsholm, & Leber, 2008; Mostow & Beck, 2006; Merceron &
Yacef, 2005; Romero, Romero, Luna, & Ventura, 2010; Romero,
Ventura, & García, 2008; Romero, López, Luna, & Ventura, 2013).
The existing literature about the use of data mining in educational
systems is mainly concerned with techniques such as clustering,
classification and association rules (Damaševičius, 2010; Tan,
Steinbach, & Kumar, 2006; Witten et al., 2011; Wu & Kumar, 2009).

An academic curriculum usually defines a specific learning pro-
gram which puts some types of restrictions on how the students
are required to take courses. These constraints typically describe
a set of courses and a set of relationships between them. In the cur-
rent practice, however, students have many degree of freedom,
therefore helping students to choose courses, discovering patterns
and key courses, planning future courses and refining curricula
based on the feedback of students are important educational tasks,
as recently pointed out in Aher and Lobo (2013), Kardan, Sadeghi,
Ghidary, and Sani (2013), Méndez, Ochoa, and Chiluiza (2014) and
Pechenizkiy, Trcka, Bra, and Toledo (2012). The present work fits
into this context extending and unifying the results presented in
Campagni, Merlini, and Sprugnoli (2012a, 2012b, 2012c). In par-
ticular, we introduce the concept of ideal career, that is, the career
of a graduated student who takes every examination just after the
end of the corresponding course, without delay, and propose a data
mining methodology, based on clustering and sequential pattern
analysis, to study the student behavior by comparing student
careers with the ideal one. Sequential pattern analysis has been
used in the context of educational data mining mainly in com-
puter-based environments. For example, Soundranayagam and
Yacef (2010) explores the order in which students access e-learn-
ing resources as they solve set assessment tasks, such as tests,
assignments and exams and the links with students learning. A
method to automatically detect collaborative patterns of student
and tutor dialogue moves is illustrated in D’Mello, Olney, and
Person (2010). Paper (Martinez, Yacef, Kay, Al-Qaraghuli, &
Kharrufa, 2011) mines and clusters frequent patterns to compare
distinct behaviors between low and high achievement groups
around an interactive tabletop. A data mining methodology for

identifying and comparing learning behaviors from students learn-
ing interaction traces is presented in Kinnebrew, Loretz, and
Biswas (2013); in particular, the paper proposes an algorithm that
employs a novel combination of sequence mining techniques to
identify differentially frequent patterns between groups of stu-
dents. Paper (Guerra, Sahebi, Brusilovsky, & Lin, 2014) models
and examines patterns of student behavior with parameterized
exercise. A recent research which proceeds in a direction similar
to ours is illustrated in Asif, Merceron, and Pathan (2014), where
the progression of a student is analyzed by defining a tuple that
shows how the results of a year stay the same, increase or decrease
compared to first year.

The preprocessing phase is the first step in any data mining pro-
cess and allows us to transform the available data into a format
suitable for the analysis. The importance of this task has been
recently highlighted in Romero, Romero, and Ventura (2014). In
Section 2, we illustrate the preprocessing phase necessary to orga-
nize data for our analysis. A crucial aspect during this phase is the
insertion in the database of the reference to the semester in which
a course has been given by a teacher and the semester in which the
student has taken the corresponding exam. This information
allows us to define the ideal career together with the career of each
graduate student. We represent a career as a trajectory of points in
the plane. In particular, the ideal career is defined by a sequence of
points sI ¼ðð0; e0Þ;ð1; e1Þ;ð2; e2Þ; � � � ;ðn; enÞ;ðn þ 1; enþ1ÞÞ, where ei
is an exam identifier and i its position in the career. The position
i ¼ 0 denotes the starting point of the career while i ¼ n þ 1
corresponds to the final examination given last by all students.
This particular career, without loss of generality, can be repre-
sented by the bisecting line of the first quadrant (green lines in
Fig. 1). The career of a generic student J is then represented by
a broken line, corresponding to the sequence of points
tJ ¼ðð0; eJ 0Þ;ð1; eJ 1Þ;ð2; eJ 2Þ; � � � ;ðn; eJ nÞ;ðn þ 1; eJ nþ1ÞÞ, where eJ i
is the identifier of the exam given by student J at time i (red lines
in Fig. 1). We then compute the distance between a generic career
tJ and sI in different ways, by using for example the Bubblesort
distance, defined as the number of inversions in the permutation
relative to tJ , or the area between the lines tJ and sI . Finally, we
insert these values in the database.

In Section 3, we analyse the preprocessed data with clustering
and sequential pattern techniques.

For what concerns the cluster analysis, the idea is to explore the
database with the aim of understanding if there exists a relation
between the distance from the ideal career and the success of stu-
dents. This kind of analysis, accompanied by cluster validation, can
highlight different groups of students characterized by similar dis-
tances and behaviors and can give some suggestions to improve
the organization of the laurea degree or to recommend precedence
relations among courses.

Sequential pattern analysis aims to find relationships between
occurrences of sequential events, that is, to find if any specific
order of the occurrences exists. In this paper we consider as events
the exams taken by a student; the temporal information is the
semester in which the exam has been taken or the delay with
which it has been taken. We study an organization of the univer-
sity which allows students to take an exam in different sessions
after the end of the course, as in Italy. The temporal information
allows us to see the career of a student as a sequence hs1s2 . . . smi
where each element sj is a collection of one or more exams taken
in the same semester or having the same delay. If we use the seme-
ster as temporal information, m indicates the number of semesters
in which a student takes exams; if we use the delay, m indicates
the maximum number of delays, in semesters, with which a stu-
dent takes one or more exams. By analyzing the sequential pat-
terns, we can explain some behaviors which may seem

R. Campagni et al. / Expert Systems with Applications 42 (2015) 5508–5521 5509


https://isiarticles.com/article/46055