key: cord-0046918-8he5e5pu authors: Rubio, Miguel A. title: Automated Prediction of Novice Programmer Performance Using Programming Trajectories date: 2020-06-10 journal: Artificial Intelligence in Education DOI: 10.1007/978-3-030-52240-7_49 sha: 0648c01f030da44fe9e0f98ef9a65efa4377652a doc_id: 46918 cord_uid: 8he5e5pu Online programming courses have become widely available and host thousands of learners every year. In these courses, participants must solve programming exercises by submitting partial solutions and checking the outcome. The sequence of partial solutions submitted by a student constitutes the programming trajectory followed by the student. In our work, we define a supervised machine learning algorithm that takes as input these programming trajectories and predicts whether a student will successfully complete the next exercise. We have validated our model with two different datasets: the first one is a set of problems from the online learning platform Robomission with over one hundred thousand exercises submitted. The second one comprises one hundred thousand exercises submitted to the Hour of Code challenge. The results obtained indicate that our model can accurately predict the future performance of the students. This work provides not only a new method to represent students’ programming trajectories but also an efficient approach to predict the students’ future performance. Furthermore, the information provided by the model can be used to select the students that would benefit from an intervention. Online programming courses have emerged as a popular way to introduce students to programming [1] . These courses present several advantages: they are easily accessible, and students face interesting challenges. Unfortunately, it is not feasible to provide individual support to each student due to the large number of students enrolled in these courses. Automatic systems capable of providing adaptive support could enhance the students' experience and improve their success rate [2] . In order to develop these automatic systems, there is a need to develop models capable of detecting students that will likely fail [3] [4] [5] . These models could use the large datasets that students generate when completing programming tasks [6, 7] . Students usually submit several partial solutions before solving a task, creating a programming trajectory for each exercise [8, 9] . These programming trajectories can be analyzed by machine learning systems to find general patterns [10] . In this study we present a supervised machine learning model that predicts the student future programming performance. The model takes the programming trajectory followed by the student and estimates the probability of the student successfully completing the next exercise. The model has been validated using two different datasets obtained from two different online programming environments, Robomission [11] , and the Hour of Code challenge from Code.org [12] . Our results indicate that this model can predict accurately whether a student will be able to successfully complete a programming exercise. The information provided by the model can be used to rank students in terms of their performance. Using this ranking one can automatically select a group student that would benefit most from an intervention. In this study we worked with two different datasets. The first dataset is a set of programming trajectories submitted by students while completing one exercise in the Hour of Code challenge [13] . Additionally, for each student the dataset contains information about whether the student successfully completed the next task. The exercises and their solutions are shown in Fig. 1 . Piech et al. [8] describe this dataset in more detail. The second dataset comprises 85 programming tasks from the Robomission programming platform. Effenberfer [14] gives a thorough description of the dataset. Our goal is to generate a supervised machine learning algorithm capable of predicting whether the student will successfully complete the next exercise. To this end we will use the programming trajectories followed by the students T = {w 0 , w 1 … w n }. Where w 0 is the state before the student starts to work, w i are the code snapshots submitted by the student and w n is the last snapshot. The training phase is straightforward: all the programming trajectories present in the training dataset are assembled into a tree. Different branches of the tree contain information about different programming trajectories. Figure 2 describes the process to integrate a new trajectory {w 0 , w 1 , w 5 } into a tree. For each code snapshot present in the trajectory we check if there is a branch in the tree with matching snapshots. If there is such a branch, we follow it while the partial solutions match. As soon as we find a partial solution (w 5 in this case) that is not present in the branch, a new branch is created. Once we have processed all the student trajectories to generate the tree, we store in each node the relevant parameters of the students that ended their programming trajectories in that node. In this study we stored the proportion of students that successfully completed the next exercise. After assembling the tree, we can estimate the probability that a new student with trajectory T i will successfully complete the next exercise. If we want to classify the student, we only need to compare this probability with the threshold that we have selected. We have selected the Receiver Operating Characteristic (ROC) curve [15] and the area under the curve (AUC) to measure the performance of the classifier. We have used a 10-fold crossvalidation [16] stratified over students to compute them. We will compare our model optimal performance with the results of a simple baseline model. Our baseline model expects the performance of both tasks, the one taken as input and the predicted one, to be the same. We start examining whether our model is successfully detecting students who fail the next exercise in the Hour of Code challenge. The left side of Fig. 3 shows that the ROC curve is systematically above the identity line (y = x). The area under the curve (AUC) of our model in this case is 0.77, with a 95% confidence interval (0.77-0.79). Both the AUC and the confidence interval are greater than 0.5, indicating that our model is performing better than a random classifier. Figure 3 also contain the main results for the baseline model and the optimal threshold. We can see that the baseline model is much closer to the bottom left corner of the figure than the optimal threshold. The right side of Fig. 3 shows the AUC obtained for each task in the Robomission dataset versus the number of students that attempted each task. We performed a loess regression [16] looking for a correlation between AUC and the number of students. From the graph we can conclude that there is no such correlation. However, the variability of AUC values depends on the number of students. When the number of students is below 500 the AUC values show high variability. For values over 500 the variability decreases markedly. In this study we present a machine learning algorithm able to predict the future performance of novice programmers using their programming trajectories in just one exercise. The output of the model can be used to rank students according to their predicted performance. The data used by the model can be easily obtained in online programming environments. We have validated our model using two different datasets from two online learning platforms. Our results indicate that the model can classify students with reasonable accuracy. We have also found that the average performance of our model seems to be independent from the number of students attempting the task. Codewebs: scalable homework search for massive open online programming courses Educational data mining and learning analytics in programming: literature review and case studies Lightweight, early identification of at-risk CS1 students A robust machine learning technique to predict low-performing students Evaluating neural networks as a method for identifying students in need of assistance Clustering and visualizing solution variation in massive programming classes Data-driven hint generation in vast solution spaces: a selfimproving python programming tutor Autonomously generating hints by inferring problem solving policies Programming pathway clustering using Tree Edit Distance Stereotype modeling for problemsolving performance predictions in MOOCs and traditional courses Towards making block-based programming activities adaptive Hour of code: we can solve the diversity problem in computer science Learnable programming: blocks and beyond Blockly programming dataset pROC: an open-source package for R and S + to analyze and compare ROC curves The Elements of Statistical Learning. SSS