key: cord-0682955-ad9rkckj authors: Gayman, C. M.; Jimenez, S. T.; Hammock, S.; Taylor, S.; Rocheleau, J. M. title: The Effects of Cumulative and Noncumulative Exams Within the Context of Interteaching date: 2021-09-08 journal: J Behav Educ DOI: 10.1007/s10864-021-09451-4 sha: 572e81342b51059eee221e8ecc247f89afe217c5 doc_id: 682955 cord_uid: ad9rkckj Interteaching is a behavioral teaching method that has been empirically shown to increase student learning outcomes. The present study investigated the effect of combining interteaching with cumulative versus noncumulative exams in two sections of an online asynchronous class. Interteaching was used in both sections of the course. The noncumulative exam section experienced weekly exams with test questions that only covered material learned in that week of class. The cumulative exam section was given weekly exams in which half of the questions were from material learned that current week and the other half were cumulative up to that point in the class. This was followed by a cumulative final exam given to both groups. All exam questions were multiple choice. On average, students in the cumulative exam group scored 4.91% higher on the final exam than students in the noncumulative exam group. Students exposed to weekly cumulative exams also earned more As and Bs on the final compared to the noncumulative exam group. Overall, our experiment provides evidence that interteaching may be further improved when combined with cumulative weekly exams. Although much of college instruction has traditionally been delivered in face-to-face lecture-style formats, student interest in online courses has continued to increase over the past several years. For example, enrollment for online courses grew by 29% between 2012 and 2018, and during the 2018-2019 school year 79% of all colleges in the USA offered online classes (Ruiz & Sun, 2021) . Additionally, online platforms have been critical to delivering instruction during the COVID-19 pandemic. To prevent the spread of infection, many colleges have offered online course formats in place of face-to-face instruction (Kaiser Family Foundation, 2020; McMurtrie, 2020) . Consequently, there has been greater demand for research examining online teaching methods. One teaching method accumulating a growing body of empirical evidence supporting its efficacy is interteaching. Interteaching is a relatively new behavioral approach to college instruction in which contingencies are arranged in support of frequent student engagement and peer-to-peer collaboration, and lectures are reduced to a supportive, supplementary role (Gayman et al., 2018 (Gayman et al., , 2020 . In addition to producing better learning outcomes (for reviews, see Hurtado-Parrado et al., 2021; Querol et al., 2015; Saville et al., 2011; Sturmey et al., 2015) , students have reported interteaching to be more enjoyable (Querol et al., 2015; Saville et al., 2011) , and interteaching has been shown to improve long-term retention (Felderman, 2014; Saville et al., 2014) . Last, interteaching offers unique benefits to instructors, who have reported interteaching class preparation to be less time-consuming than traditional teaching formats (Sturmey et al., 2015) . Interteaching is comprised of six main components: (a) preparatory guides (prep guides), (b) small group discussions, (c) record sheets, (d) clarifying lectures, (e) quality points, and (f) frequent probes (Boyce & Hineline, 2002) . Prep guides typically contain 10-30 items of varying complexity, from basic definitional questions to complex application questions. Students are expected to complete prep guides before class and, subsequently, are asked to discuss their answers to each question in small groups during class. During this time, the instructor offers guidance, answers questions, and provides supplementary explanations and examples. After the small group discussion, students are asked to indicate which topics were most challenging via record sheets. The instructor uses information from the record sheets to develop and deliver clarifying lectures addressing difficult material. Next, quality points are awarded to students based on how well members of their discussion group perform on certain questions from the weekly exam. In this way, quality points are thought to encourage cooperative behaviors during small group discussion, as students may be more likely to shape peer responding if part of their grade depends on the adequacy of this discussion (Gayman et al., 2018 (Gayman et al., , 2020 . However, the impact of these contingencies may be negligible. Component analyses have revealed quality points to be unnecessary to interteaching's overall effectiveness (Hurtado-Parrado et al., 2021; Saville & Zinn, 2009) . Last, interteaching implements probes throughout each course's duration. Probes are short assessments comprised of questions students have been made aware of in advance (Boyce & Hineline, 2002) . Probes may include any style question so long as they target material from prep guides. Further, it is recommended that they be conducted frequently and routinely (Felderman, 2014) . Another empirically supported method for promoting learning and retention is the use of cumulative exams (Khanna et al., 2013; Lawrence, 2013) . For example, Khanna et al. compared learning retention between cumulative and noncumulative final exams in a college course. Students administered a cumulative final scored higher on a separate content exam delivered in both the short and long term (once at the course's conclusion, and again, 18 months later). Lawrence furthered this line of research by assessing whether cumulative weekly exams result in similar improvements. In Lawrence's experiment, final exam performance of students in a cumulative weekly exam section was compared to that of students in a noncumulative weekly exam section. Students in the weekly cumulative exam section scored higher on both the cumulative final and a long-term retention assessment delivered two months after the course's conclusion, providing further support for cumulative exams' effectiveness. Of note was that the exams in this study only consisted of 20% cumulative content. In response, Lawrence hypothesized that higher percentages, such as 50% cumulative content, may do even better to improve learning and retention (although this assumption has yet to be empirically examined). Both probes and cumulative exams' effectiveness may depend on their ability to induce the testing effect. The testing effect occurs when retention is improved as a result of testing previously learned material as opposed to restudying the same material (Roediger & Karpicke, 2006) . Traditionally, exams are used to assess mastery, but they are also a means of promoting long-term retention (Chang & Wimmers, 2017; Khanna et al., 2013; Landrum, 2007; Petrowsky, 1999) . Therefore, frequent testing would be expected to result in even greater gains in retention. Larsen et al. (2009) investigated this proposed relationship by comparing repeated testing to repeated studying between two sections of a college course. Students in the repeated testing group were administered three exams on the same content, while students in the repeated studying group were provided with three review sheets covering the same material. In both groups, exams and review sheets were provided after twoweek intervals and students took a final exam six months after the course's conclusion. Students in the repeated testing group performed 13% higher, on average, than those in the repeated study group, demonstrating the testing effect. Although cumulative exams only partially assess previously covered content, the inclusion of at least some reiterative questions may function similar to repeated testing in promoting course outcomes. Aside from the direct experience of repeating a test, it is possible simple awareness of repeated assessment promotes long-term retention of exam-relevant content. For example, Szpunar et al. (2007) found students who expected to be retested on material were more likely to demonstrate mastery on later exams than students who were kept unaware of this possibility. The authors explained these findings by suggesting awareness may result in an increased tendency to form connections between topics (i.e., by focusing on the relatedness of material across chapters). Likewise, Szpunar et al. theorized a lack of expectation may work to devalue this practice, increasing students probability of forgetting. The present study's purpose was twofold. First, although past research has demonstrated interteaching's effectiveness (Brown et al., 2014; Jones et al., 2019; Querol et al, 2015; Saville et al., 2011; Sturmey et al., 2015) , no published study has evaluated its use in combination with cumulative exams, another empirically supported teaching intervention (Khanna et al., 2013; Lawrence, 2013) . Thus, exam scores were compared across two sections of an interteaching course: one in which weekly exams were cumulative, and another in which they were noncumulative. Second, this combination has never been studied in an online course format. Therefore, the course was delivered in an asynchronous online format. A total of 92 undergraduate students participated in this experiment. Students were enrolled across two sections of a Psychology of Learning course, each of which lasted nine weeks and was taught by the first author in an asynchronous online format using the Canvas learning management platform. Data from students who were repeating the course (n = 4) or did not consent (n = 11) were excluded from the study, leaving 77 participants' data to be included in the analysis. The noncumulative section hosted 39 participants (8 men, 31 women), while the cumulative section held 38 (2 men, 36 women). Additional demographic information for both groups is presented in Table 1 . Further, past research suggests certain teaching interventions may offer greater benefits to students with lower GPAs than those with moderate or high GPAs (e.g., Landrum, 2007; Saville et al., 2012) . To test whether cumulative exams are more beneficial to those with moderate or low GPAs, self-reported cumulative GPA was requested and later validated via the university (after gaining student consent). Students in the noncumulative exam group were administered weekly exams consisting of 20 multiple-choice questions. Questions exclusively assessed novel material covered during the current week and were randomly selected from exam banks containing approximately 50 items per chapter. Students in the cumulative exam group were also administered 20-question, multiple-choice exams each week. However, only 50% of exam questions covered novel material from the current week, while the remaining 50% assessed content covered in previous weeks using exam questions pulled from the same exam pools used for the weekly chapter exams. For example, on exam two, 10 questions were pulled from material covered in week two, and 10 questions were pulled from material covered in week one. On exam three, 10 questions came from material covered in week three, 5 questions came from material covered in week two, and 5 questions came from material covered in week one. The final exam consisted of 35 multiple-choice questions randomly selected from chapter question banks comprised of questions that came from both the textbook exam bank and questions written by the instructor. There were a total of 400 items in the test bank, and it is possible that students may have seen these questions in their previous weekly exams. However, students in both conditions had an equal probability of experiencing previously seen test questions on the final exam. The surveys were due 24 h after the deadline for each exam and included Likert scale questions asking students to rate their agreement with the following statements: "I thought the exams were difficult," "I am glad the exams were cumulative (or noncumulative)," "I crammed for the exam last week," and "After each exam, I disregarded any previously tested information and focused my attention on new information." The scale ranged from 1, "strongly disagree," to 7, "strongly agree." Finally, the survey asked students to indicate how long they spent preparing for the exam. The exit survey included questions asking students to rate their agreement with various statements, such as "I thought the final exam was difficult," "I am glad the final exam in this course was cumulative," and "I crammed for the final exam." The same 7-point Likert scale was used from the weekly exam survey. Further, the exit survey prompted students to rate the quality of the teaching method used, the quality of the interteaching questions as an assignment, the quality of the discussion component of this course, and the quality of the clarifying lectures. The scale for these items ranged from 1, "poor," to 7 "excellent," with 4 indicating "average." Last, the survey asked students to answer two open-ended questions: "List the strengths of the course" and "List any suggestions for improvement." Informed consent was obtained the day after the add/drop period ended for classes by sending students an email link to a survey in Qualtrics, an online survey platform. If students provided consent, the survey platform routed them to a demographic survey (see Table 1 ). If students did not provide consent, the survey platform instead routed them to an alternate assignment that could be completed for the same amount of extra credit as the demographic survey. The instructor was blind until after grades were submitted for the class as to whether any student provided consent and completed the demographic questionnaire, or whether they completed the alternate extra credit assignment instead. Data were not included from students who did not consent for their data to be used for the purposes of the study. Each week, the course covered one to two chapters from Chance's (2014) Learning and Behavior and ended with an exam. Students were able to access exam-relevant course material two weeks prior to its scheduled administration. In addition, a cumulative exam covering all chapters was administered during the final week of the nine-week course. All exams were monitored on video using Respondus Lockdown Browser with Monitor®, and students were not allowed to use books, notes, or any other materials during exams. Both sections of the course adopted an interteaching approach to instruction. Therefore, each week students (a) submitted answers to a 30-question prep guide, (b) discussed prep guide questions in groups on an asynchronous written discussion board, (c) uploaded a record sheet, and (d) watched a brief clarifying lecture delivered via PowerPoint on a weekly basis. For additional details on the interteaching components of this study, see Gayman et al. (2018) . Students were able to access their grade and see which questions they answered correctly and incorrectly on all exams immediately after completion. Students were not given the opportunity to retake exams, and they were not required to pass exams in order to move on to the next week of material. Correct answers to exam questions were made available for students to view 48 h after the exam deadline passed. Exams and feedback were required to be viewed using Respondus Lockdown Browser ®. After every weekly exam (see weekly exam conditions below), students completed a survey in the Canvas Learning Platform assessing their perceptions of that week's exam. After completion of the cumulative final, students completed another survey assessing their perceptions of the cumulative final exam. The present study utilized a group design in which noncumulative weekly exams were administered in one section, while cumulative weekly exams were delivered in the other. Since one group of students was exposed to cumulative weekly exams and the other group experienced noncumulative weekly exams, the questions on each exam covered different contents. For example, in the cumulative exam group, half of the questions in the week two exam came from material covered in week one and the other half covered material in week two, while the noncumulative exam group had questions that only came from material covered in week two. The material covered each week also varies in difficulty; thus, exam scores varied to some extent due to each group experiencing questions that covered different content on the weekly exams. To statistically control for this variability in exams scores, z-scores were calculated, which standardized the mean and standard deviation of each weekly exam (Gravetter et al., 2018) . The primary dependent variable of interest was students' final exam scores. An α level of 0.05 was used for all statistical tests. In the absence of randomly assigning participants to each condition, self-reported demographic characteristics were analyzed to determine whether there were significant differences between sections (see Table 1 ). A Fisher's exact test found no significant differences between groups regarding ethnicity, sex, class standing, relationship status, employment status, and parental status (p > 0.05). Additionally, an independent-samples t test found no significant differences between the number of psychology courses taken, number of current credit hours, number of hours worked per week, number of children, age, cumulative GPA, and psychology GPA between sections (p > 0.05). Exam 1 contained the same questions across both the cumulative and noncumulative conditions; therefore, it was excluded from the analysis. Table 2 summarizes the sample size, mean percentage correct, and standard deviation for exams 2-7 and the final for each group. Independent-samples t tests were used to determine whether there were differences between each of the exams (see Fig. 1 ). Students who experienced cumulative exams scored significantly lower on exam 7, t(70) = − 2.46, p = 0.016, d = 0.59. No other significant differences were found between weekly exams. Since it was hypothesized that the students who experienced cumulative weekly exams would score higher on the cumulative final than students who experienced noncumulative exams, a one-tailed test was used. An independent-samples t test supported our hypothesis that students who experienced cumulative exams all term scored significantly higher on the cumulative final, t(68) = 1.81, p = 0.037, d = 0.44. Further, visual inspection of the data demonstrated a trend in which these students earned more As and Bs on the final, while those who experienced noncumulative exams earned more Cs and Fs (see Fig. 2 ). Specifically, out of the 35 students in the cumulative exams group, 7 (20%) earned an A, 12 (34.28%) earned a B, 9 (25.71%) earned a C, and 1 (2.86%) earned an F, whereas 3 (8.57%) of the 35 students in the noncumulative exam group earned an A, 11 (31.42%) earned a B, 11 (31.42%) earned a C, and 6 (17.14%) earned an F. Although the trend of the cumulative exam group earning a better grade on the final than those in the noncumulative group did not hold for those who earned a D (17.14% and 11.43%, respectively), overall more students in the cumulative exam group passed the final (i.e., earned an A, B, or C) than those in the noncumulative group (80% and 71.43%, respectively). Fig. 1 Mean Z-score for each exam and the cumulative final. One asterisk indicates p < 0.05 for a twotailed test, while two asterisks indicate p < 0.05 for a one-tailed test. The error bars represent ± 1.0 standard error of the mean To determine the relationship between GPA score and intervention effectiveness, participants were divided into thirds based on verified GPA (high, moderate, and low) and final exam scores were compared across groups. A 2 (cumulative, noncumulative) × 3 (high, moderate, low) between-subjects ANOVA found a main effect of GPA (F [2, 64] = 4.51, p = 0.015, η p 2 = 0.12) where a Tukey's HSD post hoc showed those with a high GPA (M = 83.49, SD = 8.32) scored significantly higher on the final exam than those with a low GPA (M = 74.04, SD = 12.66, p = 0.011, d = 0.87). Those with a moderate GPA (M = 77.96, SD = 11.45) did not differ significantly from those with a high (p = 0.195) or low (p = 0.417) GPA. No significant interaction was found, which suggests that the testing method (cumulative vs. noncumulative) did not have differential effects on students' final exam score based on their GPA. Further, to determine whether students accurately reported their GPA, these scores were verified by the first author after obtaining informed consent. A significant, positive correlation was found (r [7] = 0.553, After each weekly exam, students completed a survey asking them to rate statements regarding the previous week's exam along a Likert scale ranging from 1, "strongly disagree," to 7, "strongly agree." Table 3 lists descriptive statistics corresponding to each statement. Additionally, a series of independent-samples t tests compared group means for each statement and found no significant differences between conditions (p > 0.05). Students were also asked to report the amount of time they spent completing the group discussion and studying for the exam. To remove the influence of outliers, the data were checked for any students who reported times that were ± 3.0 standard deviations away from the mean (Gravetter et al., 2018) . Three outliers were found and replaced with the mean. Independent-samples t tests indicated that the two sections did not differ in their reported times (p > 0.05). Collapsed across the two sections, an average of 2.36 (SD = 1.78) hours were reported completing the group discussion, an average of 2.53 (SD = 1.81) hours were reported on studying for each weekly exam (two outliers were replaced with the mean), and an average of 2.10 (SD = 1.33) hours were reported studying for the final (one outlier was replaced with the mean). In addition, students were asked to rate the quality of their group discussion each week when they completed their record sheet using a Likert scale ranging from 1, "poor," to 7, "excellent." The quality of group discussion was rated slightly above average (M = 5.17, SD = 1.35). We also investigated whether students rated the quality of group discussion higher after repeated exposure throughout the semester. Table 3 Descriptive statistics for weekly survey across groups Questions were adopted from Lawrence (2013) . A seven-point Likert ranged from 1 "strongly disagree" to 7 "strongly agree." The p-values are reported from independent-samples t tests Ratings of group discussion tended to increase slightly each week; however, this relationship was not found to be significant, r(475) = 0.027, p = 0.55. At the end of the semester, students completed a survey asking them to rate various aspects of interteaching on a Likert scale ranging from 1, "poor," to 7, "excellent." Specifically, they reported perceptions regarding the quality of interteaching, prep guide questions, group discussions, and clarifying lectures. Table 4 reports the descriptive statistics corresponding to each statement across conditions. Across both sections, students rated interteaching and its various components above average. Further, independent-samples t tests were used to determine whether group differences between students' reported ratings were statistically significant. Although students in the cumulative exams condition rated clarifying lectures higher than students in the noncumulative exam condition (t [64] = 2.010, p = 0.049, d = 0.50), no other statistically significant differences were found between groups with respect to this measure. In the present study, student performance was compared across two sections of an asynchronous online Psychology of Learning course. Interteaching was combined with cumulative weekly exams in one section and in the other, noncumulative weekly exams were delivered. In both sections, the final exam was cumulative. On average, students in the cumulative exam group scored 4.91% higher, or almost half a letter grade, on the final exam than students in the noncumulative exam group. In addition, students in the cumulative exam group earned more final exam grades in the A and B range than the noncumulative exam group (whose final exam grades were more likely to fall in the C and F range). Given the accuracy of the self-reports of GPA in the current study and Gayman et al. (2020) , it may be possible to rely on student-reported GPA in future studies. Overall, our experiment provides evidence that interteaching may be further improved when combined with frequent cumulative exams, as opposed to frequent noncumulative exams. The learning improvements observed in the present study may result from cumulative exams' ability to provide reinforcement for students' continual rehearsal of past material. In other words, students who rehearse past material more frequently are more likely to retain that knowledge over time. In comparison, courses with weekly noncumulative exams may fail to reinforce any added rehearsal until the end of the course, when the cumulative final is delivered. As a result, students in the cumulative exam condition may be more likely to engage in distributed practice in preparation for the final, which has been demonstrated to result in better retention than massed practice (otherwise known as cramming; Rohrer & Taylor, 2006; Seabrook et al., 2005) . However, when student survey responses were analyzed in the current study, no significant differences were found between groups with respect to self-reported statements regarding "cramming" for the final or "disregarding" previously learned material. This is somewhat surprising, given that one would anticipate students who had been exposed to cumulative exams throughout the semester would report less pressure to "cram" for the final exam. It is possible that the question phrasing was too broad, as we did not provide students with an operational definition of "cramming" for the final, and this may have led to the lack of difference between groups on this survey question. Demand characteristics could be another explanation where students reported to their instructor what they thought aligned with good or appropriate study habits. In addition, although our data support Lawrence (2013)'s speculation that higher percentages of cumulative content, such as 50%, would effectively promote student learning, it does not directly answer whether higher percentages result in greater improvements than lower percentages, such as the 20% used in her study. In our study, the difference between the cumulative and noncumulative exam group's mean score was 4.91%, while the difference between these groups in Lawrence (2013)'s study was only 2.90%. While this comparison is merely anecdotal, a future study could make an experimental comparison by combining interteaching with weekly exams of varying proportions of cumulative content across sections. For example, the performance of a section given exams with 20% cumulative content could be compared to another given 50%, and yet another given 80% to help determine which proportion does best to optimize learning. Further, while our study provides evidence that combining interteaching with cumulative exams is an effective means of promoting short-term retention, it did not assess its impact on longer-term retention. Past experiments on cumulative exams, such as those conducted by Khanna et al. (2013) and Lawrence (2013) , assessed long-term retention by administering a follow-up exam two months and 18 months after the conclusion of their courses, respectively. However, our experiment's comparisons were restricted to performance on weekly exams and a cumulative final. Therefore, researchers interested in assessing longer-term retention could replicate the present study by additionally comparing student performance on a followup exam given after a few months' delay. In addition, future studies may further investigate the effect of weekly cumulative exams by including a control condition in which no weekly exams are administered. It is also worth noting that the present study's exams exclusively contained multiple-choice questions, which assess students' ability to recognize the correct response from a list of alternatives. However, other question styles, such as essay, short answer, or fill-in-the-blank, instead require students to recall the correct information (Gay, 1980) . Another limitation is that the current study did not use novel final exam questions, although students from both conditions had an equal probability of experiencing previously seen test questions. Therefore, further replications could assess whether cumulative exams containing other question types and novel final exam questions produce similar or greater gains in learning and retention. Finally, due to the COVID-19 pandemic, many K-12 schools have transitioned their courses into online formats (Schwartz, 2020) . However, our study specifically assessed the effectiveness of this intervention with respect to college students. To begin assessing these findings' generality to other populations and settings, subsequent research could explore this intervention's effectiveness with high school students. In sum, our findings demonstrate adding frequent cumulative exams to interteaching can lead to greater student learning in comparison to using frequent noncumulative exams. Additionally, this intervention provides further evidence for the effectiveness of interteaching in an asynchronous, online course. Given educators' increased interest in developing effective online instruction, this research serves to validate one method for improving student learning and retention in the face of rapid and unprecedented changes resulting from COVID-19. Funding No funding was received to assist with the preparation of this manuscript. Data Availability Data are publicly available at the second author's GitHub repository: https:// github. com/ ststi ll7/ Cumul ative-Exams-Data Interteaching: A strategy for enhancing the user-friendliness of behavioral arrangements in the college classroom Interteaching: An evidence-based approach to instruction Effect of repeated/spaced formative assessments on medical school final exam performance Preliminary analysis of interteachings frequent examinations component in the community college classroom The comparative effects of multiple-choice versus short-answer tests on retention Interteaching in an asynchronous online class A comparison of interteaching, lecture-based teaching, and lecture-based teaching with optional preparation guides in an asynchronous online classroom. Scholarship of Teaching and Learning in Psychology Essentials of statistics for the behavioral sciences A systematic review and quantitative analysis on the effectiveness of interteaching The interteaching approach: Enhancing participation and critical thinking State COVID-19 data and policy actions Short-and long-term effects of cumulative finals on student learning Introductory psychology student performance: Weekly quizzes followed by a cumulative final exam Repeated testing improves long-term retention relative to repeated study: A randomised controlled trial Cumulative exams in the introductory psychology course Preparing for emergency online teaching The use of a comprehensive multiple-choice final exam in the macroeconomics principles course: An assessment A comprehensive review of interteaching and its impact on student learning and satisfaction Test-enhanced learning: Taking memory tests improves longterm retention The effects of overlearning and distributed practice on the retention of mathematics knowledge Distance Education in College: What Do We Know From IPEDS? NCES Blog Interteaching and lecture: A comparison of long-term recognition memory Interteaching: Bringing behavioral education into the 21st century The relation between GPA and exam performance during interteaching and lecture Interteaching: The effects of quality points on exam scores Here's what teaching looks like under COVID-19 Distributed and massed practice: From laboratory to classroom Inter-teaching: A systematic review Expectation of a final cumulative test enhances long-term retention Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations The authors have no relevant financial or non-financial interests to disclose.