key: cord-0431886-uqnvt2lr
authors: McNamara, Amelia
title: Teaching modeling in introductory statistics: A comparison of formula and tidyverse syntaxes
date: 2022-01-31
journal: nan
DOI: nan
sha: ed689dc1383b40e123632696512860e20f9c0517
doc_id: 431886
cord_uid: uqnvt2lr

This paper reports on a head-to-head comparison run in a pair of introductory statistics labs, one conducted fully in the formula syntax, the other in tidyverse. Analysis of incidental data from YouTube and RStudio Cloud show interesting distinctions. The formula section appeared to watch a larger proportion of pre-lab YouTube videos, but spend less time computing on RStudio Cloud. Conversely, the tidyverse section watched a smaller proportion of the videos and spent more time computing. Analysis of lab materials showed that tidyverse labs tended to be slightly longer in terms of lines in the provided RMarkdown materials, but not in minutes of the associated YouTube videos. The tidyverse labs exposed students to slightly more distinct R functions, but both labs relied on a quite small vocabulary of consistent functions, which can provide a starting point for instructors interested in teaching introductory statistics in R. Analysis of pre- and post-survey data show no differences between the two labs, so students appeared to have a positive experience regardless of section. This work provides additional evidence for instructors looking to choose between syntaxes for introductory statistics teaching.

When teaching statistics and data science, it is crucial for students to engage authentically with data. The revised Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report provides recommendations for instruction, including "Integrate real data with a context and purpose" and "Use technology to explore concepts and analyze data" (GAISE College Report ASA Revision Committee et al. 2016 ). Many instructors have students engage with data using technology through in-class experiences or separate lab activities.

An important pedagogical decision when choosing to teach data analysis is the choice of tool. There has long been a divide between 'tools for learning' and 'tools for doing' data analysis (McNamara 2015) . Tools for learning include applets, and standalone software like TinkerPlots, Fathom, or their next-generation counterpart CODAP (Konold & Miller 2001 , Finzer 2002 , The Concord Consortium 2020 . Tools for doing are used by professionals, and include software packages like SAS as well as programming languages like Julia, R, and Python.

Many tools for learning were inspired by Rolf Biehler's 1997 paper, "Software for Learning and for Doing Statistics" (Biehler 1997) . In it, Biehler called for more attention to the design of tools used for teaching. In particular, he was concerned with on-ramps for students (ensuring the tool was not too complex), as well as off-ramps (using one tool through an entire class, which could also extend further) (Biehler 1997) . At the time he wrote the paper it was quite difficult to teach using an authentic tool for doing, because these tools lacked technological or pedagogical on-ramps.

However, recent developments in Integrated Development Environments (IDEs) and pedagogical advances have opened space for a movement to teach even novices statistics and data science using programming. In particular, curricula using Python and R have become popular. In these curricula, educators make pedagogical decisions about what code to show students, and how to scaffold it. In both the Python and R communities, there have been movements to simplify syntax for students.

For example, the UC Berkeley Data 8 course uses Python, including elements of the commonly-used matplotlib and numpy libraries as well as a specialized library written to accompany the curriculum called datascience (Adhikari et al. 2021 , DeNero et al. 2020 ).

The datascience library was designed to reduce complexity in the code. At the K-12 level, the language Pyret has been developed as a simplified version of Python to accompany the Bootstrap Data Science curriculum (Krishnamurthi et al. 2020) .

In R, the development of less-complex code for students has been under consideration for even longer (Pruim et al. 2011 ). R offers non-standard evaluation, which allows package authors to create new 'syntax' for their packages (Morandat et al. 2012) . In human language, syntax is the set of rules for how words and sentences should be structured. If you use the wrong syntax in human language, people will hear there is something wrong with how you structured your speech or writing. However, because human understanding is flexible, the listener will probably still understand the general idea you were trying to convey. Syntax in programming languages is more formal-it governs what code will execute, run, or compile correctly. Using the wrong syntax usually means failing to get a result from the program.

Typically, programming languages have only one valid syntax. For example, an aphorism about the language Python is "There should be one-and preferably only one -obvious way to do it" (Peters 2004) . But, non-standard evaluation in R has allowed there to be many obvious ways to do the same task. There is some disagreement over whether syntax is a precise term for these differences. Other terms suggested for these variations in valid R code are 'dialects,' 'interfaces,' and 'domain specific languages.' Throughout this paper, we use the term syntax as a shorthand for these concepts. At present, there are three primary syntaxes used: base, formula, and tidyverse (McNamara 2018).

The base syntax is used by the base R language (R Core Team 2020), and is characterized by the use of dollar signs and square brackets. The formula syntax uses the tilde to separate response and explanatory variable(s) (Pruim et al. 2017 ). The tidyverse syntax uses a data-first approach, and the pipe to move data between steps (Wickham et al. 2019) .

A comparison of using the three syntaxes for univariate statistics and displays can be seen in Code Block 1.1. This example code, like the rest in this paper, uses the palmerpenguins data (Horst et al. 2020) . All three pieces of code accomplish the same tasks, and all three use the R language. But, the syntax varies considerably. # base syntax hist(penguins$bill length mm) mean(penguins$bill length mm) # formula syntax gf histogram(~bill length mm, data = penguins) mean(~bill length mm, data = penguins) # tidyverse syntax ggplot(penguins) + geom histogram(aes(x = bill length mm)) penguins %>% summarize(mean(bill length mm))

Code Block 1.1: Making a histogram of bill length from the penguins dataset, then taking the mean, using three different R syntaxes. Base syntax is characterized by the dollar sign, formula by the tilde, and tidyvese is dataframe-first. In order for this code to run as-is, missing (NA) values need to be dropped before the code is run.

There is some agreement about pedagogical decisions for teaching R. In particular, most educators agree that in order to reduce cognitive load, instructors should only teach one syntax, and to be as consistent as possible about that syntax (McNamara et al. 2021 ).

There is also some agreement base R syntax is not the appropriate choice for introductory statistics, but there is widespread disagreement on whether the formula syntax or tidyverse syntax is better for novices.

While there are strongly-held opinions on which syntax should be taught (Pruim et al. 2017 , Cetinkaya-Rundel et al. 2022 , there is relatively little empirical evidence to support these opinions. In the realm of computer science, empirical studies by Andreas Stefik, et al have shown significant differences in the intuitiveness of languages, as well as error rates, based on language design choices (Stefik et al. 2011 , Stefik & Siebert 2013 . Thus, it seems likely there are language choices that could make data science programming easier (or harder) for users, particularly novices.

Stefik's team is working to add data science functionality to their evidence-based programming language. As a first step toward understanding which elements of existing languages might be best to emulate, they ran an experiment comparing the three main R syntaxes (Rafalski et al. 2019) . The study showed no statistically significant difference between any of the three syntaxes with regard to time to completion or number of errors. However, there were significant interaction effects between syntax and task, which suggested some syntaxes might be more appropriate for certain tasks (Rafalski et al. 2019) .

Beyond this, examining the results from the study with an eye toward data science pedagogy showed common errors made by students related to their conceptions of dataframes and variables. For example, one of the figures from Rafalski et al. (2019) shows real student code with errors. In the first line of code, the student gets everything correct using formula syntax, with the exception of the name of the dataframe. When that code does not work, they try again using base R syntax, but again get the dataframe name wrong.

After both those failures, they appear to fall back on computer science knowledge and try syntax quite different from R. This is consistent with other studies of novice behavior in R (Roberts 2015 The work served as the inspiration for the longer comparison of multiple R syntaxes in the classroom context described in this paper.

The remainder of this paper is organized into three sections. Section 2 describes the setup of the classes, the participants (2.1) and their experience (2.2), and the content of the course under investigation (2.3). Section 3 contains results of the analysis, including a comparison of material lengths between the sections (3.2), the number of unique functions shown in each section (3.3), results from the pre-and post-survey (3.4), and analysis of YouTube (3.5) and RStudio Cloud (3.6) data. Finally, Section 4 discusses the results and opportunities for future study.

All pedagogical materials used for the course under discussion are available on GitHub and are Creative Commons licensed, so they can be used or remixed by anyone who wants to use them (https://github.com/AmeliaMN/STAT220-labs). All code and anonymized data from this paper is also available on GitHub, for reproducibility (https://github.com/ AmeliaMN/ComparingSyntaxForModeling). Data analysis was performed in R, and the paper is written in RMarkdown. The categorical color palette was chosen using Colorgorical (Gramazio et al. 2017) , and colors for the Likert scale plot are from ColorBrewer (Harrower & Brewer 2003) . Example data used throughout the paper is from palmerpenguins (Horst et al. 2020) . Code from the formula section uses the R packages loaded in that course, mosaic and ggformula (ggformula is now loaded automatically with mosaic) (Pruim et al. 2017 , Kaplan & Pruim 2020 . Code from the tidyverse section uses the functions from that course, the tidyverse and infer packages (Wickham et al. 2019 , Bray et al. 2021 ).

The author did this head-to-head comparison in her introductory statistics labs. The comparison was run twice, once in the Spring 2020 semester and once in the Fall 2020

semester. The disruption of COVID-19 to the Spring 2020 semester made the resulting data unusable, so this paper focuses on just Fall 2020 data.

Data was collected from YouTube analytics for watch times, from RStudio Cloud for aggregated compute time, and from pre-and post-surveys of students. Participants for the pre-and post-survey were recruited from this pool after Institutional Research Board ethics review (University of St Thomas IRB 1605810-2).

Participants were students enrolled in an introductory statistics course at a mid-sized pri- In Fall 2020, the author taught two labs associated with the same lecture section, so all students saw the same lecture content. (A third lab was associated with the same lecture, using a different software, and was not considered.) Using random assignment (coin flip), the author selected one lab section to be instructed using formula syntax, and one to be instructed using tidyverse syntax. The goal was to compare syntaxes head-to-head.

Because the lab took place during the coronavirus pandemic, the instructor recorded YouTube videos of herself working through the pre-lab documents for each lab, and posted them in advance. Students watched the videos and worked through the associated pre-lab RMarkdown document on their own time, then came to synchronous class to ask questions and get help starting on the real lab assignment. Students used R through the online platform RStudio Cloud (RStudio PBC 2021).

The two labs were of the same size (n = 21 in both sections) and reasonably similar in terms of student composition. In both sections, approximately half of students were Business majors, with the other half a mix of other majors.

Participants for the pre-and post-survey were recruited from this pool after Institutional

Research Board ethics review. For the pre-survey, n = 12 and n = 13 students consented to participate, and in the post-survey n = 8 and n = 13 responded. So, for paired analysis we have n = 8 for the formula section, and n = 13 for the tidyverse section. These sample sizes are very small, and because students could opt-in, may suffer from response bias.

However, because we have additional usage data from non-respondents, some elements of the data analysis include the full class sample sizes of n = 21.

To verify both groups of students had similar backgrounds, we compared the prior programming experience of the two groups of students. Table 1 shows results from the pre-survey.

While two additional students in the tidyverse section had prior programming experience, formula tidyverse No 10 9 Yes, but not with R 2 4 Table 1 : Responses from pre-survey about prior programming experience. The majority of students in both sections had no prior programming experience.

the overall pattern was the same. The majority of students in both sections had no prior programming experience.

For the students who had programmed before, none had prior experience with R. Three students had prior experience with Java, 3 with Javascript, and a smaller number had experience with other languages, including C++ and Python.

Each week, the lab instructor prepared a "pre-lab" document in RMarkdown. The prelab covered the topics necessary to complete the standardized lab assignment done by all students across lab sections. Pre-lab documents included text explanations of statistical and R programming concepts, sample code, and blanks (both in the code and the text) for students to fill in as they worked.

The instructor recorded YouTube videos of herself working through the pre-lab documents for each lab, and posted them in advance. Literature about flipped classrooms suggests shorter videos are better for student engagement, although there is no consensus about the ideal length for videos, with suggestions ranging from 5 to 20 minutes as a maximum length for a video (Zuber 2016 , Beatty et al. 2019 , Guo et al. 2014 ). The instructor attempted to keep the total number of minutes of video content below 20 each week. If video content became too long, the instructor split the content into multiple shorter videos.

Students were told to watch the pre-lab video(s) and work through the RMarkdown document on their own time, then come to synchronous class to ask questions and get help starting on the real lab assignment.

The topics covered in Fall 2020 were as follows: 13. ANOVA: inference for more than two means, using the F distribution.

14. Chi-square: inference for more than two counts, using the χ 2 distribution 15. Inference for Regression: inference for the slope coefficient in simple linear regression, prediction and confidence intervals. Multiple regression.

Although this was a 15-week semester, there are only 12 lab topics. Labs were not held during the first week of classes or during Thanksgiving week. Additionally, there were two "lab assessments" to gauge student understanding of concepts within the context of their lab software. One took place during finals week, the other was scheduled in week 11.

One obvious question arising when considering the comparison of the two syntaxes is whether students performed better in one section or another. The IRB did not cover examining student work (an obvious place for improved further research), so we cannot look at student outcomes on a per-assignment basis. However, running a randomization test for a difference in overall mean lab grades showed no significant difference between the two sections. While they may have been interesting differences in grades depending on the topic of the lab, we at least know these differences averaged out in the end.

Similarly, it would be interesting to know if student attitudes about the instructor were different from the summative student evaluations completed by all students at the end of the semester. These evaluations are anonymous, and the interface only provides summary statistics. Again, a test for a difference in means showed no difference in mean evaluation score on the questions "Overall, I rate this instructor an excellent teacher." and "Overall, I rate this course as excellent."

The first question we seek to answer is whether the materials presented to students were of approximately the same length. We can assess this based on the length of the pre-lab documents (in lines) and of the pre-lab videos (in minutes).

The length of the pre-lab RMarkdown documents can be measured using lines. Figure   1 shows the number of lines of code for each section's pre-lab document, per week. Lines in the RMarkdown document include the YAML header (consistent between documents), the descriptive text about processes (largely similar between documents), and the code in code chunks, which varies based on the syntax of the lab. Week of semester Length (in lines) of pre−lab RMarkdown documents formula tidyverse Figure 1 : Length of pre-lab RMarkdown documents each week, in lines. Data has been adjusted for the formula section in weeks 8 and 9, because an instructor error led this section to have only one document combining both weeks' work. Week of semester Total length of pre−lab videos (minutes) formula tidyverse Figure 2 : Length of pre-lab videos each week. Outlines help delineate multiple videos for a single week.

Every attempt was made to align these RMarkdown documents, so the descriptive text was only changed when necessary to describe specific elements of the code. Similarly, if blank code chunks appeared in one lab, that was mirrored by a blank chunk in the other lab. Both labs' documents were styled using the styler package (Müller & Walthert 2022) to remove inconsistencies with spacing, assignment operators, and the like. Code was styled using the default style from the package, which is based on the tidyverse style guide (Wickham 2022) . The tidyverse style guide has become the de facto style guide for R, as previously-existing style guides like the Google R Style Guide have largely decided to follow Wickham (2022) . Figure 1 indicates RMarkdown documents for the tidyverse section tended to be longer.

We can compute a difference in lab lengths for each week, and compute the mean difference, which is 19 lines. Although we only have 12 labs worth of data, it appeared relatively normally distributed. A 95% confidence interval for the difference in line length (computed using the Student's t-distribution) is (7, 31), and a bootstrap confidence interval computed with 5,000 bootstrap samples and using the percentile method is (9, 29). Both intervals indicate labs for the tidyverse section were longer, but only by a few lines.

A slightly longer length for these labs makes sense, because tidyverse code is characterized by multiple short lines strung together into a pipeline with %>%, while the formula syntax typically has single function calls, sometimes with more arguments. The code shown in this paper is styled the same way the labs were, so any code comparison you see (for example, in Code Block 3.1 and Code Block 3.4) will show the difference in lines of code between the two syntaxes.

Then the question becomes if the longer lengths of documents lent themselves to longer pre-lab videos. Figure 2 shows the video lengths, which appear more consistent between sections. Effort was made to ensure the maximum video length was approximately 20 minutes, and some weeks had multiple videos.

Again, we can compute a pairwise difference in total video length (adding together multiple videos in weeks that had them), and compute the mean of that difference. That difference is 2 minutes (tidyverse videos being longer). The distribution of differences appeared to be right-skewed, which may impact results because the sample size is so low.

Using a t-distribution, we can compute a 95% confidence interval, which is (-0.3, 3.6), and thus contains 0.

Alternately, we could compute a bias-corrected 95% bootstrap confidence interval using 5,000 bootstrap samples. Because of the right-skew to the data, the boostrap distribution also appeared skewed. Bias-correction should help adjust for this. The boostrap interval is (0.23, 4). This interval does not contain 0, but 0.23 minutes is equivalent to 14 seconds, which is not practically significant. It appears that while tidyverse labs are longer in terms of lines of code, the corresponding videos are not meaningfully different in length.

One place where the labs are of particularly different lengths is in week 3, when the topic was exploratory data analysis for one and two categorical variables. For the formula section the RMarkdown document was 134 lines long, and the two videos totaled 28 minutes. The

RMarkdown document for the tidyverse section was 180 lines long, and the videos totaled 35 minutes. There is a clear reason why.

In the formula section, students found frequency tables and relative frequency tables with code as in Code Block 3.1 and Code Block 3.2.

tally(~island, data = penguins) tally(~island, data = penguins, format = "percent") tally(species~island, data = penguins)

Code Block 3.1: Making tables of one and two categorical variables using the formula syntax and mosaic::tally().

tally ( Again, reversing the order of the variables (this time, inside the dplyr::group by()) changed the percentages, but it was more difficult to determine how the percents added up, because the data was in long format, rather than wide format. Compare Code Block 3.5

and Code Block 3.6 to see the effect of swapping the order of variables.

penguins %>% group by(species, island) %>% summarize(n = n()) %>% mutate(prop = n / sum(n)) # The explanation for the varying time is similar, as well.

Week 10 focused on inference for two samples; that is, inference for a difference of proportions or a difference of means.

While a difference of means makes it fairly easy to know which variable should go where (the quantitative variable is the response variable to take the mean of, and the categorical variable is the explanatory variable splitting it), with a difference of two proportions the concept comes back to thinking about two-way tables. Again, the tidyverse presentation of a "two-way table" made this more difficult to conceptualize.

In the formula section, students saw code like that in Code Block 3.7.

tally(island~sex, data = penguins, format = "proportion") prop.test(island~sex, data = penguins, success = "Biscoe")

Code Block 3.7: Making a two-way table and performing inference for a difference of proportions using the formula syntax. In order for this code to run as-is, the Torgerson island has to be removed so there are just two categories in that variable.

The code for finding the point estimate using mosaic::tally() is quite similar to the code for performing inference using prop.test().

In the tidyverse, the code is not as consistent. Students in this section saw code like that shown in Code Block 3.8.

penguins %>% group by(sex, island) %>% summarize(n = n()) %>% mutate(prop = n / sum(n)) penguins %>% prop test( response = island, explanatory = sex, alternative = "two-sided", order = c("female", "male") ) Code Block 3.8: Making a 'two-way table' and performing inference for a difference of proportions using the tidyverse syntax. Again, the Torgerson island data has been removed beforehand.

In tidyverse syntax the code for finding the point estimate (dplyr's group by(), summarize() and then mutate()) is quite different from the code performing the inference (the infer::prop test() function). And, the output from the inferential prop test() function makes it harder to determine the code was correct. In the prop.test() output, sample estimates are provided, which allows you to check your work against a point estimate computed earlier.

These discrepancies made it take longer to explain code in the tidyverse section for these topics. However, as we saw in 3.2, overall the length of videos did not appear meaningfully different between sections. Comparisons of RMarkdown document length and YouTube video length, as well as the corresponding reasons for those discrepancies are the first hint of the computing time results to come in Section 3.6.

Since both sections relied on the use of RMarkdown documents, there is a wealth of text data to be explored. The instructor prepared the pre-lab documents with blanks, but also saved a 'filled-in' copy after recording the accompanying video. She also completed each lab assignment in an RMarkdown document to generate a key.

Students in each section were also given a "All the R you need for intro stats" cheatsheet at the beginning of the semester. These cheatsheets (one for formula and one for tidyverse) were modeled on the cheatsheet of a similar name accompanying the mosaic package (Pruim et al. 2017) . The cheatsheets aimed to include all code necessary for the entire semester, but were generated a priori.

These varied documents allow us to use automated methods to analyze the number of unique functions shown in each section, using the getParseData() function from the built-in utils package.

The cheatsheets given to students at the beginning of the semester contained 34 functions for the formula section and 42 functions for the tidyverse section. There was an overlap of 18 functions between the two cheatsheets.

Of course, while teaching a real class, an instructor often has to improvise at least a little.

So, it is also interesting to consider the number of functions actually shown throughout the course of the semester. To do this, we can consider the functions shown in the filled-in version of pre-lab documents the instructor ended up with after recording the associated instructional video.

Considering this data, the formula section saw a total of 37 functions and the tidyverse section saw 50, again with an overlap of 18 functions between the two sections. These numbers make it appear as if in the formula section the instructor showed all functions from the cheatsheet, and then a few additional functions. However, there were actually several functions in the cheatsheet that were never shown in the actual class, and many more functions that appeared in the class that did not make it onto the cheatsheet. For a list of the functions used in both sections, see Appendix A.

In the tidyverse section, there were 9 functions shown in class that did not appear on the cheatsheet, and only 1 function on the cheatsheet that was not discussed in class. In the formula section, however, there were 10 functions shown in class that did not appear on the cheatsheet, as well as 7 functions on the cheatsheet that were not discussed in class.

In both classes the majority of functions shown in class were on the cheatsheet. Similarly, both sections saw several common summary statistics, but in the formula section they used the function (e.g. mean()) on its own, whereas in the tidyverse section base summary functions needed to be wrapped within summarize(). Students in the tidyverse section also saw slightly more summary statistic functions, because one lab called for the five number summary.

In the formula lab, students found the five number summary as shown in Code Block 3.9.

Code Block 3.9: The mosaic::favstats() function provides many common summary statistics for one quantitative variable. The favstats() function automatically drops missing values.

This approach is particularly attractive because it deals with missing values as part of the standard output.

In the tidyverse section, the instructor chose to show two approaches. Both approaches are in Code Block 3.10, and both needed to include drop na() to deal with missing values.

Past those similarities, the approaches are divergent. Code Block 3.10: Two approaches for doing summary statistics of one quantitative variable in tidyverse syntax. The first is quite verbose, the second is more compact but introduces a function never seen again.

The instructor should have chosen a single solution to present to students, but was faced with a dilemma. The first tidyverse approach is very verbose, but it follows nicely from other summary statistics students had already seen, just adding a few more functions like min, max, and quantile. The second solution is more concise, but it introduces the pull function, which was never used again in the course. This brings up an important consideration when teaching coding-how many times students will see the same function.

Because there is some cognitive load associated with learning a new function, and repetition helps move information from working memory to long term memory, it is ideal for students to see each function at least twice (Lovett & Greenhouse 2000 , McNamara et al. 2021 ).

When analyzing the number of functions shown in each section, we found there were 7 functions shown only one time in the formula section, and 6 functions only shown once in the tidyverse section.

Overall, neither section appeared to expose students to an overwhelming number of functions. One argument against the use of the tidyverse in teaching is that it contains too many functions. However, when teaching a course (particularly an introductory one) an instructor never shows every function in a package. The tidyverse section saw 32 as compared to the formula section's 19 functions, but that difference does not feel practically significant, particularly considering the way in which tidyverse operations often use helper functions.

The comparison also underscores the fact that while instructors may say they are teaching tidyverse or formula syntax, they are ultimately teaching R. Both sections saw 18 common functions, many of them from base R.

The practice of analyzing the number of functions shown over the course of the semester was eye-opening. It will provide valuable information for the instructor the next time she teaches the course, as she can attempt to remove functions only shown once, and ensure the cheatsheets better match what is actually shown throughout the semester. The list of functions provided in Appendix A can also serve as a starting point for other instructors as they work to produce curricular materials for introductory statistics classes in R.

As discussed in 2.1, the number of students who completed both the pre-and post-surveys were low (n = 8 for the formula section, and n = 13 for the tidyverse section), so there are major limitations to paired analysis.

The majority of the survey was modeled on a pre-and post-survey used by the Carpentries, a global nonprofit teaching coding skills (Carpentries 2021) . Questions ask respondents to use a 5-step Likert scale, from 1 (strongly disagree) to 5 (strongly agree) to rate their agreement with the following statements:

• I am confident in my ability to make use of programming software to work with data • Having access to the original, raw data is important to be able to repeat an analysis • Using a programming language (like R) can make me more efficient at working with data • While working on a programming project, if I get stuck, I can find ways of overcoming the problem • Using a programming language (like R) can make my analysis easier to reproduce • I know how to search for answers to my technical questions online In Figure 3 , you can see a visualization of these Likert-scale questions. Based on results below, it became clear that there was no major difference between the two sections, so this plot shows the overall trends without breaking into the groups. Likely, the questions used by The Carpentries was inappropriate for this setting, and a different set of survey questions would have been more appropriate for this group. For example, this class did not include any explicit instruction on searching for answers online.

This was an intentional choice, because novices typically struggle to identify which search results are relevant to their queries and get overwhelmed by the multitude of syntactic options they encounter. Instead, students with questions were referred to the "all the R you need" cheatsheet they had been given at the beginning of the semester, which attempted to summarize every function they would encounter. Likely, students still attempted to Google questions, which may be why the responses to the questions about seaching online and overcoming problems got more negative over the course of the semester. Because the questions were on Likert scales, it is not appropriate to compute an arithmetic mean of the differences, but median scores can be computed. To provide a broader picture of the distribution of responses, we also compute the 25th and 75th percentiles for question. This information is most easily displayed as a boxplot. The boxplots in question can be seen in Figure 4 , and a version of the plot broken down by section is in Figure 5 .

In both Figure 4 and 5, almost all boxes are centered at or touching 0 (meaning the median response did not change over the course of the semester), so there is no overall difference in medians for those questions.

The one question where the boxes are not centered at 0 is "I am confident in my ability to make use of programming software to work with data." In Figure 5 , the boxes for both sections are centered at a median of 1, meaning the median student answered one level up Difference in Likert rating between pre− and post−surveys on the question at the end of the course. Both boxes (the middle 50% of the data) are fully positive, although the lower whisker (minimum value) for both includes zero.

Although the sample sizes are quite small, we did attempt inference about a difference in medians between sections using a 95% bootstrap confidence interval with 5,000 bootstrap samples. As expected, that interval contained 0, meaning that there was no clear difference between the sections. We did not attempt an analogous inferential task using a theoretical distribution.

As a follow-up, we computed a bootstrap confidence interval for the median using both sections together (the data as seen in Figure 4 ). Again, we used 5,000 bootstrap samples, and a bias-corrected confidence interval because the distribution appeared skewed. In this case, we computed a 99% confidence interval, to help correct for the fact that we cherry-picked the one question that looked significant. The 99% bias-corrected interval was (1, 2.4), indicating that students across sections improved in their confidence with programming over the course of the semester.

It is somewhat heartening to know students improved their confidence in programming over the course of the semester, but since was no clear difference between the sections, this does not provide any strong evidence for one syntax or the other.

In addition to the six questions asked on both the pre-and post-survey, the two surveys also had some unique questions.

The pre-survey also asked students to share what they were most looking forward to, and most nervous about. Both sections had similar responses. Students wrote they looked forward to "learning how to code!" and "Gaining a better understanding of how to analyze data." Beyond worries related to the pandemic, they expressed apprehension about "getting stuck," "using R," and "Figuring out how to do the programming and typing everything out."

On the post survey, students were asked to report which syntax they had learned, with an option to respond "I don't know." All students in both sections correctly identified the syntax associated with their lab. Then, they were asked if they would have preferred to learn the other syntax. We hypothesized many students would say 'yes,' thinking the other syntax would have been easier or lack some feature they found frustrating. Surprisingly, though, the majority of students in both sections said 'no,' they preferred to learn the syntax they had been shown. Responses to this question are shown in Table 2 .

However, part of the explanation is likely that the students did not know what the other syntax looked like. Throughout the semester, the instructor was careful to only expose students to the syntax for the particular section. Several students asked to see the alternate syntax during office hours, but this was the exception and not the norm.

An optional follow-up question asked students why they had responded the way they did. Responses to this question are shown in Table 3 . Several students suggested a crossover design for the experiment would have allowed them to better compare. This is both a good direction for further work and a possible indication the students were listening during the chapter on experimental design.

Another question on the post-survey asked students "How was the experience of learning to program in R?" Overall, students seem to have positive sentiment toward learning R, whether in the formula or the tidyverse section. As seen in Figure 6 , most students said either the experience was "not what I expected -in a good way" or "About what I expected -in a good way."

Nothing from the survey responses seem to indicate a difference between the two sections.

While the pre-and post-survey results do not suggest interesting results, the incidental data from YouTube and RStudio Cloud provided some insights.

Response formula I've heard that formula was more straightforward formula Because I am not familiar with it formula I thought the syntax that I learned worked well formula

Do not really know what the difference is, but also Prof. M was a very good teacher. formula I have no idea what the differences are, so I don't really know how to answer this question.

tidyverse As per my plan to study data Science in graduate school, I would have preferred learning both syntaxes tidyverse I really enjoyed this class and have learned a lot. tidyverse Tidy, is well tidy. When looking online the other syntax seemed more complex/abnormal tidyverse Im not sure what the benefit is. tidyverse I really enjoyed tidyverse, it was super easy to learn, and I liked the simplicity of the syntax tidyverse I'm not sure I wish we got to experience both so we could compare, maybe learn one for one half of the semester and the other for the other half? tidyverse I'm not sure of the difference and I had 0 experience of coding or using anything like r so I didn't have a preference as to which one I learned. How was the experience of learning to program in R? Figure 6 : Responses to the question, "How was the experience of learning to program in R?"

Because of the format of the class, which was flipped such that students watched videos of pre-recorded content, we can study overall patterns of YouTube watch time. YouTube offers a data portal which allows for date targeting. We defined each week of the semester as running from Sunday to Saturday, which covered the time when videos were released through to the time finished labs needed to be submitted (Fridays at 11:59 pm). For each

week, we downloaded YouTube analytics data for the channel, and filtered the data to focus only on the videos related to the introductory statistics labs.

YouTube analytics data includes number of views for each video, number of unique viewers, and total watch time. A "view" is defined as a person playing 30 seconds or more of the video, and unique viewers are counted using browser cookies. By limiting the data to a particular week, we were able to join it with data recording the length of the relevant videos. This allows us to calculate the approximate proportion of the videos watched by each student.

Data from YouTube is aggregated, and since videos were posted publicly, could contain viewers who were not enrolled in the class. As a way to check for possible inflated view counts from people not in the class, we checked view counts of lab videos on subsequent weeks. For example, we looked at number of views on the the "describing data" lab (assigned in week 2) during weeks 3-15. Students in the class would be unlikely to watch videos in a week they were not assigned, but the general population on the internet would be less targeted in their timing. Rarely did a video garner more than 2 two views in a

week that was not the assigned lab week. This indicates there may have been a very small number of non-student views on videos, but they are negligible. While the public nature of the videos means we do need to view these results with a level of skepticism, we can be reasonably sure the majority of viewers were students. Studying the data displays some interesting trends.

First, we can look at the number of unique watchers per video, seen in Figure 7 . Interestingly, at the start of the semester there are more unique viewers than enrolled students in the class, but as time goes on, the number of unique viewers levels out at slightly less than the number of enrolled students (n = 21 for both sections). The lower numbers later on make sense because some students were likely unengaged, or found it possible to do their lab work without watching the video. However, the high numbers at the start of the semester are puzzling. Perhaps students were viewing the videos from a variety of devices (phone, laptop, computer at school, etc) when the semester began.

If we assume all viewers were actually students (some students being counted as separate viewers because of different devices or cookie settings such as adblockers or private browsing), we can find an approximate proportion of video content watched, per student. This is shown in Figure 8 . It appears the proportion of video content watched is larger for the formula videos than for the tidyverse videos. We can characterize the difference by doing pairwise differences of proportion of video watched for each week. The mean of this difference is 11, indicating that on average the formula section watched approximately 11

percentage points more of the videos each week.

The distribution of differences looked approximately normal, so we can determine if this difference is meaningful by constructing a 95% confidence interval. Using a t-distribution the interval is (1.9, 20.6). A bootstrap distribution computed with 5,000 samples appears Week of semester Approximate proportion of video content watched, per student formula tidyverse Initially, it seemed as though the discrepancy in watch proportions could be explained by the tidyverse videos being longer. But, as discussed in Section 3.2, there appears to be no meaningful difference between the length of videos in the two sections.

No matter the explanation, this trend is particularly interesting when considered in conjunction with the RStudio Cloud usage patterns in the following section.

The other source of unexpected data came from RStudio Cloud usage logs. RStudio Cloud provides summary data per user in a project, aggregated by calendar month. This data includes all students (n = 42) enrolled in the two sections. The first thing we can look at is the total number of compute hours used per month, as in Table 4 . RStudio Cloud instructor plans include 300 hours of compute time per month, and are charged an hourly overage fee ($0.10/hour, as of this writing) for any hours above that number. Since the instructor set up separate projects for each section, we can see usage hours per section. Within a single section, monthly compute hours rarely went above 300 hours per month, but for two sections they always did. Note that data for the month of November is missing for the tidyverse section because of an oversight on the part of the author.

Data is also available on a per-student basis, aggregated by month. This data was downloaded using browser developer tools. This allows us to create Figure 9 , which shows the distribution of hours of compute time per section, broken down by month. While the tidyverse section seemed to watch less of the provided videos each week (as discussed in Section 3.5), they appear to spend more time on RStudio Cloud per month. All the distributions in Figure 9 are right-skewed, with several students spending many more hours of compute time than the majority.

It is also important to note these numbers are likely inflated based on the way RStudio Cloud counts usage time. The spaces for both sections were allocated 1 GB of RAM and 1 CPU, so one hour of clock time on the space counted as one project hour (spaces with more RAM or CPU may consume more than one project hour per clock hour), but student usage often includes a fair amount of idle time. RStudio Cloud will put a project to sleep after 15 minutes without interaction, and based on observation of student habits it is likely almost every session ends with a 15 minute idle time before the project sleeps. In a month with four labs, this can add up to at least an hour of project time that does not correspond to students actually using R.

Nevertheless, because the numbers would be inflated in the same way in both sections, we can persist in comparing them. Using data over the entire semester, students in the tidyverse section had an mean number of compute hours per month of 13.5 and students in the formula section had a mean of 11.5 hours.

We can also study these numbers per month, as seen in Table 5 . The mean compute time for both sections increases from September to October, likely because of the increased number of labs that month (two labs were due in September, five in October). Compute time then drops down again for the formula section, and continues downward. November data is missing for the tidyverse section, but time also appears to decrease in this section as months progress, although not to the same degree as in the formula section.

Whereas in the pre-and post-surveys we have quite small sample sizes, the RStudio Cloud data includes all students enrolled in the class. So, we can feel more confident performing inferential statistics.

Data was collected at the student level over time, so it is necessary to use a mixed effects section September October November December formula 10.4 (3.3) 13.9 (10.3) 9.4 (6) 7.7 (6) tidyverse 7.7 (4.7)

17.1 (8.6) missing 11.5 (7.2) model to account for clustering within students. We also need to take into account the longitudinal nature of the data, so we included month as a predictor. We use the lme4 package to fit the linear mixed effect models (Bates et al. 2015) .

Initially, we fit an unconditional means model, to determine how much variability in compute time was due to differences between students, without considering differences over time or between section. Based on the intraclass correlation coefficient, we can conclude 30% of the total variation in compute time is attributable to differences between students.

After iterating through several candidate models, we arrived at a final model which predicts compute time per month (in hours) using section and month as fixed effect predictors, as well as an interaction effect between section and month. Student identifier was used as a random effect. This final model has the lowest AIC and BIC values of all candidate models.

Results from the model can be seen in Table 6 .

The predicted values for each section/month combination match the means computed It appears students in the tidyverse section spent more time on RStudio Cloud. We can concoct several different scenarios to explain this difference. In one, students in the tidyverse section were more engaged with their work, so spent more time playing with code in R. In another, students in the tidyverse section struggled to complete their work, so spent more time in R trying to get their lab material to work. A more neutral third option is just that some of the tasks take more code to accomplish (as discussed in 3.2.1), so they needed more time to do their work. Because the usage data was collected incidentally after the fact, we have no information about which story is closer to the truth. A follow-up study might conduct semi-structured interviews with students after the completion of the class, to determine more about student experiences and work patterns.

It would also be interesting to know if students who spent more time on RStudio Cloud received higher or lower grades on their assignments, but as discussed in Section 3.1, the IRB did not cover graded student work in that way. We do know the two sections did not have an overall difference in mean grade.

Since these results are from a pilot study, they should not be used without caveats.

However, they do indicate that if instructors are worried about the amount of time assignments take to complete, they may want to consider using the formula syntax rather than the tidyverse syntax.

The results can also be used by instructors attempting to ballpark how many usage hours their classes may take over the course of a month or a semester. Students in the tidyverse section used an average of 13.5 hours per month, and students in the formula section used an average of 11.5 hours. These numbers can be used to make back-of-theenvelope calculations on how much RStudio Cloud would end up costing for a class of a particular size.

This semester-long, head-to-head comparison of two sections of introductory statistics labs provides data comparing two popular R coding styles, the formula syntax and the tidyverse syntax. Pre-and post-survey analysis showed limited differences between the two sections, but analysis of other incidental data, including pre-lab document lengths and YouTube and RStudio Cloud data presented interesting distinctions.

Materials for the tidyverse section tended to be longer in lines of code (likely because of the convention of linebreaks after %>%), although not in terms of the length of the associated YouTube videos. Students in the tidyverse section watched a smaller proportion of the weekly pre-lab videos than students in the formula section, but spent more time computing on RStudio. Conversely, students in the formula section were watching a larger proportion of the pre-lab videos each week, but spending less time computing each month.

These two insights are slightly contradictory-perhaps the formula section students found the concepts more complex as they were watching the videos, but then had an easier time applying them as they worked on the real lab.

There is much more interesting further work that could be considered. As students suggested, a cross-over design where students saw one syntax for the first half of the semester and the other for the second half would allow for better comparisons. However, there are a few caveats here.

First, anecdotal evidence from many instructors suggests it is best for students to see only one consistent syntax over the course of the semester. The other challenge is the formula syntax tends to seep (albeit only minorly) into the tidyverse section. For example, when doing linear regression both sections saw the lm(y~x, data = data) formula syntax, because the instructor chose not to introduce the tidymodels package. If a cross-over design used the existing materials from these classes, just swapping the final few weeks, students in the formula section would likely see more that was familiar to them than students in the tidyverse section. This could potentially be remedied by the inclusion of tidymodels for things like linear regression.

In fact, the tidyverse students almost did have a cross-over design. This may be why the number of hours of compute time for the tidyverse section remained consistent from November to December (even though there were fewer instructional weeks in December)

while the formula section's hours of compute time decreased.

Another follow-up study that would be interesting to complete would look at student success in subsequent courses. Because tidyverse syntax is frequently used for higher-level courses, students who were in the tidyverse section may have an easier time in those later courses. However, most students in the classes under consideration here will not go on to take further statistics courses. So the takeaways about syntax choice may vary depending on the student population to which they will be applied.

One criticism of the tidyverse is how many functions the associated packages contain.

Therefore, another interesting insight from this head-to-head comparison is the number of unique functions needed to cover a semester of introductory statistics in R.

The tidyverse section exposed students to 32, compared to the 19 functions shown in the formula section, both labs focused on a relatively small number of functions. Because there were 12 labs in the semester, this averages out to approximately 3 functions per lab for the tidyverse section compared to an average 2 functions shown in the formula section. The tidyverse section saw more unique functions, but both sections were limited to a small vocabulary of functions for the semester.

We recommend instructors follow this approach regardless of syntax. Instructors should attempt to reduce the number of functions they expose students to over the course of a semester, particularly in an introductory class. This will help reduce cognitive load.

The exercise of counting R functions in existing materials, using the getParseData() function, is one we recommend all instructors attempt, particularly before re-teaching a course. It can be eye-opening to discover how many functions you show students, and which functions are only used once.

We hope this work helps answer some initial questions about the impact of R syntax on teaching introductory statistics, while also raising further questions for future study.

While some aspects of the analysis from these classes suggest the formula syntax is simpler for students to learn and use, there are still many course scenarios for which we believe the tidyverse syntax is the most appropriate choice. While formula syntax can be used throughout an entire semester of introductory statistics, it does not offer functionality for tasks like data wrangling. This means students who will go on to additional statistics or data science classes may be better served by an early introduction to tidyverse. However, in order to determine this conclusively, additional study would be needed.

No matter which syntax an instructor chooses, it appears possible to limit the number of functions shown in a semester, and provide students with a positive learning experience.

Thanks to Sean Kross for his guidance about parsing R function data, Christina Knudson for her help with mixed effects modeling, and Nick Horton for his useful comments on the paper overall.

Interleaving Computational and Inferential Thinking: Data Science for Undergraduates at Berkeley

Fitting Linear Mixed-Effects Models Using lme4

Analysis of Student Use of Video in a Flipped Classroom

Software for Learning and for Doing Statistics

Infer: Tidy Statistical Inference

The Carpentries Survey Archives

An educator's perspective of the tidyverse

The Fathom experience: Is research-based development of a commercial statistics learning environment possible?

Guidelines for Assessment and Instruction in Statistics Education College Report

Colorgorical: Creating discriminable and preferable color palettes for information visualization

How video production affects student engagement: An empirical study of MOOC videos

ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps

Palmerpenguins: Palmer Achipelago (Antarctica) penguin data

Ggformula: Formula Interface to the Grammar of Graphics

TinkerPlots (version 0.23). Data Analysis Software

Data Science as a Route to AI for Middle-and High-School Students

Applying Cognitive Theory to Statistics Instruction

Bridging the Gap Between Tools for Learning and for Doing Statistics

R Syntax Comparison Cheatsheet

Computing in the Statistics Curriculum: Lessons Learned from the Educational Sciences, in 'USCOTS 2021

Evaluating the Design of the R Language: Objects and Functions For Data Analysis

Styler: Non-invasive Pretty Printing of {R} Code

PEP 20 -The Zen of Python

The mosaic package: Helping students 'think with data' using R

Mosaic: Project MOSAIC Statistics and Mathematics Teaching Utilities

R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing

A Randomized Controlled Trial on the Wild Wild West of Scientific Computing with Student Learners

Measuring Formative Learning Behaviors of Introductory Statistical Programming in R via Content Clustering

RStudio Cloud -Do, Share, Teach, and Learn Data Science

An Empirical Investigation into Programming Language Syntax

An Empirical Comparison of the Accuracy Rates of Novices using the Quorum, Perl and Randomo Programming Languages

CODAP -Common Online Data Analysis Platform

The Tidyverse Style Guide

Welcome to the Tidyverse

The flipped classroom, a review of the literature