key: cord-0592866-wntfph7z
authors: Vilhuber, Lars; Son, Hyuk Harry; Welch, Meredith; Wasser, David N.; Darisse, Michael
title: Teaching for large-scale Reproducibility Verification
date: 2022-03-31
journal: nan
DOI: nan
sha: 9124cfd91584c44c0c05cede8e016325592e978e
doc_id: 592866
cord_uid: wntfph7z

We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.

The purpose of scientific publishing is the dissemination of robust research findings, exposing them to the scrutiny of peers. Key to this endeavor is D R A F T documenting the provenance of those findings. Recent years have seen significant concerns expressed about the robustness of scientific results, commonly referred to as the "replication crisis" (Fanelli, 2018; Gall et al., 2017; Hamermesh, 2007; King, 1995) . Various facets of the "crisis" have been explored (for just some of these, see Camerer et al., 2016; Hamermesh, 2007; Olken, 2015; Stodden et al., 2018) . Various approaches and solutions have been called for and proposed by the National Academies (National Academies of Sciences, Engineering, and Medicine, 2019), committees of National Science Foundation (Bollen et al., 2015) , and many scientists have called for greater transparency of research practices, and more assurance that published research is reproducible (Bell & Miller, 2013; Clemens, 2017; Coffman et al., 2017; Höffler, 2017b; Stodden et al., 2016) . Learned societies have a role to play within this discussion, as do journals. 1 We should note here that the terms "reproducible" and "replicable" are not well-defined. Throughout this article, we will use them as defined in Bollen et al. (2015) and National Academies of Sciences, Engineering, and Medicine (2019): (computational) reproducibility is "obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis" (National Academies of Sciences, Engineering, and Medicine, 2019, pg. 36). Replicability is achieved in this context through a relaxation of certain of the constraints implicit in the definition of reproducibility, for instance by collecting new data, implementing different methods or code, and then "obtaining consistent results across studies aimed at answering the same scientific question [...] [obtaining] consistent results given the level of uncertainty inherent in the system under study" (ibidem). These definitions are broadly accepted in the social science and statistics community nowadays, but other communities may actually use these terms somewhat differently (e.g. Heroux, 2015) . We refer to "replication packages" as the collection of materials provided by authors to enable others to replicate the results, but which should be "reproducible" themselves.

For empirical articles, the foundations on which they reside (data and its analysis) are external to the article and often to the journal they are published in. The data posting policies of many societies and journals, including the American Economic Association's pre-2019 policy (Bernanke, 2004) , were D R A F T and are intended to create a minimal framework from which to replicate empirical findings. Historically, they have often failed. In several studies (Camerer et al., 2016; Chang & Li, 2017; Höffler, 2017a) , at least half of the replication packages associated with surveyed manuscripts failed to (fully) reproduce the results in the manuscript when re-run.

Increasingly, societies and journals have therefore switched to verifying and monitoring these policies (Editors, 2021; Jacoby et al., 2017; Vilhuber, 2019) . The American Economic Association (AEA), the largest association of professional and academic economists in the world, with over 20,000 members located in 148 countries, has been at the forefront of such policies. It publishes 8 journals, including one of the top 5 journals in the discipline, in addition to several well-respected field journals. Concerns about the reliability and robustness of economic research have circulated in the AEA's membership for more than 30 years (Dewald et al., 1986; McCullough & Vinod, 1999) . The policy to require that articles provide copies of their replication materials, first implemented in 2004 (Bernanke, 2004) , was highly innovative at the time, but reflective of the membership's requests. Nevertheless, these early efforts improved the availability, but not necessarily the reproducibility of these replication packages. In 2018, the Association appointed one of the authors of this article (Vilhuber) as the inaugural Data Editor (Duflo & Hoynes, 2018) . The Data Editor, in turn, started verifying, prior to final acceptance of an article, the computational reproducibility of the results displayed in the manuscript. The AEA's endeavor is the largest in scale amongst the journals and societies conducting such verifications, having verified more than 1,000 articles since initiating such verifications two years ago.

The verification of replication packages, which checks not only the computational reproducibility of the provided materials, but also verifies the documented provenance and completeness of such materials, is not a magical solution that will solve the "reproducibility crisis." Replication packages may be reproducible, but wrong (see the recent discussion surrounding Simonsohn et al., 2021) . They do, however, reduce the cost for the scientific community to more easily find and assess such issues (finding the issue documented in Simonsohn et al., 2021 would have been much harder without a complete replication package) and can find other issues much earlier. For instance, the recent retraction in the Journal of Finance (2021) would have been detected prior to publication, not two years after publication. In the case of restricted-access data, pre-publication verification, when possible, may sometimes be the only opportunity to conduct such checks. Whether conducting D R A F T reproducibility checks prior to publication is the most efficient or conducive exercise remains an open issue, including within the AEA's discussions on this topic. In this article, we describe the AEA's activities and how they contribute to this discussion.

The AEA began conducting comprehensive pre-publication reproducibility verification for conditionally-accepted manuscripts at its eight journals starting in 2019. These checks are conducted by the Labor Dynamics Institute (LDI) Replication Lab, which was set up by the current AEA Data Editor (Vilhuber) , and which we describe in more detail in the next section. The Lab hires and trains undergraduate students who are primarily responsible for performing the required checks. The Lab's work with students is not integrated into any curriculum. Nevertheless, we will argue that students acquire some of the key data and computational skills described in National Academies of Sciences (2018). These abilities are the result of both the Lab's training and the observation of numerous completed but imperfect research projects. We also argue that students gain "hands-on" experience with some of the key dimensions of data-oriented science. The typical student will work on dozens of replication packages during their time at the Lab, with increasing autonomy along the way. These packages are of varying levels of complexity and completeness, and the students are required to assess their compliance with evolving and multi-faceted standards. This combination of taught and experiential learning provides the students with a strong foundation in data and computational management.

The goal of this article is to describe the setting, the selection process for students, the actual workflow, and sketch out some of the observed outcomes. We hope to show that, while the setting may currently be fairly unique in its scale and position within the academic publication cycle, it is feasible to implement in a broader setting, and can meaningfully contribute to students' data science education.

The AEA's Data and Code Availability Policy (DCAP) states that "[i]t is the policy of the American Economic Association to publish papers only if the data and code used in the analysis are clearly and precisely documented D R A F T and access to the data and code is non-exclusive to the authors" (American Economic Association, 2019 , 2020 . To achieve the goal of improved (pervasive) computational reproducibility, the authors of conditionally accepted manuscripts are required to submit a replication package, consisting of all code used to process and analyze data, any data not available in an existing trusted repository, and a "README" describing data provenance and processing instructions. 2 Each replication package is assessed in terms of data provenance, clarity of the description, and computational reproducibility. These checks are conducted by the LDI Replication Lab (henceforth "the Lab").

The Lab was created by one of the authors of this article and AEA Data Editor (Vilhuber) , based on earlier work starting in 2014, described in Kingi et al. (2018) . The Lab hires and trains undergraduate students who are primarily responsible for performing the required checks. These students are supervised by the Data Editor, the assistant Data Editor, and a graduate student. In a given calendar year, a total of about 50 undergraduate students work in the Lab with approximately 15-20 actively working on replications in a given week. Including the pilot phase, the Lab has hired over 100 Cornell University undergraduate research assistants (RAs) to complete reproducibility verification checks.

While not a primary objective of the Lab's work for the AEA, the reliance on undergraduate students is intentional. The project provides undergraduate students with a unique opportunity to gain insights into the research and publication practices of hundreds of economists every year. Students are trained on reproducibility techniques, best practices, and communication skills, addressing several of the key dimensions highlighted in the National Academies of Sciences (2018) report, namely "data acumen" and knowledge of some "computational foundations," "data management and curation," as well as "workflow and reproducibility" skills. Upon graduation, these students in turn take those skills into non-academic workplaces or graduate studies, thus seeding the next generation's improved practices. The Lab, therefore, attempts to target both ends of the academic publication pipeline: 2 The current requirements for authors are described in much detail at aeadataeditor.github.io/aea-de-guidance/, but have varied over time. Confidential data are not part of the replication package, but must be described in the README, and are regularly made available to the Data Editor privately. As of February 2022, the README should conform to Vilhuber, Connolly, et al. (2020) , but this requirement, too, has varied over time.

the "outlet", by improving replication packages of conditionally accepted publications, and the "inlet", by improving skills and attitudes of students at an early stage in their career. This article will focus on the latter.

The type of activity conducted at the Lab (systematic computational reproducibility checks within a publication workflow) is relatively novel. In particular, we are not aware of other institutions conducting such activities with undergraduate students. Other institutions that conduct reproducibility checks, such as the Odum Institute and cascad (Pérignon et al., 2019) primarily conduct such work with graduate students and professionals.

The Lab relies on students having some prior experience with statistical software, so as to be productive in short order. Despite the growth of open source software in the last decade, the software used in top economics journals is still overwhelmingly proprietary software (Stata, Matlab; see Figure 2 in Vilhuber, 2020) . Furthermore, most usage by economists of said software is still strongly desktop-oriented. This is reflected in the replication packages received, and determines the required skill set of the students. However, the operating paradigms for most of these software packages remain quite similar. We have found that experience in any one of these packages is sufficient to allow students to follow instructions and conduct reproducibility verifications. Through prior experience, we have also found that knowledge of software more frequently used in computer science or engineering (for instance, Java, or C++) was not so useful, driving our current requirement of experience with what we know to be similar computing paradigms. Of note, while we require some exposure to statistical software commonly used in the social sciences, students are not expected to master it, nor to be proficient programmers.

These requirements are described in job postings, circulated among various undergraduate student experience coordinators on campus, and published to the campus-wide employment opportunity website and the LDI website. 3 The posted job description provides applicants with information on requirements, wage rate, and maximum hours. As of January 2022, un-D R A F T dergraduate RAs in the Lab are hired at the starting end of the Level II (out of four) classification wage rate, which, as of 12/31/2021, is $13.45 per hour. 4 Hours are limited to 10 hours per week, which is less than the maximum number of hours allowed by Cornell University policy. 5 Students are recruited from across campus, but most applicants are from the social sciences. Students apply with a cover letter and a CV, and are selected based primarily on observed and self-declared experience with one of the common statistical software packages used in the social sciences. The field they are majoring in plays no role, but is correlated with experience in these statistical software packages. Since most of the manuscripts stem from economics, most applicants have an interest, and often a major in economics. No explicit years of study requirement is targeted. In practice, we have hired sophomores, juniors, and seniors, but not freshmen. In the most recent two rounds, we had 48 applications, and selected 21 for training. Of the 48 applications, 21 mentioned economics as one of their majors, 7 statistics, 9 computer or information science, and 21 none of those. Of the 21 students selected for training, 16 had an economics major, 5 had a statistics major, and 4 had a computer or information science major, with 4 having declared none of those majors. 6

Although we announce the minimum requirements upon recruitment, the skill levels of the trainees are often heterogeneous, which could raise problems in conducting reproducibility assessments. Therefore, we train applicants before making final hiring decisions. The training is not remunerated, but is also not meant as a test. Rather, our intention with the training is to upskill all applicants. In practice, we retain over 90% of trained applicants for at least one semester, and nearly all attrition in the past has been voluntary.

We currently provide training three times a year: before the fall and spring semesters, for students joining the Lab in that semester, as well as at the end of the spring semester for students joining the Lab as a summer job. During this initial training, we provide instructions on all essential skills and knowledge necessary for the tasks. The base training includes an overview lecture on the context of reproducibility concerns in economics, provides knowledge on reproducible practices, data provenance and data citations (Data Citation Synthesis Group & Martone, 2014), includes a presentation on the Vilhuber, Connolly, et al. (2020) README, basic instructions on command line and version control systems, and a detailed walkthrough of the assessment process, including how to prepare the final reproducibility report sent back to the authors. 7 It also reinforces computational skills, but does not train students on basic computational skills (since that was part of the selection criterion). Many of these topics are new to social science undergraduate students.

Depending on circumstances, in particular during the unusual 2020-2021 period, base training has been conducted as an intense in-person 8-hour training day, as a sequence of 2-4 hour training sessions spread over multiple days on Zoom, or even as an 8-hour long virtual training session. The intense basic training is followed by a sequence of targeted test cases, interspersed with additional short lectures, reinforcing and deepening certain aspects (data provenance, debugging), as well as helping students acclimate to the Lab's task scheduling, reporting and workflow systems. Again, in adapting to changing circumstances over the last several semesters, these test cases and additional lectures have been concentrated into the three days immediately following the base training, or stretched out over the next three weeks, or only the following week. Students work on each test case on their own, following detailed step-by-step instructions, then interact with more experienced peers (undergraduate students already employed by the lab), and finally submit the report for each test case (as described in the Workflow) to the senior instructors and Lab leaders. Each case is discussed as a group before the next test case is presented and assigned, and students receive both generic feedback (commonly made errors or omissions) and individualized feedback. 8

The first test case uses the dominant software package Stata, and introduces students to several small impediments to reproducibility. The case is 7 The agenda for the January 2022 training can be viewed at labordynamicsinstitute.github.io/replicability-training/, most of the slides are available at labordynamicsinstitute.github.io/replicability-training-presentation, and the textual content of training, which the presentation loosely follows, is available at labordynamicsinstitute.github.io/replicability-training-curriculum/.

8 The individualized feedback is generally prepared by one of the graduate assistants.

a simple fake article with one table. The notion of "add-on packages" (libraries, etc., that need to be installed) is introduced. Stata, R, and many other statistical software rely on such packages, but they need to be specified for a replication package to be considered complete, since replicators may not have these installed. Authors of replication packages often neglect to identify such packages, and if not available, code will fail. The first test case uses one such package, but does not specify that it is needed. Students learn to recognize the type of error this generates, how to solve the problem, and how to document both the existence of the problem and the solution. Second, students are introduced to the idea of publicly accessible data that cannot be provided as part of a replication package. In this case, the package uses a dataset made available by ICPSR to researchers at member institutions. The terms of use specify that the data should not be redistributed. Students therefore need to download the data themselves. They learn to document any requirements (such as registration requirements, costs, application procedures, etc.), and to evaluate whether the description of the access conditions in a README is complete and sufficient. Finally, this test case is the first exposure to the structured replication template 9 , the use of git and Markdown, and how to navigate the workflow process. The second test case uses the second-most frequent software in economics, MATLAB. Students are again provided with an article, except this time, it is a real article, with all its complexity and idiosyncracies. Students are now faced with identifying whether all code is provided for the tables, figures, and numbers in the article. It is a surprisingly frequent problem with replication packages that some significant part of the code is not provided. Sometimes, the missing code is used for appendices, sometimes it is data cleaning code, and sometimes, it is a key part of the overall replication package. In this case, both data cleaning code and code for some manuscript figures are missing. As for data, all the data appear to be provided, and students are challenged to identify whether data provenance is sufficiently documented. Numerically, it is a "small data" case, so it is actually possible to compare the source data with the provided data. Finally, for many students, the use of MATLAB pushes most of them out of their comfort zone, and part of the training is to guide them to the similarities in user interfaces of statistical software, rather than focusing on the differences in the programming languages. The second test case is also the second time that the workflow is navigated, and the D R A F T template used, instilling familiarity with the tools that will be used in the Lab.

The third test case introduces students to more data provenance issues, while making it harder to verify that all the data are present. The case is a real article that uses confidential data that cannot be shared. Thus, students are challenged to identify whether all the data appear to be documented, by reading not just the README, but also the data section in the article. The article relies on Scandinavian register data and includes a thorough discussion of this data (a luxury not always present in other articles that students will later encounter). Students will assess whether data are properly cited, and write the report without ever running any code.

Both the second and third test cases rely on articles that were subsequently published. 10 These cases were chosen because they give students practice with common issues they will face, while not being too computationally difficult or time consuming. The second test case emphasizes some of the technical skills needed to evaluate reproducibility (running code in an unfamiliar program, identifying parts of code in relation to tables/figures, etc), while the third test case helps students learn and practice how to identify data sources and verify packages for completeness without being able to run code. Students are shown how the final, published article and package differ from the earlier versions that they were provided with, and can see the impact that the reports can have on the clarity and reproducibility of scholarly publications.

As noted, each of the training test cases is accompanied by an initial presentation of the topic, significant independent work that relies on prepared documentation, a mentoring session with an experienced undergraduate replicator, and a debriefing session in which students provide their first impressions, and instructors provide both specific and generalized feedback.

At the end of the initial training, students join the regular Lab meetings, and are assigned real cases. However, we do generally take care to assign them cases that we believe to be easier to complete, slowly ramping up the complexity. We can do this because we have overlapping cohorts of students, with varying levels of experience, present in the Lab. 

We now describe the generic workflow for the verification activities conducted in the Lab. A simplified workflow is shown in Figure 1 , a more detailed workflow with specific instructions for students is publicly available and regularly updated. 11 The Lab receives a request to verify a replication package, generating a case number. Such replication packages generically consist of (or should consist of) computer code, possibly but rarely software, data that can be redistributed, and a document, generally referred to as the "README," describing the provenance of all data, including data not made available as part of the package, and the process to reproduce the results in the paper. Materials are usually deposited by authors at the AEA Data and Code Repository 12 , though deposits made at other trusted repositories are also accepted (but are rare). The associated manuscript is also made available to the Lab.

The case is assigned to the next available student. At this stage, no allowance is made for student skills or experience. The first part consists of an assessment of the README, the manuscript, and of the package con-D R A F T tents. 13 Students assess whether data provenance is completely described, whether the manuscript and/or the README contain data citations (Data Citation Synthesis Group & Martone, 2014) , and whether there appears to be a complete archive of the program code (consistent with the README). Data provenance is assessed and verified, even when data are provided. When data are not provided, data provenance is even more important, as it must completely describe how the student can obtain the data. This can range from simple click-through downloads to complex application processes. The computer code and the README are assessed in terms of the software requirements. Finally, the computational resources are summarized, if reported in the README or deduced from other descriptions.

This information is summarized in a preliminary report. One of the longest tasks is the data assessment task, since some manuscripts combine several dozen distinct data sources, and the quality of the description still ranges from excellent to poor. Once the preliminary report is finalized, it may be discussed with senior Lab members (team leads, graduate students, Data Editor): Does the assigned student have the necessary skills? Does the Lab have access to the software? Does the Lab have access to the computing resources (specific OS, high-performance computing)? Does the Lab have access to the data, or can it obtain access to the data without an unreasonable delay? Do the computations run in a reasonable amount of time, given all the resources available? Based on these assessments, the case may be reassigned to a student with the requisite experience, to a cooperating third party with access to data or computing resources, or, most frequently, remains with the original student.

The student assigned to conduct the computational reproducibility check then downloads any necessary data, and follows instructions in the README to run the programs provided. This can be as simple as running a single "main" controller program, or as complex as running and baby-sitting dozens of programs, as well as conducting certain activities by hand. While much software can be automated, it is currently still very frequent to see certain automatable tasks conducted manually. Most often, this involves parsing computer output to generate tables (despite perfectly good code to do so programmatically) and generating maps using GIS software (despite 13 We note that the AEA states that authors should use the standardized README template (Vilhuber, Connolly, et al., 2020) . Many of the issues that the Lab encounters could be addressed by correct use of the README template.

perfectly good automated software to do so). It is the student's job to reproduce the manuscript's results following the authors' instructions as closely as possible, though this has some limits, in particular for very inefficient code or instructions.

While all of the above steps appear to be very systematic, there are many subjective assessments. When are the manual steps too onerous? When is a deviation from stated computer runtimes unusual or too long, and action must be taken? The students are queried frequently about progress, and are required to discuss their active cases at twice-weekly meetings. In those meetings, all cases are discussed openly, and decisions are made. Those decisions do not necessarily come from the Data Editor, although he is present at most of these meetings. Students are encouraged to propose reasonable solutions, based on their experience, and increasingly make decisions autonomously. Input to such decisions may also come from team leads, or from graduate students. Students can also use a mailing list for peer-to-peer conversations and problems, and are encouraged to contact the student pre-approvers and the senior Lab members (Data Editor, assistant data editor, and graduate student) with more specific problems, which may be resolved in one-on-one meetings.

Once the results have (or have not) been reproduced, the student completes a report. While based off of a template, the report has many free-form and narrative components, describing the steps the replicator undertook to achieve full computational reproducibility. Students are trained and mentored in how to convey steps in a concise but complete way, allowing the reader to understand fully what was done, and why it may not have worked. Students have access to a bank of frequently used "canned responses" 14 that provide constructive feedback and a checklist of requests. They are also mentored in providing an objective, positive, and dispassionate tone, as they are communicating with much more senior members of the profession.

The report is reviewed by more experienced students ("pre-approvers"), 15 and then reviewed again by the Data Editor before being sent to authors. 16 Students get feedback on how they performed on each case, and when im-14 The "canned responses" can be found at https://perma.cc/U8MR-WEEZ. The latest version of the canned responses is available in the students' cloned template every time they initiate a new verification. 15 We do not discuss the supplementary training for pre-approvers here. 16 A sample report, suitably anonymized, can be viewed at https://perma.cc/ Z577-EEHG and is included in Appendix C.

provements need to be made to the report or to the methods employed, they are provided both directly (privately) to the student, as well as in the form of generalized and anonymized counsel in the group meetings.

Most reports require that authors improve their package. In some cases, in particular when computational reproducibility was not achieved for key tables and figures, the replication package will go through another cycle of review after resubmission by the authors. When possible, such re-submissions are assessed by the same replicator. This is done both because it is efficient (the replicator will already have downloaded all the necessary data, often many gigabytes, into their personal replication workspace) and because it provides validating feedback to the replicator, showing them that authors are (generally) quite responsive to constructive comments. Students then write an abbreviated revision report, which goes through the same review cycle as the full initial report.

The cycle described above gets repeated for every new case assigned to the Lab. The typical case will be in the student's hands for about 10-14 days, though they may sometimes work on 2 or more cases simultaneously, as they wait for code to run or data access requests to be authorized, and some cases may take significantly longer.

To give context, we refer the reader to the AEA Data Editor's annual reports (Vilhuber, 2019 (Vilhuber, , 2021 (Vilhuber, , 2022 Vilhuber, Turitto, et al., 2020) . In the most recent year, the LDI Lab received 529 requests for 415 manuscripts (Vilhuber, 2022) . The vast majority of manuscripts go through a single round of reviews, typically with minor changes requested (Vilhuber, 2022 ,  Tables 2 and 3 ). The median time to full acceptance is between 4 and 6 weeks (reported in various annual reports by the AEA journals). 17

The recruitment, training, and regular activities have been refined in an iterative process since early 2018. After each training, we have reviewed the effectiveness of our methods, surveyed trainees about their perception of the training, and incorporated improvements into the next round. For D R A F T instance, the peer-driven tutorial as part of the test cases was based on trainee feedback, and has proven both popular and effective.

As students accumulate cases, we have observed fairly rapid gains in maturity and autonomy. The best students may be promoted to team leads, where they run the shorter check-in meetings mostly autonomously, are expected to provide a first level of support to their team members, and may be asked to be "pre-approvers." Students contribute regularly to overall improvements in processes and procedures, and are encouraged to contribute to a Wiki, with the intent of providing a knowledge base that is driven by the students for the students. In certain circumstances, students may provide short (verbal or written) tutorials or presentations at group meetings when they have solved a particularly thorny problem, for which the solution is of potential utility to all the students.

Two types of student-faculty interaction occur during the Lab activities in which the students participate. First, students meet twice a week with the faculty supervisor (the Data Editor), to discuss individual cases. These meetings are group meetings, and take as long as necessary to guide the student onto the right path. Where appropriate, one-on-one meetings with the faculty supervisor or graduate students are also scheduled, to solve thornier problems.

The second type of interaction is more indirect. All reports written by students, once approved, are read by manuscript authors, and may be read by journal editors. Students also read the authors' responses to any requested changes. Authors and editors are most often academic faculty. Students do not, however, interact directly with authors and editors -in fact, this is explicitly forbidden by the rules of behavior of the Lab. Thus, this indirect interaction with academic and private-sector faculty is always mediated via structured written communication.

The activities described here are not offered as a regular for-credit course. Rather, they are offered as an on-campus, research-oriented job. As such, they do not neatly fit into a curricular development or plan. However, our impression is that students gain valuable experience through participation in the Lab's activities. This (admittedly subjective and potentially biased) opinion relies on a few observations. For one, almost all students stay with the Lab until graduation, and sometimes even beyond graduation. They also tend to continue at the Lab over the summer, despite having other summer jobs or internships (the hours are adjusted to accommodate such constraints). Such implicit "job satisfaction" indicates that students believe that the experience is valuable, despite having access to higher-paying jobs during summers and post-graduation.

Second, while most students come to us with some prior exposure to programming and data analysis, they largely need training on both the conceptual and technical skills necessary to assess reproducibility. This suggests that they are not currently acquiring these skills elsewhere in their coursework. While the various curricula on campus do offer some of these elements explicitly, and other courses likely embed these techniques within their more discipline-specific topics, there does seem to be a need to more generally expose undergraduate students to these techniques. The Lab's training, while not designed to provide training in such techniques, seems to complement other curricular offerings. For example, while many students have read academic papers in other courses, they rarely have the skill set upon joining the lab to identify and summarize the data sources used in the paper. This skill is something that we train students on in order to evaluate the existence or completeness of data citations.

We have not previously conducted a formal evaluation of the takeaways and experiences undergraduates have had after working for the Lab, and we do not systematically conduct exit interviews with undergraduates at the end of their job tenure. We have, however, conducted informal conversations with the goal of a formative evaluation of initial training and later activities. A few graduates have reported that they have "learned a lot about reproducibility" in ways that "really helped me as a research assistant" (2021 economics grad-D R A F T uate working for a large national economic consulting firm), and that they received "overwhelmingly positive feedback on my documentation method in code reviews, which is all thanks to my time with LDI" (2020 sociology graduate working for a non-profit research organization). A more formal survey of former Lab members is currently in the field, and we plan to report on outcomes in the future.

One of the difficulties of empirically measuring the effect of the explicit initial training and the implicit on-the-job training provided is the noisiness of most empirical measures. Students are not required to actively program, but they will learn programming techniques. No student will have been exposed to data citation principles prior to joining the Lab, and yet all will have learned and applied such principles by the end of the initial training, and refined it over time. The students' efficiency at conducting computational reproducibility invariably will increase, but allocation of papers is not random, and the difficulty of papers varies so widely that any objective measure of time or effort will likely be too noisy to be useful. We continue considering ways to measure this in the future.

When initially planning how to translate the AEA's verification activities into a feasible operation, the reliance on undergraduate students was intentional, for two reasons. For one, we believed that with proper training, undergraduates would provide a more cost effective verification of the basic computational reproducibility of the packages we received than with a rotating cadre of graduate students. This has largely been born out, though we do not provide evidence here on the financial underpinnings of the operation. More importantly, however, we intentionally forced ourselves to develop a training program for those undergraduates, believing that it would have utility for other universities, disciplines, journals, and in other circumstances.

If an instructor wanted to directly integrate these activities into a curriculum, it is likely best to implement them as a formal course. This course should include the training that research assistants currently receive as well as training for how to actively create reproducible packages on their own. The initial training alone accounts for about 14 hours of classroom instruction, plus significant "homework" time. With a straightforward and pedagogically valuable expansion of some of the themes that get short shrift in the current training because they are taught and learned in the first "real D R A F T cases" (git, version control, objective communication skills), a course based on these materials would easily cover 21 hours of classroom training. Test cases could be expanded to have higher and more varied computational requirements, and can easily be based on real replication packages, suitably modified to highlight specific learning objectives. Student assessments could be based on recognition of required components of reproducible packages, followed by written reports on actual replication packages, and concluding with the students creating their own replication package, based on either a research project from another course or a novel one as part of this course. If using a research project from another course, then this replication course could be paired with, for example, an upper level course with a term paper component.

While it may seem attractive to embed some of the minor activities into an existing course, our experience with the particular student population from which we recruited leads us to conclude that the necessary start-up training is intensive enough to require a standalone course. We note, however, that almost none of our recruits have taken more explicitly data-science oriented courses, as far as we can tell. Our suggestions are thus likely not representative when more technically advanced students are included in the population of interest.

We do, however, believe that what we observe is not specific to economics, and could be easily implemented in other (related) disciplines. Discussions with colleagues in sociology or policy suggests that the basic context and student skill set may be quite similar there. Amongst our recruits have been engineering, bio-statistics, and sociology students, and they have performed just as well as the students with an economics major.

One final thought, though. Implementing the activities outlined here as part of a course is unlikely to then also meet the needs of a journal for reliable and timely verification service. Most journals would have verification needs at monthly or even weekly frequencies throughout the year, including when classes are not in session. A course would not even produce a continuous stream of verification reports throughout the semester. A course might, however, serve as a feeder to a campus-wide verification service, similar to statistical consulting services with student workers.

We thank Hautahi Kingi, Sylverie Herbert, and Flavio Stanchi, who contributed to the earliest pilots of this effort. We thank the students who have worked for the Lab, and who have helped improve both the reproducibility of economics articles, as well as the training and workflow for later participants.

LV is the Data Editor of the American Economic Association, a paid position. All activities described herein are funded by the American Economic Association. HHS, MW, and DNW were graduate student assistants at various points during the time period described in the article, and MD is the assistant data editor, all hired by LV and paid through the AEA's contract with Cornell University. 

The following image of the job posting was captured on February 6, 2022. It is preserved at https://perma.cc/G969-2HT4. • Contact: We have currently �lled our available training slots for January. Please check back again later in the semester. Feel free to contact us at ldi@cornell.edu for inquiries any time.

• Remuneration: An hourly rate commensurate with experience will be o�ered. $14.25 per hour, up to 10 hours per week while in session; up to 20 hours per week during the summer.

• Goal: Ensure that supplementary materials for articles in a journal with a replication policy are (a) accessible (b) reproduce the intended results, (c) document results and �ndings.

The American Economic Association (AEA) monitors compliance with its Data and Code Availability Policy, under the leadership of the AEA Data Editor. LDI Replcation Lab members will access pre-publication materials provided by authors, and assess how well these materials reproduce the results published in the manuscript or article. The provided materials and instructions will be assessed using a checklist. Authors'

instructions will be followed (if possible), and success or failure to (i) perform the analysis (ii) replicate the authors' results will be documented. Other related activities, such as literature search or tabulation of results, may also be assigned. Team work is encouraged, and activity will be supervised by graduate student or faculty member.

Team members must be at ease working in various computer environments (Windows Remote Desktop, local laptops) and software tools (statistical software, Git).

This is ongoing work, and conditional on satisfactory work, continued employment (until graduation) is possible and desirable. Student status with Cornell is required.

Some experience with empirical social science data analysis using statistical software is Cornell employers will not send electronic payroll checks by email and will not ask students to make gift card purchases on their behalf.

The 

The following report has been anonymized, but reflects the typical report sent back to authors. Length depends on number of data sources (each data source is listed) and number of programs (each is listed, together with reproducibility status). AEJPol-2019-xxxx.R1 "PAPER TITLE" Validation and Replication results

You may want to consult Unofficial Verification Guidance for additional tips and criteria.

[NOTE] This is an amalgam of multiple reports, and updated to reflect template and guidance as of February 2022.

Thank you for your replication archive. All tables and figures are replicated, with one minor discrepancy in Table 1 .

Conditional on making the requested changes to the manuscript and the openICPSR deposit prior to publication, the replication package is accepted.

In assessing compliance with our Data and Code Availability Policy, we have identified the following issues, which we ask you to address: The World Bank collected both a baseline survey and initial follow-up survey to assess short-run program impacts.

Note that the raw survey datasets the authors have published have been modified slightly from the original raw data in order to remove PII. In particular, they have removed the recorded date of birth for all survey respondents (year and month of birth are still included).

The original study describing the first two surveys was cited in the manuscript. It appears that the last follow-up survey is conducted by the authors. Access information is detailed in the readme.

NOTES: The citation in the manuscript is to a description of the first two waves, not to the actual data.

[SUGGESTED] The authors provide data from Waves I and Wave II, and Wave III (conducted by the authors). Presumably, all three surveys are cataloged and stored in the World Bank Microdata catalog.

Referencing the relevant entries from the Microdata catalog is strongly suggested, as it provides additional metadata and findability.

[REQUIRED] Please add data citations to the article. Guidance on how to cite data is provided in the AEA Sample References and in additional guidance.

No analysis data file mentioned Analysis data files mentioned, not provided (explain reasons below)

Analysis data files mentioned, provided. File names listed below. Title conforms to guidance (starts with "Data and Code for:" or "Code for:", is properly capitalized)

Authors (with affiliations) are listed in the same order as on the paper

[REQUIRED] openICPSR should not have ZIP files visible. ZIP files should be uploaded to openICPSR via "Import from ZIP" instead of "Upload Files". Please delete the ZIP files, and re-upload using the "Import from ZIP" function.

Detailed guidance is at https://aeadataeditor.github.io/aea-de-guidance/. [NOTE] openICPSR metadata is sufficient.

However,

[SUGGESTED] We suggest you update the openICPSR metadata fields marked as (highly recommended), in order to improve findability of your data and code supplement.

[SUGGESTED] We suggest you update the openICPSR metadata fields marked as (suggested), in order to improve findability of your data and code supplement.

For additional guidance, see https://aeadataeditor.github.io/aea-de-guidance/data-deposit-aea-guidance.html.

Each figure in the article corresponds to specific lines in the provided programs; see "code-check.xlsx"

(end of this report) and "Readme.pdf" for details.

The main analysis programs are written in Stata and the could be executed through the given master program MASTER.do. [SUGGESTED] Please add a setup program that installs all packages as noted above. Please specify all necessary commands. An example of a setup file can be found at https://github.com/gslabecon/template/blob/master/config/config_stata.do

[REQUIRED] Please provide debugged code, addressing the issues identified in this report.

Code Check Table   Figure/Table #  Program  Line Number Replicated?   Table 1 table1.do minor difference in F-statistics Table 2 table2.do Yes Table 3 table3.do Yes Table 4 table4.do Yes Table 5 table5.do Yes Table 6 table6.do Yes Appendix Table 29 appendix_table29.do Yes Appendix Table C [REQUIRED] Please adjust your tables to account for the noted numerical discrepancies, or explain (in the README) discrepancies that a replicator should expect.

There are no in-text numbers, or all in-text numbers stem from tables and figures.

There are in-text numbers, but they are not identified in the code Bugs in code that were fixable by the replicator (but should be fixed in the final deposit)

Code missing, in particular if it prevented the replicator from completing the reproducibility check Data preparation code missing should be checked if the code missing seems to be data preparation code

Code not functional is more severe than a simple bug: it prevented the replicator from completing the reproducibility check

Software not available to replicator may happen for a variety of reasons, but in particular (a) when the software is commercial, and the replicator does not have access to a licensed copy, or (b)

the software is open-source, but a specific version required to conduct the reproducibility check is not available.

Insufficient time available to replicator is applicable when (a) running the code would take weeks or more (b) running the code might take less time if sufficient compute resources were to be brought to bear, but no such resources can be accessed in a timely fashion (c) the replication package is very complex, and following all (manual and scripted) steps would take too long.

Data missing is marked when data should be available, but was erroneously not provided, or is not accessible via the procedures described in the replication package

Data not available is marked when data requires additional access steps, for instance purchase or application procedure.

Data and Code Availability Policy

Data and code availability policy

How to Persuade Journals to Accept Your Replication Paper

Editorial Statement

Retrieved

Report of the Subcommittee on Replicability in Science Advisory Committee to the National Science Foundation Directorate for Social, Behavioral, and Economic Sciences)

Evaluating replicability of laboratory experiments in economics

A Preanalysis Plan to Replicate Sixty Economics Research Papers That Worked Half of the Time

Operationalizing the Replication Standard: A Case Study of the Data Curation and Verification Workflow for Scholarly Journals

The Meaning of Failed Replications: A Review and Proposal

A Proposal to Organize and Promote Replications

Joint Declaration of Data Citation Principles (tech. rep.). Force11

Replication in Empirical Economics: The Journal of Money, Credit and Banking Project

Report of the Search Committee to Appoint a Data Editor for the AEA

Supporting computational reproducibility through code review

Opinion: Is science really facing a reproducibility crisis, and do we need it to?

Highlights of the US National Academies Report on

The credibility crisis in research: Can economics tools help?

Viewpoint: Replication in economics

Editorial: ACM TOMS Replicated Computational Results Initiative

Replication and Economics Journal Policies

ReplicationWiki: Improving Transparency in Social Sciences Research. D-Lib Magazine

Retracted: Risk Management in Financial Institutions

Replication, Replication. PS, political science & politics

The Reproducibility of Economics Research: A Case Study (Presentation)

The Numerical Reliability of Econometric Software

Data Science for Undergraduates: Opportunities and Options

Reproducibility and Replicability in Science

Promises and Perils of Pre-analysis Plans [tex.ids= olken2015

Certify reproducibility with confidential data

& Anonymous. (2021). [98] Evidence of Fraud in an Influential Field Experiment About Dishonesty

Enhancing reproducibility for computational methods

An empirical analysis of journal policy effectiveness for computational reproducibility

Report by the AEA Data Editor

Reproducibility and Replicability in Economics

Report by the AEA Data Editor

Report by the AEA Data Editor. AEA Papers and Proceedings

Please add data citations to the article. Guidance on how to cite data is provided in the AEA Sample References and in additional guidance

Please adjust your tables to account for the noted numerical discrepancies, or explain (in the README) discrepancies that a replicator should expect

Action Items (openICPSR) (these are handled prior to publication, and are verified by the Data Editor

Please adjust your tables to account for the noted numerical discrepancies, or explain (in the README) discrepancies that a replicator should expect

The authors should check if the three surveys are cataloged and stored in the World Bank Microdata catalog. If yes, referencing the relevant entries from the Microdata catalog is required

Import from ZIP" instead of "Upload Files". Please delete the ZIP files, and re-upload using the

Current Employment Statistics Dataset is not provided, but a link is provided in the README Access conditions are not described

The data are cited in the references section of the manuscript and the README. Data citation: Bureau of Labor Statistics

Carthago-Lion p-values is calculated in the program

Please specify hardware requirements, and duration (execution time) for the last run, to allow replicators to assess the computational requirements. Data checks All datasets are present for this manuscript. Data can be read using Stata, and have data variable labels Ran PII and results are reported in

Stated Requirements No requirements specified Software Requirements specified as follows: Stata Computational Requirements specified as follows: Cluster size, etc

Time Requirements specified as follows: Length of necessary computation (hours, weeks, etc.) follows: Cluster size, disk size, memory size, etc. Time Requirements Length of necessary computation (hours, weeks

While running a static code scanner for Stata, we identified possible Stata packages used in your code. Please verify, and adjust requirements accordingly