key: cord-0174033-cncgotl1
authors: Wang, Liu; Li, Ruiqing; Zhu, Jiaxin; Bai, Guangdong; Wang, Haoyu
title: When the Open Source Community Meets COVID-19: Characterizing COVID-19 themed GitHub Repositories
date: 2020-10-23
journal: nan
DOI: nan
sha: 5f2c412b69debd0bf661b5670568bb584dbc8f9f
doc_id: 174033
cord_uid: cncgotl1

Ever since the beginning of the outbreak of the COVID-19 pandemic, researchers from interdisciplinary domains have worked together to fight against the crisis. The open source community, plays a vital role in coping with the pandemic which is inherently a collaborative process. Plenty of COVID-19 related datasets, tools, software, deep learning models, are created and shared in research communities with great efforts. However, COVID-19 themed open source projects have not been systematically studied, and we are still unaware how the open source community helps combat COVID-19 in practice. To fill this void, in this paper, we take the first step to study COVID-19 themed repositories in GitHub, one of the most popular collaborative platforms. We have collected over 67K COVID-19 themed GitHub repositories till July 2020. We then characterize them from a number of aspects and classify them into six categories. We further investigate the contribution patterns of the contributors, and development and maintenance patterns of the repositories. This study sheds light on the promising direction of adopting open source technologies and resources to rapidly tackle the worldwide public health emergency in practice, and reveals existing challenges for improvement.

The COVID-19 pandemic has quickly become a worldwide crisis. Since its outbreak, COVID-19 has attracted great attention from various research communities. Researchers from interdisciplinary domains work together to fight against the crisis. Beyond the medical domain, computer scientists have adopted advanced information techniques like machine learning to help medical practitioners deal with COVID-19 [1] , [2] , [3] . The open source community plays a vital role in coping with the pandemic which is inherently a collaborative process. Existing research efforts necessitate that the datasets used for the studies should be open sourced to promote the extension and collaboration of the work in the fight against COVID-19. Thus, in the early stage of COVID-19, plenty of datasets including the statistics of COVID-19 cases, the COVID-19 diagnosis datasets from X-ray images and cough sounds, and COVID-19 emotional and sentiment datasets from social media, are created and shared in our research communities with great efforts [4] , [5] . Beyond the datasets, the open source community has shared a large number of tools (e.g., online tracker), software (e.g., contact tracing mobile apps), deep learning models (e.g., diagnose models and prediction models) to help combat COVID-19.

GitHub, as the most popular collaborative platform in the open source community, has gained great attention during this pandemic. Most of the COVID-19 datasets and software are open sourced on it. By the time of this study, there are over 67K COVID-19 themed GitHub repositories. Although plenty of research studies in the software engineering community have analyzed GitHub from different perspectives [6] , [7] , [8] , to the best of our knowledge, the COVID-19 themed GitHub repositories have not been systematically characterized. We are still unaware how the open source community helps combat COVID-19 in practice, and there remain a number of interesting directions to explore, e.g., the popularity and trends of COVID-19 themed repositories, the characteristics of contributors, the features and categories of COVID-19 repositories, and the development and maintenance behaviors.

This Work. In this paper, we present the first large-scale empirical study of COVID-19 themed repositories on GitHub. We first make efforts to harvest a comprehensive dateset of COVID-19 themed GitHub repositories. By the time of July 17, 2020, we have collected 67,079 repositories in total (See Section III-B). Leveraging the dataset, we perform a systematic analysis including popularity and trends analysis, contributor characterization, and national responsiveness analysis (See Section IV). Then, we take advantage of natural language processing (NLP) techniques to classify these repositories into six major categories, including data, contact tracing, toolkit, forecast & simulation, detection & diagnosis, and helpful in some ways (See Section V). We further perform an analysis to understand the development and maintenance behaviors of COVID-19 themed repositories (See Section VI). Our work is the first comprehensive study to demystify the reaction of open source community to the pandemic. Our exploration gives a first impression on the landscape of the COVID-19 themed GitHub repositories, revealing some interesting observations:

open source community's rapid response to the pandemic. • The aims of COVID-19 themed repositories cover various aspects. We found a wide diversity of repositories and constructed a taxonomy to classify them into six categories. Repositories are unevenly distributed across categories. Subtle differences arise when looking in detail on a per-category basis with regard to the boom time. • Most COVID-19 repositories are developed rapidly, while the maintenance lifecycles are short-lived. The development process for most repositories is rapid and intense in the early stages after creation. However, most repositories are not well maintained over the long run, and soon become inactive. Besides, the activity of contributors are not significantly affected by the lockdown of cities. Our results highlight the practical and potential value of open source technologies and resources in handling global crisis. They also imply that the advanced techniques and mechanisms of popularization, internationalization, data and software sharing, contribution gathering, etc., are required for more effective collaboration of open source community in global emergencies. To boost future research, we have released the collected dataset to the research community at https://covid19-repos.github.io.

GitHub is one of the most popular collaborative platforms primarily used for software development. There have been millions of open source repositories in GitHub so far, which cover various topics including different languages, frameworks, applications, events, etc. Among them, COVID-19 studied in this paper is a very special one in 2020, and even in the history of GitHub. The collaboration in GitHub is based on the git version control system. Developers can create their own copies of a repository (aka fork) and make local changes. The git commits record what and when they modify. To contribute back to the repository, developers request the maintainers to pull the code from their forks (aka pull-request). Maintainers can collaborate with others to review the pull-request and decide to merge or not. The developers of the merged code are contributors of the repository. GitHub also integrates some social features. Users are able to follow other users, watch and star repositories. Users also have profiles of identifying information, e.g., their locations and companies.

Due to its popularity, social features, and availability of the rich data, GitHub has attracted great attention from the software engineering community. Some studies have been conducted to understand GitHub repositories. Kalliamvakou et al. [9] highlighted important characteristics of the GitHub repositories, e.g., most repositories have very few commits and are inactive, and a large portion of repositories are not for software development. Meanwhile, techniques were proposed to classify the GitHub repositories [10] , commits [11] and README [12] .

As collaboration is an inherent feature of software engineering projects, many studies take use of GitHub to investigate efficient collaboration models. Thanks to the rich data GitHub provides, many insights have been identified. Studies show that the pull-based collaboration model is more effective [13] , although it faces a number of challenges [14] . Factors which impact the integration of pull-requests are studied from many perspectives [15] , [16] , [17] . Pull-request reviewers are recommended [18] , [19] , and team structures are studied [20] to improve the efficiency. The code and development techniques in GitHub are also analyzed. Cross-project code clones [21] and code quality [22] are studied. The performance of test case prioritization techniques [23] and test driven development [24] are evaluated. These studies have greatly enriched our understanding of the software development process in open source community. In this paper, we intend to have an overview of the GitHub repositories from similar aspects in the context of COVID-19, and obtain implications of tackling the large-scale emergencies from the perspective of open source community.

A growing number of studies targeting COVID-19 have sprung in the community of software engineering. Neto et al. [25] investigated the impact of COVID-19 on software projects and software development professionals through mining software repositories and a survey study. A few studies seek to understand developer productivity at technology companies due to the almost overnight migration to working at home for software developers. For example, Ford et al. [26] presented a survey study to understand benefits/challenges since working from home and analyze factors that affect developer productivity over time. Ralph et al. [27] used a survey to understand the effects of the pandemic on developers' wellbeing and productivity, and discussed how organizations may better support their employed developers. Bao et al. [28] conducted a case study on Baidu Inc. to understand both positive and negative impacts of working form home on developer productivity. To the best of our knowledge, no previous studies have systematically characterized the COVID-19 themed open source repositories on GitHub.

We present the details of our characterization on COVID-19 themed GitHub repositories. We first describe our research questions, and then present the dataset used for our study. We first need to harvest a comprehensive dataset of COVID-19 themed repositories. Our collection process is based on the official GitHub Search API 1 , which is designed to facilitate a wide search for the public repositories and retrieve meta data about them (e.g., stars, forks, etc.). Collecting the related repositories. We first rely on a keyword matching based method to collect COVID-19 themed repositories. We used three search keywords (i.e., "COVID-19", "COVID19" and "coronavirus") as parameters to construct search queries respectively, such as "https://api.github. com/search/repositories?q=covid-19", and we can get a few pages of search results that best match "covid-19" (The query is case-insensitive). Since the GitHub Search API limitation of 1,000 results for each search, we have adopted a segmented approach, i.e., narrow the results of one query using search qualifiers and make multiple queries, and finally integrate them. We used the repository creation time (i.e., created qualifier) and the number of stars (i.e., stars qualifier) as the primary and secondary qualifiers respectively. In this way, we have collected 60,591 results for "covid-19", 26,289 results for "covid19" and 11,116 results for "coronavirus", respectively. After the de-duplication process, we obtained a dataset that contains 68,269 unique repositories. Besides the meta information (e.g., id, owner, creation date, stars, programming language) of each repository, we also collected profiles of its corresponding contributors, including their names, ids, locations, etc. We crawled these through the official API 2 . Filtering the dataset. Our keyword-based collection is to match a keyword across the entire repository, such that wherever the keyword appears in a repository, we include it in our dataset. This may cause false positives. For example, a number of repositories that are irrelevant but with description like "[TEMPORARILY SUSPENDED because of COVID-19]" would appear in our search results. We therefore conducted the following filtering process. 1) We started with checking the repository name. If the full name of the repository contains the COVID-19 wordings, e.g., covid-19, coronavirus, 2019-ncov, sars-cov-2, etc., and some anagrams of COVID-19, e.g. codiv19, condiv19, cord19, etc., we would assume 

In this section, we present the overview analysis of COVID-19 themed repositories, including the trend of newly created repositories, their activeness, their popularity, and the distribution of the contributors.

Overall Trend. COVID-19 has rapidly spread around the world and become a global pandemic in early 2020. As it spreads, new repositories regarding the disease emerge in GitHub. We are curious about when these repositories were created and whether their creation is in line with the trend of the epidemic growth. Figure 1 presents the distribution of daily created new repositories, from January to July in 2020. It can be seen that the number of newly created repositories only accounted for less than 0.2% before mid-January 2020, while the number started to increase sharply from mid-late January and peaked in mid-late March, after which there was a slowly decreasing trend. Over 65K (98%) repositories were created after March 1, 2020, and the day with the most creation was March 28, when 1,202 new repositories were created. Note that March is also the time when WHO declared COVID-19 a pandemic that started spreading around the world.

Activeness. We also pay attention to the activeness of the repositories during the pandemic. Through the updated_at and pushed_at indicators which refer to the most recent update and push time of the repository, we summarize the distribution of the latest update date and push date of the repositories, as shown in Figure 1 . The distribution of pushed time is almost the same as the updated time (see blue line and purple line in Figure 1 ), which is expected as the most seen update behavior is push. We thus only consider either of them in our subsequent trend analysis. Over 37k (55%) of the repositories were updated after May 1, 2020, and only 10k (15%) were updated after July 1, indicating that a large number of repositories lack continuous activeness.

In most previous studies [29] , [30] , the popularity of GitHub repositories is usually measured by the number of received stars because the stargazers button is an explicit feature for users to manifest their interest or satisfaction with a repository. In addition, the number of forks can be seen as a proxy for the importance of a repository in GitHub because forks are used to either propose changes to an existing repository or as a starting point for a new repository. Besides, we found that many COVID-19 themed repositories were built atop of other repositories, and have shown explicitly links in the Readme.md files, which we called cross-repository reference. Thus, our investigation will focus on these three aspects.

The number of received stars. We counted the number of repositories per star and found that the 0-star accounted for the vast majority, up to 79.6%, followed by the 1star with 11.2%. Figure 2 (a) shows the distribution of the number of repositories with more than one star. Obviously, it follows the typical power law distribution, i.e., there are only a few repositories with large number of stars, while most repositories receive few stars. The most representative repository, CSSEGISandData/COVID-19 3 , which is a COVID-19 data repository hosted by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, has received 23,352 stars by the time of our study.

The number of forks. The forks have almost the same power law distribution with the stars, which is consistent with our intuition (see Figure 2 (b)). There is a high percentage of 0-fork repositories (88.5%), and the 0-fork and 1-fork repositories make up the vast majority (95%). The repository with the most forks is exactly the repository with the most stars, i.e., the SSEGISandData/COVID-19, which has been forked 14,392 times as of this writing.

Cross-repository Reference. In addition to the descriptions, we have collected the Readme.md files of all the repositories in our dataset. We extracted the links from these files and tried to investigate referencing relationships among 3 https://github.com/CSSEGISandData/COVID-19 the COVID-19 themed repositories. It turns out that, 4,811 repositories contain links in their Readme to other repositories in our dataset, and 2,224 repositories have been referenced by others. Figure 3 presents the cross-reference relations among the repositories. The larger the node, the more the corresponding repository is referenced. Note that we have classified the repositories into six categories (to be detailed in Section V), and the color of a node represents the category of the corresponding repository. As can be seen in Figure 3 , the top 7 most referenced repositories are of the data category. The most frequently referenced repository is again the CSSEGISandData/COVID-19, which has been referenced by 1,640 repositories.

Next, we investigate the distribution of the 67,985 collected contributors. As aforementioned, we gathered their location information during our data collection, while unfortunately 35,537 (52%) of them did not provide their location data. Among the location data provided, there are some invalid ones like "somewhere", "universe", "planet", "internet", "home", etc. Besides, different formats/representations of the location are provided, including city name only, country name only, and both. Thus we started with filtering invalid location names, and then used Geotext [31] to convert all the valid locations into country names. Finally we acquired a total of 31,296 valid locations mapping over 200 countries. A map displaying the country distribution of these contributors is shown in Figure 4 . 

Considering that the emerging time of the COVID-19 pandemic varies from country to country, it is thus valuable to investigate whether the number of daily new created repositories matches the number of daily new confirmed cases. On account of this, we conducted case studies in the top 6 countries with the largest number of contributors, i.e., the United States, India, Brazil, the United Kingdom, Canada and Germany. We define which country a repository belongs to by its contributors' locations. If all the contributors (one or more) of a repository are located in the same country, we treat the repository as belonging to that country; if the different contributors of a repository are located in multiple countries, we take the repository as transnational. In this way, we worked out the results as shown in Table I . Obviously, the distribution of repositories is in line with the distribution of contributors across countries. We notice that most of the repositories are contributed by developers of a sole country, while only 1,490 of them are transnational repositories that contributed by developers from multiple countries.

We next investigate the correlation between the development process and the COVID-19 evolvement in each country. We collected the number of new infections per day as complete as possible for each country. We show the comparison chart of 6 countries in Figure 5 . In each plot, the red scatter shows the number of newly created repositories each day and the blue line presents the number of newly confirmed cases per day. Beyond this, we also take the latest updated time into account and plot it with green scatter. From the graphs, we can derive at least two interesting findings: 1) Regardless of countries, the trend is relatively similar in both the number of newly created repositories and latest update repositories, and even consistent with their overall trends (see Figure 1 ). This partly reflects a quick reaction to the public health emergency in these countries. 2) Taking the blue line into account, we can find that in some countries, the line and scatters show a strong consistency in the trend (overall or in stages). For four countries, i.e., US, UK, Canada and Germany, we can observe that the blue line, red scatter and green scatter have significant up-trends that almost overlap and are followed Answer to RQ1: There are over 67K COVID-19 related repositories in GitHub by the time of our study, and 98% of them were created after March 1, the time when the coronavirus becomes a pandemic. As the outbreak spread extensively, a significant increase in the creation and alteration of related repositories in GitHub instantly followed, indicating the open source community's rapid response to the pandemic.

In this section, we seek to understand the focus of collected COVID-19 repositories. We first rely on manually efforts to create a taxonomy. Then, we adopt an automated classification method to classify them.

We conduct a manual investigation to understand the subjects and contents of the COVID-19 themed repositories in order to create the taxonomy. As it is infeasible for us to inspect all collected repositories, we seek 200 representative ones for our manual analysis. As we discussed in Section IV-B, the number of stars partially indicates the popularity and quality of a repository, so we picked the top 200 stared ones. Through extensive manual analysis, we create a taxonomy including the following six categories: C1 Data. A large portion of the 200 samples are data-related, including COVID-19 cases statistics, image datasets, data visualization and analysis, etc. C2 Contact tracing. Contact tracing is a key technique used by public health authorities to contact and give guidance to anyone who may have been exposed to COVID-19 cases. Some repositories in our collection are opensource contact tracing apps or frameworks. For example, the repository google/exposure-notifications-server 4 implements the Exposure Notifications API and provides reference code for working with Android and iOS apps. C3 Toolkit. A number of tracking toolkits for COVID- 19 epidemic have emerged in GitHub, including mobile apps, APIs, crawlers, Python packages, etc. For example, 

We next seek to automatically classify the repositories. Ground truth. We first built a ground truth dataset to be used for training our classification model. It is intended to include 600 labelled repositories, 100 per category. We built it by manually labeling collected repositories one by one in descending order of the number of stars, till each category has been assigned 100 repositories. Considering there are multiple languages in the descriptions, we used Google Translation to translate all the descriptions into English before labelling. The labeling was done by the first two authors independently too. A discussion was performed to reach a consensus on the labels in the end. After obtaining the ground truth, we utilize the labeled descriptions to build a classification model that enables us to classify other repositories.

Approach Overview. We adopted a distance-based classification methodology. Below we brief each step in our approach. 1) Keyword generation and feature set extraction. The first step of our approach is to generate keywords for each category such that the characteristics of each category can be represented by its keywords. Considering that most of the descriptions have few words, we merged all the 100 descriptions in a category into a single text, resulting in six documents as our training data. Each of them has its category name as its label. We extended emojis and those general-thusindistinguishing words like "COVID-19" and "coronavirus" into the stopwords list (downloaded form www.nltk.org), so that they are removed from the training data prior to our keyword generation. All terms in the training data are then taken to calculate their weights in differentiating the six documents using the TF-IDF [32] , which is a widely used method to find out how important a word is to differentiate the labelled documents from each other. We then took the top 20 keywords for each category and merged them into a feature set of length 117 (after deduplication). We list several representative keywords in Table II. 2) Classification. The similarity of documents can be expressed by the angle or distance between their feature vectors -the smaller the angle or distance means the more similar the two documents [33] . Thus, we combine TF-IDF and the cosine similarity for classification. Given a repository to classify, we take its description along with the training documents as input to the TF-IDF method, which generates vectors including the weights of all terms that appear in all documents. These vectors are projected to our feature set (of length 117), so that the feature vectors of all six categories and the tested repository are derived. We then measure their cosine distance, and the tested repository is classified into the nearest category.

3) Cross validation. We adopted a ten-fold cross validation to measure its effectiveness. As shown in Table III , our method achieves a promising performance, with an overall accuracy of 92%. It is thus feasible to apply our classification model to large scale repository classification.

We applied our classification method to the rest of the dataset. We only focus on the six categories we have identified, as they are the most dominant and representative (recall that we identified them based on the top 200 stared repositories). There are also a number of categories whose description is either very short (even without description), unclear or uncharacteristic such that it is infeasible to classify them into any of our categories. Therefore, we conducted a filtering on our dataset based on the 117 keywords we have obtained. We consider introducing a threshold n, and filer out the repositories whose description contains fewer than n keywords. As such, the key becomes to figure out an appropriate n.

To this end, we analyzed the 600 labeled ground truth. We learned that the minimum number of keywords included in a description is 1 (with 130 descriptions), the maximum is 17 (with 2 descriptions), and the average is 3.08. We also noticed that the descriptions with 2 keywords are the most common (with 161 descriptions). It turns out that 21.6% of the repositories are possible to be classified by a single keyword, so we set n to 1. However, not every keyword has the same capacity to distinguish categories, so we grouped the keywords into two levels, i.e., primary and secondary. The primary keyword means that the keyword itself could be relied on to categorize the repository. We identified them from those descriptions containing only one keyword. Overall, we obtained 34 such keywords, including "stat", "data", "visual", "track", "forecast", "predict", "detect", etc.; the remaining 83 are taken as secondary keywords. With obtained keywords, we filtered the dataset in such a way that, if the description contains at least one primary keyword or at least two secondary keywords, we keep it. Through this, we obtained 19,739 classifiable repositories. We had a sampling to test our approach, and found that only 5 out of 200 randomly selected repositories cannot be classified into any of our six categories, which is acceptable. Finally, we have these repositories categorized using our classification method. The result is shown in Category  C1  C2  C3  C4  C5  C6   # repository 10,882 276 3,276 2,231 

We next take a deeper dive into each category. First, we explore whether there are differences in the timing of the prevalence of different categories of repositories. Figure 6 (a) shows the distribution of creation date for all the repositories in each category separately. Most categories have similar trends and are consistent with the overall trends of creation time (see Figure 1) , while there is a distinct difference for the "Detection and diagnosis" category, where the line peaks at the end of April. This indicates that the repositories involving detection and diagnosis elements flourished slightly later than other categories, which is reasonable.

Next, we zoomed into several repositories of each category to examine their activities. For each category, we ranked the repositories by the number of stars and selected top (correctly-classified) 100 of them for our study. We clone these 600 top repositories and use git log command to get the commit information about each repository. Similar to the above analysis on the prevalent creation time of each category ( Figure 6 (a)), we added up the number of commits per day for each category to see their committing trends over time, as shown in Figure 6 (b). It can be seen that "Data" category has far more commits than other categories, which is reasonable because those repositories providing data require constantly updating. Besides, the peaks of "Contact tracing" and "Detection and diagnosis" are somewhat later than others. From the above two category-level comparisons we can discover that, for repositories with diverse aims and focus, their boom period are quite likely to be different.

Answer to RQ2: We created a taxonomy and devised an automated approach to classify the repositories into categories. The Data category makes up the largest proportion of them, while the Contact tracing category accounts for the least. Subtle differences arise when looking in detail on a per-category basis with regard to the boom time. In this section, we direct attention to the development and maintenance behaviors of these repositories.

Over the past few months, many countries around the world went into lockdown to control the spread of COVID-19. We are interested in investigating whether the quarantine has an impact on the activity of the users on GitHub. Our study covers all collected 67,857 contributors. For each contributor, we collect the number of his/her contributions (e.g., commits, issues and pull requests) to the GitHub repositories in each day of the last year, from August 4, 2019 to August 4, 2020, and use the number of contribution to represent his/her activeness. We analyzed the activity of contributors over the year through two indicators: (a) the total number of active contributors per day (i.e., the number of contributors with contributions greater than 0), and (b) the total number of contributions from all contributors per day.

We can examine the trends of the two indicators over time in three stages, as is shown in Figure 7 (a). First, before COVID-19 was discovered, they (green and red lines) both exhibit a largely stable trend, with only a slight decline in late-2019 and early-2020, probably due to the Christmas and New Year holidays. Then the latter two stages are prior to and after the WHO's declaration. We can observe a clearly upward trend in both lines when the COVID-19 was beginning to explode globally and the number of confirmed cases rose sharply in March. More specifically, we take two countries, i.e., US and UK, as case studies to analyze whether contributor activity is significantly associated with the corresponding epidemic at the national level. The results are displayed in Figure 7 (b) and Figure 7 (c). We can see in both countries, as the blue line rises sharply, the green line has almost the same upward trend, which is consistent with the overall situation. These findings suggest that COVID-19 did not significantly disturb contributors' work in GitHub, and may have even given them more time and motivation to contribute. Beyond that, it is interesting to see that the trends in both indicators are cyclical and the periodicity is one week upon our inspection.

We clone the top 100 repositories with the most stars for each category and use git log command to get their commit history. We are interested in examining how commit evolves after a repository is created, so we performed clustering on the postures of commit over a period of time of each repository.

For each repository, we retrieved the number of commits per day over two months (60 days) since it was created, took them as feature vectors and performed L2 regularization. Then we applied the K-Means clustering. We used the Elbow Method to find the optimal value of k, which is 3. Thus we grouped the repositories into three clusters. Figure 8 (a) presents the trends of commit in each cluster. We can see that the majority of the commits are done in the first few days after creation.

1) Cluster 1 (in red) implies a relatively stable trend that with not so many but continuous commits, reflecting the fact that some repositories have ongoing maintenance after early development. For example, a repository named "covid-19data" provides data on the COVID-19 confirmed cases around the world. We checked its commit history and found that the commit happened almost every day and the number of commits range from 1 to 20 within a day.

2) Cluster 2 (in green) indicates that the commits are intensive in the first few days after creation, but later the number decreases to nearly 0, which reflects the fact that some repositories are infrequently modified after the initial development is completed. For example, there is a repository called "covid19" designed to visualize and track the COVID-19 pandemic by country. It committed frequently in the first five days (average of 27), and then fewer and fewer until 0.

3) Cluster 3 (in blue) has the similar trend with cluster 2, except that its commits flourish earlier, like the first day when created. For example, there is a repository called "infectiontracker", which implements an app for tracing chains of infection and for sharing timely information in the event of COVID-19 infection, with a trend that a lot of commits (93) in the first day, then drops to 0 sharply. We next investigate the category level distribution (see Table V ). Most of the repositories in "Data" category are clustered into cluster 1, which conforms to Figure 6 (b). Half Category  Cluster 1 Cluster 2 Cluster 3 Total  Data  73  16  11  100  Contact tracing  34  19  47  100  Toolkit  44  29  27  100  Forecast&simulation  56  22  22  100  Detection&diagnosis  39  29  32  100  Helpful in some ways  53  25  22  100  Total  299  140  161  600 of the repositories tend to be inactive after a short period of vitality. Given that we choose repositories with a high number of stars, it is reasonable to believe that there exist many more repositories not well maintained for long periods of time.

We are also interested in exploring the code development process, i.e., the variation of LOC over time. Similar to the analysis of commit evolution, for each repository, we collected the LOC per day over two months (60 days) since it was created, and then applied K-Means clustering likewise. We also used the Elbow Method to find the optimal value of k, which is 5. We clustered the data into five clusters and Figure 8 (b) shows the pattern of each cluster.

In general, the number of code lines shows an increasing trend, although the period of growth varies. Over 60% of the repositories (363) are clustered into cluster 1, which shows a tendency of code lines to increase to a certain extent very early on (within five days after creation) and remain stable afterwards, i.e., the code development is concentrated in the very early days and with little or even no change thereafter. For example, there is a repository 5 that added 5,398 lines of code on the first day of creation, then gradually increased to 10,182 lines by the 19th day, and has remained stable with handful changes since then. It suggests that most repositories are developed quickly but tend to remain stable after the initial development. Cluster 2 (164) and cluster 3 (60) manifest a general trend of steady rise, which fits our perception, although the periods of rapid growth are somewhat different and cluster 3 seems to take longer to develop. Cluster 4 (8) shows that the lines of code fluctuate considerably especially in the first month. To understand this pattern, we manually checked the commit messages of several repositories in this cluster. We can observe the following major reasons: adjust code or data, cleanup or refactoring, remove data and unused code, etc. For example, there is a repository 6 whose lines of code suddenly reduced from 40,343 to 9,926 on the 25th day, and the message of the commit is "Massive cleanup and refactoring..... Moved all fonts and stylesheets to local. Removed some deprecated code and unneeded data files". Cluster 5 shows that the lines of code remain almost stable over a long period after creation, while increasing sharply at the end time period we focused on, implying that the code development process takes a longer time, like cluster 3. Note that there are only five repositories 5 https://github.com/rizmaulana/kotlin-mvvm-covid19 6 https://github.com/futurice/corona-simulations in cluster 5, indicating that this trend may be very uncommon. We then analyzed the logs of their commits history and found that they added/uploaded files or implemented a new feature on the day the lines of code soared, which are normal occurrences of code development. Note that, we further investigate the distribution of different LOC patterns across categories. We observe that they follow the similar distribution with the overall trend, i.e., cluster 1 and 2 are dominant.

Answer to RQ3: The activity of contributors are not significantly affected by the lockdown of cities. Instead, the quarantine has somewhat boosted contributors' engagement in GitHub. Most of the repository development process is prompt and intense in the early stages after creation. Some of the repositories undergo a longer development period (more than two months) and most of them are not well maintained over the long term.

Our results show that open source technologies can be rapidly applied to tackle the worldwide public health emergency. A great many attempts from all over the world have been done on GitHub, covering various aspects from COVID-19 data, computer-aid diagnosis to daily life, and some of them started shortly after the appearance of COVID-19, as well as produced some valuable deliverable with high stars and references. We notice that contributors' activities in GitHub are not significantly influenced by the lockdown of cities. Coping with pandemic is an inherently collaborative process. The collaboration platform, GitHub, plays an important role to support the software development and information sharing. It should be noticed that these technologies and resources can be very helpful in other emergencies too.

Our findings also reveal some underlying challenges and propose several potential directions for improvement as follows: 1) A large proportion of repositories are data related, however, GitHub is built on Git and not very suitable to share data. The platform can be extended or integrate with other data sharing sites for such requirements. 2) Majority of repositories are intra-national and people from different countries may use the native language solely. Internationalization is essential for information exchange in worldwide collaboration. Internationalization techniques can be applied. 3) Most repositories are not well maintained over the long run, and soon become inactive. It is challenging to convene a software project with large number of participants in a short time. Some mechanisms can be developed for project organization in emergencies.

We recognize that our study carries several limitations. First, our investigation is limited by repositories we identified. Although we make efforts to cover all the repositories contains any of the three most representative keywords we summarized, it is quite possible that some COVID-19 related repositories are overlooked by us. Nevertheless, we believe our collection has covered most of the available COVID-19 themed repositories on GitHub. Second, this paper aims to analyze how open source community helps combat COVID-19, however GitHub is not the only platform where people can share their open source projects, which might limit our observations. Third, even though a large number of repositories were collected, only a subset of all COVID-19 relevant repositories were classified and analyzed. The major reason is that most of them are inactive and we cannot understand these repositories based on empty description or description with very few words.

VIII. CONCLUSION This paper presents the first large scale empirical study of COVID-19 themed repositories on GitHub. We make efforts to collect over 67K related repositories and characterize them from trend, popularity, contributors, etc. To further understand the development and maintenance behaviors of COVID-19 themed repositories, we propose a NLP-based method to classify them and then perform a per category analysis. Our observations show the promising direction of applying open source technologies to tackle public health emergency, and reveal some underlying challenges for improvement.

Lung infection quantification of covid-19 in ct images with deep learning

Coronavirus (covid-19) classification using ct images by machine learning methods

A novel medical diagnosis model for covid-19 infection detection based on deep features and bayesian optimization

Covid-19 open source data sets: A comprehensive survey

Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set

Automation, algorithms, and politics| where do bots come from? an analysis of bot codes shared on github

Devrank: Mining influential developers in github

Understanding Java Usability by Mining GitHub Repositories

The promises and perils of mining github

Higitclass: Keyword-driven hierarchical classification of github repositories

Gitcproc: A tool for processing and classifying github commits

Categorizing the content of github README files

Effectiveness of code contribution: From patch-based to pull-request-based tools

Work practices and challenges in pull-based development: The contributor's perspective

How does code style inconsistency affect pull request integration? an exploratory study on 117 github projects

Studying pull request merges: a case study of shopify's active merchant

On the relation between github communication activity and merge conflicts

What factors influence the reviewer assignment to pull requests?

Reviewer recommender of pull-requests in github

An empirical study on the teams structures in social coding using github projects

Cross-project code clones in github

A large-scale study of programming languages and code quality in github

How do static and dynamic test case prioritization techniques perform on modern software systems? an extensive study on github projects

Analyzing the effects of test driven development in github

A deep dive on the impact of covid-19 in software development

A tale of two cities: Software developers working from home during the covid-19 pandemic

Pandemic programming

How does working from home affect developer productivity? -a case study of baidu during covid-19 pandemic

On the popularity of github applications: A preliminary note

Understanding the factors that impact the popularity of github repositories

Geotext

Automated document classification for news article in bahasa indonesia based on term frequency inverse document frequency (tf-idf) approach

Classification of official letters using TF-IDF method