key: cord-0144669-h23w89h2
authors: Babuji, Yadu; Blaiszik, Ben; Brettin, Tom; Chard, Kyle; Chard, Ryan; Clyde, Austin; Foster, Ian; Hong, Zhi; Jha, Shantenu; Li, Zhuozhao; Liu, Xuefeng; Ramanathan, Arvind; Ren, Yi; Saint, Nicholaus; Schwarting, Marcus; Stevens, Rick; Dam, Hubertus van; Wagner, Rick
title: Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
date: 2020-05-28
journal: nan
DOI: nan
sha: 0ba5aaea1db8233464d6dfddea4e06e19953ca4d
doc_id: 144669
cord_uid: h23w89h2

Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.

The Coronavirus Disease (COVID-19) pandemic, caused by transmissible infection of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus [1] [2] [3] [4] , has resulted in millions of diagnosed cases and over 353 000 deaths worldwide [5] , straining healthcare systems, and disrupting key aspects of society and the wider economy. In order to save lives and reduce societal effects, it is important to rapidly find effective treatments through drug discovery and repurposing efforts.

Here, we describe a public data release of 23 molecular datasets collected from community sources or created internally, representing over 4.2 B molecules. In addition to collecting the datasets from heterogeneous locations and making them available through a unified interface, we have enriched the datasets with additional context that would be difficult for many researchers to compute without access to significant HPC resources. For example, these data now include the 2D and 3D molecular descriptors, computed molecular fingerprints, 2D images representing the molecule, and canonical simplified molecular-input line-entry system (SMILES) [6] structural representations to speed development of machine learning models.

This data release encompasses information on the 4.2 B molecules and 60 TB of additional data. We intend to supplement this dataset in future releases with more datasets, further enrichments, tools to extract potential drugs from natural language text, and machine learning models to sift the best candidates for protein docking simulations from the billions of available molecules. In the following, we first describe the datasets collected, the methodology used to generate the enriched datasets, and then discuss future directions.

We have collected molecules from the datasets listed in Table 1 , each of which has either been made available online by others or generated by our group. The collected datasets include some specifically collected for drug design (e.g., Enamine), known drug databases (e.g., Drugbank [7, 8] , DrugCentral [9, 10] , CureFFI [11] ), antiviral collections (e.g., CAS COVID-19 Antiviral Candidate Compounds [12] , and the Lit COVID-19 dataset [13] ), others that provide known decoys (DUDE database of useful decoys), and further counterexamples including molecules used in other domains (e.g., QM9 [14, 15] , Harvard Organic Photovoltaic Dataset [16, 17] ). By aggregating these diverse datasets, including the decoys and counterexamples, we aim to allow researchers the maximal freedom to create training sets for specific use cases. Future releases will include additional data relevant to SARS-CoV-2 research. [32, 33] 97 545 266 QM9 QM9 subset of GDB-17 [14, 15] 133 885 REP Repurposing-related drug/tool compounds [34, 35] 10 141 SAV Synthetically Accessible Virtual Inventory (SAVI) [36, 37] 265 047 097 SUR SureChEMBL dataset of molecules from patents [38, 39] 17 

The data processing pipeline is used to compute different types of features and representations of billions of small molecules. The pipeline is first used to convert the SMILES representation for each molecule to a canonical SMILES to allow for de-duplication and consistency across data sources. Next, for each molecule, three different types of features are computed: 1) molecular fingerprints that encode the structure of molecules; 2) 2D and 3D molecular descriptors; and 3) 2D images of the molecular structure. These features are being used as input to various machine learning and deep learning models that will be used to predict important characteristics of candidate molecules including docking scores, toxicity, and more. Figure 1 : The computational pipeline that is used to enrich the data collected from included datasets. After collection, each molecule in each dataset has canonical SMILES, 2D and 3D molecular features, fingerprints, and images computed. These enrichments simplify molecule disambiguation, ML-guided compound screening, similarity searching, and neural network training respectively. Term Description SOURCE-KEY Identifies the source dataset: see the three-letter "Keys" in Table 1  IDENTIFIER A per-molecule identifier either obtained from the source dataset or, if none such is available, defined internally SMILES A canonical SMILES for a molecule, as produced by Open Babel

We use Open Babel v3.0 [42] to convert the simplified molecular-input line-entry system (SMILES) specifications of chemical species obtained from various sources into a consistent canonical smiles representation. We organize the resulting molecule specifications in one directory per source dataset, each containing one CSV file with columns [SOURCE-KEY, IDENTIFIER, SMILES], where SOURCE-KEY identifies the source dataset; IDENTIFIER is an identifier either obtained from the source dataset or, if none such is available, defined internally; and SMILES is a canonical SMILES as produced by Open Babel. Identifiers are unique within a dataset, but may not be unique across datasets. Thus, the combination of (SOURCE-KEY, IDENTIFIER) is needed to identify molecules uniquely. We obtain the canonical SMILES by using the following Open Babel command:

We use RDKit [43] Table 2 , and FINGERPRINT is a Base64-encoded representation of the fingerprint.

In Figure 2 , we show an example of how to load the fingerprint data from a batch file within individual dataset using Python 3. Further examples of how to use fingerprints are available in the accompanying GitHub repository [44] . 

We generate molecular descriptors using Mordred [45] (version 1.2.0). The collected descriptors (∼1800 for each molecule) include descriptors for both 2D and 3D molecular features. We organize these descriptors in one directory per source dataset, each containing one or more CSV files. Each row in the CSV file has columns [SOURCE-KEY, IDENTIFIER, SMILES, DESCRIPTOR 1 ... DESCRIPTOR N ]. In Figure 3 , we show how to load the data for an individual dataset (e.g., FFI) using Python 3 and explore its shape (Figure 3-left) , and create a TSNE embedding [46] to explore the molecular descriptor space (Figure 3 -right). 

Images for each molecule were generated using a custom script [44] to read the canonical SMILES structure with RDKit, kekulize the structure, handle conformers, draw the molecule with rdkit.Chem.Draw, and save the file as a PNG-format image with size 128×128 pixels. For each dataset, individual pickle files are saved containing batches of 10 000 images for ease of use, with entries in the format (SOURCE, IDENTIFIER, SMILES, image in PIL format). In Figure 4 , we show an example of loading and display image data from a batch of files from the FFI dataset. 

Providing access to such a large quantity of heterogeneous data (currently 60 TB) is challenging. We use Globus [47] to handle authentication and authorization, and to enable high-speed, reliable access to the data stored on the Petrel file server at the Argonne Leadership Computing Facility's (ALCF) Joint Laboratory for System Evaluation (JLSE). Access to this data is available to anyone following authentication via institutional credentials, an ORCID profile, a Google account, or many other common identities. Users can access the data through a web user interface shown in Fig. 5 , facilitating easy browsing, direct download via HTTPS of smaller files, and high-speed, reliable transfer of larger data files to their laptop or a computing cluster via Globus Connect Personal or an instance of Globus Connect Server. There are more than 20 000 active Globus endpoints distributed around the world. Users may also access the data with a full-featured Python SDK. More details on Globus can be found at https://www.globus.org. Figure 5 : Data access with Globus. All data are stored on Globus endpoints, allowing users to access, move, and share the data through a web interface (pictured above), a REST API, or with a Python client. The user here has just transferred the first three files of descriptors associated with the E15 dataset to an endpoint at UChicago.

We have released to the community an open resource of molecular structures (as canonical SMILES), descriptors, 2D images, and fingerprints. We hope these data will contribute to the discovery of small molecules to combat the SARS-CoV-2 virus. We expect forthcoming data releases to extend to molecular conformers; incorporate the results of natural language processing extractions of drugs from COVID-related literature; provide the results of molecular docking simulations against SARS-CoV-2 viral and host proteins; and include the trained machine learning models that the team is building to identify top candidates for running various, more expensive calculations.

All data and code links can be found at http://2019-ncovgroup.github.io/data/. Subsequent updates will be made available through the same web page, and further release papers will be issued as necessary. The code for the examples used in this paper can be found at https://github.com/globus-labs/covid-analyses.

The data generated have been prepared as part of the nCov-Group Collaboration, a group of over 200 researchers working to use computational techniques to address various challenges associated with COVID-19. We would like to thank all the researchers who helped to assemble the original datasets, and who provided permission for redistribution.

This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Additional data storage and computational support for this research project has been generously supported by the following resources: Petrel Data Service at the Argonne Leadership Computing Facility (ALCF), Frontera at the Texas Advanced Computing Center (TACC), Comet at the San Diego Supercomputing Center (SDSC)

The work leveraged data and computing infrastructure produced in other projects, including: ExaLearn and the Exascale Computing Project [48] (DOE Contract DE-AC02-06CH11357); Parsl: parallel scripting library [49] (NSF 1550588); funcX: distributed function as a service platform [50] (NSF 2004894); Globus: data services for science (authentication, transfer, users, and groups (see globus.org for funding); CHiMaD: Materials Data Facility [51, 52] and Polymer Property Predictor Database [53] (NIST 70NANB19H005 and NIST 70NANB14H012)

For All Information. Unless otherwise indicated, this information has been authored by an employee or employees of the UChicago Argonne, LLC, operator of the Argonne National laboratory with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, provided that this Notice and any statement of authorship are reproduced on all copies.

While every effort has been made to produce valid data, by using this data, User acknowledges that neither the Government nor UChicago Argonne, LLC, makes any warranty, express or implied, of either the accuracy or completeness of this information or assumes any liability or responsibility for the use of this information. Additionally, this information is provided solely for research purposes and is not provided for purposes of offering medical advice. Accordingly, the U.S. Government and UChicago Argonne, LLC, are not to be liable to any user for any loss or damage, whether in contract, tort (including negligence), breach of statutory duty, or otherwise, even if foreseeable, arising under or in connection with use of or reliance on the content displayed on this site.

Network-based drug repurposing for novel coronavirus 2019-nCoV/ SARS-CoV-2

An orally bioavailable broad-spectrum antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple coronaviruses in mice

Identification of potential treatments for COVID-19 through artificial intelligence-enabled phenomic analysis of human cells infected with SARS-CoV-2. bioRxiv

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

COVID-19 Map -Johns Hopkins Coronavirus

SMILES. 2. Algorithm for generation of unique SMILES notation

DrugBank 5.0: A major update to the DrugBank database for

Drugbank Database

DrugCentral: Online drug compendium

Antiviral Candidate Compounds <https : / / www

Lit -A Collection of Literature Extracted Small Molecules to Speed Identification of COVID-19 Therapeutics 2020

Quantum chemistry structures and properties of 134 kilo molecules

QM9 Dataset

The Harvard organic photovoltaic dataset

/ articles / HOPV15_Dataset/1610063> (2020)

A webaccessible database of experimentally determined protein-ligand binding affinities

Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking

DUDE database for useful decoys

Enamine REAL database: Making chemical diversity real

970 million druglike small molecules for virtual screening in the chemical universe database GDB-13

GDB-13 Small Organic Molecules Up To 13 Atoms

Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17

GDB-17 Small Organic Molecules Up To 17 Atoms

Molecular sets (MOSES): A benchmarking platform for molecular generation models

PubChem substance and compound databases

PubChem National Library of Medicine <https : / / pubchem

The Drug Repurposing Hub: A next-generation drug library and information resource

Synthetically Accessible Virtual Inventory (SAVI)

Synthetically Accessible Virtual Inventory Database

SureChEMBL: a large-scale, chemically annotated patent document database

SureChEMBL Open Patent Data <https

ZINC 15 -Ligand Discovery for Everyone

Open Babel: An open chemical toolbox

Molecular Feature Extraction Pipeline Tools

Mordred: A molecular descriptor calculator

Visualizing data using t-SNE

Recent enhancements and future plans in XSEDE16 Conference on Diversity, Big Data, and Science at Scale

Exascale applications: Skin in the game

Pervasive parallel programming in Python in 28th International Symposium on High-Performance Parallel and Distributed Computing

Serverless supercomputing: High performance function as a service for science

A data ecosystem to support machine learning in materials science

The Materials Data Facility: Data services to advance materials science research

Creating training data for scientific named entity recognition with minimal human