key: cord-0702687-k90ynh2z
authors: Medina-Franco, José L.; Sánchez-Cruz, Norberto; López-López, Edgar; Díaz-Eufracio, Bárbara I.
title: Progress on open chemoinformatic tools for expanding and exploring the chemical space
date: 2021-06-18
journal: J Comput Aided Mol Des
DOI: 10.1007/s10822-021-00399-1
sha: 3b3b838b2f4560c0b6d5653ee432e6f87749c236
doc_id: 702687
cord_uid: k90ynh2z

The concept of chemical space is a cornerstone in chemoinformatics, and it has broad conceptual and practical applicability in many areas of chemistry, including drug design and discovery. One of the most considerable impacts is in the study of structure–property relationships where the property can be a biological activity or any other characteristic of interest to a particular chemistry discipline. The chemical space is highly dependent on the molecular representation that is also a cornerstone concept in computational chemistry. Herein, we discuss the recent progress on chemoinformatic tools developed to expand and characterize the chemical space of compound data sets using different types of molecular representations, generate visual representations of such spaces, and explore structure–property relationships in the context of chemical spaces. We emphasize the development of methods and freely available tools focusing on drug discovery applications. We also comment on the general advantages and shortcomings of using freely available and easy-to-use tools and discuss the value of using such open resources for research, education, and scientific dissemination.

Chemical space is a cornerstone concept in chemoinformatics. It serves as a framework to study the chemical compounds that populate or might do so, the "chemical universe" i.e., all compounds that can exist. Although it seems a straightforward idea (in particular, if one associates the idea of the chemical space with the chemical universe), it is not easy to define uniquely. Other subjective and general notions frequently used in chemoinformatics are "similarity" [1] , or "diversity," "molecular or structural complexity" [2] , "chemical beauty" [3] , "descriptors' usefulness", to name a few examples.

The notion of chemical space has numerous practical applications. In drug discovery, chemical space has provided a solid conceptual framework to guide diversity analysis, structure classification, library design, compound selection, and assessment of structure-property and structure-activity relationships (SPR, SAR or SP(A)R) that is a fundamental practice in drug discovery [4] . As commented hereunder, the notion of chemical space is also related to computational chemogenomics, where one aims to predict (and then validate experimentally) the intersection between the chemical and biologically relevant space. Indeed, in the early '60 s, the quantitative analysis of the SAR marked a significant milestone in the history of chemoinformatics and computeraided drug design [5] .

This Perspective aims to discuss advances in the development of chemoinformatic resources to characterize the chemical space of compound data sets using different types of molecular representations, generate visual representations of such spaces, and explore SP(A)R in the context of chemical spaces. In addition to analyzing the currently known chemical space, we comment on recent trends to augment the number of molecules that could be made. We emphasize the development of open tools focused on applications relevant to drug discovery. As part of the discussion, we comment briefly on the advantages and shortcomings of using freely available and user-friendly tools and comment on the value of using such tools in research, education, teaching, and scientific dissemination. This manuscript is organized into six main sections. After this introduction, Sect. 2 presents an overview of the concept of chemical space, providing examples of different definitions proposed in the literature. Section 3 covers advances on open resources to expand and describe the chemical space, e.g., augmenting the number of compounds either on-stock or virtually available and calculating chemical descriptors. Section 4 presents advances on the concept, methods for the visual representation of the chemical space, including free web servers. The section after that discusses progress on the exploration of SP(A)R in the context of chemical space, including the exploration of "StARs" (Structure-Activity Relationships) in chemical space. Section 6 presents the conclusions and future directions.

Chemical space is a subjective concept and different definitions have been proposed, which has been reviewed elsewhere [4, 6] . For instance, Virshup et al. define chemical space as "An M-dimensional cartesian space in which compounds are located by a set of M physicochemical and/or chemoinformatic descriptors" [7] . Along the same lines, Arús-Pous et al. describe it as "a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical space where the position of each molecule is defined by its properties" [8] . Based on these notions, Fig. 1 shows what can be considered a "chemical space table," where the rows are the N number of chemical compounds themselves (identified by, for instance, a text identifier). The columns are an M number of descriptors that describe the compounds, defining the "M-dimensional cartesian space" of Virshup's definition.

A common pitfall is that chemical space itself is frequently taken as equivalent to an image, aka, a visual representation. Although in many practical uses of chemical space, data visualization plays a major role, the chemical space itself is a subjective and general notion that depends primarily on the choice of the number and type of the descriptors that define the M-dimensional space. When a visualization method is not well suited to analyze a particular set of compounds and descriptors, it is always possible to analyze and extract information (and knowledge) from the chemical space using the full set (or relevant subsets) of the initial M-dimensions. Unless there are only two or three descriptors that define the M-dimensions in Fig. 1 (M = 2 or 3, in which case the chemical space could be represented visually with a scatter plot), it is required a method to portray the M-dimensional space into two-or three-dimensions (2D/3D). Advances on the approaches to generate a visual representation of the chemical space, including the chemical space networks (that are coordinate-free) are addressed and cited in Sect. 4. Suppose one adds one or more columns to the table in Fig. 1 , representing the values of biological activity evaluations. In that case, one can produce a data format to perform SAR studies, reminiscent of a QSAR table or SAR matrix. In light of the concept of polypharmacology and multi-target drug design, it is possible to explore structure multipleactivity Relationships, e.g., "get SmARt" [9] . The "QSAR tables" have been the starting points to perform from simple QSAR linear regression studies to complex multivariate models used now in machine learning. Furthermore, QSAR tables are the basis of computational chemogenomics that is a strategy to navigate the chemical and biologically relevant chemical space [10, 11] .

The molecules (e.g., rows in the chemical space table in Fig. 1 ) typically used in drug discovery projects are small organic molecules (loosely defined with a molecular weight below 1,000 Da although could be bigger). These include natural products that have a significant impact on drug discovery [12] and semi-synthetic compounds. However, other types of molecules are also of interest in drug discovery, such as therapeutic peptides and proteins [13, 14] , antibodies, and metallodrugs [15, 16] . The representation of these types of compounds, particularly metallodrug and organometallic molecules, is a major challenge in chemoinformatics. The representation and descriptors for (short) peptides and proteins are borderline between chemoinformatics and bioinformatics. For this Perspective, we will focus on the efforts to visualize the chemical space of mostly small organic molecules.

The descriptors (columns in the chemical space table in Fig. 1 ) can be any set of numbers that defines the space in an orderly (logical and rational manner). The type of descriptors can be suited to define the desired space and apply the concept for an array of applications, depending on the project's goals. Molecular description and the type of descriptors are distinctive of the different informatic disciplines in such a way that they somehow contribute to shape disciplines such as bioinformatics, chemoinformatics, biomedical informatics, etc. [17] . As commented in detail elsewhere in chemoinformatics common descriptors are calculated based on linear notations that are well-suited to manage many chemical compounds. It is also well-known that there is no single or a set of "best" descriptors as they should be selected based on their performance on a specific task [18] . This is associated with the inductive learning process used in chemoinformatics (as opposed to deductive learning used predominantly in quantum mechanics) [19] .

Common types of descriptors that have been used to define the chemical space of small organic molecules include whole molecular properties that are aimed at encoding the so-called "drug-like," "lead-like," ADME (absorption, distribution, metabolism, and Excretion), toxicity, and other pharmaceutical-relevant characteristics. Other major molecular representations are fingerprint-based descriptors of different designs (dependent and independent of the molecule [20] , and descriptors associated with sub-structures. Also, it has been approached using combined representations (e. g., hybrid fingerprints or combined molecular representations in general).

Beyond drug discovery, a recent application of physicochemical properties and molecular fingerprints to explore SPRs is to generate models that predict the smell of odorant molecules [21] .

As further commented below, a novel type of descriptors that have been used to explore chemical spaces is the ISIDA descriptors, used to navigate the chemical space of natural products [22, 23] .

Capecchi et al. recently proposed the molecular fingerprint MAP4 (MinHashed atom-pair fingerprint up to a diameter of four bonds). MAP4 has shown good performance in similarity searching and visual representation of the chemical space for small molecules and larger molecules such as peptides [24] . Reymond et al. recently used the MAP4 fingerprint to visualize the chemical space of natural products and [25] and peptides libraries in the public domain [26] .

Recently the in silico acid-based profile of small molecules has been used to explore the chemical space of small molecules with epigenetic activity [27] and natural products from different sources [28] .

There are reviews of open chemoinformatics resources for numerous applications [29, 30] . For instance, Singh et al. recently reviewed online web servers to perform virtual screening of small molecules and docking [31] . The authors reported 68 web applications in that review and classified them into target-fishing, ligand-based, and structure-based virtual screening. The review also covered compound databases that provide different information relevant to drug discovery, such as approved drugs, patented molecules or small molecules commercially available. Wu et al. surveyed databases and software commonly used to predict ADME/ Tox-related properties [32] .

Regarding the use of free web servers, Table 1 outlines the advantages and disadvantages of using open-source programs and freely accessible web servers. Overall, a clear benefit and advantage over commercial software are that they provide resources for research groups with a limited budget [33] and support open science. Also, the correct use of open-source programs advocates data reproducibility and facilitates cross-comparisons. A general disadvantage or caution of free web servers and "easily accessible" software is that they can be used as black boxes if they are used with no knowledge of the limitations of the tools and might lead to poor interpretation. Also, "easy-to-use" software has the associated risk of being used to generate only data and not knowledge and might promote the practice of irrational use of computers for drug discovery. Herein, we not aimed to fully discuss these points that are beyond the main goal of this manuscript that is focused on the chemical space. Instead, we want to give a brief comment about this topic that has been discussed openly in more detail elsewhere [34] .

In the last few years, the chemical space has been growing rapidly: the number of compounds available in stock or that could be synthesized increases. Based on the Virshups' concept of chemical space (vide supra), generating compounds could be graphically represented as incrementing the number of rows in the "chemical space table" of Fig. 1 . Chemical databases systematically organize the information of chemical compounds, and such databases have played a key role in drug discovery [35] . Progress on the development of compound databases in the public domain for drug discovery applications has been reviewed recently, and the interested reader is directed to these publications [36, 37] .

Virtual and make-on-demand libraries are having a significant impact on drug discovery. As pointed out by Walters, progress on the computer capabilities for generating and storing chemical compounds has increased the number of organic molecules that potentially could be synthesized [38] .

A prominent example of a freely available and large library is the Generated Databases (GDB) developed in the group of Reymond et al. [39] . The most recent version is GDB-17 that contains 166.4 billion compounds up to 17 non-hydrogen atoms that include molecules not seen in the traditional medicinally relevant chemical space but have promising features to identify novel hit molecules [40] .

Another recent development of an open resource to access purchasable or on-demand chemical libraries is ZINC20 that contains more than 9 million in-stock molecules and billions of new on-demand molecules [41] . Large-scale virtual screening of make-on-demand collections has led to discovering compounds with novel chemical scaffolds and submicromolar bioactivity [42] . Notably, the newest version of ZINC20 includes resources to generate a visual representation of the chemical space of the so-called "ultra large-scale chemical database [41] .

Interestingly, the collection of compounds, so-called "dark chemical matter," represents a particular region of the chemical space that is mostly inactive [43] .

Another recent development is the increase in the availability of natural product collections in the public domain that surpasses the half-million molecules [44] . A notable advance in this area is the assembly of the public database COCONUT (COlleCtion of Open NatUral producTs) [45] . In response to the COVID-19 pandemic, large and small collections and data sets of natural products have been virtually screened to identify potential compounds active in a number of molecular targets of SARS-CoV2. In most cases, however, experimental validation of the computational hits has to be performed as many publications were the result of a "hype" and easy access to resources to conduct virtual screening.

Beyond the significant increase of chemical compounds that can be accessed (either in-stock or readily accessible after synthesis) a common trend now is the generation of chemical compounds designed de novo using machine learning. This has been reviewed recently in excellent review papers [8, 46] . There have also been advances in the automated generation of short peptides for drug discovery applications. A recent example in this area is the development of the free web server D-Peptide Builder that enumerates linear and cyclic combinatorial peptide libraries (Fig. 2) [47] . The server computes physicochemical properties of the newly enumerated peptides and provides tools to perform quantitative analysis of the structural diversity. D-Peptide builder also enables a visual representation of the chemical space of the libraries and compares it with the chemical space of five preloaded compound data sets (including small molecules and peptides approved for clinical use, natural products, macrolides and non peptide protein-protein interaction modulators).

PepCoGen is also a free web server for generating peptides with a specific physicochemical profile [48] . In particular, the server generates all possible combinations of peptides by modifying the amino acids having a comparable physicochemical property profile at a given position.

On a separate work, the code of the Peptide Design Genetic Algorithm (PDGA) was made publicly available. PDGA is designed to generate peptide sequences of different topologies so that the generated sequences are similar to a given reference molecule (as measured considering macromolecule extended atom-pair fingerprint (MXFP) (an atombased fingerprint that considers the shape and pharmacophore features of the molecules [49] . The research group of Reymond has reviewed computational methods to design, generate and visualize the chemical space of peptides [26] .

In order to support teaching in chemoinformatics, a tutorial that describes how to enumerate virtual libraries was published recently [50] . The tutorial describes a step-by-step procedure for anyone interested in designing and building chemical libraries with or without experience in using computational tools.

In parallel to recent developments to enumerate, generate (synthesize), and make available chemical compounds (e.g., increase the number of rows in the "chemical space table" of Fig. 1 (vide supra) , there has been a lot of progress in the development of descriptors, e.g., augment the number of M-dimensions or "columns" in Fig. 1 . Of note, depending on the project's goals, one can generate a given Fig. 2 The graphical user interface of D-Peptide builder: an example of a recent free webserver to generate compounds. D-Peptide builder enumerates combinatorial peptide libraries finite set of descriptors to define the chemical space of the compounds under study. Thus, one can develop "different types of chemical spaces," e.g., defined by different sets of M-descriptors (Fig. 1) . Arguably, it has been commented that "different chemical spaces" are associated by different types of molecules (small molecules, biologics, polymers, materials, etc. [46] ). Under the later notion, molecules with different nature (like polymers, materials, etc.) would require a particular set of M-descriptors.

To define or generate the M-descriptors and define the chemical space using open-source and freely available software, there are several tools that have been available in the public domain for several years now. Typical examples include MayaChemTools (chemistry toolkit) [51] , PaDEL-Descriptors [52] , and the 3D descriptors implemented in QuBiLs-MIDAS [53] , which was updated recently [54] . Additional free resources recently developed are briefly commented on hereunder.

PyDescriptors is a set of freely available 11,145 molecular descriptors easily interpretable and thus appropriate for QSAR studies [55] . PyDescriptors include 1D, 2D, and 3D descriptors that encode atomic fragments, pharmacophoric patterns, and diverse fingerprints. The PyDescriptors is a Python-based plugin that is implemented in PyMOL.

Mordred package for Python contains 1,800 2D and 3D descriptors freely available and promising for chemoinformatic studies and SPR analysis [56] . The descriptors can be used for large molecules (e.g., maitotoxin, a large non-polymer natural product with a molecular weight of 3,422). The Python package can be installed and used on different platforms (Linux, Windows, macOS). In the original publication [56] the Mordred descriptors were compared with the PaDEL-Descriptors [52] and turned out to be faster.

Another recent development in descriptors calculations is ChemDes [57] . This is a public integrated webbased platform that calculates 2D and 3D descriptors and molecular fingerprints. It calculates 3,679 descriptors (BlueDesc, Chemopy, CDK, RDKit, and PaDEL) and 59 types of molecular fingerprints for small (drug type) molecules. ChemDes is freely accessible via a previous registration, at http:// www. scbdd. com/ chemd es/ (accessed May 1st, 2021).

Overall, a critical and controversial point of chemical descriptors is their interpretability and physical meaning. In predictive models, it is open for discussion if the descriptors do not only show how a good statistical association between the chemical structure and the property (e.g., biological activity) of interest but if the descriptors can actually explain or contribute to the causality of the activity as encoded by the chemical descriptors [58, 59] .

Visualization of chemical space plays a key role in communicating and disseminating information with experts and non-experts within a research group, an organization, community, and the research community on the large. In practice, chemical space is commonly studied accompanied by a graphical representation of the descriptors, typically a low-dimensional graph (2D or 3D). Formally speaking the chemical space (Fig. 1) could be unidimensional (1D), 2D, 3D and can be represented straightforwardly using scatter plots. The challenge comes when the M-dimensions are four or more. To this end, different mathematical approaches to reduce dimensions and techniques for data visualization have been applied to project chemical information in low dimensions and then map another property, such as biological activity, on that low-dimensional representation. In the past few years, progress on data visualization has been reviewed by different authors [6, 60, 61] . However, generating meaningful, interpretable, and useful graphical representations of chemical space is not trivial. Visualization of the chemical space (in particular in light of the rapid expansion of the compounds that might populate the space) is an area of active research to develop or improve methods [62] . Representative novel developments in the visual representation of the chemical space using open-source and freely available resources are discussed hereunder.

The research group of Varnek et al. generated the socalled "Universal REACH map, and application of the Generative Topographic Mapping (GTM) [63] to visualize the chemical space of chemicals from the Registration Evaluation Authorization and restriction of Chemicals (REACH) [64] . GTM produces 2D graphs on which each compound is represented with a data point. Ecotoxicological properties were mapped onto the 2D graph. The Universal REACH map was then used to classify and evaluate the property of new chemicals projected onto the map with a balanced accuracy from 0.60 to 0.78. In independent work, GTM was used to visualize a large library of 40 million fragment-like molecules [65] and the entire ZINC database of purchasable compounds, relative to 1.6 million biologically relevant molecules in ChEMBL [66] . A similar chemography approach using GTM was implemented to navigate the chemical space of 800 million organic molecules and identify "anti-CoV" regions [67] . More recently, GTM was used as a framework to visualize interactively the chemical space of a large database of natural products (COCONUT, vide supra) and ChEMBL [22] . The GTM maps were implemented into a freely available intuitive online tool called Natural Products Navigator (vide infra).

ChemMaps is a methodology for the visual representation of chemical space. It is based on the similarity matrix of compound data sets generated with the similarity computed with fingerprints and a similarity coefficient. ChemMaps is based on a reference or satellite approach implemented in ChemGPS [68] with the working hypothesis that satellites are, in principle, molecules whose similarity to the rest of the molecules in the database provides sufficient information for generating a visualization of the chemical space. The code to generate ChemMaps is freely available [69] .

Another methodological advance in the visualization of chemical space is given by virtual reality. Probst and Reymond developed a virtual reality chemical space of Drug-Bank where the user can interactively explore the contents of this database. The source code of the application is publicly available [70] .

Chemical space networks (CSNs) represent another major conceptual advance to generate visual representations of the chemical space, as discussed in detail by Maggiora and Bajorath [71, 72] . A major feature of CSNs is that they are coordinate-free representations of the chemical space. An algorithm to transform a multidimensional chemical space into CSNs readily has been developed that is further useful to explore SARs [73] . CSNs have been used in many applications, including the assessment of the molecules from patents [74] .

DataWarrior is a free stand-alone program that is being increasingly used for diverse chemoinformatics tasks, including data visualization [75, 76] . Datawarrior in a recent version (number 5.00) implemented t-SNE [77] . At the time of writing this manuscript (May 2021) the latest release of DataWarrior is 5.5.0. Table 2 summarizes free web applications to visualize the  chemical space of compound collections. The table includes ChemGPS-NP, one of the first free web applications developed to visualize the biologically relevant chemical space [78] . In addition to ChemGPS-NP, some of the web servers in the table are dedicated to the browsing and visualization of the chemical space of user-supplied compounds (e.g., ChemMap.com [79] , tMAPs [80] , Natural Products Navigator [22] . Other websites include other functionalities such as D-Peptide Builder [47] , and the Platform for Unified Molecular Analysis (PUMA) [81] . D-Peptide Builder is an application to enumerate chemical spaces of peptide combinatorial libraries and visualize chemical spaces. PUMA is a server that integrates the calculation of descriptors and visual representation of the chemical space based on those descriptors. Both web servers are part of D-Tools, a set of free web applications for chemoinformatics (https:// www. difac quim. com/d-tools/) [82] . The research group of Reymond has developed several free web applications in Table 2 for the interactive visualization of chemical space (https:// gdb. unibe. ch/ tools/). Figure 3 shows an example of a visualization of chemical space using the free server PUMA ( Table 2 ). The figure shows a principal component analysis based on six physicochemical properties of pharmaceutical interest of two focused libraries (targeting DNMT1 and epigenetic targets). The libraries represent commercial synthetic compounds that can be acquired from chemical vendors for experimental screening). In PUMA, the user supplies the SMILES strings of curated compound libraries, and the server computes the physicochemical properties internally (e.g., the descriptors) and then performs the principal component analysis. The user chooses to plot the first two or three principal components. From the lower left part of the graphical user interface (Fig. 3) , the user can download from the sever the raw data and the loadings and a summary of the analysis. Full details of the server are described in [81].

As commend above, since chemical space is defined by a set of M descriptors (Fig. 1) , that encode the structural or other characteristics of the molecules, it can serve as a basis to analyze SPRs and SmARTs if one adds one or more dimensions that describe the property (e.g., biological activity) of the compounds (i.e., the biological profile). Visually, the property (including the biological "activity") is usually mapped in the chemical space using a color (continuous color scale or categorical scheme) ( Fig. 1 ) but could be visually represented in different forms (e.g., shapes for categorical variables). The visualization of SP(A)R and "STaRs in chemical space) has been commented on in the literature [61, 84] . Herein we emphasize exemplary most recent advances in this area.

Prof. Gerald Maggiora was one of the first investigators that kicked off the research on a general concept with high relevance in drug discovery: activity landscape modeling with his founding Editorial on activity cliffs [85] : pair of compounds with high structure similarity but unexpectedly large potency differences. Over the past few years, the concept, interpretation, and applications of activity cliffs have evolved, as reviewed by Bajorath et al. [86] [87] [88] . One of the most recent developments in the activity landscape concept has been the extension to model other properties of general interest beyond drug discovery [89] .

To illustrate this point, Fig. 4a shows the Structure-Property Similarity (SPS) map for tubulin inhibitors generated with the free website Activity Landscape Plotter [90] . Each data point represents a pairwise comparison that shows the relationship between the difference in Topological Surface Area (TPSA) and the molecular similarity. The data points are further distinguished by the SALI value [91] , using a continuous color scale from a low value (green) to a high value (red). In this context, higher SALI values represent a higher relationship between TPSA values and similarity between each pair of compounds. In contrast, Fig. 4b shows a Dual-Property Difference (DPD) map, plotting all pairwise activity differences of tubulin inhibitors with A-549 cell-line . On the free web server, the 2D or 3D plot is interactive (X-axis) and HeLa cell-line (Y-axis). Therefore, DPD maps facilitate the identification of compounds with selective and dual activity.

Using SPR graphs allows us to relate chemical structures with their properties, bioactivities, or other characteristics. For example, Fig. 4 shows a property and dual activity cliffs (13P and 11FF) pair. These compounds are structurally similar (0.470-using ECFP6 and the Tanimoto coefficient). However, their TPSA is different (property cliff). It is well documented that TPSA values > 140 (like that of compound 11FF in Fig. 4C ) lose their ability to cross membranes, unlike compounds with TPSA values < 140 (like that of compound 13P) that retain this ability [92] . This is a case study that illustrates the similarity-property-activity relationship.

Constellation plots were developed to combine a substructure-based representation and classification of compounds with a coordinate-based representation of chemical space [93] . Constellation plots are 2D graphs that combine substructure-based clustering of compounds with a fingerprintbased similarity classification of the chemical scaffolds. The substructure-based clustering of the molecules is based on the concept of analog series-based scaffolds [94, 95] . Since the biological activity data (or any other property) can be mapped into a Constellation plot, these 2D representations of the chemical space enable identifying whole regions in chemical space rich in SPR annotations: groups of molecules, aka "constellations" in chemical space. The groups of molecules rich in biological activity would be light "bright StARs" in chemical space and be different from 'dark regions': groups of molecules with no biological activity [61] .

Additionally, in the constellation plots, the analog series with similar chemical structures are closely ordered because they share similar X and Y coordinates in the 2D plots. In contrast, analog series with more different structures are far apart. Recently, López-López E. et al. proposed a methodology to navigate interactively/dynamically in the chemical space using constellation plots [96] Fig. 4 Property Landscapes of compounds with activity against Tubulin using cell-based inhibition data. a Structure-Property Similarity (SPS) map of 188 tubulin inhibitors that correspond to 17,578 pairwise comparisons. The property cliffs are displayed in the upper-right zone. Each data point was colored using a SALI value scale from green (low) to red (high); b Dual Property Difference (DPD) map of tubulin inhibitors. The dual active compounds are displayed in the upper right zone. Each data point was colored using a selectivity score from green (low) to red (high); c Example of a property and dual activity cliff by implementing the DataWarrior software [76] . All this allows applying filters for compounds, analogous series, biological activity, and other properties of pharmaceutical interest using an intuitive platform that is well suited for all users (expert or non-experts on chemoinformatics tools). Figure 5 illustrates an example of a Constellation plot for a series of tubulin inhibitors. The plot shows 147 data points, each one representing an analog series. The size of the data point indicates the relative number of compounds in each analog series, and the color is the average activity of the compound in the series so that green-to-red colored dots point to analog series enriched with active molecules, hence more promising for further development. In contrast, cyan-to-blue colored dots indicate analog series with mostly inactive molecules. Full details of the study are described elsewhere [96] .

Constellation plots have been used to navigate the chemical space of high-throughput screening data of compounds consistently tested against the same panel of cell lines. In that work, Naveja et al. proposed a proof-of-concept of a method for finding a consistent cell-selective analog series of chemical compounds and identified the so-called "luminaries in chemical space" [97] .

For years the subjective but fundamental notion of chemical space has assisted drug discovery projects. Chemical space is also a cornerstone concept in chemoinformatics.

In the past few years, we have witnessed an expansion of the chemical space regarding the number of compounds that are known or can be synthesized in principle. As commented on this Perspective, it is growing how the chemical compounds can be represented and the number of public tools to compute descriptors. Open-source codes can be implemented in other public web servers, chemoinformatics suits, and desktop programs. In any case, the ready availability of compound libraries that are expanding the chemical space and the ready availability of tools to conduct virtual screening: e.g., in silico bioactivity profiling (or computer-assisted compound selection of the chemical Fig. 5 Constellation plot of compounds with activity against Tubulin using cell-based inhibition data. The plot shows 147 data points, each one representing an analog series. The size of the data point indicates the relative number of compounds in each analog series, and the color is the average activity of the compound in the series. Linking lines represent shared molecules between two analog series. Figure was adapted from López-López E. et al. [96] space), favor the potential identification of small molecules with therapeutically relevant targets.

Similar to the expansion of the chemical space (more compounds and more descriptors, e.g., enlarge the table in Fig. 1) ), novel free applications and open-source methods to generate visual representations of the chemical space are emerging and evolving. Recent developments include CSNs, TMAPs, GTMs, Constellation plots, and Chem-Maps. Virtual reality has started to facilitate the interactive exploration of chemical spaces. Some of these visualization tools have been implemented in freely available websites that enable the browsing of chemical spaces. Several methodologies aim to assist the analysis of SP(A)Rs and identity promising regions or clusters of compounds in chemical space.

Despite numerous open-source and easily accessible ways to calculate molecule descriptors, the user has to pay close attention (rational use) by preparing -curating-the compounds and then generating appropriate descriptors relevant to the problem in question. Considering the large chemical databases and large sets of descriptors available: one of the first and critical questions is defining the chemical space to be explored by focusing on the type of compounds of interest and the type of descriptors. In several drug discovery applications, the choice of compounds and descriptors is dynamic: an iterative process where one explores different compounds and various descriptors that best suit the work goals.

We also want to encourage students, newcomers to the field, and users of free and easy-to-use tools and websites to properly use and interpret the concept of chemical space. Based on the topics discussed from this Perspective, chemical space is a subjective and complex notion and goes beyond nice and colorful graphs. Along these lines, we encourage that the newcomers to the field select the methods for the right reasons and not because they are "popular." Instead, because the methods are thoroughly validated and properly documented. The interested reader is referred to the Opinion manuscript "Rationality over fashion and hype in drug design," where this and related points are discussed in more detail, and it is open for discussion with the scientific community [34] .

Molecular similarity in medicinal chemistry

The many roles of molecular complexity in drug discovery

Quantifying the chemical beauty of drugs

Visualization of the chemical space in drug discovery

Chemistry in times of artificial intelligence

Progress in visual representations of chemical space

Stochastic Voyages into uncharted chemical space produce a representative library of all possible drug-like compounds

Exploring chemical space with machine learning

Getting smart in drug discovery: chemoinformatics approaches for mining structure-multiple activity relationships

Chemogenomic strategies to expand the bioactive chemical space

A perspective on computational chemogenomics

Natural products in drug discovery: advances and opportunities

Peptide therapeutics: current status and future directions

Computational avenues in oral protein and peptide therapeutics

Metallodrugs in medicinal inorganic chemistry

Metallodrugs are unique: opportunities and challenges of discovery and development

Informatics for chemistry, biology, and biomedical sciences

Molecular representations in AI-driven drug discovery: a review and practical guide

Chemoinformatics as a theoretical chemistry discipline

Introduction to molecular similarity and chemical space

Smiles to smell: decoding the structure-odor relationship of chemical compounds using the deep neural network approach

NP Navigator: a new look at the natural product chemical space

Isida property-labelled fragment descriptors

One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome

Assigning the origin of microbial natural products by chemical space map and machine learning

The acid/ base characterization of molecules with epigenetic activity

Analysis of the acid/base profile of natural products from different sources

One hundred thousand mouse clicks down the road: selected online resources supporting drug discovery collected over a decade

Open chemoinformatic resources to explore the structure, properties and chemical space of molecules

Virtual screening web servers: designing chemical probes and drug candidates in the cyberspace

Computational approaches in preclinical studies on drug discovery and development. Front Chem 8:726

Computational chemistry on a budget: supporting drug discovery with limited resources

Rationality over fashion and hype in drug design

Chemical database techniques in drug discovery

Large compound databases for structure-activity relationships studies in drug discovery

Freely accessible databases of commercial compounds for high-throughput virtual screenings

Virtual chemical libraries

970 Million drug-like small molecules for virtual screening in the chemical universe database Gdb-13

The Generated Databases (GDBs) as a source of 3d-shaped building blocks for use in medicinal chemistry and drug discovery

ZINC20-a free ultralarge-scale chemical database for ligand discovery

Ultralarge library docking for discovering new chemotypes

Dark chemical matter as a promising starting point for drug lead discovery

Data resources for the computer-guided discovery of bioactive natural products

COCONUT Online: collection of open natural products database

Defining and exploring chemical spaces

D-Peptide Builder: a web service to enumerate, analyze, and visualize the chemical space of combinatorial peptide libraries

Peptide combination generator: a tool for generating peptide combinations

Populating chemical space with peptides using a genetic algorithm

Chemoinformatics-based enumeration of chemical libraries: a tutorial

Mayachemtools: an open source package for computational drug discovery

Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints

Qubils-Midas: a parallel free-software for molecular descriptors computation based on multilinear algebraic maps

Distributed and multicore Qubils-Midas Software V2.0: Computing Chiral, Fuzzy, Weighted and Truncated Geometrical Molecular Descriptors Based on Tensor Algebra

Pydescriptor: a new pymol plugin for calculating thousands of easily understandable molecular descriptors

Mordred: a molecular descriptor calculator

Chemdes: an integrated web-based platform for molecular descriptor and fingerprint computation

On the interpretation and interpretability of quantitative structure-activity relationship models

Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications

The chemical space project

Reaching for the bright stars in chemical space

Call for papers for the special issue: from reaction informatics to chemical space

Parallel generative topographic mapping: an efficient approach for big data handling

Visualization and analysis of the reach-chemical space with generative topographic mapping

Mapping of the available chemical space versus the chemical universe of lead-like compounds

Chemography: searching for hidden treasures

A chemographic audit of anti-coronavirus structureactivity information from public databases (ChEMBLl)

ChemGPS-NPweb: chemical space navigation online

Chemmaps: towards an approach for visualizing the chemical space based on adaptive satellite compounds

Exploring Drugbank in virtual reality chemical space

Chemical space networks: a powerful new paradigm for the description of chemical space

Lessons learned from the design of chemical space networks and opportunities for new applications

Chemical space visualization: transforming multidimensional chemical spaces into similarity-based molecular networks

Exploring sets of molecules from patents and relationships to other active compounds in chemical space networks

Datawarrior: an evaluation of the open-source drug discovery tool

Datawarrior: an open-source program for chemistry aware data visualization and analysis

Visualizing data using T-SNE

ChemGPS-NP: tuned for navigation in biologically relevant chemical space

Exploring drug space with

Visualization of very large highdimensional data sets as minimum spanning trees

Chemoinformatics: a perspective from an academic setting in Latin America

Atlascbs: a web server to map and explore chemico-biological space

Data structures and computational tools for the extraction of sar information from large compound sets

On outliers and activity cliffs-why QSAR often disappoints

Advancing the activity cliff concept, Part II

Increasing the public activity cliff knowledge base with new categories of activity cliffs

Advances in exploring activity cliffs

From qualitative to quantitative analysis of activity and property landscapes

Activity landscape plotter: a web-based application for the analysis of structure-activity relationships

Structure-activity landscape index: identifying and quantifying activity cliffs

Physicochemical parameters of recently approved oral drugs

Finding constellations in chemical space through core analysis

Computational method for the systematic identification of analog series and key compounds representing series and their biological activity profiles

A general approach for retrosynthetic molecular core analysis

Tubulin inhibitors: a chemoinformatic analysis using cellbased data

Consistent cell-selective analog series as constellation luminaries in chemical space

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Conflicts of interest None.