key: cord-0065570-6tzlfv26
authors: Elkin, Magdalyn E.; Zhu, Xingquan
title: Community and topic modeling for infectious disease clinical trial recommendation
date: 2021-07-07
journal: Netw Model Anal Health Inform Bioinform
DOI: 10.1007/s13721-021-00321-7
sha: 8342ffe83f2a3cb151321b463d8fbc01219ab6a6
doc_id: 65570
cord_uid: 6tzlfv26

Clinical trials are crucial for the advancement of treatment and knowledge within the medical community. Although the ClinicalTrials.gov initiative has resulted in a rich source of information for clinical trial research, only a handful of analytic studies have been carried out to understand this valuable data source. Analysis of this database provides insight for emerging trends of clinical research. In this study, we propose to use network analysis to understand infectious disease clinical trial research. Our goal is to understand two important issues related to the clinical trials: (1) the concentrations and characteristics of infectious disease clinical trial research, and (2) recommendation of clinical trials to a sponsor (or an investigator). The first issue helps summarize clinical trial research related to a particular disease(s), and the second issue helps match clinical trial sponsors and investigators for information recommendation. By using 4228 clinical trials as the test bed, our study investigates 4864 sponsors and 1879 research areas characterized by Medical Subject Heading (MeSH) keywords. We use a network to characterize infectious disease clinical trials, and design a new community-topic-based link prediction approach to predict sponsors’ interests. Our design relies on network modeling of both clinical trial sponsors and keywords. For sponsors, we extract communities with each community consisting of sponsors with coherent interests. For keywords, we extract topics with each topic containing semantic consistent keywords. The communities and topics are combined for accurate clinical trial recommendation. This transformative study concludes that using network analysis can tremendously help the understanding of clinical trial research for effective summarization, characterization, and prediction.

Clinical trials carry out tests on human participants w.r.t. different interventions, including new medications or treatment, to understand and answer meaningful clinical questions (Friedman et al. 2015; Elkin and Zhu 2021) . These studies are critical for discovering new treatments to diagnose, treat, and reduce the risk of disease. Understanding the concentrations and characteristics of clinical trials in specific disease areas is important for researchers and industry to be aware of emerging trends. In addition, understanding clinical trial topics can also recommend clinical trials to researchers, using shared knowledge, such as common interests, community, and topics (Hurtado et al. 2016) . For example, a recommendation engine can recommend relevant clinical trials to a researchers, by using shared study topics, so the researchers can be fully aware of existing/previous studies in the field.

The ClinicalTrials.gov database serves as a registry and results database for clinical trials all over the world. The database provides patients, researchers and public easy access to past, current and future clinical trials. In 1997, ClinicalTrials.gov was created as a registry of clinical trial information for federally and privately funded trials. The Food and Drug Administration Amendments Act (FDAAA) was created in 2007 that defines Applicable Clinical Trials (ACT) that are legally required to register their trial to the ClinicalTrials.gov database. ACTs include the following: 47 Page 2 of 17 clinical investigations of any U.S. Food and Drug Administration (FDA) regulated drug/biological product, certain studies of FDA-regulated medical devices, investigational studies that have one or more sites in the US, FDA investigational new drug studies and trials involving drug/biological/medical devices manufactured in U.S. (Zarin et al. 2016) . While the FDAAA act specifies what clinical trials are legally required to register on the online ClinicalTrials. gov database, many trials are not legally obligated. As of October, 2020 the ClinicalTrials.gov database holds 355,127 clinical trials from over 217 countries (ClinicalTrials.gov 2020). The Clini calTr ials. gov database is an abundant source of clinical trial studies with longest history and largest complete data (Yang and Lee 2018) . Unfortunately the database is an underutilized information source for the health industry and life science research (Glass et al. 2014) . Conversely, there's been an exponential increase in the amount of registered clinical trials in the US. The global clinical trial market is expected to reach 65.2 billion dollars by 2025, growing at a compound annual growth rate of over 5.5% from 2017 to 2025 (CenterWatch Staff 2017 . This growth rate is expected to increase due to the growing prevalence of diseases and incidence of new diseases. Such a trend naturally raises questions on how to better analyze and utilize existing clinical reports to benefit industry, academia, and individuals (Califf et al. 2012 ).

Determining the relationship and projection of trends in clinical trials can be a daunting task considering there are large number of variables from different sub-domains. Network analytics are commonly used to understand structure, development, and relationships of complex systems. Such analysis provides valuable information about the systems, such as link prediction, correlation, or degree distribution (Gundogan and Kaya 2017) .

For example, a previous research modeled clinical trials as a collaboration network to understand relationships between listed pharmaceutical companies, research institutes, and universities, and their mechanisms (Yang and Lee 2018) . Another study created a bipartite graph from clinical trial reports from ClinicalTrials.gov to study patterns of interventions in depression trials. The authors transformed the bipartite network into a single-mode network, where intervention nodes would connect if they co-occurred in a clinical trial (Bhavnani et al. 2010) . This method was able to group together similar intervention methods while quantifying trends in depression interventions.

Network analysis is commonly used for drug repurposing research. Drug or disease networks can be created using expression patterns, disease pathology, protein interactions or genetic data to find potential drugs to treat a disease of interest (Pushpakom et al. 2019) . Such analysis can classify gene-disease associations with high accuracy; or identify drugs that have an effect against respiratory viral host targets (Pushpakom et al. 2019) . A previous study created a diseasedrug bipartite network, where a drug is connected to a disease if it's in the top three most commonly used treatments for the disease. Using an internal link based link prediction method, the authors were able to predict drugs that treated the diseases in the network (Gundogan and Kaya 2017) .

In our previous study, we proposed to use bipartite network to represent clinical trial research entities and their relationships, and designed a community-based link prediction (CLP) to model sponsors as communities and predict links for information recommendation (Elkin et al. 2019) . Although effective, CLP cannot make recommendation to all sponsors, because a small portion of sponsors may be assigned to invalid communities, due to their sparse connections or specialized areas not sharing by many others. As a result, sponsors in invalid communities cannot leverage information from other peers, within the same community, for recommendation.

In this paper, we propose to use both communities and topics for clinical trial recommendation. For sponsors, we extract communities with each community consisting of sponsors with coherent interests. For keywords, we extract topics with each topic containing semantic consistent keywords. By introducing topic-based link prediction, we're able to connect sparse research areas by topics which provide a better similarity metric to compare sponsors against. The communities and topics are combined for accurate clinical trial recommendation.

The main contribution of the study is as follows. to recommend clinical trials to sponsors. The general framework can be extended to many other disease types or medical domains.

In our study, 4228 infectious disease clinical trial reports are downloaded, in XML format, from ClinicalTrials.gov database as test bed. The downloaded reports include past, current, and future clinical trials during 1991-2023. Because the main goal of our research is to understand characteristics of infectious disease clinical trials (e.g. what are the main diseases studied in infectious disease clinical trials, who are interested in infectious disease, and what are other areas they are interested in), we extract investigators/ sponsors and clinical trial areas from two XML tags: (1) investigator information: ⟨overall_official⟩ , and (2) area of clinical trials: Medical subject headings (MeSH) ⟨mesh_ term⟩ ). An investigator is the individual (e.g. a physician or a researcher) who submits and is in charge of the underlying clinical trial. In the case that an investigator name does not exist in the clinical trial report, the trial's sponsor was used instead. For simplicity, we will refer to investigators and sponsors as sponsors. Research areas are Medical Subject Headings (MeSH) Terms which roughly define the focused research topics of the underlying clinical trial. MeSH was created by the US National Library of Medicine as a method to describe a wide variety of biomedical topics to properly index articles in MEDLINE (Huang et al. 2011) . In this study, the research area was determined by intervention and condition MeSH words from the file. A clinical trial report often contains one or multiple sponsors, and multiple research areas.

Formally, we use s to denote a sponsor and use k to denote a keyword of research area. Likewise, we use to denote the set of all sponsors, and represents the set of all keywords (research areas). From our testbed, we extracted 4864 investigators (i.e. | |=4864) and 1878 research areas (i.e. | |=1878)

Clinical trials involve complex sponsors and research area relationships. A sponsor may be interested in multiple closely related (or interdisciplinary) research areas and results from one research area may be beneficial to another areas. The nature of pair-wise sponsor and research area bound provides a bipartite relationship for analysis. So we use bipartite network as the underlying data structure to support our analysis. Formally, a bipartite network = ( , , ) is a graph where the node set can be partitioned into two disjointed sets ( = 1 ∪ 2 ). No node belongs to both sets of , ( 1 ∩ 2 ). In our research, sponsors represent one set of nodes and research areas represent the second set of nodes. An edge e(s, k) connects a node s in sponsor node set to a node k in the research area node set ( ⊂ 1 × 2 ), and denotes the edge set of the graph. An example bipartite network is shown in Fig. 1a . The degree of a node, deg(v), is the number of edges incident to node v. In an undirected bipartite graph, the deg(s) is the number of k nodes that s is connected to and vice versa. In Fig. 1a , deg(s 1 ) = 3.

If a clinical trial had multiple sponsors, edges are created from all investigators to research areas. For each edge, e(s, k), a weight value w s,k represents the number of times an investigator is connected to a research area. To decrease the sparsity of the network, MeSH words that 

Page 4 of 17 contain a comma were separated into two research areas, e.g., "Influenza, Human" was separated into "Influenza" and "Human".

Community detection aims to find connected groups of nodes within a network. In Fig. 1a the dot dash line represents the split of the bipartite network into two communities such that ℂ 1 contains node set s 1 , s 2 , s 3 , k 1 , k 2 , k 3 . And ℂ 2 contains node set s 4 , s 5 , k 4 , k 5 , k 6 . Network community detection was done using the LPAwb+ algorithm created by Beckett (Beckett 2016) . Communities are found by distinct modules that consists of a combination of two node types in a weighted bipartite network. The goal is to maximize the modularity score for a weighted bipartite network, Q W , defined in Eq.

(1) (Beckett 2016; Dormann and Strauss 2014) .

where s and k are node types, sponsors and research areas, s u is a sponsor node and k v is a research area node. The Kronecker delta function (s u , k v ) equals one when nodes s u and k v are in the same module, or community, or zero otherwise. Ẽ is a matrix of no interactions between two nodes, W is the weighted incidence matrix, y is the incidence matrix row totals and z is the column totals. The algorithm computes modules based on two stages. In the first stage sponsor nodes are updated using information from research area nodes and research area nodes are updated using information from sponsor nodes. For a sponsor node x, its node label, s x , is found by maximizing Eq. (2). Labels are updated until modularity score, Q W , no longer increases (Beckett 2016 ).

In the second stage, groups of communities are merged together. Each module consists of nodes sharing the same label. Communities are merged if merging increases network modularity. This is repeated until merging more communities does not increase network modularity further (Beckett 2016) . Each community, ℂ c , contains a distinct subset of s and k such that = ℂ 1 ∩ ℂ 2 .

(1)

Because infectious disease clinical trials cover many diverse research areas, it is important to determine the robustness of communities. The transitivity of social networks has been widely studied (Newmann 2001; Opsahl 2011) . Transitivity can define connectivity in a network by defining the number of connections between connected nodes. It is measured by the fraction of connected triangles to the number of connected triplets (Newmann 2001). A triangle is where V 1 and V 2 are connected and are both connected to V 3 . A connected triplet is where V 1 is connected to V 2 , and V 2 is connected to V 3 and there is no connection between V 1 and V 3 . To measure transitivity, the clustering coefficient, C c , is often used (Newmann 2001; Opsahl 2011) This is frequently used in one-mode networks, an example of one-mode network is shown in Fig. 1c . A high clustering coefficient indicates high robustness. If a graph is completely connected, e.g., all nodes connect to each other, C c = 1 . If the graph has no triangles, C c = 0.

However, the global clustering coefficient cannot be applied to two-mode networks, such as a bipartite network (Fig. 1a) . By definition in a two-mode network, nodes in set only connect to nodes in set , thus a triangle will never form (Opsahl 2011) , as shown in Fig. 1b . So to determine robustness, we used two coefficients created for bipartite two-mode networks. The first is a global coefficient, GC c , which measures the number of closed 4-paths compared to the number of 4-paths. A path is a sequence of connected distinct nodes. An open 4-path is the one where the first and last node do not connect. In Fig. 1d (upper panel) nodes k 2 , s 2 , k 4 , s 5 , k 6 are on an open 4-path. A closed 4-path (also called a 4-cycle) is a path where the first and last nodes connect. In a bipartite graph, they are connected by a 5th node. In Fig. 1d (lower panel) nodes k 2 , s 2 , k 4 , s 5 , k 5 are on a closed 4-path, closed by s 4 . A 4-cycle is the smallest cycle possible in a two-mode network. GC c = 1 if all 4-paths in a bipartite network are closed, and 0 if all 4-paths are open (Opsahl 2011) .

The second measure is the reinforcement coefficient, RC c , which measures the number of closed 3-paths compared to total 3-paths in the network. It's considered reinforcement between two sponsors rather than a measure of clustering between a group of sponsors. A high reinforcement coefficient indicates localized closeness in a bipartite network (Robbins and Alexander 2004) .

(3)

#of closed 4-paths # of 4-paths .

A community whose research areas only connect to one sponsor, or multiple sponsors only connect to one research area would not have a value for either GC c or RC c coefficient (an example is shown in Fig. 6b ). In this case, we consider this type of community as an invalid community.

To accurately recommend/predict research areas interesting to a sponsor, we propose to use link prediction to find connections between sponsor nodes s and research area node k that currently do not exist. In Fig. 1a the red dashed-line with a question mark is a predicted link that suggests that node s 2 is interested in node k 1 . Link Prediction has been extensively studied in research and many methods, such as similarity-based, supervised learning based, or collaborative filtering-based approach, have been used for link prediction (Liben-Nowell and Kleinberg 2007). In the following, we first discuss existing collaborative filtering-based link prediction, and then propose our community-based link prediction.

User-based collaborative filtering is generally performed to predict the votes of a user on a particular item by comparing the user to other users in a dataset , where other users have a vote on the particular item (Breese et al. 1998) . In this study, we are predicting weight of linkage between a sponsor and a research area. The highest predicted weight would indicate that research area is interesting to the sponsor (e.g. the topic he/she may be interested in pursuing in the future). For clinical trial bipartite network, we treat users as sponsor nodes (s) and items as research area nodes (k). Thus we are predicting P s,k which would indicate the weight value for sponsor s on research area k, as defined in Eq. (6). The highest value P s,k for k would indicate the top one predicted research area and so on.

In Eq. (6), denotes a dataset used to determine sponsor s's sore, and | | is the number of sponsors in . (s, i) denotes

the similarity between two sponsors s and i; v i,k denotes the weight value (vote) between sponsor i and research area k, and is a normalization parameter. v i is the average weights of sponsor i, which is defined in Eq. (7) ( N i denotes the set of research area nodes connecting to sponsor s i ) (Breese et al. 1998 ).

In summary, P s,k denotes sponsor node s weight on research area k. P s,k is the average weights of sponsor s plus the weighted summation of all other sponsors' weight on research area k. The more similar two sponsor nodes are, the more similar their weights for research area k will be.

In this study, we used cosine similarity to measure similarity between two sponsors a and b. Assume a and b are the vector representation of the sponsor of interest (a) and sponsor to compare (b) from, where a and b each denotes an m dimensional vector. The similarity between sponsors a and b is calculated as follows.

In our study, we have observed that sponsor-area relationship has strong community ties, where sponsors/investigators are very likely to be interested in research areas within the same community. This is mainly because that biomedical research has a strong domain requirements, where an investigator trained in one area is often only specialized in limited relevant areas. Meanwhile, as interdisciplinary and cross domain research continuously grows, more clinical trials involve teams of experts from multiple domains, which essentially complicate the community structure in clinical trials.

Motivated by the above observations, we proposed a community link prediction method (CLP) to recommend links (Elkin et al. 2019 ), which includes three major components:

(1) create bipartite network from clinical trial reports; (2) detect community from bipartite networks, and (3) apply userbased link prediction to each community to find links. The theme of the CLP is to rely on the community to recommend clinical trials areas to each sponsor. For each sponsor s, CLP uses Eq. (6) to find the sponsor's potential interest on keyword k, by using all sponsors in the same community as s, with the similarity between two sponsors calculated using Eq. (8). For each sponsor s, its vector representation s is keyword-based using Eq. (9), where e(s, k i ) denotes an edge connecting sponsor s to keyword k i , and w s,k i denotes weight of edge e(s, k i ). Limitation: Although effective, CLP suffers from two major limitations: (1) If a sponsor does not belong to a valid community, it cannot make recommendation to the sponsor, because there is no other sponsors in the same community to calculate similarity scores (using Eq. (8)). In our experiments, about 25% sponsors are placed in invalid communities, therefore cannot find recommendations for them; and

(2) If a sponsor has a very specific focus on some rare keywords not shared by many others, CLP may not recommend accurately or fails (because of the sparsity). To overcome the above two limitations, we propose a community-topic-based recommendation algorithm, which replies on communities and topics for recommendation. The employment of the topics ensures that keywords with low/ rare occurrences are connected to others through topics, so we can make accurate recommendation to sponsors with very specific research interests, as shown in Fig. 2 

Q W ← Maximizing modularity score of G using Eq. (1) 15:

for each vertex x ∈ V do 16:

gx ← Find its modularity-based label using Eq. (2) 17:

G ← G ∪ gx 18: end for 19: until Convergence 20: C ← Find communities using modularity labels G 21: for each community c ∈ C do 22:

if GCc or RCc are valid using Eqs. (4) and (5) then 23: 

The goal of topic detection is to group keywords together as topic groups. For example, "Penicillins", "Amoxicillin", and "Ceftriaxone" could be different keywords under a topic construct, "Antibiotics". Grouping keywords as topics has two major benefits: (1) connecting different keywords at concept/semantic level; and (2) low-frequency keywords (rare search areas) can be linked to popular keywords, and tackle the sparsity challenge.

To detect topics, we use a hierarchical clustering method to combine keywords using their topological connections. To do so we create a one-mode graph, k , consisting of keyword nodes only, k = | | . Edges k connect two keywords, k i and k j , if both keywords co-occur in the same clinical trial report. Accordingly, a co-occurrence matrix k is created as the weighted adjacency matrix to create one-mode keyword graph k . k is a symmetric m × m matrix, where m is the number of keywords. k i,j equals to the number of clinical trials containing two keywords, k i and k j . k i,j = 0 if no clinical trial contains both keywords. An example of a one-mode keyword network is shown in Fig. 1c .

The Walktrap algorithm is applied to the keyword graph k to group keyword nodes into topic groups (Pons and Latapy 2005) . WalkTrap determines structural similarities between nodes using random walks, which start on a randomly selected vertex and moves randomly in the network by following edges. The distance between two vertices is determined from the random walk, two vertices within the same subgraph region will have a small distance. Similar nodes are merged together to form communities, the merging process continues until all nodes are merged together.

To merge nodes into clusters, graph k is first separated into m clusters consisting of a single vertex. Each iteration merges two clusters into one cluster to create a new partition of the graph. This completes when all nodes are joined into one cluster. This is an agglomerative hierarchical clustering algorithm. After i iterations, there is a sequence of partitions, P 1 ≤ P i ≤ P m . Where P 1 is the partition of m clusters consisting of a single node and P m is the partition of 1 cluster consisting of m nodes. Of these partitions, there is one that has the best separation of clusters, this is determined by modularity Q(P) as defined in Eq. 10. Where e C is the number of edges inside cluster C and a C is the number of edges connected to cluster C. The partition with maximum Q(P) will have the final node cluster structure (Pons and Latapy 2005) .

This method can produce cluster groups that consist of only a single node. If merging a node into a cluster decreases the partitions modularity score, Q(P), it might not be desired. In our research, we wanted to group together keywords into topics to have a more generalized groupings of keywords. Thus if we had topic groups with only one node, those were ultimately merged into the cluster with the highest cooccurrence frequency. A topic cluster would consist of only a single keyword if that keyword didn't appear in any other clinical trial.

To leverage the topics for clinical trial recommendation, we propose a community-topic-based link prediction model, which combines topics and communities for recommendation, as shown in Algorithm 1. dashed line with a question mark from sponsor s 4 to keyword k 6 ). In comparison, the community-topic approach finds communities and topics from sponsors and keywords, respectively. Although sponsors s 4 is in an invalid community, the existing linkage to topic 2 will help recommend connection to k 6 which is within the same topic

Page 8 of 17

Topic detection: To leverage topics, we first create a topic matrix, ∈ ℝ n× , with sponsors as rows and topic groups as columns. The value of i,j , represents sponsor s i 's interest on topic j , which is the total number of keywords in j having an edge to sponsor s i .

After the topic matrix is generated from the network, each row of the is used as the the vector representation of a sponsor, as defined in Eq. (12), with Eq. (8) being used to calculate similarity between sponsors. So the similarity between two sponsors is based on the similarity on common topics, instead of based on common keywords like CLP does.

Community detection: Similarly to CLP, CTP also leverage the community to calculate a sponsor s's potential interest on keyword k by using all sponsors in the same community.

In other words, a sponsor s is only compared to the sponsors within the same community to calculate their similarity using Eq. (8), which is calculated using vectors from the topic matrix, . However, due to limitations of community detection algorithms and uniqueness of network structures, not all sponsors will be placed in valid community. Therefore, CTP combines topics and communities to make recommendations for all sponsors.

Community-topic combined recommendation: Due to high sparsity and specification, invalid communities often consist of sponsors that are linked to keywords with low occurrences. Utilizing topic groups, these keywords are clustered in a method that can more accurately describe the similarity between two sponsors. From the previous study, it was determined that CLP has a high accuracy in recommendation, so if a sponsor is within a community, we still follow community-based link prediction. On the other hand, if a sponsor is within an invalid community (which consists of a single sponsor), we would use topic based global link prediction to recommend areas for the sponsor, as shown in Algorithm 1 (steps 28-39).

Algorithm 1 mainly includes two major components: (1) create bipartite network and find community label for each node (line 3 to 27); and (2) recommend research areas to each sponsor (line 28 to 41). Denotes n s = | | the number

of sponsor nodes and n k = | | the number of research area nodes.

In order to find communities, Eq. (1) requires O(n s ⋅ n k ) complexity to calculate the combinations between n s sponsors and n k research area nodes. For each repetition, Eq. (2) needs to be calculated for all nodes ( n s + n k ) in order to find their node labels. Therefore, the complexity for each repetition is O(n s ⋅ n k ⋅ (n s + n k )) . The process repeats until it reaches its convergence. Assume it repeats times, the complexity is O( ⋅ n s ⋅ n k ⋅ (n s + n k )).

For recommendation, the loop from lines 28 to 39 is repeated for each sponsor and each research area, so the total complexity is O(n s ⋅ n k ) . The sorting on line 40 is based on all sponsor and keyword pairs, and the complexity is O(n s ⋅ n k ⋅ log(n s ⋅ n k )).

Because the log function log(n s ⋅ n k ) has a lower order complexity than linear function n s + n k , the complexity of the system is asymptotically bounded by O( ⋅ n s ⋅ n k ⋅ (n s + n k )).

The degree distributions of Research area nodes, ; and sponsor nodes, , are shown in Fig. 3a , b respectively. Both degree distributions follow a scale-free degree distribution with long-tail phenomenon. This indicates that a majority of sponsors focus on a few research areas and some research areas are studied by multiple different sponsors.

For research area nodes, the maximum deg(k) is 864. For sponsor nodes, the maximum deg(s) is 140. In total, there are 25 sponsors who all deg(s) = 140. These sponsors are all connected to the same set of nodes, indicating they may have worked together on one or many clinical trials. Table 1 lists the top 20 k nodes by degree. The top k by degree, (deg(k) = 864), is "Infection." The top 20 k nodes represent research areas in infectious disease research that receive a lot of attention from many sponsors. High on the top 20 k nodes list are "HIV Infections","Acquired Immunodeficiency Syndrome"(AIDS), "Malaria", and "Tuberculosis". These represent the "big three" infectious diseases, Malaria, tuberculosis (TB) and HIV/AIDS (Bourzac 2014) . These three diseases combined accounted for 2.7 million deaths worldwide in 2018 (Prudêncio and Costa 2020). Hepatitis-related research areas ("Hepatitis","Hepatitis A", "Hepatitis C"), are also ranked high by degree. Hepatitis is responsible for 1.44 million deaths globally (Chen et al. 2015) . The high ranking of these serious infectious diseases reflect serious research efforts to combat these diseases.

Research areas such as "Anti-Bacterial Agents", "Vaccines", "Antibiotics", "Antitubercular", all represent interventions that are ranked high as these are commonly used to treat/prevent infectious diseases. These research areas are broad and ranked high towards their likely combinations with disease research areas. Such as a sponsor may be researching vaccine development for HIV. While the research area nodes with large degree represent those with a lot of research attention, research area nodes with a smaller degree give information on research areas that are often overlooked. The majority of k nodes have deg(k) < 10 , with median deg(k) = 3. Often research areas with smaller degree represent more uncommon/rare infectious disease research areas. However, in some cases, it has been shown that certain infectious diseases are disproportionately neglected. The Neglected Tropical Diseases (NTDs) represent a group of infectious diseases that are commonly found in low-income developing areas of the world. These diseases affect the poorest one-sixth of the world's population and have been neglected by research attention and funding. The more recent focus on the "big three" further declined efforts towards these diseases (Feasey et al. 2010) .

To compare and contrast research areas that are receive differing amounts of research efforts, Table 2 displays research area nodes of infectious diseases that are considered NTD (CDC 2020; Feasey et al. 2010 ). These research areas represent infectious diseases that affect 0.1 million (Trypanosomiasis) to 740 million people (Hookworm Infections) (Feasey et al. 2010 ). The number of years lost to disability and premature death, Disability-Adjusted Life-Years (DALYs), for NTDs is estimated at 56.6 million; compared to 84.5 million for HIV/AIDS, 46.5 million for Malaria and 34.7 million for Tuberculosis (Hotez et al. 2007) . As suggested by their classification as NTDs, the Figure 4 shows a portion of the dendrogram resulted from the clustering process with four topics (denoted by different colors). The first topic cluster, denoted by green color, represent keywords all conceptually related to facial paralysis. The keywords Facial Paralysis, Paralysis and Facial Nerve Diseases all describe facial paralysis. The first keyword, Bell Palsy, is a form of facial paralysis (de Almeida et al. 2014 ). The second topic group (colored in blue) represents keywords conceptually related to neuropathic pain. Postherpetic Neuralgia is neuropathic pain due to complications caused by Herpes Zoster Oticus virus, also known as shingles (Forbes et al. 2015) . Pregabalin is a treatment for postherpetic neuralgia (Derry et al. 2019) . The trigeminal nerve is responsible for facial sensation; Trigeminal nerve injuries cause neuropathic facial pain (Edvinsson et al. 2020) .

The third topic group, denoted by purple color, are all serious bacterial or viral infections. Tetanus, Diptheria and Pertussis, also known as whopping cough, are commonly vaccinated together with the DTaP vaccination. Recently, a new vaccine (DTaP-IPV-Hib-HepB) was approved by the FDA to prevent Diphtheria, Tetanus, Pertussis, Polio, Haemophilus Influenzae type B and Hepatitis B (Oliver 2020) .

The final topic group, denoted by pink color, represent three serious viral diseases. Mumps, Rubella and Measles are all RNA viruses. Rubella is also known as German measles. Mumps and Measles have unique symptoms between them, but the same vaccine, measles-mumpsrubella (MMR) vaccine immunizes against all three viral diseases (White et al. 2013) .

Conceptually the keywords in these topic groups represent a logical clustering. The keywords that are linked together at a lower height, such as Bell Palsy, Facial Paralysis and Paralysis, indicate these words frequently appeared in the same clinical trials. Facial Nerve diseases is a keyword that appeared with the first three keywords in only one clinical trial, thus the clustering is merged at a higher height.

After applying clustering method to keyword graph k , = 169 topic groups are derived, and the topic groups have an average of 13 keywords per group. The largest topic includes 540 keywords, whereas the smallest contains 1 keyword, (there are 14 single keyword topic groups); which happens when a keyword only appears once in the set of clinical trials and there are no other keywords belonging to that clinical trial. Figure 5 reports word clouds in two relatively large topic groups. Figure 5a are related to an oncology construct with 236 keywords such as "Lymphoma", "T-Cell", "B-Cell", etc. The topic group also contains some treatments for cancer such as Hydrocortisone. Figure 5b shows a word cloud for 30 which consists of 36 keywords. These keywords represent an HIV treatment construct. Table 3 reports 10 selected small topic groups, the topic construct, and the respective keywords within the topic and their frequency within all clinical trials. The construct is a possible scientific construct for the keywords in the topic group, since the ultimate ground truth of the groupings of keywords is unknown and left to interpretation. Since the keywords are grouped into topics ultimately based on cooccurrence, there are some cases where an odd keyword falls within a topic group. For example, all keywords in 17 are related to the urinary system. While logically, "Stress" doesn't directly relate to the Urinary system, it can play a role in overactive bladders (Lai et al. 2015) . Since the keyword "Stress" didn't appear in any other clinical trial report, it is ultimately clustered in 17 .

Overall, topic detection results show that topics are useful in finding a group of keywords sharing similar/related semantic concepts. This is particularly beneficial in connecting sparse keywords to related groups, so our method can recommend trials to sponsors with high research specificity. Table 4 lists the summary of detected infectious disease clinical trial communities. Overall, we found 478 communities ℂ and 139 of them have valid GC c and RC c scores (these communities are listed as "Valid" in Table 4 ). In total, all valid communities have 3,662 sponsor nodes (s) (75.38% of all sponsor nodes) and 1,304 research area nodes (k) nodes (69.40% of all research area nodes), indicating that valid communities cover large portions of the network. For all valid communities, their global clustering coefficients, GC c range from 0.4 to 1 with average of 0.9814, and their reinforcement coefficients, RC c range from 0.054 to 1 with average of 0.728.

To show the structure of valid vs. invalid communities in the network, Fig. 6 demonstrates two separate communities. Figure 6a displays a valid community, ℂ 34 with 12 s nodes and 6 k nodes. Figure 6b displays an invalid community, ℂ 413 with 2 s nodes and 3 k nodes. Table 5 lists the research area nodes for the two communities.

The valid community, ℂ 34 , as shown in Fig. 6a , has all closed 4-paths, thus GC 34 = 1 . The reinforcement coefficient is slightly lower, RC = 0.4 , due to the four k nodes that only have connections to one other s node in the community. This indicates less localized clustering between sponsor nodes within ℂ 34 . Figure 6b displays an invalid community, ℂ 413 . This community has two s nodes and three k nodes. Since one of the s nodes is only connected to one k node in the community, there is no 4-paths, thus GC 413 = NA. There exists a 3-path, but it is not closed, thus RC 413 = 0 . Our analysis shows that a typical invalid community consists of only one or two sponsors from a single clinical trial.

To validate the performance of the proposed clinical trial recommendation algorithm, we carry out following designs to remove a small portion of connections from the networks as benchmarks, and then compare different methods' performance in predicting these "removed" links.

To create benchmark links for prediction, we generate following three benchmark node sets, representing sponsor nodes with increasing number of connections.

- [ , ] : randomly select 100 sponsor nodes from where each selected sponsor has minimum 2 edges and maximum 6 edges. This set represents sponsors with normal degree of connections (majority sponsors belong to this category as shown in Figure 3b ). - ( , ] : randomly select 100 sponsor nodes from where each selected sponsor has minimum 7 edges and maximum 10 edges. This set represents sponsor with a high degree of connections. -( , ∞) : randomly select 100 sponsor nodes from where each selected sponsor has minimum 11 edges. This set represents sponsors with a very high degree of connections. The maximum degree of a sponsor node is 140, so up to 70 edges are removed within this node set. Fig. 6 The structure of a valid community and an invalid community: a valid community, ℂ 34 , consists of 18 nodes ( |s| = 12 , |k| = 6 ); and b invalid community, ℂ 413 , consists of of 5 nodes ( |s| = 2 , |k| = 3 ) The pink squares indicate sponsors and the blue circles indicate research areas After creating the above three benchmark node sets, for each node in any of the selected sets, half of its edges are removed and the removed edges are used as benchmark edge set of the selected node set. After creating the subnetwork with removed edges, the corresponding topic matrix is created for recommendation. If a method predicts a research area that was previously removed, the prediction is accurate (i.e. the predicted result is the one that was removed). By doing so, we know the ground truth of the links and can therefore compare algorithm performance.

In the experiments, we employ the following baseline methods for comparisons:

-GLP and CLP: These two methods are from our previous study (Elkin et al. 2019) , which use keywords and communities for recommendation. -CTP The proposed method which combines communities and topics for recommendation. -CTP t : A variant of the proposed CTP method, which removes the community detection module and only uses topics for recommendation.

The purpose of using CTP t is to carry out ablation study and remove communities to study CTP's performance. Alternatively, CLP only uses communities and does not use topics. Therefore, by comparing CLP vs. CTP t , we can understand whether topics are playing more important role than communities for recommendation, or vice versa. It is worth noting that CLP relies on communities for recommendation, so it only works on sponsors within valid communities. GLP, CTP, and CTP t , on the other hand, work for all sponsors. Because some methods only work for valid communities, we carry out experiments by comparing their performance on Valid Community Network (which only consists of sponsors within valid community) and all network, respectively.

To compare the performance of GLP, CLP, CTP, and CTP t , three benchmark node sets are created on a subnetwork consisting of nodes only in valid communities. In each benchmark node set, 10 sponsor nodes are selected and half their links are removed with each method being used to predict links for recommendation. This repeats 20 times, and the mean accuracy for the link prediction methods is reported in Figs. 7 and 8. Figure 7a , shows the performance with respect to top 3 accuracy. This benchmark node set, [2, 6] , is the only case that CLP marginally outperforms CTP. GLP has the worst performance, and the addition of topic-based link prediction ( CTP t ) increases accuracy. Figure 7b shows the performance with respect to top 5 accuracy. As the number of removed edges increases from Fig. 7 Link prediction accuracy comparison on valid community networks, a using benchmark node set [2, 6] , and b using benchmark node set (6, 10] . The x−axis denotes the top-k prediction, and the y−axis denotes the link prediction accuracy

Page 14 of 17 maximum 3 to maximum 5, overall accuracy decreases. CTP now outperforms CLP. Figure 8 shows the performance with benchmark node set (10,∞) . The maximum degree of a sponosor node is 140, thus sponsors in this node set had up to 70 edges removed. Generally, as the number of edges removed increases, the accuracy increases for all methods. CLP and CTP have similar performance, CTP achieves higher accuracy, and CTP t achieves much better performance than GLP.

The results can be summarized into two major findings: (1) the addition of Topic-based link prediction significantly increases recommendation accuracy; and (2) the usage of community-based link prediction increases accuracy.

The whole network consists of all nodes, regardless if they are within a valid community or not. Due to the findings that topic based and community-based increases link prediction accuracy, we wanted to validate our CTP method on the whole network. To do so we create three benchmark node sets as previously described, in each node set, 20 sponsor nodes are selected and half their edges are randomly removed. Then GLP, CTP, and CTP t are used to predict links. This is repeated 20 times. The average accuracy of link prediction for the three benchmark node sets is reported in Figs. 9 and 10. Figure 9a shows the accuracy with respect to top-3 prediction with benchmark node set [2, 6] . The addition of topicbased link prediction, CTP t , shows a major improvement compared to GLP. The addition of community and topicbased link prediction, CTP, shows the highest performance. Figure 9b demonstrates the accuracy with respect to top-5 prediction with benchmark node set (6, 10] . As with the valid community network, this benchmark node set has slightly decreased performance for all methods. GLP shows the lowest performance. CTP shows an increased advantage over CTP t . Fig. 8 Link prediction accuracy on the valid community network using benchmark node set (10,∞) . The x−axis denotes the top-k prediction, and the y−axis denotes the accuracy Fig. 9 Link prediction accuracy comparison on valid community networks, a using benchmark node set [2, 6] , and b using benchmark node set (6, 10] . The x−axis denotes the top-k prediction, and the y−axis denotes the link prediction accuracy Page 15 of 17 47 Figure 10 demonstrates the accuracy with benchmark node set (10,∞) with respect to Top 1 to Top 70 prediction. This follows the trend in the valid community network with increasing accuracy as the number of edges is removed. Again CTP has the highest performance and GLP has the lowest performance.

Overall the results show that the addition of topic-based link prediction increases accuracy. A topic-based similarity metric provides a better basis for similarity comparison between two sponsors increasing the accuracy of collaborative user filtering.

This study aims to characterize clinical trials using network analysis of sponsors and research areas. By modeling infectious disease clinical trials as a bipartite network between sponsors and research areas, we can differentiate infectious disease research areas receiving a lot vs. a little attention. The degree of a research area directly measures the number of sponsors studying the infectious disease. While it is expected for more uncommon infectious diseases to receive a smaller degree of research efforts, some infectious disease research areas do receive disproportionate research efforts. Our results show that NTDs infectious disease research areas have considerably smaller degree compared to the "big three" (HIV/AIDS, Malaria, Tuberculosis) and Hepatitis. This demonstrates their classification as "Neglected". Similar analysis on research areas with a small degree can identify commonly overlooked infectious diseases.

Our previous research method demonstrated the high predictive power of using community-based link prediction (Elkin et al. 2019) . While CLP does have an increase in performance, it still can't be used effectively towards the whole network. CLP requires that all nodes exist within a valid community. This study expands on our previous method by introducing topic-based modeling. By finding topics based on keywords and summarizing the number of keywords in each topic per sponsor, this metric provides a better basis for similarity comparison between sponsors increasing the accuracy of collaborative user filtering. Finding topics effectively groups together some keywords that are not as common within the network. If a sponsor only has connections to uncommon keywords, grouping them together can accurately represent the similarity between two sponsors. This is demonstrated in the increased performance of CTP t compared to GLP within both the valid community network and the whole network. The introduction of topic-based similarity is a more reliable similarity metric.

As shown with invalid community, C 413 , the network connections are sparse, indicating that using community structure for link prediction wouldn't be accurate for invalid communities. Invalid communities often only consist of 1-2 s nodes. With sparse connections, communitybased link prediction may be unreliable if there are only a small number of sponsors in the community. The keywords for C 413 , as listed in Table 5 , are also found within topic 52 . The groupings of keywords within a topic can provide more useful information regarding sponsors connected to these research areas than relying on community structure only.

The performance of CTP within the valid community network is greater than CTP within the whole network. Since the performance of GLP is similar between valid community network and whole network, we can conclude that the network size isn't the determining factor. This demonstrates the high predictive power of link prediction within a valid community network. However, this exclusion of nodes is not always feasible, especially if the sponsor of interest belongs to an invalid community. The superior performance of CTP vs. CTP t demonstrates the power of using community information, if a node does belong to a valid community, using community-based link prediction will increase accuracy. Ultimately, CTP has increased performance because it utilizes both community and topic information.

For both whole network and valid community network, performance on benchmark node set (6,10] is slightly reduced than benchmark node set [2, 6] . The majority of sponsor nodes (|s| = 4056) fall within benchmark node set [2, 6] . These sponsors may represent those who only have specialized or localized research interests. As deg(s) increases for nodes in benchmark node set (6, 10] , the research areas become broader and link prediction accuracy is slightly reduced for all methods. In benchmark node set, (10,∞) , accuracy increases as the number of links predicted increases. As the deg(s) increases for a sponsor node, the likelihood increases that the node belongs to a highly localized dense community. For example, the 25 s nodes that all have deg(s) = 140 all belong to a large dense community with 46 sponsors and 170 research areas. The Fig. 10 Link prediction accuracy on the whole network using benchmark node set (10,∞) . The x−axis denotes the top-k prediction, and the y−axis denotes the accuracy

Page 16 of 17 dense connections effectively provide more information for each sponsor node and increase the link prediction accuracy, resulting in a gradual increase in accuracy for all methods. The difference between GLP and CTP is greatest at lower Top-k predictions. This demonstrates the ability of CTP to rely more necessary information regarding a sponsor node, which is necessary when the connections are less dense.

Overall these results suggest that link prediction has increased benefits from researchers in localized/specialized areas and researchers with large degrees (i.e. many research areas shared by many other researchers). Meanwhile, link prediction shows a decline for researchers with a broader set of interests while maintaining a lower degree.

In our research, the topics are based on the graph k , instead of using node content. This indicates the original dataset itself has high importance with regards to finding topics. For example, if more clinical trials contained the keyword "Stress", that would affect the keyword's placement in a topic group. Within the dataset used for this study, "Stress" only was found in one clinical trial, which determined it's placement into a Urinary System topic group construct, as shown in Table 3 . Using more information to enrich the networks can essentially improve the topic discovery, and result in more accurate clinical trial recommendation.

In this study, we proposed to study relationships between investigators/sponsors and research areas in infectious disease clinical trials extracted from ClinicalTrials.gov. which is a valuable, but under utilized, data source. We used bipartite graph to create infectious disease networks between sponsors and research areas, and studied characteristics of the networks. The analysis of research area degree demonstrates the research efforts given to separate infectious diseases. Our research shows that clinical trial research follows unique scale-free network characteristics: (1) researchers are highly specialized where many of them primarily work on specific research areas, although a handful a researchers indeed work on many areas; (2) a small number of research areas are very commonly studied by many researchers, yet many research areas are studied by a small number of researchers. Overall, infectious disease research for the "big three" and Hepatitis receive large research efforts/attention from sponsors, whereas infectious disease research for NTDs receive a smaller amount of sponsor attention.

For accurate clinical trial recommendation, we proposed to reduce sparsity in the data, by extracting communities to group sponsors and using topics to model research areas. Combining communities and topics, we formed a link prediction task to recommend research areas for sponsors. Experiments and validations confirmed that, compared to the previous research, the proposed method is much more accurate in recommending links for infectious disease clinical trial research. The proposed method provides an accurate and reliable method for recommending clinical trial research areas to a sponsor.

Future research can emphasize on integrating additional relationships, such as drug keywords, into the network analysis, or extending the proposed framework to other clinical trial areas, such as heart disease.

Improved community detection in weighted bipartite networks

Network analysis of clinical trials on depression: implications for comparative effectiveness research

Infectious disease: beating the big three

Empirical analysis of predictive algorithms for collaborative filtering

CDC (2020) Diseases -Neglected Tropical Diseases

Characteristics of clinical trials registered in Clini-calTrials.gov

From the big three to the big four

Pregabalin for neuropathic pain in adults

A method for detecting modules in quantitative bipartite networks

On behalf of the European Headache Federation School of Advanced Studies (EHF-SAS) (2020) The fifth cranial nerve in headaches

Predictive modeling of clinical trial terminations using feature engineering and embedding learning

Network analysis and recommendation for infectious disease clinical trial research

Neglected tropical diseases

A systematic review and meta-analysis of risk factors for postherpetic neuralgia

Clinicaltrials.gov:an underutilized source of research data about the design and conduct of commercial clinical trials

A link prediction approach for drug recommendation in disease-drug bipartite network

Control of neglected tropical diseases

Recommending MeSH terms for annotating biomedical articles

Topic discovery and future trend forecasting for texts

Correlation between psychological stress levels and the severity of overactive bladder symptoms

The link prediction problem for social networks

Scientific collaboration networks. I. Network construction and fundamental results

Licensure of a diphtheria and tetanus toxoids and acellular pertussis, inactivated poliovirus, haemophilus influenzae type b conjugate, and hepatitis B vaccine, and guidance for use in infants

Triadic closure in two-mode networks: Redefining the global and local clustering coefficients

Computing communities in large networks using random walks

Research funding after COVID-19

Drug repurposing: progress, challenges and recommendations

Report: Global clinical trials market is expected to reach 65.2B by 2025

Small worlds among interlocking directors: network structure and distance in bipartite graphs

Measles, mumps, and rubella

Long-term collaboration network based on clinicaltrials.gov database in the pharmaceutical industry

Trial reporting in clinicaltrials.gov-the final rule

Acknowledgements This research is sponsored by the US National Science Foundation through Grants IIS-2027339, IIS-1763452, CNS-1828181.