key: cord-0107652-92bzemor
authors: Benson, Austin R.; Veldt, Nate; Gleich, David F.
title: fauci-email: a json digest of Anthony Fauci's released emails
date: 2021-08-03
journal: nan
DOI: nan
sha: 63ce72e0541bac0c5e2873433606099c97cd8ea2
doc_id: 107652
cord_uid: 92bzemor

A collection of over 3000 pages of emails sent by Anthony Fauci and his staff were released in an effort to understand the United States government response to the COVID-19 pandemic. We describe how this email data was translated into a resource consisting of json files that make many future studies easy. Findings from our processed data include (i) successful organizational partitions using the simple mincut techniques in Zachary's karate club methodology, (ii) a natural example where the normalized cut and minimum conductance set are extremely different, and (iii) organizational groups identified by optimum modularity clusters that illustrate a working hierarchy. These example uses suggest the data will be useful for future research and pedagogical uses in terms of human and system behavioral interactions. We explain a number of ways to turn email information into a network, a hypergraph, a temporal sequence, and a tensor for subsequent analysis as well as a few examples of such analysis.

Fauci's email release [Bettendorf and Leopold, 2021] includes approximately 1289 email threads with 2761 emails including 101 duplicate emails among the threads. 1 1 These counts are exact only given the precise details of our data conversion strategy including what is retained and excluded, see more below; independent parsing and analysis may show slightly different counts.

Each email thread begins with an email from Fauci and the thread is the chain of emails underlying his reply. There are 410 length 1 threads of outgoing mail only. Each email includes partially redacted text and is time-stamped, albeit from a mixture of time-zones that may not always be listed.

These raw data can be analyzed as a social network or graph, a temporal graph, a hypergraph, or a tensor. The most closely related existing dataset is the Enron email dump [Cohen, 2004] . For Fauci's email, we discuss a number of interesting findings in the data and provide them as an easy-to-use resource for continued exploration by others in the field. As was also the case for the Enron email dataset, there may be future releases of this data that correct errors. Our current parsing, for instance, has numerous OCR errors in the text pieces.

The graphs, networks, and hypergraphs that result from these data are small compared with the size of many modern datasets, yet they are not so small as to permit trivial analysis. This renders them a rich setting to investigate what can be ascertained from the data. Because the original emails are available, many of these findings are easy to assess in the documents themselves to understand where various graph features arise.

This goal of this manuscript is more akin to a data manual instead of an article that supports conclusions. We intended to highlight interesting findings of the data (sections 1.1-1.3) and demonstrate a variety of uses (section ). The processed datasets we have are available on github:

https://github.com/nveldt/fauci-email · The main json digest derived from Bettendorf and Leopold [2021] , which has senders and receivers of Fauci's email threads canonically labeled in an easy-to-process format (section ). · Five graphs derived from the data from the data (section . and table ) ranging from 46 to 869 vertices. · A hypergraph derived from the emails themselves (section .) with 233 nodes and 254 hyperedges. · A temporal sequence of adjacency matrices over 100 days from those 77 people where information can flow among all individuals in a temporally consistent sequence (section .). · A tensor projection of the data designed to highlight the role of email carbon copy (CC) networks suitable for hypergraph centrality studies (section .) as well as a tensor representation of the data as sender, receiver, time, and word.

Summary of key people. We provide a briefly annotated list of key individuals to help contextualize some of our results.

anthony fauci Head of the National Institute of Allergy and Infectious Disease (NIAID), a group within the US National Institutes of Health (NIH).

patricia conrad Fauci's key special assistant and frequent email proxy. jennifer routh Science communication editor in the NIAID division of the NIH.

greg folkers Anthony Fauci's chief of staff.

An early and well-known example of social network analysis was the study of a karate club by Zachary [1977] . A simple minimum cut analysis of this network predicted a future division of the club into two groups. We also found minimum cut analyses effective for weighted networks derived from the email exchanges. For instance, consider an undirected, weighted network based on senders and receivers of any email with less than 5 recipients, where edges and edge weights indicate the maximum number of emails sent along that edge or received along that edge (this is the tofrom-nofauci-nocc network in our detailed description, section .). We also remove Fauci from this network, which is done both because Fauci is connected to almost everyone due to how the data were collected and also because theories of structural holes in social networks suggest more meaningful analysis with Fauci removed [Burt, 1995] . Finally, we examine the minimum cut between Francis Collins (head of the NIH) and Patricia Conrad (Fauci's assistant). This cut roughly bisects the network into two pieces as shown in figure . There are 16 edges cut listed in the figure. This cut is largely preserved under multiple perturbations of the network structure (e.g., considering hypergraph projections, including additional emails with more recipients weighted to scale edge importance with recipient list size). An interesting node in the cut list is Sheila Kaplan, who is a New York Times Reporter. Her interactions with Collins and Conrad revolved around the New York Times desire to interview Fauci for an article around March 16-18 -this involves Kaplan discussing the issue with the NIH Office of Communication.

Indeed, handling media queries, scheduling, etc. reveals itself in another minimum cut. In the next example, we consider a weighted hypergraph projection (hypergraph-projection without CC as described below). Each email gives a single hyperedge among all senders and recipients, which is projected onto a clique. Edge weights are the number of emails on that edge. We further filter by removing nodes with only a single edge (ignoring weights). In contrast, we leave Fauci in this network and compute a cut between Fauci and his assistant Patricia Conrad (figure ). This gives a set of 9 nodes involving media inquires, including a Fox News anchor's repeated requests to interview Fauci.

Overall, minimum cut analysis is effective at finding meaningful partitions of this network and has the advantage of being a simple method. The left layout is a force directed layout of the network whereas the right layout is a force directed layout designed to highlight groups in the optimal modularity partition of the network. Many of the cut edges are between nodes with high centrality values (routh, kadlec, billet are in the top 10 PageRank nodes on this graph).

goldner, shannah hynds, joanna figliola, mike edwards, sara l mcguffee, tyler ann good-cohn, meredith gathers, shirley amerau, colin c rom, colin baier, bret koerber, ashley griffin, janelle robinson, sae FIGURE 2 -The minimum cut that separates Anthony Fauci from Patricia Conrad (Fauci's key assistant) in a hypergraph projected to a graph via clique expansion. Blue edges are cut in the solution and purple nodes are on the Conrad side whereas light red nodes are on the Fauci side. The left layout is a force directed layout of the network whereas the right layout is a force directed layout designed to highlight groups in the optimal modularity partition of the network. The nodes on the Conrad side of the cut largely deal with media inquiries and scheduling.

The graphs derived from the data are small enough to allow us to use combinatorial optimization techniques to solve classically hard problems optimally -that is, we need not use heuristics or approximations to study solutions. One surprising result here was a different between the optimal normalized cut set and the optimal minimum conductance set. These measures are closely related and frequently interchanged when used in algorithms. This causes a perception that the sets identified by normalized cut optimization and conductance optimization should be similar. Here, we show a natural example where the results sets are extremely different (figure ). These use the simple, unweighted tofrom-nofauci graph from below with CC lists included and without Fauci. The optimal normalized cut set is a small group involved in setting up an interview for Fauci -another media interaction set. The optimal conductance set is a large group centered around Collins and other NIH groups. This example serves as a useful reminder that the precise details of the objective functions matter when applied to a specific dataset.

For the purposes of being precise, let G = (V, E) be an undirected, unweighted, and connected simple graph. The normalized cut, ncut, of a set S is

where cut(S) is the sum of weights of edges cut, vol(S) is the sum of weighted degrees of vertices in S, andS is the complement set of verticesS = V S. In comparison, the conductance, φ, of a set S is φ(S) = cut(S) min(vol(S), vol(S)) .

Although these two measures are different, we have

Both measures φ(S) and ncut(S) are NP-hard to optimize in general. Consequently methods designed for approximating conductance often implicitly or explicitly solve problems for ncut instead of φ -the two measures only differ by a factor of two after all. Given the weighted graph loaded from data, we remove self-loop entries and edge weights to get the simple, undirected, unweighted network. (The result does not appear using the weights.) To find each solution set, we solve the combinatorial problem using Gurobi's mixed integer linear programming software. This terminates in a few seconds to minutes. The sets identified by the method are shown in figure ; this uses the modularity layout of the network. (See details in Appendix.) For comparison, we also show the s − t cut from Collins to Conrad, which identifies a group around Collins in this unweighted graph. (See section . for more discussion of st-cuts and how we get a bigger partition in the weighted graph.) -The optimal conductance and normalized cut sets from the undirected, unweighted tofrom-cc graph with 386 nodes and 588 undirected edges are extremely different. The minimum conductance set is about half the graph whereas the minimum normalized cut set is only 7 vertices. The graph layout is computed by emphasizing the groups in a optimal modularity partition of the network. We also show some other simple partitions of the network based on the s, t-cut between Collins and Conrad, and also the spectral partition based on a sweepcut of spectral partitioning eigenvector.

Since we are able to solve many of the combinatorial objectives on this network exactly, for the networks of senders and receivers excluding Fauci (tofrom-nofauci-nocc as a simple graph, we find that the optimal modularity clusters [Newman and Girvan, 2004] are characterized by nodes of high betweenness centrality [Freeman, 1977; Csardi and Nepusz, 2006 ] that identify functions and groups in the emails. See figure , where we label nodes with high betweenness centrality. Note the partitioning of agency heads (Collins, Redfield) and task coordinators (Birx, Farrar) as high betweenness nodes in distinct clusters. The clusters identified revolve around different agencies (NIH, CDC, WHO) or functional tasks (handling media requests, budgets), or involve email exchanges around a specific topic, for instance an editorial for the New England Journal of Medicine. Remember that Fauci is involved in almost all of the emails, so the interaction between Redfield, Collins, and Farrar is really modulated by Fauci as well, despite the appearance in this network otherwise.

Overall, this shows the power of this type of analysis to identify relevant structure in these networks with only a little information. In these networks, the FIGURE 4 -The optimal modularity partition of the network of senders and receivers alone (without Fauci) and reduced to a simple graph are indicated by the colored regions. There are 15 groups and the layout is designed to highlight the modularity groups (see Appendix). We show the 14 most central nodes by betweenness centrality scores in a large fontsize, which labels at least one vertex in all but 5 groups. The small fontsize labels on Abutaleb (rank 46), Awwad (rank 76), Beigel (rank 24), Cabezas (rank 28), and niaid news (rank 33) show key nodes in clusters that were not top 14 betweenness. Note that many of the agency heads and task leads are identified as key nodes in these networks (Collins, Redfield, Birx, Farrar).

optimal modularity partitions feature nodes with large betweenness centrality, showing another perspective on how this network appears to be constructed with local leaders as one might expect in a working hierarchy. See awwad, david is the NIAID IT field manager [Wair, 2020] .

niaid od am a mailing list that is frequently forwarded emails for discussion.

myles, renate is the deputy director for public affairs in the office of communication and public liaison.

lane, cliff a clinical director at NIAID.

billet, courtney was often CCed as a point of coordination for Fauci replying to reporters.

cabezas, miriam helped coordinate emergency budget requests for NIH.

Jason Leopold submitted a freedom of information act request to obtain email surrounding the initial response of United States federal agencies including the National Institutes of Health (NIH) and Centers for Disease Control (CDC) regarding the COVID-19 pandemic. The result was a 3234 page PDF document [Bettendorf and Leopold, 2021] consisting of emails that Anthony Fauci, the head of the national institute of allergy and infectious disease (NIAID), send between approximately February 2020 and April 2020. Consequently, to be included in the data, the information must have been included in an email that Fauci sent. Many email clients include "reply data" in the email information, consequently, we are able to infer some amount of communication outside of only what Fauci sent. For example, consider the email in figure . This shows a reply from Fauci to another group with a long CC list. This is in response to a previous email from the same group.

The PDF was converted to text and then formatted into a json digest. The final digest contains 2,761 emails among 1,303 individuals in 1,289 email threads.

The PDF was first converted to a text file with the pdftotext program. ;.!::::======~= (6) ---------Subject: Please review: House Oversight Letter on Coronavirus Diagnost ics NIH-000960 FIGURE 5 -The first page from the PDF file released as part of the freedom of information act request regarding Fauci's email contains the entirety of Fauci's sent email including information (partially redacted) on the email Fauci was replying to. From this page, we are able to extract information on two emails: (i) an email from Fauci to Haskins with a CC to Selgrade, Crawford, and Conrad on 2020-03-06 and (ii) an email from Haskins to Fauci with a CC to Selgrade, Crawford, and Conrad on 2020-03-05. While we have the text of Fauci's email, the text of the original email is redacted. corresponding to email threads; the start of a thread was considered to be a from line with Fauci as sender that also began with a form feed character (indicating a new page of the pdf). The emails within a thread were found by from lines.

The start of the emails contained clear delimiters for the sender, timestamp, recipient list, cc list, and subject (figure ). The body of the email was then taken to be all text after the subject and before the next email in the thread.

Timestamps appeared in ten different formats that could be parsed by Python's datetime.strptime function. The main challenge was handling the numerous errors in the PDF to text conversion. For example, "Thursday" might appear as "Thu rsday" or the number 1 and letter l were often interchanged. Parsing the timestamp involved several general string substitutions and many manual rules for special cases. We successfully parsed timestamps for 86.5% of identified emails, and we omitted emails for which we could not parse a timestamp. The sender, recipient list, and cc list were handled similarly. For the recipient and cc lists, individuals were separated by the semicolon ';' (the cc list in figure  has two semicolons for the three individuals). Standardizing names involved both automation and considerable manual inspection. There were issues with text conversion; for instance, "fauci" was parsed into several textual variants, including "f auci," "f.aucl," "fa uci," "fa11ci," and "fauc i." Also, one individual could appear with multiple variants on their name or address. For example, the individual Cliff Lane appeared as "Lane, Cliff," "Cliff Lane," and "clane@niaid.nih.gov" in different emails. The standardization process was iterative. Given a tentative list of names, we used matching algorithms to find possible duplicates, and these were often checked by manually inspecting the PDF. Sometimes, emails were sent on behalf of someone else (e.g., Patricia Conrad on behalf of Anthony Fauci). We treated these as their own "names" rather than attributing to one of the parties. We omitted any emails where we could not identify a sender or at least one recipient, which occurred in 5.1% of the cases. The omissions were mostly caused by redactions or severe errors in the PDF to text conversion.

We also identified federal organizations to which individuals belonged via designations in the email names (e.g., "NIH" appearing after all names in figure ). Organization affiliations were National Institutes of Health (NIH), Health and Human Services (HHS), Centers for Disease Control and Prevention (CDC), the Food and Drug Administration (FDA), Office of the Secretary (OS), and the Executive Office of the President (EOP). Around 26.6% of individuals were identified as belonging to one of these organizations, and all of the memberships were manually verified.

The subsequent json files are suitable for many types of studies at the intersection of sociology and network science. We describe a few examples.

The data can be modeled in terms of a number of different networks that we describe here. Note that there are many other possible networks. For instance, although Fauci was removed from many of these networks, they all could have Fauci in them too.

repliedto-nofauci This is a weighted network that enumerates replied-to relationships. We have an edge from u to v if u replied to v's email and then weight the edge with the largest number of interactions in either direction. We remove Fauci from this view of the network to study the view without his emails. This network is an instance of a temporal motif network using a "replied-to" temporal motif [Paranjape et al., 2016] . We then remove everyone outside of the largest connected component.

tofrom-nofauci-nocc This is a weighted network that has an edge between the sender and recipients of an email (excluding the CC list), weighted by the largest number of interactions in either direction. In this network, we remove emails with more than 5 recipients to focus on work behavior instead of broadcast behavior. This omits, for instance, weekly emails that detail spending of newly allocated funds to address the pandemic that were often sent to around 20 individuals. We also remove everyone outside the largest connected component.

tofrom-nofauci This is the same network above, but expanded to include the CC lists in the number of recipients. The same limit of 5 recipients applies.

hypergraph-projection-nocc This is a weighted network that is a network projection of the email hypergraph where each email indicates a hyperedge among the sender and recipients. We then form the clique projection of the hypergraph, where each hyperedge induces a fully connected set of edges among all participants. The weight on an edge in the network are the number of hyperedges that share that edge. The graph is naturally undirected. Because this omits CC lists from each hyperedge, the graph can easily be disconnected if an email arrived via a CC edge. To focus the data analysis, we remove any individual who has only a single edge in the graph (with any weight).

hypergraph-projection This version of the network adds CCed recipients to the hyperedge for each email. This remains disconnected largely due to email lists and BCC-events in the data (see figure  for an instance of a list on page 128 and page 1508 in the PDF Bettendorf and Leopold [2021] for an instance of a BCC) even though Fauci remains in this data. Other disconnections are due to parsing errors. There are 35 nodes that are removed due to disconnection. Please let us know if there's anyth ing we can do at CAP to assist. We plan on hosting an event next week and I'll send you details as they come togethe r.

Again, thank you . -An example email change that produces a disconnected component. In this case, a mailing list "posted products" generated an email to multiple people, that were forwarded to Fauci. But Fauci is disconnected from the original email. This could be addressed by adding links based on the threading, although we did not pursue this avenue in our analysis.

These are all weighted networks. Consequently, we analyze them as both simple networks (with edge weights and self-loops removed) and the weighted networks depending on the type of analysis. Basic statistics of the networks are given in table .

PageRank and Degree centrality scores As an example use case, we can study how centrality changes with graph construction. PageRank and Degree centrality are two heavily studied centrality measures for graphs. For undirected graphs, such as those we are studying, it is often the case that the two are highly related. For the 5 graphs we construct -after removing Fauci from each graphwe find considerable differences in a simple analysis of the rankings. See tables 3 and 4 for the 10 rankings by PageRank and degree centrality in each of these 5 weighted, undirected graphs. There are also considerable differences between graphs, showing how each construction highlights different features of the network. TABLE 2 -The 5 canonical graphs we derive from the email data along with some simple statistics. Each graph is connected, and there is a simple version without weights and self-loops along with a weighted version that has integer edge weights along with possible self-loops. The number of edges is the count of undirected edges, so there are twice this many nonzeros in the adjacency matrix of the simple graph. The weighted graph also has loops, which gives twice this many non-zeros plus the number of loops in the adjacency matrix. We also show the total volume (sum of weighted degrees) of the weighted graph along with max, median (med), and mean statistics on the degrees of the simple (deg) and weighted graphs (wdeg). Finally, we show the value of λ 2 associated with the normalized Laplacian matrix. The graph names with nofauci do not include Fauci's node and those with nocc omit the CC lists from the construction whereas those without this treat CC lists equivalently with other recipients. -PageRank centrality rankings (with α = 0.85) in 5 different weighted graphs derived from the data. All the graphs are undirected, and Anthony Fauci has been removed from all of these graphs, rendering some of them disconnected. The values prefixing each name are the ranks in alternative orderings. The order of these is the same as the order of tables and the ordered list is shown in light gray to emphasize differences in other lists. For instance, antoniak is ranked 9 in the repliedto graph but ranked 444 in the hypergraph projection with CC and 83 in the tofrom with CCed nodes. Other individuals of note with large changes in rank include stover, redfield, shapiro, farrar, goldner, marston, folkers. 4 -Degree centrality rankings in 5 different weighted graphs derived from the data. All the graphs are undirected, and Anthony Fauci has been removed from all of these graphs, rendering some of them disconnected. The values prefixing each name are the ranks in alternative orderings. The order of these is the same as the order of tables and the ordered list is shown in light gray to emphasize differences in other lists. Note, for instance, that awwad doesn't appear in any of the top 10 PageRank lists. Other uses These graphs are used in the leading examples above in Section 1.

The hypergraph-projection data is one example of a hypergraph analysis (as a projected graph). We now consider the email data as a hypergraph where each email is a hyperedge among the senders and recipients (excluding the CC entries) -excluding Fauci. We remove any individual that does not have at at least degree 5 in a clique expansion of the resulting graph. The largest connected component of resulting hypergraph has 233 vertices and 254 hyperedges.

Differences between local diffusions A local diffusion in a graph or hypergraph answers the question: what else might be related to a given node in a graph or hypergraph. It's an instance of a relationship-by-transitivity-of-relationships study. Local diffusion analysis on hypergraphs have been a recently active area.

Here, we show how three closely related ideas around PageRank-like diffusions produce strikingly different results on this hypergraph, which indicates it's a useful tool for followup work on comparisons among the implications of these ideas. PageRank-like diffusions are quadratic or smoothed variations on cut problems for graphs and hypergraphs [Liu et al., 2021] . They can be seeded on a single node to generate a ranked list of other nodes based on relationship strength. We do this for a sparse PageRank diffusion on a graph projection of the hypergraph, a direct sparse PageRank diffusion on the hypergraph, and a unregularized PageRank diffusion on the hypergraph. (Sparse PageRank diffusions include regularization extra terms to encourage sparse solutions of the PageRank diffusion equations.) The difference in results is shown in table . There are far more differences than one would expect between these solutions. This indicates an area of further study. It possible simple parameter changes or other tools will show how these are more similar than apparent from this simple experiment.

Hypergraph cuts compared with graph cuts Hypergraph cuts can be far more interesting than simple graph cuts [Veldt et al., 2020a] . Here, we show how hypergraph cuts in these data are more stable. We consider the same hypergraph, but where large hyperedges are removed via a max hyperedge size filter. We see a large difference in the graph cut in the clique projected hypergraph, but relatively Maximum hyperedge size 10 Maximum hyperedge size 15

Maximum hyperedge size no limit Standard clique weights Distributed clique weights FIGURE 7 -These figures show that the hypergraph cuts are far more stable with respect to including large hyperedges compared with the graph cuts. The light blue nodes are in both the graph and hypergraph cut between Collins and Conrad (on the Collins side). The sole light red node is in the hypergraph cut but not in the graph cut. The green nodes are in the graph cut but not hypergraph cut.

(Black nodes are on the Conrad side of the cut.) In the top row, the graph cuts are formed by projecting each hyperedge to a clique and then solving an st cut problem in the graph. If instead the graph is formed by projecting each hyperedge to a clique and weighting each edge by 1/hyperedgesize-choose-2 (so the sum of weights in the clique is 1) then we arrive at similar results with the figures in the bottom row. Edge sizes show the various weights in the graph. Anecdotally, we note that Robert Redfield, the head of the CDC, is strongly associated with the large hyperedges that cause the graph cut to change.

little difference in the hypergraph st cut between Francis Collins and Patricia Conrad (figure ). This is true even for multiple ways of weighting a hyperedge in the clique projection.

Hypergraph cut flexibility As mentioned, hypergraph cut functions can be far more flexible than simple graph cut functions. One of the cut functions proposed by Veldt et al. [2020b] was the δ-linear penalty, which interpolates between the all-or-nothing hyperedge cut and the star-expansion hyperedge cut function. In Figure 8 , we show nodes that switch sides while varying δ in this cut function in the hypergraph. This shows non-monotonic behavior.

We use the tools and codes from [Veldt et al., 2020a,b; Liu et al., 2021] for these computations.

We processed the data in a set of directed edges for emails that were sent on the same day, restricted to the largest temporal strong component. 3 This gave a sequence of 100 adjacency matrices for each day from February 1 2020 to May 5 2020 with a few other preliminary days (e.g. a September 4, 2018 email from Folkers to Fauci on CDC guidelines on aerosol protections for influenza and coronaviruses, Page 429). The first analysis we did was a temporal communicability analysis [Grindrod et al., 2011] . This analysis scores each node based on a weighted average of the length of email chains they start (broadcast-centrality) or receive (receivecentrality). The results are in table .

The second analysis was a temporal community analysis [Mucha et al., 2010] . This analysis assigns a community or group to each node at each time-point to reflect how the groups change over time. Formally, this is a modularity-like analysis on a temporally-linked graph -this allows the analysis to violate a strict arrow of time and foreshadow the future. The communities this analysis identifies FIGURE 8 -As an example of the flexibility of hypergraph cuts, this figure shows nodes that change sides as δ is varied in a hypergraph cut between Francis Collins and Patricia Conrad. Dark red indicates the node is on the Collins side of the cut and light red is on the Conrad side. Note that the behavior is not monotonic and nodes can move back across the cut as δ increases. [Grindrod et al., 2011] with parameter 0.02 show Fauci and Conrad as the top broadcast and receiver nodes, respectively. The light fontcolor indicates the rank in the sorted list and the dark fontcolor indicates the rank in the other list. The value after the name is the centrality score itself.

show how the emails sent respond to various external events (figure ); although there are a few groups (i.e. the lime green around April 20th, 2020) that are harder to resolve.

We also created a force directed animation of this dataset to illustrate the temporal modularity groups. This animation is available from our github repository https://github.com/nveldt/fauci-email/blob/master/figures/anim-mod.mp4.

Methods for temporal strong components The largest temporal strong component can be computed by building a reachability network among temporal paths and then finding the largest clique in the reachability network [Bhadra and Ferreira, 2003; Nicosia et al., 2012] . We did this and used the pmc software [Rossi et al., 2013] to find the largest clique. This gave a set of 77 nodes. Although the largest clique is NP-hard in general, in this case, the largest clique has the same size as the largest network core, which means it is easy to validate. Consequently, this clique can be validated by finding the largest network core and then using a Vice Pres First death Emergency Funding FIGURE 9 -A plot of the communities in a temporal modularity analysis of the network; the figure should be viewed zoomed in and studied for best effect. There are 7 groups, indicated by colors. Nodes are sorted by the number of distinct communities they are a part of, so the first few nodes switch between communities through the time-course of the emails. Community assignments are hidden until the node sends their first email and the small circles indicate days the individuals sent email along with 7 days after their last email. A few key dates are listed at top. The "Vice Pres" event is when Vice President Pence was appointed head of the Coronavirus Task Force; the first death of an American with COVID-19 was on Feb 28; there was a supplemental funding package passed on March 6, 2020; and there was a national emergency declaration on March 13, 2020. Fauci's node is highlighted in the middle. greedy heuristic clique finder inside that core to find the set of 77 vertices.

Methods for temporal communicability. Let A 1 , . . . , A T be the sequence of adjacency matrices. Then the broadcast and receive temporal communicability scores are the row and column sums of the matrix Q = T t=1 (I − αA t ) −1 , respectively. The matrices involved were all small (77 nodes) and we computed this by direct inversion of the matrix -this is in violation the pedagogical dogma of numerical linear algebra classes and would have failed the final author in Gene Golub's numerical analysis class. 4 4 The use of inv was because the prod function in Julia cannot work with a factorization object directly for successive inverses. That same author will investigate strategies in this area as this is the second time this issue has arisen in the past few months.

Methods for temporal modularity To compute temporal modularity, we used the Louvain algorithm directly on the slice-expanded modularity matrix [Blondel et al., 2008] (see reproduction details below). The modularity matrix slices were coupled with parameter 0.5, as was indicated as a reasonable default parameter in Mucha et al. [2010] . We only briefly investigated sensitivity to this parameter and this can obviously be tuned for different effects -we plan to explore that in the future.

Here, we explore some higher-order structure in the emails through senderreceiver-CC interactions. We first found a maximal set of nodes where everyone participates in the sender, receiver, and CC roles with all of the other nodes in the set. Specifically, we examine all emails containing at least one recipient and at least one CC and find the set of discard nodes S corresponding to people that do not appear at least once as a sender, receiver, and CC in these emails. After, we discard emails where a node in D is a sender, and omit nodes in D from the recipient and CC lists of the other emails. This process is repeated until there are no nodes in the discard set. In the end, there remained a set S of 44 nodes and 1,413 emails with a sender, at least one recipient, and at least one CC from S.

We next constructed a 44 × 44 × 44 (non-symmetric) tensor T representing the email relationships of the nodes S. Let s i represent the sender of the ith email and r i and c i the subsets of S who are recipients and CC. Then the tensor entries map the total email volume the nodes, scaled by the number of email participants:

where I(·) is the indicator function. Finally, we computed the hypergraph H-eigenvector centrality scores [Benson, 2019] for T , which is a positive unit-1-norm (unit-sum) vector x such that

for all indices u and some scalar λ > 0. Since the first index of T corresponds to CC, the centrality scores are a measure of how central each node is with respect to participation in that role (x would be the same if we permuted the second and third indices, so only the first index determines the interpretation of the centrality). Table  reports the top-10 nodes in terms of this centrality measure. Fauci is ranked ninth even though the entire dataset is constructed from his emails. However, Fauci is in the CC position relatively less often (Fauci was ranked first if the first index of the tensor corresponded to the sender or recipient roles). Conrad is ranked first, which agrees with her central role in many graphs constructed from this dataset (tables 3 and 4). Folkers, Fauci's Chief of Staff, is ranked second.

Tensor text analysis We also release a tensor (fauci-email-tensor-words.json) that mirrors many analyses of the Enron email data [Cohen, 2004] where we examine interactions among sender, receivers, time, and words. This gives a 77 × 77 × 100 × 212 tensor of the most common words. However, we were unable to identify any useful processing of this tensor. Standard factorization analysis would often focus on individual hyperedges as the relevant factors. We leave this as a challenge for others.

Please remember that this not all of Fauci's email from the relevant timeframe. We may update this document if we have more explicit documentation on what all was included or excluded in the released dataset.

The processing of this data was automated. While we attempt to describe the major scenarios and edge-cases above and discuss how we handled them, please be aware that the information may be inaccurate. In terms of sociological findings for which they may be appropriate, these data should be used with care to understand nuances regarding the exact data collection and ingestion.

It is very likely that additional relevant correspondence took place over the phone and text messages that are not included in the data.

Note also that the text fields of our released data have many errors. This renders text analysis problematic and we leave text analysis to future studies.

Although this data is superficially similar to the Enron data tensor frequently analyzed, there are some critical differences. First, much of the email information was redacted. Second, we only have Fauci 's view on the email instead of raw email inbox dumps for more executives.

We found this data extreme interesting for its seemingly unique ability to show differences among closely related methods. We have highlighted many of those features. The data is also small and easy-to-process, even with combinatorial optimization tools that are infeasible on larger data. We hope it becomes a useful resource to others as well!

We show a few network layouts. These were computed by using the Fruchterman-Reingold layout algorithm [Fruchterman and Reingold, 1991] as implemented in igraph [Csardi and Nepusz, 2006] . In this paper, a force directed layout is the result of applying this algorithm to the unweighted, undirected graph. We also compute modularity-biased layouts by first computing an optimal modularity partition and then densifying edges within each optimal modularity cluster. This is done in an adhoc fashion by adding uniform random edge noise within a cluster to increase the within-edge density based on the modularity partition. (This makes the graph look more like a stochastic block model that encodes the modularity partition). Then we proceed with the same layout algorithm on the edge-augmented graph. This causes the layout to show these groups more strongly than in a straightforward spring layout, although it has the potential to mislead and make groups appear more strongly than they should given the edges alone. This is why we often show both layouts.

The github repository contains all of the scripts we used for these figures in the final subdirectory. For instance, the PageRank results are produced by running pagerank-scores.jl. We omit an index as we hope those interesting readers can easily identify the mapping between the script names and the examples in this document. As a small exception, we note the the tensor centrality analysis (section .) is in a Python notebook demo-cc-recipient-sender-tensor-centrality.ipynb.

We also feature the same examples as demo files that would be more appropriate for explanatory use as as a basis for future studies.

The only heuristic computations which may be difficult to reproduce are the network layouts, which we sought to make as reproducible as possible by avoiding random seeds, and the Louvain-based modularity clustering [Blondel et al., 2008] , for which we used the HyperModularity code [Chodrow et al., 2021] without the randomization techniques. Towards those ends, we provide the clustering we found as the final/temporal-modularity-clusters.json file.

Sparse Seeded PageRank-HyperGraph

Three hypergraph eigenvector centralities. arXiv

Anthony Fauci's emails reveal the pressure that fell on one man

Complexity of connected components in evolving graphs and the computation of multicast trees in dynamic networks

Fast unfolding of communities in large networks. arXiv, physics

Structural Holes: The Social Structure of Competition

Generative hypergraph clustering: from blockmodels to modularity. arXiv

Enron email dataset. Online

The igraph software package for complex network research. InterJournal, Complex Systems: 1695

A set of measures of centrality based on betweenness. Sociometry

Graph drawing by force-directed placement. Software: Practice and Experience

Communicability across evolving networks

Strongly local hypergraph diffusions for clustering and semi-supervised learning. arXiv

Community structure in timedependent, multiscale, and multiplex networks

Finding and evaluating community structure in networks. arXiv

Components in time-varying graphs. arXiv, physics.socph:1106.2134

Motifs in temporal networks. arXiv

Mostofa Ali Patwary. Parallel maximum clique algorithms with applications to network analysis and storage. arXiv

Hypergraph cuts with general splitting functions. arXiv

Localized flow-based clustering in hypergraphs. arXiv

NIAID's Awwad gets Fauci ready for prime time

An information flow model for conflict and fission in small groups

a supplemental details

See table  for the files and brief associated descriptions of derived products.fauci-email-graph.json the json digest of threaded emails · section  fauci-email-repliedto.json the repliedto network · section . fauci-email-tofrom-5.json the tofrom-nofauci-nocc network · section . fauci-email-tofrom-cc-5.json the tofrom-nofauci network · section . fauci-email-hypergraph-projection.json the hypergraph-proj-nofauci-nocc network · section . fauci-email-hypergraph-projection-cc.json the hypergraph-proj-nofauci network · section . fauci-email-hypergraph.json the hypergraph of senders and receivers · section . fauci-email-bydate-sequence-tofrom.json the temporal sequence of adjacency matrices · section . cc-recipient-sender-tensor.json the tensor studied in table  fauci-email-tensor-words.json the tensor of senders, receivers, time, and words (for the top 212 words) that we did not get any meaningful analysis from fauci-email-repliedto-products-simple.json force directed layouts, modularity, conductance, and ncut partitions for the simple graph version of the network above fauci-email-repliedto-products-weighted.json force directed layouts, modularity, conductance, and ncut partitions for the weighted graph version of the network above fauci-email-tofrom-5-products-simple.json (same) fauci-email-tofrom-5-products-weighted.json (same) fauci-email-tofrom-cc-5-products-simple.json (same) fauci-email-tofrom-cc-5-products-weighted.json (same) fauci-email-hypergraph-projection-products-simple.json (same) fauci-email-hypergraph-projection-products-weighted.json (same) fauci-email-hypergraph-projection-cc-products-simple.json

fauci-email-hypergraph-projection-cc-products-weighted.json (same) TABLE 8 -The full list of derived datasets and associated files that we produce from the raw PDF dump of Fauci's email.