key: cord-0828311-3sy2ca6z
authors: Manyumwa, Colleen Varaidzo; Emameh, Reza Zolfaghari; Tastan Bishop, Özlem
title: Alpha-Carbonic Anhydrases from Hydrothermal Vent Sources as Potential Carbon Dioxide Sequestration Agents: In Silico Sequence, Structure and Dynamics Analyses
date: 2020-10-29
journal: Int J Mol Sci
DOI: 10.3390/ijms21218066
sha: 34df2073bf45f1da4762b8e6985db2b52cde1ef8
doc_id: 828311
cord_uid: 3sy2ca6z

With the increase in CO(2) emissions worldwide and its dire effects, there is a need to reduce CO(2) concentrations in the atmosphere. Alpha-carbonic anhydrases (α-CAs) have been identified as suitable sequestration agents. This study reports the sequence and structural analysis of 15 α-CAs from bacteria, originating from hydrothermal vent systems. Structural analysis of the multimers enabled the identification of hotspot and interface residues. Molecular dynamics simulations of the homo-multimers were performed at 300 K, 363 K, 393 K and 423 K to unearth potentially thermostable α-CAs. Average betweenness centrality (BC) calculations confirmed the relevance of some hotspot and interface residues. The key residues responsible for dimer thermostability were identified by comparing fluctuating interfaces with stable ones, and were part of conserved motifs. Crucial long-lived hydrogen bond networks were observed around residues with high BC values. Dynamic cross correlation fortified the relevance of oligomerization of these proteins, thus the importance of simulating them in their multimeric forms. A consensus of the simulation analyses used in this study suggested high thermostability for the α-CA from Nitratiruptor tergarcus. Overall, our novel findings enhance the potential of biotechnology applications through the discovery of alternative thermostable CO(2) sequestration agents and their potential protein design.

The accumulating concentrations of greenhouse gases (GHGs) over the years have led to the warming of the earth's atmosphere [1] [2] [3] . CO 2 is considered to be one of the GHGs contributing to a significant amount of global warming, with the major source of these gases being the combustion of fossil fuels, rice paddies, and livestock fields [3] [4] [5] [6] . CO 2 concentrations have increased from approximately 280 ppm (parts per million) in the pre-industrial period to approximately 410 ppm in 2019 [7] . Thus, the discovery and implementation of mitigation strategies is crucial.

Sequestration of CO 2 via biomineralization, which involves the aqueous precipitation of minerals in the presence of CO 2 to form mineral carbonates, is a storage strategy currently being explored [8, 9] . The use of a catalyst in this process is important, because otherwise the hydration of CO 2 is very slow [10] . Some of the reactions that take place during biomineralization, however, require conditions, CO2 is very slow [10] . Some of the reactions that take place during biomineralization, however, require conditions, including high temperatures exceeding 100 °C and an alkaline pH [11, 12] . Consequently, thermo-alkali stable enzymatic catalysts are being sought.

Carbonic anhydrases (CAs) are enzymes responsible for the reversible catalytic reaction between CO2 and H2O. They are viable CO2 sequestration agents due to their fast CO2 hydration with catalytic turnover rates (kcat) exceeding 10 4 s −1 for the majority of CAs [11, 13, 14] . They are metalloenzymes containing a Zn 2+ metal ion in the active site, with some having shown the ability to utilize iron (Fe 2+ ) or cobalt (Co 2+ ) instead of Zn 2+ , and still maintain their catalytic activity [15] [16] [17] . Three major classes of the CA family, α, β, and γ, have been widely studied. The other classes identified include δ, ζ, η, θ, and ι-CAs, with the ι-CAs being the most recently discovered [18] [19] [20] [21] [22] .

CO2 hydration proceeds as shown in Reaction (1) ( Figure 1A) , resulting in the formation of carbonic acid which dissociates to bicarbonate and hydrogen ions. After being released from the catalytic site into the solvent, the bicarbonate ions further dissociate to carbonate ions (Reaction (2) in Figure 1A ) which then react with the metal ions (shown as X 2+ ), such as Ca 2+ or Mg 2+ , during biomineralization to form the mineral carbonate XCO3 [10, 23] . This is illustrated in Reaction (3) ( Figure  1A ). shows the disulfide core formed by four Cys residues in the tetramerization interface. The catalytic cavity is enlarged in (D), with CO2 binding pocket residues colored red, proton transfer residues in blue and Zn 2+ coordinating His residues colored green.

α-CAs are the most widely studied class of carbonic anhydrases, and have been found in mammals, algae, and plants, with numerous reports in bacteria as well [24] [25] [26] . Most human α-CAs have monomeric structures, in contrast to their bacterial counterparts which have been revealed to have a dimeric assembly [27, 28] . A unique tetrameric assembly, absent in all other α-CAs to date, has been observed for the α-CA from Thermovibrio ammonificans (TaCA, PDB ID: 4C3T) ( Figure 1B) , which is held together by a core of two disulfide bonds ( Figure 1C ) [29] . Each monomer has a functional independent active site, containing a Zn 2+ metal ion in tetrahedral coordination by three (3), while (B) shows the tetrameric structure of TaCA. (C) shows the disulfide core formed by four Cys residues in the tetramerization interface. The catalytic cavity is enlarged in (D), with CO 2 binding pocket residues colored red, proton transfer residues in blue and Zn 2+ coordinating His residues colored green.

α-CAs are the most widely studied class of carbonic anhydrases, and have been found in mammals, algae, and plants, with numerous reports in bacteria as well [24] [25] [26] . Most human α-CAs have monomeric structures, in contrast to their bacterial counterparts which have been revealed to have a dimeric assembly [27, 28] . A unique tetrameric assembly, absent in all other α-CAs to date, has been observed for the α-CA from Thermovibrio ammonificans (TaCA, PDB ID: 4C3T) ( Figure 1B) , which is held together by a core of two disulfide bonds ( Figure 1C ) [29] . Each monomer has a functional independent active site, containing a Zn 2+ metal ion in tetrahedral coordination by three His residues as well as a water molecule [14, [30] [31] [32] . They also have a hydrophobic pocket close to the active site where the CO 2 molecule is normally held during catalysis ( Figure 1D ). This pocket has been identified in most α-CAs, including the human CA II [33] [34] [35] [36] [37] .

Given the high environmental temperatures, CAs from hydrothermal vents may already possess attributes to withstand the extreme conditions in the CO 2 sequestration process, thus providing a viable alternative to engineering CAs. This has been observed in vitro on α-CAs from hydrothermal vent bacteria Caminibacter mediatlanticus (CmCA), Persephonella marina (PmCA) and TaCA, with CmCA and PmCA revealing stability up to 70 • C and 100 • C, respectively, and the wild-type and variants of TaCA appearing stable up to 95 • C [38] [39] [40] [41] . Another CA, found in previously isolated DNA from the Logatchev hydrothermal field (LOGACA), has also proven to be thermostable, withstanding temperatures up to 103 • C [35] . The structures of these α-CAs, except CmCA, have been experimentally solved and reported [29, 35, 42] . Consequently, the in silico analysis of the thermostability attributes of α-CAs from hydrothermal vents, is the focus of this study.

To date, in silico studies on α-CAs have been for monomeric forms of the proteins. The multimeric occurrence of these enzyme complexes, which is brought about by the interaction of residues between monomers, a region termed the interface or the buried surface area (BSA), is often overlooked [43, 44] . The BSA contributes to function as well as stability of the protein complex with a group of residues, termed "hotspot residues", contributing significantly to protein stability [45, 46] . This study considers their biological assemblies (BAs), revealing the relevance of simulating the CAs in their multimeric states. We report here computationally solved dimeric structures and the analysis of 12 α-CAs, including CmCA, from various bacteria coming from hydrothermal vent systems. Analysis was also included for LOGACA, PmCA and TaCA. Sequence alignments and motif analyses were performed to identify conserved and possible functionally important regions in the proteins, followed by phylogenetic tree construction, to view the evolutionary relationships amongst the CAs [47, 48] . Analysis of the CA interfaces revealed the presence of hotspot residues, present in conserved motifs, which contribute to the stability of these proteins. Similar results have been shown in Enterovirus capsids [49] . Furthermore, the importance of some interface residues in protein communication was fortified through average betweenness centrality (BC) analysis of molecular dynamics (MD) simulation trajectories at 300 K [50] . Hydrogen bond networks centered on high communication residues recognized in average BC analysis were identified through hydrogen bond analysis. Patterns of thermostability were monitored for the proteins through the radius of gyration (R g ), root mean square fluctuation (RMSF), and dynamic cross correlation (DCC) analysis of MD simulations at increasing temperatures, 300 K, 363 K, 393 K and 423 K, and were compared to previously characterized CmCA, LOGACA, PmCA and TaCA. Maintenance of active site cavity compactness, low residue fluctuations and high correlated motions at temperatures of 423 K were conspicuous for Nitratiruptor tergarcus' CA (NtCA). Analysis of inter-subunit hydrogen bonds at all four temperatures showed interface disruptions in some proteins at high temperatures, indicative of fluctuations and reduced thermostability. Overall, the computational approach combined data retrieval, sequence alignment, motif analysis, phylogenetic tree calculations, homology modeling of biological assemblies, protein-protein interface analysis coupled with molecular dynamics simulations, dynamic residue analysis and dynamic cross correlation; and thus brought novel aspects to the field of carbonic anhydrases as carbon dioxide sequestration agents. The novelty of our findings enhances the knowledge base of biotechnology applications through the discovery of alternative thermostable sequestration agents as well as their potential protein design.

Sequence and structural analysis of 15 α-CA proteins, of which 14 were from 13 bacteria and one from isolated DNA, was performed (Table 1 and Table S1 ). The organisms were previously isolated in or around hydrothermal vents, and were mainly Gram-negative [51] [52] [53] [54] [55] [56] [57] . They are classified under four groups (Table 1) : Aquificacea, Campylobacteria (formerly known as Epsilonproteobacteria), Deltaproteobacteria (δ-proteobacteria) and Gammaproteobacteria (γ-proteobacteria) [58, 59] . Campylobacteria are the most abundant in these locations [60] [61] [62] . Confirmation of the origin sites of these organisms was crucial to this study, as some genera are not confined to inhabiting hydrothermal environments alone. For example, the genus Vibrio is widespread in marine environments [64, 65] . Although Geothermobacter iron reducers are predominantly confined to hydrothermal systems [66] [67] [68] [69] , they have also previously been identified in paddy soils [70] . Bacteria from the families Hydrogenimonaceae, Nautiliaceae, and Nitratiruptoraceae are known to be found entirely in vent systems [71] .

The reference sequence from PmCA will be used for α-CA residue numbering going forward unless stated otherwise. All corresponding residues for other sequences are outlined in Table S2. 2.1.1. Multiple Sequence Alignment Reveals the Extent of Conservation in the α-CA Sequences Multiple sequence alignments (MSAs) are important in the identification of conserved regions, which are assumed to be structurally and functionally significant. The MSA produced by Tree-based Consistency Objective Function for Alignment Evaluation (T-Coffee) [72, 73] gave an accurate alignment of functional residues across all species and was used for further analysis of the α-CAs. The three His residues that coordinate the Zn 2+ metal ion in the active site, His107, His109 and His126, were conserved across all sequences and are denoted in Figure 2A by the cyan boxes. CAs of this class have been reported to contain Cys residues (Cys44 and Cys197) that form an intra-subunit disulfide bond affecting structural stability [29, 34, [74] [75] [76] [77] . These are usually the only two Cys residues in the sequence, indicated by the yellow boxes in the MSA, and were present in all the α-CA sequences sampled ( Figure 2A) . TaCA, however, possesses a third Cys residue in position 67 (TaCA numbering), which is responsible for its tetrameric quaternary structure [29] . This Cys residue forms a disulfide bond with another Cys in the same residue position of another subunit, resulting in a core of two disulfide bonds by four monomers ( Figure 1C ) [29] . The rest of the sequences, though of similar origins, and other previously characterized α-CAs, do not possess that particular Cys residue, as shown by the magenta-colored star in Figure 2A . The corresponding amino acid position in the other sequences is moderately variable, with the residue Lys or Asn being the most common substitute. A dimeric architecture, compared to TaCA's tetrameric structure, was thus assumed for the rest of the bacterial CAs during homology modelling (Section 2.2.1). residues in the sequence, indicated by the yellow boxes in the MSA, and were present in all the α-CA sequences sampled (Figure 2A) . TaCA, however, possesses a third Cys residue in position 67 (TaCA numbering), which is responsible for its tetrameric quaternary structure [29] . This Cys residue forms a disulfide bond with another Cys in the same residue position of another subunit, resulting in a core of two disulfide bonds by four monomers ( Figure 1C ) [29] . The rest of the sequences, though of similar origins, and other previously characterized α-CAs, do not possess that particular Cys residue, as shown by the magenta-colored star in Figure 2A . The corresponding amino acid position in the other sequences is moderately variable, with the residue Lys or Asn being the most common substitute. A dimeric architecture, compared to TaCA's tetrameric structure, was thus assumed for the rest of the bacterial CAs during homology modelling (Section 2.2.1). Proton transfer residues are indicated by the arrows above them and the other functional residues are color coded: cyan-Zn 2+ coordinating residues; green-CO 2 binding pocket residues; yellow-Cys-Cys disulfide bond residues. The magenta star indicates the Cys position in TaCA responsible for its tetramerization. (B): The evolutionary relationship amongst the 15 retrieved sequences was inferred using the Maximum Likelihood method under the WAG + G + I model and a 100% gap deletion. Bootstrap values from 1000 bootstrap replicates are shown as decimals at their respective nodes. The heat map for the all-versus-all pairwise sequence identity calculations, generated using the T-Coffee MSA, is displayed next to the phylogenetic tree with the magnitude of identity between sequences increasing from 0, shown by the blue color, to 1, shown by red. Classes A, B, C and D indicate the bacteria classes Deltaproteobacteria, Gammaproteobacteria, Campylobacteria and Aquificacea, respectively.

The hydrophobic pocket necessary for CO 2 binding has been reported in numerous α-CAs, and was present in all structures modelled [27, 30, 34, 78, 79] . It contains the hydrophobic residues Val128, Val138, Leu192, Val201, and Trp203 [29, 37, 75] . All these residues, except Leu192, were completely conserved across the sequences. In NtCA the Leu was substituted by Phe, which is also a hydrophobic amino acid, thus maintaining the hydrophobicity of the pocket. His82 has been reported to be involved in proton shuttling, and was conserved across all the sequences [29, 77] . This residue, along with other residues involved in proton transfer as indicated in Figure 2A , as well as Thr83, are part of a cavity including the active site and the hydrophobic CO 2 binding pocket ( Figure 1D ). The proton transfer residues form a pocket, referred to as the tertiary CO 2 binding pocket in hCA-II, which has also been previously observed binding the inhibitor acetazolamide in TaCA [28, 29, [80] [81] [82] .

The cell wall of Gram-negative bacteria is known to be encircled by an outer membrane with a periplasmic space in between [83] . The signal peptide identified in α-CAs is believed to be a useful coping mechanism for the secretion of the CAs, either into the periplasmic space or extracellularly. There it executes CO 2 hydration, thus, aiding the movement of bicarbonate through the cell membrane [14, 29, 84] . This is in contrast to the β-CA and γ-CA classes that are found in the cytoplasm [14] . Previously isolated bacterial α-CAs, including LOGACA, PmCA and TaCA, possess signal peptides at the N-terminal of the protein [29, 35, 85] . In this study, probable cleavage sites for the retrieved sequences were calculated by two signal peptide prediction programs. SignalBLAST [86] uses the BLAST package to predict signal peptides, comparing the query sequence to reference data, whereas Phobius [87] predictions are based on the hidden Markov model (HMM). Results are outlined in Table S3 . Predictions from Phobius were in agreement with those from SignalBLAST concerning the absence of a signal peptide in both Geothermobacter α-CAs, suggesting cytoplasmic localization for these particular CAs. This observation was also supported by motif analysis (Section 2.1.4) where Motif 11, which was perceived as an indication of a signal peptide, was found to be absent in both Geothermobacter spp. CAs. The signal peptide sequence (see Figure 2A ), was therefore excluded from the calculation of all structures except the two mentioned above during the modelling of α-CAs in Section 2.2.1. Residues excluded from the modelling of GEprmCA and GHr1CA were as a result of their absence in the template structure, which also possessed a signal peptide. Phylogenetic tree calculations, however, proceeded with complete sequences including the signal peptide.

The top three models had Bayesian Information Criterion (BIC) values of 7788, 7796 and 7806, respectively. The phylogenetic relationship amongst the retrieved α-CA sequences was inferred by MEGA under the best-fit protein substitution model, WAG + G + I, with a gap deletion of 100%. Phylogenetic analysis evidenced the clustering of the bacteria to match their all-versus-all pairwise sequence identities ( Figure 2B ). The pairwise identity heat map, which was derived from the MSA in Figure 2A , revealed high sequence identities in CAs belonging to similar lineages. CAs which belonged to the Sulfurovum genus had high sequence identities (above 80%) and clustered together. VaCA2 and VdCA were highly similar, in contrast to the comparison of either of the sequences to VaCA1. The Geothermobacter spp. CAs also showed high sequence identities (above 80%). Tree branching suggested that unclassified LOGACA belonged to the class Aquificacea and genus Persephonella. This α-CA showed the closest relation to PhCA, with a sequence identity of 87% and a branch bootstrap value of 0.912. Clustering patterns by some of the sequences align with phylogenetic tree calculation results obtained by Nakagawa et al. [57] . Common bacteria to both studies include H. thermophila, N. tergarcus, S. lithotrophicum, and Sulfurovum sp. NBC 37-1. Trees were calculated using 16s rRNA sequences and branching of these bacteria proved similar to those observed in this study [57] .

Motif analysis was performed to elucidate conserved patterns amongst the sequences. 14 unique motifs were identified for the α-CAs and generally, a high conservation of motifs across the sequences was perceived, with 11 being conserved in the dataset ( Figure 3 ). Several motifs have been identified as functionally important, as they contain residues that are critical to the function of the CAs (Table 2) . Motif numbering is aligned to results produced by Multiple Expectation Maximisation for Motif Elicitation (MEME) software [88] . Start and end positions of motifs for each sequence, as well as motif E-values, are outlined in Table S4 .

Functional residues previously identified were observed and highly conserved in the MSA of the α-CAs in this study and were located in various motifs ( Table 2) . Residues in the interface of the dimers (Section 2.2.2) also signified the structural importance of some motifs and are included in Table 2 . Asn80 and Lys85 are proton shuttling residues present in Motif 6, with HtCA possessing a Lys residue in place of Asn80. Lys85 position was occupied by either a Lys or Gln residue in the retrieved sequences. The α-CAs that possessed Gln in this position included both Geothermobacter CAs, all three Sulfurovum α-CAs and all three Vibrio α-CAs. The rest had Lys in that position. Functional importance of motifs 8, 9, 10, 12, 13 and 14 is unknown. Motif 11 was located at the N-terminal, except in the Geothermobacter spp. α-CAs, correlating with signal peptide predictions in Section 2.1.2, as this motif was observed to be part of a signal peptide with residues 1 to 16. phylogenetic tree calculation results obtained by Nakagawa et al. [57] . Common bacteria to both studies include H. thermophila, N. tergarcus, S. lithotrophicum, and Sulfurovum sp. NBC 37-1. Trees were calculated using 16s rRNA sequences and branching of these bacteria proved similar to those observed in this study [57] .

Motif analysis was performed to elucidate conserved patterns amongst the sequences. 14 unique motifs were identified for the α-CAs and generally, a high conservation of motifs across the sequences was perceived, with 11 being conserved in the dataset ( Figure 3 ). Several motifs have been identified as functionally important, as they contain residues that are critical to the function of the CAs ( Table 2 ). Motif numbering is aligned to results produced by Multiple Expectation Maximisation for Motif Elicitation (MEME) software [88] . Start and end positions of motifs for each sequence, as well as motif E-values, are outlined in Table S4 . blue-most hydrophobic residues (A, C, F, I, L, M, V and W); red-positively charged residues (K and R); green-polar, non-charged and non-aliphatic residues (N, Q, S and T); magenta-most acidic residues (D and E); light pink, orange, turquoise and yellow are for H, G, Y and P respectively. (C): Motifs are mapped onto the structure of Sulfurovum lithotrophicum (SlCA). Identified interface residues are shown as spheres and colored according to the color of the motif on which they are found. Motif web logos displayed in (C) are those for motifs containing interface residues. Motif 11 was omitted due to its absence in the modelled structures. Interface and hotspot residues are indicated by the black and red arrows, respectively. ; red-positively charged residues (K and R); green-polar, non-charged and non-aliphatic residues (N, Q, S and T); magenta-most acidic residues (D and E); light pink, orange, turquoise and yellow are for H, G, Y and P respectively. (C): Motifs are mapped onto the structure of Sulfurovum lithotrophicum (SlCA). Identified interface residues are shown as spheres and colored according to the color of the motif on which they are found. Motif web logos displayed in (C) are those for motifs containing interface residues. Motif 11 was omitted due to its absence in the modelled structures. Interface and hotspot residues are indicated by the black and red arrows, respectively. 

3D protein structures were calculated via homology modelling. These structures were then used for mapping sequence and motif information, structural analysis, and MD simulations which helped to understand the differences between monomeric and multimeric forms of the proteins, as well as the identification of key interface residues.

The crystal structure of LOGACA was used as the template for the dimeric structure calculations of all sequences retrieved except PmCA and TaCA, which already have crystal structures. LOGACA had a resolution of 2.5 Å and good coverage with all the CA sequences, which ranged between 87% and 95%. Sequence identities with the retrieved sequences varied between 47% and 84% (Table S5 ). The signal peptide sequence, also absent in LOGACA, was excluded from the models, along with a few residues that were not structurally available from the template and these are indicated by the red box in Figure 2A . The starting residue numbers for each of the structures after alignment trimming during modelling are noted in Table S3 along with suggested cleavage sites from the signal peptide prediction servers. Structures were modelled with an active site Zn 2+ metal ion tetrahedrally coordinated by three His residues and a water molecule. Coordinating His residues were all within the expected bond length of 2.2 Å to Zn 2+ [89] . The active site appeared to be open and accessible to the solvent as observed in most CAs, with the catalytic pocket extending from the outside of the protein to the metal ion [34, 90, 91] . Validation results from various programs revealed that the structures modelled were of good quality (Table S5 ). The z-DOPE (Discrete Optimized Protein Energy) scores were all below the quality threshold of −0.5, with structure quality improving as the score became more negative. The Ramachandran plots generated using PROCHECK [92] showed that 87% or more of residues for each of the structures were in the most favored regions. Verify3D [93] compares the protein sequence (1D) to the 3D structure, producing results as a percentage of residues with scores above 0.2. Percentages above 80% are considered a pass, and all α-CA structures passed this verification.

It was crucial to investigate the interface residues of the multimeric α-CAs and their interactions, particularly hotspot residues, mutation of which might result in destabilization of the protein [45, 94] .

The residues involved in interface formation, as well as hotspot residues as a subset of the interface residues, were identified using various web servers (Table S6 ). In TaCA, two different interfaces were considered; the dimerization interface, which is between two monomers forming the bacterial dimer, and the tetramerization interface, which is between the dimers responsible for tetramer formation. Henceforth, residues from a neighboring monomer will be signified by an asterisk (*).

It was interesting to observe marginal asymmetry in the spatial orientation of residues in the interface. On account of this, similar residues in different chains were found to be contributing differently to the interface binding energy, thus one could be identified as a hotpot residue but not the other. Complete conservation was observed in the sequence alignment ( Figure 2A) for Asn49, Arg187, Ala237 and Arg238, which appeared as interface residues across all sequences. The important inter-subunit interactions formed by these residues were deduced from the Protein Interactions Calculator (PIC) web server [95] . Asn49 was observed to form hydrogen bonds with a number of residues including Asn49 * and Ser189 * as well as Glu199 *. Variability in Ser189 position did not affect the presence of the Asn49-Ser189 * bond in the CAs. Glu199 also formed hydrogen bonds with the same residue in the neighboring monomer. In Geothermobacter spp., Glu199 was replaced by a Gln residue, but showed similar interactions in the interface. The Arg187-Ala237 * hydrogen bond in the interface was common to all proteins. The combination of Arg187-Asp102 and Asp102-Arg207 intra-subunit salt bridges (TaCA numbering) has been previously recognized in TaCA and is referred to as an ion-pair 'latch' [29] . This was observed for the α-CAs in this study, except the Geothermobacter spp. α-CAs, which lacked an Asp102-Arg207 salt bridge due to an Ile substitution in place of Arg207.

Sulfurovum lithotrophicum's α-CA (SlCA) contained all 14 motifs identified; thus, it was used as a representative structure in Figure 3C for the mapping of motifs and interface residues, which are outlined in Table S6 . This was in order to visualize the interface, the motif positions, and their interactions. High variability of most interface residues in Motifs 5 and 7, besides hotspot residue Asn49, was observed. Physicochemical properties in each position varied across the sequences with possible implications to individual stabilities of the proteins, which were searched further through MD simulation analyses (see Section 2.4.3). Characteristics of the interfaces obtained from the Protein Interfaces, Surfaces and Assemblies (PDBePISA) web server [96] , including the number of hydrogen bonds, salt bridges formed as well as the BSA of each model, are listed in Table S7 . Inter-subunit hydrogen bonds and salt bridges were also queried from the PIC web server [95] . Results from PDBePISA revealed the protein subunits to be held together by approximately 7-18 inter-subunit hydrogen bonds of moderate strength with H-bond distances varying between 2.6 Å and 3.7 Å. Salt bridges are bonds between groups of opposite charges between the negative carboxyl oxygen atom from either Asp or Glu residues and the positive nitrogen atom from either His, Lys or Arg residues within 4 Å of each other [97, 98] . Their presence in protein structures has been recognized as potentially stabilizing, and possibly contributing to and increasing their thermostability [99, 100] . Both PIC and PBDePISA concurred that CmCA, both Geothermobacter CAs, HtCA, VaCA2 and VdCA did not contain any inter-subunit salt bridges.

The BSA and the total solvent accessible area of each dimer was calculated in Å 2 and both are shown in Table S7 . CmCA, HtCA and PmCA had the largest buried surface areas of 2094.6 Å 2 , 2082.8 Å 2 and 2011 Å 2 , respectively. Dimeric buried interface residue areas are similar to those reported for the thermophilic α-CA species Sulfurihydrogenibium azorense (PDB ID: 4 × 5S) (2050 Å 2 ), with that of Sulfurihydrogenibium yellostonense (PDB ID: 4G7A) being much higher at 2300 Å 2 [34, 74] . In comparison to the retrieved sequences, the α-CA from the mesophile Neisseria gonorrhoea (PDB ID: 1KOP) was also put through PISA, and was observed to have a smaller buried surface area of 1691 Å 2 .

Although the analysis of static multimeric models gave an adequate representation of the important interactions within the structures, proteins often have a number of conformations in vivo, thus necessitating their simulation and analyses [101] . MD simulations allowed sampling of a wide range of these conformations, as shown by the root mean square deviation (RMSD) Kernel density estimation (KDE) plots in Figure 4 . All simulations had equilibrated by 40 ns (Figure S1 ), substantiating the 50 ns runs. Average BC and hydrogen bond analysis of the simulations at room temperature (300 K) gave a clearer picture of the residues with important roles during these conformational fluctuations. 

Dynamic residue network (DRN) analysis is an approach of MD-TASK designed to observe the changes in the communication of protein residues during an MD simulation [50] . The Cβ atoms (Cα for Gly) of residues are represented as nodes as part of this network and the shortest path to each node is calculated. Average BC calculations were performed to identify residues important in the communication of these α-CAs. Distributions of the average BC values across the proteins were positively skewed, thus relevance of high communication residues was considered for the top 5% residues showing the highest average BCs and these are listed in Table 3 . Some of the hotspot and interface residues that we identified (Section 2.2.2) were observed to have significant average BCs, displaying frequent usage and noteworthy importance in the functioning of the α-CAs. It was observed that the asymmetric behavior displayed during interface analysis of static structures was more pronounced in the MD simulations. Similar residues in different chains showed different average BC intensities ( Figure S2) , thus in all proteins, residues appearing above the 5% cut-off were 

Dynamic residue network (DRN) analysis is an approach of MD-TASK designed to observe the changes in the communication of protein residues during an MD simulation [50] . The C β atoms (C α for Gly) of residues are represented as nodes as part of this network and the shortest path to each node is calculated. Average BC calculations were performed to identify residues important in the communication of these α-CAs. Distributions of the average BC values across the proteins were positively skewed, thus relevance of high communication residues was considered for the top 5% residues showing the highest average BCs and these are listed in Table 3 . Some of the hotspot and interface residues that we identified (Section 2.2.2) were observed to have significant average BCs, displaying frequent usage and noteworthy importance in the functioning of the α-CAs. It was observed that the asymmetric behavior displayed during interface analysis of static structures was more pronounced in the MD simulations. Similar residues in different chains showed different average BC intensities ( Figure S2 ), thus in all proteins, residues appearing above the 5% cut-off were marginally different for monomers of the same protein. Hotspot residue Asn49 showed relatively high average BC compared to other residues in their respective structures in the average BC analysis of most proteins. Interface residue Asn194 (SlCA numbering), which was not found in α-CAs from Aquifaceae and NtCA, was also found above the cut-off. Residues with high average BCs in TaCA were observed to be located more towards the tetramerization interface, with the Cys residues that formed the two inter-subunit disulfide bonds displaying the highest usage. These disulfide bonds have been shown to be crucial for the thermostability of TaCA [29] . Val64 and Leu68 (TaCA numbering) had relatively high BCs and were observed to form hydrophobic interactions in the dimerization interfaces of both dimers. Contributing to the tetramerization interface was Lys65, forming hydrogen bonds with Lys247 in the opposite monomer in a different dimer.

Amongst the residues identified were also those previously identified in literature as crucial to the catalytic activity of the α-CAs. CO 2 binding pocket residue Trp211 (CmCA numbering) showed a high usage in most proteins. Of particular interest was Val201, which was observed to play a role in the dimer interface as well as operating as a CO 2 binding pocket residue. This occurrence was observed across all the proteins simulated with relatively high average BC. Functionality of the residues without annotation in Table 3 was unknown from previous literature and was further searched using hydrogen bond analysis.

Hydrogen bonds play a key role in protein stability [102, 103] . Analysis of intra-subunit bonds substantiated the presence of high communication residues identified in average BC analysis, whose function was not recognized from previous literature. These residues were queried from the results produced by cpptraj [104] and the networks shown in this study were constructed on the basis of persistent hydrogen bonds, i.e., bonds lasting more than 90% of the simulation, around the residues. All proteins were observed to have a hydrogen bond network around Glu124, which appeared in the top 5% for most CAs in average BC analysis. This network involved at least two high communication residues in the proteins, except in HtCA where the Glu residue was the only one. For all proteins, it included hydrogen bonds between Glu124 and residues His109 and His126, which are functional as Zn 2+ coordinating residues. GEprmCA and GHr1CA had the greatest number of high communication residues, i.e., four, in a single hydrogen bond network (Glu124, Ser140, Asn114 and Val53-GEprmCA numbering).

An illustration of GEprmCA's Glu network is shown in Figure 5A . The largest hydrogen bond network, with 11 residues interacting for most of the simulation, was observed for thermostable PmCA ( Figure 5B ) and LOGACA ( Figure S3K ) as well as for PhCA ( Figure S4D ), which all belong to the same class. Amongst these 11 residues were CO 2 binding pockets Trp203 and Leu192, both of which formed hydrogen bonds with high communication residue Ser191. The Leu192-Ser191-Trp203 network was also present, but observed to be separate from the Glu124 network in CmCA, HtCA, SlCA and SrCA ( Figures S3D,J and S4H,L) . Apart from its Glu network shown in Figure 5C , NtCA was observed to possess a second unique hydrogen bond network ( Figure 5D ), centered on Gln58 and including Leu242 (NtCA numbering), both which showed high usage in average BC analysis. Persistent hydrogen bonds were observed between Gln58 and hotspot residues Glu207 and Arg245, as well as intra-subunit disulfide bonds Cys205 and Arg237 (NtCA numbering). Interactions formed by Glu207 (NtCA numbering) in the interface were probed and mentioned in the interface analysis (Section 2.2.2) This additional network is presumed to be of weighty importance towards stability of the protein both within and across subunits. Hydrogen bond networks for the rest of the structures are shown in the (Supplementary Data Figures S3-S5 ).

Simulation of the proteins at increasing temperatures of 300 K, 363 K, 393 K and 423 K in this study unearthed other potentially thermostable α-CAs. Behavior patterns of the uncharacterized CAs were compared to those of CAs already pre-determined to be thermostable, i.e., CmCA, LOGACA, PmCA and TaCA, upon trajectory analyses. Similar patterns, and possibly those displaying higher thermostability properties, were seen for some of the proteins in these analyses, which included R g , RMSF, DCC and inter-subunit hydrogen bond analysis. Results from the analysis of the temperature MD simulations also further substantiated the importance of multimeric simulations.

analysis. Persistent hydrogen bonds were observed between Gln58 and hotspot residues Glu207 and Arg245, as well as intra-subunit disulfide bonds Cys205 and Arg237 (NtCA numbering). Interactions formed by Glu207 (NtCA numbering) in the interface were probed and mentioned in the interface analysis (Section 2.2.2) This additional network is presumed to be of weighty importance towards stability of the protein both within and across subunits. Hydrogen bond networks for the rest of the structures are shown in the supplementary data ( Figures S3-S5 ). Figure 5 . Hydrogen bond networks around some residues with high average BC (labelled in blue) confirmed through average BC analysis. Networks shown for GEprmCA (A), PmCA (B) and NtCA (C), respectively, pass through the conserved Glu132 (NtCA numbering). (D) shows the hydrogen bond network around NtCA's Gln58. All bonds were present for > 90% of the simulation. Functional residues are annotated as follows: hotspot residues-h; interface residues-i; intra-subunit disulfide Cys residues-d; CO 2 binding pocket residues-c; Zn 2+ coordinating residues-z.

The cavity with Zn 2+ contains all the catalytically important residues, including those forming the CO 2 binding pocket, His coordinating residues, and proton transfer residues. Maintenance of this cavity at high temperatures is thus perceived as important for the sustenance of catalytic activity in these extreme conditions. R g analysis, which assesses protein compactness, was used to monitor the compactness of (i) the separate chains and (ii) the catalytic cavity of each chain across all temperature trajectories; these are illustrated in Figure 6 . The length of the vertical lines represent the minimum and maximum R g exhibited at each temperature. This approach was recently used for the analysis of the active and allosteric sites of the main protease of SARS-CoV-2 [105] . Although the CAs are homo-multimers, the behavior of either the individual chains, the pockets, or both, was asymmetrical at one or more temperatures for all proteins. This was, however, less evident for CmCA, SlCA and TaCA. At all four temperature simulations, compactness of the catalytic cavities was maintained for CmCA, GEprmCA, NtCA, PmCA, SlCA and TaCA, an indication of resisting denaturation at high temperatures, and possibly the maintenance of catalytic activity. Decrease in compactness was noted for GHr1CA and HtCA's catalytic cavities as well as their chains with an increase in temperatures. Despite a general increase in R g for SNbcCA's chains A and B at 393 K, it was interesting to observe a maintenance of compactness for the catalytic pockets. This was also seen for GHr1CA's and SrCA's chain B at 423 K, as well as for VaCA2.

RMSF is a measure of average residue displacement across a trajectory compared to its position in the initial frame. The effects of increasing temperature on individual residue fluctuations in the multimeric structures were investigated using RMSF analysis. Rigidity of the residues at lower temperatures is a desirable characteristic for increased thermostability and is necessary to make up for fluctuations at high temperatures found in the CO 2 sequestration process [106, 107] . The N-terminal is known to be a generally flexible region and exhibited high residue fluctuations in most proteins, particularly at 393 K and 423 K ( Figure S6 ). This flexibility could also be accredited to the removal of the signal peptide identified in Section 2.1.2. Reference α-CAs CmCA and PmCA displayed low fluctuations at all four temperatures. These proteins are known to possess high thermal stability properties; thus, a similar effect was explored in the proteins simulated as a speculation of thermostability, and was observed for NtCA. At 423 K, marginally higher fluctuations were observed for LOGACA compared to lower temperatures, and a similar behavior was noted for GEprmCA, PhCA, SlCA, SrCA and VaCA1. The highest fluctuations were seen at 423 K for VdCA, HtCA, GHr1CA, HtCA and SNbcCA, in order of increasing fluctuations. These results implied a reduced tolerance to high temperatures due to losses in structural rigidity compared to the other α-CAs.

The secondary structures adopted by proteins during folding are important for their function and structural integrity; and consequently, protein unfolding when exposed to extreme conditions is often accompanied by secondary structure loss [108] [109] [110] . As such, the high fluctuations observed for GHr1CA, HtCA, SNbcCA, as well as VdCA, prompted the investigation of the occurrence of secondary structure elements of each residue over the 50 ns trajectories using Define Secondary Structure of Proteins (DSSP) analysis [111] . Particular attention was paid to the interface residues identified in Section 2.2.2, and this is shown in Figure 7 , along with the changes in RMSF as well as the average BC of each residue with increases in temperature. RMSF of interface residues for the rest of the structures is illustrated in Figure S7 . Loops in protein structures are generally known as regions of high flexibility. Most interface residues in the proteins in Figure 7 were observed to be present in loop regions upon DSSP analysis. Hydrophobic interactions formed between HtCA's Met48, found in Motif 5, and Ile51, which was not located on any motif, are suggested to prevent solvent accessibility from the N-terminal end of the interface. This is signified by the relatively lower fluctuations observed in the N-terminal region of HtCA at 423 K compared to its other interface residues at this temperature. These hydrophobic interactions were also conserved for CmCA, LOGACA, PhCA and PmCA. The last three CAs however, also contained hydrophobic residue Ile49 (PhCA numbering) thus lower fluctuations were observed at 423 K ( Figure S7 ) compared to HtCA which had hydrophilic Glu instead. Inevitably, disruption of the predominantly alpha helical structure involving Met48 and Ile51 (HtCA numbering), in both HtCA chains was observed at 423 K. Interface residues of HtCA in Motif 7 (Tyr66, Asp67 and Asp69) located on the opposite periphery to Motif 5 interface residues, were amongst the highest fluctuating residues in both chains. These residues create a polar environment in this region, allowing easy solvent accessibility. The remarkable stability of CmCA's interface towards Motif 7 was observed to be brought about by the hydrogen bonds formed by Asn68-Thr69, Thr64-Asp105 and Asn68-Asn70. It is also worth noting that the Cys residues involved in the tetramerization disulfide bridge in TaCA, which are located in Motif 7, barricade solvent accessibility to the dimerization interface, thus the exceptional interface stability was observed ( Figure S7 ). In VdCA, the degradation of the anti-parallel beta-sheet involving Ala58 was also seen in both chains at 423 K. Differences in the structural propensities of interface residues was also noted between chains, particularly for HtCA ( Figure 7B ), further highlighting the asymmetric chain behaviors.

Except for the hotspot residue Asn49, the interface residues in Motifs 5 and 7 were observed to be low communication residues. These residues are oriented closer to the N-terminal as well as being obscurely accessible for the solvent (see Figure 3C ), compared to those located closer to the C-terminal across all proteins. Asn49, which showed a higher average BC compared to these interface residues, was completely conserved across all sequences, whereas the rest in Motifs 5 and 7 displayed very low conservation. Val201 (HtCA numbering) was observed to have higher average BC compared to most of the interface residues at most temperatures. Hydrogen bond interactions in the interface are further analyzed in Section 2.4.4.

Following the identification of interface residues in Section 2.2.2, it was interesting to identify the occurrence of hydrogen bonds in the interface across trajectories with an increase in temperature ( Figures S8-S11) . During this analysis, it was important to differentiate between the hydrogen bonds contributing to interface stability from those formed by temporary proximity of residues due to fluctuations. The most important hydrogen bonds were identified as those maintained for a larger fraction of the trajectory, whereas numerous short-lived bonds were an indication of residue fluctuations in the interface. The latter was seen for proteins HtCA, SNbcCA and SrCA at 423 K, with an absence of persistent hydrogen bonds being observed at this temperature ( Figure 8 ). HtCA and SNbcCA interface residues were correspondingly showing large fluctuations at 423 K in Section 2.4.3. NtCA showed a number of hydrogen bonds present for most of the simulation at all four temperatures ( Figure S9 ), which led to the low fluctuation in the interface residues ( Figure S7 ). Hydrogen bonds mentioned for CmCA in Section 2.4.3 were observed at all four temperatures ( Figure S8) , thus, they are confirmed to contribute to interface stability. Upon a closer look at the results, it was observed that Val201, which was identified in the interface in Section 2.2.2 and had a high average BC, was not seen forming any hydrogen bonds in the interfaces of all the proteins simulated. Its high average BC values across all proteins are therefore accredited to its role in the CO 2 binding pocket. An increase in short-lived hydrogen bonds was accompanied by the loss of the two Asp60-Lys62 * long-lived hydrogen bonds in PmCA at 393 K and 423 K ( Figure S9 ). In TaCA, the tetramerization interface had adjacent monomers (chain A-D and chain B-C) showing short-lived hydrogen bonds between Lys65-Ala66 * and Lys65-Lys247 * only, with the latter having been identified in average BC analysis. The dimerization interfaces contained a consistent hydrogen bond pattern throughout the temperatures, with the most long-lived bonds forming between Lys65 and disulfide bonding Cys67 * ( Figure S11 ). Except for the hotspot residue Asn49, the interface residues in Motifs 5 and 7 were observed to be low communication residues. These residues are oriented closer to the N-terminal as well as being obscurely accessible for the solvent (see Figure 3C ), compared to those located closer to the C-terminal across all proteins. Asn49, which showed a higher average BC compared to these interface residues, was completely conserved across all sequences, whereas the rest in Motifs 5 and 7 displayed very low conservation. Val201 (HtCA numbering) was observed to have higher average BC compared to most of the interface residues at most temperatures. Hydrogen bond interactions in the interface are further analyzed in Section 2.4.4.

Following the identification of interface residues in Section 2.2.2, it was interesting to identify the occurrence of hydrogen bonds in the interface across trajectories with an increase in temperature ( Figures S8-S11) . During this analysis, it was important to differentiate between the hydrogen bonds contributing to interface stability from those formed by temporary proximity of residues due to 

To gain insight on the effect of temperature on the synchronized movements between the residues of each protein, the correlation coefficients of motions between the Cα atoms were calculated as described in Materials and Methods (Section 3.9). Correlation matrices were plotted as heatmaps to visualize these motion dynamics (Figure 9 ). Common to all proteins, highly synchronized motions were observed between chains, confirming communication through the interface. This observation also signified the importance of oligomerization to the thermostability of these enzymes, and more so, their simulation in these states. The highest correlated intra-and inter-subunit motions at 423 K were observed for GHr1CA, NtCA, PhCA and VaCA1. Interestingly, GHr1CA was amongst the CAs showing increased fluctuations in RMSF analysis at this temperature. Unsynchronized motions between chains observed at 423 K for HtCA, LOGACA and SNbcCA, as well as the reduced correlations between the chains in PmCA at 393 K and 423 K, were consistent with the increase in intermittent hydrogen bonds observed in Section 2.4.3. 

To gain insight on the effect of temperature on the synchronized movements between the residues of each protein, the correlation coefficients of motions between the C α atoms were calculated as described in Materials and Methods (Section 3.9). Correlation matrices were plotted as heatmaps to visualize these motion dynamics (Figure 9 ). Common to all proteins, highly synchronized motions were observed between chains, confirming communication through the interface. This observation also signified the importance of oligomerization to the thermostability of these enzymes, and more so, their simulation in these states. The highest correlated intra-and inter-subunit motions at 423 K were observed for GHr1CA, NtCA, PhCA and VaCA1. Interestingly, GHr1CA was amongst the CAs showing increased fluctuations in RMSF analysis at this temperature. Unsynchronized motions between chains observed at 423 K for HtCA, LOGACA and SNbcCA, as well as the reduced correlations between the chains in PmCA at 393 K and 423 K, were consistent with the increase in intermittent hydrogen bonds observed in Section 2.4.3. TaCA displayed a behavior slightly different from the dimeric structures. In this tetrameric structure, chains A and B, which are diagonally oriented to each other, were observed to exhibit higher synchronized motions compared to those between chains A and C, which form the bacterial dimer. The same was observed for chains C and D in comparison to the dimer formed by chains B and D. This observation corroborates with the average BC analysis results, where the highest communication was observed passing through the Cys residues forming the inter-subunit disulfide bonds. Low anti-correlated motions were present between the adjacent monomers, i.e., chain A-D and chain B-C, forming the dimers in TaCA possibly due to reduced interactions between the chains. TaCA displayed a behavior slightly different from the dimeric structures. In this tetrameric structure, chains A and B, which are diagonally oriented to each other, were observed to exhibit higher synchronized motions compared to those between chains A and C, which form the bacterial dimer. The same was observed for chains C and D in comparison to the dimer formed by chains B and D. This observation corroborates with the average BC analysis results, where the highest communication was observed passing through the Cys residues forming the inter-subunit disulfide bonds. Low anti-correlated motions were present between the adjacent monomers, i.e., chain A-D and chain B-C, forming the dimers in TaCA possibly due to reduced interactions between the chains.

The overall methodology is shown in Figure 10 . 

The overall methodology is shown in Figure 10 . Overall methodology followed in this study. BC, DCC, R g and RMSF refer to betweenness centrality, dynamic cross correlation, radius of gyration and root mean square fluctuation, respectively.

Retrieval of sequences for this study was conducted in two consecutive parts. In the first part, Position-Specific Iterated BLAST (PSI-BLAST) in the National Center for Biotechnology Information (NCBI) [112] was utilized for two iterations to identify sequences that were homologous to the query sequences from three organisms originating from hydrothermal vents. These were CmCA (NCBI Accession Number: WP_007474387), and previously crystallized PmCA and TaCA (NCBI Accession numbers: WP_015898908.1 and WP_013538320.1, respectively) [29, 42, 51] . The second part of retrieval involved searching for published literature on the origins of the organisms, and only those verified to be from hydrothermal vents were considered. Further sequences that were more than 40% similar to the query, with query coverage above 90% were selected.

Overall, 15 sequences were identified from the organisms that were previously isolated from hydrothermal vents (Table S1 ). These organisms included C. mediatlanticus [51] , Geothermobacter sp. EPR-M [67] , Geothermobacter sp. HR-1 [68] , Hydrogenimonas thermophila [113] , Nitratiruptor tergarcus [114] , Persephonella hydrogeniphila [53] , P. marina [40, 42] , Sulfurovum lithotrophicum [55] , Sulfurovum riftiae [56] , Sulfurovum sp. NBC37-1 [57, 115] , T. ammonificans [29] , Vibrio antiquarius [116] and Vibrio diabolicus [54] . Two different α-CA sequences from the bacterium V. antiquarius were included in the list of sequences, as well as the unclassified sequence of α-CA structure LOGACA. 3D structures of LOGACA, PmCA and TaCA (PDB IDs: 6EKI, 6IM3 and 4C3T, respectively) were downloaded from the Protein Data Bank (PDB) [35, 52, 117, 118] . Table S1 summarizes the accession numbers and query coverages, as well as E-values corresponding to α-CAs of each bacterium with the class containing a total of 15 sequences. Sequence lengths and percentage identities to the query sequences are also shown in the table.

Multiple sequence alignment (MSA) was performed using the Tree-based Consistency Objective Function for Alignment Evaluation (T-Coffee) [72, 73] and Multiple Alignment using Fast Fourier Transform (MAFFT) [119] alignment programs with default settings, and visualized in Jalview v2 [120] . An all-versus-all sequence identity matrix was produced for the best-fitting alignment by use of an in-house Python script, which tallied the number of identical amino acid positions between each sequence and all other sequences in the dataset, representing the total as a fraction of the sequence length in the matrix [121] . A heat map of the matrix was subsequently generated using GNU Octave v4.0.0 [122] .

All α-CA sequences in the dataset, except those for crystallized LOGACA, PmCA and TaCA, were simultaneously submitted for signal peptide prediction using two different prediction servers, Signal-Blast and Phobius [86, 87] . The consensus of the servers was used to discern the presence or the absence of a signal peptide in each of the proteins.

MEGA v7 [123] was employed for phylogenetic tree calculations using the T-Coffee MSA as the input. The tree models were generated with gap deletions of 90%, 95% and 100%, using the strong branch filter option. Bayesian Information Criterion (BIC) scores were used to rank the models, and the first 3 models displaying the lowest BIC scores were selected for phylogenetic tree generation. The top three models for all gap deletions were the Le and Gascuel (LG) model with discrete gamma distribution and invariable sites (LGGI), the Whelan and Goldman (WAG) model with discrete gamma distribution (WAGG), and the WAG model with discrete gamma distribution and invariable sites (WAGGI) [124, 125] . The Nearest-Neighbor-Interchange (NNI) maximum likelihood method was utilized, with the initial tree made by employment of the default neighbor joining (NJ/BioNJ) method. A total of nine trees were produced (3 models × 3 gap deletions) with 1000 bootstrap replications each. Dendroscope v3.5.9 [126] was used to visualize the phylogenetic trees, and the branching consistency of each tree with its corresponding consensus tree was checked. The best model and gap deletion were picked from the trees which showed similar branching and bootstrap values to their consensus trees.

Motif analysis was carried out using Multiple Expectation Maximisation for Motif Elicitation (MEME) v4.11 [88] , with a motif width of 3-20 amino acids, to identify conserved patterns in potentially important functional regions. A maximum of 20 motifs was searched. Overlapping motifs were identified using motif pairwise correlations calculated by the Motif Alignment and Search Tool (MAST) [127] . High correlations observed above 0.6 indicated motif redundancy, and these motifs were consequently not included in the results. Those that showed statistical insignificance (E-value > 0.05) were also removed from the results. An in-house MATLAB script which calculates the frequency of motif occurrence across the sequences was used to produce a heat map displaying motif conservation [121, 128] .

LOGACA (PDB ID: 6EKI) was selected as a template and used to model all α-CA proteins except for PmCA and TaCA, whose structures were obtained from the PDB for further analyses. 3D calculations proceeded, utilizing the slow refinement option in automodel by MODELLER v9.20 [129] . The α-CA proteins were modelled as dimers with 100 models being produced per sequence. The first 5 models exhibiting the lowest z-DOPE scores were chosen for further structural validation using Verify3D [93] , PROCHECK [92] as well as ProSA [130] servers. Motifs, generated using MEME suite, were mapped onto a representative structure using PyMOL v1.7.2.1 [131] , a molecular graphics program.

To identify interface residues of the multimeric proteins, all models and crystal structures were subjected to interface analysis using five different web servers: Hotregion [132] ; Knowledge-based FADE and Contacts (KFC) server [133] ; the Protein Interfaces, Surfaces and Assemblies (PDBePISA) web service [96] ; PPCheck [134] ; and Robetta [94] . The web servers, excluding PDBePISA, were further queried for hotspot residues in the interface. PDBePISA is a web-based tool that implements the graph theory method in analyzing macromolecular complex interfaces, producing information such as the solvent accessible surface areas, BSA and interface residues, as well as inter-subunit hydrogen bonds and salt bridges [96] . Hotregion and KFC make use of machine learning to predict interfacing residues. Robetta identifies interface residues and mutates them to alanine, then calculates the change in binding free energy (∆∆G bind ) upon mutation. A change in ∆∆G bind above 1 kcal/mol indicates that mutation of the residue results in destabilization, thus it is regarded as a hotspot residue [94] . PPCheck identifies hotspot residues using alanine scanning as well. Interface residues and hotspot were then selected by consensus approach, as such, for interface residues three out of five, and for hotspot residues three out of four web servers were required to agree. All these residues, along with motifs generated using MEME suite, were mapped onto a representative dimeric structure using PyMOL [131] .

Intra-subunit interactions were analyzed using the Protein Interactions Calculator (PIC) web server [95] . It uses standardized cut-offs for a wide range of interactions, including disulfide bridges, hydrophobic, aromatic-aromatic, ionic and cation-Pi interactions as well as hydrogen bonds. In this study, ionic interactions were queried with a cut-off distance of 4 Å.

MD simulations were performed on the dimeric (tetrameric for TaCA) assemblies of the proteins at four different temperatures; 300 K, 363 K, 393 K and 423 K. H++ server [135] was used to protonate the structures at pH 8.00 to mimic the alkaline environment of CO 2 sequestration. All His residues coordinating Zn 2+ were manually cross-checked for the correct protonation states. The first two His residues were confirmed to be HIDs, with the δ-nitrogen protonated and the third His residue an HIE, with the ε-nitrogen protonated. Zn 2+ parameters previously generated by Sanyanga et al. [82] using AmberTools17 [136] and Gaussian09 [137] were inferred onto the α-CA structures. The AMBER force field ff14SB [138] was utilized. Coordinate and topology files were produced using tleap [139] and the system was solvated with a cubic water box of 10 Å. ACPYPE (AnteChamber PYthon Parser interfacE) [140] was utilized for the generation of gmx files for minimization as well as subsequent equilibration and simulation using GROMACS 2016.1 [141] at each of the four temperatures. Steepest descent minimization was performed with an energy step size of 0.01, followed by NVT/canonical ensemble equilibration at 300 K, 363 K, 393 K and 423 K in order to stabilize the temperatures. NPT/isothermal-isobaric equilibration was performed at the temperature of the corresponding NVT, and a pressure of 1 atm followed by 50 ns MD simulations at the same temperature. All calculations were performed at Centre for High Performance Computing (CHPC) clusters in Cape Town (SA). All simulations ran with a time step of 2 fs and coordinates were saved every 10 ps during the simulation. Root mean square fluctuation (RMSF) values were calculated for the residues of each complex. The radius of gyration (R g )was calculated for the individual chains of each complex, and for the individual catalytic pockets. Hydrogen bond interactions across the multimers were identified using the hbond command in cpptraj [104] from AmberTools17 with a donor-acceptor distance cut-off of 3.5 Å. Define Secondary Structure of Protein (DSSP) analysis [111] of interface residues was performed using the secstruct command, also in cpptraj.

Betweenness centrality (BC) for each residue in the multimeric α-Cas and C β atoms (C α for Gly), was calculated by use of the MD-TASK tool [50] . BC measures the usage of a node by calculating how often the shortest paths from all other nodes pass through it. The script, calc_network.py, in MD-TASK was used for the calculation of BC over the last 20 ns of the trajectories with a threshold of 6.7 Å, followed by the calculation of average BC using the avg_network.py script over MD trajectories. Average BC for each protein was normalized, scaling all values into the range 0-1. High average BC values directly translate to high residue usage, substantiating its importance in protein communication.

Dynamic cross correlation (DCC) analysis was implemented for the C α atoms in each multimeric protein. Matrices were produced by MD-TASK software using the calc_correlation.py script over the entire 50 ns trajectories using the following equation: 

The urgency of the discovery of thermostable carbon dioxide sequestration agents motivated the present study of α-CAs from bacteria with hydrothermal vent origins. With most studies focusing on the monomeric forms of the enzymes, this study reflected markedly on the importance of the protein-protein interactions formed by the multimeric proteins for their function, as well as stability, at high temperatures.

Various computational techniques were successfully employed in the retrieval and analysis of a total of 15 CAs, including LOGACA, PmCA and TaCA, which had predetermined structures. The other structures were calculated using homology modelling and several validation programs deemed the dimeric structures to be of good quality. Sequence alignments and motif analysis revealed high levels of conservation amongst the CAs with all but two (Geothermobacter CAs) showing the presence of signal peptides. Hotspot and other interface residues were identified for the multimeric structures by various web servers. Importance of these residues in protein function was further fortified by their presence in the group of residues showing high usage via average BC calculations over MD simulation trajectories at 300 K. The detection of important intra-subunit hydrogen bond networks was also achieved. The unique network identified in NtCA, involving a combination of high communication residues, hotspot residues as well as Cys205 (NtCA numbering) which forms an intra-subunit disulfide bond, is proposed to have a significant contribution towards its thermostability.

Residue composition of the interface motifs proved crucial to protein thermostability after investigating behavior of the interface residues using RMSF, DSSP and average BC analyses of simulations at 300 K, 363 K, 393 K and 423 K. CmCA, LOGACA, PhCA and PmCA, which contained hydrophobic residues shielding the interface core from solvent accessibility, demonstrated lower fluctuations, and consequently higher thermostability properties. This was also accompanied by consistency in interface hydrogen bonds at high temperatures. All proteins showed high correlation between chains at one or more temperatures, further fortifying the importance of interface residues. TaCA portrayed a slightly different behavior compared to the rest of the CAs, with further confirmation of the two disulfide bonds in the tetramer interface as the main source of its thermostability. The importance of the catalytic cavity was also considered and monitored using R g analysis, with most proteins maintaining compactness of the cavities at high temperatures. It is also worth noting that, for all CAs, asymmetrical behavior was observed for monomers of the same protein throughout the analyses.

Ultimately, a consensus of all the methods utilized in this study revealed NtCA, besides those previously characterized, as the CA with the highest potential for thermostability, showing low protein residue fluctuations and high synchronized motions, as well as the maintenance of chain and cavity compactness at high temperatures. Runner-up candidates include GEprmCA, PhCA, SlCA and VaCA1. Overall, this study showed the importance of simulating these proteins in their multimeric assemblies and serves as a basis for in vitro analysis of these α-CAs to prove their usefulness in the biotechnology industry as thermostable CO 2 sequestration agents. It also revealed important features of the CAs responsible for thermostability, and thus serves as a basis for engineering of less thermostable dimeric CAs. Acknowledgments: Many thanks go to the Centre for High Performance Computing (CHPC) for computing time resources. The authors would also like to acknowledge T.A. Sanyanga for the constructive discussions.

The authors declare no conflict of interest. Vibrio antiquarius carbonic anhydrase 1 VaCA2

Vibrio antiquarius carbonic anhydrase 2 VdCA

Vibrio diabolicus carbonic anhydrase

Livestock as sources of greenhouse gases and its significance to climate change

Contribution of Working Groups I, II and III to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change

accelerated carbonation of brucite in mine tailings for carbon sequestration

The progressive routes for carbon capture and sequestration

Carbonic anhydrase mediated carbon dioxide sequestration: Promises, challenges and future prospects

Revised methane emissions factors and spatially distributed annual carbon fluxes for global livestock. Carbon Balance Manag

National Oceanic and Atmospheric Administration

Biomineralization-based conversion of carbon dioxide to calcium carbonate using recombinant carbonic anhydrase

Microbially enhanced carbon capture and storage by mineral-trapping and solubility-trapping

Investigating the application of enzyme carbonic anhydrase for CO 2 sequestration purposes

Biotechnology for the acceleration of carbon dioxide capture and sequestration

The use of carbonic anhydrase to accelerate carbon dioxide capture processes

Carbonic anhydrase: New insights for an ancient enzyme

An overview of the alpha-, beta-and gamma-carbonic anhydrases from Bacteria: Can bacterial carbonic anhydrases shed new light on evolution of bacteria?

Biochemistry: A cadmium enzyme from a marine diatom

Cadmium-containing carbonic anhydrase CDCA1 in marine diatom Thalassiosira weissflogii

The γ class of carbonic anhydrases

Thermostable carbonic anhydrases in biotechnological applications

Thylakoid luminal θ-carbonic anhydrase critical for growth and photosynthesis in the marine diatom Phaeodactylum tricornutum

The many types of carbonic anhydrases in photosynthetic organisms

Efficient carbon dioxide sequestration by using recombinant carbonic anhydrase

A new widespread subclass of carbonic anhydrase in marine phytoplankton

Microbial carbonic anhydrases in biomimetic carbon sequestration for mitigating global warming: Prospects and perspectives

On carbon dioxide storage based on biomineralization strategies

Role of histidine 64 in the catalytic mechanism of human carbonic anhydrase II studied with a site-specific mutant

Carbonic anhydrase II based biosensing of carbon dioxide at high temperature: An analytical and MD simulation study

Crystal structure of the secretory form of membrane-associated human carbonic anhydrase IV at 2.8-A resolution

Structural analysis of inhibitor binding to human carbonic anhydrase II

The structure of a tetrameric α-carbonic anhydrase from Thermovibrio ammonificans reveals a core formed around intermolecular disulfides that contribute to its thermostability

Structure and mechanism of carbonic anhydrase

Structure and function of carbonic anhydrases

Catalytic mechanism of α-class carbonic anhydrases: CO 2 hydration and proton transfer

Structure and kinetics of human carbonic anhydrase II mutants at residue Val-121

Crystal structure of the most catalytically effective carbonic anhydrase enzyme known, SazCA from the thermophilic bacterium Sulfurihydrogenibium azorense

Structure of a hyperthermostable carbonic anhydrase identified from an active hydrothermal vent chimney

Mechanism of action of carbonic anhydrase substrate, sulfonamide, and anion binding

Solvent-Mediated proton transfer in catalysis by carbonic anhydrase

Heat-Stable Carbonic Anhydrases and Their Use

Bacterial extremo-α-carbonic anhydrases from deep-sea hydrothermal vents as potential biocatalysts for CO 2 sequestration

Highly thermostable carbonic anhydrase from Persephonella marina EX-H1: Its expression and characterization for CO 2 -sequestration applications

Engineering of Thermovibrio ammonificans carbonic anhydrase mutants with increased thermostability

Crystal structure of a highly thermostable α-carbonic anhydrase from Persephonella marina EX-H1

Reassessing buried surface areas in protein-protein complexes

Transient protein-protein interactions

Anatomy of hot spots in protein interfaces

The in silico prediction of hotspot residues that contribute to the structural stability of subunit interfaces of a picornavirus capsid

Horizontal transfer of β-carbonic anhydrase genes from prokaryotes to protozoans, insects, and nematodes. Parasites Vectors

Involvement of β-Carbonic Anhydrase Genes in Bacterial Genomic Islands and Their Horizontal Transfer to Protists

Interacting motif networks located in hotspots associated with RNA release are conserved in Enterovirus capsids

MD-TASK: A software suite for analyzing molecular dynamics trajectories

Caminibacter mediatlanticus sp. nov., a thermophilic, chemolithoautotrophic, nitrate-ammonifying bacterium isolated from a deep-sea hydrothermal vent on the Mid-Atlantic Ridge

Persephonella marina gen. nov., sp. nov. and Persephonella guaymasensis sp. nov., two novel, thermophilic, hydrogen-oxidizing microaerophiles from deep-sea hydrothermal vents

Persephonella hydrogeniphila sp. nov., a novel thermophilic, hydrogen-oxidizing bacterium from a deep-sea hydrothermal vent chimney

Vibrio diabolicus sp. nov., a new polysaccharide-secreting organism isolated from a deep-sea hydrothermal vent polychaete annelid, Alvinella pompejana

Sulfurovum lithotrophicum gen. nov., sp. nov., a novel sulfur-oxidizing chemolithoautotroph within the ε-Proteobacteria isolated from Okinawa trough hydrothermal sediments

Sulfurovum riftiae sp. nov., a mesophilic, thiosulfate-oxidizing, nitrate-reducing chemolithoautotrophic epsilonproteobacterium isolated from the tube of the deep-sea hydrothermal vent polychaete Riftia pachyptila

Deep-sea vent -proteobacterial genomes provide insights into emergence of pathogens

Comparative genomic analysis of the class Epsilonproteobacteria and proposed reclassification to Epsilonbacteraeota (phyl. nov.). Front. Microbiol. 2017, 8, 682

Addendum: Comparative genomic analysis of the class Epsilonproteobacteria and proposed reclassification to Epsilonbacteraeota (phyl

Single cell genomics-based analysis of gene content and expression of prophages in a diffuse-flow deep-sea hydrothermal system

Chemoautotrophy at deep-sea vents: Past, present, and future

Isolation and phylogenetic diversity of members of previously uncultivated ε-Proteobacteria in deep-sea hydrothermal fields

Genome-facilitated discovery of RND efflux pump-mediated resistance to cephalosporins in Vibrio spp. isolated from the mummichog fish gut

Comparative genomic analysis of Vibrio diabolicus and six taxonomic synonyms: A first look at the distribution and diversity of the expanded species

Thermophily in the Geobacteraceae: Geothermobacter ehrlichii gen. nov., sp. nov., a novel thermophilic member of the Geobacteraceae from the "Bag City" hydrothermal vent

Genome sequence of Geothermobacter sp. strain EPR-M, a deep-sea hydrothermal vent iron reducer

Genome Sequence of Geothermobacter sp. Strain HR-1, an Iron Reducer from the Lō 'ihi Seamount

Relative importance of chemoautotrophy for primary production in a light exposed marine shallow hydrothermal system

Geographic distance and amorphous iron affect the abundance and distribution of Geobacteraceae in paddy soils in China

The globally widespread genus Sulfurimonas: Versatile energy metabolisms and adaptations to redox clines

T-coffee: A novel method for fast and accurate multiple sequence alignment1

A web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension

X-ray structure of the first extremo-α-carbonic anhydrase', a dimeric enzyme from the thermophilic bacterium Sulfurihydrogenibium yellowstonense YO3AOP1

Crystal structure of carbonic anhydrase from Neisseria gonorrhoeae and its complex with the inhibitor acetazolamide

The unique structure of carbonic anhydrase αCA1 from Chlamydomonas reinhardtii

Carbonic anhydrase XII functions in health and disease

Engineering the hydrophobic pocket of carbonic anhydrase II

Structural basis for the inhibition of Helicobacter pylori α-carbonic anhydrase by sulfonamides

Binding of substrate CO 2 to the active site of human carbonic anhydrase II: A molecular dynamics study

Entrapment of carbon dioxide in the active site of carbonic anhydrase II

Mechanism of Action of Non-Synonymous Single Nucleotide Variations Associated with α-Carbonic Anhydrase II Deficiency

The bacterial cell envelope

Acatalytic" Carbonic Anhydrase-Related Protein XI Contains a Sequence Motif Present in the Neuropeptide Sauvagine: The Human CA-RPXI Gene (CA11) Is Embedded between the Secretor Gene Cluster and the DBP Gene at 19q13. 3. Genom

The complete sequence, expression in Escherichia coli, purification and some properties of carbonic anhydrase from Neisseria gonorrhoeae

High-performance signal peptide prediction based on sequence alignment techniques

A combined transmembrane topology and signal peptide prediction method

Tools for motif discovery and searching

Zinc coordination spheres in protein structures

Multiple binding modes of inhibitors to carbonic anhydrases: How to design specific drugs targeting 15 different isoforms?

Structure of α-carbonic anhydrase from the human pathogen Helicobacter pylori

PROCHECK: A program to check the stereochemical quality of protein structures

VERIFY3D: Assessment of protein models with three-dimensional profiles

PIC: Protein Interactions Calculator

Inference of macromolecular assemblies from crystalline state

Protein stabilization by salt bridges: Concepts, experimental approaches and clarification of some misunderstandings

ESBRI: A web server for evaluating salt bridges in proteins

Stabilizing Salt-Bridge Enhances Protein Thermostability by Reducing the Heat Capacity Change of Unfolding

Entropic contribution to enhanced thermal stability in the thermostable P450 CYP119

Dynamics, a Powerful Component of Current and Future in Silico Approaches for Protein Design and Engineering

Contribution of hydrogen bonds to protein stability

Software for processing and analysis of molecular dynamics trajectory data

Impact of Early Pandemic Stage Mutations on Molecular Dynamics of SARS-CoV-2 Mpro

Relationship of protein flexibility to thermostability

Hyperthermophilic Enzymes: Sources, Uses, and Molecular Mechanisms for

Importance of secondary structural specificity determinants in protein folding: Insertion of a native β-sheet sequence into an α-helical coiled-coil

In silico designing of an industrially sustainable carbonic anhydrase using molecular dynamics simulation

Exploring early stages of the chemical unfolding of proteins at the proteome scale

Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features

Hydrogenimonas thermophila gen. nov., sp. nov., a novel thermophilic, hydrogen-oxidizing chemolithoautotroph within the ε-Proteobacteria, isolated from a black smoker in a Central Indian Ridge hydrothermal field

Nitratiruptor tergarcus gen. nov., sp. nov. and Nitratifractor salsuginis gen. nov., sp. nov., nitrate-reducing chemolithoautotrophs of the ε-Proteobacteria isolated from a deep-sea hydrothermal system in the Mid-Okinawa Trough

Distribution, phylogenetic diversity and physiological characteristics of epsilon-Proteobacteria in a deep-sea hydrothermal field

Deep-sea hydrothermal vent bacteria related to human pathogenic Vibrio species

The protein data bank

Thermovibrio ammonificans sp. nov., a thermophilic, chemolithotrophic, nitrate-ammonifying bacterium from deep-sea hydrothermal vents

A novel method for rapid multiple sequence alignment based on fast Fourier transform

Jalview Version 2-a multiple sequence alignment editor and analysis workbench

Plasmodium falciparum Hop: Detailed analysis on complex formation with Hsp70 and Hsp90

GNU Octave version 4.0. 0 manual: A high-level interactive language for numerical computations

MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets

An improved general amino acid replacement matrix

A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach

Dendroscope: An interactive viewer for large phylogenetic trees

Combining evidence using p-values: Application to sequence homology searches

Human, vector and parasite Hsp90 proteins: A comparative bioinformatics analysis

Comparative Protein Structure Modeling Using Modeller

ProSA-web: Interactive web service for the recognition of errors in three-dimensional structures of proteins

PyMOL: An open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr

HotRegion: A database of predicted hot spot clusters

Interactive forecasting of protein interaction hot spots

PPCheck: A webserver for the quantitative analysis of protein-protein interfaces and prediction of residue hotspots

H++: A server for estimating pKas and adding missing hydrogens to macromolecules

Improving the accuracy of protein side chain and backbone parameters from ff99SB

ACPYPE antechamber python parser interface

High performance molecular simulations through multi-level parallelism from laptops to supercomputers

Publisher's Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.