key: cord-0298548-mvrudk4w
authors: Morel, Marie; Lemoine, Frédéric; Gascuel, Olivier
title: Sensitive Detection of Site-wise Convergent Evolution in Large Protein Alignments with ConDor
date: 2021-07-01
journal: bioRxiv
DOI: 10.1101/2021.06.30.450558
sha: 97321bb3d2ed01428a212305a91939df6e066402
doc_id: 298548
cord_uid: mvrudk4w

Evolutionary convergences are observed at all levels, from phenotype to DNA and protein sequences, and the changes observed at these different levels tend to be strongly correlated. Here we propose a simulation-based method to detect positions under convergent evolution in large protein alignments, without prior knowledge on the phenotype and environmental constraints. A phylogeny is inferred from the data and used in simulations to estimate the expected number of amino-acid changes in stable evolutionary constraints (null model) for each position. Similarly, we count the number of mutations towards the same amino acid in the data and test if they are occurring more often than expected. We applied our method to two real datasets: HIV reverse transcriptase and fish rhodopsin, and to HIV-like simulated data. On the latter, with known convergent events and substitution model, we detected on average two third of these events, with a low fraction of false positives. With HIV data, one knows that drug resistance mutations (DRMs) are convergent. Even without any knowledge of patient treatment status, we retrieved more than 70% of positions corresponding to known DRMs. On the rhodopsin dataset, four substitutions are supposed to be convergent, as they change the maximum wavelength absorption of the photoreceptor and occurred several times independently during evolution. We detected three of them. These results demonstrate the potential of the method to target specific mutations to be further studied experimentally or, for example, using a nonsynonymous/synonymous rate ratio approach. Our software named ConDor is available at http://condor.pasteur.cloud.

Convergent evolution can be defined as the independent acquisition of similar traits in distinct lineages over the course of evolution. The studied traits can be behavioral, morphological, molecular, etc. In each category, traits can be quantitative (size, length, etc.), binary (presence or absence of a given phenotype) or categorical (a trait is subdivided into several classes). Recently, several studies focused on the molecular level, following the hypothesis that convergent phenotypes generally result from the same genetic changes (Stern 2013; Rosenblum et al. 2014; Storz 2016) . At the protein level, we commonly distinguish parallel mutations (a change towards the same amino acid is observed from the same ancestral amino acids), convergent mutations (change towards the same amino acid, from different ancestral amino acids) and reversions (mutations that restore an amino acid previously lost during evolution).

Examples of evolutionary convergence at the molecular level have been highlighted in higher eukaryotes in relation to adaptation to certain environments (Muschick et al. 2012; Foll et al. 2014; Foote et al. 2015; Hill et al. 2019; Lu et al. 2020; Xu et al. 2020) , diet (Zhang 2006; Zhen et al. 2012; Ujvari et al. 2015; Hu et al. 2017) , metabolic or morphological changes and the acquisition of new abilities (Davies et al. 2012; Parker et al. 2013 ; Thomas and Hahn 2015; Parto and Lartillot 2018; Marcovitz et al. 2019; Chai et al. 2020; Lee et al. 2018 Jan 8) . Similarly, when submitted to constraints such as experimental conditions or drug treatments, microorganisms and viruses adapt and are likely to exhibit similar escapes. This has been demonstrated in HIV after exposure to antiviral drug treatments in several patients (Crandall et al. 1999) and within a single treated patient (Holmes et al. 1992) . Similarly, several authors found adaptive convergence in experimental populations of RNA viruses (Cuevas et al. 2002) and in pathogenic bacteria (van Ditmarsch et al. 2013) . In natural conditions, evolutionary convergence was found in viruses having experienced host shifts (Longdon et al. 2018 ) and changes of vector specificity (Tsetsarkin et al. 2007 ).

Several methods have been developed to detect convergent evolution at the molecular level (Zhang and Kumar 1997; Zhang 2006; Tamuri et al. 2009 ; Thomas and Hahn 2015; Chabrol et al. 2018; Rey et al. 2018; etc) . They are all based on the prior knowledge or observation of a convergent phenotype and aim to identify protein mutations that correlate with the presence of the converging trait. Two major types of approaches can be distinguished, depending on the scale at which evolutionary convergence is studied. Some approaches aim to identify which coding genes show mutations supporting a convergent phenotype, while others study which amino-acid changes can explain convergent changes at the scale of a single protein. Methods of the first category are commonly applied to eukaryotic and prokaryotic genomes and perform genome-wide analyses to detect convergent genes by considering simultaneously all positions of the corresponding protein sequences; for example, the methods developed by Zou et al, Thomas et al and Chabrol et al were applied to the search of genes responsible for echolocation in mammals (Thomas and Hahn 2015; Chabrol et al. 2018 ). In the second configuration, the coding genes responsible for the convergent phenotype have already been identified and the methods focus on the detection of converging mutations at the site level; for example, Zhang et al identified convergent and parallel mutations in stomach lysozyme sequences of foregut fermenters (Zhang and Kumar 1997) . Similarly, Zhang found parallel substitutions in colobine pancreatic ribonucleases (Zhang 2006) , and Rey et al found positions with convergent substitutions in the PEPC protein occurring jointly with the transition toward C4 metabolism in sedges (Rey et al. 2018 ).

As mentioned above, both types of approaches rely on the identification of mutations correlated with a convergent phenotype. More specifically, they rely on ancestral sequence reconstruction to detect independent changes that occurred at the same position towards the same amino acid within different lineages having the convergent phenotype of interest. Considering that a change towards the exact same amino acid could be too strict, since several amino acids can have similar physico-chemical properties, Rey et al looked for shifts in amino-acid profiles Rey et al. 2018) . Identifying those amino-acid changes or shifts is not sufficient as they could occur by chance. To statistically test the strength of convergent evolution at the site level, Chabrol et al defined a new "convergence index" and used simulations to estimate de distribution of this index in a null, non-convergent model (Chabrol et al. 2018 ). Rey et al selected convergent positions based on the log ratio of the posterior of the studied position assuming the convergent model (shift of the amino-acid profile) versus the one obtained with the null model (homogeneous amino-acid profile along all branches) (Rey et al. 2018) .

Testing the significance of convergent (or parallel or revertant) changes at the site level in proteins has many potential applications. In the case of complex eukaryotic or bacterial organisms, there are few examples of a single amino-acid change that could explain a convergent phenotype (Storz 2016) . However, in the case of viruses with rapid evolution, and whose (small) genomes are strongly constrained, only a few amino-acid changes are generally possible at a given position ) and site-wise convergent evolution is expected to be relatively frequent (Gutierrez et al. 2019) . Determining molecular changes that deviate from what is expected by chance can thus be indicative of adaptive phenomena. This is in fact what was observed with SARS-CoV-2, where one first identified mutations in the Spike protein, which were spreading within the viral population and appeared multiple times independently, before being demonstrated to be evolutionary advantageous for the virus (Korber et al. 2020; Martin et al. 2021; van Dorp et al. 2020) . Indeed, in viruses it is often easier to identify a mutation of interest than to observe the effects of that mutation given how difficult the phenotype of a virus and the environmental conditions in which it evolves are to access.

In some ways, identifying those mutations of interest presents similarities with the detection of positions under positive selection (Goldman and Yang 1994) . The idea is indeed to identify mutations that could be advantageous as they are found independently more often than expected under a neutral (or purifying) model of evolution. Positive selection can be inferred at a position if the rate of non-synonymous substitutions exceeds the rate of synonymous substitutions. These substitutions can be towards a specific amino acid or any change from the original amino acid. This is, for example, the case in immune avoidance where the trend to mutate towards any new amino acid at the antigenic positions is generally favorable and positively selected. Conversely, in the case of convergent evolution, we are interested in substitutions towards one or a few similar amino acids. We thus expect that positive selection is a necessary, but not sufficient condition for a position to be convergent. We shall see that our results confirm this intuition.

Here we propose a method designed to detect site-wise convergent evolution in large aminoacid alignments without prior knowledge of phenotype. This method performs detailed analysis at the gene/protein level, with typical application to viruses, but also to specific genes known to be involved in phenotypic convergence (Hill et al. 2019) . We are interested in changes towards a target amino acid regardless of the ancestral amino acids that lead to the difference in the extant amino acid sequences.

In other words, parallel, convergent and revertant mutations are considered indifferently and we consider different target amino acids as different events. The observed number of amino-acid changes is estimated using ancestral character reconstruction, and their expected number in a null model using computer simulations. In the following sections, we describe this approach that is implemented in a software and web service named ConDor (Convergence Detector; condor.pasteur.cloud). Its performance is assessed on HIV-like simulated datasets, on a real HIV reverse transcriptase dataset and on a fish rhodopsin dataset. We notably compare its performance in retrieving drug resistance mutations in HIV, with the results of a standard positive selection-based approach.

Our method identifies amino-acid mutations that emerged several times in independent lineages and occurred significantly more frequently than expected under a null model of evolution.

The workflow of the method is presented in Figure 1 . It is made up of four main steps: (1) estimate the parameters of the null model from the real data (phylogenetic tree, parameters of the substitution model, site-wise evolutionary rates, etc.); (2) infer ancestral amino acids and count the number of observed emergence events of mutations (EEMs) for every position and amino acid of interest in the real data; (3) simulate new datasets under the inferred null model and count simulated EEMs; (4) compare the observed and the simulated numbers of EEMs, and then determine which mutations occurred significantly more often than expected by chance, assuming the null model. Such mutations are considered as convergent events.

The null model is inferred from the input alignment. The selected substitution model, along with amino-acid frequencies, rates-across-sites distribution parameters, tree topology, branch lengths and site-wise evolutionary rates are assumed to represent the data without convergence. We make this assumption because using large alignments (>1000 sequences), we consider that mutations resulting from convergent evolution are rare enough to have a negligible influence on tree and parameter inference. The reconstructed phylogeny is then rooted using the provided outgroup. This is essential to infer the ancestral sequence at the root of the tree, run simulations starting from this sequence, and count simulated EEMs. Ancestral character reconstruction (ACR) is achieved using a maximum likelihood approach, implemented in PastML . We use the "maximum a posteriori" (MAP) method in which the state with the highest marginal posterior is selected at each tree node. Once all ancestral sequences are reconstructed and associated to the nodes in the phylogeny, we identify where independent amino-acid changes occurred in the tree and count them as explained in the subsection "Counting emergence events". This counting gives the observed number of EEMs for each alignment position and amino acid under study, that is, those that are observed at the given position enough time (≥12 sequences in our HIV experiments) and in more than 2 independent clades.

We then simulate the expected evolution without convergence of each position of the alignment many times (10,000 in our experiments). We do not use the root sequence reconstructed by ACR as a start, but draw amino acids based on their marginal posterior probabilities. Taking only Figure 1 : Flowchart of the method. The method takes as input an amino-acid alignment, which is used for inference of the null model (phylogenetic tree, substitution model and its parameters, site-wise evolutionary rates, etc.) and ancestral reconstruction. The reconstructed tree and (probabilistic) root sequence are used to simulate 10,000 alignments under the null model. The output is a list of amino-acid changes per position that seem to be convergent, as they emerge more often in the input alignment than in simulations. number of simulated EEMs ranges from 0 to 31 with an average of 12. From the observed number of EEMs and the distribution of simulated EEMs, we estimate a p-value for each observed mutation, which is equal to 0 in our M41L example. After correction for multiple testing, mutations with p-values lower than the rejection criterion are considered as resulting from convergent evolution.

The observed number of EEMs is inferred by ACR based on the input sequences, while the expected number of EEMs and its distribution are estimated from many simulations evolving the probabilistic root sequence along the inferred tree. In simulations, changes may appear which cannot be inferred by ACR, in particular when they are not transmitted to any tree leaf. In this case, the expected number of changes artificially deviates from the ACR-based "observed" number of EEMs.

This effect is even more pronounced on positions with rapid evolution since more changes are expected. Thus, only the changes transmitted to at least one leaf are counted in our method, since they are the only ones that could be found by ACR. Note, moreover, that generally we are only interested by the amino acids present in the available, actual sequences, and rarely by those that are never observed. In the tree illustrated in Figure 2 , we have 6 changes along the branches, which are represented either by a cross or by a NO symbol. The NO symbol stands for a change of the blue state towards the red one, but the red state is then lost in the subtree. With ACR, this node would have been either blue or yellow, but never red, while this might occur in simulations. Thus, in our counting, we do not consider this EEM towards the red state, and only the one in the upper subtree is counted.

The two yellow crosses mark changes transmitted to one leaf in the upper subtree and two tips in the bottom subtree; thus, both are counted. The two blue crosses mark a return to the ancestral state present at the tree root and are reversions. Even though we count EEMs without making the difference between convergent, parallel and revertant events, we retain the information during the counting process for interpretation afterwards.

To our knowledge, there is no convergent evolution model allowing simulating thousands of sequences without prior knowledge of the phenotype or environmental constraints. We therefore created our own convergent dataset inspired from a real case of convergence. Drug resistance mutations (DRMs) occur independently in patients receiving a drug treatment and thus are a perfect example of evolutionary convergence. In HIV, they are well characterized and studied since their emergence can lead to treatment failure and transmission of resistant viruses. They are mostly found in proteins targeted by the antiretroviral treatment: the protease, the reverse transcriptase and the integrase. The list of known DRMs affecting these proteins is publicly available at https://hivdb.stanford.edu/. DRMs are written as "XposY", with X the ancestral amino acid, "pos" the position of the substitution in the protein alignment, with numbering based on the reference sequence HXB2, and Y the mutated amino acid.

We extracted positions and sequences with DRMs from a real HIV polymerase amino-acid dataset (described in Material and Methods) and replaced the corresponding residues with gaps in the multiple sequence alignment (MSA). DRMs were retrieved from the Stanford University Drug resistance database "Essential DRM Data" lists (https://hivdb.stanford.edu/pages/poc.html). We then reconstructed a tree, estimated the substitution model parameters, and inferred the rates of evolution per position (see Material and Methods for details). This tree represents the evolutionary relationships between the sequences in the real data and its structure is not affected by the DRMs.

Such procedure is standard when inferring trees from HIV sequences, to avoid convergence perturbations of the inferred tree. We then simulated the evolution of HXB2 (GenBank TM accession number K03455) reference sequence (reverse transcriptase only) along this tree. We performed the simulations 5 times for robustness purposes, resulting in 5 MSAs without convergence. DRMs were then manually added in the sequences and positions where they were found in the real MSA. For example, mutation M41L was found in the real data in 211 sequences, so we replaced in the corresponding sequences of our simulated alignment, any amino acid found at position 41 (which turned out to all be methionine (M), the same as HXB2 at that position) by a leucine (L). Since we used the real data tree to create the synthetic dataset, we did not randomly place DRMs, which would make the detection task easy. Methionine and leucine are closely related, with a genetic barrier of 1 (M and L can exchange through a single nucleotide substitution), and the detection is much more difficult than between highly different amino acids (e.g., D and W with a genetic barrier of 3). As M41L is at the same position and in the same tip sequences in our synthetic dataset as in real data, detecting this convergence in the synthetic and real datasets should be of similar difficulty. This guarantees a certain realism of our simulated data. We implemented this insertion procedure for the 37 most common DRMs of our real dataset (i.e., present in ≥ 12 sequences and > 2 independent clades). The fiveresulting synthetic MSAs thus have no convergent events, except the "realistic" added DRMs.

We applied our approach on three datasets for which we knew a priori mutations due to convergent evolution: (1) an HIV-like synthetic dataset with "realistic" added DRMs; (2) a real HIV dataset of reverse transcriptase with 20% sequences with DRMs; and (3) a real dataset of fish rhodopsin, a light-sensitive receptor protein that is highly conserved but known to vary at certain positions among species depending on their environment. On simulated data, we know exactly which mutations are true convergent events or not. With real data, even if we know certain convergent events, some other mutations are likely convergent, but are unknown. In other words, we will be able to assess the rate of false positives with simulated data, but not with real data, where the method sensitivity in detecting the known convergence events will be the main criterion.

The dataset consists in 5 MSAs of 3,551 sequences and 250 amino acids each, mimicking HIV reverse transcriptase and simulated with HIVb model of evolution (Nickle et al. 2007 ). In total, 37 DRMs were placed in each of the MSAs at 27 distinct positions (see above). These DRMs are found in at least 12 sequences. The most common one, M184V, is found in 273 (8%) sequences. However, 19

DRMs are found in less than 1% of the sequences (i.e., in 12 to 33 sequences) so they are expected to be difficult to detect.

The model inferred from the datasets by ModelFinder (Kalyaanamoorthy et al. 2017 

HIVb, which is the model we used for generating them. Thus, the whole analysis (tree reconstruction, ACR and simulations) was first achieved with HIVb, which is the true model of evolution. In a second stage, we also used JTT (Jones et al. 1992) , to study the impact of model misspecification. We tested on average 441 mutations per dataset, the ones present in at least 12 sequences and with more than 2 EEMs.

Using HIVb, on average 27.4 mutations are found to be convergent by our method, 26 of which are true DRMs (out of 37 added DRMs). The numbers of EEMs for the DRMs range from 9 (K101P) to 225 (M184V). We detect DRMs with a higher number of EEMs better, and especially those with more than 30 EEMs that we detect 90% as convergent, as illustrated in Figure 3a . If there are several DRMs at one position, we often only detect the most frequent one(s). For example, at position 219 (Fig3a, bold characters) we detect mutations towards Q and E but not N. Similarly, at position 215, we do not detect mutations towards S and D. However, we detect mutation T215C although there are fewer EEMs towards C. This is explained by the low substitution rate between T and C (BLOSUM62 score = -1), and thus few changes are expected from T to C; 2 EEMs between T and C at position 215 are observed on average over the 10,000 null-model simulations, while 17 are found in the synthetic, (1) detected and (2) tested but not detected on the 5 synthetic HIV-like MSAs, analyzed with HIVb and JTT substitution models. We report the average for the 5 datasets and the standard deviation between parentheses. True positives are at the intersection between detected and DRMs, and false positives at the intersection between detected and non-DRMs (i.e., mutations resulting from the evolution of the root sequence under the null model).

In these experiments, we tested all mutations (regardless of their DRM status) exhibiting more than 2 EEMs and found in at least 12 sequences. The strongest effect can be seen on the number of false positives, which increases from 1.4 to 17.2 (Tab. 1). Compared to the number of negatives, this remains very low, with 4% of falsely detected random mutations (i.e., mutations resulting from the evolution under the null model) among more than 400. Moreover, as we see Figure 

This dataset consists in a MSA of truncated polymerase from HIV-1 subtype B. It was retrieved from the paper by Lemoine et al (2018) . After removal of recombinant sequences, it contains 3,546 sequences of 1,043 nucleotide positions that were translated into 347 amino acids. Among these 347 amino acids, 250 are on the reverse transcriptase and are analyzed here. Slightly more than 20% of the sequences have at least one known DRM (https://hivdb.stanford.edu/pages/poc.html) and, on average, the DRMs are found in ~11 sequences. The most common one, M184V, is found in 273

sequences. There are 37 DRMs present in at least 12 sequences, corresponding to those used to generate the synthetic datasets. They are distributed on 27 positions. We focused on these 37 DRMs to assess the performance of our approach, but, as already explained, we expect to detect other mutations, some being truly convergent but unknown, and some corresponding to false positives, likely located on fast positions, due to model misspecification.

The evolutionary model selected by ModelFinder (Kalyaanamoorthy et al. 2017 ) on this dataset is HIVb, with 'freerates' rates-across-site model and 9 rate categories. We tested 255 mutations in total: those present in at least 12 sequences and showing more than 2 EEMS. Among these, we detected 74 convergent events, after applying the Benjamini-Hochberg correction (Benjamini and Hochberg 1995) for multiple testing (non-corrected p-value threshold of 4e-4, corresponding to a corrected alpha level of 5%). Among these detections, 20 are DRMs, which represents 54% of true positives and is a higher proportion than what is expected by chance (Fisher's exact test p-value = 4.8e-4). The non-DRM detected events correspond to 11 mutations on fast evolving positions (Fig. 4a) , which are likely false positives, plus other events for which we cannot conclude using simple arguments (but see below). Regarding false negatives, 17 DRMs are not detected, 7 of which having a (non-corrected) p-value lower than 0.005, but higher than the significance threshold (4e-4). Interestingly, 4 of them are between amino acids with a low substitution rate, as shown in Figure 4b . Based on the substitution rate and small p-values, it could be possible to identify some false negatives, in particular Y188L with amino acids that are unlikely to substitute (BLOSUM62 score = -1). Positive selection analysis supports this approach (see below). ConDor is reduced by ~60% (10 fast + 18 found in the literature). 

Rhodopsin is a photosensitive protein pigment responsible for the eye's sensitivity to light. It is found in many vertebrates and has been shown to be under positive selection among species that evolve in different environments (Spady et al. 2005) . Depending on the habitat and the amount of light available, different amino acids are observed at the same positions, which result in variations in rhodopsin structure and different maximum wavelength absorption. Certain substitutions corresponding to these amino-acid changes have been described as resulting from convergent evolution. In particular D83N, E122Q, F261Y and A292S (using similar substitution encoding as with HIV) occurred several times independently ).

The dataset we used comes from a study in which the authors characterized substitution F261Y as convergent in fish rhodopsin, as a possible result from a transition from marine to brackish or freshwater environments (Hill et al. 2019) . It contains an alignment of 2,047 sequences with 308 amino-acid positions. The sequences have been classified by the authors in two groups: species found only in marine water and species that can evolve (exclusively or not) in brackish or fresh water. Species annotated within the habitats brackish or fresh water can therefore also be found in marine water.

The neutral model inferred by ModelFinder on this dataset was 'MtZoa' and 'freerates' with 8 rate categories. The reconstructed tree is well supported with three quarters of the bootstrap supports above 70%. We tested 355 substitutions (present in at least 12 sequences and showing more than 2 EEMs) with our method and 55 were retained as possibly convergent (15%). They are distributed over 49 positions out of the 136 tested positions, which means that 36% of the tested positions are detected as convergent (16% considering all alignment positions). We were able to recover substitutions F261Y, D83N and A292S, and reversion N83D was also found as convergent.

Substitution E122Q was not found as convergent since glutamine (Q) independently emerged only 3 times according to ACR, but emerged up to 7 times in simulations.

From the ancestral reconstruction of position 261 presented Figure 5 , Tyrosine (Y) emerged 20 times independently from the phenylalanine (F), which confirms the observation made in (Hill et al, 2019) . It is a reversion since Y is found as root amino acid at this position. After being acquired from amino acid F, amino acid Y shifts 3 times back to F without this change being detected as convergent.

Hill et al observed a strong correlation between amino acid Y at position 261 and the brackish or freshwater habitat (Hill et al. 2019) . To alleviate phylogenetic confounding factors, we reanalyzed this correlation using BayesTraits 'discrete dependent' model and found a strong dependence between these two traits (Bayes Factor = 58). Indeed, this substitution between F and Y is known to shift wavelength absorption of the rhodopsin (Yokoyama 2000) and could be advantageous in brackish or fresh water. We also found a significant correlation between habitat and the emergence of amino acid S at position 292, as well as a correlation between habitat and mutations D83N/N83D (see Supplementary Figures S1 and S2 ).

Among the 55 detected convergent events, 12 are found at fast positions and should be treated carefully (Supplementary Table S4 ). Interestingly, 9 of the detected events are reversions towards the amino acid reconstructed at the root by ACR, none of which is at fast positions. These 9 reversions most likely are true convergence events. We tested the correlation between the 55 predicted convergent events and the habitat and found a significant dependence for 32 of them (9 on fast positions), meaning that, again, a substantial fraction of these detections most likely corresponds to true convergence. This was confirmed by a literature review (see Supplementary Table S4 

Our method is implemented in a workflow, named ConDor, which is accessible through a web service: http://condor.pasteur.cloud. ConDor inputs consist of: (1) a protein alignment in fasta format;

(2) a file containing outgroup sequence identifiers; and (3) a newick file with a tree (rooted or not) whose tips are the same sequences. The workflow is executed on the Institut Pasteur cluster and takes ~160 minutes for a dataset of ~2,050 sequences and ~300 positions. This time corresponds to ~140 minutes for the model inference and ~20 minutes for the convergence detection, using 10,000 simulations per position and target amino acid. The output consists of all the tested mutations with several statistics such as their p-value and if they passed or not the threshold to be considered as convergent mutations. We also provide the evolutionary rate of the corresponding position, the nature of the mutation (convergent, parallel, revertant), the number of EEMs, the genetic barrier, the BLOSUM62 score, etc. All these statistics can be used to analyze further and select the most relevant mutations.

In this work, we have developed the ConDor approach, which makes it possible to detect evolutionary convergence at the resolution of a mutation, without prior knowledge of constraints and phenotype. We could retrieve more than 70% of the positions with DRMs on a real HIV dataset, and ~55% of all DRMs present in the sequences. On synthetic datasets mimicking the evolution of HIV, we showed that the detection power depended on the number of emergence-events of mutations and the exchangeability between amino acids. Even though ConDor was primarily developed for the analysis of viral datasets, it could detect several convergent mutations involved in the change in absorption wavelength in a fish rhodopsin dataset. These results confirm that our method detects realistic convergent evolution signal and could be applied to a broad range of organisms.

We tested the robustness of ConDor to model violation by using JTT (Jones et al. 1992 ) instead of HIVb (Nickle et al. 2007 ) as a neutral model of evolution for the study of synthetic HIV-like datasets.

In doing so, the sensitivity remained high, and we still detected ~70% of positions with DRMs.

However, the number of false positives increased, and thus we expect false positives with real data.

The number of false positives can be partly explained as ConDor tends to be biased towards mutations found at positions with very high evolutionary rates. Indeed, our method relies on how realistic simulations are as a null model. Our results show that for most positions and most mutations, we are close to what is observed in real data, and our simulations represent a satisfactory null model.

However, at certain fast positions we observed that simulations tend to differ from the real data, which resulted in an increased rate of false positives. Thus, we advise that ConDor detections should be cross-validated with approaches such as positive selection to decrease the number of potential false positives. More advanced substitution models, for example based on mixtures or some ideas derived from the CAT model ) also used in (Rey et al. 2018 ) but in a different setting, or mixture of matrix models which accounts for structural features of the positions or their evolutionary rate (Le et al. 2012) , should likely enhance our approach, make the simulations more realistic and lower the number of potentially erroneous detections.

ConDor was developed to detect convergent amino-acid changes and not convergent positions, which complicates the comparison with existing approaches based on convergent site detection (e.g. PCOC (Rey et al. 2018) , and to some extent positive selection methods ). An adaptation of ConDor to work at the position level could be an interesting feature to add to the program. Our approach is made possible since we are working at the scale of a single protein with thousands of sequences, which provides sufficient signal and detection power. By working on thousands or even millions of positions (e.g., with bacterial genomes), ConDor would likely lack the statistical power to work at the scale of a single mutation due to multiple testing. An extension of

ConDor to work at the gene level (similarly to (Chabrol et al. 2018) ), or to detect convergence within a sliding window, would certainly be a useful development.

We have designed this method for the study of specific genes, typically from viruses and microorganisms for which the phenotype is rarely available. Thus, we do not consider any prior knowledge of constraints and phenotype for our analyses. In the same way that methods consider that a position is convergent if it shows the same amino acid derived independently in species with a convergent phenotype (Foote et al. 2015; , knowledge of the phenotype could be added to ConDor and only positions meeting the above criterion could be tested and selected.

The HIV reverse transcriptase dataset we analyzed is based on the nucleotide alignment provided in (Lemoine et al. 2018 ) which we downloaded from https://github.com/evolbioinfo-/booster-workflows/tree/master/data/vih. This is an alignment of 9,147 HIV-1 group M polymerase sequences. The authors indicate that this alignment contains recombinant sequences, which we have removed based on the JPHMM output file provided in their supplementary data (https://github.com/ evolbioinfo/booster-workflows). We then extracted the B subtype sequences from this alignment using their annotation file "pol_nonrecombinant.txt". Finally, we added the reference sequence HXB2

(GenBank TM accession number K03455) from which the position numbering and the ancestral amino acids in the DRMs notation are derived. The resulting alignment contains 3,557 B subtype sequences and 1,043 nucleotide positions.

As outgroup, we downloaded the group M subtype reference alignment (user-defined range 2258-3300) from the Los Alamos HIV database (https://www.hiv.lanl.gov/content/index). We removed subtype B sequences from the outgroup and added the sequences to Lemoine et al MSA,

resulting in an MSA of 3,592 nucleotide sequences which we translated into amino acids.

After tree reconstruction and rooting (as explained in subsection "Tree reconstruction by maximum likelihood") we visually checked the rooting of the tree with Itol (Letunic and Bork 2019).

We removed from the MSA 11 sequences classified as B (but likely recombinant), which were placed among the outgroup and obtained a monophyletic group of 3,546 B subtype sequences and the corresponding rooted tree.

We used the sierrapy (https://github.com/hivdb/sierra-client/tree/master/python) fasta command to locate (positions and sequences) the DRMs present in the HIV-1 subtype B MSA of 3,557

sequences. The list of mutations used by sierrapy can be found at "https://hivdb.stanford.edu/hivdb/by-mutations". Then, we replaced positions with DRMs with unknowns in the corresponding sequences. The phylogeny, thus unaffected by DRMs, was reconstructed as explained in subsection "Tree reconstruction by maximum likelihood". This phylogeny was therefore different from the one representing the real data because the aim here was not the same. In the first case, with the real data, we did not remove the DRMs because we wanted to be agnostic of any convergence before applying

ConDor. In the second case, we wanted to control where the convergence was in order to quantify the percentage of true and false positives. To do this, we removed all traces of (known) convergence, thus removing the DRMs and reconstructing a phylogeny whose structure was affected as little as possible by convergence.

We rooted the tree using the same outgroup as before and we deleted 6 sequences of subtype B that were placed among the outgroup sequences. As result, we obtained a true HIV phylogeny of 3551 truncated polymerases of subtype B. We simulated the evolution of HXB2 (reverse transcriptase only) along this tree 5 times using a homemade simulator implemented in python.

We did not replace exactly the DRMs found by sierrapy, but used the same list as the one used for the real data, for comparison purposes. This list corresponds to the "Essential DRM Data" data from the Stanford University Drug resistance database (https://hivdb.stanford.edu/pages/poc.html), from which we selected only mutations found in at least 12 sequences and showing more than 2 EEMs, i.e., found in at least 3 distinct subtrees. The list of 37 DRMs can be found in Supplementary Table S1 .

Protein data of rhodopsin and fish habitat were retrieved from https://github.com/ Clupeaharengus/rhodopsin/tree/master/phylogeny_habitat. We extracted the 2,056 sequences from "spp_to_keep.txt" from the file "final_alignment.translated.fullrhodopsin.fasta" and removed 7 badly aligned sequences. We used the same sequences used for rooting than in (Hill et al. 2019 ) (Huso huso and Polyodon spathula). The habitat was provided in the file "rabo_allele_hab.tsv" from the repository provided in (Hill et al. 2019 ).

All phylogenies were reconstructed from the corresponding protein MSAs, using the following procedure and options. We used model finder (Kalyaanamoorthy et al. 2017 ), corresponding to parameter -m MFP in IQtree version 1.6.8 (Nguyen et al. 2015) , to select the model of sequence evolution (substitution matrix, gamma categories or freerates model, presence of invariant positions, etc.). Amino acid equilibrium frequencies were estimated by maximum likelihood using IQ-tree option +FO, and site-specific evolutionary rates were estimated using option -wsr.

Ancestral character reconstruction was achieved using PastML version 1.9.29.9 ) with option --prediction_metho MAP. We provided one parameter file (option -parameter) per position, in which are written (1) the amino acid frequencies for the whole alignment and (2) the scaling factor for the studied position, corresponding to the rate of evolution of the site as estimated by IQtree. The selected substitution matrix (HIVb, JTT, resp. MetaZoa) was given as input (--rate_matrix) using PastML option -m CUSTOM_RATE.

The whole method is implemented in a Nextflow pipeline (Tommaso et al. 2017 ) taking as input an amino-acid alignment, a rooted/unrooted tree and a file containing outgroup sequences identifiers. The python libraries numpy (Harris et al. 2020) , pandas (McKinney 2010) and scipy were used for data frames and matrices manipulations and for the statistic tools they provide. We used biopython (Cock et al. 2009 ) for sequences and alignments manipulations.

Tree traversals and analyses were achieved with ETE 3 (Huerta-Cepas et al. 2016). Graphics were obtained using matplotlib (Hunter 2007) and seaborn libraries. All MSA (translation to amino acids, subalignments, etc.) and trees manipulations (pruning, rooting, etc.) were achieved using goalign and gotree (Lemoine and Gascuel 2021) . Simulations and counting of EEMs were computed using homemade python scripts. Convergent candidate events were selected based on their p-value after correcting for multiple testing with a "Benjamini -Hochberg" correction, with a risk alpha of 0.05.

We used MEME 

In the rhodopsin dataset, correlations between fish habitat and mutations detected as convergent by ConDor were measured with BayesTraits 'discrete dependent' model . Prior to running the software, we transformed our data into discrete binary traits.

This way, marine habitat was annotated as 1 and fresh/brackish water as 0. Similarly, for a given position, the convergent change had the value 1 and the other amino acids at that position the value 0. We followed the procedure detailed in http://www.evolution.rdg.ac.uk/BayesTraitsV3/Files/-BayesTraitsV3.Manual.pdf to assess whether the dependence between both traits was more likely than their independence. The dependence hypothesis was retained if the Bayes factor was greater than 10. Priors for the transition rates were estimated using the output provided by the maximum likelihood models as described in the user guide.

Our MSAs, phylogenetic trees, scripts and results analysis are accessible from the Github repository https://github.com/mariemorel/condor-analysis.

Marie MOREL, Frédéric LEMOINE and Olivier GASCUEL Corresponding Authors: marie.morel@pasteur.fr, olivier.gascuel@mnhn.fr

• Figure S1 : Fish rhodopsin data, ancestral reconstruction of position 83 and corresponding contingency table p. 2-3

• Figure S2 : Fish rhodopsin data, ancestral reconstruction of position 292 and corresponding contingency table p. 4

• The visualisation corresponds to a compressed representation of the ancestral scenario, after performing a vertical merge such as defined by PastML . Each disk corresponds to a cluster within which all nodes and tips have the same common ancestor (a node, included in the cluster) and are predicted by ACR with the same state. Next to the inferred state (here N, D or G), is written the number of tips in the cluster. If there is no number, the represented "cluster" is a tip. Arrows represent independent emergence-events of mutations (EEMs). The blue and green circles around each cluster represent the percentage of tips in the cluster annotated in marine and freshwater/brackish water, respectively.

The ancestral amino acid at position 83 is an aspartate (D) which was lost independently 28 times towards asparagine (N) and twice towards glycine (G). The reversion from N towards D then occurred independently 31 times. In some clusters this switch between D and N occurred again leading to 32 EEMs towards N and 32 EEMs towards D in total. Both mutations were found as convergent with ConDor.

D clusters seem to be more frequently associated with brackish/fresh water whereas species in N clusters seem to be found in marine water. This distribution can also be observed with the contingency table displaying the number of species having D or N at position 83 in function of their habitat. To alleviate phylogenetic confounding factors, we reanalysed this correlation using BayesTraits 'discrete dependent' model and found a strong dependence between these two traits, with Bayes Factors equal to 104 (D83N) and 140 (N83D). The Bayes Factors are different because the data are not perfectly symmetrical between D83N and N83D, mainly due to the presence of other amino acids (G). Indeed, the input given to BayesTraits is binary with the convergent change taking the value 1 and any other amino acids at the corresponding position the value 0. The correlation found between the habitat and position 83 is confirmed by previous work which found that position 83 harboured key blue-shifting substitutions, with amino acid N found in deep-diving species and D in non-deep-diving species (Sugawara et al. 2005; Yokoyama et al. 2008 ).

This visualisation corresponds to a compressed representation of the ancestral scenario, after performing a vertical merge such as defined by PastML . Each disk corresponds to a cluster within which all nodes and tips have the same common ancestor (a node, included in the cluster) and are predicted by ACR with the same state. Next to the inferred state (here A, S, C, G or I), is written the number of tips in the cluster. If there is no number, the represented cluster is a tip. Arrows represent independent EEMs. The blue and green circles around each cluster represent the percentage of tips in the cluster annotated in marine and freshwater/brackish water, respectively.

The two most common amino acids at position 292 are alanine (A) which is the root amino acid and serine (S). We count 44 EEMs towards S (all from A) and 13 EEMs towards A. Mutation A292S was found convergent with ConDor. This mutation could be associated with marine water as most of the species in the S clusters are also found in marine water. This distribution can also be observed with the contingency table displaying the number of species having A or S at position 292 in function of their habitat. To alleviate phylogenetic confounding factors, we reanalysed this correlation using BayesTraits 'discrete dependent' model and found a strong dependence between these two traits with Bayes Factors equal to 70. This is consistent with previous works which found that amino acid S at position 292 could be an adaptation to the bathypelagic environment (Sugawara et al. 2005 M184V  273  124  Yes  K103N  221  143  Yes  M41L  211  45  Yes  T215Y  187  30  Yes  D67N  164  50  Yes  L210W  130  17  Yes  K70R  123  54  Yes  K219Q  78  29  Yes  G190A  64  50  Yes  T215F  60  22  Yes  V179D  58  47  No  Y181C  55  41  Yes  E138A  49  45  Yes  V108I  47  41  Yes  T69D  47  23  No  P225H  39  25  Yes  L74V  37  32  No  K219E  37  22  No  A62V  30  23  Yes  A98G  27  19  Yes  T215D  25  18  No  Q151M  22  9  Yes  L100I  22  17  No  K238T  22  11  Yes  T215C  21  16  No  T215S  21  18  No  F116Y  20  11  No  L74I  20  20  No  V75I  18  9  No  K101E  18  18  No  Y188L  18  17  No  H221Y  18  17  No  V179E  18  16  No  F77L  15  7  No  K219N  15  13  No  V75M  14  9  Yes  K101P 12 8 Yes Mutations: Mutations detected as convergent with ConDor on the real HIV-1 subtype B MSA of reverse transcriptase. The p-values are not corrected and below or equal to the acceptance threshold of 0.0004 after "Benjamini-Hochberg" correction. Fast: detected mutation is on a position that belong to the 5% positions with highest evolutionary rate on the whole dataset. Role of the mutation: possible role of the mutation found in the literature. Mutations that have "DRM" noted for the role of the mutation corresponds to the 37 DRMs that we used as true convergent mutations, and which can be found in Supplementary Table S1. Positions significantly detected with MEME to be under episodic positive selection ) on the real HIV-1 subtype B MSA of reverse transcriptase. The p-values are those provided by MEME and unmodified with an acceptance threshold at 0.05. ConDor detects: positions on which we also detected events with ConDor. DRM: position on which there is at least one DRM belonging to the list of 37 DRMs (see Supplementary Table S1 ). Fast: detected positions that belong to the 5% positions with highest evolutionary rate on the whole dataset. Over the 32 positions found to be under positive selection, 26 intersect with events found by ConDor. Among all (32) MEME detections, 11 are at positions with DRMs and 10 are at fast positions. Mutation: Mutations detected as convergent with ConDor on the fish rhodopsin dataset. The pvalues are those given as output by ConDor and are not corrected. The list of mutations corresponds to those whose p-value is less than or equal to the acceptance threshold of 0.0007 after "Benjamini-Hochberg" correction with an alpha risk of 0.05. Fast: detected mutation on a position that belong to the 5% positions with highest evolutionary rate on the whole dataset. Bayes Factor: Bayes factor calculated by BayesTraits 'discrete dependent' model . If the Bayes factor is greater than 10 (highlighted in bold), the mutation is significantly correlated with the habitat (marine or fresh/brackish water). Role of the mutation: possible role of the mutation found in the literature, with the reference.

We detected 55 events with ConDor, 12 of which are found at fast positions, 32 are correlated with marine or fresh/brackish water environments, 15 were already discussed in the literature. Positions detected by PCOC considering that the convergent clades were those annotated with fresh/brackish water. PCOC: combination of posterior probabilities of PC (Profile change) and OC (One change) model. PC: posterior probabilities of the PC model. The threshold for significance was set to 0.8; OC: posterior probabilities of OC model. The threshold for significance was set to 0.8; ConDor detects: If we detect a convergent mutation with ConDor at the given position.

All the detections are due to the changes in amino acid profiles (PC model) which means that species with the convergent phenotype shifted to a different vector of amino acid probabilities compared to their ancestors. These shifts occurred at 16 different positions, 4 of which intersect with ConDor detections. However, the OC model is not significant as the model shift did not occur at the beginning of the branch supporting the clade of convergent species. Since the OC model is not verified, there are no strictly convergent positions detected for the given phenotype (fresh/brackish water).

Characteristics of HIV-1 Natural Drug Resistance-Associated Mutations in Former Paid Blood Donors in Henan Province

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

HIV Reverse Transcriptase and Protease Genes Variability Can Be a Biomarker Associated with HIV and Hepatitis B or

Detecting the molecular basis of phenotypic convergence

Evidence of Echolocation in the Common Shrew from Molecular Convergence with Other Echolocating Mammals

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Parallel evolution of drug resistance in HIV: failure of nonsynonymous/synonymous substitution rate ratio to detect selection

Molecular Basis of Adaptive Convergence in Experimental Populations of RNA Viruses

Parallel signatures of sequence evolution among hearing genes in echolocating mammals: an emerging model of genetic convergence

Convergent evolution of hyperswarming leads to impaired biofilm formation in pathogenic bacteria

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2

Epistatic interactions influence terrestrial-marine functional shifts in cetacean rhodopsin

Widespread Signals of Convergent Adaptation to High Altitude in Asia and America

Convergent evolution of the genomes of marine mammals

A codon-based model of nucleotide substitution for protein-coding DNA sequences

Parallel molecular evolution and adaptation in viruses

Array programming with NumPy

Recurrent convergent evolution at amino acid residue 261 in fish rhodopsin

Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient

Comparative genomics reveals convergent evolution between the bamboo-eating giant and red pandas

ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data

Matplotlib: A 2D Graphics Environment

A Fast Likelihood Method to Reconstruct and Visualize Ancestral Scenarios

The rapid generation of mutation data matrices from protein sequences

ModelFinder: fast model selection for accurate phylogenetic estimates

Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus

Modeling Protein Evolution with Several Amino Acid Replacement Matrices Depending on Site Rates

Empirical profile mixture models for phylogenetic reconstruction

Phylogenetic mixture models for proteins

Building superfast muscles: insights from molecular parallelism in fast-twitch muscle proteins in echolocating mammals

Renewing Felsenstein's phylogenetic bootstrap in the era of big data

Gotree/Goalign : Toolkit and Go API to facilitate the development of phylogenetic workflows

Interactive Tree Of Life (iTOL) v4: recent updates and new developments

Host shifts result in parallel genetic changes when viruses evolve in closely related species

Molecular convergent and parallel evolution among four high-elevation anuran species from the Tibetan region

Modulation of thermal noise and spectral sensitivity in Lake Baikal cottoid fish rhodopsins

A functional enrichment test for molecular convergent evolution finds a clear protein-coding signal in echolocating bats and whales

The emergence and ongoing convergent evolution of the N501Y lineages coincides with a major global shift in the SARS-CoV-2 selective landscape. Infectious Diseases (except HIV/AIDS)

Data Structures for Statistical Computing in Python

HIV reverse transcriptase gene mutations in anti-retroviral treatment naïve rural people living with HIV/AIDS

Detecting Individual Sites Subject to Episodic Diversifying Selection

Convergent Evolution within an Adaptive Radiation of Cichlid Fishes

IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies

HIV-Specific Probabilistic Models of Protein Evolution

Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters

Bayesian Analysis of Correlated Evolution of Discrete Characters by Reversible-Jump Markov Chain Monte Carlo

Genome-wide signatures of convergent evolution in echolocating mammals

Molecular adaptation in Rubisco: Discriminating between convergent evolution and positive selection using mechanistic and classical codon models

Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces

Accurate Detection of Convergent Amino-Acid Evolution with PCOC

The Molecular Basis of Phenotypic Convergence

Adaptive Molecular Evolution in the Opsin Genes of Rapidly Speciating Cichlid Species

The genetic causes of convergent evolution

Causes of molecular convergence and parallelism in protein evolution

Identifying Changes in Selective Constraints: Host Shifts in Influenza

Determining the Null Model for Detecting Adaptive Convergence from Genomic Data: A Case Study using Echolocating Mammals

Nextflow enables reproducible computational workflows

A Single Mutation in Chikungunya Virus Affects Vector Specificity and Epidemic Potential

Widespread convergence in toxin resistance by predictable molecular evolution

Recreated Ancestral Opsin Associated with Marine to Freshwater Croaker Invasion Reveals Kinetic and Spectral Adaptation

SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python

Genomic Convergence in the Adaptation to Extreme Environments

Molecular evolution of vertebrate visual pigments

Evolution of Dim-Light and Color Vision Pigments

Parallel adaptive origins of digestive RNases in Asian and African leaf monkeys

Detection of convergent and parallel evolution at the amino acid sequence level

Parallel Molecular Evolution in an Herbivore Community

Are Convergent and Parallel Amino Acid Substitutions in Protein Evolution More Prevalent Than Neutral Expectations?

No Genome-Wide Protein Sequence Convergence for Echolocation

The HIV-1 reverse transcriptase polymorphism A98S improves the response to tenofovir disoproxil fumarate+emtricitabine-containing HAART both in vivo and in vitro

The molecular and cellular basis of rhodopsin retinitis pigmentosa reveals potential strategies for therapy

Impact of HIV-1 Reverse Transcriptase Polymorphism F214L on Virological Response to Thymidine Analogue-Based Regimens in Antiretroviral Therapy (ART)-Naive and ART-Experienced Patients

Characterization and Structural Analysis of Novel Mutations in Human Immunodeficiency Virus Type 1 Reverse Transcriptase Involved in the Regulation of Resistance to Nonnucleoside Inhibitors

Mutations and Polymorphisms Associated with Antiretroviral Drugs in HIV-1C-Infected African Patients

Epistatic interactions influence terrestrial-marine functional shifts in cetacean rhodopsin

Extended spectrum of HIV-1 reverse transcriptase mutations in patients receiving multiple nucleoside analog inhibitors

HIV-1 Reverse Transcriptase (RT) Polymorphism 172K Suppresses the Effect of Clinically Relevant Drug Resistance Mutations to Both Nucleoside and Non-nucleoside RT Inhibitors

The molecular basis for spectral tuning of rod visual pigments in deep-sea fish

A Fast Likelihood Method to Reconstruct and Visualize Ancestral Scenarios

2'-deoxy-4'-C-ethynyl-2-halo-adenosines active against drug-resistant human immunodeficiency virus type 1 variants

Polymorphisms and Mutational Covariation Associated with Death in a Prospective Cohort of HIV/AIDS Patients Receiving Long-Term ART in China

Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake

Impact of HIV-1 reverse transcriptase polymorphism at codons 211 and 228 on virological response to didanosine

Orthologous Divergence and Paralogous Anticonvergence in Molecular Evolution of Triplicated Green Opsin Genes in Medaka Fish

Non-nucleoside reverse transcriptase inhibitor (NNRTI) cross-resistance: implications for preclinical evaluation of novel NNRTIs and clinical genotypic resistance testing

An experimental comparison of human and bovine rhodopsin provides insight into the molecular basis of retinal disease

Detecting Individual Sites Subject to Episodic Diversifying Selection

Group on HIV Drug Resistance

Emergence of the H208Y mutation in the reverse transcriptase (RT) of HIV-1 in association with nucleoside RT inhibitor therapy

Bayesian methods outperform parsimony but at the expense of precision in the estimation of phylogeny from discrete morphological data

Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters

Bayesian Analysis of Correlated Evolution of Discrete Characters by Reversible-Jump Markov Chain Monte Carlo

Multiple sites in HIV-1 reverse transcriptase associated with virological response to combination therapy

Impact of unreported HIV-1 reverse transcriptase mutations on phenotypic resistance to nucleoside and non-nucleoside inhibitors

Polymorphic mutations associated with the emergence of the multinucleoside/tide resistance mutations 69 insertion and Q151M

Parallel and convergent evolution of the dim-light vision gene RH1 in bats (Order: Chiroptera)

Parallelism of amino acid changes at the RH1 affecting spectral sensitivity among deep-water cichlids from Lakes Tanganyika and Malawi

The A62V and S68G mutations in HIV-1 reverse transcriptase partially restore the replication defect associated with the K65R mutation

Involvement of Novel Human Immunodeficiency Virus Type 1 Reverse Transcriptase Mutations in the Regulation of Resistance to Nucleoside Inhibitors

Recreated Ancestral Opsin Associated with Marine to Freshwater Croaker Invasion Reveals Kinetic and Spectral Adaptation

Critical amino acid replacements in the rhodopsin gene of 19 teleost species occupying different light environments from shallow-waters to the deep-sea

Variants Other than Aspartic Acid at Codon 69 of the Human Immunodeficiency Virus Type 1 Reverse Transcriptase Gene Affect Susceptibility to Nucleoside Analogs

The Genomes of Two Billfishes Provide Insights into the Evolution of Endothermy in Teleosts

Evolution of Dim-Light and Color Vision Pigments

Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates

Precise identification of a human immunodeficiency virus type 1 antigen processing mutant

We sincerely thank Anna Zhukova, Luc Blassel and Jakub Voznica for their help and suggestions. This work was supported by INCEPTION program (Convention ANR-16-CONV-0005; MM PhD grant) and by PRAIRIE program (Convention ANR-19-P3IA-0001; OG).