# Phylogenetic algorithms: How to interpret multiple ML trees from the same dataset?

There's still something that confuses me about how many of these algorithms work, and how results are presented in the literature.

Let's consider a Maximum Likelihood based algorithm like MrBayes or RAxML: Users set a random number seed which generates the starting tree. For many of our datasets, different seeds which result in different ML results, as the algorithms are initialized with different trees.

I'm not entirely sure how one is supposed to interpret this, especially as my experience with ML methods is that the initial step is irrelevant to the global/local min/max in parameter space----the chains just take longer to converge.

How should one interpret these results? Are users to run 1000s of trees at different parameter values, and then chose the most optimal likelihood value? That seems rather ad hoc, as does bootstrapping, etc.

Is the dataset fundamentally flawed?

In short, you have chosen two examples that do not use maximum likelihood as you know it in other contexts. In most statistical contexts, the ML is a single number which can be calculated analytically, so it it always the same for a given data set. This is not the case for either MrBayes or RAxML but for different reasons.

### MrBayes

The likelihood criterion in MrBayes is the marginal likelihood of the posterior, given effectively the data conditioned on the priors. This likelihood comes from a stochastic MCMC sampling of the parameter space. If all is well behaved, then the chains and/or runs will converge in the same general location. But then the different possible topologies need to be summarized in some way.

### RAxML

RAxML generates essentially random starting trees by random sequence additions to build up trees. The subtrees are then rearranged to find a "best" tree. Again, different starting points may lead to different best trees. But if all goes well, analyses will end up with the same tree. This process is described in this chapter.

In both cases, if you start in a different place, you may end up in a different place. There may be many trees that are, within some criterion, equally likely. If you are familiar with parsimony methods, the analogy is that of multiple equally parsimonious trees.

## Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data.

### Results

We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data.

### Conclusions

This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances.

## Abstract

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

Is the only file of interest. The reason this is true in this context is really complicated and you have to understand the statistics of likelihood and how they are interpreted within phylogeny to understand why. This file is simply the final output of a non-parametric bootstrap analysis performed by maximum likelihood.

What on earth is a non-parametric boostrap?

A non-parametric bootstrap is resampling each alignment position with replacement. Thus if we have alignment positions 1,2,3,4,5 A bootstrap resample for 2 replicates might be,

The ML algorithm will make trees of replicates 1 and 2 and find the consensus between them. If you think about it in any other context a bootstrap replicate is pretty meaningless because it no longer reflects the true biological sequence. Thus information on how the consensus was derived, are not really of interest to us providing we are confident this has been done correctly, viz. RAxML_bipartitionsBranchLabels.output_bootstrap.tre and RAxML_bipartitionsBranchLabels.output_bootstrap.tre

So why is this output of limited use?

There are situations to some investigators this information is useful, but assess the robustness of a tree topology its not needed. The only thing we want is a phylogram (bestTree) with the bootstrap values superimposed on them. We really don't need complicated stuff such as the tree to be represented for example as a polytomy (non-bifurcating tree) because we can just read the bootstraps to make that deduction (values >> 75%). In addition, there is not perfect consensus what boostrap value constitutes robustness, but generally most agree >80% is robust.

What output files have useful information in them?

The information that is important are the files associated with "bestTree", that was the single maximum likelihood tree performed on the intact native sequence. The "info" file for this contains 3 really important parameters:

• -lnL . very important!!
• Gamma distribution parameter "alpha",
• PINVAR, proportion of invariant sites,

-lnL is the highest log-likelihood (probability) of the phylogeny. It is usually a very small number for which where is an enormous amount of theory over it.

Alpha parameter of the gamma distribution this is the shape parameter of the mutation rate, if it is very low (<1) the distribution of mutations across the alignment is very tight clustered and approximates to a negative binomial distribution. This means some sites don't mutate at all and a small number of sites mutate alot. If it is very large >200 (which is never observed) it approximates to the Poisson distribution, meaning the mutation distribution is randomised across the alignment.

PINVAR this is a straight percentage/frequency and simply means the sites that don't mutate.

How are they calculated?

PINVAR and alpha are not emperically calculated, i.e. if you look at an alignment and say 'no mutations at that position', PINVAR would of course agree but may consider other invariant depending on the phylogeny. These parameters are calculated by maximum likelihood and you can begin to see why the calculation takes so long . alpha and PINVAR affect the tree topology (which affects -lnL), but the topology affects alpha and PINVAR. Thus, is a multidimensional search of tree and parameter space.

So what stuff do I report in my Results?

Anyway reporting -lnL is good technique and shows the reader you've done maximum likelihood, citing PINVAR and alpha from gamma distribution helps ('Methods' parameters were calculated reiteratively under maximum likelihood). This is only useful for bestTree. The -lnL, PINVAR and gamma's alpha are also calculated for every single bootstrap replicate, but these values are of limited to use, because we have resampled the data, only the consensus tree counts. Obviously presenting the bootstrapped phylogram is extremely important.

Welcome to the technical world of phylogeny!

The amino acid matrix you used BTW .. LG is in vogue right now.

How do I do it?

When I do this stuff its via Biopython and ETE3, I capture the values within the pipeline and don't examine the output files of RAxML because I generate my own.

## Results and discussion

### Comparison of MRP pseudo-sequence supertree and ML tree

To accurately determine the evolutionary relationships among SARS-CoV-2, approaches of MRP pseudo-sequence supertree and ML tree were employed for phylogenetic analysis of 102 SARS-CoV-2 isolated all over the world together with 5 SARS-CoV, 2 MERS-CoV, and 11 bat coronaviruses as outgroups. In the MRP pseudo-sequence supertree (Fig. 2), SARS-CoV and MERS-CoV were placed on one major branch, while SARS-CoV-2 belonged to another major branch. The divergent location of SARS-CoV-2 relative to SARS-CoV and MERS-CoV on the MRP pseudo-sequence supertree was consistent with the results from the phylogenetic ML tree in this study (Supplementary Fig. S1). It was also supported by previous reports about the phylogeny of SARS-CoV-2 constructed with the whole genome 3,4,6 . However, some discrepancies present between the MRP pseudo-sequence supertree and the ML tree. In particular, the MRP pseudo-sequence supertree analysis showed more resolution power than ML tree approach. Distinctive phylogenetic distances on clades of SARS-CoV and SARS-CoV-2 in MRP pseudo-sequence supertree, explicitly presented evolutionary relationships among coronaviruses. Also, the MRP pseudo-sequence supertree successfully identified civet-sampled coronavirus AY572035 to be the closest ancestor of the SARS-CoVs (Fig. 2), which was highly consistent with the previous study 35 . What is more, the MRP pseudo-sequence supertree showed detailed evolutionary relationship of SARS-CoV-2, with nine sub-branches identified from Clade A to Clade I in Fig. 2. The reliability of phylogenetic inference of SARS-CoV-2 in supertree is sufficiently guaranteed by high bootstrap values between 55 and 95. Conversely, coronaviruses clustered tightly on clades of SARS-CoV and SARS-CoV-2 in phylogenetic ML tree (Supplementary Fig. S1), with barely discerned branch length (less than 0.001). It is worth noting that some bat coronaviruses sampled from the same animal host or/and same sampling location, displayed closer genetic distance in MRP pseudo-sequence supertree, which is rational and logical from the perspective of evolutionary progress. However, bat coronaviruses showed no definitive evolutionary relationship in the phylogenetic ML tree. The major factor that determines the topology of the phylogenetic ML tree appears to be the ORF1ab gene that is about 75% of the genome. It is readily explained by the fact that evolutionary relationship was similar in the phylogenetic ML tree relative to the source phylogenetic ML tree based on the sequence of ORF1ab gene (Supplementary Fig. S1, Fig. 3a). Taken together, the phylogenetic supertree displayed significant superiority for deciphering evolutionary relationships among coronaviruses.

MRP pseudo-sequence supertree for SARS-CoV-2 constructed from protein source trees. The hosts and sampling locations of animal coronaviruses are enclosed in parentheses. The coding of SARS-CoV-2 viruses is the combination of the abbreviation of sampling location, sampling time, and Genbank accession. MERS-CoV clade, SARS-CoV clade, and nine clades of SARS-CoV-2 are highlighted and labeled, respectively. The numbers along the branches mark the bootstrap values percentage out of 1000 bootstrap resamplings.

Source phylogenetic ML trees for phylogenetic supertree construction. (a) ORF1ab (b) Spike protein (c) M protein (d) N protein (e) E protein (f) ORF3a (g) ORF6 (h) ORF7a (i) ORF8. Clades of SARS-CoV-2 are in bold in all source phylogenetic ML trees. Bat virus MG996532 is written in red, MG772933 and MG772934 are in blue. Clades of SARS-CoV and MERS-CoV are highlighted in green and purple, respectively.

### Comparison of different supertrees of coronaviruses

Since the birth of supertree theory, many methods have been developed for constructing supertrees from source trees, including MRP method 9,26 , most similar supertree algorithm (MSSA) method 36 , average consensus 37 , and newly developed approximated maximum likelihood (ML) supertree method 30 . Among them, the MRP method is the most widely used supertree method, based on which MRP pseudo-sequence supertree was derived. However, few of them have been used for constructing supertrees of viruses.

In this study, the above-listed approaches for supertree construction were all adopted, attempting to seek out which supertree approach is the best one to clarify the phylogeny of coronaviruses. The outcome that the SARS-CoV clade is located in the SARS-CoV-2 clade in supertrees built by approaches of MSSA supertree (Supplementary Fig. S2) and average consensus supertree (Supplementary Fig. S3), strongly demonstrated that these two approaches can’t provide reliable phylogenetic signal of coronaviruses. Similarly, the ML supertree method is also improper for phylogenetic reconstruction by virtue of failure in resolution for the outgroup MERS-CoVs (Supplementary Fig. S4). Conversely, supertrees obtained based on the traditional MRP method and MRP pseudo-sequence supertree method showed similar topology (Supplementary Fig. S5, Fig. 2), providing a good separation among MERS-CoV, SARS-CoV, and SARS-CoV-2. The MRP pseudo-sequence method is relatively more suitable for phylogenetic reconstruction, as many taxa with the same sampling position and time are accurately resolved in the same clade (Clade B, C, D, E and H). The rationality of using the MRP pseudo-sequence supertree method for phylogenetic analysis should partly ascribe to the removal of most unreliable bipartitions with low bootstrap values (< 55) during the reconstruction process. The preservation of unreliable bipartitions resulted in the MRP supertree with a chaotic topology, especially in SARS-CoV-2 clade (Supplementary Fig. S5). Moreover, the MRP pseudo-sequence supertree method can choose various well-established phylogenetic algorithms to calculate the branch length and bootstrap statistical test from the MRP pseudo-sequences, rendering itself an extra opportunity for accurately constructing phylogenetic supertree.

In addition, MRP pseudo-sequence supertree relied on nucleic acid source trees was also constructed (Supplementary Fig. S6) in this study, which inappropriately placed MERS-CoVs in the SARS-CoV-2 clade. The problem of nucleic acid source tree-based supertree could be caused by the fact that coronaviruses recombine frequently 38 and some recombination breakpoints may misdirect the reconstruction of the supertree. In contrast, this problem could be avoided by constructing a supertree based on protein sequence (Fig. 2), which would exclude the breakpoints in non-coding regions and minimize the influence of nonsense and silent mutation in coding regions. Consequently, protein-sequence based MRP pseudo-sequence supertree is more reliable and accurate.

### Evaluate the validity of MRP supertree on analyzing simulation-based viral genomic evolution

To prove MRP pseudo-sequence supertree is more preferable for analysis of coronavirus phylogenetics, we used ALF simulation frame to compare MRP supertree with full-length genomic sequence ML tree. In comparison with the real tree generated by ALF (Supplementary Fig. S7a), both MRP supertrees could correctly resolve the topology of the phylogenetic tree, yet the MRP pseudo-sequence supertree (Supplementary Fig. S7c) showed more reasonable branch length relative to the MRP supertree constructed by Clann (Supplementary Fig. S7d). Of particular interest was that the taxon SE008 was placed on an inappropriate position—an inconsistent node in the ML tree (Supplementary Fig. S7b). The poverty of the ML method applied here principally could attribute to the LGT events introduced in the simulation, which could be firmly supported by the fact that the ML method constructed a phylogenetic tree fitting well with the corresponding real tree generated by ALF as long as no LGT in the simulation (data not shown). It has been well known that virus evolution is a complex interaction between viruses and hosts, in which RNA viruses exhibit remarkable genomic flexibility. Factors affecting viral genomic flexibility include, but are not limited by, LGT among viruses and hosts, recombination, gain, and loss of genes 32 . Therefore, viral evolution is so intricate that the current model was incompetent to precisely run the simulation. Primarily, LGT event in the evolution of SARS-CoV-2 cannot be ignored in the simulation process. At this point, the MRP supertree established its superiority compared with the full-length genomic sequence ML tree.

### Clues to the origin of the SARS-CoV-2

As the phylogenetic MRP pseudo-sequence supertree and ML tree exhibited, RaTG13 (MN996532), bat-SL-CoVZC45 (MG772933), bat-SL-CoVZXC21 (MG772934) and SARS-CoV-2s formed one major clade (Fig. 2, Supplementary Fig. S1). In particular, RaTG13 isolated from bat Rhinolophus affinis (Yunnan, China), is the closest relative of SARS-CoV-2s, which substantiates the previously reported phylogeny of SARS-CoV-2s constructed with the whole genome 39,40 . However, the phylogenetic distance of SARS-CoV-2s and RaTG13 was distinctly exhibited in the MRP pseudo-sequence supertree (Fig. 2) by contrast, it was barely observed in the phylogenetic ML tree constructed in this study (Supplementary Fig. S1) or previous report 39 .

To interpret the disparate proximity between SARS-CoV-2s and RaTG13 in MRP pseudo-sequence supertree relative to ML tree, we examined and evaluated the 10 source ML trees (Fig. 3), based on which the MRP pseudo-sequence supertree was built. Consistent with the results of MRP pseudo-sequence supertree and ML tree, RaTG13 (MN996532) is identified as adjacent coronavirus to SARS-CoV-2s in source ML trees based on phylogenetic analysis of five CDSs, including ORF1ab, spike protein, N protein, ORF6 and ORF7a (Fig. 3a, b, d, g, h). By contrast, bat coronavirus MG772933 and MG772934, both of which are isolated from bat Rhinolophus sinicus (Zhejiang, China) 41 , were the nearest relatives of SARS-CoV-2 s in source ML trees based on M protein, ORF3a, and ORF8 (Fig. 3c, f, i). In addition, phylogenetic analysis of E protein sequence showed that SARS-CoV-2s, MN996532, MG772933, and MG772934 are pinpointed on the same branch (Fig. 3e). The inconsistent phylogenetic relationship relied on diverse genes seriously casts doubt on the reliability of single-gene based phylogenetic analysis.

Whatsoever, the above distinct phylogenetic analysis results showed beyond a reasonable doubt that the rates of evolution on sequences of varied proteins in SARS-CoV-2s are highly non-uniform. There probably exists another bat coronavirus in divergent species as the adjacent ancestor of SARS-CoV-2, and/or SARS-CoV-2s already made advanced evolution in its animal host. Anyway, what is clear is that the actual validity of RaTG13 be the direct ancestor of SARS-CoV-2 is seriously questioned, although they share 96.5% identical genome sequence. Taking RaTG13 as the last common ancestor of SARS-CoV-2 would seriously mislead phylogenetic inference of SARS-CoV-2.

### Mutants and evolution of SARS-CoV-2

Within phylogenetic MRP pseudo-sequence supertree, nine sub-branches were resolved in SARS-CoV-2 clades, labeled from clade A until clade I in Fig. 2, which were absent in phylogenetic ML tree based on full-length genomic sequence analysis (Supplementary Fig. S1). The sub-branches displayed an evolutionary scenario of the SARS-CoV-2s in human hosts from December 2019 to March 2020 all around the world, at least based on 102 SARS-CoV-2 isolates in this study. By interrogating ten CDSs of SARS-CoV-2s, diverse mutations are disseminated within five viral proteins, which are ORF1ab, N protein, spike protein, ORF3a, and ORF8 (Table 1). Within most mutation sites described in this study, the original amino acid was substituted by another one possessing altered chemical properties, except L1599F in ORF1ab (clade A), V62L in ORF8 (clade H), and I1606V in ORF1ab (clade D1). Most strikingly, SARS-CoV-2s from the USA displayed common mutation in clades of A, C, D, F, H, and I, covering a large number of countries listed in this study, including Spain, Finland, Sweden, Italy, Brazil, Australia, and South Korea. In particular, detection of the identical mutation in ORF3a protein (G251V) in clade I indicated the spread of the G251V mutant happened at least in January 2020 or earlier, in Sweden, Italy, Brazil, Australia, and the USA.

The ORF1ab gene, taking up 75% of the whole genome size of coronavirus, encodes a series of non-structural proteins (nsp), which assemble to facilitate viral replication and transcription. Mutations in amino acid sequence of ORF1ab present in most clades, including clades A, B, C, D1 in D, and E, which are involved in SARS-CoV-2s from Spain, the USA, China, but no identical mutation site was detected. Among them, a mutation from proline to leucine (P4715L) in ORF1ab, was located on Nsp12. To be noticed, Nsp12 is considered as a primary target for nucleotide analog antiviral inhibitors such as remdesivir. Thus, the mutation (P4715L) would possibly make anti-coronavirus treatment less effective 42,43 .

Spike protein, responsible for viral entry into host cells, exhibited two mutated sites distributed in clade A (D614G) and F (H49Y), respectively. The mutation site D614G in spike protein is located between receptor-binding domain (451–509) and polybasic cleavage site (682–685) 44 , which possibly can regulate the capability of SARS-CoV-2s binding to human host ACE2 receptor or involved in other steps related to the invasion of host cells. Further studies and clinical observations are needed to figure out whether mutation sites on various proteins could change the viral ability for infection and its pathogenicity.

## Conclusion

Given that the vast majority of publicly available sequence data from complex genomes is derived from large-scale partial gene sequencing projects, it would be a serious handicap to limit phylogenetic analyses to alignments derived only from full-length sequences. However, we have shown that the particular pattern of gappiness found in alignments of partial gene sequences needs to be handled with care in order to obtain accurate phylogenies. Both masking and model-based approaches to missing data show potential for improving the accuracy of the trees obtained from gappy alignments. Their performance will have to be compared to other approaches to deal with incomplete alignments [14,23]. Such methods will be critical for the application of techniques that rely upon large numbers of accurate gene trees, as is common in phylogenomics [4,6].

## Phylogenetic algorithms: How to interpret multiple ML trees from the same dataset? - Biology

In a previous post, Steven mentioned that one of the datasets from the Grass Phylogeny Working Group has played an unexpectedly prominent role in evaluation of hybridization network algorithms.

These algorithms work by trying to construct a network from a set of rooted trees with overlapping sets of taxa and the GPWG dataset provides six such trees, one from each of six different molecular loci. This dataset seems to have been introduced into the network literature by Bordewich et al. (2007), although it had previously been used for evaluations of supertree methods (Salamin et al. 2002 Schmidt 2003).

The data used consist of DNA sequences of three nuclear loci and three chloroplast genes. The original publication also has data provided for morphology and restriction sites, but these have not been used for the network analyses. One reason for interest in this dataset is the possibility of reticulation signals between the nuclear and chloroplast data sources. There are 66 taxa, although nearly half of them are composites formed from data for several different species in the same genus, and only a few of the taxa have data for all six datasets (the number of taxa varies from 19-65 per dataset). The data available are summarized in Table 7.1 from Schmidt (2003).

An important point about these data is that in the original GPWG publication the six gene trees were strict consensus trees from maximum-parsimony analyses, and so they have quite a number of polychotomies. These polychotomies were intended by the authors [personal communication] to express uncertainty about the topologies of the trees.

However, this uncertainty is not shown in the trees that have been used for network evaluation. According to Bordewich et al., the trees that they (and everyone else) used were reconstructed using the fastDNAmL program (ie. maximum-likelihood), and were supplied by Heiko Schmidt (see Schmidt 2003, p.74). As expected, there are no polychotomies in these ML trees and no indication of uncertain topology and, of course, the tree topologies are somewhat different from the parsimony trees.

An important consequence is that there is more incompatibility among the dichotomous maximum-likelihood trees than there is among the polychromous maximum-parsimony trees. That is, many of the ML incompatibilities are related to uncertainties in the MP trees. Unfortunately, most of the network algorithms that have been evaluated using these data require strictly dichotomous trees.

Also, the root seems to create problems for these data. The GPWG trees are all rooted with this topology:
(Flagellaria,((Elegia,Baloskion),(Joinvillea,((Streptochaeta,Anomochloa),(Pharus,(ingroup))))))
However, the position of this 7-taxon outgroup relative to the rest of the taxa varies among the gene trees. That is, the connection between the outgroup and the ingroup differs between the gene trees. So, some of the incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes.

Some of the ML datasets available have trees with the same set of ingroup / outgroup relationships as the GPWG trees, for example those datasets available with the CASS algorithm. However, some of the ML trees presented in the literature seem to be rooted in quite a different place, and this place differs between the gene trees. For example, the data as presented with the HybridInterleave program, which is presented as 15 pairs of subtrees rather than as six complete trees, not only are the the gene trees apparently rooted in different places but the different subsets presented of the same gene tree are also sometimes rooted in different places.

It seems to me that there are two consequences arising from these points: (i) it is unnecessarily hard to construct a network from the ML data (because not all of the data signals relate to reticulation), and (ii) the resulting networks (as published) look rather unrealistic to a biologist (there are far too many reticulation nodes). Perhaps this isn't the most realistic dataset to be using for the evaluation of network algorithms.

Another commonly used dataset is the Ranunculus data from Lockhart et al. (2001). In this dataset much of the incompatibility signal also seems to be associated with an uncertain position for the root (see Morrison 2011, Fig. 4.7). In this case there are two gene trees (one nuclear and one chloroplast) that have similar unrooted topologies but have different outgroup-derived root locations. Dealing with root uncertainty may thus be one of the biggest confounding problems when trying to identify reticulation events.

A nexus treefile with the original six GPWG (consensus parsimony) trees is available at:
http://acacia.atspace.eu/data/GPWG.tre

A dendroscope treefile with the six ML trees is available at:

Bordewich M., Linz S., St. John K., Charles Semple C. (2007) A reduction algorithm for computing the hybridization number of two trees. Evolutionary Bioinformatics 3: 86-98.

Grass Phylogeny Working Group (2001) Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden 88: 373-457.

Lockhart P., McLechnanan P.A., Havell D., Glenny D., Huson D., Jensen U. (2001) Phylogeny, radiation, and transoceanic dispersal of New Zealand alpine buttercups: molecular evidence under split decomposition. Annals of the Missouri Botanical Garden 88: 458-477.

Salamin N., Hodkinson T.R., Savolainen V. (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51: 136-150.

Schmidt H.A. (2003) Phylogenetic Trees From Large Datasets. PhD thesis, Heinrich Heine University, Düsseldorf.

Wu Y. (2010) Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics 26: i140-i148.

## Introduction

Mosasauroid reptiles sensu Bell [1] (mosasaurids + aigialosaurids) were a diverse and globally distributed clade of lizards that invaded freshwater and marine environments during the Late Cretaceous [1–5]. Although multiple reptilian clades have become secondarily adapted to aquatic habitats, mosasauroids were one of the few to become fully aquatic—feeding and spending most of their life cycle in aquatic environments [6]. Some of the most relevant aspects of mosasauroid morphology that illustrate their transition to an aquatic lifestyle are concentrated in a set of changes in their pelvic and pedal anatomy. These changes, such as loss of contact between the sacral vertebrae and the pelvis followed by a reduction in the number of sacrals, characterize the so called hydropelvic condition [7]. Additionally, the development of hyperphalangy in the autopodium, which aids in locomotion under water, constitutes the hydropedal condition [8]. These two conditions of the pelvic and pedal morphologies as observed in most mosasauroids contrast to the connection between sacrum and ilium (termed plesiopelvic), as well as the typical phalangeal formula (plesiopedal), as seen in most limbed squamates [7, 8].

Despite numerous previous studies on mosasauroid phylogeny and evolution of pelvic and pedal characters, it is still uncertain whether mosasauroids acquired their aquatic adaptations only once in their evolutionary history [1, 9, 10], or multiple times [7, 8, 11, 12]. The hypothesis of convergent evolution of aquatic adaptations in mosasauroids has been proposed, and given further support in the past decade, due to the incorporation of new taxa (e.g. Dallasaurus and Tethysaurus) into phylogenetic analyses of mosasauroids. However, some other studies (with a similar taxonomic sampling) still recover fully aquatic mosasaurs as forming a single clade [11, 13]—e.g. the clade Natantia of Bell [1], also recovered by Caldwell [9, 10].

One common aspect to all analyses published so far is that these have been analyzed using only traditional unweighted maximum parsimony. Nevertheless, incorporating multiple methods that take into account the effect of highly plastic characters to phylogenetic inference can provide an important additional test towards hypothesis of mosasauroid interrelationships, and of the potentially homoplastic origin of fully aquatic forms. In the present study, we provide the first analysis of mosasauroid relationships based on traditional (unweighted) maximum parsimony using two different coding schemes: contingent (Co-UMP) and multistate codings (Mu-UMP). Additionally, we utilize methods designed to downweight homoplasy and/or take evolutionary rates along with branch lengths into consideration: parsimony under implied weighting (IWMP), maximum likelihood (ML) and Bayesian inference. The latter methods should provide a more robust phylogenetic assessment of the recently proposed convergent evolution of aquatically adapted features than the traditional maximum parsimony. We also make comments and considerations relative to the benefits and limitations of likelihood methods in phylogenetic investigations using morphological data, and their potential application to the study of fossil lineages.

Distance- and ML-based algorithms using reversible models can't find the root of trees. A classical method to root a tree is to use an outgroup (not outlier), which is a species/sequence known to directly descend from the root of the tree. In your case, it is relatively easy: add a chicken or fish ortholog to your dataset and put the root on the chicken/fish branch.

Outgroup works well for single-copy genes used in species tree construction (e.g. ADH). However, it doesn't always work. The culprit is the "known" part. Say a gene has two copies A and B in vertebrates. Copy A was lost in rodent and copy B lost in primate. If you choose a chicken A gene as the outgroup, the correct tree should be ((primary-A,chicken-A),rodent-B). Without knowing the true history, you may forcefully put the root at the chicken-A branch and build a wrong tree ((primate-A,rodent-B),chicken-A).

There are a few other tree rooting methods. An easy approach is to put the root at the longest branch in the tree, assuming the presence of molecular clock. When the species tree is known, you can root a gene tree by minimizing the number gene duplication/loss events in the history. I generally prefer the latter approach when the relevant information is available.

## Section 2.6: Models and comparative methods

For the rest of this book I will introduce several models that can be applied to evolutionary data. I will discuss how to simulate evolutionary processes under these models, how to compare data to these models, and how to use model selection to discriminate amongst them. In each section, I will describe standard statistical tests (when available) along with ML and Bayesian approaches.

One theme in the book is that I emphasize fitting models to data and estimating parameters. I think that this approach is very useful for the future of the field of comparative statistics for three main reasons. First, it is flexible one can easily compare a wide range of competing models to your data. Second, it is extendable one can create new models and automatically fit them into a preexisting framework for data analysis. Finally, it is powerful a model fitting approach allows us to construct comparative tests that relate directly to particular biological hypotheses.