How To Draw Trees With Raxml Output

RAxML hands-on session

Welcome to the RAxML hands-on session. I will presume that you are running some flavor of Linux/Unix operationg system and that you are familiar with some bones Linux/Unix commands. To get all exam datasets, type "wget http://sco.h-its.org/exelixis/resource/download/hands-on/Hands-On.tar.bz2" or download here.

Graphical user interface

You lot may besides test the new graphical user interface of RAxML that is available hither

Step one: Download and installation

Go to Alexis github repository and download the latest RAxML version. When the download is completed type:
unzip standard-RAxML-master.zip

This will create a directory called standard-RAxML-master
Change into this directory by typing
cd standard-RAxML-primary
Initially, we will need to compile the RAxML executables, if you don't know what a compiler is or why we need to exercise this, come across here.
Considering RAxML is available as sequential, vectorized, parallel, and vectorized parallel program (all included in the same source code) we will have to compile, that is, generate executables for various versions of RAxML.

To obtain the sequential version of RAxML blazon:
make -f Makefile.gcc
this will generate an executable called raxmlHPC. If you then desire to compile an additional version of RAxML make always sure to type rm *.o before you use another Makefile, the *.o files. Assume, we are using a laptop with two cores and desire to compile all RAxML versions that will run on it we would type:

make -f Makefile.gcc -> generates a binary called raxmlHPC
rm *.o
brand -f Makefile.SSE3.gcc -> generates a binary chosen raxmlHPC-SSE3, this is also a sequential version of RAxML that however exploits a type of very fine-grain parallelism in the processor. See here for more data on SSE3 and vector instructions.
rm *.o
make -f Makefile.PTHREADS.gcc -> generates a binary called raxmlHPC-PTHREADS that can run in parallel on several cores that share the same memory (that's the instance on all common multi-core desktop and laptop computers). Pthreads are a library for generating and managing a specific sort of leight-weight processes which are called threads.
rm *.o
make -f Makefile.SSE3.PTHREADS.gcc -> generates a Pthreads version called raxmlHPC-PTHREADS-SSE3 of RAxML as described above, but it also uses vector instructions inside each cadre/processor. Usually the vectorized versions of the lawmaking (with SSE3) are faster than the not-vectorized ones considering more arithmetics instructions can be executed per processor clock cycle.

Once we are washed with compiling the code we can either execute it locally in the directory where the executables are by typing for case
./raxmlHPC
or copy all the executables into our local directory of binary files, my username is stamatak so we would copy all executables into the respective binary directory past typing
cp raxmlHPC* ~stamatak/bin/
When this is done I can then execute RAxML in any diorectory by typing
raxmlHPC
without the ./ in fron of it. To get an overview of available commands blazon:
raxmlHPC -hWe will as well demand some tree visualization tool to look at the output trees. I ever use Dendroscope because it can handle large trees and is easy to install. Please go to the dendroscope homepage and follow the installtion instructions provided there.

Pace 2: Test Datasets

Here is a list of test datasets that nosotros will utilize and need to download:

Alignment of binary characters
Alignment of Deoxyribonucleic acid characters

a partition file that will let u.s. to sectionalization the datasets into two partitions: one for the 1st and second position and one for the third Deoxyribonucleic acid position
some other partition file that volition let the states to partition the dataset into two partitions from position i-30 and position 31-60
a secondary construction file that will allow us to use RNA secondary structure models for this Deoxyribonucleic acid dataset (essentially it would need to be RNA, but this is just an instance so it doesn't matter)

Alignment of protein characters
Alignment of multi-country characters
Alignment of DNA and protein characters

a respective partition file

A binary courage constraint tree for our Dna dataset
A multi-furcating constraint tree for our Dna dataset

Step 3: Getting Started

Let'southward become started with a simple ML search on binary data:
raxmlHPC -grand BINGAMMA -p 12345 -s binary.phy -n T1
and open the resulting tree called RAxML_result.T1 with Dendroscope. BINGAMMA tells RAxML that nosotros are using binary data and we want to use the GAMMA model of rate heterogeneity. The file name appendix passed via -north is capricious, for this run all output files will be named RAxML_FileTypeName.T1 for instance. Every RAxML run will require a different run ID in social club to avoid overwriting previous results.

We could also type:
raxmlHPC-SSE3 -k BINCAT -p 12345 -due south binary.phy -n T2
CAT is a memory and time efficient approximation for the standard GAMMA model of rate heterogeneity. Information technology is particularly convenient for saving memory on very large datasets. Yet, it should not be used for datasets with less than say fifty-100 taxa. Also, it should not exist confused with mixture models in phylogenetics. WARNING: Likelihood values obtained from the True cat model should not be used as a basis for comparing trees by means of their likelihood values!

As starting trees RAxML uses randomized stepwise addition parsimony copse, i.e., it will not generate the same starting tree every fourth dimension. You can either force information technology by providing a stock-still random number seed:
raxmlHPC -m BINGAMMA -p 12345 -due south binary.phy -n T3
or by passing a starting tree
raxmlHPC -g BINGAMMA -t startingTree.txt -due south binary.phy -n T4
If you want to acquit multiple searches for the best tree you lot can type:
raxmlHPC -thou BINGAMMA -p 12345 -southward binary.phy -# 20 -n T5
here RAxML will carry out 20 ML searches on xx randomized stepwise addition parsimony trees.

We can now do the same matter for the DNA and poly peptide data:
raxmlHPC -yard GTRGAMMA -p 12345 -s deoxyribonucleic acid.phy -# 20 -n T6
and
raxmlHPC -one thousand PROTGAMMAWAG -p 12345 -s protein.phy -# 20 -due north T7.
For protein data nosotros may chose among a set of standard models with fixed transition rates that are typically estimated from a big number of existent-earth datasets: DAYHOFF, DCMUT, JTT, MTREV, WAG, RTREV, CPREV, VT, BLOSUM62, MTMAM, LG we may besides chose to utilize empirical base of operations frequencies drawn from the alignment (past substantially just counting the freuqency of occurence of the amino acids), rather than use the pre-defined base frequencies that come with the models:
raxmlHPC -m PROTGAMMAJTTF -p 12345 -southward poly peptide.phy -n T8
( DAYHOFFF, DCMUTF, JTTF, MTREVF, WAGF, RTREVF, CPREVF, VTF, BLOSUM62F, MTMAMF, LGF). Hither we may also employ the CAT approximation of rate heterogeneity again:
raxmlHPC -m PROTCATJTTF -p 12345 -southward protein.phy -due north T9.

A question that arises is of grade which poly peptide model to select. What I typically do is the following. I generate a reasonable reference tree then just compute the likelihood scores (without conducting a tree search) under all available models. I and then just use the model that yields the all-time score under GAMMA. There is also a perl script available to do this automatically here.

Notation that, in more contempo RAxML versions the best protein model with respect to the likelihood on a stock-still, reasonable tree tin also be automatically adamant directly by RAxML using the following commad:
raxmlHPC -p 12345 -m PROTGAMMAAUTO -s protein.phy -n Automobile

It is also possible to guess a GTR model for protein data, just this should not really exist used on small datasets because there is not enough data to guess all 189 rates in the matrix:
raxmlHPC -p 12345 -thou PROTGAMMAGTR -s poly peptide.phy -due north GTR
Here is the control to conduct a ML search on multi-state morphological datasets:
raxmlHPC -p 12345 -g MULTIGAMMA -s multiState.phy -n T10.
There are different models available for multi-land characters that can be specified via -G ORDERED|MK|GTR, the default is GTR and then nosotros just executed a multi-country inference under the GTR model, for MK nosotros can execute
raxmlHPC -p 12345 -one thousand MULTIGAMMA -southward multiState.phy -M MK -n T11
and for ordered states
raxmlHPC -p 12345 -thousand MULTIGAMMA -s multiState.phy -K ORDERED -n T12

Step iv: Bootstrapping

Now let's behave a simple bootstrap analysis. Initially, let's try to observe the best-scoring ML tree for a Deoxyribonucleic acid alignment. We refer to this as the all-time-scoring tree because the ML search trouble is computationally hard and we tin thus generally non find the optimal tree under ML for a given alignment.

Let's execute:
raxmlHPC -m GTRGAMMA -p 12345 -# 20 -s dna.phy -n T13
This command will generate xx ML trees on singled-out starting trees and as well print the tree with the best likelihood to a file called RAxML_bestTree.T13. Now we will want to get support values for this tree, so let'south conduct a bootstrap search:
raxmlHPC -m GTRGAMMA -p 12345 -b 12345 -# 100 -s deoxyribonucleic acid.phy -n T14
We demand to tell RAxML that we want to practise bootstrapping by providing a bootstrap random number seed via -b 12345 and the number of bootstrap replicates nosotros desire to compute via -# 100. Note that, RAxML likewise allows for automatically determining a sufficient number of bootstrap replicates, in this case y'all would replace -# 100 past one of the bootstrap convergence criteria -# autoFC, -# autoMRE, -# autoMR, -# autoMRE_IGN.

Having computed the bootstrap replicate trees that will be printed to a file called RAxML_bootstrap.T14 we tin now apply them to draw bipartitions on the best ML tree as follows: raxmlHPC -m GTRCAT -p 12345 -f b -t RAxML_bestTree.T13 -z RAxML_bootstrap.T14 -n T15.
This phone call will produce to output files that tin be visualized with Dendroscope: RAxML_bipartitions.T15 (support values assigned to nodes) and RAxML_bipartitionsBranchLabels.T15 (support values assigned to branches of the tree). Note that, for unrooted trees the correct representation is actually the 1 with support values assigned to branches and not nodes of the tree!

We tin also use the Bootstrap replicates to build consensus trees, RAxML supports strict, majority rule, and extended bulk rule consenus trees:

strict consensus: raxmlHPC -thou GTRCAT -J STRICT -z RAxML_bootstrap.T14 -n T16
majority rule: raxmlHPC -m GTRCAT -J MR -z RAxML_bootstrap.T14 -n T17
extended bulk rule: raxmlHPC -m GTRCAT -J MRE -z RAxML_bootstrap.T14 -n T18

Footstep five: Rapid Bootstrapping

Because bootstrapping is very compute intensive, RAxML also offers a rapid bootstrapping algorithm that is ane society of magnitude faster than the standard algorithm discussed higher up. To invoke it blazon, east.m.,:
raxmlHPC -m GTRGAMMA -p 12345 -x 12345 -# 100 -s dna.phy -n T19
and then the only difference here is that you employ -x instead of -b to provide a bootstrap random number seed. Otherwise you can as well chose different models of exchange and besides employ the bootstrap convergence criteria with rapid bootstrapping as well.

The overnice thing about rapid bootstrapping is that it allows you to practice a complete assay (ML search + Bootstrapping) in one unmarried footstep by typing
raxmlHPC -f a -yard GTRGAMMA -p 12345 -x 12345 -# 100 -s dna.phy -n T20
If called similar this RAxML will do 100 rapid Bootstrap searches, 20 ML searches and return the best ML tree with support values to you via one single program call.

Step 6: Partitioned Analyses

A common task is to conduct partitioned analyses. We need to laissez passer the information about partitions to RAxML via a elementary plain text file that is passed via the -q parameter. For a uncomplicated partition of our Dna dataset we can blazon:
raxmlHPC -one thousand GTRGAMMA -p 12345 -q simpleDNApartition.txt -s dna.phy -n T21
The file simpleDNApartition partitions the alignment into ii regions as follows:

DNA, p1=ane-xxx
Deoxyribonucleic acid, p2=31-60

p1 and p2 are just arbitrarly chosen names for the segmentation. Nosotros also need to tell RAxML what kind of data the partition contains (meet below). If we partition the dataset like this the alpha shape parameter of the Gamma model of rate heterogeneity, the empirical base of operations frequencies, and the evolutionary rates in the GTR matrix volition be estimated independently for every partition. If we type:
raxmlHPC -M -m GTRGAMMA -p 12345 -q simpleDNApartition.txt -s dna.phy -n T22
RAxML volition also approximate a carve up set of branch lengths for every partition. If we want to practise a more elaborate partitioning past, 1st, 2nd and third codon position nosotros tin execute:
raxmlHPC -m GTRGAMMA -p 12345 -q dna12_3.partition.txt -southward deoxyribonucleic acid.phy -due north T23
The division file at present looks like this:

Deoxyribonucleic acid, p1=1-60\three,ii-sixty\3
DNA, p2=3-60\3

Here, we infer singled-out model parameters jointly for all 1st and 2nd positions in the alignment and separately for the 3rd position in the alignment. Nosotros tin can of course as well use partitioned datasets that contain both, DNA and protein data, e.g.:
raxmlHPC -m GTRGAMMA -p 12345 -q dna_protein_partitions.txt -s dna_protein.phy -n T24
the sectionalization file looks as follows:

DNA, p1 = 1-50
WAG, p2 = 51-110

Hither we are telling RAxML that partition p1 is a Deoxyribonucleic acid sectionalisation (for which GTR+GAMMA will be used) and that sectionalisation p2 is a protein segmentation for which WAG+GAMMA will be used. Notation that, the parameter -m is now only used to excerpt the desired model of rate heterogeneity which volition be used for all partitions, i.e., nosotros could also type:
raxmlHPC -m PROTGAMMAWAG -p 12345 -q dna_protein_partitions.txt -due south dna_protein.phy -n T25
which will exist exactly equivalent. If we want to use a different protein substitution model for p2 we may edit a partition file that looks like this:

Deoxyribonucleic acid, p1 = ane-50
JTTF, p2 = 51-110

At present JTT with empirical base frequencies will be used. The format is coordinating for binary partitions, e.g., bold that p1 is a binary partition we would write

BIN, p1 = 1-50

and for multi-land partitions, e.yard.,

MULTI, p1 = ane-50

Don't forget to specify your substitution model for multi-land regions via -K (the chosen model volition and so be applied to all multi-country partitions).

Stride 7: Secondary Structure Models

Specifying secondary structure models for an RNA alignment works slightly differntly because we read in a manifestly RNA alignment and then demand to tell RAxML past an additional text file that is passed via -S which RNA alignment sites need to be grouped together. We do this in a standard subclass notation written into a plain text file, east.g., our Dna exam alignment has sixty sites, thus our secondary structure file needs to contain a string of 60 characters like this ane:

..................((.......))............(.........)........

The '.' symbol indicates that this is simply a normal RNA site while the brackets point stems. Patently, the number of opening and closing brackets mus lucifer. In addition, it is also possible to specify pseudo knots with additional symbols: <>[]{} for example:

..................((.......)).......{....(....}....)........

In terms of models there are six-land, seven-state and xvi-country models for accommodating secondary structure that are specified via -A. Bachelor models are: S6A, S6B, S6C, S6D, S6E, S7A, S7B, S7C, S7D, S7E, S7F, S16, S16A, S16B. The default is the GTR sixteen-state model (-A S16). In RAxML the same classification as in PHASE is used, so please consult the phase transmission for a nice and detailed description of these models.

For our small-scale example datasets we would run a secondary structure analysis like this:
raxmlHPC -m GTRGAMMA -p 12345 -Due south secondaryStructure.txt -s deoxyribonucleic acid.phy -n T26
A common question is whether secondary construction models can also be partitioned. This is presently not possible. Even so, you can partitioning the underlying RNA data, e.g., use two partitions for our Deoxyribonucleic acid dataset as earlier. What RAxML will practice internally though is to generate a 3rd sectionalization for secondary structure that does not take into business relationship that distinct secondary structure site pairs may stem from different partitions of the alignment.

Step 8: Using the Pthreads Version

Using the Pthreads version is adequately straight-forward. The only really important thing y'all need to know is that you lot should never run it with more threads (light-weight processes) than you have cores (processors/CPUs) available on your system! This may lead to serious operation degradation. Assume that y'all take four cores but outset v threads. In this instance two threads will be continously competing to get compute fourth dimension on i cadre and thereby too slow down the remaining threads that will be waiting for the two adverseary threads to terminate their dispute.
In order to run the Pthreads version you only demand to utilise the right executable (raxmlHPC-PTHREADS or raxmlHPC-PTHREADS-SSE3) and specify ane additional parameter, the number of threads you want to use via -T, e.g.:
raxmlHPC-PTHREADS -T ii -p 12345 -m PROTGAMMAWAG -s poly peptide.phy -due north T27
this will run nicely on my laptop that has two cores. If y'all want to meet them running blazon top then ane which will testify y'all the computational load on all cores of your computer. You will see that both cores are running at nigh 100% to compute likelihoods on trees.

Stride 9: Constraint Trees

When you are passing trees to RAxML it unremarkably expects them to be bifurcating. The rationale for this is that a multifurcating tree actually represents a set of bifurcating trees and it is unclear how to properly resolve the multifurcations in general. Also, for computing the likelihood of a tree we need a bifurcating tree, therefore resolving multi-furcations, either at random or in some more clever way is a necessary prerequisite for computing likelihood scores.
I personally take s strong dislike for constraint copse because the bias the analysis a prior using some biological knwoledge that may not necesssarily represent the signal coming from the information one is analyzing. The simply purpose for which they may be useful is to assess various hypotheses of monophyly by imposing constraint copse then conducting likelihood-based significance tests to compare the trees that were generated by the various constraints.
Overall RAxML offers to types of constraint trees binary courage constraints and multifurcating constraint copse.
In a backbone constraint we pass a bifurcating tree to RAxML whose construction will not exist changed at all and just add those taxa in the alignment that are not contained in the binary backbone constraint to the tree via an ML estimate (see beneath).

For our Deoxyribonucleic acid dataset we may specify a backbone constraint like this: "(Mouse, Rat, (Human, Whale));" and type:
raxmlHPC-SSE3 -p12345 -r backboneConstraint.txt -grand GTRGAMMA -s dna.phy -due north T28Multi-furcating constraints are slightly dissimilar in that they maintain the monophyletic structure of the backbone, just evidently taxa within a monopyhletic clade may be moved around. Likewise taxa that are not contained in the multi-furcating constrain tree which need not be comprehensive may be placed into whatsoever branch of the tree via ML.

For our DNA dataset we can specify a multi-furcating constraint tree like this: "(Mouse, Rat, Frog, Loach, (Human, Whale,Carp));" and type:
raxmlHPC-SSE3 -p 12345 -m multiConstraint.txt -thousand GTRGAMMA -s deoxyribonucleic acid.phy -n T29
Manifestly, it doesn't make sense to specify a binary backbone constraint that contains all taxa (given that the backbone will remain fixed there is aught to rearrange), while for multifurcating constraints it makes sense, for instance to resolve the multi-furcations.

A Frequent misconception about how constraint copse work in RAxML. I frequent trouble users encounter is RAxML behavior when they utilize incomplete constraint trees, i.e., when the constraint does non contain all taxa in the alignment. Consider an alignment with 25 taxa in which you want to have enfore 2 phylogenetic groups for taxa A1, A2,...,A5 and B1, B2. Let'southward assume that the other taxa (not independent in the constraint) are called X1, X2, .... X17.
Now if you lot specify a constraint like this: "((A1,A2,A3,A4,A5),B1,B2);" it is not clear how the remaining taxa (X1,...,X17) shall bear. In RAxMl I have decided that they can be inserted anywhere in the tree and potentially within the monophyletic groups A and B. The property that is maintained (or shall exist maintained if the implementation is correct) is that, in every tree using the above constraint, yous will ever detect a branch that will divide the taxon set up such that taxa A1,...,A5 are on one side of that branch and taxa B1, B2 are on the other side of that branch. The positions of the taxa X1,...,X17 are completely irrelevant in this case, since your constraint simply says something nearly A and B. If you don't desire the X-taxa to announced inside the groups A and B you will need to specify a comprehensive constraint (including all taxa) like this: "((A1,A2,A3,A4,A5),(B1,B2),(X1,X2,...,X17);".
Please take a look at the tree below that has been built with the constraint "((A1,A2,A3,A4,A5),B1,B2);" and convince yourself that information technology really respects the constraint according to the definition used in RAxML:

Step 9.i Constraining sis taxa

Now, if you want to constrain potential sister taxa to strict monophyly, eastward.g., mouse and rat, while other taxa will exist allowed to freely move effectually the tree during ML topology optimization, you could do the following: Assume Mouse and Rat shall be forced to be monophyletic, you tin specify a multi-furcating backbone constraint (containing, e.g., all taxa in dna.phy as follows: "(Cow, Carp, Craven, Human, Loach, Seal, Whale, Frog, (Rat, Mouse));" and store this in a file monophylyConstraint.txt. Then you can call:
raxmlHPC-SSE3 -p 12345 -thou monophylyConstraint.txt -m GTRGAMMA -south deoxyribonucleic acid.phy -n T29monophyly
Assume that now you would similar to force the Frog and the Mouse to be monophyletic, here you lot'd write: "(Cow, Bother, Chicken, Human being, Loach, Seal, Whale, Rat, (Frog, Mouse));" and store this in a file called, e.g., weirdMonophyly.txt and execute:
raxmlHPC-SSE3 -p 12345 -g weirdMonophyly.txt -m GTRGAMMA -due south dna.phy -n T29weird_monophyly
If the constraint doesn't brand much sense (e.g., Frog sis to Mouse) yous volition get a worse likelihood score for this and so use likelihood tests to determine wether the two alternative hypotheses yield trees with significantly differnt LnL scores.

Stride 10: Outgroups

An outgroup is essentially just a tree drawing choice that allows yous to root the tree at the branch leading to that outgroup. If your outgroup contains more than one taxon it may occur that the outgroups cease to be monophyletic, in this case RAxML volition impress out a respective alarm. For our DNA dataset nosotros tin specify a single outgroup similar this:
raxmlHPC-SSE3 -p 12345 -o Mouse -m GTRGAMMA -due south dna.phy -due north T30
If we want Mouse and Rat to be our ougroups nosotros just type:
raxmlHPC-SSE3 -p 12345 -o Mouse,Rat -grand GTRGAMMA -s dna.phy -n T31