P U Z Z L E 2.5.1: User Manual


1 October 1996

 

PUZZLE is (c) 1995-96 by Korbinian Strimmer and Arndt von Haeseler

Zoologisches Institut
Universitaet Muenchen
Luisenstrasse 14
D-80333 Muenchen
Germany

Voice: +49-89-5902-327
Fax: +49-89-5902-474

Email: strimmer@zi.biologie.uni-muenchen.de
Email: arndt@zi.biologie.uni-muenchen.de

 


LEGAL STUFF

The PUZZLE software package and its accompanying documentation are provided as
is, without guarantee of support or maintenance. The copyright holders make no
express or implied warranty of any kind with respect to this software, including
implied warranties of merchantability or fitness for a particular purpose, and
are not liable for any damages resulting in any way from its use.

Everyone is granted permission to copy and redistribute this software package
and its accompanying documentation, provided that:

1) All copies are identical to the original
2) No charge is made for this software or works derived from it, with the
exception of a distribution fee to cover the cost of materials and/or
transmission.

If you plan to redistribute PUZZLE on the Internet or on CD-ROM please
obtain permission of the authors before. Permission to use portions of
the source code will be granted on request.


QUICK READING

For those of you who don't have much time to spend on reading long manuals:
if you read the two sections "INPUT/OUTPUT CONVENTIONS" and "QUICK START"
you should be able to use the PUZZLE program, especially if you have some
experience in using the PHYLIP programs. The other sections just give more
details about options etc. that you may read at a later time.


INTRODUCTION

PUZZLE is an maximum likelihood program written in ANSI C for analysing
nucleotide and amino acid sequence data. As tree search procedure it
uses the quartet puzzling algorithm (Strimmer and von Haeseler 1996),
a procedure that can be applied to data sets with a large number of taxa.
Quartet puzzling automatically computes estimates of support for each
internal branch and returns both completely resolved and multifurcating
trees depending on the phylogenetic quality of the data. PUZZLE also
computes maximum likelihood distances for all popular models of sequence
evolution (TN, HKY, F84, and submodels; Dayhoff, JTT, mtREV) and maximum
likelihood branch lengths on user specified trees. PUZZLE also estimates
the TN/HKY model parameters for a given data set and tries to select the
optimal model of sequence evolution for your data.


INPUT/OUTPUT CONVENTIONS

Sequence input:

PUZZLE requests sequence input in PHYLIP INTERLEAVED format (sometimes also
called PHYLIP 3.4 format). Many sequence editors and alignment programs
(e.g. CLUSTAL W) are able to save data in this format. Take a look at the
three example files that come with the PUZZLE program ("globin.a", "wolf.n",
"atp6.a"), and you know how the sequence data should look like. The default
name of the sequence input file is "infile". If an "infile" is not present
PUZZLE will prompt the user for an alternative file name. Taxa names in the
"infile" are allowed to contain blanks, but these blanks will be converted to
underscores "_". Sequences can be in upper or lower case, any spaces or
control characters are ignored. The dot "." is recognized as matching
character, it can be used in all sequences (of course not in the first
sequence!). For nucleotides valid symbols are A, C, G, T and U, and for
amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y.
All other visible characters (incuding gaps, question marks etc.) are treated
as N (DNA/RNA) or X (amino acids). The first taxon of the data set is
considered the default outgroup.

General output:

All results of a PUZZLE analysis are written to the file "outfile". If you have
the option "show all unresolved quartets" invoked you'll also get a file called
"outqlist" that list all these quartets.

Distance output:

PUZZLE automatically computes maximum likelihood distances. They are written
in the "outfile" and in the separate file "outdist". The format is PHYLIP
compatible, you can for example use the "outdist" file directly to do a
neighbor-joining tree with the corresponding PHYLIP programs.

Tree output:

The quartet puzzling tree showing support values for each internal branch
and with corresponding branch lengths is plotted as ASCII drawing in the
"outfile". The same tree is written into the "outtree" file. The output
tree convention follows the one adopted by the CLUSTAL W team: the tree topology
is described by the usual round brackets (a,b,(c,d)); and the branch lengths
are written after the colon a:0.22,b:0.33. To be able to display support
values for each branch simultaneosly with branch lengths these values
are written as internal labels, i.e. they follow directly after each node
before the branch lengths. Here is an example:

(Gibbon:0.1393,((Human:0.0414,Chimpanzee:0.0538)99:0.0175,Gorilla:0.0577)
98:0.0531,Orangutan:0.1003);

With TreeView and TreeTool it is possible to view this tree with its
branch lengths AND simultaneously with the support values for the internal
branches (here 98% and 99%). Note, the PHYLIP programs DRAWTREE and
DRAWGRAM may also be used with the CLUSTAL W treefile format but the
current versions simply ignore the internal labels and print only the
tree topology with branch lengths (sorry for the wrong information
given here in the manual of version 2.5!). All these tree drawing
programs are also able to process multifurcating trees.

Tree input:

PUZZLE also reads input trees (optionally!). The default name for the file
containing the input tree is "intree" but if you choose the input tree option
and there's no "intree" present you will be prompted for an alternative name.
The format of the input tree file is identical to the tree in the "outtree"
file. However, it is sufficient to provide the tree topology only, you don't
need to specify branch lengths (that are ignored anyway) or internal labels
(that are read, stored, and written back to the "outtree" file). It is
important that the input tree is unrooted, and that taxa names in the input
tree file don't contain blanks (use underscores!). The format of the "intree"
file is easy: just write the tree into the file (there are no additional numbers
indicating how many trees because PUZZLE reads only one tree).


QUICK START

Prepare your sequence input file "infile" and optionally your tree input file
"intree". Then start the PUZZLE program. Depending on your data PUZZLE will
choose automatically the nucleotide or the amino acid mode. If more than 85%
of the characters (not counting the - and ?) in the sequence are from A, C, G,
T, U or N, the sequence will be assumed to be nucleotide. If your data set
contains amino acids PUZZLE recognizes whether you have amino acids encoded on
mtDNA or on nuclear DNA, and selects accordingly one of the corresponding models
of amino acid evolution. If your data set contains nucleotides the model of
sequence evolution chosen is the Tamura-Nei (1993) model. The corresponding
parameters need not to be specified, they will be automatically estimated by a
maximum likelihood procedure from the data. If PUZZLE detects an "intree"
file it automatically switches to the input tree mode. Then, a menu
(PHYLIP "look and feel") appears with all recommended options set. Ususally,
the preselected options are very well adapted to your data but you can still
change all available options at this stage. Then type "y" at the input
prompt and start the analysis. If a particular computation takes more than
15 minutes PUZZLE prints out an estimate of the remaining computing time for
this task. When the analysis is finished you will find output files
corresponding to your analysis (e.g. "outfile", "outtree", "outdist",
"outqlist") in the same folder of the input files.

To obtain a publication quality picture of the output tree most conveniently,
you can use the TreeView program by Roderic Page that is available free of
charge and runs on personal computers (Macintosh and MS-Windows). It can be
retrieved from

http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

TreeView understands the CLUSTAL W treefile conventions, reads multifurcating
trees and is able to simultaneosly display branch lengths and quartet puzzling
support values for each branch. Open the "outtree" file with TreeView, choose
"Phylogram" to draw branch lengths, and select "Show internal edge labels".

If you work on a SUN workstation you can use the TreeTool program to display
and manipulate PUZZLE trees (ftp://rdp.life.uiuc.edu/rdp/programs/TreeTool).


AVAILABLE OPTIONS

Once you have started the PUZZLE program you see a PHYLIP style menu with
all the options preselected that PUZZLE thinks are optimal for your data
set. This is how it may look like:

Settings for this analysis:

k Tree search? Quartet puzzling
n Number of puzzling steps? 1000
u Show unresolved quartets? No
o Selected outgroup? Thylacinus
d Type of sequence input data? Nucleotides
m Model of sequence evolution? TN (Tamura-Nei 1993)
p Constrain TN model to F84 model? No
t Transition/transversion parameter? Estimate from data set
r Y/R transition parameter? Estimate from data set
f Nucleotide frequencies? Estimate from data set

Are these settings OK (type y or choose from the menu):

By typing the indicated letters you can either change values or enter some
parameters. Some options (e.g. "m") can be invoked several times so that
one can switch between a number of different possibilities for that option.

PUZZLE cannot always make the correct guess for the available options of
the program. Therefore, you have the possibility to change all options
before starting the maximum likelihood analysis (by typing "y" at the input
prompt). In the following we describe all options in alphabetical order
that PUZZLE provides. Note, not all of them are accessible at the same time:

"d" option: Data type. The user can specify whether nucleotide or amino
acid sequences serve as input.

"f" option: Base frequencies. The maximum likelihood calculation needs the
frequency of each nucleotide/amino acid/doublet as input. PUZZLE
estimates these values from the sequence input data. This option
allows specification of other values.

"k" option: Tree search. With this option you can decide how to search trees
in the PUZZLE maximum likelihood program. Currently you can choose
to do a quartet puzzling search (Strimmer and von Haeseler 1996),
a fast heuristic tree search based on quartets, or you can supply
your own tree to compute branch lengths and ML values for it, or
you can chose to forget about trees and simply compute only a
maximum likelihood distance matrix for your data set.

"m" option: Model selection. To model the evolution of nucleotides PUZZLE
offers the Tamura-Nei (TN) model, the Hasegawa et al. (HKY) model,
and the Schoeniger-von Haeseler (SH) model. The SH model describes
the evolution of pairs of dependent nucleotides (pairs are the first
and the second nucleotide, the third and the fourth nucleotide etc).
Note that the Jukes-Cantor (1969), the Felsenstein (1981), and the
Kimura (1980) model are all special cases of the HKY model. By
setting all base frequencies to 0.25 the Kimura model is obtained.
If in addition the transition/transversion parameter is set to 0.5
one gets the Jukes-Cantor model. The Felsenstein model results
from the HKY model by setting only the transition/transversion
parameter to 0.5 while still considering base composition.
For amino acid sequence data the Dayhoff et al. (Dayhoff) model,
the Jones et al. (JTT) model, and the Adachi and Hasegawa (mtREV)
model are implemented in PUZZLE. The mtREV model describes evolu-
tion of amino acids encoded on mtDNA.

"n" option: Number of puzzling steps. Generally, the more taxa are used the
more puzzling steps are advised, but the default value 1000
should be OK for most data sets.

"o" option: Outgroup. For plotting the (unrooted) quartet puzzling tree an
outgroup must be specified. The default outgroup is the first
taxon of the "infile".

"p" option: Constrain the TN model to the F84 model. This option is only
available for the Tamura-Nei model. With this option you can
enter the expected transition-transversion ratio for the F84
model, and PUZZLE computes the corresponding parameters of the
TN model (this depends on your data file!). This is basically
a PHYLIP compatibility mode. You can enter here the same
parameter as you would in the PHYLIP programs DNADIST (select
"maximum likelihood distances") or DNAML, and then PUZZLE works
with the same model as would the PHYLIP programs (compare for
example the resulting maximum likelihood distances!).

"r" option: Y/R transition parameter. This option is only available for the
TN model of sequence evolution. This parameter is the ratio of
the rates for pyrimidine transitions and purine transitions.
In the notation of paper by Tamura and Nei (1993) this parameter
corresponds to alpha2/alpha1.
You don't need to specify this parameter as PUZZLE can estimate
it from the given data set.

"s" option: Symmetrize doublet frequencies. This option is only available for
the SH model. With this option the doublet frequencies are sym-
metrized. For example, the frequencies of "AT" and "TA" are set to
the average of both frequencies.

"t" option: Transition/transversion parameter. For nucleotide data only.
This parameter is the ratio of the base frequency independent rate
for transitions and twice the base frequency independent rate for
transversions. If sequences evolve for example according to a Kimura
(1980) model of sequence evolution then this parameter simply equals
the ratio of the number of expected transitions to the number of
expected transversion. For other nucleotide models the parameter
is very similiar to the expected transition/transversion ratio.
You don't need to specify this parameter as PUZZLE can estimate
it from the given data set.

"u" option: Show unresolved quartets. During the run of a quartet puzzling
analysis PUZZLE display the number of unresolved quartets for
your data set. An unresolved quartet is a quartet where the maximum
likelihood values for each of the three possible quartet topologies
are so similar that it is not possible to prefer one of them
(Strimmer, Goldman, von Haeseler 1996). If you select this option
you'll get a detailed list of these starlike quartets.
Note, for some data sets you may have a lot of unresolved quartets
In this case the printout of all unresolved quartets is probably not
very useful and also needs a lot of disk space.


INTERPRETATION AND HINTS

Quartet puzzling support values for each internal branch:

The quartet puzzling tree search automatically generates support values for
each internal branch. In principle, these values have the same practical
meaning as bootstrap values. Indeed, it turns out that PUZZLE gives you
estimates of support that are even numerically very similar to corresponding
neighbor-joining bootstraps. This means that branches showing a QP reliability
from 90% to 100% are very strongly supported. In principle one can of course
also trust branches with lower reliability but in this case it is advisable to
check how well the respective branch does in comparison to other branches in the
tree (relative reliability). It is also important if you have a branch with a
low confidence to check the alternative groupings that are not included in the
QP tree (they are all listed in the outfile!). There should be a significant
gap between the lowest reliability value of the QP tree and the most frequent
grouping that is not included in the QP tree. For example, if you have a support
value of 60% and the not-included grouping occurs with a frequency of 20% then
the 60% support for the branch is OK.

Percentage of unresolved quartets:

PUZZLE tells you the number and the percentage of completely unresolved
maximum likelihood quartets for your data set. An unresolved quartet is
a quartet where the maximum likelihood values for each of the three possible
quartet topologies are so similar that it is not possible to prefer one of
them (Strimmer, Goldman, von Haeseler 1996). The percentage of the unresolved
quartets among all possible quartets is an indicator of the suitability of the
data for phylogenetic analysis. A high percentage usually results in a highly
multifurcating quartet puzzling tree. If you have only few unresolved quartets
we recommend to invoke option "u" to get a list of all these quartets.

Automatic parameter estimations:

For the Tamura-Nei (TN) and the Hasegawa et al. (HKY) model of nucleotide
evolution PUZZLE estimates the corresponding parameters before starting a
tree reconstruction analysis from the data. Usually this works quite fine
but if you have a reason to assume a different set of parameters to be correct
you should use them and not the estimated ones. This is because under certain
circumstances the automated estimator may fail to give the correct estimate.
For example, if you have very similar sequences and if it happens that you have
no transversion in your data set (or in a specific quartet) then the estimated
transition-transversion parameter may be much larger than the true one. You
should also keep in mind that for each parameter the program has ainternal lower
and upper bound. If you have, e.g., a (unrealistic) transition-transversion
ratio of 80 you will have a hard time to have it estimated by PUZZLE.
Details of this estimation procedure will be published elsewhere
(Strimmer and von Haeseler, in preparation).


OTHER (HIDDEN) FEATURES AND PERFORMANCE

For nucleotide data PUZZLE computes the expected transition/transversion ratio
and the expected pyrimidine transition/purine transition ratio corresponding to
the selected model. Note that base frequencies play a role in the calculation
of these values. PUZZLE also checks with the help of a 5% level chi-square-test
whether the base composition of each taxon is identical with the average base
composition that serves as a basis for the maximum likelihood computation.
The deviating taxa are listed in the output file.

A hidden feature of PUZZLE version 2.5.1 is the employment of an more advanced
weighting scheme of quartets (Strimmer, Goldman, and von Haeseler 1996)
leading to a even more better performance of QP than described in the
original paper (Strimmer and von Haeseler).

There are some other hidden features in the source code: all the maximum
likelihood maximization is done using no derivatives whatsoever (this is only
slightly slower than Newton-Raphson using second derivatives but numerically
much more stable!). If you take a closer look at the source code, you'll
see that we have already developped all procedures that are necessary to
consider rate heterogenity and rate correlation in conjunction with a discrete
Gamma model. However, they are by now not very well tested and for sure not in
their final form, so we opted to make the corresponding PUZZLE options
silent and inactive. We hope to finally include rate heterogenity and rate
correlation in a future version of PUZZLE.


LIMITS

PUZZLE can (theoretically) perform quartet puzzling with up to 568 taxa.
In that case 4,291,262,010 quartets have to be examined. In ANSI C this
number is just below the maximal possible integer number (unsigned long int).
At least 32,767 sites should be possible (ANSI C lower limit for normal
integer numbers), depending on your compiler more sequence positions might
be possible.

Computation time and RAM capacity is also an important factor. As all possible
quartets have to be examined the execution time is expected to grow with n^4,
n being the number of taxa. Information about every quartet is stored in RAM.
For large data sets you need a large RAM capacity. If you run out of memory
you will get the following error message:

ERROR: Memory allocation failure in ...

In this case try to give your application more memory (on a Mac the standard
memory partition size of PUZZLE is 3000K, under MS-DOS you might encounter
problems with the famous 640K barrier ), ask your local expert for help.
If this doesn't help decrease the number of taxa. In summary, on a standard
machine (1995 UNIX workstation) with 32 to 64 MB RAM PUZZLE can easily do
maximum likelihood tree searches including estimation of support values
for data sets with 50-100 taxa. More taxa are also possible but probably not
very useful (star tree!).


OTHER PROGRAMS

There are a number of other programs to reconstruct phylogenetic relationships.
Here are the addresses of some WWW pages providing links to most of them:

http://phylogeny.arizona.edu/tree/programs/programs.html
http://evolution.genetics.washington.edu/software.html
http://www.ebi.ac.uk/biocat/Phylogeny.html
http://hgc6.sph.uth.tmc.edu:8080/doc.dir/software.html


GETTING PUZZLE

PUZZLE runs on all popular platforms (MacOS, MS-DOS, VMS, UNIX). It is
distributed electronically over the Internet by the European Bioinformatics
Institute (Hinxton Hall, Hinxton, Cambridge CB10 1RQ, UK):

ftp://ftp.ebi.ac.uk/pub/software/dos/puzzle (MS-DOS)
ftp://ftp.ebi.ac.uk/pub/software/mac/puzzle (Macintosh)
ftp://ftp.ebi.ac.uk/pub/software/unix/puzzle (UNIX)
ftp://ftp.ebi.ac.uk/pub/software/vms/puzzle (VMS)

Precompiled executables are provided for MacOS and MS-DOS. For UNIX and VMS
system specific files for automated compilation are provided. Please see the
seperate PUZZLE installation guide for further details of how to make PUZZLE
run on your system.

 

REFERENCES

Adachi, J. 1995. Modeling of molecular evolution and maximum likelihood
inference of molecular phylogeny. PhD dissertation.
Institute of Statistical Mathematics, Tokyo.

Adachi, J. and M. Hasegawa. 1996. Model of amino acid substitution in
proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.

Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978.
A model of evolutionary change in proteins. In: Dayhoff, M. O. (ed.)
Atlas of Protein Sequence Structur, Vol. 5, Suppl. 3.
National Biomedical Research Foundation, Washington DC, pp. 345-352.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum
likelihood approach. J. Mol. Evol. 17: 368-76.

Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c.
Distributed by the author. Department of Genetics, University of
Washington, Seattle.

Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape
splitting by a molecular clock of mitochondrial DNA.
J. Mol. Evol. 22: 160-174.

Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules.
In: Munro, H. N. (ed.) Mammalian Protein Metabolism,
New York: Academic Press, pp. 21-132.

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation
of mutation data matrices from protein sequences.
CABIOS 8: 275-282.

Kimura, M. 1980. A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences.
J. Mol. Evol. 16: 111-120.

Tamura, K. and M. Nei. 1993. Estimation of the number of nucleotide
substitutions in the control region of mitochondrial DNA in humans
and chimpanzees. Mol. Biol. Evol. 10: 512-526.

Tamura K. 1994. Model selection in the estimation of the number of
nucleotide substitutions. Mol. Biol. Evol. 11: 154-157.

Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving
the sensitivity of progressive multiple sequence alignment through sequence
weighting, positions-specific gap penalties and weight matrix choice.
Nucleic Acids Research 22: 4673-4680.

Schoeniger, M. and A. von Haeseler. 1994. A stochastic model for the
evolution of autocorrelated DNA sequences. Molecular Phylogenetics and
Evolution 3: 240-247.

Strimmer, K. and A. von Haeseler. 1996. Quartet puzzling: a quartet
maximum likelihood method for reconstructing tree topologies.
Mol. Biol. Evol. 13: 964-969.

Strimmer, K., N. Goldman, and A. von Haeseler. 1996. Bayesian probabilities
and quartet puzzling. Submitted to Mol. Biol. Evol.


VERSION HISTORY

2.5.1 Fix of a bug (present only in version 2.5) related to computation of
the variance of the maximum likelihood branch lengths that caused
occasional crashs of PUZZLE on some systems when applied to data sets
containing many very similar sequences. Drop of support for non-FPU
Macintosh version. Corrections in manual.

2.5 THIS IS THE VERSION OF PUZZLE REFERRED TO IN THE ADDENDUM TO THE ORIGINAL
QP PAPER (Strimmer, Goldman, and von Haeseler 1996).

Bug fixes in ML engine, computation of ML distances and ML branch lengths,
optional input of a user tree, F84 model added, estimation of all TN model
parameters and corresponding standard deviations, CLUSTAL W treefile
convention adopted allowing to show branch lengths and QP support values
simultaneously, display of unresolved quartets, update of mtREV matrix,
source code more compatible with some almost-ANSI compilers, more
safety checks in the code.

2.4 Automatic data type recognition, chi-square-test on base composition,
automatic selection of best amino acid model, estimation of transition-
transversion parameter, ASCII plot of QP tree in outfile.

2.3 THIS IS THE VERSION OF PUZZLE REFERRED TO IN THE MAIN QP PAPER
(Strimmer and von Haeseler 1996).

More models, many usability improvements, built-in consensus tree
routines, more supported systems, bug fixes, no more dependencies
of input order. First EBI distributed version.

2.2 New optimized data structure requiring much less computer memory

2.2 New data structure for storing quartet information and bug fixes.

2.1 Bug fixes concerning algorithm and transition/transversion parameter.

2.0 Complete revision merging the maximum likelihood and the quartet puzzling
routines into one user friendly program. First electronic distribution.

1.0 First public release, presented at the phylogenetic workshop 1995
in Bielefeld, FRG (15-17 June 1995).