Partimatrix A program for calculating Partition Matrices Copyright 1997 by Ingrid Jakobsen, Australian National University Permission is granted to use, copy, modify and distribute this program and accompanying documentation provided no fee is charged, and this copyright notice is not removed. No representations are made about the suitability of this software for any purpose. I'm currently (2006) at University of Queensland email i.jakobsen@uq.edu.au Contents ******** List of supplied files Compiling partimatrix Compiling on systems without X11 graphics Input format Converting sites to binary partitions Sorting and selecting partitions Clustering within a partition References Supplied Files ************** Instructions This file partimatrix.c main partition matrix program pmXPlot.c X11 graphics functions XPlot.h X11 graphics header file Makefile plainpmx.c partimatrix program without X11 windows pseudaut.gde Alignment of primate pseudoautosomal boundary sequence in gde flat file format. Compiling partimatrix ********************* Before the program is compiled, it may be neccessary to edit some of the #define statements at the beginning of the program. These define the maximum number of sequences that can be analysed and the shape and size paper that the PostScript output will be printed on. The maximum number of sequences should be changed only via SUPERLONG. The number of sequences that can be analysed is SUPERLONG * 32 + 1. So, since SUPERLONG currently is 5, 161 sequences can be analysed. You may wish to increase SUPERLONG for large alignments. On the other hand, if you know you will always have fewer sequences, it may be worth reducing SUPERLONG, particularly on computers with less memory. It may also need to be reduced if you are trying to analyse a very long alignment. I have set the papersize to be a compromise between US letter and A4, so the matrix should fit on either papersize. All sizes are in points: 72 points = 1 inch 100 points = 3.5 cm. A4 US let. Folio Legal A3 A5 B4 B5 PAPERWIDTH 593 611 593 611 841 419 728 515 PAPERLENGTH 841 791 934 1007 1190 593 1031 728 PAPEREDGE I recommend no smaller than 15. FONT I recommend no smaller than 4. Unfortunately, a matrix may be too wide to print with a larger font size. There are some tricks to get around this (described later). The Makefile also needs to be edited to correspond to the local system. In particular, the location of the X11 library files must be indicated. Then type "make" and partimatrix should compile. Compiling without X11 ********************* You will only need one file: plainpmx.c. You may need to edit the #define values for number of sequences and papersize, as described above. The program should compile using for example the GNU C compiler: gcc plainpmx.c -o partimatrix -lm. This version of the program does not let you see the partition matrix wihle the program is running. This makes it hard to experiment with different partition orders, removing partitions, etc. If there is a local graphics package that can be used to create a module similar to the X11 module, I would recommend using it. The alternative is to save a number of PostScript files with different sort orders and then view the PostScript using something like GhostScript, if you want to save paper and toner. As a shortcut, we have found that the two most useful sorting orders are: "least to most conflict" (in the (o) menu) or sorting based on similarity scores (in the (s) menu) Input format ************ The sequence data must be DNA or RNA, and already aligned. The minimum number of sequences is four. If the file contains more than MAXSPECIES (currently 161) sequences, the extra ones will be ignored. The input format is fasta block format and similar formats: >Sequence1 ACTACGGGATAACAATT AGATACAGATACGAGAG AGTCTCTC >Sequence2 ACTACGCGATAATGATT AG--ACAGATAGCAGAG ?GTCCCTC >Sequence3 -----GGGATTACGATG AG--ACAG?TACCAGAG AGTCTCTT Each sequence is introduced using a marker character '>', followed by the sequence name. This line may contain other comments following the name, but there should be no sequence on the name line. The entire sequence appears on the following line(s), prior to the marker and name for the next sequence. Gaps in the alignment should be marked using '-' characters, unknowns using '?'. The marker character can be any non-alphanumeric character such as: > ; # % (except '?' or '-'). If any sequence has a leading gap (such as Sequence3 above) the gap information can be included on the name line, as in GDE flat file format: >Sequence3(5) GGGATTACGATGAG--A CAG?TACCAGAGAGTCT CTT When the input file is read in, you wil be asked if there are any such offsets in the file. Converting sites to binary partitions ************************************* The partition matrix principle is explained in the paper. Note that sites with more than two distinct nucleotides are converted to transversion sites (R:A/G vs Y:C/T/U) automatically. Also note that sites where more than one sequence is unknown or a gap are not converted to binary partitions, although they are included in the final partition matrix output. If one sequence is unknown at a site, the sequence is half assigned to each of the groups defined by the remaining sequences. In the final matrix, the two partitions are shown by a diagonal line through the site/partition square. Sorting and selecting partitions ******************************** The program can order the partitions in a number of ways. Use option (o) to order the partitions based on their support and conflict. The support score indicates how many sites are identical to a partition, thus sorting from high to low support allows you to see which partitions are observed frequently. The conflict score is based on the number of sites inconsistent with each partition. Sorting from low to high conflict can help reveal parts of the phylogeny that are not changing along the alignment. You can also order based on the support minus conflict. For a partitition to score highly, it must both have a number of identical sites and few inconsistent sites. This value is also called "Predictor of Bootstrap". In the original description, a Hadamard Transform was used on the support and conflict scores, in effect to corrrect for multiple hits (see Lento 1995). The partimatrix program does not perform a Hadamard Transform for two reasons. Firstly, the Hadamard Transform is intensely memory-hungry, and the memory required doubles for each sequence added to the alignment. So although computer memory keeps increasing, data can be generated faster than computing technology advances. Secondly, the theory of the Hadamard Transform assumes there is one phylogenetic history that describes the entire alignment. Most people constructing a partition matrix are trying to avoid making that assumption about their data. Use option (s) to sort the partitions based on their similarity scores, or alternatively just calculate similarity scores but retain the current order of partitions. The similarity score of two partitions is based on how many sites are consistent or inconsistent with both partitions. The score is normalised, based on how many sites would be expected to be "shared" in this way if the order was random. A positive similiarity score means more sites are shared than expected, a negative similarity score means that sites tend to be consistent with one partition and inconsistent with the other. The partitions are ordered so that adjacent partitions have as high a similarity score as possible. This can in some cases mean that partitions with negative similarity scores are put next to each other, because they have to go somewhere. The text file output will warn you, as well as reporting the similarity scores between each pair of partitions. Use option (m) to manually order the partitions. You can choose to swap pairs of partitions or you can specify a new order for all the partitions at once if you have a lot of rearranging to do. Use option (r) to remove partitions. You can remove partitions based on their support or conflict scores, for example all those with very low support, or you can specify particular partitions to remove. If you are analysing a lot of sequences, it is quite possible that there are too many partitions to show on the width of one page. What you can do in this case is remove partitions. For example, if you want to see all partitions sorted in order of least to most conflict, first, remove partitions with the most conflict, until there are few enough partitions to print on the width of one page, and save that PostScript file. Then, if you try to remove all partitions, you will be given the option of returning to the original full set of partitions. Do so, and now remove those partitions that were included in the first file, and if neccessary, again remove partitions until the remaining ones fit on one page width. You will have to engage in a bit of real cutting and pasting to create the full-width partition matrix from the various PostScript files. You may find that when there are a large number of partitions that many have low support scores and that you will get a clearer picture by only looking at a subset of partitions anyway. Clustering within a partition ***************************** Within one partition, it may appear that sites consistent with and inconsistent with the partition are clustered. The (c) option will count clustering in a partition two different ways: the number of and length of "runs" of sites of one colour, and the number of times a site of one colour is followed by a site the same colour or the other colour. These counts can be tested statistically, for example using the tests described in the Appendix. References ********** Jakobsen, I.B., Wilson, S.R & Easteal, S. The partition matrix: Exploring variable phylogenetic signals along nucleotide sequence alignments Mol. Biol. Evol. 14: 474-484 (1997) For background on compatibility matrices, see: Sneath, P.H.A., Sackin,M.J. & Ambler,R.P. Detecting evolutionary incompatibilities from protein sequences Syst. Zool. 24: 311-332 (1975) Jakobsen, I.B. & Easteal, S. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences CABIOS 12: 291-295 (1996) For background on support and conflict for partitions (LentoPlot) see: Lento, G.M., Hickson, R.E., Chambers, G.K. & Penny, D. Use of spectral analysis to test hypotheses on the origin of pinnipeds Mol. Biol. Evol. 12: 28-52 (1995)