/*---------------------------------------------------------------------------- Program name : partimatrix.c Author : Ingrid Jakobsen. X11 code : Hugh Fisher Copyright 1997 by Ingrid Jakobsen, Australian National University Permission is granted to use, copy, modify and distribute this program provided no fee is charged, and this copyright notice is not removed. No representations are made about the suitability of this software for any purpose. Notes on modifying the program code *********************************** The #define statements at the beginning of the program may need to be modified for your requirements before it is compiled. Apart from the PostScript modifications, they assume familiarity with the program, at least as far as judging the need for modification. Many sequences : If very large numbers of sequences need to be analysed and you wish to recompile the program to handle that, increase the value of SUPERLONG SUPERLONG in its #define. The actual number of sequences that can be handled is SUPERLONG * 32 + 1, (i.e. MAXSPECIES) so for example SUPERLONG = 4 gives you 129 sequences. Please do not adjust MAXSPECIES (MAXSPECIES) directly, only through SUPERLONG. LONGBITS = 32 is the guarantee provided by the ANSI C standard. (LONGBITS) It can be altered for a specific machine if you have good reason to believe it has longer unsigned ints. PostScript : If you need to adjust the program's PostScript output for your local papersize, you need to change the following #define and recompile. Here are the point values for some common papersizes: A4 US letter Folio Legal A3 A5 B4 B5 PAPERWIDTH 593 611 593 611 841 419 728 515 PAPERLENGTH 841 791 934 1007 1190 593 1031 728 PAPEREDGE Most laserprinters in my experience are not capable of printing much closer than 15 point from the edge, but PAPEREDGE can be increased if larger margins are desirable (for publication for example). FONT Larger font size may be appropriate for small datasets, but smaller than 4 point is basically illegible. Summary : The program asks for the number of sequences and their length (each sequence) followed by the name of the file with the data and the character used to mark sequence names. It assumes a fasta/GDE flatfile type format, for example: >Sequence1 AGTCATGACTGAAATGA >Sequence2 AGTCATTTCTGAAATGA Fasta has the character '>' as shown, GDE flat '#' and for example mega can be edited into this format easily. The sequences should already be aligned. Once the sequences have been read in, the program looks at the data site-by-site and decides which are the interesting sites. Basically a site must have more than two distinct nucleotides, present in more than 1 sequence each. If the site is expressible as a binary partition the appropriate binary partition has its support increased by 1.0. If there are gaps or ?'s the program generally assumes the site is too difficult to give a binary score to but saves it for later consistency testing. If there are 3 or 4 nucleotides the program tries to make the site a transversion site, except if there is only 1 of either pyrimidines or pyrines, when it applies the "1?" rule. 1? rule If there is 1 ?, (or one sequence with a gap) the program assigns 0.5 support to each of the partitions created by turning the ? into one of the observed nucleotides. Some examples: (5 sequences) TTCC? - 0.5 scores for TTCCT and TTCCC (1? rule) TTCCG - 0.5 scores for TTCCT and TTCCC (1? rule) TTTCG - ignored as only T occurs more than once. TTTAG - treated as YYYRR - scored (transversion) TTGCA - treated as YYRYR - scored (transversion) TC?AG - treated as YYYRR and YYRRR 0.5 scores each (transversion followed by 1? rule) Once all partitions have been found and support scored, the conflict scores against each partition "con" is found: All other partitions that are inconsistent are found and their "sup" scores are added to create that partition's "con" score. The final "con" scores are normalised by dividing by the sum of all "con" scores and multiplying by the sum of "sup" scores - so the total of each kind of score is the same over all partitions. The main matrix compares each partition with each site and sees that site is identical to, consistent with, or inconsistent with, the partition in question. The partitions can then be sorted according to their scores, either largest to smallest support or smallest to largest conflict. Partitions can also be ordered based on their similarity to each other - so that partitions with a similar distribution of black and white along the sequence end up next to each other. As well as sorting, it is also possible to remove selected partitions, for example low scorers, or manually swap the order of partitions. The program has a variety of output - it can print out partitions and their scores, along with the "difficult" sites, it can display the partition matrix on the screen if X display is available, and finally produce PostScript output with the matrix, and the partition scores presented in the "LentoPlot" format. ----------------------------------------------------------------------------*/ #include #include #include #include #include #include /* The following three lines are for X11 graphic display the Xlib and Xutils have been omitted as they are included by XPlot.c and this way I can almost read alint output :-) */ /* #include */ /* #include */ #include "XPlot.h" #define NAMELEN 13 /* allow 13 character names */ /* define for dealing with large numbers of species. Only the first value should be altered to increase number of species the program can handle. */ #define SUPERLONG 5 /* number of longs in BigLong */ #define LONGBITS 32 /* required number of bits in a long */ #define MAXSPECIES SUPERLONG * LONGBITS + 1 /* maximum species */ /* #define for PostScript output paper size. */ #define PAPERWIDTH 593 /* 593 = The width of A4 paper in points */ #define PAPERLENGTH 791 /* 791 = The length of US letter in points */ #define PAPEREDGE 20 /* 15 = Minimum edges not to print on */ #define FONT 4 /* 4 point smallest nice font size */ /* Structure Declarations */ typedef struct { unsigned long int sub[SUPERLONG]; } BigLong; struct site_index { int site; int partid; int halfpart; }; struct partition { BigLong part; float sup; float con; int partid; }; struct keep_site { int site; char data[MAXSPECIES +1]; }; /* Function Declartions */ FILE *GetNumandLen(int *Num, int *Len, char *Marker, int *gdeformat); int ReadData(FILE *input, int NumSeq, int SeqLen, char Marker, int gdeformat, char **names, char *Data); int PartitionData(int Num, int Len, char *Data, struct site_index *index, int *nice, struct partition *scores, int *hard, struct keep_site *tough); BigLong MakeBinary(int Num, char *array); char *UnMakeBinary(int Num, BigLong binary, char *holder); int BigEqual(BigLong one, BigLong other); void MinusScore(int nice, struct partition *scores); int MakeMatrix(int Num, int sites, struct site_index *site, int OrigNice, int Nice, struct partition *scores, int Hard, struct keep_site *tough, char **matrix); char Assign(int consis); int MaxButOne(int i, int *array); int Consistency(int max, char *string1, char *string2, char *live); int BinaryCons(BigLong one, BigLong other); int PartSim(int Num, int Nice, int sites, struct partition *scores, char **matrix); void HeapSortPartition(int size, struct partition *scores); int RemovePart(int Num, int Nice, struct partition *scores); void SwapPart(int Num, int Nice, struct partition *scores); int ClusCount(int Num, int Nice, struct partition *scores, int Sites, char **matrix); int WriteTextSum(int Num, char **name, int SiteNum, int Nice, struct partition *scores, struct site_index *pos, char **matrix, int Hard, struct keep_site *tough); int WritePostScript(int NumSeq, char **name, int SiteNum, int NiceSite, struct partition *scores, struct site_index *pos, char **matrix); void PostScriptBlurb(FILE *Out, int page, int le, int bot, int ri, int up); char *BintoHex(BigLong num, int len, char *home); void *MyMalloc(size_t size); void *MyRealloc(void *old, size_t size); int main(void) { int NumSeq = 0, SeqLen = 0; int RetCode; char Marker = 'A'; int gdeform = 0; FILE *Data; int SiteNum, NiceNum = 0 , HardNum = 0; int NowNice, oldnice; char **seq_names; char *store_data; struct site_index *inform; struct partition *scores; struct keep_site *hardsites; char **site_matrix; int i; char option = 'o'; char optst[21]; /* Find out how many sequences and how long they are */ Data = GetNumandLen(&NumSeq, &SeqLen, &Marker, &gdeform); if (Data == 0) return(1); /* Set up data arrays of the appropiate size. */ seq_names = (char **)MyMalloc( sizeof(char *) * NumSeq ); for (i=0; i < NumSeq; i++) seq_names[i] = (char *)MyMalloc(sizeof(char) * (NAMELEN+1) ); store_data = (char *)MyMalloc( sizeof(char *) * NumSeq * SeqLen); inform = (struct site_index *)MyMalloc(sizeof(struct site_index) * SeqLen); scores = (struct partition *)MyMalloc(sizeof(struct partition) * SeqLen); hardsites = (struct keep_site *)MyMalloc(sizeof(struct keep_site) * SeqLen); /* Read in the data for analysis */ RetCode = ReadData(Data,NumSeq, SeqLen, Marker, gdeform, seq_names, store_data); if (RetCode) /* something went wrong reading the data */ return(RetCode); /* Identify partitions for the interesting sites */ SiteNum = PartitionData(NumSeq, SeqLen, store_data, inform, &NiceNum, scores, &HardNum, hardsites); if (SiteNum == 0 ) /* Nothing to do (or an error) */ { printf("Unable to continue - no informative sites could be found.\n"); return(1); } if (NiceNum == 0) { printf("No site (of %d) was identical to a binary partition.\n", SiteNum); return(1); } /* Ditch all the memory we don't need anymore (phew!) */ inform = (struct site_index *)MyRealloc(inform, sizeof(struct site_index) * SiteNum); scores = (struct partition *)MyRealloc(scores, sizeof(struct partition) * NiceNum); hardsites = (struct keep_site *)MyRealloc(hardsites, sizeof(struct keep_site) * HardNum); free( store_data ); /* Any wild ideas for processing the partitions scores through a Hadamard Transform should probably be inserted here. Only datasets with a limited number of sequences can be done in this way - on the SunSparc 5, 25 sequences is the current maximum, since memory has to be allocated for an array with 2**24 floats. */ /* Calculate the negative scores for partitions and normalise */ MinusScore(NiceNum, scores); printf("\n%d sites were informative, %d not suitable for partitions\n", SiteNum, HardNum); printf("There are %d different binary partitions.\n", NiceNum); /* Allocate memory for the partimatrix itself */ site_matrix = (char **)MyMalloc( sizeof(char *) * (NiceNum+1) ); for (i=0; i <= NiceNum; i++) site_matrix[i] = (char *)MyMalloc(sizeof(char) * (SiteNum+1) ); NowNice = NiceNum; /* The next bit is interactive output choices. The reason it starts with 'o' set (i.e tries to order before any options are presented) is that we need to build a matrix before displaying it */ while (option != 'q') { if (option == 'o') /* sort the data and build the matrix */ { HeapSortPartition(NowNice, scores); RetCode = MakeMatrix(NumSeq, SiteNum, inform, NiceNum, NowNice, scores, HardNum, hardsites, site_matrix); if (RetCode) /* Something went wrong building the matrix */ { printf("Matrix building unsuccessful\n"); return(RetCode); } } else if (option == 't') /* Save features of partimatrix as text */ { RetCode = WriteTextSum(NumSeq, seq_names, SiteNum, NowNice, scores, inform, site_matrix, HardNum, hardsites); if (RetCode) return(RetCode); } else if (option == 'd') /* Display matrix on screen */ { printf("\n\nThe matrix will be displayed in a separate window.\n"); printf("When you have examined it, click the mouse once anywhere"); printf(" in the\nwindow to continue the program.\n"); RetCode = ShowXPlot(SiteNum, NowNice, site_matrix); if (RetCode) printf("Matrix cannot be displayed\n"); } else if (option == 'p') /* PostScript output */ { RetCode = WritePostScript(NumSeq, seq_names, SiteNum, NowNice, scores, inform, site_matrix); if (RetCode) return(RetCode); } else if (option == 'r') /* Remove partitions from consideration*/ { oldnice = NowNice; NowNice = RemovePart(NumSeq, oldnice, scores); if (NowNice < 1) { printf("No partitions remain - the program will end except\n"); printf(" (r) will restore all partitions > "); fgets(optst, 20, stdin); if ( tolower(optst[0]) == 'r') NowNice = NiceNum; else return(1); } RetCode = MakeMatrix(NumSeq, SiteNum, inform, NiceNum, NowNice, scores, HardNum, hardsites, site_matrix); if (RetCode) return(RetCode); } else if (option == 'm') /* manual swapping of partition order */ { SwapPart(NumSeq, NowNice, scores); RetCode = MakeMatrix(NumSeq, SiteNum, inform, NiceNum, NowNice, scores, HardNum, hardsites, site_matrix); if (RetCode) return(RetCode); } else if (option == 's') /* Similarity Scores for partitions */ { RetCode = PartSim(NumSeq, NowNice, SiteNum, scores, site_matrix); if (RetCode) return(RetCode); RetCode = MakeMatrix(NumSeq, SiteNum, inform, NiceNum, NowNice, scores, HardNum, hardsites, site_matrix); if (RetCode) return(RetCode); } else if (option == 'c') /* Runs and clusters in a partition */ { RetCode = ClusCount(NumSeq, NowNice, scores, SiteNum, site_matrix); if (RetCode) return(RetCode); } else printf("\nThat is not an available option, try again\n"); printf("\nYou may now chose to:\n"); printf(" (d) Display the current matrix on screen (X display)\n"); printf(" (p) Save the current matrix as a PostScript file\n"); printf(" (t) Save a summary of the current matrix as a text file\n\n"); printf(" (s) Order the partitions based on similarity scores\n"); printf(" (o) Order the partitions based on support and conflict\n"); printf(" (m) Inspection and manual ordering of partitions \n"); printf(" (r) Remove partitions, for example with low scores\n\n"); printf(" (c) Determine clustering of sites within a partition\n"); printf(" (q) Quit partimatrix\n\n"); printf("Enter your choice > "); fgets(optst, 20, stdin); option = tolower(optst[0]); } /* The end. Tidy up and go home. */ for (i = 0; i <= NiceNum; i++) free( site_matrix[i]); free( site_matrix ); free( inform ); free( scores ); free( hardsites ); return(RetCode); } /* end of main */ /* Function Name : GetNumandLen() Function Summary : This function will ask the user the name of the input file, and determine how many sequences will be read in, and how many bases are in each sequence. */ FILE *GetNumandLen(int *Num, int *Len, char *Marker, int *gde) { int TmpLen, MaxNum, MaxLen; char Filename[120]; int linebegin, seqline; FILE *InFile; char ch; int i; int offset = 0; printf("This program calculates a partimatrix from aligned DNA sequences"); printf("\n\nThe sequences should be in FASTA format or similar (eg gde)"); printf("\nUnknown or ambiguous characters should be marked with 'N' or "); printf("'?'\nSites with more than 2 different nucleotides will be \n"); printf("treated as transversion sites\n\n"); /* Ask the user for the name of the data file */ printf("Please enter the name of the input file > "); fgets(Filename, 120, stdin); for (i = 0; Filename[i] != '\n'; i++); Filename[i] = '\0'; /* Try to open the input file for reading. If the file cannot be opened, print an error message and return to main() with an error code */ if ((InFile = fopen (Filename, "r")) == NULL) { printf("Unable to open input file %s\n", Filename); printf("Check you are in the correct directory\n"); return(0); } MaxLen = 0; MaxNum = 0; TmpLen = 0; linebegin = 1; seqline = 0; while (!feof(InFile)) { while ((ch = getc(InFile)) != '\n' && !feof(InFile)) { if (linebegin == 1) { if (*Marker == 'A') { if (ispunct(ch)) { *Marker = ch; MaxNum = 1; seqline = 0; /* Ask user if this is gde format and we need to look for offsets */ printf("Are any initial gaps marked with "); printf("a number in brackets (y/n) > "); fgets(Filename, 40, stdin); if (toupper(Filename[0]) == 'Y') *gde = 1; } else { printf("Marker character not found!\n"); printf("The file should begin with a "); printf("marker character e.g. > # ; \n"); return(0); } } else if (ispunct(ch) && ch == *Marker) { MaxNum++; TmpLen = 0; seqline = 0; } else seqline = 1; } linebegin = 0; /* if we are on a line with sequence, count in the sequence */ if (seqline) { if (isalnum(ch) || ch == '-' || ch == '?') TmpLen++; } /* if we are in gde format look for offsets on the nameline. */ else if (*gde) { if ( ch == ')' ) offset = 0; if ( offset == 1 ) TmpLen = TmpLen * 10 + (ch - '0'); if ( ch == '(' ) offset = 1; } } if (TmpLen > MaxLen) MaxLen = TmpLen; linebegin = 1; } if (MaxNum > MAXSPECIES) { printf("%d sequences found in input file\n", MaxNum); printf("It is not possible to store binary partitions for "); printf("more than\n%d sequences, which is therefore ", MAXSPECIES); printf("all that will be done\n"); printf("Recompile the program for more sequences if desired\n"); MaxNum = MAXSPECIES; } /* Once the values have been stuck in, copy them into the appropriate variables */ *Num = MaxNum; *Len = MaxLen; printf("Analysing %d sequences %d bases long\n", MaxNum, MaxLen); /* Reset the file to the beginning for the sequence reading program. rewind() is meant to reset the pointer to the beginning of the file and clear any error messages floating around */ rewind(InFile); return(InFile); } /* GetNumandLen */ /* Function Name : ReadData() Function Summary : This function wil read the data when it arrives in this format: all the sequences are in the one file and the names of the sequences are delimited by a marker character such as '#' or '%' Modified 11Apr95 : To read GDE flatfile format with initial offset in brackets rather than explicit. */ int ReadData(FILE *InFile, int NumSeq, int SeqLen, char Marker, int gde, char **names, char *Data) { char ch, ch2, nameholder[161]; int offset, i, j; /* This is the main loop... while there haven't been enough sequences read in yet, keep trying to read another one... */ for (i = 0; i < NumSeq; i++) { /* Look for the marker character. If we can't find it, print an error message and return to main() */ while (isspace(ch = fgetc (InFile)) ) ; if (feof (InFile)) { printf("Unexpected end-of-file encountered in input file\n"); printf("It appears to be in sequence %d\n", i+1); return(1); } else if (ch != Marker) { printf("Error in input file - Marker character %c was ", Marker); printf("not found\nMissing for sequence %d\n", i+1); return(1); } /* read in the sequence name - it should be the next thing in the file. If it's GDE there may be a number in brackets corresponding to the offset. */ fgets(nameholder, 160, InFile); for (j = 0; nameholder[j] != '\n'; j++) ; nameholder[j] = '\0'; offset = 0; if (gde) { j = 0; while ((nameholder[j] != '(' ) && j < 159) j++; if (nameholder[j] == '(' ) { offset = atoi(nameholder+j+1); nameholder[j] = '\0'; } } /* cut off the name after the first double space, for neatness */ for (j=0; nameholder[j] != '\0'; j++) if (isspace(nameholder[j]) && isspace(nameholder[j+1])) nameholder[j] = '\0'; strncpy(names[i], nameholder, NAMELEN); names[i][NAMELEN] = '\0'; if (feof (InFile)) { printf("End of File encountered in sequence %d\n", i+1); return(1); } /* Fill in the front gap if there is one */ if (offset) printf("Gap at front of sequence %s filled with '-' \n", names[i]); for (j = 0; j < offset ; j++) Data[i*SeqLen + j] = '-'; /* Try to read in the sequence */ while ( !feof (InFile) && j < SeqLen) { ch = fgetc (InFile); if ( ch == '-' || isalnum(ch) || ch == '?') { ch2 = toupper(ch); if ( ch2 == 'U') Data[i*SeqLen + j] = 'T'; else if ( ch2 == 'A' || ch2 == 'C' || ch2 == 'G' || ch2 == 'T' || ch2 == '-') Data[i*SeqLen + j] = ch2; else Data[i*SeqLen + j] = '?'; j++; } /* if the character is not sequence, make sure it's not the marker of the next sequence. */ else if ( ch == Marker) { printf("Gap at end of sequence %s filled with '-'\n", names[i]); ungetc(ch, InFile); /* fill in the rest of the sequence with '-'s */ for( ;j < SeqLen; j++) Data[i*SeqLen + j] = '-'; break; } } /* Check the last sequence entered is long enough */ if ( j < SeqLen) { printf("Gap at end of sequence %s filled with '-'\n", names[i]); for( ;j < SeqLen; j++) Data[i*SeqLen + j] = '-'; } } /* close the input file and return to main() */ fclose(InFile); return(0); } /* ReadData */ /* * Function name : PartitionData * Summary : This goes through the raw data, looks for * informative sites, building up a site_index array. * "Easy" sites add support to the partition score * array, while "difficult" sites may provide 0.5 support * but in either case are added to keep_sites so they * can be evaluated for consistency with the partitions * later. The function returns the number of sites. * Needed Functions: MyMalloc, MaxButOne, MakeBinary, BigEqual */ int PartitionData(int Num, int Len, char *Data, struct site_index *inform, int *NiceP, struct partition *scores, int *HardP, struct keep_site *tough) { int raw, i, j, k; int count[6]; char *tempsite; int states = 0; char left, behind; int nhold, flag; int tally = 0; BigLong parti; int nice = 0, hard = 0; /* First, do a bit of mallocing */ tempsite = (char *)MyMalloc(sizeof(char) * Num); /* Go through and discover which are the informative sites */ for (raw = 0; raw < Len; raw++) { for (i = 0 ; i < 6; i++) count[i] = 0; states = 0; /* leaf through all the sequences, add each one's character to the tally ambiguous codes were converted to ? when data was read in. NOTE count[0] = A, 1 = C, 2 = G, 3 = T, 4 = ?, 5 = - */ for (j = 0; j < Num; j++) { if (Data[j*Len +raw] == 'A') count[0]++; else if (Data[j*Len +raw] == 'C') count[1]++; else if (Data[j*Len +raw] == 'G') count[2]++; else if (Data[j*Len +raw] == 'T') count[3]++; else if (Data[j*Len +raw] == '?') count[4]++; else if (Data[j*Len +raw] == '-') count[5]++; else /* something wrong here !! */ { printf("UNRECOGNISED CHARACTER!!! ERROR\n\n"); return(0); } } /* this is a rough check on which are the interesting sites. They get remembered by adding them to the site_index structure */ for (i = 0; i < 4; i++) if (count[i] > 0) states++; if (states < 2) /* not a variable site */ continue; for ( k = 0; k < Num; k++) tempsite[k] = Data[k*Len +raw]; /* If there are three or four nucleotides, treat as transversions */ if ( states > 2 ) { if ( count[0] + count[2] == 1 ) { count[4]++; for ( k = 0; k < Num; k++) if (tempsite[k] == 'A' || tempsite[k] == 'G') tempsite[k] = '?'; count[0] = 0; count[2] = 0; } else if (count[1] + count[3] == 1) { count[4]++; for ( k = 0; k < Num; k++) if (tempsite[k] == 'C' || tempsite[k] == 'T') tempsite[k] = '?'; count[1] = 0; count[3] = 0; } else /* It's a fullblown transversion site */ { for ( k = 0; k < Num; k++) { if (tempsite[k] == 'G') { tempsite[k] = 'A'; count[0] = count[0] + count[2]; count[2] = 0; } if (tempsite[k] == 'T') { tempsite[k] = 'C'; count[1] = count[1] + count[3]; count[3] = 0; } } } } /* Double check - there should only be two informative types of sites left. If not, disaster!!! */ states = 0; for (i = 0; i < 4; i++) if (count[i] > 0) states++; if (states != 2) { printf("ERROR! site %d not adapting satisfactorily\n", raw+1); return(0); } if (MaxButOne(4, count) == 1) continue; /* Sites which can't be used for partition scores and must be remembered separately */ if ( count[5] + count[4] > 1 ) { tough[hard].site = raw+1; for (k = 0 ; k < Num; k++) tough[hard].data[k] = Data[k*Len +raw]; hard++; inform[tally].site = raw+1; inform[tally].partid = -1; inform[tally].halfpart = -1; tally++; } /* Sites which can be used for half/half scores but keep as hard */ else if (count[4] + count[5] == 1) { left = 'X'; tough[hard].site = raw+1; for (k = 0 ; k < Num; k++) tough[hard].data[k] = Data[k*Len +raw]; hard++; inform[tally].site = raw+1; /* find the single unclear sequence and group the other seqs */ for (k = 0; k < Num; k++) { if (tempsite[k] == '?' || tempsite[k] == '-') nhold = k; else if (left == 'X') left = tempsite[k]; else if (tempsite[k] != left) behind = tempsite[k]; } /* assign the singleton to the first group */ tempsite[nhold] = left; parti = MakeBinary(Num, tempsite); flag = 0; for ( i = 0; i < nice; i++) if (BigEqual(scores[i].part, parti)) { scores[i].sup = scores[i].sup + 0.5; inform[tally].partid = i; flag = 1; } if (flag == 0) { scores[nice].part = parti; scores[nice].sup = 0.5; scores[nice].con = 0.0; scores[nice].partid = nice; inform[tally].partid = nice; nice++; } /* assign the singleton to the second group */ tempsite[nhold] = behind; parti = MakeBinary(Num, tempsite); flag = 0; for( i = 0; i < nice; i++) if (BigEqual(scores[i].part, parti)) { scores[i].sup = scores[i].sup + 0.5; inform[tally].halfpart = i; flag = 1; } if (flag == 0) { scores[nice].part = parti; scores[nice].sup = 0.5; scores[nice].con = 0.0; scores[nice].partid = nice; inform[tally].halfpart = nice; nice++; } tally++; } /* Sites which can be used for nice partitions. Wow! */ else { parti = MakeBinary(Num, tempsite); flag = 0; for( i = 0; i < nice; i++) if (BigEqual(scores[i].part, parti)) { scores[i].sup++; inform[tally].partid = i; flag = 1; } if (flag == 0) { scores[nice].part = parti; scores[nice].sup = 1.0; scores[nice].con = 0.0; scores[nice].partid = nice; inform[tally].partid = nice; inform[tally].halfpart = -1; nice++; } inform[tally].site = raw+1; tally++; } } *NiceP = nice; *HardP = hard; return(tally); } /* PartitionData */ /* Function Name : BigEqual Summary : Check if two BigLongs are identical */ int BigEqual(BigLong one, BigLong other) { int i; for (i = 0; i < SUPERLONG; i++) if (one.sub[i] != other.sub[i]) return(0); return(1); } /* BigEqual */ /* Function Name :MakeBinary Summary : Convert a given number of characters into the corresponding binary partition, using the last character as the equivalent of "0" and anything else (hopefully only one char) as "1" */ BigLong MakeBinary(int Num, char *array) { BigLong binary; int i; char zero; int region; unsigned long int holder = 1; /* set everything equal to zero first. Partly to ensure that comparisons don't fail because bits I haven't set are compared */ for (i = 0; i < SUPERLONG; i++) binary.sub[i] = 0; /* the last species defines the partition called '0' */ zero = array[Num-1]; /* now do the actual binary creation */ for (i=0 ; i < Num-1 ; i++) { /* Find out which region we're currently in, and reset everything if it's a new region, otherwise build powers of two. */ region = i/LONGBITS; if ((i%LONGBITS) == 0) holder = 1; else holder = holder * 2; /* Now check if the array entry should be coded as a 1 or a 0. Adding an appropriate power of two is equivalent to setting that bit to 1, otherwise it stays 0. */ if (array[i] != zero) binary.sub[region] = binary.sub[region] + holder; } return(binary); } /* MakeBinary */ /* Function Name :UnMakeBinary Summary : Convert a binary partition into a character array with 'A' for less common and 'B' for more common. Each binary digit is compared and then rightshifted so the next digit can be compared */ char *UnMakeBinary(int Num, BigLong binary, char *holder) { int i; int region; int count = 0; for (i = 0; i < Num-1; i++) { /* find out what region we're in first */ region = i/LONGBITS; /* Then do comparison on the last digit in that region */ if (binary.sub[region] & 01) { holder[i] = 'X'; count++; } else holder[i] = 'Y'; /* Shift everything one place to the right, to do the next */ binary.sub[region] >>= 1; } /* Set the last character to Y (since Y is 0 ) and a \0 to end nicely */ holder[Num-1] = 'Y'; holder[Num] = '\0'; /* now, or X and Y, make the less common A and the more common B */ if (count*2 > Num) for (i = 0; i < Num; i++) { if (holder[i] == 'X' ) holder[i] = 'B'; else holder[i] = 'A'; } else for (i = 0; i < Num; i++) { if (holder[i] == 'X' ) holder[i] = 'A'; else holder[i] = 'B'; } return(holder); } /* UnMakeBinary */ /* Function Name :MinusScore Summary :given a struct partition with the partitions, go through and give the appropriate conflict scores to each partition (based on the support score each one already has). It assumes the conflict scores are already initialised to zero (or other appropriate value) Needed Functions : BinaryCons() */ void MinusScore(int nice, struct partition *scores) { int i, j; double supscore = 0.0, conscore = 0.0; /* Do a pairwise comparison of all partitions in the structure And add each's support score to the other's conflict score if they are inconsistent - generating raw conflict scores */ for (i = 1; i < nice; i++) for (j = 0; j < i; j++) if (BinaryCons(scores[i].part, scores[j].part ) ) { scores[i].con += scores[j].sup; scores[j].con += scores[i].sup; } /* Normalise the con scores to sum to the sup scores over all partitions */ for (i = 0; i < nice; i++) { supscore += scores[i].sup; conscore += scores[i].con; } if (conscore < 1) { printf("No significant conflict among partitions"); return; } for (i = 0; i < nice; i++) scores[i].con = scores[i].con * supscore / conscore; return; } /* MinusScore */ /* Function Name :BinaryCons Summary :Given two bipartitions, return 0 if they are consistent, 4 if not. The partitions are expressed in binary format. Pass contains "votes" for 0,1 1,0 1,1. Only two are allowed votes for consistency to be the case. */ int BinaryCons(BigLong one, BigLong other) { unsigned long int compare; int i; int pass[3] = {0, 0 ,0}; int key = 0; /* do a binary comparison of the ints, one at a time. Keep track of which kinds of binary forms we've seen using "pass" */ for (i = 0; i< SUPERLONG; i++) { /* if both BigLongs are equal to zero in a region, we don't have to do anything to them. Hence the test to see if at least one is not zero in that region */ if ((one.sub[i]) || (other.sub[i])) { if (one.sub[i] == 0) pass[0]++; /* i.e. 0,1 positions present */ else if (other.sub[i] == 0) pass[1]++; /* i.e. 1,0 positions present */ else /* BOTH are non-zero in this int */ { compare = one.sub[i] & other.sub[i]; if (one.sub[i] == other.sub[i]) { pass[2]++; /* 1,1 */ } else if (compare == one.sub[i]) { pass[0]++; /* 0,1 */ pass[2]++; /* 1,1 (since both non-zero) */ } else if (compare == other.sub[i]) { pass[1]++; /* 1,0 */ pass[2]++; /* 1,1 */ } else if (compare == 0) { pass[0]++; /* 0,1 */ pass[1]++; /* 1,0 */ } else /* at least one int isn't consistent, so it fails the entire thing */ return(4); /* 0,1 1,0 AND 1,1 in one int */ } } } /* Are the individual binary consistencies, consistent? There must be at most two categories of "pass" with non-zero score. */ for (i = 0; i < 3 ; i++) if (pass[i] == 0) key++; if (key >= 1) /* They are indeed consistent */ return(0); else return(4); /* They are inconsistent */ } /* BinaryCons */ /* Function Name :HeapSortPartition Summary :Sorts the entries in the partition structure in order of highest support score first through to smallest, or conflict scores, smallest to largest (depending on user input) - sdcu Uses the HeapSort algorithm from _Numerical Recipes_ The principle is first to "hire" each value as a "worker" below a "boss". If either of the "workers" is better (i.e. higher or lower, depending on the sort in progress) they get the "boss"'s position and the boss moves down to become one of the workers. Each worker in turn becomes the boss for another two workers, applying the same rules, until all values have been "hired". In the binary hierachy created the boss is always better than both workers below, for each boss. Then the top boss (i.e. the highest/lowest value) retires (I've used "sack" for this :-)) and the best worker is promoted to be the new boss, which has to be one of those immediately below the boss. Then this top boss is retired, etc. */ void HeapSortPartition(int size, struct partition *scores) { int hire, sack, i, j; struct partition hold; int sdcu = 0; char reply[6]; printf("\nWould you like to sort the partitions by:\n"); printf(" (a) largest to smallest support score\n"); printf(" (b) smallest to largest conflict score\n"); printf(" (c) largest to smallest (support minus conflict) ... PB\n"); printf(" (z) no sorting, leave as is\n"); while (!sdcu) { printf("Enter your choice > "); fgets(reply, 5, stdin); if (toupper(reply[0]) == 'A') sdcu = 1; else if (toupper(reply[0]) == 'B') sdcu = 2; else if (toupper(reply[0]) =='C') sdcu = 3; else if (toupper(reply[0]) == 'Z') return; } hire = size/2 + 1; sack = size; while ( hire > 1 || sack > 1) { if (hire > 1) { hire--; hold = scores[hire-1]; } else { hold = scores[sack-1]; scores[sack-1] = scores[0]; sack--; if (sack == 1) { scores[0] = hold; return; /* function returns here */ } } i = hire; j = hire + hire; while (j <= sack) { if ( j < sack) if ((sdcu == 1 && scores[j-1].sup > scores[j].sup) || (sdcu == 2 && scores[j-1].con < scores[j].con) || sdcu == 3 && scores[j-1].sup - scores[j-1].con > scores[j].sup - scores[j].con) j++; if ((sdcu == 1 && hold.sup > scores[j-1].sup) || (sdcu == 2 && hold.con < scores[j-1].con) || sdcu == 3 && hold.sup - hold.con > scores[j-1].sup - scores[j-1].con) { scores[i-1] = scores[j-1]; i = j; j = j+j; } else j = sack+1; } scores[i-1] = hold; } } /* HeapSortPartition */ /* Function name : MakeMatrix Summary : creates the consistency matrix of partitions and parsimonious sites. Needed Functions: MyMalloc, UnMakeBinary, Consistency, BinaryCons */ int MakeMatrix(int Num, int sites, struct site_index *site, int OrigNice, int Nice, struct partition *scores, int Hard, struct keep_site *tough, char **matrix) { int part, len; char *partdata; char *dummy; int hardcount; int consis; int i; BigLong sitepart; partdata = (char *)MyMalloc(sizeof(char) * (Num+1)); dummy = (char *)MyMalloc(sizeof(char) * (Num+1)); hardcount = 0; for (len = 0; len < sites; len++) { if (site[len].site == tough[hardcount].site) { for (part = 0; part < Nice; part++) { partdata = UnMakeBinary(Num, scores[part].part, partdata); if (site[len].partid == -1) consis = Consistency(Num, tough[hardcount].data, partdata, dummy); else if (site[len].halfpart == -1) { printf("Error: sites out of synch\n\n"); return(1); } else if (site[len].partid == scores[part].partid) consis = -1; else if (site[len].halfpart == scores[part].partid) consis = -1; else consis = Consistency(Num, tough[hardcount].data, partdata, dummy); matrix[part][len] = Assign(consis); } hardcount++; } else { /* First find the binary representation of the correct partition for this site */ for (i = 0; i < OrigNice; i++) if (site[len].partid == scores[i].partid) sitepart = scores[i].part; /* Now go through the partitions again and assign the right character to each partition relative to that site */ for (part = 0; part < Nice; part++) { if (site[len].partid == scores[part].partid) consis = -2; else consis = BinaryCons(sitepart, scores[part].part); matrix[part][len] = Assign(consis); } } } if (Hard != hardcount) { printf("ERROR analysing sites - incorrect number of stored sites\n"); return(1); } free(dummy); free(partdata); return(0); } /* MakeMatrix */ /* Function name : Assign Summary : Given a numerical consistency assign the arbitrary character used for the PostScript and other matrix reference -2 identical 'i' -1 half-identity 'j' 0 consistent 'c' 1+ inconsistent 'm' */ char Assign(int consis) { char result; if (consis == -2) result = 'i'; else if (consis == -1) result = 'j'; else if (consis == 0) result = 'c'; else result = 'm'; return(result); } /* Assign */ /* Function name : MaxButOne Summary : finds the next-largest number in an array of numbers. */ int MaxButOne(int i, int *count) { int j; int max = 1; int next = 1; for (j = 0; j < i; j++) { if (count[j] > next) { if (count[j] > max) { next = max; max = count[j]; } else next = count[j]; } } return(next); } /* MaxButOne */ /* Function name : Consistency Summary : Given all the possible character states at two sites, decide if they form a consistent tree. This naturally assumes that the character states are in the same order, corresponding to some arbitrary order of the sequences analysed. NOTE Return 0 if they are consistent, number of pairs remaining if not. This is opposite "reticulate" at least at time of writing. I think this is nicer, and "reticulate" may eventually use this convention also. The pairs with '-' (gaps) or '?' (unknown) are ignored, eliminated at the first step. Then duplicate pairs are eliminated (these will stuff up the algorithm otherwise) Then the algorithm, as per "reticulate", is employed. */ int Consistency(int max, char *pair1, char *pair2, char *live) { int i, j; int remain; int change = 1; int flag; remain = max; /* We can kill pairs with gaps or unknowns - they are uninformative */ for (i = 0; i < max; i++) { if (pair1[i] == '-' || pair1[i] == '?' || pair2[i] == '-' || pair2[i] == '?') { live[i] = '0'; remain--; } else live[i] = '1'; } /* eliminate duplicate pairs - the match eliminates the pair further on in the arbitrary order of sequences */ for (i = 0; i < max; i++) for (j = 0; j < i; j++) if (live[i] == '1') if (pair1[i] == pair1[j] && pair2[i] == pair2[j]) { live[i] = '0'; remain--; } /* Test the remaining living pairs, kill them if one of the parts of the pair is unique. Keep going until there are two or less pairs left - this is a consistent pair OR until we go through a round where we haven't been able to kill anything (change becomes 0) - this is an inconsistent pair */ while ((remain > 0) && (change > 0)) { change = 0; for ( i = 0; i < max ; i++) { if (live[i] == '1') { flag = 0; for( j = 0; j < max ; j++) if ((i != j) && live[j] == '1') if (pair1[i] == pair1[j]) flag = 1; if (flag == 0) { live[i] = '0'; remain--; change++; } else { flag = 0; for(j= 0; j < max ; j++) if ((i != j) && live[j] == '1') if (pair2[i] == pair2[j]) flag = 1; if (flag == 0) { live[i] = '0'; remain--; change++; } } /* if flag == 0 else */ } /* live[i] == 1 */ } /* for i 0 to max */ } /* while */ return(remain); } /* Consistency */ /* Function Name : RemovePart() Function Summary : Allow user to remove selected partitions, for example those with low support, or high conflict And user-selected partitions can also be removed. Needed functions : UnMakeBinary */ int RemovePart(int Num, int Nice, struct partition *scores) { char opt = 'l', str[10]; char supcon = 's'; /* s = support c = conflict */ double cutoff = 1.1; int top, bot, newnice; int victim; char deleting = 'Y'; char holder[MAXSPECIES]; struct partition holding; /* newnice is the value returned by the function. Set it equal to Nice in case it is not changed */ newnice = Nice; /* Let the user select the appropriate option, continuing until they finish ( 'f' ) */ while ( opt != 'f' && newnice > 0) { printf("\nPartitions have scores of support and conflict. The support"); printf(" score is the number\nof sites identical to that partition, "); printf("the conflict score is proportional to\nthe number of sites "); printf("inconsistent with that partition"); printf("\nThe options available are:\n"); printf(" (l) remove partitions with low support (or conflict)\n"); printf(" (h) remove partitions with high conflict (or support)\n"); printf(" (s) remove specific partitions\n"); printf(" (f) Finished removing partitions\n\n"); printf("Enter your choice > "); fgets(str, 9, stdin); opt = tolower(str[0]); if (opt == 'l') /* Cut out low-scoring partitions */ { printf("Remove partitions based on a low score "); printf("in support or conflict (s/c) > "); fgets(str, 9, stdin); if (toupper(str[0]) == 'C') supcon = 'c'; else supcon = 's'; printf("Partitions with scores BELOW (not equal to) the value "); printf("you enter are removed.\nEnter the value > "); fgets(str, 9, stdin); if (atof(str) > 0) cutoff = atof(str); printf("Using a cutoff value of %.2f\n", cutoff); /* This function just goes through and separates the values into those greater than the cutoff and those less than. It may contain some redundancy (values checked twice) but it was so surprisingly fast and neat to code that I decided to leave it like this. Should make it easier to tell if something goes wrong, I hope. */ top = 0; bot = Nice; while (top < bot) { if (supcon == 's' && scores[top].sup >= cutoff) top++; else if (supcon == 's' && scores[bot -1].sup < cutoff) bot--; else if (supcon == 'c' && scores[top].con >= cutoff) top++; else if (supcon == 'c' && scores[bot -1].con < cutoff) bot--; else { holding = scores[top]; scores[top] = scores[bot -1]; scores[bot -1] = holding; } } /* At this point (top = bot) all entries < top are the ones we want to keep, = top through to < Nice are the ones to throw out. But check with the user first, in case all partitions have been marked to be cut, or they decide they don't want that many partitions removed */ printf("%d partitions have scores greater than the cutoff\n", top); printf("%d partitions have scores in %s below the cutoff\n", Nice - top, supcon == 'c'?"conflict":"support" ); if (!top) printf("WARNING: all partitions marked for removal!\n"); printf("Do you want to cut the low scoring partitions out > "); fgets(str, 9, stdin); if (toupper(str[0]) == 'Y') { newnice = top; scores = (struct partition *)MyRealloc(scores, sizeof(struct partition) * newnice); } } else if (opt == 'h') /* remove high-scorers in exactly the same way*/ { printf("Remove partitions based on a high score "); printf("in support or conflict (s/c) > "); fgets(str, 9, stdin); if (toupper(str[0]) == 'S') supcon = 's'; else supcon = 'c'; printf("Partitions with scores ABOVE (not equal to) the value "); printf("you enter are removed.\nEnter the value > "); fgets(str, 9, stdin); if (atof(str) > 0) cutoff = atof(str); printf("Using a cutoff value of %.2f\n", cutoff); top = 0; bot = Nice; while (top < bot) { if (supcon == 's' && scores[top].sup <= cutoff) top++; else if (supcon == 's' && scores[bot -1].sup > cutoff) bot--; else if (supcon == 'c' && scores[top].con <= cutoff) top++; else if (supcon == 'c' && scores[bot -1].con > cutoff) bot--; else { holding = scores[top]; scores[top] = scores[bot -1]; scores[bot -1] = holding; } } printf("%d partitions have scores less than the cutoff\n", top); printf("%d partitions have scores in %s above the cutoff\n", Nice - top, supcon == 'c'?"conflict":"support" ); if (!top) printf("WARNING: all partitions marked for removal!\n"); printf("Do you want to go ahead and cut the high scorers out > "); fgets(str, 9, stdin); if (toupper(str[0]) == 'Y') { newnice = top; scores = (struct partition *)MyRealloc(scores, sizeof(struct partition) * newnice); } } else if (opt == 's') /* remove selected partitions */ { while (deleting == 'Y') { if (newnice < 2) printf("WARNING: only one partition remains!\n"); printf("Specify a partition between 1 and %d inclusive> ", newnice); fgets(str, 9, stdin); victim = atoi(str); if (victim > 0 && victim <= newnice) { printf("Your chosen partition is :\n%s\n", UnMakeBinary(Num , scores[victim-1].part, holder)); printf("The support score is %f and conflict score %f\n", scores[victim-1].sup, scores[victim-1].con); printf("Do you wish to go ahead and delete this partiti"); printf("on (y/n) > "); fgets(str, 9, stdin); if (toupper(str[0]) == 'Y') { holding = scores[newnice-1]; scores[newnice-1] = scores[victim-1]; scores[victim -1] = holding; newnice = newnice - 1; } } else printf("That is not a possible partition \n"); printf("Would you like to delete another partition (y/n) > "); fgets(str, 9, stdin); deleting = (toupper(str[0])); } } else if (opt != 'f') /* anything other than quit */ printf("\nThat is not an available option, sorry\n"); } printf("NOTE: This option has changed the order of the partitions.\n"); printf("You may need to re-order the remaining partitions\n"); return(newnice); } /* RemovePart */ /* * Function Name : SwapPart * Summary : Allow user to manually rearrange the order of * the partitions by swapping */ void SwapPart(int Num, int Nice, struct partition *scores) { char option = 'l'; char Answer[260]; int temp, first, second; char strin[MAXSPECIES]; int *neworder, *newmark; int typed, i , j, k; char *next; struct partition holding; neworder = (int *)MyMalloc(sizeof(int) * Nice); newmark = (int *)MyMalloc(sizeof(int) * Nice); /* provide a brief introduction to manual swapping of partitions */ printf("The partitions in the dataset can be displayed in any order.\n"); printf("This option allows display and change of the current order.\n"); printf("Each partition has a constant hexadecimal number identifying it,"); printf("\nhowever to simplify typing, the order of the partitions is "); printf("used here.\nThe 'l' option shows the current order and "); printf("hexadecimal identifiers.\nUse the numbers of the current "); printf("order to specify your chosen order.\nCheck the new order with "); printf("'l' again\n"); do { /* find out what the user would like to do */ printf("\nThe options available are : \n"); printf(" (l) List partitions in current order \n"); printf(" (s) Swap a pair of partitions \n"); printf(" (a) Provide a new order for all the partitions \n"); printf(" (f) Finished manual ordering of partitions \n"); printf("\nEnter your choice > "); fgets(Answer, 20, stdin); option = tolower(Answer[0]); if (option == 'l') { printf("The partitions are :\n"); printf("order hexadecimal identifier\n"); for (temp = 0; temp < Nice; temp++) printf("%5d %s\n", temp+1, BintoHex(scores[temp].part, Num/4 +1, strin)); /* In case I need this, it's here * * printf("Score for is %.2f against %.2f\n", * scores[temp].sup, scores[temp].con); */ } else if (option == 's') { printf("Enter the number of one partition (1 to %d) > ", Nice); fgets(Answer, 20, stdin); temp = atoi(Answer); if (temp > 0 && temp <= Nice) first = temp; else first = 0; printf("Enter the other partition (1 to %d) > ", Nice); fgets(Answer, 20, stdin); temp = atoi(Answer); if (temp > 0 && temp <= Nice) second = temp; else second = 0; if (first == 0 || second == 0 || first == second) printf("Those are not valid partitions to swap\n"); else /* swap them around in the scores structure */ { holding = scores[first-1]; scores[first-1] = scores[second-1]; scores[second-1] = holding; } } else if (option == 'a') { typed = 0; printf("Type in the new order. Leave at least one space between"); printf(" each number.\nHit the key when the line gets"); printf(" long (max. 200 characters)\nNumbers outside the range "); printf("(1 to %d) or repeats will be ignored\n", Nice); while (typed < Nice) { printf("%d more > ", Nice - typed); fgets(Answer, 260, stdin); j = 0; while (Answer[j] != '\0') { next = Answer + j; temp = atoi(next); k = 1; for (i = 0; i < typed; i++) if (temp == neworder[i]) k = 0; /* No good */ if ( temp < 1 || temp > Nice) k = 0; /* No good */ if (k) { neworder[typed] = temp; typed++; } /* move forward to the next number on the line*/ while (isgraph(Answer[j])) j++; while (isspace(Answer[j])) j++; } } /* Sort the partitions into the new order. */ for (i = 0; i < Nice ; i++) newmark[i] = 0; for (i = 0; i < Nice ; i++) { if (newmark[i]) continue; /* this is a loop we've done*/ holding = scores[i]; j = i; k = neworder[i] - 1; do { scores[j] = scores[k]; newmark[k] = 1; j = k; k = neworder[j] - 1; } while (k != i); scores[j] = holding; } } else if (option != 'f') printf("Sorry, that is not an available option.\n\n"); } while (option != 'f'); free(neworder); free(newmark); return; } /*SwapPart */ /* Function Name : PartSim Summary : Calculate similarity scores, by comparing partitions pairwise. then use the scores to order the partitions so that partitions next to each other have as high similarity scores as possible. I.e. the final order's similarity score matrix should have a high sum along the diagonal. */ int PartSim(int Num, int Nice, int sites, struct partition *scores, char **matrix) { FILE *Out; char Filename[120]; char *sscend; char *warn; int ss = 0; int *ssindex; int count[2]; int obssim; double **pq; double obsshared; double expshared; double normalisor; double **Simscore; int **adjacent; int groups, pone, ptwo, gone, gtwo; double maxscore; int nor, oldneigh, negpairs; int *Neworder; int *newmark; struct partition holding; int sort; char pstr[MAXSPECIES]; int hexlength; int i, j, k; pq = (double **)MyMalloc( sizeof(double *) * Nice); Simscore = (double **)MyMalloc(sizeof(double *) * Nice); adjacent = (int **)MyMalloc(sizeof(int *) * Nice); Neworder = (int *)MyMalloc(sizeof(int) * Nice); newmark = (int *)MyMalloc(sizeof(int) * Nice); for (i = 0; i < Nice; i++) { pq[i] = (double *)MyMalloc( sizeof(double) * 2); Simscore[i] = (double *)MyMalloc(sizeof(double) * Nice); adjacent[i] = (int *)MyMalloc(sizeof(int) * 3); } ssindex = (int *)MyMalloc(sizeof(int) * sites); warn = (char *)MyMalloc( sizeof(char) * Nice); hexlength = Num/4 + 1; /* provide a bit of background and give the option between sorting and just getting the matrix of distances given the current order */ printf("This option calculates the similarity scores of partitions to "); printf("each\nother based on the pattern of consistency and inconsisten"); printf("cy of sites\nwith each partition. The similarity scores are then "); printf("used to create an order\nof partitions in which adjacent partiti"); printf("ons have as high similarity as\npossible. The partitions can be "); printf("placed in this order or left as they were.\nIn any case, the "); printf("similarity scores will be saved in a file\n"); printf("Sort partitions into the similarity order (y/n) > "); fgets(Filename, 40, stdin); if (toupper(Filename[0]) == 'Y') sort = 1; else sort = 0; /* Prompt the user for the name of the output file... */ printf("\nEnter the name of the output file to be used (.ssc) > "); gets(Filename); sscend = strstr(Filename, ".ssc"); if (sscend == 0) strcat(Filename, ".ssc"); /* try to open the output file. If no can do, return 1 to Main. */ if ((Out = fopen (Filename, "w")) == NULL) { printf("Unable to open Output file - %s\n", Filename); return(1); } /* Determine how many sites to really use - all those inconsistent with at least one partition */ for (i = 0; i < sites; i++) { k = 0; for (j = 0; j < Nice && k == 0; j++) if (matrix[j][i] == 'm') k = 1; if (k == 1) { ssindex[ss] = i; ss++; } } /* print a header for the file */ fprintf(Out, "Similarity scores for partitions in partimatrix\n"); fprintf(Out, "%d sequences, %d partitions, %d sites\n\n", Num, Nice, sites); fprintf(Out, "Statistics performed on %d sites, those with inconsistency\n", ss); fprintf(Out, "\nFraction of sites consistent for each partition\n"); /* Go through each partition and count the number of consistent and inconsistent sites. Print out a summary */ for (i = 0; i < Nice; i ++) { count[0] = 0; count[1] = 0; for (j= 0; j < ss; j++) { if (matrix[i][ssindex[j]] == 'm') count[1]++; else count[0]++; } if (count[0] == ss) warn[i] = 'c'; /* every site is consistent */ else if (count[1] == ss) warn[i] = 'm'; /* every site is inconsistent :-) */ else warn[i] = 'g'; /* it's a "good" part with real fractions*/ pq[i][0] = (double)count[0]/ss; pq[i][1] = (double)count[1]/ss; if (!sort) fprintf(Out," %s %.3f\n", BintoHex(scores[i].part, hexlength, pstr), pq[i][0] ); } /* Now go through the pairwise comparisons of partitions. Print out the results as we go, for unsorted matrices */ if (!sort) { fprintf(Out, "\nMatrix of Similarity Scores between pairs of"); fprintf(Out, " partitions\n"); } for ( i = 0; i < Nice-1; i++) { /* Put in spaces to make the matrix upper triangular */ if (!sort) for (k = 0; k < i; k++) { fprintf(Out, " "); if ((k+1)%20 == 0) fprintf(Out, "\n"); } for (j = i+1 ; j < Nice; j++) { obssim = 0; /* Tally all those sites where the partitions are either both consistent or both inconsistent with the site*/ for (k = 0; k < ss; k++) { if (matrix[i][ssindex[k]] == 'm') { if (matrix[j][ssindex[k]] == 'm') obssim++; } else if (matrix[j][ssindex[k]] != 'm') obssim++; } obsshared = (float)obssim/ss; expshared = pq[i][0] * pq[j][0] + pq[i][1] * pq[j][1]; if (warn[i] != 'g' || warn[j] != 'g') Simscore[i][j] = 0.0; else { normalisor = sqrt(expshared * (1- expshared)/ss ); Simscore[i][j] = (obsshared - expshared) / normalisor; } if (!sort) { fprintf(Out, "%4.1f ", Simscore[i][j]); if ((j)%20 == 0) fprintf(Out, "\n"); } } if (!sort) fprintf(Out, "\n"); } if (!sort) fprintf(Out, "\n\n"); /* Find the order of the partitions based on the similarity scores. */ /* First initialise stuff */ for (i = 0; i < Nice; i++) { adjacent[i][0] = -1; adjacent[i][1] = -1; adjacent[i][2] = -1; } groups = 0; negpairs = 0; for (i = 0; i < Nice-1; i++) { maxscore = -1000.0; pone = -1; ptwo = -1; for (j = 0; j < Nice-1; j++) { if (adjacent[j][2] == -1) for (k = j+1; k < Nice; k++) if (adjacent[k][2] == -1 && (adjacent[k][0] == -1 || adjacent[k][0] != adjacent[j][0])) if (Simscore[j][k] >= maxscore) { maxscore = Simscore[j][k]; pone = j; ptwo = k; } } if (pone > -1 && ptwo > -1) { if (maxscore < 0) negpairs++; /* We're creating a new group */ if (adjacent[pone][1] == -1 && adjacent[ptwo][1] == -1) { adjacent[pone][0] = groups; adjacent[ptwo][0] = groups; groups++; adjacent[pone][1] = ptwo; adjacent[ptwo][1] = pone; } /* the first one is already in a group */ else if (adjacent[ptwo][1] == -1) { adjacent[ptwo][0] = adjacent[pone][0]; adjacent[ptwo][1] = pone; adjacent[pone][2] = ptwo; } /* the second is already in a group */ else if (adjacent[pone][1] == -1) { adjacent[pone][0] = adjacent[ptwo][0]; adjacent[pone][1] = ptwo; adjacent[ptwo][2] = pone; } /* they are both in groups; the groups need to be joined */ else { gone = adjacent[pone][0]; gtwo = adjacent[ptwo][0]; if (gone < gtwo) for (j = 0; j < Nice; j++) { if (adjacent[j][0] == gtwo) adjacent[j][0] = gone; else if (adjacent[j][0] > gtwo) adjacent[j][0]--; } else /* gtwo < gone */ for (j = 0; j < Nice; j++) { if (adjacent[j][0] == gone) adjacent[j][0] = gtwo; else if (adjacent[j][0] > gone) adjacent[j][0]--; } groups--; adjacent[pone][2] = ptwo; adjacent[ptwo][2] = pone; } } else /* the remaining scores are all less than -1000!! We are done*/ { for (j = 0; j < Nice ; j++) if (adjacent[j][0] == -1) { adjacent[j][0] = groups; groups++; } i = Nice; } } /* print the results of the grouping */ printf("Ordering of partitions based on similarity scores complete\n"); nor = 0; if (groups > 1 || negpairs > 0) { fprintf(Out, "The partitions could not all be joined into order with "); fprintf(Out, "non-negative scores.\nThe breaks are marked\n"); } else fprintf(Out, "The following order of partitions was found:\n"); oldneigh = -1; while (nor < Nice) { for (j = 0; j < Nice; j++) if (adjacent[j][2] == -1 && adjacent[j][0] >= 0) /* we've found an "end" of the matter */ { if (groups > 1 && nor > 0) fprintf(Out, "Negative score between previous and next\n"); if (sort) fprintf(Out, " %s %.3f\n", BintoHex(scores[j].part, hexlength, pstr), pq[j][0]); else fprintf(Out, " %d %s\n", j+1, BintoHex(scores[j].part, hexlength, pstr) ); Neworder[nor] = j; nor++; oldneigh = j; adjacent[j][0] = -1; j = Nice; } if (oldneigh == -1) /* this is a bug ... */ { printf("Partitions have been missed in the similarity routine\n"); printf("This is a bug in the program\n"); fflush(Out); fclose(Out); return(5); } while (oldneigh > -1) { for (j = 0; j < Nice; j++) if (adjacent[j][0] > -1) if (adjacent[j][1] == oldneigh || adjacent[j][2] == oldneigh) { if ((j < oldneigh && Simscore[j][oldneigh] < 0) || (j > oldneigh && Simscore[oldneigh][j] < 0)) { fprintf(Out, "Negative score between previous and next\n"); } if (sort) fprintf(Out, " %s %.3f\n", BintoHex(scores[j].part, hexlength, pstr), pq[j][0]); else fprintf(Out, "%d %s\n", j+1, BintoHex(scores[j].part, hexlength, pstr) ); Neworder[nor] = j; nor++; oldneigh = j; adjacent[j][0] = -1; j = Nice+1; } if (j == Nice) /* no match was found */ { if (adjacent[oldneigh][2] == -1) /* this is what we expect*/ oldneigh = -1; else /* bizarre error */ { printf("WARNING: ordering failed!!\n"); fprintf(Out, "ERROR: The previous partition should also b"); fprintf(Out, "e next to partition %d %s\n", adjacent[oldneigh][2]+1, BintoHex(scores[adjacent[oldneigh][2]+1].part, hexlength, pstr) ); oldneigh = -1; } } } } if (sort) /* print out the matrix using the newly-defined order */ { fprintf(Out, "\nMatrix of Similarity Scores between pairs of"); fprintf(Out, " partitions\n"); for (i = 0; i < Nice-1 ; i++) { for (k = 0; k < i; k++) { fprintf(Out, " "); if ((k+1)%20 == 0) fprintf(Out, "\n"); } for (j = i+1; j < Nice; j++) { if (Neworder[i] < Neworder[j]) maxscore = Simscore[Neworder[i]][Neworder[j]]; else maxscore = Simscore[Neworder[j]][Neworder[i]]; fprintf(Out, "%4.1f ", maxscore); if ((j)%20 == 0) fprintf(Out, "\n"); } fprintf(Out, "\n"); } fprintf(Out, "\n\n"); } /* Sort the actual partitions into the new order. */ if (sort) { for (i = 0; i < Nice ; i++) newmark[i] = 0; for (i = 0; i < Nice ; i++) { if (newmark[i]) continue; /* this is a loop we've done*/ holding = scores[i]; j = i; k = Neworder[i]; do { scores[j] = scores[k]; newmark[k] = 1; j = k; k = Neworder[j]; } while (k != i); scores[j] = holding; } } /* tidy up */ fflush(stdout); for (i = 0; i < Nice; i++) { free(pq[i]); free(Simscore[i]); free(adjacent[i]); } free(pq); free(Simscore); free(adjacent); free(Neworder); free(ssindex); free(warn); fflush(Out); fclose(Out); return(0); } /* PartSim */ /* Function : ClusCount Summary : Given a partition, count number of runs/clusters and the adjacent similarities, for statistical analysis. */ int ClusCount(int Num, int Nice, struct partition *scores, int Sites, char **matrix) { FILE *Out; char Filename[120]; char Answer[10]; char *cluend; int white, black, wwcount, bbcount, wbcount, run; int *whrun, *blrun; char *pstr; int hexlength; int parttotest; char another = 'Y'; int i, j, k; hexlength = Num/4 + 1; pstr = (char *)MyMalloc( sizeof(char) * (hexlength+1)); whrun = (int *)MyMalloc( sizeof(int) * (Sites+1)); blrun = (int *)MyMalloc( sizeof(int) * (Sites+1)); /* Introduce the concept */ printf("This option counts clustering and runs along individual partition"); printf("s\nThe partitions should be referred to using the first number "); printf("in this list\n\n"); for (i = 0; i < Nice; i++) { pstr = BintoHex(scores[i].part, hexlength, pstr); printf("%d %s\n", i+1, pstr); } /* Prompt the user for the name of the output file... */ printf("\nEnter the name of the output file to be used (.clu) > "); gets(Filename); cluend = strstr(Filename, ".clu"); if (cluend == 0) strcat(Filename, ".clu"); /* try to open the output file. If no can do, return 1 to Main. */ if ((Out = fopen (Filename, "w")) == NULL) { printf("Unable to open Output file - %s\n", Filename); return(1); } do { printf("Enter the number of a partition, or 0 to finish > "); fgets(Answer, 10, stdin); parttotest = atoi(Answer); if (parttotest > Nice || parttotest < 0) { printf("That is not a partition\n"); another = 'Y'; } else if (parttotest) { pstr = BintoHex(scores[parttotest-1].part, hexlength, pstr); white = 0; black = 0; for (j = 0; j <= Sites; j++) { whrun[j] = 0; blrun[j] = 0; } wwcount = 0; bbcount = 0; wbcount = 0; if (matrix[parttotest-1][0] == 'm') { black = 1; run = -1; } else { white = 1; run = 1; } /* Go along the partition, counting up the relevant scores */ for (j = 1; j < Sites; j++) { if (matrix[parttotest-1][j] == 'm') { black++; if (run > 0) { whrun[run]++; run = -1; wbcount++; } else { whrun[0]++; run--; bbcount++; } } else { white++; if (run < 0) { blrun[-run]++; run = 1; wbcount++; } else { blrun[0]++; run++; wwcount++; } } } /* Add the last run to the appropriate place */ if (run > 0) whrun[run]++; else blrun[-run]++; /* print out result into file */ fprintf(Out, "*** Partition %s ***\n", pstr); fprintf(Out, "%d sites: %d consistent, %d inconsistent\n", Sites, white, black); fprintf(Out, "Observed runs:\n"); fprintf(Out, "Length Consist Inconsist\n"); for (k = 0; k <= Sites; k++) if (whrun[k] || blrun[k]) fprintf(Out,"%6d %6d %6d\n", k, whrun[k], blrun[k]); fprintf(Out, "\nConsecutive sites: (ww) %d (bb) %d (wb) %d\n", wwcount, bbcount, wbcount); fprintf(Out, "*** *** ***\n"); fprintf(Out, "\n"); printf("Counts for partition added to file\n"); } else another = 'N'; } while (another == 'Y'); fflush(Out); fclose(Out); return(0); } /* Function Name : WriteTextSum() Function Summary : This function writes a text summary of the created matrix. This is partly to save hand counting of sites in the PostScript output, partly to allow those without access to PostScript printers to still get something meaningful for their money. Functions Needed : BintoHex, UnMakeBinary */ int WriteTextSum(int Num, char **name, int Sites, int Nice, struct partition *scores, struct site_index *pos, char **matrix, int Hard, struct keep_site *tough) { FILE *Out; char Filename[120]; char *parchk; char parend[5] = ".par"; int ident, consis, incon; int tally, temp; char *nstr , *pstr; int partc; int page, endpage, index, dig; int i, j, k; nstr = (char *)MyMalloc(sizeof(int) * (Num+1)); pstr = (char *)MyMalloc(sizeof(int) * (Num+1)); /* Prompt the user for the name of the output file. Add '.par' to the file name if the user hasn't already. */ printf("\nEnter the name of the output file (%s) > ", parend); fgets(Filename, 120, stdin); for (i = 0; Filename[i] != '\n'; i++); Filename[i] = '\0'; parchk = strstr(Filename, parend); if (parchk == 0) strcat(Filename, parend); /* try to open the output file. If no can do, return 1 to main(). */ if ((Out = fopen (Filename, "w")) == NULL) { printf("Unable to open Output file - %s\n", Filename); return(1); } fprintf(Out, "Summary of partition matrix for the following species:\n"); /* List the names of the species used, in order, so that the later data is intelligible */ for (i = 0; i < Num; i++) fprintf(Out, "%4d %s\n", i+1, name[i]); fprintf(Out, "\n"); partc = (Num-1)/4 +1; fprintf(Out, "Partitions are referred to in hexadecimal.\n"); fprintf(Out, "The graphical equivalent of each number is shown:\n\n"); for ( i = 0 ; i < Nice ; i++) { nstr = UnMakeBinary(Num, scores[i].part, nstr); for (j = 0 ; j < Num ; j++) { if (nstr[j] == 'A') nstr[j] = '*'; else nstr[j] = '.'; } pstr = BintoHex(scores[i].part, partc, pstr); if (Num < 60) fprintf(Out, "%s %s\n", pstr, nstr); else { fprintf(Out, "%s", pstr); for (j = 0; j < Num ; j++) { if (j%60 == 0) fprintf(Out, "\n "); fprintf(Out, "%c", nstr[j]); } fprintf (Out, "\n"); } } fprintf(Out, "\n"); fprintf(Out, "For each partition, support and conflict scores are shown,"); fprintf(Out, " followed by\nthe hexadecimal code, and the number of sites"); fprintf(Out, " identical, consistent(excl ident)\nand inconsistent with"); fprintf(Out, " each partition:\n\n"); fprintf(Out, " Support Conflict "); for (j = 4; j < partc; j++) fprintf(Out, " "); fprintf(Out, "Part Ident Consis Incon\n"); for ( i= 0; i < Nice; i++) { pstr = BintoHex(scores[i].part, partc, pstr); ident = 0; consis = 0; incon = 0; for (j = 0; j < Sites ; j++) { if (matrix[i][j] == 'i') ident++; else if (matrix[i][j] == 'c' || matrix[i][j] == 'j') consis++; else incon++; } fprintf(Out, "%8.1f %9.2f %s", scores[i].sup, scores[i].con, pstr); for (j = partc; j < 4; j++) fprintf(Out, " "); fprintf(Out, " %5d%7d%7d\n", ident, consis, incon ); } fprintf(Out, "\n"); fprintf(Out, "Site positions identical to each partition\n"); if (Hard) fprintf(Out, "Half sites are in parentheses\n"); for (i = 0; i < Nice; i++) { fprintf(Out, "Partition %s:\n", BintoHex(scores[i].part, partc, pstr)); tally = 0; for (j = 0 ; j < Sites; j++) { if (matrix[i][j] == 'i') { fprintf(Out, " %d", pos[j].site); temp = pos[j].site; while (temp) { tally++; temp = temp/10; } tally = tally + 2; if (tally > 70) { fprintf(Out, "\n"); tally = 0; } } else if (matrix[i][j] == 'j') { fprintf(Out, " (%d)", pos[j].site); temp = pos[j].site; while (temp) { tally++; temp = temp/10; } tally = tally + 4; if (tally > 70) { fprintf(Out, "\n"); tally = 0; } } } fprintf(Out, "\n"); } /* If there are no hard sites to print out, we are finished */ if (!Hard) { free(nstr); free(pstr); fflush(Out); fclose(Out); return(0); } /* Print the "Hard" sites - Only put 50 sequences in each block... */ fprintf(Out, "\nThe sites that did not contribute to partition scores,\n"); fprintf(Out, "and those which scored as half on two partitions:\n\n"); for( page = 0 ; page < Num ; page += 50) { endpage = page + 50; if (endpage > Num) endpage = Num; fprintf(Out, "Sequence "); index = endpage; while (index > 0) { for (dig = 1; dig <= index; dig *= 10) ; if (dig != 1) dig = dig/10; for (i = page ; i < endpage ; i++) { fprintf(Out, "%d", ((i+1)/dig)%10 ); if ((i+1)%5 == 0 ) fprintf(Out, " "); } fprintf(Out, "\n "); index = index/10; } fprintf(Out, "\n\n"); /* write out each site number followed by the appropriate residues */ for ( j = 0 ; j < Hard ; j++) { fprintf(Out, "%8d ", tough[j].site ); for ( k = page ; k < endpage ; k++) { fprintf(Out, "%c", tough[j].data[k]); if ( (k+1)%5 == 0 ) fprintf(Out, " "); } fprintf(Out, "\n"); } } /* Tidy up, close the file, and return */ free(nstr); free(pstr); fflush(Out); fclose(Out); return(0); } /* Function Name : WritePostScript() Function Summary : This function writes the created matrix as a PostScript file. Formatting integers are: starw = width of the "coloured in" partition display partc = number of digits required to display the partition numbers in hexidecimal partw = width of the partition numbers above maxplot = height of the tallest score bar maxlow = 'height' of the tallest minus score bar maxdeep = width of the partitions namedeep = width of the space between (where the species names have to fit) matr1 = point to start writing the matrix itself matr2 = ditto, when the first row is different. sites1 = sites in the first column sites2 = sites in later columns format is the print style: 0 LentoPlot heads first col, hex numbers head other cols 1 LentoPlot and partdis first, hex numbers head other cols 2 LentoPlot and partdisplay head all columns. Functions Needed : PostScriptBlurb, BintoHex, UnMakeBinary */ int WritePostScript(int seqs, char **name, int sites, int nice, struct partition *scores, struct site_index *pos, char **matrix) { FILE *OutFile; char Filename[120]; char *pschk; char psend[5] = ".ps"; int onepage, page = 0; int starw, partc, partw, maxplot, maxlow, bar; int maxdeep, namedeep, matr1, matr2, matrnow; int temp, format = 1; int formrow, totrow, rowpp, currow = 1; int lhr, curlhr, ppage = 0; int lebound, dobound, ribound, upbound; int sites1, sites2, fitto, x, y, sofar = 0; char reply[5]; char bin[MAXSPECIES]; char str[MAXSPECIES]; float maxscore = 0, maxminus = 0; int i, j; /* Check the matrix is a decent size before anything else happens */ if (sites < 2 || nice < 2) { printf("\nNo PostScript file can be made: \n"); printf("There are %d sites and %d partitions\n", sites, nice); return(0); } /* Also try to make sure it's not too wide to fit on the paper. These numbers will be used extensively later in the subroutine. maxdeep is the width of the matrix, namedeep of the names down the side. Both have to fit on a piece of paper (less its edges). If both fit more than once, we can have several rows! This is what 'formrow' says. */ maxdeep = nice * FONT; namedeep = (int)(FONT * NAMELEN *.65) +1; formrow = (int)((PAPERWIDTH - PAPEREDGE - PAPEREDGE) / (maxdeep +namedeep)); if (formrow <= 0) { printf("The matrix is too wide to fit on the available papersize.\n"); printf("Consider removing some partitions, for example low scores\n"); printf("Otherwise recompile the program for a smaller font size,"); printf("or larger paper\n"); return(0); } /* Prompt the user for the name of the output file. Add '.ps' to the file name if the user hasn't already. */ printf("\nEnter the name of the output file (%s) > ", psend); fgets(Filename, 120, stdin); for (i = 0; Filename[i] != '\n'; i++); Filename[i] = '\0'; pschk = strstr(Filename, psend); if (pschk == 0) strcat(Filename, psend); /* try to open the output file. If no can do, return 1 to main(). */ if ((OutFile = fopen (Filename, "w")) == NULL) { printf("Unable to open Output file - %s\n", Filename); return(1); } /* Set up the magic numbers for formatting this mess */ starw = (seqs * FONT) + 1; partc = (int)(seqs-1)/4 +1; partw = (int)(partc * FONT * .65) + 1; maxplot = (int)(maxdeep/3) + 1; if (maxplot > (PAPERLENGTH - 2*PAPEREDGE) /8) maxplot = (PAPERLENGTH - 2*PAPEREDGE) /8; bar = 3*PAPEREDGE + partw + maxplot; /* Scale the sup and con scores (using maxplot from above) */ for (i = 0 ; i < nice; i++) { if (scores[i].sup > maxscore) maxscore = scores[i].sup; if (scores[i].con > maxminus) maxminus = scores[i].con; } maxminus = maxminus / maxscore; /* maxminus is now scaled to about 1 */ maxscore = maxplot / maxscore; maxlow = (int)(maxplot * maxminus) + 2; /* This is where the matrix would start on the page. If it's too far down the page (i.e. we're never going to fit any sites onto this) ditch the graphic display of partitions (starw) */ matr1 = 3*PAPEREDGE + maxplot + maxlow + starw + 2*partw; if (matr1 > PAPERLENGTH - PAPEREDGE) { format = 0; matr1 = matr1 - starw - partw; printf("Because the dataset is so large, it is not possible to "); printf("display the\npartitions graphically on the page.\n"); printf("The partition numbers, LentoPlot and sites will still "); printf("be printed\n"); } else if (matr1*2 > PAPERLENGTH - PAPEREDGE) { printf("The dataset is very large and including the graphic display "); printf("of partitions\ntakes more than half the page. This can be "); printf("omitted, while the LentoPlot\nwill be printed in either "); printf("case. Include partition display (y/n) > "); fgets(reply, 5, stdin); if (toupper(reply[0]) == 'N') { format = 0; matr1 = matr1 - starw - partw; } } /* In subsequent rows the matrix will generally start just after the hexadecimal numbers. But adjust this value so it lines up with matr1 (i.e. is a multiple of FONT different) */ matr2 = 3*PAPEREDGE + partw; temp = (matr2 - matr1)%FONT; matr2 = matr2 + temp; /* Try to figure out how the sites fit in rows. All these equations work, honestly. They were originally calculated for A4 paper, but can now be reset to any papersize by changing the default values up the top and recompiling the program */ sites1 = (int)((PAPERLENGTH - PAPEREDGE - matr1)/FONT); if (sites1 < 0) { printf("Warning: large dataset. May print out very badly!\n"); sites1 = 0; } sites2 = (int)((PAPERLENGTH - PAPEREDGE - matr2)/FONT); /* Discover how many rows of blocks are needed to fit everything in */ if (sites <= sites1) /* They all fit in one row */ { totrow = 1; sites1 = sites; } /* Next section: if sites1 is at least three quarters sites2 this means the sites take up most of the page length and we can afford to "head" each column if desired. DO NOT check this option if we've already ditched the graphic display! */ else if (sites1*4 > sites2*3 && format) { printf("Would you like the partitions and lentoplot to head each "); printf("column of matrix?\n Otherwise they will head only the first"); printf("column\nPartitions above each column (y/n) > "); fgets(reply, 5, stdin); if (toupper(reply[0]) == 'Y') { format = 2; matr2 = matr1; sites2 = sites1; totrow = (int)((sites-1)/sites1) +1; } else /* the user only wants one copy of lentoplot and partitions */ { totrow = 2 + (int)((sites - sites1 - 1)/sites2); } } else /* the lentoplot and partitions take up too much room to print on each block */ totrow = 2 + (int)((sites - sites1 - 1)/sites2); /* Find how many rows per page to actually print */ if (totrow == 1) { rowpp = 1; onepage = 1; } else if ( totrow <= formrow) { rowpp = totrow; temp = (matr1 - matr2)/FONT; if (temp*(totrow-1) < sites) { sites2 = (sites + temp + totrow - 1)/totrow; sites1 = sites2 - temp; } onepage = 1; } else { rowpp = formrow; onepage = 0; } /* How far to the left hand side of the paper do we start? */ lhr = (int)(PAPERWIDTH/2) - (int)((rowpp*(maxdeep+namedeep))/2); upbound = PAPERLENGTH - 3*PAPEREDGE; lebound = lhr; ribound = lhr + (maxdeep + namedeep)*rowpp; dobound = PAPERLENGTH - matr1 - sites1*FONT; /* Stick all the boring PostScript headers, fonts etc into the file */ PostScriptBlurb(OutFile, onepage, lebound, dobound, ribound, upbound); /* set the current matrix print starting point to matr1 */ matrnow = matr1; fitto = sites1; /* Put the actual data into the PostScript file. */ while (currow <= totrow) { /* If it's a new page, put the page header stuff and move to the left*/ if ( formrow == 1 || (currow%formrow) == 1) { page++; fprintf(OutFile, "\n\n%%%%Page: ? %d\n", page); fprintf(OutFile, "/savepage save def\n\n"); fprintf(OutFile, "0 %d translate\n", PAPERLENGTH); fprintf(OutFile, "-90 rotate\n\n"); curlhr = lhr; } else curlhr = curlhr + maxdeep + namedeep; /* print the partitions on the edge of the paper */ fprintf(OutFile, "\n/Courier findfont %d scalefont setfont\n", FONT); for ( i = 0 ; i < nice ; i++) fprintf(OutFile, "(%s) %d %d ms \n", BintoHex(scores[i].part, partc, str), 3*PAPEREDGE, curlhr +1 + i*FONT ); /* Setup row 1 with the partitions, lentoscores etc, or each row if requested earlier (format = 2). If we've got a big dataset, (format = 0) don't do the graphic print of partitions, just the lentoplot */ if ((currow == 1) || format == 2) { if (format) { for (i = 0 ; i < nice ; i++) fprintf(OutFile, "(%s) %d %d ms\n", BintoHex(scores[i].part, partc, str), bar + maxlow + starw, curlhr+1 + i*FONT ); fprintf(OutFile, "\n/Ingrid-Blocks findfont %d ", FONT); fprintf(OutFile, "scalefont setfont\n"); for (i = 0 ; i < nice ; i++) fprintf(OutFile, "(%s) %d %d ms\n", UnMakeBinary(seqs, scores[i].part, bin), bar + maxlow, curlhr + i*FONT); } /* This part prints the lentoplot - a graphic display of scores for and against a given partition */ fprintf(OutFile, "\n%.2f setlinewidth\n", (float)FONT/16); for ( i= 0; i < nice; i++) { fprintf(OutFile, "%d %d %d %d whitecol\n", bar, curlhr + i*FONT, -(int)(scores[i].sup*maxscore), FONT); fprintf(OutFile, "%d %d %d %d blackcol\n", bar, curlhr + i*FONT, (int)(scores[i].con*maxscore), FONT); } fprintf(OutFile, "\n"); } /* Print the actual matrix of consistencies */ fprintf(OutFile, "\n/Ingrid-Blocks findfont %d scalefont setfont\n", FONT); for (i = 0; i < nice; i++) { fprintf(OutFile, "("); for (j = sofar ; j < fitto; j++) { fputc(matrix[i][j], OutFile); if (((j-sofar+1)%70) == 0) fprintf(OutFile, "\\\n"); } fprintf(OutFile, ") %d %d ms\n", matrnow, (curlhr + i*FONT)); } /* Now for the final bits on the PostScript page, most notably all the site numbers, at right angles. */ if ((currow%formrow) == 0 || currow == totrow) { fprintf(OutFile, "\n 90 rotate\n 0 -%d translate\n", PAPERLENGTH); fprintf(OutFile, "/Courier findfont %d scalefont setfont\n",FONT); x = lhr+1 + nice*FONT; if ((page == 1) || format == 2 ) { /* Only print the names of sequences if there was room for them! */ if (format) { y = PAPERLENGTH - bar - maxlow - FONT +1; for (j = 0 ; j < seqs; j++) fprintf(OutFile, "(%s) %d %d ms\n", name[j], x, y-j*FONT); y = PAPERLENGTH - matr1 - FONT +1; } else y = PAPERLENGTH - matr1 - FONT +1; } else y = PAPERLENGTH - matr2 - FONT +1; for ( j = ppage ; j < fitto ; j++) { fprintf(OutFile, "(%d) %d %d ms\n", pos[j].site, x, y); if ((j - sites1 + 1)%sites2 == 0) { x = x + maxdeep + namedeep; y = PAPERLENGTH - matrnow - FONT + 1; } else y = y - FONT; } ppage = fitto; fprintf(OutFile, "\n\nsavepage restore\nshowpage\n"); fprintf(OutFile, "%%%%PageTrailer\n"); } /* Set up as needed for the next block of matrix. */ /* move block starting point as needed */ matrnow = matr2; sofar = fitto; fitto = fitto + sites2; if (fitto > sites) fitto = sites; currow ++; } /* Tidy up, close the file, and return */ if (onepage) fprintf(OutFile, "%%%%Trailer\n%%%%EOF\n"); else fprintf(OutFile, "%%%%Trailer\n%%%%Pages: %d\n%%%%EOF\n", page); fflush(OutFile); fclose(OutFile); return(0); } /* Function :BintoHex Summary : given a BigLong, a length, and a string space, build a hexadecimal representation of the BigLong as many characters as length into the string, padded with spaces at the front. */ char *BintoHex(BigLong num, int len, char *home) { int region, part; int i; home[len] = '\0'; i = len - 1; for (region = 0; region < SUPERLONG; region++) { for (part = 0; part < LONGBITS/4; part++) { switch (num.sub[region]%16) { case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9: home[i] = '0' + num.sub[region]%16; break; case 10: home[i] = 'A'; break; case 11: home[i] = 'B'; break; case 12: home[i] = 'C'; break; case 13: home[i] = 'D'; break; case 14: home[i] = 'E'; break; case 15: home[i] = 'F'; break; default: /* this shouldn't happen !! */ home[i] = '*'; break; } num.sub[region] = num.sub[region]/16; i--; if ( i < 0) break; } if (i < 0) break; } /* Tidy up the font so they look more like numbers */ i = 0; while ( home[i] == '0') { home[i] = ' '; i++; } return(home); } /* Function :PostScriptBlurb Summary :Writes all the header stuff that has to go first in the PostScript file, particularly the custom font, "Ingrid-Blocks" which draws squares and circles and other partimatrix symbols. The font follows the example font on page 282-3 of The PostScript Reference Manual (2nd ed) closely. There are lots of Copyright and Trademark restrictions on the use of PostScript but as far as I can tell what I'm doing here is totally legit usage in Adobe's eyes. */ void PostScriptBlurb(FILE *Out, int onepage, int le, int bo, int wi, int hi) { time_t now; char username[120]; int i = 0; char encap[] = "EPSF-3.0"; if (!onepage) strcpy(encap, ""); printf("Please enter your name > "); fgets(username, 120, stdin); while (username[i] != '\n') i++; username[i] = '\0'; time(&now); fprintf(Out, "%%!PS-Adobe-3.0 %s\n", encap); fprintf(Out, "%%%%BoundingBox: %d %d %d %d \n", le, bo, wi, hi); fprintf(Out, "%%%%Title: PartiMatrix \n"); fprintf(Out, "%%%%Creator: partimatrix.c \n"); fprintf(Out, "%%%%For: %s \n", username); fprintf(Out, "%%%%CreationDate: %.24s\n", ctime(&now)); if (onepage) fprintf(Out, "%%%%Pages: 1 \n"); else fprintf(Out, "%%%%Pages (atend) \n"); fprintf(Out, "%%%%DocumentNeededResources: font Courier\n"); fprintf(Out, "%%%%DocumentSuppliedResources: font Ingrid-Blocks \n"); fprintf(Out, "%%%%+ procset Parti-Abbrev 0 0\n"); fprintf(Out, "%%%%EndComments \n\n"); fprintf(Out, "%%%%BeginProlog \n"); fprintf(Out, "%%%%IncludeResource: font Courier\n"); fprintf(Out, "%%%%BeginResource: font Ingrid-Blocks \n"); fprintf(Out, "12 dict begin \n"); fprintf(Out, "/FontType 3 def \n"); fprintf(Out, "/FontMatrix [.001 0 0 .001 0 0] def \n"); fprintf(Out, "/FontBBox [0 0 1000 1000] def \n"); fprintf(Out, "/Encoding 256 array def \n"); fprintf(Out, "0 1 255 {Encoding exch /.notdef put} for \n"); fprintf(Out, "Encoding 109 /black put %% m = 109 \n"); fprintf(Out, "Encoding 99 /white put %% c = 99 \n"); fprintf(Out, "Encoding 105 /dot put %% i = 105 \n"); fprintf(Out, "Encoding 106 /diag put %% j = 106 \n"); fprintf(Out, "Encoding 65 /circle put %% A = 65 \n"); fprintf(Out, "Encoding 66 /ring put %% B = 66 \n"); fprintf(Out, "/CharProcs 7 dict def \n"); fprintf(Out, "CharProcs begin \n"); fprintf(Out, " /.notdef { } def \n"); fprintf(Out, " /black \n"); fprintf(Out, " {0 0 moveto 1000 0 lineto 1000 1000 lineto \n"); fprintf(Out, " 0 1000 lineto closepath fill} bind def \n"); fprintf(Out, " /white \n"); fprintf(Out, " {0 0 moveto 1000 0 lineto 1000 1000 lineto\n"); fprintf(Out, " 0 1000 lineto 0 0 lineto 50 50 lineto 50 950 lineto\n"); fprintf(Out, " 950 950 lineto 950 50 lineto 50 50 lineto\n"); fprintf(Out, " closepath fill} bind def\n"); fprintf(Out, " /dot \n"); fprintf(Out, " {0 0 moveto 1000 0 lineto 1000 1000 lineto\n"); fprintf(Out, " 0 1000 lineto 0 0 lineto 50 50 lineto 50 950 lineto\n"); fprintf(Out, " 950 950 lineto 950 50 lineto 50 50 lineto\n"); fprintf(Out, " closepath fill \n"); fprintf(Out, " 500 500 110 0 360 arc fill} bind def\n"); fprintf(Out, " /diag \n"); fprintf(Out, " {0 0 moveto 1000 0 lineto 1000 1000 lineto \n"); fprintf(Out, " 0 1000 lineto 0 0 lineto 50 50 lineto 50 950 lineto\n"); fprintf(Out, " 950 950 lineto 950 50 lineto 50 50 lineto \n"); fprintf(Out, " closepath fill \n"); fprintf(Out, " 50 0 moveto 1000 950 lineto 950 1000 lineto \n"); fprintf(Out, " 0 50 lineto closepath fill} bind def\n"); fprintf(Out, " /circle \n"); fprintf(Out, " {500 500 450 0 360 arc fill} bind def\n"); fprintf(Out, " /ring \n"); fprintf(Out, " {500 500 450 0 360 arc 500 500 350 360 0 arcn\n"); fprintf(Out, " closepath fill} bind def\n"); fprintf(Out, "end %% of CharProcs \n"); fprintf(Out, "/BuildGlyph { \n 1000 0\n 0 0 1000 1000\n"); fprintf(Out, " setcachedevice \n exch /CharProcs get exch \n"); fprintf(Out, " 2 copy known not {pop /.notdef} if \n"); fprintf(Out, " get exec \n} bind def\n"); fprintf(Out, "/BuildChar { \n 1 index /Encoding get exch get\n"); fprintf(Out, " 1 index /BuildGlyph get exec \n} bind def\n"); fprintf(Out, "currentdict \nend %%of font dictionary\n"); fprintf(Out, "/Ingrid-Blocks exch definefont pop \n"); fprintf(Out, "%%%%EndResource\n\n"); fprintf(Out, "%%%%BeginResource: procset Parti-Abbrev 0 0 \n"); fprintf(Out, "/ms {moveto show} bind def\n"); fprintf(Out, "/blackcol %%takes x y width height\n"); fprintf(Out, " {gsave 4 2 roll newpath moveto \n"); fprintf(Out, " dup 0 exch rlineto exch \n"); fprintf(Out, " 0 rlineto 0 exch neg rlineto \n"); fprintf(Out, " closepath fill grestore \n"); fprintf(Out, " } bind def \n"); fprintf(Out, "/whitecol %%takes x y width height\n"); fprintf(Out, " {gsave 4 2 roll newpath moveto \n"); fprintf(Out, " dup 0 exch rlineto exch \n"); fprintf(Out, " 0 rlineto 0 exch neg rlineto \n"); fprintf(Out, " closepath stroke grestore \n"); fprintf(Out, " } bind def \n"); fprintf(Out, "%%%%EndResource\n"); fprintf(Out, "%%%%EndProlog\n"); return; } /* * Function: MyMalloc * Summary : malloc with error checking like everyone else has to have. * If the system can't allocate memory of the appropriate size, * it breaks out of the program altogether. */ void *MyMalloc(size_t size) { void *temp; temp = malloc(size); if (temp == 0) { fprintf(stderr, "\n Unable to allocate %d bytes from the system \n", size); fprintf(stderr, "Try running the program with a smaller dataset\n"); exit(1); } return(temp); } /* MyMalloc */ /* * Function: MyRealloc * Summary : You mean you can't figure this out, if you're still * reading this far??? (Hint: see MyMalloc) */ void *MyRealloc(void *old, size_t size) { void *temp; temp = realloc(old, size); if (temp == 0) { fprintf(stderr, "\n Unable to reallocate %d bytes from the system \n", size); fprintf(stderr, "Try running the program with a smaller dataset\n"); exit(1); } return(temp); } /* MyRealloc */