## Using Needleman Wunsch Distances

The presented results for 16sRNA are using a Smith Waterman Gotoh. The initial analysis used Needleman Wunsch’s global alignment NW but this was found to give confusing results as illustrated by the figure belowwhich projects the random 100K sample to 3D using DA-SMACOF. One does not see approximately globular clusters but rather long cigar shaped structures. It was determined that this was a feature of NW and not of analysis technology. Linear structure reflects a term in distance between sequences with a one dimensional nature. There is one obvious feature – namely sequence length – that is one dimensional. We did a further analysis of data to show that cigar shape comes from a term in distance between sequences that is proportional to difference in sequence length. Short lengths have larger distances than long sequences.

Fig.1 DA-SMACOF of 100K samples of 16sRNA using Needleman Wunsch distances |

We illustrate this with 8 sequences taken along length of one of the cigars. We calculate effects referenced to a sequence SRR042354.5177 (#55679) that is long (507 base pairs). This and 8 other points on a cigar are shown in figure below with longest sequence 55679 at top.

Figure 2: 9 points along a Needleman Wunsch Cigar |

In table 1 we show NW and Euclidean (after map to 3D) distances from 8 sequences to 55679. There is a strong agreement confirming accuracy of DA-SMACOF mapping

Table 1: Comparison of NW (marked pd) and Euclidean (marked ed) distances for 8 sequences compared to 55679. |

In table 2 we plot features of the NW alignment showing a steadily increasing number of mismatches as we move along cigar.

Table 2: Features of NW alignment score for 8 sequences compared to 55679. The mismatches increase along cigar and are negatively correlated with length of shortest sequence. |

Comparing tables 1, 2 and figure 2 we deduce that cigar shape is an artifact of sequence length bias in NW distance computation. This effect is much reduced if you use Smith Waterman Gotoh distance computation although there appear still to be some sequence length effects.