## Introduction

The goal of this project is to view phylogenetic tree in 3D space along with the clustering result, so that the domain experts could verify the clustering result with the phylogenetic in order to identify interesting features of the OTUs.

There are totally two runs of this project, one is with previous Fungi project, the information of this run can be found in here.

The other result is an additional run of the previous fungi2 data with improved algorithms. The information about this run could be found here.

The data were generated using following parts:

1) The sequence alignment of AM fungal sequences dowloaded from a recent large-scale phylogeny of AM fungi and only retained sequences that contained at least a portion of the 28S rRNA gene.

2) Sequences from GenBank that had confident species attribution in order to supplement the species coverage within the sequence dataset;

3) Representative sequences for known AM fungal species obtained from spores using 454 sequencing (Roche, Indianapolis, IN) of the variable and phylogenetically informative D2 domain of the 28S rRNA gene. The representative sequences were found using the following methods:

- Centers for each cluster (read more)

## Sequence Alignment

In order to evaluate how different sequence lengths affected the correspondence between phylogenetic trees and clustering, we then created two datasets with sequences that shared the same starting location on the 28S rRNA gene: one dataset contained longer sequences, and the other contained shorter sequences.

We first trimmed the Multiple Sequence Alignment and only retained the unique sequences that spanned an extended region beyond the D2 domain (dataset 1, roughly 675 bases long without gaps); then from that subset we retained only the unique sequences that spanned the 454 sequencing start site and the average end position of the 454 sequences (roughly 425 bases long without gaps).

Finally, we added the representative 454 sequences to this trimmed alignment using MAFFT as described above to create dataset 2. This gave a MSA for dataset 1 (999nts) with:

1) 801 sequences from here and 505 sequences from GenBank for a total of 1306 sequences.

2) And for dataset 2 (599nts with 454 optimized) with: 514 sequences from here, 380 sequences from GenBank, and 126 representative 454 sequences for a total of 1020 sequences. For this phylogenetic comparison test we selected a smaller set of sequences that still represents the expected range of genetic variability within AM fungi.

## Phylogenetic Tree

We created a maximum likelihood unrooted phylogenetic tree from the multiple sequence alignment (MSA) with RAxML (Stamatakis 2006) using 100 iterations with the general time reversible (GTR) nucleotide substitution model and with gamma rate heterogeneity (GTRGAMMA). The 2D phylogram display using FigTree is shown as below:

Figure 1: Maximum likelihood phylogenetic tree from dataset 2 that is collapsed into clades at the genus level as denoted by colored triangles at the end of the branches. Branch lengths denote levels of sequence divergence between genera and nodes are labeled with bootstrap confidence values. 454 sequences from spores that are not part of another clade are denoted with the label ‘454 sequence from spore’. Two sequences in the Claroideoglomus clade are instead attributed to Rhizophagus, and one sequence in the Funneliformis clade is instead attributed to Septoglomus (denoted by arrows at the blunt end of the colored triangles).

## Dimension Reduction

Generally speaking, we used both pairwise sequence alignment and multiple sequence alingment as distance measurement. And used multidimensional Scaling to do the dimension reduction and visualized the tree in 3D space. However, different distance measurement and MDS algorithms will yield various result.

**Distance Measurement**

We used MSA, Smith-Waterman and Needleman-wunsch as sequence aligment. The distance were calculated using Percentage Identity (PID).

We used the Mantel test in order to evaluate whether pairs of experimental treatments retained the same structure of sequence differences between them.

Comparisons were then made to the RAxML distance matrix from the same dataset. The Mantel tests were performed using the vegan package in R (version 3.0.2, R Core Team 2013), and none of the tests had p-values greater than 0.001, suggesting all of the measured correlations were likely significant despite the increased type I error (false-positive) rate that can occur with Mantel tests

Figure 2: The comparison using Mantel between distances generated by three sequence alignment methods(MSA, SWG, NW) and RAxML. The higher correlation means better result. It shows that SWG and NW has very similar result with MSA.

**MDS algorithm**

We used WDA-SMACOF to do the dimension reduction for our final result. Because it can reliably find global optima of the STRESS value. We use sum of branch lengths (edge sum) to measure the accuracy of the dimension reduction method used for spherical phylogram generation. And a lower sum of branch lengths means a better result.

Figure 3: The sum of branch lengths comparison of three different MDS methods (WDA-SMACOF, Levenberg–Marquardt (LMA), EM-SMACOF) using distance input generated from three different types of sequence alignments on 599nts with 454 optimized dataset. The WDA-SMACOF always has lowest sum of branch lengths because it can generate the best spherical phylogram.

## Interpolative Joining

This is the algorithm we use to generate the spherical phylogram in 3D based on the dimension reduction result and phylogenetic tree from RAxML.- Input data are from the dataset here, including the dimension reduction result and phylogenetic tree.
- For each pair of siblings in the phylogenetic tree
- Find their parent
- Compute the distance the parent to all the other sequences
- Run Interpolation to find the parent's coordinates in 3D space
- Connect the parent with these two siblings with edges

The detailed algorithms can be found here.

## References

## Technologies Used

- Twister
- MPI.NET
- Dimension Reduction with Deterministic Annealing SMACOF
- Dimension Reduction by Interpolation
- Clustering by Deterministic Annealing Pairwise Techniques
- Smith Waterman Gotoh Distance Computation
- .NET Bio (formerly Microsoft Biology Foundation)

Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research

This is a SALSAHPC project