Introduction

The goal of this project is to view phylogenetic tree in 3D space along with the clustering result, so that the domain experts could verify the clustering result with the phylogenetic in order to identify interesting features of the OTUs.

There are totally two runs of this project, one is with previous Fungi project, the information of this run can be found in here.

The other result is an additional run of the previous fungi2 data with improved algorithms. The information about this run could be found here.

The data were generated using following parts:

1) The sequence alignment of AM fungal sequences dowloaded from a recent large-scale phylogeny of AM fungi and only retained sequences that contained at least a portion of the 28S rRNA gene.

2) Sequences from GenBank that had confident species attribution in order to supplement the species coverage within the sequence dataset;

3) Representative sequences for known AM fungal species obtained from spores using 454 sequencing (Roche, Indianapolis, IN) of the variable and phylogenetically informative D2 domain of the 28S rRNA gene. The representative sequences were found using the following methods:


Sequence Alignment

In order to evaluate how different sequence lengths affected the correspondence between phylogenetic trees and clustering, we then created two datasets with sequences that shared the same starting location on the 28S rRNA gene: one dataset contained longer sequences, and the other contained shorter sequences.

We first trimmed the Multiple Sequence Alignment and only retained the unique sequences that spanned an extended region beyond the D2 domain (dataset 1, roughly 675 bases long without gaps); then from that subset we retained only the unique sequences that spanned the 454 sequencing start site and the average end position of the 454 sequences (roughly 425 bases long without gaps).

Finally, we added the representative 454 sequences to this trimmed alignment using MAFFT as described above to create dataset 2. This gave a MSA for dataset 1 (999nts) with:

1) 801 sequences from here and 505 sequences from GenBank for a total of 1306 sequences.

2) And for dataset 2 (599nts with 454 optimized) with: 514 sequences from here, 380 sequences from GenBank, and 126 representative 454 sequences for a total of 1020 sequences. For this phylogenetic comparison test we selected a smaller set of sequences that still represents the expected range of genetic variability within AM fungi.


Phylogenetic Tree

We created a maximum likelihood unrooted phylogenetic tree from the multiple sequence alignment (MSA) with RAxML (Stamatakis 2006) using 100 iterations with the general time reversible (GTR) nucleotide substitution model and with gamma rate heterogeneity (GTRGAMMA). The 2D phylogram display using FigTree is shown as below:




Figure 1: Maximum likelihood phylogenetic tree from dataset 2 that is collapsed into clades at the genus level as denoted by colored triangles at the end of the branches. Branch lengths denote levels of sequence divergence between genera and nodes are labeled with bootstrap confidence values. 454 sequences from spores that are not part of another clade are denoted with the label ‘454 sequence from spore’. Two sequences in the Claroideoglomus clade are instead attributed to Rhizophagus, and one sequence in the Funneliformis clade is instead attributed to Septoglomus (denoted by arrows at the blunt end of the colored triangles).

Dimension Reduction

Generally speaking, we used both pairwise sequence alignment and multiple sequence alingment as distance measurement. And used multidimensional Scaling to do the dimension reduction and visualized the tree in 3D space. However, different distance measurement and MDS algorithms will yield various result.

Distance Measurement

We used MSA, Smith-Waterman and Needleman-wunsch as sequence aligment. The distance were calculated using Percentage Identity (PID).

We used the Mantel test in order to evaluate whether pairs of experimental treatments retained the same structure of sequence differences between them.

Comparisons were then made to the RAxML distance matrix from the same dataset. The Mantel tests were performed using the vegan package in R (version 3.0.2, R Core Team 2013), and none of the tests had p-values greater than 0.001, suggesting all of the measured correlations were likely significant despite the increased type I error (false-positive) rate that can occur with Mantel tests


Figure 2: The comparison using Mantel between distances generated by three sequence alignment methods(MSA, SWG, NW) and RAxML. The higher correlation means better result. It shows that SWG and NW has very similar result with MSA.


MDS algorithm

We used WDA-SMACOF to do the dimension reduction for our final result. Because it can reliably find global optima of the STRESS value. We use sum of branch lengths (edge sum) to measure the accuracy of the dimension reduction method used for spherical phylogram generation. And a lower sum of branch lengths means a better result.


Figure 3: The sum of branch lengths comparison of three different MDS methods (WDA-SMACOF, Levenberg–Marquardt (LMA), EM-SMACOF) using distance input generated from three different types of sequence alignments on 599nts with 454 optimized dataset. The WDA-SMACOF always has lowest sum of branch lengths because it can generate the best spherical phylogram.


Interpolative Joining

This is the algorithm we use to generate the spherical phylogram in 3D based on the dimension reduction result and phylogenetic tree from RAxML.
  1. Input data are from the dataset here, including the dimension reduction result and phylogenetic tree.
  2. For each pair of siblings in the phylogenetic tree
    1. Find their parent
    2. Compute the distance the parent to all the other sequences
    3. Run Interpolation to find the parent's coordinates in 3D space
    4. Connect the parent with these two siblings with edges

The detailed algorithms can be found here.

References

  • Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.
  • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)
  • Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.
  • Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012
  • Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.

  • Technologies Used


    Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research


    This is a SALSAHPC project