Introduction

The goal of million sequence clustering was to find clusters for large sequence sets with around million sequences. We classified two data sets in this regard of which the details are given below.


Process

The outline of the clustering process for a given sequence set is as follows (for more details see 16S rRNA Process or fungi process)

  1. Pick a sub set of sequences - called sample sequence set. We used 100K sequences in sample
  2. Perform pairwise alignment on 1. and produce a distance matrix
  3. Perform pairwise clustering and multidimensional scaling on 2. to produce a 3 dimensional view of the sample sequences with different coloring on clusters found from pairwise clustering program
  4. Refine the clusters found in 2. to form spatially compact regions - called Mega-regions aiming to get at most ~100K sequences in each Mega-region as processing time for larger samples was too great.
  5. Assign each remaining (the ones not in the sample sequence set) sequence approximately to a Mega-region based on the pairwise distance between the sequence and the sequence representing the centers of nodes of decomposed trees (see figure 1 below) in a Mega-region. This is extremely fast as can use hierarchical algorithm in 3D mapped space. The result is illustrated in figure 2 below.
  6. For each Mega-region of 5.
    1. Run pairwise alignment and produce distance matrix
    2. Run multidimensional scaling on 6.1
    3. Run pairwise clustering on 6.1
    4. Produce 3 dimensional view based on 6.2 and 6.3
    5. If refinements are necessary for clustering repeat from step 6.3 for such clusters The visual depiction of clusters (mapped to 3D) makes it much easier to decide on reliability of results and decide when refinement is warranted). The final result of this is illustrated in figure 3 below
  7. Use Interpolative Joining algorithm to generate Spherical Phylogram as a phylogenetic tree in 3D dimension.

Note each step was run in parallel on a number of cores that depended on problem size but was up to 1000 cores in number.


figure 1

Figure 1. An Oct-tree built on top of 3D projection of initial sample of 100K sequences. The terminal nodes of Oct-tree define the points to be used in interpolation


figure 2

Figure 2. Result of interpolating full sample to the Oct-tree centers illustrated in figure 1


figure 3

Figure 3: The result of a full clustering and 3D projection for one of the Mega-regions defined with a complete set of sequences in analysis shown in figure 2.

figure 4

Figure 4: The result of a Spherical Phylogram generated by using a phylogenetic tree from RAxML. The 3D projecttion used SWG to calculate the pairwise distances. The phylogenetic tree is shown in 2D at here.


Dependence of results on sequence length

We also studied the effect of sequence length in final clustering results as presented under following links


Sequence alignment

We used the local alignment algorithm, Smith-Waterman to align sequences. We also tried using global alignment with Needleman-Wunsch, but discontinued due to the reasons discussed here.


3D Visualizations

The pictures are screen snapshots of data displayed in a 3D visualizer. You can see the full 3D versions by downloading PlotViz found at download page and using it to open .pviz files labeled as "plot" of download sections.


References

  • Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.
  • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)
  • Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.
  • Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012
  • Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.

  • Technologies Used


    Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research


    This is a SALSAHPC project