In this project we found clusters for a 16S rRNA data set with nearly a million sequences.


Data Set

The initial data set was received from Dr. Mina Rho in Indiana University. The details are as follows.


  1. Pick 100K random sample
    -- 100k_sample_fasta.txt
  2. The remaining is called out-sample sequences
    -- 584769_outSample_fasta.txt
  3. Run pairwise local sequence alignment (Smith-Waterman) on 1.
  4. Run DA(Deterministic Annealing)-SMACOF on 3.
  5. Run Deterministic Annealing pairwise clustering on 3.
  6. Produce plot from 4. and 5.
  7. Refine 6. for spatially compact Mega-regions
  8. Assign out-sample sequences in 2. to Mega-regions of 7. based on a nearest neighbour approach
  9. Extract sequences for each such Mega-region from full unique sequence set
  10. For each Mega-region sequence set
    1. Run pairwise local sequence alignment (Smith-Waterman)
    2. Run DA(Deterministic Annealing)-SMACOF on 10.1
    3. Run Deterministic Annealing pairwise clustering on 10.1
    4. Produce plot from 10.2 and 10.3
    5. While refinements necessary for 10.4
      1. Extract distances for necessary sub clusters
      2. Run Deterministic Annealing pairwise clustering on 10.5.1
      3. Merge results of 10.5.2 with 10.3
      4. Produce plot from 10.2 and 10.5.3
      5. Go to 10.5

Dependence of results on sequence length

The (cluster) structure of data is dependent in detail on length of sequences. This is documented in a set of plots for each Mega-region that divides set of sequences into ten groups based on length. We show typical visualizations of full color coded sample and that with only top 2 or 3 samples displayed. The visualizations in "lengthwise" galaxy fall into pairs -- identical 2D projections for all lengths and their restriction to top 2 or 3 sequences.
The distance ranges were:

Cluster Status

We classified clusters into 6 different types

Cluster Status Meaning
1 Good Clean Cluster
2 Clean Cluster but some refinement could be useful
3 Debris
4 Debris: More Data might show a cluster
5 Part of a Complicated Structure
6 Part of a Complicated Structure but some refinement could be useful

Note "clustering program" puts all sequences into 1 and only 1 cluster. So sequences scattered in background are put in clusters. Thus categories 3 and 4 label "clusters" which are really scattered sequences filling void between clusters.

Note some debris sequences are tails to "real" clusters.
Category 2 and 6 correspond to cases where further refinement could be useful -- especially with increased statistics.
Category 5 and 6 correspond to large structures -- significantly bigger than clusters in categories 1 and 2. The mapping to three dimensions typically shows structure in structures of category 5 and 6 but naively this structure does not look like lots of small families.

Use of Needleman Wunsch Alignment/Dissimilarity Computation

Originally we started project using Needleman Wunsch alignment between sequences but we switched to Smith Waterman Gotoh for reasons discussed here. However Mega-region 2 was so clear (see Figure 1 region 2 below) that we did use Needleman Wunsch results to identify it and get project started without delay from redoing alignment. This leads to very minor glitches in sequence assignment which is based on Smith Waterman Gotoh for other mega-regions.


  • Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.
  • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)
  • Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.
  • Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012
  • Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.

  • Technologies Used

    Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research

    This is a SALSAHPC project