Introduction

This is a re-run of the previous fungi data with improved algorithms. Overall steps, however, remained the same except for the following additions.



The following is an image of the new interpolation results after region refinements. Note the introduction of a new seventh mega region.



A snapshot of the full scale collage constructed out of MDSs of each region is given below.



Summary of Differences

The following table summarizes the change in number of points per each mega region. More details on this is available here.

Region In Old and New In Old NOT New In New NOT Old Old Total New Total
0 109158 2071 16640 111229 125798
1 40634 9601 6405 50235 47039
2 112163 2460 12046 114623 124209
3 34480 7694 4732 42174 39212
4 34315 39570 446 73885 34761
5 36399 2044 5471 38443 41870
6 12967 2485 9745 15452 22712
7 0 0 10440 0 10440

Data Set

The initial data set was received from Dr. Haixu Tang in Indiana University. The details are as follows.


Process

  1. Pick 100K random sample from unique sequences with lengths greater than 200
    -- allreads_uniques_gt200_440641_random_100k_0.txt
  2. The remaining is called out-sample sequences
  3. Run pairwise local sequence alignment (Smith-Waterman) on 1.
  4. Run DA(Deterministic Annealing)-SMACOF on 3.
  5. Run Deterministic Annealing pairwise clustering on 3.
  6. Produce plot from 4. and 5.
  7. Refine 6. for spatially compact Mega-regions
  8. Assign out-sample sequences in 2. to Mega-regions of 7. based on a nearest neighbour approach
  9. Extract sequences for each such Mega-region from full unique sequence set
  10. For each Mega-region sequence set
    1. Run pairwise local sequence alignment (Smith-Waterman)
    2. Run DA(Deterministic Annealing)-SMACOF on 10.1
    3. Run MDSasChisq in SMACOF mode on 10.1 with sample points of the region fixed to locations from 4
    4. Run Deterministic Annealing pairwise clustering on 10.1
    5. Produce plot from 10.2 and 10.3
    6. While refinements necessary for 10.5
      1. Extract distances for necessary sub clusters
      2. Run Deterministic Annealing pairwise clustering on 10.5.1
      3. Merge results of 10.5.2 with 10.3
      4. Produce plot from 10.2 and 10.5.3
      5. Go to 10.5
    7. Produce final region specific plot from 10.2 and 10.6
    8. Produce final region specific fixed run plot from 10.3 and 10.6
    9. Classify clusters into three groups (see cluster status) for clusters in 10.7
    10. Find cluster centers for each clean cluster in MDSs of 10.7 and 10.8

Fixed MDS Plots

Each mega region has a subset of sequences coming from the sample set (see process for details). We ran MDS for each region while fixing these sample sequences to a set of known positions. These known positions were determined by running MDS on sample sequences only.


Full Scale Collage

Regions are a method used to break down the large computation necessary otherwise, but may not necessarily be related with a biological categorization. Thus, we wanted a mechanism to produce a single plot containing points from all regions, so biologists may have a better view at the sequences as a whole. The solution was to use the fixed points MDS plots as shown above.


Cluster Centers

Once we are satisfied with the clustering results, we found sequences to represent each of them, which we denote as cluster centers. Three methods were used in finding centers resulting three center sequences per each cluster. Later, each of these centers were evaluated based on the position in the cluster (of fixed points plots), thus keeping only the sequence that best appear to represent the cluster. If more than one best centers were found we picked the one with the longest sequence length. See Refined Cluster Centers page for more information

Description of the three methods is as follows.


Dependence of Results on Sequence Length

An extensive study of length dependency was done with previous run of this data, hence we avoided redoing it in this run as it was the same set of sequences and we expected similar results. Details of the previous analysis is available here.


Cluster Status

We classified clusters into 3 different types

Cluster Status Meaning
1 Good Clean Cluster
2 Clean Cluster but some refinement could be useful
3 Debris

Note "clustering program" puts all sequences into 1 and only 1 cluster. So sequences scattered in background are put in clusters. Thus category 3 label "clusters" which are really scattered sequences filling void between clusters.

Note some debris sequences are tails to "real" clusters.
Category 2 correspond to cases where further refinement could be useful -- especially with increased statistics.


References

  • Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.
  • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)
  • Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.
  • Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012
  • Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.
  • Technologies Used


    Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research


    This is a SALSAHPC project