## Introduction

The goal of million sequence clustering was to find clusters for large sequence sets with around million sequences. We classified two data sets in this regard of which the details are given below.

- 16S rRNA Sequences

-- This set contained a total of 1160946 16S sRNA sequences where 684769 of them were unique - Fungi Sequences

-- This set contained a total of 957387 fungi sequences where 482158 of them were unique and we considered the ones with lengths greater than 200 resulting 446041 sequences. - Fungi Phylogenetic Tree

-- We proposed a displaying method called Spherical Phylogram to display the phylogenetic tree in 3D space. The sequences includes the representative sequences found by clustering fungi sequences, and sequences from GenBank.

## Process

The outline of the clustering process for a given sequence set is as follows (for more details see 16S rRNA Process or fungi process)

- Pick a sub set of sequences - called sample sequence set. We used 100K sequences in sample
- Perform pairwise alignment on 1. and produce a distance matrix
- Perform pairwise clustering and multidimensional scaling on 2. to produce a 3 dimensional view of the sample sequences with different coloring on clusters found from pairwise clustering program
- Refine the clusters found in 2. to form spatially compact regions - called Mega-regions aiming to get at most ~100K sequences in each Mega-region as processing time for larger samples was too great.
- Assign each remaining (the ones not in the sample sequence set) sequence approximately to a Mega-region based on the pairwise distance between the sequence and the sequence representing the centers of nodes of decomposed trees (see figure 1 below) in a Mega-region. This is extremely fast as can use hierarchical algorithm in 3D mapped space. The result is illustrated in figure 2 below.
- For each Mega-region of 5.
- Run pairwise alignment and produce distance matrix
- Run multidimensional scaling on 6.1
- Run pairwise clustering on 6.1
- Produce 3 dimensional view based on 6.2 and 6.3
- If refinements are necessary for clustering repeat from step 6.3 for such clusters The visual depiction of clusters (mapped to 3D) makes it much easier to decide on reliability of results and decide when refinement is warranted). The final result of this is illustrated in figure 3 below
- Use Interpolative Joining algorithm to generate Spherical Phylogram as a phylogenetic tree in 3D dimension.

Note each step was run in parallel on a number of cores that depended on problem size but was up to 1000 cores in number.

Figure 1. An Oct-tree built on top of 3D projection of initial sample of 100K sequences. The terminal nodes of Oct-tree define the points to be used in interpolation

Figure 2. Result of interpolating full sample to the Oct-tree centers illustrated in figure 1

Figure 3: The result of a full clustering and 3D projection for one of the Mega-regions defined with a complete set of sequences in analysis shown in figure 2.

Figure 4: The result of a Spherical Phylogram generated by using a phylogenetic tree from RAxML. The 3D projecttion used SWG to calculate the pairwise distances. The phylogenetic tree is shown in 2D at here.

## Dependence of results on sequence length

We also studied the effect of sequence length in final clustering results as presented under following links

- 16S rRNA length dependency explanation and gallery
- Fungi length dependency explanation and gallery

## Sequence alignment

We used the local alignment algorithm, Smith-Waterman to align sequences. We also tried using global alignment with Needleman-Wunsch, but discontinued due to the reasons discussed here.

## 3D Visualizations

The pictures are screen snapshots of data displayed in a 3D visualizer. You can see the full 3D versions by downloading PlotViz found at download page and using it to open .pviz files labeled as "plot" of download sections.

## References

## Technologies Used

- Twister
- MPI.NET
- Dimension Reduction with Deterministic Annealing SMACOF
- Dimension Reduction by Interpolation
- Clustering by Deterministic Annealing Pairwise Techniques
- Smith Waterman Gotoh Distance Computation
- .NET Bio (formerly Microsoft Biology Foundation)

Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research

This is a SALSAHPC project