Sequence Clustering

Jump to: navigation, search

Pipelining Applications to Classify Biological Sequences

Contents

Introduction

The problem at hand is to classify biological sequences, either DNA or Protein, into clusters of similar nature. Eventually this would help biologists to analyze various characteristics of these sequences in their domain of work. Classification begins with the calculation of a value to indicate the similarity between each pair of sequences. This value is usually known as the distance between two sequences. Thus, a matrix of distances for each pair is generated and is fed to a clustering algorithm implementation. The clustering algorithm will assign a group number to each sequence such that sequences with similar distances to other sequences fall into the same group. In theory, the aforementioned steps complete the classification. A visualization step, however, dramatically increases the usefulness of such results for analytical purposes. Therefore, a point visualization based on the same distance matrix is attached to the chain of algorithms. The goal of the algorithm is to find a 3-dimensional coordinates for each sequence such that each pair of points has the same/similar distance of the corresponding two sequences in the distance matrix. A visualization tool is then able to produce a picture depicting the clustering of sequences.

Application Pipeline

Classification process requires several software pieces to work in a pipeline. Primarily it contains the algorithm implementations. The particular implementations used in SALSA Group are based on MPI.NET technology and can be run on a Microsoft HPC Cluster. Configuring the algorithms and running them in a cluster of machines require manual labor. We are currently in an effort to minimize this burden through the introduction of user-friendly software and software-as-services. At present we have developed a job submission tool, which enables us to configure and schedule sequence classification tasks on to in-house Microsoft HPC Clusters. The following sub sections describe the use of this tool in detail.

Main Interface

The main interface has two modes depending on the type of execution, i.e. local or remote. Local execution is mainly for testing purposes and generates a .bat script file which has information to run the MPI based applications on local machine. The remote mode is the preferred one, which will schedule the job to a cluster of machines. Figure 1: Main Interface (Local Mode) presents the interface for local mode and Figure 2: Main Interface (Remote Mode) presents the remote mode interface.

File:NumberedMainInterface.png

Figure 1: Main Interface (Local Mode)

  1. Project Name
    This is the name of the project. A project denotes one sequence classification configuration.
  2. Project Directory
    This is the directory on the local machine where the project is created.
  3. Input File
    This is the input file to start the sequence of applications. Usually it is a sequence file in FASTA format. If a distance matrix is already present then it is possible to skip the distance calculation and proceed with clustering and dimensional reduction. In such case the input file will be a distance matrix in binary format.
  4. Applications
    This is the place to select and configure the application pipeline of interest. The first application performs sequence alignment, which eventually produces the distance matrix. The present implementation supports global alignment through Needleman-Wunsch algorithm and local alignment through Smith-Waterman algorithm. The latter has two implementations, one from SALSA Group itself and the other from Microsoft Biology Foundation. The second and third applications in the sequence are the Pairwise Clustering and Multi-Dimensional Scaling algorithm implementations from SALSA Group. Each of the applications is configurable via the blue “Configure” link. The interfaces for configurations are presented in a following section.
  5. Execution Type
    This is the mode of intended execution. The presented figure is for local execution. Note. This rudimentary mode is advisable for testing purposes with small number of sequences.
  6. Number of Processes
    This is present only in the local execution mode. The value here indicates the number of MPI processes to invoke when run in the local machine.
  7. Generate
    This will generate the project in the local directory specified under Project Directory text box.

File:NumberedMainInterfaceRemote.png

Figure 2: Main Interface (Remote Mode)

  1. Execution Type
    This is the mode of intended execution. This figure presents the interface for remote execution on a cluster.
  2. Select Head Node
    This is an automatically generated list of available Microsoft HPC Cluster Head Nodes. You can pick the desired cluster to use using this combo box.
  3. Select Compute Nodes
    This is a list of available compute nodes for the selected cluster. You can pick any number of compute nodes from the list. The “Select All”, “Inverse”, and “Clear” links are given for easy selection of nodes.
  4. Target Directory
    This is the directory to be used in the cluster, i.e. both in head node and in compute nodes. The project will be copied to this directory before being scheduled.
  5. Submit
    This will generate the project in the local directory similar to the case in local execution mode. Additionally it will copy the generated job to specified head node and schedule a job in the cluster.

Application Configuration Interface

Each application is configurable via the “Configure” link. The configuration for all the applications is stored inside the project directory as a single .xml file. Each configuration interface automatically picks the appropriate section from the file and presents it as a well-known property editor dialog. A typical configuration interface for an application is given in Figure 3.

File:NumberedApplicationConfigurationNew.png

Figure 3: Configuration Interface

  1. Property Display Options
    This enables the user to either sort the properties alphabetically or organize them in to categories.
  2. Load Existing
    This is a convenient feature to load an already existing configuration. The tool will automatically correct IO information like different paths and total number of data points. Thu loading a configuration from a different job is risk free as long as the algorithm parameters are acceptable for the current job you are configuring.
  3. Property Editor
    This is the property view, which enables the user to view and edit values for different properties for the particular application.
  4. OK, Cancel, Rest Buttons
    buttons perform the task implied by their name. Reset in particular will clear any current changes to the configuration and reset the values to the last known configuration.

Project Structure

The previous sections mentioned the process of generating a project. The structure of the generated project is given below to get a detailed understanding about the location of different files.

File:ProjectStructure.png

Figure 4: Project Structure

A folder is with the name of the project denotes the project in the file system. The folder contains four sub folders, i.e. Apps, Config, Input, and Output. The Apps folder contains the required set of application binaries. Config folder has the configuration xml named config.xml by default. The selected input file is copied inside Input folder. Any output from the applications is stored inside Output folder. Note. At present the tool does not support retrieving the results from a remote execution automatically. Therefore, the output should be copied manually from the particular cluster when necessary.

Future Work

The next steps of this tool would be to move towards a Service Oriented Architecture (SOA) for job submission to a remote cluster. This effort is already under progress and we have successfully implemented some features as services. This tool will function as a desktop version of the SALSA Portal once the services are fully built and integrated.