DryadLINQ Tutorial

SALSA Group
PTI Indiana University
June 17th 2010

Contents

  1. Introduction
    1. Dryad
    2. DryadLINQ
    3. Prerequisites
  2. DryadLINQ Programming Guide
    1. Important DryadLINQ Objects
    2. Line Count Example
      1. How to Run Application
      2. DryadLINQ config.xml
      3. Partition File
      4. Code Analysis
    3. MapReduce Example
  3. References and Resources

1. Introduction

  1. Dryad

    The Dryad runtime is designed to support data parallel computing which includes various kinds of applications, like machine learning, data mining, and scientific computations. Dryad is a general-purpose distributed computing engine. It supports the direct acyclic graph (DAG) job execution plan, which subsumes the MapReduce model. Dryad simplifies the task of implementing distributed applications by addressing following issues: resource scheduling, job monitoring, input data transferring, fault tolerance, tasks scheduling optimization. Dryad can run on top of few cluster services in Windows platform, including Windows HPC Server 2008, Cosmos, and Azure. Besides, it supports high level language like DryadLINQ and SCOPE. Figure 1 describes the architecture of Dryad/DryadLINQ.


    Figure 1. Dryad Architecture

  2. DryadLINQ

    DryadLINQ is a compiler which translates LINQ programs to Dryad distributed computations that run on cluster service. LINQ is a built in declarative programming language for data manipulation in .NET framework. Visual Studio 2008 provides integrated support for LINQ programs. It is not necessary for DryadLINQ programmers to have the knowledge about Dryad and Windows HPC Server. Besides, the programmers do not even have to know much about parallel or distributed computation. Thus DryadLINQ users can write the data parallel applications without too much effort.

  3. Prerequisites

    In this tutorial, we mainly focus on the academic release version of Dryad/DryadLINQ run on Windows HPC cluster. We assume the both Windows HPC cluster and Dryad/DryadLINQ have been installed. For more detailed information about installing Dryad/DryadLINQ please reference to “DryadLINQ Installation and Configuration Guide” We also assume that you are familiar with the basics of LINQ programming, and so focus on how to use the DryadLINQ extensions to LINQ. If you are new to LINQ programming, see “LINQ: .NET Language Integrated Query”.

DryadLINQ Programming Guide

The DryadLINQ programming model is based on LINQ, which is a query technology for .NET applications. LINQ defines a set of general-purpose operators that allow applications to declaratively express query operations such as traversal, filtering, and projection. LINQ was introduced with .NET Framework Version 3.5, and is included with Visual Studio 2008. DryadLINQ applications use the LINQ programming model, and most of a DryadLINQ application typically consists of standard LINQ code. DryadLINQ extends LINQ to allow applications to efficiently execute LINQ queries as distributed applications on a Windows HPC cluster.

  1. Important DryadLINQ Objects
  2. IEnumerable<T> object Used by LINQ to represent data. IEnumerable<T> is an iterator over the associated data on the local computer. IQueryable<T> object Used by DryadLINQ to represent data. IQueryable<T> inherits from IEnumerable<T>, but it represents a query to be executed by a LINQ provider, rather than an iterator that is executed locally. During evaluation, the query are translated into a Dryad job, performs the computation as a distributed application on a cluster. partition A subset of the input data that require processing. Each Dryad task processes a separate partition. PartitionedTable<T> object The objects inherit from IQueryable<T>. It is used by DryadLINQ to represent persistent data as an IQueryable<T> collection.

  3. Line Count Example
  4. We made a simple LineCount DryadLINQ application which counts the number of lines within one text file. The text file had been splitted into two partitions which are stored on two compute nodes respectively on the Madrid-headnode Windows HPC cluster.

            Using  System;
            Using  System.Collections.Generic;
            Using  System.Linq;
            Using  System.Text;
            Using  LinqToDryad;
            Public  static  class  Program {
                Static  void Main (string [] args){
                    String  uri  =  @”file://\\Madrid-headnode\DryadData\Hui\LineCount\input.pt”;
                    PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(uri);
                    Console.WriteLine(“there are {0} lines in input.txt”, table.Count());
                }
            }
        
    1. How to Run Application
      1. Create “LineCount” C# console project in Visual Studio
      2. Add the reference to “LinqToDryad.dll” file. The DLL is located in the DryadLINQ installation’s lib folder—typically C:\Program Files\Microsoft Research DryadLINQ\lib
      3. Create a local DryadLINQConfig.xml file, put this file in the bin directory where your executable file will be launched. Etc. C:\Users\lihui\Documents\Visual Studio 2008\Projects\LineCount\LineCount\bin\Debug
      4. Put two input data partition files under the default DryadData share directory on two compute nodes respectively.
      5. Create the input.pt partition file, and add some lines to record the information of partitions, like network paths of partitions, file size.
      6. Build the application and press Ctrl+F5 to run it.

    2. DryadLINQ config.xml
    3. Each DryadLINQ project must include a local configuration file named DryadLinqConfig.xml. An example configuration file is given below.

                  <DryadLinqConfig>
                      <DryadLinqRoot>C:\Program Files\Microsoft Research DryadLINQ</DryadLinqRoot>
                      <ClusterName>MADRID-HEADNODE</ClusterName>
                      <Cluster name="MADRID-HEADNODE"
                               schedulertype="Hpc"
                               dryadoutputdir="file://\\MADRID-HEADNODE\XC\lihui\LineCount"
                               partitionuncdir = "XC\output"
                              />
                  </DryadLinqConfig>
              
      1. DryadLinqRoot: Path to global DryadLinqConfig.xml file
      2. ClusterName: The name Windows HPC cluster
      3. Dryadoutputdir: Path to the default folder for output tables
      4. Partitionuncdir: Local path on each compute node that stores the partition files

    4. Partition File
    5. Currently, Dryad did not port with distributed file system. Programmers need partition the files themselves and distribute input data partitions across the cluster before running DryadLINQ applications. The input data partitions’ location and other related information are stored in the partition file. An example partition file is given below.

      \DryadData\Hui\LineCount\partition
      2
      0,1000,madrid-101
      1,1000,madrid-102

      1. The first line indicates the name prefix of each partition. The partitions must all be placed in the same directory on all the machines. In this example each file will be in the \DryadData\Hui\LineCount\. And its name will have the form partition.XXXXXXXX. Here XXXXXXXX is an 8-digit hexadecimal number.
      2. The second line is the number of partitions, in this example is 2
      3. Each line that follows describes a partition:
        1. The partition number, in decimal
        2. The partition size in bytes
        3. Finally, a comma-separated list of machines. A partition may be replicated on several machines, for fault-tolerance.

    6. Code Analysis
    7. The main function creates a partitioned table by calling the Get method of the PartitionedTable class. The table is a sequence of pre-defined DryadLINQ lineRecord objects. It implements both IEnumerable and IQueryable interfaces. When you look into the partition file input.pt, there are two input data partitions on two compute nodes. When the lineCount application is submitted to Windows HPC scheduler via DryadLinq provider, there are two Dryad tasks processing those two partitions respectively on two machines where the partitions located.

  5. MapReduce Example
  6.         using System;
            using System.Collections.Generic;
            using System.Linq;
            using System.Text;
            using LinqToDryad;
    
            namespace MapReduce
            {
                public struct Pair
                {
                    private string word;
                    private int count;
                    public Pair(string w, int c)
                    {
                        word = w;
                        count = c;
                    }
                    public int Count { get { return count; } }
                    public string Word { get { return word; } }
                    public override string ToString()
                    {   return word + ":" + count.ToString();   }
                }//Pair
                public class Program
                {
                    static void Main(string[] args)
                    {
                        string uri = @"file://\\madrid-  headnode\DryadData\hui\LineCount\input.pt";
                        PartitionedTable<LineRecord> inputTable = PartitionedTable.Get<LineRecord>(uri);
                        IQueryable<string> words = inputTable.SelectMany(x => x.line.Split(' '));
                        IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x);
                        IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count()));
                        IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count);
                        IQueryable<Pair> topPairs = ordered.Take(3); //top 3 most words
                        For each (Pair words in topPairs)
                            Console.WriteLine(words.ToString());
                        Console.ReadLine();
                        return;
                    }
                }
            }
        

    DryadLINQ provides a simple, straightforward way to implement MapReduce operations. Here is a simple word count MapReduce application that just counts the word frequency of text files. The partition file and input data partitions are the same as those in the lineCount example. The way how to run this example is also the same as previous one.

    Code Analysis

    1. The Pair structure is used to store the <key, value> pair, where key is the “word” object, value is the “count” object.
    2. Applies the SelectMany operator to inputTable to transform the collection of lines into collection of words
      1. The String.Split method converts the line into a collection of words.
      2. SelectMany concatenates the collections created by Split into a single IQueryable <string> collection named words, which represents all the words in the file.
    3. Performs the Map part of the operation by applying GroupBy to the words object. The GroupBy operation groups elements with the same key, which is defined by the selector delegate. This creates a higher order collection whose elements are groups. In this case, the delegate is an identity function, so the key is the word itself, and the operation creates a groups collection that consists of groups of identical words.
    4. Performs the Reduce part of the operation by applying Select to groups This operation reduces the groups of words from Step 3 to an IQueryable <Pair> collection named counts that represents the unique words in the file and how many instances there are of each word. Each key value in groups represents a unique word, so Select creates one Pair object for each unique word. IGrouping.Count returns the number of items in the group, so each Pair object’s Count member is set to the number of instances of the word.
    5. Applies OrderByDescending to counts. This operation sorts the input collection in descending order of frequency and creates an ordered collection named ordered.
    6. Applies Take to ordered to create an IQueryable <Pair> collection named topPair, which contains the 3 most common words in the input file and their frequency.
  7. References and Resources
    1. Dryad and DryadLINQ: An Introduction
    2. Dryad and DryadLINQ Installation and Configuration Guide
    3. DryadLINQ Programming Guide
>