Child pages
  • Big Data Tutorial 1: MapReduce

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

MapReduce Streaming

 

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python, shell scripts or C++. Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 

Streaming runs a MapReduce Job from the command line. You specify a map script, a reduce script, an input and an output. Streaming takes care of the Map Reduce details such as making sure that your job is split into separate tasks, that the map tasks are executed where the data is stored. Hadoop Streaming works a little differently (your program is not presented with one record at a time, you have to iterate yourself)

 

          • -input – The data in hdfs that you want to process
          • -output – The directory in hdfs where you want to store the output
          • -map mapper script – the program script command line or process that you want to use for your mapper

          • -reduce reducer script – the program script command or process that you want to use for your reducer.
          • -file – Make the mapper, reducer, or combiner executable available locally on the compute nodes.

            Info
            There is an example of Hadoop-streaming at /share/apps/examples/hadoop-streaming on Dumbo. The README file explains how to run the example and where to find the hadoop-streaming.jar

            Steps to copy example:

            Info
            titleCommand used to run a mapreduce job using streaming:

            cp -r /share/apps/examples/ $HOME/example/ 

            cd $HOME/example/hadoop-streaming

            An example of how to run an Hadoop-streaming job is:

            Info
            titleCommand used to run a mapreduce job using streaming:

            export HADOOP_LIPATH=/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib

            hadoop jar $HADOOP_LIPATH/hadoop-mapreduce/hadoop-streaming.jar -numReduceTasks 2 -file $HOME/example/hadoop-streaming/src -mapper src/mapper.sh -reducer src/reducer.sh -input /user/<net_id>/book.txt -output /user/<net_id>/example.out

             

            (Please contact hpc@nyu.edu to learn more)