Child pages
  • Big Data Tutorial 1: MapReduce
Skip to end of metadata
Go to start of metadata

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.

What is HDFS? 

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

What is Map-Reduce?

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Lets go to the slide deck for more information:


Phases in MapReduce

A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. The mapping and reducing functions receive not just values, but (key, value) pairs.This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability.  

Every MapReduce job consists of at-least three parts:

  • The driver 
  • The Mapper 
  • The Reducer 

Mapping Phase

The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.

The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.

Mapping creates a new output list by applying a function to individual elements of an input list.


Reducing Phase

Reducing let's you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.

Reducing a list iterates over the input values to produce an aggregate value as output.

MapReduce Data Flow 


What is  dumbo?

Dumbo is the stand alone Hadoop cluster running on Cloudera Enterprise (CDH 5.9.0). Cloudera Enterprise(CDH) combines Apache Hadoop with a number of other open source projects to create a single, massively scalable system where you can unite storage with an array of powerful processing and analytic frameworks.


To access dumbo the Hadoop cluster 


On the Terminal enter the commands below: 

  • cd /users/NetID
  • mkdir .ssh
  • cd .ssh
  • touch config
  • vi config
  • ESC i (insert mode)


Copy and insert the below onto the config file: 

Host hpctunnel

      ForwardX11 yes
      LocalForward 8025
      User NetID
DynamicForward 8118
Host dumbo
           HostName localhost
           Port 8025
           ForwardX11 yes
           User NetID


Steps to connect to hadoop cluster (ie., dumbo) when you are in NYU campus.


Step - 1: User hostname "" with port 22 in putty. Provide your credentials. 
Step - 2: Then, use "ssh <net id>". It will connect to dumbo cluster. 
Step - 1: From terminal, give "ssh <net id>
Step - 2: Then, use "ssh <net id>". It will connect to dumbo cluster. 


What are the components of the dumbo Cluster @NYU and what can they be used for?

Lets see the UIs for a better understanding:

Cloudera Manager: 

Hue :

Resource Manager :


Commands for HDFS & MapReduce:





hadoop fs  -put   <filename_in_lfs>  <hdfs_name>


hadoop fs  -copyFromLocal  <filename_in_lfs>  <hdfs_name>


hdfs dfs -put   <filename_in_lfs>  <hdfs_name>



hadoop fs  -get     <hdfs_name>  <filename_in_lfs>


hadoop fs  -copyToLocal   <hdfs_name>  <filename_in_lfs>



hadoop fs -ls





usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  CLASSNAME            run the class named CLASSNAME


To compile java files with maven:

mvn clean all



hadoop jar <jarfilename>.jar <DriverClassName> <ip_file_in_HDFS> <op_dir_name>



hadoop job -list


yarn application -list



hadoop job -kill <job_id>


yarn application -kill <job_id>



Example Map-Reduce job:


Example 1:

Word Count: The objective here is to count the number of occurrences of each word by using key-value pairs.


Step 1:

ssh into dumbo


Step 2:  

Copy example1 folder to /home/<net_id>/ 

cp -r /share/apps/Tutorials/Tutorial1/example1 /home/<net_id>/

cd /home/<net_id>/example1

It includes 5 files

book.txt ------ Input file  ------ This is the reducer ------ This is the mapper  ------- This is the driver

WordCount.jar - Complied jar file by java compiler which then can be used to run the mapreduce job


Step 3:

Place the book.txt file on to hdfs 

hadoop fs -put /home/<net_id>/example1/book.txt /user/<net_id>/book.txt


Step 4: 

Compile code with java compiler and create jar file using generated class files.

export HADOOP_LIPATH=/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib

javac -classpath $HADOOP_LIPATH/hadoop/*:$HADOOP_LIPATH/hadoop-0.20-mapreduce/*:$HADOOP_LIPATH/hadoop-hdfs/*   *.java

jar cvf WordCount.jar *.class


Step 5:

Run the mapreduce job using WordCount.jar

hadoop jar  WordCount.jar WordCount /user/<net_id>/book.txt /user/<net id>/wordcountoutput


Step 6: 

Check output by accessing HDFS directories

hadoop fs -ls /user/<net_id>/wordcountoutput

hadoop fs -cat /user/<net_id>/wordcountoutput/part-r-00000


hadoop fs -getmerge /user/<net_id>/wordcountoutput $HOME/output.txt

cat $HOME/output.txt

Example 2: 

Standard Deviation : The objective is to find the standard deviation of the length of the words. 

Step 1: 

copy example2 folder to /home/nextid/

cp -r /share/apps/Tutorials/Tutorial1/example2  /home/<net_id>/

example2.txt - Input file

StandardDeviation.jar - compiled jar file


Step 2:

Place the example2.txt file on to hdfs 

hadoop fs -put /home/<net_id>/example2/example2.txt /user/<net_id>/example2.txt


Step 3:

Run the mapreduce job using StandardDeviation.jar

hadoop jar  StandardDeviation.jar  wordstandarddeviation /user/<net_id>/example2.txt /user/<net_id>/standarddeviationoutput


Step 4:

Check output by accessing HDFS directories

hadoop fs -ls  /user/<net_id>/standarddeviationoutput

hadoop fs -cat  /user/<net_id>/standarddeviationoutput/part-r-00000


More examples : 

Summarization Patterns

1. Numerical Summarization 
2. Inverted Index Summarization 
3. Counting with Counters


Filtering Patterns

1. Filtering 
2. Bloom Filtering 
3. Top Ten 
4. Distinct 


Data Organization Patterns 

1. Structured to Hierarchical
2. Partitioning 
3. Binning 
4. TotalOrderSorting 
5. Shuffling 


Join Patterns

1. Reduce Side Join 
      • UserJoin
      • UserJoinBloomFilter
2. Replicated Join 
      • ReplicatedUserJoin
3. Composite Join 
      • CompositeUserJoin
4. Cartesian Product 
      • CartesianCommentComparison


 Example 3:

Step 1:  

Create a directory to work with example3

mkdir /home/<net_id>/example3

cd /home/<net_id>/example3


Step 2: 

Copy the input to local directory, then to HDFS.

cp /share/apps/Tutorials/Tutorial1/example3/MapReduce-master/examples/inputComments.xml /home/<net_id>/example3/


hadoop fs -put /home/<net_id>/example3/inputComments.xml /user/<net_id>/


Step 3: 

Clone a git repository to create a local copy of the code 

git clone


cd /home/<net_id>/example3/MapReduce


Step 4: 

Build/compile using maven. Make sure pom.xml is present in the same directory. This command will generate "target" directory.

/opt/maven/bin/mvn install

cd target


Step 5: 

Extract jar file. This command creates directory "com".

jar -xvf MapReduce-0.0.1-SNAPSHOT.jar


Step 6: 

Execute the process using the class files created in directory "com".

export JAVA_CLASS=com/eftimoff/mapreduce/summarization/numerical/average

hadoop jar MapReduce-0.0.1-SNAPSHOT.jar $JAVA_CLASS/Average /user/<net_id>/inputComments.xml /user/<net_id>/AverageOutput


Step 7:

Check output by accessing HDFS directories. 

hadoop fs -ls /user/<net_id>/AverageOutput

hadoop fs -cat /user/<net_id>/AverageOutput/part-r-00000

(Note: Twitter Sentiment analysis can be done using this cluster. It requires the use of java for mapreduce and pig script for sorting the twitter users based on number of tweets. The next steps would be setting up oozie workflow and observe the analysis on Hue. To learn more about sentiment analysis please contact


MapReduce Streaming


Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python, shell scripts or C++. Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 

Streaming runs a MapReduce Job from the command line. You specify a map script, a reduce script, an input and an output. Streaming takes care of the Map Reduce details such as making sure that your job is split into separate tasks, that the map tasks are executed where the data is stored. Hadoop Streaming works a little differently (your program is not presented with one record at a time, you have to iterate yourself)


          • -input – The data in hdfs that you want to process
          • -output – The directory in hdfs where you want to store the output
          • -map script – the program script command line or process that you want to use for your mapper

          • -reduce script – the program script command or process that you want to use for your reducer.
          • -file – Make the mapper, reducer, or combiner executable available locally on the compute nodes.

            There is an example of Hadoop-streaming at /share/apps/examples/hadoop-streaming on Dumbo. The README file explains how to run the example and where to find the hadoop-streaming.jar

            Steps to copy example:

            Command used to run a mapreduce job using streaming:

            cp -r /share/apps/examples/ $HOME/example/ 

            cd $HOME/example/hadoop-streaming

            An example of how to run an Hadoop-streaming job is:

            Command used to run a mapreduce job using streaming:

            export HADOOP_LIPATH=/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib

            hadoop jar $HADOOP_LIPATH/hadoop-mapreduce/hadoop-streaming.jar -numReduceTasks 2 -file $HOME/example/hadoop-streaming/src -mapper src/ -reducer src/ -input /user/<net_id>/book.txt -output /user/<net_id>/example.out


            (Please contact to learn more) 

  • No labels