Child pages
  • Clusters - Dumbo
Skip to end of metadata
Go to start of metadata

Dumbo - Hadoop cluster

Dumbo is our 48-node Hadoop cluster, running Cloudera CDH 5.9.0 (Hadoop 2.6.0 with Yarn).

All HPC users also have an account on Dumbo. (See Getting or renewing an HPC account for instructions to get an account. If you are inrolled in a class using the clusters you may already have an account, try logging in first).

Your $HOME directory on Dumbo is different that the $HOME on the HPC cluster, Prince. See FAQ for a way to set the command prompt to show which cluster you are on.

 

Hardware specifications

 

System Name

Dumbo

Master nodes

"babar" and "hathi"

2x12-core Intel "Haswell" (c 2014) CPUs

256GB memory

8TB RAID1 disk

Login and Compute nodes

2 login/edge nodes: "dumbo0" and "dumbo1". Logging in to "dumbo" gets a randomly selected login node.

44 compute nodes: "compute-#-#".

2x8-core Intel "Haswell" (c 2014) CPUs

128GB memory

16x2 TB disk for HDFS

Network

10 Gb Ethernet

Operating System

Linux (Centos 6.9)

File systems

/home - 40GB quota per user, common $HOME

HDFS - 1.4 PB configured. 128MB block size

Logging In

If you wish to use Hadoop Web UIs, you will need to setup dynamic port forwarding before logging in.

Dynamic port forwarding

HDFS and other Hadoop components have Web-based user interfaces you can access with a web browser. However, all HPC clusters are behind the HPC bastion host and not visible to the NYU network or the outside world.

To make them visible, first configure Firefox to use a SOCKS proxy server.

  1. Start Firefox
  2. Under "Preferences", select "Advanced" and then "Network":
  3. Next click "Settings" for "How Firefox connects to the Internet"
  4. Select "Manual proxy configuration", and set SOCKS Host to "localhost" on port 8118 (you can choose any port you like, but you must use the same port in the ssh command below)
  5. Check "SOCKSv5" and "Remote DNS"
  6. Set up your SSH tunnel as per the instructions here (Windows / Mac)
    For Mac/Linux, add the following lines to .ssh/config:

    # first we create the tunnel, with instructions to pass incoming
    # packets on ports 8024, 8025 and 8026 through it and to specific
    # locations
    Host hpcgwtunnel
       HostName gw.hpc.nyu.edu
       ForwardX11 no
       LocalForward 8025 dumbo.hpc.nyu.edu:22
       LocalForward 8026 prince.hpc.nyu.edu:22
       User NetID 
    # next we create an alias for incoming packets on the port. The
    # alias corresponds to where the tunnel forwards these packets
    Host dumbo
      HostName localhost
      Port 8025
      ForwardX11 yes
      User NetID

    Host prince
      HostName localhost
      Port 8026
      ForwardX11 yes
      User NetID

     

    For Windows:

    Please follow the instructions on the link and then move on to the next step: Logging in from a Windows workstation

    In the "Tunnels" dialog of your hpctunnel, add a new forwarded port, but this time the Source port is "8118" and check "Dynamic" instead of "Local" 

  7. Now, after you have logged in to hpc.nyu.edu, you can log in to dumbo in the same way you log in to Prince.

Hadoop Web UIs

Two very useful UIs are the Cloudera Manager (http://babar.es.its.nyu.edu:7180/) and Hue Interface (http://babar.es.its.nyu.edu:8888/). Please enter either link on the firefox web browser to access the UI.

Cloudera Manager is an administrator tool, but you can view the status of the cluster by logging in as "nyu" with the password "nyu".

The hue interface login is your NetId and Password.

Storage on Dumbo

Files Systems for usage:

The NYU Hadoop clusters have multiple filesystems for users files. Each filesystem is configured differently to serve a different purpose.

Space

Environment

Variable

Space PurposeFlushed

Allocation

(per user)

/home
$HOMEProgram development space; storing small files you 
want to keep long term , e.g. source code, scripts.
NO40GB
/scratch
$SCRATCH

Best suited to large files.

Files not accessed
for 60 days

 5TB
HDFS-NA-Hadoop Distributed File System on DumboNO60TB
/archive$ARCHIVELong-term storageNO2TB

 

Moving files between your workstation and HDFS

There are three ways to do this:

  1. Use the SSH tunnel to scp between your workstation and $HOME, and use 'hadoop fs -put' and 'hadoop fs -get' to move between $HOME and HDFS. If you do this, please remove extraneous files for $HOME afterwards:

    Transfer from workstation to HDFS
    Transfer from HDFS to workstation
  2. Use the SSH tunnel and NFSGateway to scp files directly between your workstation and HDFS:

    NFSGateway is currently down - please use method 1 or 3 instead

    Transfer from workstation to HDFS
    Transfer from HDFS to workstation
  3. Use the Hue Interface file browser option which is located in the top dashboard. On the right right hand side you will have an option to upload and download files to and from your workstation.

Huge Data Transfer between your workstation and Dumbo

Using Globus to transfer files to and from NYU HPC storage. Use Globus to transfer the data to /scratch on Prince first. The same /scratch folder is also available on dumbo login nodes.

  • /scratch is shared between multiple clusters i.e., Prince and dumbo. But /home is different on Prince and dumbo.
  • Globus server is only available on Prince. So, transfer data to /scratch with Globus on Prince and then copy the data from /scratch to your hdfs or /home folder of dumbo. 

The Globus project aims at providing powerful tools for scientific data management, to help researchers to focus on their domain subjects and solve data intensive research problems. Globus has been grown maturely to enable grid computing by connecting computing resources distributed globally across organizational boundary. Universities, national laboratories and computing facilities are using services of Globus.

File permissions and Access Control Lists

Users can share files with others using ACL's.

An access control list (or ACL) gives per-file, per-directory and per-user control over who have permissions on the files. You can see the ACL for a file or directory with the getfacl command:

To modify permissions for files or directories, use setfacl.

Using Hive

Hive defaults to using the "default" database. This means that tables you create in this database might clash with tables others have created. Therefore, we highly recommend creating and using your own database, eg:

Please make sure to enter the Hive Heap size with your query for a MapReduce job.

Currently we do not have a way to automatically choose a suitable compute node to connect to - you should choose one at random. The Cloudera Manager WebUI can help, on the front page there are charts showing host CPU usage and host memory usage. It is wise to avoid hosts which already show high CPU or memory usage!

You can see a list of all compute hosts by clicking the "Hosts" link on the top dashboard of the Cloudera Manager WebUI main page.

Using Impala

To connect to impala use hostname where impala daemon is running. Infact, impala daemons are running on all slaves/datanodes on compute nodes of dumbo. So, connecting to any compute nodes, impala will work.

Here is the process to connect to Impala-shell.

Using Python

Python defaults to version 2.6 on dumbo. Incase there is need to work on python with libraries i.e., NumPy, SciPy, Pandas, NLTK, Gensim, Geopandas, follow below steps which points to Python 2.7.11 with required libraries. Contact hpc@nyu.edu for installing new libraries.

NOTE: If you want to work with Hadoop Streaming, Refer Big Data Tutorial 1: MapReduce for MapReduce Hadoop Streaming execution process.

Using Spark

Spark is a data-processing framework that operates on the distributed data collections. Hadoop and Apache Spark are both big-data frameworks, which really serve the different purposes. Hadoop is essentially a distributed data infrastructure, It distributes massive data collections across multiple nodes within a cluster of commodity servers where as Spark is a data-processing tool.

Refer Big Data Tutorial 3 to learn more about Spark.

Here is the process to work with Spark on Dumbo:


Hadoop Job Status

Please find the list of current running jobs on hadoop cluster using 'Yarn' script. Running the yarn script without any arguments prints the description for all commands.

To kill a currently running job because the submitted job started malfunctioning or in worst case scenario, job is stuck in infinite loop. Get the Job ID and then Kill it as given below.


Kafka

Kafka is a messaging system. In the real world data exists on many systems in parallel, all of which need to interact with Hadoop and with each other. The situation quickly becomes more complex, ending with a system where multiple data systems are talking to one another over many channels. Infact, each of these channels require their own communication method to move the data. So, we can use Kafka as a central messaging bus. All incomming data is first placed in Kafka and all outgoing data is read from Kafka which is called as "Topic". Kafka centralizes communication between "producers" of data and "consumers" of that data.

The basic architecture of Kafka is organized around a few key terms: topics, producers, consumers, and brokers. All Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer of topics pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.

Zeppelin

Apache Zeppelin is a web-based interactive computational environment that could use Apache Spark as a backend. In some sense it is like the IPython Notebook. Zeppelin is installed on dumbo. 

Dumbo is a yarn cluster, not a standalone spark cluster. 

Here is the process to work with Zeppelin on Dumbo.

The procedure to start and stop it on dumbo login nodes is:

Suppose you have dynamic port forwarding set up on your workstation, you can browse your notebook using URL: 

Alternatively if you have X11 set up (see instructions for Windows and Mac / Linux computers), you can also run firefox on dumbo login node, and get firefox window popping up on your workstation:

 

Please refer back to this page in the near future for instructions on how to use Oozie, Flume and Kafka.

 

  • No labels