Dumbo - Hadoop cluster
Dumbo is our 48-node Hadoop cluster, running Cloudera CDH 5.9.0 (Hadoop 2.6.0 with Yarn).
All HPC users also have an account on Dumbo. (See Getting or renewing an HPC account for instructions to get an account. If you are inrolled in a class using the clusters you may already have an account, try logging in first).
Your $HOME directory on Dumbo is different that the $HOME on the HPC cluster, Prince. See FAQ for a way to set the command prompt to show which cluster you are on.
"babar" and "hathi"
2x12-core Intel "Haswell" (c 2014) CPUs
8TB RAID1 disk
|Login and Compute nodes|
2 login/edge nodes: "dumbo0" and "dumbo1". Logging in to "dumbo" gets a randomly selected login node.
44 compute nodes: "compute-#-#".
2x8-core Intel "Haswell" (c 2014) CPUs
16x2 TB disk for HDFS
10 Gb Ethernet
Linux (Centos 6.9)
/home - 40GB quota per user, common $HOME
HDFS - 1.4 PB configured. 128MB block size
If you wish to use Hadoop Web UIs, you will need to setup dynamic port forwarding before logging in.
Dynamic port forwarding
HDFS and other Hadoop components have Web-based user interfaces you can access with a web browser. However, all HPC clusters are behind the HPC bastion host and not visible to the NYU network or the outside world.
To make them visible, first configure Firefox to use a SOCKS proxy server.
- Start Firefox
- Under "Preferences", select "Advanced" and then "Network":
- Next click "Settings" for "How Firefox connects to the Internet"
- Select "Manual proxy configuration", and set SOCKS Host to "localhost" on port 8118 (you can choose any port you like, but you must use the same port in the ssh command below)
- Check "SOCKSv5" and "Remote DNS"
Please follow the instructions on the link and then move on to the next step: Logging in from a Windows workstation
In the "Tunnels" dialog of your hpctunnel, add a new forwarded port, but this time the Source port is "8118" and check "Dynamic" instead of "Local"
- Now, after you have logged in to hpc.nyu.edu, you can log in to dumbo in the same way you log in to Prince.
Hadoop Web UIs
Two very useful UIs are the Cloudera Manager (http://babar.es.its.nyu.edu:7180/) and Hue Interface (http://babar.es.its.nyu.edu:8888/). Please enter either link on the firefox web browser to access the UI.
Cloudera Manager is an administrator tool, but you can view the status of the cluster by logging in as "nyu" with the password "nyu".
The hue interface login is your NetId and Password.
Storage on Dumbo
Files Systems for usage:
The NYU Hadoop clusters have multiple filesystems for users files. Each filesystem is configured differently to serve a different purpose.
|$HOME||Program development space; storing small files you |
want to keep long term , e.g. source code, scripts.
Best suited to large files.
Files not accessed
|HDFS||-NA-||Hadoop Distributed File System on Dumbo||NO||60TB|
Moving files between your workstation and HDFS
There are three ways to do this:
Use the SSH tunnel to scp between your workstation and $HOME, and use 'hadoop fs -put' and 'hadoop fs -get' to move between $HOME and HDFS. If you do this, please remove extraneous files for $HOME afterwards:
Use the SSH tunnel and NFSGateway to scp files directly between your workstation and HDFS:
NFSGateway is currently down - please use method 1 or 3 instead
- Use the Hue Interface file browser option which is located in the top dashboard. On the right right hand side you will have an option to upload and download files to and from your workstation.
Huge Data Transfer between your workstation and Dumbo
Using Globus to transfer files to and from NYU HPC storage. Use Globus to transfer the data to /scratch on Prince first. The same /scratch folder is also available on dumbo login nodes.
- /scratch is shared between multiple clusters i.e., Prince and dumbo. But /home is different on Prince and dumbo.
- Globus server is only available on Prince. So, transfer data to /scratch with Globus on Prince and then copy the data from /scratch to your hdfs or /home folder of dumbo.
The Globus project aims at providing powerful tools for scientific data management, to help researchers to focus on their domain subjects and solve data intensive research problems. Globus has been grown maturely to enable grid computing by connecting computing resources distributed globally across organizational boundary. Universities, national laboratories and computing facilities are using services of Globus.
File permissions and Access Control Lists
Users can share files with others using ACL's.
An access control list (or ACL) gives per-file, per-directory and per-user control over who have permissions on the files. You can see the ACL for a file or directory with the getfacl command:
To modify permissions for files or directories, use setfacl.
Hive defaults to using the "default" database. This means that tables you create in this database might clash with tables others have created. Therefore, we highly recommend creating and using your own database, eg:
Please make sure to enter the Hive Heap size with your query for a MapReduce job.
Currently we do not have a way to automatically choose a suitable compute node to connect to - you should choose one at random. The Cloudera Manager WebUI can help, on the front page there are charts showing host CPU usage and host memory usage. It is wise to avoid hosts which already show high CPU or memory usage!
You can see a list of all compute hosts by clicking the "Hosts" link on the top dashboard of the Cloudera Manager WebUI main page.
To connect to impala use hostname where impala daemon is running. Infact, impala daemons are running on all slaves/datanodes on compute nodes of dumbo. So, connecting to any compute nodes, impala will work.
Here is the process to connect to Impala-shell.
Python defaults to version 2.6 on dumbo. Incase there is need to work on python with libraries i.e., NumPy, SciPy, Pandas, NLTK, Gensim, Geopandas, follow below steps which points to Python 2.7.11 with required libraries. Contact email@example.com for installing new libraries.
NOTE: If you want to work with Hadoop Streaming, Refer Big Data Tutorial 1: MapReduce for MapReduce Hadoop Streaming execution process.
Spark is a data-processing framework that operates on the distributed data collections. Hadoop and Apache Spark are both big-data frameworks, which really serve the different purposes. Hadoop is essentially a distributed data infrastructure, It distributes massive data collections across multiple nodes within a cluster of commodity servers where as Spark is a data-processing tool.
Refer Big Data Tutorial 3 to learn more about Spark.
Here is the process to work with Spark on Dumbo:
Using Spark 2.2
Process to work with Spark 2.2 on Dumbo:
Hadoop Job Status
Please find the list of current running jobs on hadoop cluster using 'Yarn' script. Running the yarn script without any arguments prints the description for all commands.
To kill a currently running job because the submitted job started malfunctioning or in worst case scenario, job is stuck in infinite loop. Get the Job ID and then Kill it as given below.
Kafka is a messaging system. In the real world data exists on many systems in parallel, all of which need to interact with Hadoop and with each other. The situation quickly becomes more complex, ending with a system where multiple data systems are talking to one another over many channels. Infact, each of these channels require their own communication method to move the data. So, we can use Kafka as a central messaging bus. All incomming data is first placed in Kafka and all outgoing data is read from Kafka which is called as "Topic". Kafka centralizes communication between "producers" of data and "consumers" of that data.
The basic architecture of Kafka is organized around a few key terms: topics, producers, consumers, and brokers. All Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer of topics pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker.
Apache Zeppelin is a web-based interactive computational environment that could use Apache Spark as a backend. In some sense it is like the IPython Notebook. Zeppelin is installed on dumbo.
Dumbo is a yarn cluster, not a standalone spark cluster.
Here is the process to work with Zeppelin on Dumbo.
The procedure to start and stop it on dumbo login nodes is:
Suppose you have dynamic port forwarding set up on your workstation, you can browse your notebook using URL:
Please refer back to this page in the near future for instructions on how to use Oozie, Flume and Kafka.