Child pages
  • Slurm Tutorial
Skip to end of metadata
Go to start of metadata

 

This tutorial assumes you have a NYU HPC user account. If you don't have an account, you may apply for an account here.

Introduction to the Prince Cluster

In a Linux cluster there are hundreds of computing nodes inter-connected by high speed networks. Linux operating system runs on each of the nodes individually. The resources are shared among many users for their technical or scientific computing purposes. Slurm is a cluster software layer built on top of the inter-connected nodes, aiming at orchestrating the nodes' computing activities, so that the cluster could be viewed as a unified, enhanced and scalable computing system by its users. In NYU HPC clusters the users coming from many departments with various disciplines and subjects, with their own computing projects, impose on us very diverse requirements regarding hardware, software resources and processing parallelism. Users submit jobs, which compete for computing resources. The Slurm software system is a resource manager and a job scheduler, which is designed to allocate resources and schedule jobs. Slurm is an open source software, with a large user community, and has been installed on many top 500 supercomputers.

This tutorial assumes you have a NYU HPC account. If not, you may apply for an account here.

Also assumes you are comfortable with Linux command-line environment. To learn about Linux please read Tutorial 1.

Prince computing nodes

Nodes Cores/Node CPU Type Memory Available To Jobs (GB) Names
68 28 Intel(R) Broadwell @ 2.60GHz 125 c[01-17]-[01-04]
32 28 Intel(R) Broadwell @ 2.60GHz 250 c[18-25]-[01-04]
32 20 Intel(R) Haswell @ 2.60GHz 62 c[26-27]-[01-16]*
9 28 Intel(R) Broadwell @ 2.60GHz 250 gpu-[01-09]**
4 28 Intel(R) Broadwell @ 2.60GHz 125 gpu-[10-13]***
* c[26-27]-[01-16] represents two sets of nodes: c26-01 to c26-16, and c27-01 to c27-16
** 4 Tesla K80 cards on each of the GPU nodes
*** 4 GeForce GTX 1080 cards on each of the GPU nodes

File systems

Space Environment Variable Purpose Flushed Allocation/User
/archive $ARCHIVE Long-term storage NO 2 TB
/home $HOME Small files, code NO 20 GB
/beegfs
/scratch $SCRATCH File staging - frequent writting and reading YES. Files unused for 60 days are deleted 5 TB


For more details of nodes, file system's hardware configuration, please click the link "Cluster - Prince".

Return to Top

The Prince picture

NOTE: The cluster nodes can still access the internet directly. This may be useful when copying data from servers outside the NYU Network

NOTE: Alternatively, instead of login to the bastion hosts, you can use VPN to get inside NYU's network and access the HPC clusters directly. Instructions on how to install and use the VPN client are available here

NOTE: You can't do anything on the bastion hosts, except ssh to the HPC clusters (Prince, Dumbo).

Return to Top 

Connecting to Prince

Logging onto the Prince cluster and submitting jobs is analogous to triple jump the Olympic game which was originated in ancient Greece. First, open a terminal on your Mac workstation. If your workstation is outside of NYU network, follow these three steps:

  1. Hop  - from your workstation, ssh onto one bastion host, which can be either hpc.nyu.edu or hpc2.nyu.edu
  2. Step - from any bastion host, ssh to the Prince cluster login node prince.hpc.nyu.edu
  3. Jump - from any login node, run command "sbatch" or "srun" to submit jobs which will land on the computing node(s)

If you are inside NYU network, the first step 'hop' could be omitted.
See for instance a complete HPC session:

ITSs-Air-3:~ johd$ ssh johd@hpc.nyu.edu
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           WARNING:  UNAUTHORIZED PERSONS ........ DO NOT PROCEED
           ~~~~~~~   ~~~~~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~~
 This computer system is operated by New York University (NYU) and may be
 accessed only by authorized users.  Authorized users are granted specific,
 limited privileges in their use of the system.  The data and programs
 in this system may not be accessed, copied, modified, or disclosed without
 prior approval of NYU.  Access and use, or causing access and use, of this
 computer system by anyone other than as permitted by NYU are strictly pro-
 hibited by NYU and by law and may subject an unauthorized user, including
 unauthorized employees, to criminal and civil penalties as well as NYU-
 initiated disciplinary proceedings.  The use of this system is routinely
 monitored and recorded, and anyone accessing this system consents to such
 monitoring and recording.  Questions regarding this access policy should be
 directed (by e-mail) to askits@nyu.edu or (by phone) to 212-998-3333.
 Questions on other topics should be directed to COMMENT (by email) or to
 212-998-3333 by phone.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
johd@hpc.nyu.edu's password: 
Last login: Wed Jan 11 09:59:54 2017 from ool-18befec8.dyn.optonline.net


[johd@hpc ~]$ ssh prince.hpc.nyu.edu
johd@prince.hpc.nyu.edu's password: 
Last login: Sat Jan 14 11:45:42 2017 from hpc2.es.its.nyu.edu


[johd@log-1 ~]$ sbatch --wrap "hostname; echo 'hello, this is a test'"
Submitted batch job 9870


[johd@log-1 ~]$ exit

For access from Windows station using PuTTY, please read below.

 Click here to expand...

Step 1.

Enter "hpc2.nyu.edu" for host name, and leave "22" the default for port. If you want, you may enter
a name for saved session e.g. "hpc2", click "Save" for use next time. Hit "Open".

Click "Yes" when a Window as below showing up.

 

Step 2.

Enter your NetID username, and password. This will get you on the bastion host hpc2.

 

Step 3.

On hpc2, enter command "ssh prince.hpc.nyu.edu" or "ssh prince" for short hostname, answer "yes" to the question
and type your NetID password. Suppose everything goes on smoothly, you will land on one prince login node!

 

 

 

Return to Top

Describing Slurm commands

Submit jobs - [sbatch]

Batch job submission can be accomplished with the command sbatch. Like in Torque qsub, we create a bash script to describe our job requirement: what resources we need, what softwares and processing we want to run, memory and CPU requested, and where to send job standard output and error etc. After a job is submitted, Slurm will find the suitable resources, schedule and drive the job execution, and report outcome back to the user. The user can then return to look at the output files.

Example-1:
In the first example, we create a small bash script, run it locally, then submit it as a job to Slurm using sbatch, and compare the results.  

 

Example-2:

Follow the recipe below to submit a job. The job can be used later as an example for practicing how to check job status. In my test its running duration is about 7 minutes.

Below is the content of the bash script "run-matlab.s" just used in the job submission:


#!/bin/bash
#
##SBATCH --nodes=1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=1:00:00
#SBATCH --mem=10GB
#SBATCH --job-name=myMatlabTest
#SBATCH --mail-type=END
##SBATCH --mail-user=bob.smith@nyu.edu
#SBATCH --output=slurm_%j.out

module purge
module load matlab/2016b

cd /scratch/$USER/mytest1
cat thtest.m | srun matlab -nodisplay


For reference - Explanation of script

 Click here to expand...

The script is given in /share/apps/Tutorials/slurm/example. Below is an annotated version with detailed explanation of the SBATCH directives used in the script:

#!/bin/bash
# This line tells the shell how to execute this script, and is unrelated 
# to SLURM.
   
# at the beginning of the script, lines beginning with "#SBATCH" are read by
# SLURM and used to set queueing options. You can comment out a SBATCH 
# directive with a second leading #, eg:
##SBATCH --nodes=1
   
# we need 1 node, will launch a maximum of one task. The task uses 2 CPU cores  
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
   
# we expect the job to finish within 1 hours. If it takes longer than 1
# hours, SLURM can kill it: 
#SBATCH --time=1:00:00
   
# we expect the job to use no more than 10GB of memory:
#SBATCH --mem=10GB
   
# we want the job to be named "myMatlabTest" rather than something generated 
# from the script name. This will affect the name of the job as reported
# by squeue: 
#SBATCH --job-name=myMatlabTest
 
# when the job ends, send me an email at this email address.
# replace with your email address, and uncomment that line if you really need to receive an email.
#SBATCH --mail-type=END
##SBATCH --mail-user=bob.smith@nyu.edu
   
# both standard output and standard error are directed to the same file.
# It will be placed in the directory I submitted the job from and will
# have a name like slurm_12345.out
#SBATCH --output=slurm_%j.out
 
# once the first non-comment, non-SBATCH-directive line is encountered, SLURM 
# stops looking for SBATCH directives. The remainder of the script is  executed
# as a normal Unix shell script
  
# first we ensure a clean running environment:
module purge
# and load the module for the software we are using:
module load matlab/2016b
  
# the script will have started running in $HOME, so we need to move into the 
# directory we just created earlier
cd /scratch/$USER/mytest1
  
# now start the Matlab job:
cat thtest.m | srun matlab -nodisplay
# Leave a few empty lines in the end to avoid occasional EOF trouble.

The job has been submitted successfully. And as the example box showing, its job ID is 11615. Usually we should let the scheduler to decide on what nodes to run jobs. In cases there is a need to request a specific set of nodes, use the directive nodelist, e.g. '#SBATCH --nodelist=c09-01,c09-02'.

Return to Top

Check cluster status - [sinfo, squeue]

The sinfo command gives information about the cluster status, by default listing all the partitions. Partitions group computing nodes into logical sets, which serves various functionalities such as interactivity, visualization and batch processing.

 

A partition is a group of nodes. A partition can be made up of nodes with a specific feature/functionality, such as nodes equipped with GPU accelerators (gpu partition). A partition can have specific parameters, such as how long jobs can run. So partitions can be thought as "queues" in other batch systems. Partitions may overlap.

sinfo by default prints information aggregated by partition and node state. As shown above, there are four partitions namely c01_25, c26, c27 and gpu. The partition marked with an asterisk is the default one. Except there are two lines with the node state 'mix', which means some CPU cores occupied, all other nodes are idle.

See two useful sinfo command examples: 1. the first one lists those nodes in idle state in the gpu partition; 2. the second outputs information in a node-oriented format.

Return to Top

The squeue command lists jobs which are in a state of either running, or waiting or completing etc. It can also display jobs owned by a specific user or with specific job ID.

Run 'man sinfo' or 'man squeue' to see the explanations for the results.

 

Return to Top

Check job status - [squeue, sstat, sacct]

With the job ID in hand, we can track the job status through its life time. The job first appears in the Slurm queue in the PENDING state. Then when its required resources become available, the job gets priority for its turn to run, and is allocated resources, the job will transit to the RUNNING state. Once the job runs to the end and completes successfully, it goes to the COMPLETED state; otherwise it would be in the FAILED state. Use squeue -j <jobID> to check a job status. 

Most of the columns in the output of the squeue command are self-explanatory. 

The column "ST" in the middle is the job status, which can be :

  • PD - pending: waiting for resource allocation
  • S  - suspended
  • R  - running
  • F  - failed: non-zero exit code or other failures
  • CD - completed: all processes terminated with zero exit code
  • CG - completing: in the completing process, some processes may still be alive

The column "NODELIST(REASON)" in the end is job status due to the reason(s), which can be :

  • JobHeldUser:            (obviously) 

  • Priority:               higher priority jobs exist
  • Resources:              waiting for resources to become available
  • BeginTime:              start time not reached yet
  • Dependency:             wait for a depended job to finish
  • QOSMaxCpuPerUserLimit:  number of CPU core limit reached

You may select what columns to display, in a width specified with an integer number between %. and a letter, %.10M.


Run the command sstat to display various information of running job/step. Run the command sacct to check accounting information of jobs and job steps in the Slurm log or database. There is a '–-helpformat' option in these two commands to help checking what output columns are available.

Type "man <command>" to look up detailed usage on the manual pages of command squeue, sstat and sacct.

Return to Top

Cancel a job - [scancel]

Things can go wrong, or in a way unexpected. Should you decide to terminate a job before it finishes, scancel is the tool to help.


Look at job results

Job results includes the job execution logs (standard output and error), and of course the output data files if any defined when submitting the job. Log files should be created in the working directory, and output data files in your specified directory. Examine log files with a text viewer or editor, to gain a rough idea of how the execution goes. Open output data files to see exactly what result is generated. Run sacct command to see resource usage statistics. Should you decide that the job needs to be rerun, submit it again with sbatch with a modified version of batch script and/or updated execution configuration. Iteration is one characteristic of a typical data analysis!

Return to Top

Software and Environment Modules

Environment Modules is a tool for managing multiple versions and configurations of software packages, and is used by many HPC centers around the world. With Environment Modules, software packages are installed away from the base system directories, and for each package an associated modulefile describes what must be altered in a user's shell environment - such as the $PATH environment variable - in order to use the software package. The modulefile also describes dependencies and conflicts between this software package and other packages and versions.

To use a given software package, you load the corresponding module. Unloading the module afterwards cleanly undoes the changes that loading the module made to your environment, thus freeing you to use other software packages that might have conflicted with the first one.

Working with software packages on the NYU HPC clusters.
 

Command Functionality
module avail check what software packages are available
module whatis module-name Find out more about a software package
module help module-name A module file may include more detailed help for the software package
module show module-name see exactly what effect loading the module will have with
module list check which modules are currently loaded in your environment
module load module-name load a module
module unload module-name unload a module
module purge remove all loaded modules from your environment

Return to Top 

Running interactive jobs 

Majority of the jobs on Prince cluster are submitted with the sbatch command, and executed in the background. These jobs' steps and workflows are predefined by users, and their executions are driven by the scheduler system.

There are cases when users need to run applications interactively (interactive jobs). Interactive jobs allow the users to enter commands and data on the command line (or in a graphical interface), providing an experience similar to working on a desktop or laptop. Examples of common interactive tasks are:

  • Editing files
  • Compiling and debugging code
  • Exploring data, to obtain a rough idea of characteristics on the topic
  • Getting graphical windows to run visualization
  • Running software tools in interactive sessions

Since the login nodes of the Prince cluster  are shared between many users, running interactive jobs that require significant computing and IO resources on the login nodes will impact many users.

Interactive jobs on Prince Login nodes

Running compute and IO intensive interactive jobs on the Prince login nodes is not allowed. Jobs may be removed without notice. 

Instead of running interactive jobs on Login nodes, users can run interactive jobs on Prince Compute nodes using SLURM's srun utility. Running interactive jobs on compute nodes does not impact many users and in addition provides access to resources that are not available on the login nodes, such as interactive access to GPUs, high memory, exclusive access to all the resources of a compute node, etc.  There is no partition on Prince that has been reserved for Interactive jobs.

Through srun SLURM provides rich command line options for users to request resources from the cluster, to allow interactive jobs. Please see some examples and short accompanying explanations in the code block below, which should cover many of the use cases.

In srun there is an option "–x11", which enables X forwarding, so programs using a GUI can be used during an interactive session (provided you have X forwarding to your workstation set up). If necessary please read the wiki pages on how to set up X forwarding for Windows  and Linux / Max  workstation. NOTE: X forwarding is not required for Slurm tutorial classroom exercises.

Return to Top

Running R batch Job

Long running and big data crunching jobs ought to be submitted as batch, so that they will run in the background and Slurm will drive their executions. Below are a R script "example.R", and a job script which can be used with sbatch command to send a job to Slurm:

Run the job using "sbatch".

Return to Top

R Interactive session

The following example shows how to work with Interactive R session on a compute node:

Return to Top

Running GPU jobs

To request one GPU card, use SBATCH directives in job script:

#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

To request a specific card type, use e. g. --gres=gpu:k80:1 or --gres=gpu:p1080:1. As an example, let's submit an Amber job. Amber is a molecular dynamics software package. The recipe is:

From the tutorial example directory we copy over Amber input data files "inpcrd", "prmtop" and "mdin", and the job script file "run-amber.s". The content of the job script "run-amber.s" is:


#!/bin/bash
#
#SBATCH --job-name=myAmberJobGPU
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00
#SBATCH --mem=3GB
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu

module purge
module load amber/openmpi/intel/16.06

cd /scratch/$USER/myambertest
pmemd.cuda -O


 
The demo Amber job should take ~2 minutes to finish once it starts running. When the job is done, several output files are generated. Check the one named "mdout", which has a section most relevant here:

 

|--------------------- INFORMATION ----------------------
| GPU (CUDA) Version of PMEMD in use: NVIDIA GPU IN USE.
|                    Version 16.0.0
|
|                      02/25/2016
[......]

|------------------- GPU DEVICE INFO --------------------
|
|            CUDA_VISIBLE_DEVICES: 0
|   CUDA Capable Devices Detected:      1
|           CUDA Device ID in use:      0
|                CUDA Device Name: Tesla K80
|     CUDA Device Global Mem Size:  11439 MB
| CUDA Device Num Multiprocessors:     13
|           CUDA Device Core Freq:   0.82 GHz
|
|--------------------------------------------------------

Return to Top

Running array jobs

Using job array you may submit many similar jobs with almost identical job requirement. This reduces loads on both shoulders of users and the scheduler system. Job array can only be used in batch jobs. Usually the only requirement difference among jobs in a job array is the input file or files. Please follow the recipe below to try the example. There are 5 input files named 'sample-1.txt', 'sample-2.txt' to 'sample-5.txt' in sequential order. Running one command "sbatch --array=1-5 run-jobarray.s", you submit 5 jobs to process each of these input files individually.

The content of the job script 'run-jobarray.s' is copied below:


#!/bin/bash
#
#SBATCH --job-name=myJobarrayTest
#SBATCH --nodes=1 --ntasks=1
#SBATCH --time=5:00
#SBATCH --mem=1GB
#SBATCH --output=wordcounts_%A_%a.out
#SBATCH --error=wordcounts_%A_%a.err

module purge
module load python/intel/2.7.12

cd /scratch/$USER/myjarraytest
python wordcount.py sample-$SLURM_ARRAY_TASK_ID.txt



Job array submission induces an environment variable SLURM_ARRAY_TASK_ID, which is unique for each job array job. It is usually embedded somewhere so that at a job running time its unique value is incorporated into producing a proper file name. Also as shown above: two additional options %A and %a, denoting the job ID and the task ID (i.e. job array index) respectively, are available for specifying a job's stdout, and stderr file names.

Return to Top