Child pages
  • Transitioning to HPC cluster Prince (CGSB)
Skip to end of metadata
Go to start of metadata
This page assumes you are familiar with Mercer cluster and its workload manager PBS. 


Mercer, the cluster most of you are familiar with is in the process of being retired.  Prince is the new HPC cluster, which is in production as of Spring 2017. 

Introduction to Prince

 

Compute Resources Mercer vs Prince

 

 

Mercer Compute Nodes

Nodes

Cores/Node

CPU Type

Memory

Network

OS

16

12

Westmere @2.67GHz 

23 GB

QDR 40GB/s

 

 

 

 

 

CentOS 6.3











8

12

Westmere @2.67GHz 

46 GB

QDR 40GB/s

68

12

Westmere @2.67GHz 

23 GB

QDR 40GB/s

64

8

Nehalem @2.67GHz

23 GB

QDR 40GB/s

9

16

Sandy Bridge 2.0GHz (GPU)

126 GB

QDR 40GB/s

8

20

Ivy Bridge @ 2.5GHz (GPU)

126 GB

FDR 56GB/s

64

20

Ivy Bridge @3.0GHz

62 GB

FDR 56GB/s

112

20

Ivy Bridge @ 3.0GHz

62 GB

QDR 40GB/s

48

20

Ivy Bridge @ 3.0GHz

189 GB

QDR 40GB/s

4

20

Haswell @ 3.1GHz

505 GB

FDR 56GB/s

2

20

Ivy Bridge @ 3.0GHz

1490 GB

FDR 56GB/s

395

5312

 

25T

 Cluster Totals

Prince Compute Nodes

Nodes

Cores/Node

CPU Type

Memory

Network

OS

68

28

Broadwell @ 2.60GHz

125 GB

EDR 100GB/s

 








CentOS 7.2

32

28

Broadwell @ 2.60GHz

250 GB

EDR 100GB/s

32

20

Haswell @ 2.60GHz

62 GB

FDR 56GB/s

9

28

Broadwell @ 2.60GHz (GPU)

250 GB

EDR 100GB/s

4

28

Broadwell @ 2.60GHz (GPU)

126 GB

EDR 100GB/s

8

20

Ivy Bridge @ 2.5GHz (GPU)

126 GB

FDR 56GB/s

64

20

Ivy Bridge @3.0GHz

62 GB

FDR 56GB/s

112

20

Ivy Bridge @ 3.0GHz

62 GB

QDR 40GB/s

48

20

Ivy Bridge @ 3.0GHz

189 GB

QDR 40GB/s

4

20

Haswell @ 3.1GHz

505 GB

FDR 56GB/s

2

20

Ivy Bridge @ 3.0GHz

1490 GB

FDR 56GB/s

383

8716

 

48T

 Cluster Totals

 

Filesystems on Prince

SpaceEnvironment variablePurposeFlushed?Allocation/User
/archive$ARCHIVELong-term storageNO2 TB / 700 TB
/scratch$SCRATCHFile staging - frequent writting and readingYES. Files unused for 60 days are deleted5 TB / 1.1 PB
/share/appsN/ASoftware directoryNON/A
/home$HOMESmall files, codeNO20 GB /43 TB
/beegfs$BEEGFSFile staging - frequent writting and readingYES. Policy TBD2 TB / 500 TB

Fields in green indicate resources available on both Prince and Mercer*

 

For more details on Prince's hardware configuration, please click the following link: "Cluster - Prince".

 

Connecting to Prince

Similarly to Mercer, if you are on NYU Network, you can connect to Prince login node prince.hpc.nyu.edu directly. If you are outside of campus, you first need to log in to bastion host hpc.nyu.edu/hpc2.nyu.edu and after that you will be able to connect to login node.

ITSs-Air-3:~ johd$ ssh prince.hpc.nyu.edu

johd@prince.hpc.nyu.edu's password: 

[johd@log-0 ~]$ 

ITSs-Air-3:~ johd$ ssh johd@hpc.nyu.edu
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           WARNING:  UNAUTHORIZED PERSONS ........ DO NOT PROCEED
           ~~~~~~~   ~~~~~~~~~~~~~~~~~~~~          ~~~~~~~~~~~~~~
 This computer system is operated by New York University (NYU) and may be
 accessed only by authorized users.  Authorized users are granted specific,
 limited privileges in their use of the system.  The data and programs
 in this system may not be accessed, copied, modified, or disclosed without
 prior approval of NYU.  Access and use, or causing access and use, of this
 computer system by anyone other than as permitted by NYU are strictly pro-
 hibited by NYU and by law and may subject an unauthorized user, including
 unauthorized employees, to criminal and civil penalties as well as NYU-
 initiated disciplinary proceedings.  The use of this system is routinely
 monitored and recorded, and anyone accessing this system consents to such
 monitoring and recording.  Questions regarding this access policy should be
 directed (by e-mail) to askits@nyu.edu or (by phone) to 212-998-3333.
 Questions on other topics should be directed to COMMENT (by email) or to
 212-998-3333 by phone.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
johd@hpc.nyu.edu's password: 
Last login: Wed Jan 11 09:59:54 2017 from ool-18befec8.dyn.optonline.net


[johd@hpc ~]$ ssh prince.hpc.nyu.edu
johd@prince.hpc.nyu.edu's password: 
Last login: Sat Jan 14 11:45:42 2017 from hpc2.es.its.nyu.edu

[johd@log-1 ~]$

Slurm Job Scheduler

Prince cluster is using a new job Scheduler SLURM (Simple Linux Utility for Resource Management). Slurm is an open source, fault-tolerant, and very efficient cluster management system.

Basic Terminology

 

Partition - a logical group of compute nodes.  Can be considered a job queue, which has a set of constraints such as job size limit, job time limit, users permitted to run jobs on it, etc. Partitions may or may not overlap. 

Job - a unit of execution, started by user from a command line with sbatch, srun and salloc.

Step - task within a job, created with srun command. If srun was invoked from a command line, this interactive job will have one step. If srun is called from batch script, each srun will be a new step in a job. They will run sequentially, although each step can start multiple parallel tasks

Task - a single process in a job. If job has steps, each step will have at least one task.

Array - a set of similar jobs submitted with the same initial requirement automatically and almost simultaneously.

 

Translation from PBS to Slurm

 

User Commands
PBS/Torque
Slurm
Job submission
qsub [script_file]
sbatch [script_file]
Job deletion
qdel [job_id]
scancel [job_id]
Job status (by job)
qstat [job_id]
squeue [job_id]
Job status (by user)
qstat -u [user_name]
squeue -u [user_name]
Queue list
qstat -Q
squeue
Node list
pbsnodes -l all
sinfo -N OR scontrol show nodes
Cluster status
qstat -a
squeue
Graphical cluster status
pbstop
slurmtop
   
Environment
  
Starting directory on compute node
User's home directory
the working (submit) directory
Job ID
$PBS_JOBID
$SLURM_JOBID
Submit Directory
$PBS_O_WORKDIR
$SLURM_SUBMIT_DIR
Submit Host
$PBS_O_HOST
$SLURM_SUBMIT_HOST
Node List
$PBS_NODEFILE
$SLURM_JOB_NODELIST
Job Array Index
$PBS_ARRAYID
$SLURM_ARRAY_TASK_ID
   
Job Specification
  
Script directive
#PBS
#SBATCH
Job Name
-N [name]
--job-name=[name]
Node Count
-l nodes=[count]
-N [min[-max]]
CPU Count
-l ppn=[count] OR
-l mppwidth=[PE_count]
-n [count]
Memory Size
-l mem=[MB]
--mem=[mem][M|G|T] OR
--mem-per-cpu=[mem][M|G|T]
Wall Clock Limit
-l walltime=[hh:mm:ss]
-t [min] OR -t [days-hh:mm:ss]
Standard Output FIle
-o [file_name]
-o [file_name]
Standard Error File
-e [file_name]
-e [file_name]
Combine stdout/err
-j oe (both to stdout) OR
-j eo (both to stderr)
(use -o without -e)
Tasks Per Node
-l mppnppn [PEs_per_node]
--tasks-per-node=[count]
CPUs Per Task
N/A
--cpus-per-task=[count]
Event Notification
-m abe
--mail-type=[events]
Email Address
-M [address]
--mail-user=[address]
Job Dependency
-d [job_id]
--depend=[state:job_id]
Queue
-q [queue]
-p [queue]
Job Arrays
-t [array_spec]
--array=[array_spec]

**The table was adopted from SchedMD https://slurm.schedmd.com/rosetta.pdf

 

Submit jobs - [sbatch]

Batch job submission can be accomplished with the command sbatch. Like in Torque qsub, we create a bash script to describe our job requirements: what resources we need, what software we want to run, how much memory and how many CPUs we need, and where to send job standard output, error etc. After a job is submitted, Slurm will find the suitable resources, schedule and drive the job execution, and report outcome back to the user. The user can then return to look at the output files.

Converting a PBS batch script to SLURM batch script is generally a straightforward process. Below is the comparison between different options specified in Slurm batch script and in PBS batch script:

          

 

You can find a full list of options for job specification if you view sbatch man page:

$ man sbatch

 

Monitor cluster/job status - [sinfo, squeue,sstat,sacct]

 
The sinfo command gives information about the cluster status, by default listing all the partitions. Partitions group computing nodes into logical sets, which serves various functionalities such as interactivity, visualization and batch processing.

 

sinfo by default prints information aggregated by partition and node state. As shown above, there are 4 partitions namely c01_25, c26, c27 and gpu. The partition marked with an asterisk is the default one. Except there are two lines with the node state 'mix', which means some CPU cores occupied, all other nodes are idle. Some of the common node states are listed in the table below: 

 

Node StateMeaning
allocated
The node has been allocated to one or more jobs
completing
All jobs associated with this node are in the process of COMPLETING
down
The node is unavailable for use
drained
The node is unavailable for use per system administrator request
idle
The node is not allocated to any jobs and is available for use
mixed
The node has some of its CPUs ALLOCATED while others are IDLE
reserved
The node is in an advanced reservation and not generally available

 

The squeue command lists jobs which are in a state of either running, or waiting or completing etc. It can also display jobs owned by a specific user or with specific job ID.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9874     c01_25 model_ev   johd  R      17:00      4  c13-[01-04]
              9868       gpu relases-   xh814  R   17:45:45      1  gpu-01
              9869       gpu amberGPU   xh814  R    1:30:19      1  gpu-01
              9873       gpu  pemed_1    johd  R      17:08      1  gpu-02
 
$ squeue -u johd
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9874    c01_25 model_ev     johd  R      22:19    4  c13-[01-04]
$ squeue -j 9877
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9877       gpu  pemed_1   johd  R       0:10      1  gpu-02

Most of the columns in the output of the squeue command are self-explanatory. 

The column "ST" in the middle is the job status, which can be :

  • PD - pending: waiting for resource allocation
  • S  - suspended
  • R  - running
  • F  - failed: non-zero exit code or other failures
  • CD - completed: all processes terminated with zero exit code
  • CG - completing: in the completing process, some processes may still be alive

The column "NODELIST(REASON)" in the end is job status due to the reason(s), which can be :

  • JobHeldUser:            (obviously) 

  • Priority:               higher priority jobs exist
  • Resources:              waiting for resources to become available
  • BeginTime:              start time not reached yet
  • Dependency:             wait for a depended job to finish
  • QOSMaxCpuPerUserLimit:  number of CPU core limit reached

Run the command sstat to display various information of running job/step. Run the command sacct to check historical accounting information of jobs and job steps in the Slurm log or database. There is a '–-helpformat' option in these two commands to help checking what output columns are available.

 

Cancel a job - [scancel]

Things can go wrong, or in an unexpected way. Should you decide to terminate a job before it finishes, scancel is the tool to help.

$ squeue -j 9877
             JOBID PARTITION     NAME     USER ST      TIME  NODES NODELIST(REASON)
              9877       gpu  pemed_1   johd  R       9:04      1  gpu-02
$ scancel 9877

Running Interactive Jobs

You can run interactive job with Slurm using srun command:

 

Running array jobs

Using job array you may submit many similar jobs with almost identical job requirement. This makes life easier for users and for the scheduler system. Job array can only be used in batch jobs. Usually the only requirement difference among jobs in a job array is the input file or files. 

The example below describes running an array of read alignments jobs with bowtie. First, we create a batch script:

 

 

And submit it with sbatch command:

$ sbatch array.sh

 

 

 

  • No labels