Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Warning

Running jobs from /home is serious violation of HPC policy. Any users who intentionally violate this policy will get their account suspended. /home SSDs are not design for scratch disks, it will kill the SSDs quickly. 

Warning

You cannot submit jobs until you pass our online training. Dalma Training

Prerequisite

Make sure you know basic Linux usage. Useful links:

Usage Model

Dalma is accessed through a dedicated set of login nodes, which are designed for light-weight, short period tasks. Access to the compute, GPU and visualization nodes for production runs, is controlled by the workload manager Slurm. The production jobs are submitted on login nodes to Slurm. Then Slurm will schedule and run these jobs on compute nodes.

 

Gliffy Diagram
sizeM
nameDalma Access Model

Getting and Renewing an Account

Please follow the instructions here: Accounts

Access

Once you have an HPC account, you are ready to access a cluster. In the most simple case, use ssh in your terminal:

Code Block
languagepowershell
themeMidnight
titleLogin Dalma
ssh <your-NetID>@dalma.abudhabi.nyu.edu

If you use Windows or outside NYU AD/NY network, follow the instructions here: Access Dalma.

 

Storage System

Right after login to Dalma, user is automatically directed to $HOME. Dalma storage consists of 4 filesystems: $HOME (/home/<Net-ID>), $FASTSCRATCH (/fastscratch/<Net-ID>), $SCRATCH (/scratch/<Net-ID>) and $ARCHIVE (/archive/<Net-ID>) that can be referenced through the environment variables $HOME, $FASTSCRATCH, $SCRATCH. $ARCHIVE can NOT be accessed directly on login or compute nodes (see the tutorial here for usage: The guide to Archive on Dalma). 

Code Block
languagepowershell
themeMidnight
titleAccess different files systems using environmental variables
# It's much less error-prone to access different files systems using environmental variables than absolute path
# For example
cd $HOME
# is equivalent to 
cd /home/<your-NetID>
# or
ls $SCRATCH

The usages are summarized as below.

Warning
titleSummary of storage

Submit your job and prepare your input / output in $SCRATCH

Put your source code, applications and executable in $HOME. NO JOBS SHOULD BE RUN FROM $HOME

Back up your data in $ARCHIVE

Contact us for usage in $FASTSCRATCH

If you encounter 'disk quota exceeded' error or similar, that's because you breached the disk quota, either data size, of number of files, on one or more of the filesystems. Running myquota in the terminal on Dalma gives you the current usage and quota.

Note
titleHOME is limited.

The quota of $HOME is only 5GB. Run myquota in terminal on Dalma to check your current usage and quota.

We urge our users to clean up their storage.

Warning
titleBackup your FASTSCRATCH and SCRATCH

Files not accessed for older than 90 days at $FASTSCRATCH and $SCRATCH will be deleted.

Backing up user data is user's own responsibility. E.g., if a user deleted something accidentally, we can not have it recovered unfortunately.

 

 

Data Transfer

You can use either Terminal or FileZilla to transfer your data from / to Dalma, as instructed here. File Transfer using rsync and File Transfer using FileZilla.

Hardware Overview

Warning

Users can not acquire all physical memory on nodes in their job scripts. Some memory is reserved for system. See the form below.

Node TypeNumber of NodesHardware per NodeMaximum Memory Per Node User Can RequireNote
Standard Compute236128GB, 28 cores, Broadwell112GB 
Fat8192GB memory, 12 cores, Westmere180GB 
Super Fat1

1TB memory, 32 cores, Westmere

1000GB 
Ultra Fat12TB memory, 72 cores, Broadwell2000GBConsult with us for access to this node
GPUBuTinah1696GB memory, 12 cores, Westmere, NVIDIA GPU90GB Login4

128GB, 28 cores, Broadwell

 

 

Visualization (incoming)    

 

 

Software Overview

We now have available a new Module Environment in Dalma, which is part of the User Centric Approach that we have been promoting from NYUAD to manage the software stack. This new Module Environment, NYUAD 3.0 overcome the flaws of the traditional modules environment when used to manage complex modern software environments.

First, you could check what applications are available

Code Block
languagepowershell
themeMidnight
# Run the following commands after logging in Dalma
module avail

Then you could select the desired software to load. The following example shows how to load a self-sufficient-single-application environment for gromacs.

Code Block
languagepowershell
themeMidnight
titleLoad a self-sufficient-single-application environment for Gromacs
# Run the following commands after logging in Dalma
module load NYUAD/3.0
module load gromacs
# or use the full module name
module load gromacs/5.0.4

Thie following example shows how to load an environment for compiling source code from scratch.

Code Block
languagepowershell
themeMidnight
titleLoad GCC, OpenMPI and FFTW for Compiling Source Code
# Run the following commands after logging in Dalma
module load NYUAD/3.0
module load gcc
# multiple modules could be loaded in one line
module load openmpi fftw3

If you cannot find a certain version of software (for example, you are looking for Python 3, but only to find Python 2 is available), try running the following command to make all modules visible first.

Code Block
languagepowershell
themeMidnight
# Run the following commands after logging in Dalma
module load all
module avail python
--------------------------------------- /share/apps/NYUAD/modules/ALL -------------------------------
python/2.7.11 python/3.5.1

As you can see, Python 3 is available then. You could load Python 3 by loading the specific module.

Code Block
languagepowershell
themeMidnight
module load python/3.5.1

 


Batch System

The batch system on Dalma is the Slurm (Simple Linux Utility for Resource Management), a free open-source resource manager by LLNL. Similar to most supercomputers, on Dalma, production jobs are submitted to the batch system.  In order to submit a job you need to create a submission script where you specify your resources requirements. Before jobs are dispatched to run they are put onto partitions waiting for available processing resources. There are partitions for various types of use. Parallel partition will allocate entire nodes to the job (i.e. only 1 job per node). Serial partition allows multiple jobs to share one node. 

Interactive Sessions

Computational heavy jobs are not allowed on login nodes. The alternative way should be using an interactive session. To start an interactive session, use srun command:

Code Block
srun --pty -n 1 /bin/bash

To exit the interactive session, type Ctrl+d, or 

Code Block
exit

 

Available Partitions (Queues)

Most used partitions for users: 

  1. serial: For job using no more than 1 node.
  2. parallel: For job using more than 1 node.

Job Limit

Run 

Code Block
themeMidnight
slurm-show-my-limits.sh

to check your job limits. 

Writing a Batch Script

A job script is a text file describing the job, resources required. Slurm has its unique directives, but is similar in many ways to PBS or LSF. Moreover, Slurm maintaned a good compability on PBS script. In many case, PBS script is directly acceptable. 

Warning

You cannot submit jobs until you pass our online training. Dalma Training

Warning

Any job ntasks <= 28 should use #SBATCH -p serial. Any job with ntasks >28 should use #SBATCH -p parallel and set ntasks to be dividible by 28. (if not using MPI-OpenMP Hybrid Parallelization).

Serial Job Example

  1. A typical Slurm serial job script looks like this. Let say you save it as serial-job.sh

    Code Block
    languagebash
    themeRDark
    titleTypical Serial Job Script in Slurm
    #!/bin/bash
    #SBATCH -p serial
    # Set number of tasks to run
    #SBATCH --ntasks=1
    # Walltime format hh:mm:ss
    #SBATCH --time=00:30:00
    # Output and error files
    #SBATCH -o job.%J.out
    #SBATCH -e job.%J.err
    
    # **** Put all #SBATCH directives above this line! ****
    # **** Otherwise they will not be in effective! ****
    #
    # **** Actual commands start here ****
    # Load modules here (safety measure)
    module purge
    # You may need to load gcc here .. This is application specific
    # module load gcc 
    # Replace this with your actual command. 'serial-hello-world' for example
    hostname

    Below, your will find a generic Slurm job scripts will gentle explanation with each directives.

  2. Then you can submit the saved job script serial-job.sh with:

    Code Block
    titleSubmitting a Serial Job
    sbatch serial-job.sh

Parallel Job Example

Warning

You cannot submit jobs until you pass our online training. Dalma Training

  1. A typical Slurm parallel job script looks like this. Let say you save it as parallel-job.sh

    Code Block
    languagebash
    themeMidnight
    titleTypical Parallel Job Script in Slurm
    #!/bin/bash
    #SBATCH -p parallel
    # Set number of tasks to run
    # To maximize the performance, set ntasks to be divisible by 28, e.g., 56, 84...
    #SBATCH --ntasks=56
    # Walltime format hh:mm:ss
    #SBATCH --time=00:30:00
    # Output and error files
    #SBATCH -o job.%J.out
    #SBATCH -e job.%J.err
    
    # **** Put all #SBATCH directives above this line! ****
    # **** Otherwise they will not be in effective! ****
    #
    # **** Actual commands start here ****
    # Load modules here (safety measure)
    module purge
    # You may need to load gcc here .. This is application specific
    # module load gcc 
    # Replace this with your actual command. 'srun roms' for example
    srun hostname
  2. Then you can submit the saved job script parallel-job.sh with:

    Code Block
    titleSubmitting a Parallel Job
    sbatch parallel-job.sh

Submitting a Job

Warning

Please be aware that submitting jobs is only possible from login nodes at the moment. Contact us if you need help.

Command sbatch is for submitting jobs. A simple example:

Code Block
titleMinimal Example of Job Submission
sbatch <jobscript>

After the submission, it will return the corresponding job id. Once scheduled for run, the script is executed on the first compute node in the allocation.

Checking Job Status

Before and During Job Execution

This command gives you all the jobs by you.

Code Block
languagebash
titleList all current jobs for a user
squeue

Example output:

Code Block
[gh50@login-0-1 ~]$ squeue -j 31408
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             31408   ser_std  job1.sh     gh50  R       0:02      1 compute-21-4
It means the job with Job ID 31408, has been running (ST: R) for 2 minutes on compute-21-4.

For more verbose information, use scontrol show job.
Code Block
titleGetting Verbose Information on a Job
scontrol show job <jobid>

After Job Execution

Once the job is finished, the job can't be inspected by squeue or scontrol show job. At this point, you could inspect the job by sacct.

Code Block
languagebash
titleChecking a Job
sacct -j <jobid>

The following commands give you extremely verbose information on a job.

Code Block
titleGetting Verbose Information on a Job
sacct -j <jobid> -l

 

Canceling a Job

If you decide to end a job prematurely, use scancel

Code Block
scancel <jobid>
Warning
titleUse with Cautions

To cancel all jobs from your account. Run this on Dalma terminal.

Code Block
scancel -u <NetID>

 

 

Panel

On This Page:

Table of Contents
maxLevel5