Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section
Column
width20%

Include Page
Quick Links
Quick Links

Panel

Include Page
FAQ Index
FAQ Index

Column

Anchor
CLUSTERNAME
CLUSTERNAME
Where am I?

Expand
titleHow can I show the cluster name in the command prompt?

The login nodes of each cluster have names like "login-0-1". You can add the following idiom to your $HOME/.bashrc file to set the prompt to the name of the cluster:

Code Block
# what cluster am I on?
CLUSTERNAME=NYU-HPC
ip=$(host $(hostname))
[[ $ip =~ 10.0.254.[0-9]+$ ]] && CLUSTERNAME=Prince
[[ $ip =~ 10.0.255.[0-9]+$ ]] && CLUSTERNAME=Dumbo
PS1='['$CLUSTERNAME': \u@\h \W]\$ '

Anchor
HPCACCOUNT
HPCACCOUNT
HPC Account

Expand
titleHow can I continue to access HPC after I leave/graduate from NYU

As long as there is a full time faculty member in NYU would like to support your HPC accounts, you will be eligible for NYU HPC access. If your NYU Home status is expired, please ask your advisor in NYU to send an email to its.accounts@nyu.edu to extend your NYU Home and Google Apps for another year for HPC access. After that, to reset your NYU password, please go to the page
https://start.nyu.edu/

Then you can follow the instructions in NYU HPC wiki to renew your HPC accounts
https://wikis.nyu.edu/display/NYUHPC/Getting+or+renewing+an+HPC+account

Anchor
SOMETHING_WRONG
SOMETHING_WRONG
Something went wrong!
Anchor
EXPIRED
EXPIRED

Expand
titleMy account expired! What should I do? Is my data gone forever?

You can renew your account even after it has expired, see Getting or renewing an HPC account for how to renew. Data on /home, /archive and /work is deleted 90 days after account expiration. Data on /scratch is deleted after 60 days of not being used, whether your account is current or expired.

Anchor
LS_SCRATCH
LS_SCRATCH

Expand
titleWhy is "ls" on /scratch so slow?
Lustre stores the file itself and the file metadata (its name, size, etc) separately. When you issue a simple 'ls' command, a remote procedure call (RPC) is made to the metadata server (MDS), which returns a list of the files in the current directory. If you add certain options, such as -l or --color=tty, then for each file in the list, ls will call stat() on that file. The stat() call involves an RPC to the MDS and an RPC to the object storage server (OSS) which holds the file itself. These RPCs, especially those to the OSS, can take a long time.

Anchor
CANT_LOGIN
CANT_LOGIN

Expand
titleI can't login
Include Page
i cannot login
i cannot login

Anchor
SSH_KEY_WARN
SSH_KEY_WARN

Expand
titleWhen trying to login, I get warnings about "HOST IDENTIFICATION HAS CHANGED"
 Recent versions of OSX and Ubuntu have a newer version of ssh than the NYU HPC clusters. You can prevent the warnings with two steps:
  1. Update .ssh/config on your workstation according to the example here.
  2. Delete your .ssh/known_hosts fileYou will then be asked about connecting to a new host on the first time, you can safely answer "yes"

Anchor
SCRATCH_FLUSHING
SCRATCH_FLUSHING

Expand
titleWhat happened to my data on /scratch?
The /scratch filesystem is a short-term filesystem providing fast I/O for running jobs. It is not backed up, moreover files which remain unused for a period of time a flushed. See Storage July 2017 - especially /scratch policy for further information.

Anchor
LIB_WIFI
LIB_WIFI

Expand
titleQ: In the library, my wireless connection keeps dropping out. How can I fix it?
Include Page
library wifi
library wifi

Anchor
MODULE_CMD
MODULE_CMD

Expand
titleQ: I used "module load" and it failed with a "module: command not found" error
 
Excerpt Include
module command not found error
module command not found error
nopaneltrue

Anchor
TTY_WARNING
TTY_WARNING

Expand
titleWarning: no access to tty (Bad file descriptor), Thus no job control in this shell
It's harmless and does not indicate an error. This is an innocuous warning that simply means you are running a script (rather than a binary) under a job that has no access to the TTY.In other words, you cannot interrupt it (^C), suspend it (^Z) or use other interactive commands because there is no screen or keyboard to interact with it.It can be safely ignored.

Anchor
NO_DISPLAY
NO_DISPLAY

Expand
titleI get an error "Warning: no display specified." when I use -X flag with ssh
Include Page
Preparing your Mac for X
Preparing your Mac for X
Include Page
Preparing Windows for X
Preparing Windows for X

Anchor
KILLED_JOB
KILLED_JOB

Expand
titleWho killed my job, and why?
The most likely culprit is the batch system. The other prime suspect is us, the HPC system administrators.If the batch system killed your job, it probably did so because it exceeded the amount of memory or CPU time requested (which may have been some default, if you did not explicitly request these). See Submitting a job - Resource limits for help on this, and this overview of scheduling for a more general introduction.If we killed your job, it was probably to prevent the system from crashing, which can happen when a jobs runtime behavior puts a certain type of load on the system. The most common trouble is non-optimal use of the /scratch filesystem, described at the bottom of the table on the Storage page. The next most common reason is that you were running your job on the login node rather than through the batch system - see Running jobs - Nodes for more on this. There are a few circumstances where the login node is the only option - especially archiving files to your $ARCHIVE directory - if you experience trouble with this please contact us.ause The /scratch filesystem is configured for large-block I/O, such as sequential reading and writing of large files. However individual I/O operations are relatively costly, so programs using frequent, small I/O accesses will put a heavy load on the metadata servers, which in extreme cases can cause the system to become unstable. The system administrators generally detect this quickly and may kill a job whose I/O characteristics are are stretching the capabilities of the filesystem (if this happens, we will contact you to help configure your job for better stability and performance).

Anchor
LOGIN_JOBS
LOGIN_JOBS

Expand
titleI got an email "Please do not run jobs on login nodes"
 The login nodes are a shared resource intended for editing scripts, compiling small amounts of code and moving data about. This is enforced via process size limits - if a command on a login node runs for too long or uses too much memory, the system will kill it and send you this email. The likely causes are:
  • You are trying to run a simulation interactively instead of via the batch system. Please read Running jobs on the NYU HPC clusters, especially Writing and submitting a job for how to do this. If you need interactive use, read Submitting a job - Working interactively
  • You are running a command on the login node that takes longer or needs more memory than you expected. Common causes include:
    • Opening a large (more than 1GB) file with Vim (try using "less" instead)
    • Compiling a large and complex source file
    • Compressing a large file with gzip
In most cases the best solution is to start an interactive session on a compute node, requesting sufficient resource for the task at hand - see Submitting a job - Working interactively

Anchor
RUNNING_JOBS
RUNNING_JOBS
Running Jobs
Anchor
RESOURCES
RESOURCES

Expand
titleWhat resources can and should I request?

 

Excerpt Include
Requesting resources
Requesting resources
nopaneltrue

For further details, see Submitting a job - Resource limits

Anchor
JOB_DEPS
JOB_DEPS

Expand
titleCan I make sure a job gets executed only after another one completes?

A: Yes

Include Page
qsub job dependencies
qsub job dependencies

Anchor
JOB_OUTPUT
JOB_OUTPUT

Expand
titleWhere did my job output go?

Excerpt Include
Job stdout and stderr
Job stdout and stderr
nopaneltrue
 

Anchor
GPU
GPU

Expand
titleHow do I use GPUs?

 

Excerpt Include
Running jobs - GPU
Running jobs - GPU
nopaneltrue

Anchor
NODE_LOGIN
NODE_LOGIN

Expand
titleHow do I log in to a specific node?

A1: You can ssh to a specific login node. The login nodes on bowery are named login-0-0, login-0-1, login-0-2, login-0-3.

Panel

$ ssh login-0-0

A2: You can ssh to a specific compute node if you have a job running on it. To find out which nodes your job is running on use:

Panel

$ qstat -n jobid

You will see the usual qstat output followed by a list of the nodes and cores your job is allocated to. The list will look something like:

Code Block
 

In this example, the list shows cores 11 and 10 on node compute-6-3.

Anchor
MONITOR_JOB
MONITOR_JOB

Expand
titleHow can I ensure my resource-intensive job is running smoothly?

A: After submitting jobs, you will be able to locate where your jobs are executing by running pbstop -u NetID. Then you can monitor these jobs by logging in to the corresponding compute nodes and running top. You will then see both CPU and memory consumptions. If you find little memory left (or even little swap left) due to your job, you should increase the "ppn" number in your PBS script or maybe consider taking advantage of the nodes with larger memory

Excerpt Include
Monitoring jobs - pbstop
Monitoring jobs - pbstop
nopaneltrue
  (more...)

Anchor
LONG_JOB
LONG_JOB

Expand
titleMy job will take longer than 48 hours, what should I do?

For long-running jobs, we suggest using Checkpoint-Restart to split the job into a series of shorter jobs. We realize though that this is not always possible - if you need the walltime limit for a job extended, contact us.  

Anchor
NEED_SERVICE
NEED_SERVICE

Expand
titleMy job needs (MySQL, some other service) to be running

Contact us.  

Expand
titleI want to run a job at 9am every day

(still to come) 

Anchor
STATA
STATA

Expand
titleHow do I run a STATA job?

(more...)

Anchor
GAUSSIAN
GAUSSIAN

Expand
titleHow do I run a Gaussian job?

(more...)

Anchor
MATLAB
MATLAB

Expand
titleHow do I run a Matlab job?

(more...

Anchor
R
R

Expand
titleHow do I run a R job?

Excerpt Include
Running jobs - R
Running jobs - R
nopaneltrue
(more...

Anchor
Mathematica
Mathematica

Expand
titleHow do I run a Mathematica job?

(more...

Anchor
PARALLEL_JULIA
PARALLEL_JULIA

Expand
titleHow do I start a multinode parallel job that is NOT MPI (eg Julia)?

The MPI modules on Mercer are built with support for the batch system. However, third-party MPI libraries and parallel software (such as Julia) may not be.

To launch a Julia job on multiple nodes you can use the node list provided by $PBS_NODEFILE:

Code Block
julia --machinefile $PBS_NODEFILE ./my_code.jl

For an MPI job that does not use the MPI modules on Mercer:

Code Block
mpirun -np --machinefile $PBS_NODEFILE ./my_mpi_exec.exe

To run one multithreaded MPI process per node (hybrid MPI/OpenMP), see Running jobs - MPI.

Note

Parallel libraries other than the MPI modules on Mercer do not normally have support for Torque, consequently they do not play nicely with other jobs. For this reason, the queue for multinode jobs sets the "#PBS -n" (node-exclusive) option. Multinode jobs therefore do not share nodes with any other job.

(thanks Spencer for the Julia tip!)

Anchor
PDF
PDF

Expand
titleHow can I view a PDF file on Prince?

evince myfile.pdf

You need to have logged in with X forwarding enabled, as evince is an X application. See Logging in to the NYU HPC Clusters for how to do this.

Anchor
USING_SOFTWARE
USING_SOFTWARE
Using Software

Include Page
snippet - Usage examples
snippet - Usage examples

Anchor
DATA
DATA
Managing Data

Expand
titleHow much of my file/space quota have I used?

On Prince, enter 'myquota' at the prompt to see how much space you have used and available on each filesystem  

Expand
titleHow do I give my colleague access to my files?

 

Excerpt Include
snippet - File permissions and Access Control Lists
snippet - File permissions and Access Control Lists
nopaneltrue
(more...)

Expand
titleHow do I get the best transfer speed to or from BuTinah?

Excerpt Include
Transferring files between NYU NY and NYUAD BuTinah
Transferring files between NYU NY and NYUAD BuTinah
nopaneltrue
(more...

Expand
titleI have a huge amount of data that I want to compress for storage or transfer

 Mercer has 'pigz', which is a parallel version of gzip. To use it:

Panel

module load pigz/intel/2.3.1

pigz --help

Expand
titleMy workflow uses thousands of small files, how should I manage them

Include Page
Managing large numbers of files
Managing large numbers of files

Expand
titleI want to keep a folder on the HPC cluster in sync with a folder on my workstation

Excerpt Include
Keeping directories in sync with rsync
Keeping directories in sync with rsync
nopaneltrue

Click here for more about using rsync on the NYU HPC clusters 

Expand
titleHow do I transfer files to Dumbo cluster from Windows workstation?

First, download and install WinSCP tool from here. If you are inside NYU network (on campus), simply open WinSCP and fill in all the fields:

If you are outside of NYU network, one option is to set up and use VPN. After that you can use WinSCP as described above. Another option is to  start an SSH tunnel on the workstation. We have instructions on how to do this for Windows workstations. Once your SSH tunnel is set up and started on dumbo, open WinSCP and fill the fields in as shown below :

Expand
titleHow can I share data on Prince with external (non-NYU) collaborators?

Please use the Globus data sharing feature. As long as external collaborators have the Globus access, they can download files from directories you set up for data sharing on prince. The setup procedure is described on this page - https://docs.globus.org/how-to/share-files/

Expand
titleHow to use RCLONE to copy files between the Prince storages and Google Drive etc?

Please see this wiki page dedicated to the RCLONE usage.  

Anchor
NOTEBOOK
NOTEBOOK
Jupyter Notebooks

Expand
titleFrom Windows computer how to connect to a notebook running inside a Slurm job in the Prince cluster
To get access to the Jupyter notebook from your Windows computer, assuming:
  • You are within NYU network. If not, you may set up VPN to get onto NYU network.
  • In Slurm job stdout file slurm-9999999.out, the instruction is:
    ssh -L 6217:localhost:6217 <netid>@prince.hpc.nyu.edu


First start Start a new PuTTY session, then follow the steps below -  
1. Enter hostname 'prince.hpc.nyu.edu' and port 22 as shown in the above picture on the left.
2. On the left side of the PuTTY session, click 'SSH' then 'Tunnels'. See above picture on the right: enter these as in the red oval, then click 'Add'.
3. Click 'Open' to start the connection.
4. Enter username, password to get onto a Prince login node.
5. Go to your browser, enter the URL as provided in the Slurm stdout file.