Where am I?
The login nodes of each cluster have names like "login-0-1". You can add the following idiom to your $HOME/.bashrc file to set the prompt to the name of the cluster:
Something went wrong!
You can renew your account even after it has expired, see Getting or renewing an HPC account for how to renew.Data on /home, /archive and /work is deleted 90 days after account expiration. Data on /scratch is deleted after 60 days of not being used, whether your account is current or expired.
ls' command, a remote procedure call (RPC) is made to the metadata server (MDS), which returns a list of the files in the current directory. If you add certain options, such as
--color=tty, then for each file in the list,
stat()on that file. The
stat()call involves an RPC to the MDS and an RPC to the object storage server (OSS) which holds the file itself. These RPCs, especially those to the OSS, can take a long time.
I get a message about bad permissions
SSH is fussy about permissions on the
$HOME/.ssh directory. It should have:
rwx permission for the owner, and no permission at all for anyone else. You can set these permissions with the command:
.ssh/configon your workstation according to the example here.
- Delete your
.ssh/known_hostsfileYou will then be asked about connecting to a new host on the first time, you can safely answer "yes"
OSX roams "aggressively", which means that if it can see multiple access points it will abandon a working connection to pursue another one which might be better. The Bobst library supports a lot of wireless users and thus has many wireless access points, and Mac laptops behave like an undisciplined child in a candy store, authenticating to one point only to then disconnect and try another. Some actions which might help:
- Under Network preferences->WiFi->Advanced, remove all NYU networks except "nyu" from the "preferred networks" list, and move "nyu" to the top of the list
If that fails, you can disable the aggressive roaming from the terminal, with the command:
A2: If you are using Windows:
- If you are running the PeerGuardian personal firewall software, switch it off (it disables DHCP). Otherwise:
Recent versions of Windows take a supposedly-more-secure but also less reliable approach to authenticating to a wireless network, which causes network connections to be dropped unnecessarily. A pop-up bubble in the bottom corner of the screen which says "please re-enter your password" is an indication that this is happening.
- Instead of using the Windows-supplied wifi drivers, download and install the most recent driver from the manufacturer of your wireless-network-interface card
A3: if all this fails:
Come see the DS helpers - they may have another trick or two up their sleeve
If your script (or interactive environment) uses
bash(the default) or
If your script (or interactive environment) uses
In the case of a PBS job script, add one of the above lines before the first "
module" command in your script.
If you are seeing the error in an interactive shell, run one of the above commands at the prompt, then attempt the "
module load" command again.
Preparing your Mac for X
If you wish to use any software with a graphical interface, you will need an X server. This is a software package that draws on your local screen windows created on a remote computer (such as an NYU HPC cluster).
- Download and install XQuartz
Preparing your Windows workstation for X
If you wish to use any software with a graphical interface, you will need an X server. This is a software package that draws on your local screen windows created on a remote computer (such as an NYU HPC cluster). There are a couple of options out there:
- We recommend Cygwin/X. Instructions for downloading and installing it can be found here.
Before starting PuTTY you will need to have the X server running, by double-clicking the "XWin Server" shortcut under Cygwin-X on the Start Menu. You may wish to add this to your Windows Startup folder so it runs automatically after starting Windows
- Another good option is Xming. Installation instructions can be found on its web site.
As per Cygwin/X, you will need to launch Xming before starting PuTTY.
You will also need to download and install PuTTY SSH if you have not already.
$ARCHIVEdirectory - if you experience trouble with this please contact us.ause The
/scratchfilesystem is configured for large-block I/O, such as sequential reading and writing of large files. However individual I/O operations are relatively costly, so programs using frequent, small I/O accesses will put a heavy load on the metadata servers, which in extreme cases can cause the system to become unstable. The system administrators generally detect this quickly and may kill a job whose I/O characteristics are are stretching the capabilities of the filesystem (if this happens, we will contact you to help configure your job for better stability and performance).
- You are trying to run a simulation interactively instead of via the batch system. Please read Running jobs on the NYU HPC clusters, especially Writing and submitting a job for how to do this. If you need interactive use, read Submitting a job - Working interactively
- You are running a command on the login node that takes longer or needs more memory than you expected. Common causes include:
- Opening a large (more than 1GB) file with Vim (try using "less" instead)
- Compiling a large and complex source file
- Compressing a large file with gzip
For further details, see Submitting a job - Resource limits
This page is retained from an earlier version of the HPC wiki only for reference.
A1: You can
ssh to a specific login node. The login nodes on bowery are named
A2: You can ssh to a specific compute node if you have a job running on it. To find out which nodes your job is running on use:
You will see the usual qstat output followed by a list of the nodes and cores your job is allocated to. The list will look something like:
In this example, the list shows cores 11 and 10 on node
A: After submitting jobs, you will be able to locate where your jobs are executing by running pbstop -u NetID. Then you can monitor these jobs by logging in to the corresponding compute nodes and running top. You will then see both CPU and memory consumptions. If you find little memory left (or even little swap left) due to your job, you should increase the "ppn" number in your PBS script or maybe consider taking advantage of the nodes with larger memory
(still to come)
There are usage examples for many popular software packages in
/share/apps/examples on Prince:
batch- An example batch job
blcr- Checkpoint-Restart facility for long jobs
comsol- Computational Fluid Dynamics
c-sharp- Language for the .NET/mono runtime environment
fluent- Computational Fluid Dynamics / Multiphysics package
gaussian- Chemistry package
matlab- For mathematical exploration
namd- Molecular dynamics
qchem-amber- Molecular dynamics
r- Interpreted language for statistics work
resource-usage- Shows minute-by-minute CPU and memory usage of a program
stata- Statistics package
Basic R jobs
Multiple R versions exist in HPC environment. To check what are available, on Prince:
Suppose we want to use 3.3.2, run these commands:
We first clean up the environment by doing 'module purge'. Then we load the R version selected, check what are available in current environment. We can see that R 3.3.2 is indeed loaded along with its dependency modules. Let's try this basic R example. We name it "example.R":
Below is the screen scene while running it on Prince:
What is shown above is a simple demo case on login nodes. For real interactive analysis scenario, users are encouraged to run on compute nodes using the 'srun' command to request dedicated resources, e.g.:
Besides running our analysis interactively, long running and big data crunching jobs ought to be submitted to the batch system slurm. The "example.R" can be submitted to slurm to run in batch mode.
Copy example files to your newly created directory.
Below is how the example looks like:
Then create a sbatch job script as:
Once the sbatch script file is ready, it can be submitted to the job scheduler using sbatch. After successful completion of job, verify output log file for detail output information.
The MPI modules on Mercer are built with support for the batch system. However, third-party MPI libraries and parallel software (such as Julia) may not be.
To launch a Julia job on multiple nodes you can use the node list provided by $PBS_NODEFILE:
For an MPI job that does not use the MPI modules on Mercer:
To run one multithreaded MPI process per node (hybrid MPI/OpenMP), see Running jobs - MPI.
Parallel libraries other than the MPI modules on Mercer do not normally have support for Torque, consequently they do not play nicely with other jobs. For this reason, the queue for multinode jobs sets the "#PBS -n" (node-exclusive) option. Multinode jobs therefore do not share nodes with any other job.
(thanks Spencer for the Julia tip!)
You need to have logged in with X forwarding enabled, as evince is an X application. See Logging in to the NYU HPC Clusters for how to do this.
On Prince, enter 'myquota' at the prompt to see how much space you have used and available on each filesystem
An access control list (or ACL) gives per-file, per-directory and per-user control over who can read, write and execute files. You can see the ACL for a file or directory with the
To modify permissions for files or directories, use
setfacl. For a detailed description, see '
man setfacl'. In the example below, I give read permission on
dummy.txt to user
For setting execute permission on files - useful for scripts, and for allowing directories to be entered -
chmod is still used.
For faster transfer between the HPC clusters at NYU in NYC and the BuTinah cluster at NYUAD, use
scp over port 922. This will route the transfer over a high bandwidth ANKABUT link rather than the default low bandwidth MLPS link. The speed difference is greatest when pulling files from BuTinah to NYU NY.
Transferring many small files will still be slow - you will get better performance if you tar small files into a single archive, and transfer the tar file.
The default user environment on bowery sets an alias for
scp which does this automatically, so in most cases you can skip over this section. If you are finding that file transfers between NYUAD and NYU are slow, you can check whether you are using the alias with '
which scp'. If the response is not '
/usr/local/bin/scp_wrapper.sh', you should follow the instructions below.
scp over port 922 directly with the following commands, initiated from any of the NYU HPC clusters in NYC:
Pushing to BuTinah:
Pulling from BuTinah:(more...)
Mercer has '
pigz', which is a parallel version of
gzip. To use it:
Managing large numbers of files
Filesystems generally - and high-performance filesystems such as Lustre especially - perform best with a small to moderate number of medium to large files. Some specific issues to be aware of are:
- Lustre ($SCRATCH) gets performance mainly by striping distributing a large file across several disks and several "object storage servers". File metadata operations, on the other hand, do not have much parallelism available. So a few large read or write operations is vastly faster than many small reads or writes. This is true for reads and writes within a single files as well as for reads or writes on many files.
- If your job does many small I/O operations, it might be better to copy the file to $PBS_JOBTMP or $PBS_MEMDISK at the start of the job, and open the local copy of the file.
- (But for large reads and writes, $SCRATCH is likely to be faster than local disk)
The default stripe count on $SCRATCH is 4, so each file is striped across disks on 4 object storage servers. If you have a folder filled with files each smaller than 1MB, it is better not to stripe them. You can set the stripe count on a folder (under $SCRATCH) to 1 with:
- Finding a file within a folder is a serial operation. And the more files in a folder, the longer it takes. With several thousand files, even 'ls' on that folder can take several minutes and may affect responsiveness of the filesystem for other users.
- If you have more than about 1000 files in one folder, distribute them over a number of subfolders. The best performance will be when the number of subfolders is the square root of the total number of files (eg, for 1 million files, 1000 subfolders each containing 1000 files)
- The backup regime on
/archiveis optimized for small numbers of large files - 1000 files of 1kb each take 1000 times as long to backup as 1 file of 1MB! Too many small files can prevent the backup from completing in time
when archiving a collection of small files, please
tarthe files first. You can send a collection of files to /archive with the command:
And fetch it again with:
To replicate on Mercer a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running:
The host name followed by a colon tells
rsyncthat the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with
Note the backslash in
\$USER- this instructs the shell not to expand
$USERto your local (on your workstation) username. An equivalent command is:
To copy in the other direction, from /scratch on Mercer to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):
Only those files not already up-to-date on your workstation will be copied.
First, download and install WinSCP tool from here. If you are inside NYU network (on campus), simply open WinSCP and fill in all the fields:
If you are outside of NYU network, one option is to set up and use VPN. After that you can use WinSCP as described above. Another option is to start an SSH tunnel on the workstation. We have instructions on how to do this for Windows workstations. Once your SSH tunnel is set up and started on dumbo, open WinSCP and fill the fields in as shown below :