Skip to end of metadata
Go to start of metadata

Independence Day weekend - reduced HPC support

Icon

From Thursday 7/2 until Monday 7/5, HPC staff will be mostly unavailable to answer support requests. The clusters will be running as usual but we will not be able to answer requests for assistance until Monday, and we will have limited capacity to quickly fix node or batch system issues.

Dumbo Login Node Down

Icon

Hello Users,

The Dumbo cluster's login node (dumbo1) is currently down. Please use "ssh dumbo0" to access the dumbo(Hadoop) cluster.

New /scratch

Icon

Following the Mercer /scratch upgrade, batch job outputs are no longer staged in $HOME/.pbs_spool.

If you are using "#PBS -o" or "#PBS -e" to set the location for stdout and stderr files, and the location does not exist, you may see job failures you did not see earlier, as the missing location will be discovered early in the job rathe than at the end. If your job fails with exit status -9, please check that stdout and stderr and going to a directory which already exists.

If you experience any other new problems since the upgrade, please let us know at hpc@nyu.edu

New K80 GPUs

Icon

NVidia has donated 8 nodes each with 4 NVidia K80 GPU cards, these are now available for use (nodes gpu-24-9 through gpu-24-16). To use either Titan or K80 GPUs specifiy:

-l nodes=1:ppn=1:gpus=1

And to specifically use the K80 nodes:

-l nodes=1:ppn=1:gpus=1:k80

or Titan nodes:

-l nodes=1:ppn=1:gpus=1:titan

The old Tesla GPUs in gpu-11-0 through gpu-11-3 have now been retired.

See Running jobs - GPU for more

Welcome to High Performance Computing (HPC) at New York University. 

NYU HPC, within IT, operates and supports high performance computing resources and assists the NYU research community in their use. HPC resources are open to NYU faculty and staff, and faculty-sponsored students, and may be used for class instruction. IT is also available to partner with faculty as they seek funding for research with substantial technology components - see HPC Stakeholders and also ITS Research Computing. We can also assist in access to and collaboration with a number of national and state HPC facilities.

Getting and Renewing Access

 Click here to expand...
Icon

For how to log in, see Logging in to the NYU HPC Clusters

Who is eligible for an HPC account?

NYU HPC resources are available at no charge to full-time NYU faculty (other than NYU Medical School) and to all other NYU staff and students with full-time NYU faculty sponsorship (more...)

Getting an account on the NYU HPC clusters

First you need a valid NYU NetID. Your HPC sponsor can request one for you here. You also need a valid NYU Google account to receive emails, as does your HPC sponsor - contact us if you need assistance with this.

Next you need a faculty sponsor.

Finally, log into the NYU Identity Management service and follow the link to "Request HPC account". We have a walkthrough of the process here.

Renewing your HPC account

Each year, non-faculty users must renew their HPC account by filling in the account renewal form from the NYU Identity Management service. See  Renewing your HPC account with IIQ for a walk-through of the process.

Information for faculty who sponsor HPC users

You can request a NetID for your student or collaborator here. The request form has additional information about affiliates.

Each year, your sponsored users must renew their account. You will need to approve the renewal by logging into the NYU Identity Management service. We have a walkthrough of the process, with screenshots, here.

Pre-approving a list of netids for class HPC accounts

Faculty (who can sponsor HPC accounts) can pre-approve requests in bulk - this is intended to streamline the process of registering a class to use the HPC facilities. Faculty can set this up via the NYU Identity Management service. We also have a walkthrough of the process here.

Getting an account with one of NYU partners

NYU partners with many state and national facilities with a variety of HPC systems and expertise. Contact us for assistance setting up a collaboration with any these.

The Open Science Data Cloud
Provides 1TB free storage for science data. We encourage researchers to publish datasets associated with published research as "Public Data" on OSDC

The NY State High Performance Computing Consortium (hpc^2)
Provides  high performance computing resources for New York State industry and academic institutions:

Rensselaer Polytechnic Institute
Stony Brook University - Dave Ecker
University at Buffalo
Brookhaven National Lab
NYSERNet

The Extreme Science and Engineering Discovery Environment (XSEDE)
The most advanced, powerful, and robust collection of integrated advanced digital resources and services in the world; a single virtual system that scientists can use to interactively share computing resources, data, and expertise.

Open Science Grid
A national, distributed computing grid for data-intensive research.

The Common Solutions Group
for cooperative exploration of common solutions to IT challenges in higher education

The Open Science Project
is dedicated to writing and releasing free and Open Source scientific software. 

NYSERNet
is a private not-for-profit corporation created to foster science and education in New York State

The National Science Foundation
An independent federal agency created by Congress in 1950 "to promote the progress of science; to advance the national health, prosperity, and welfare; to secure the national defense."

Oak Ridge National Laboratory
The Department of Energy's largest science and energy laboratory.

Argonne National Laboratory
One of the U.S. Department of Energy's largest research centers. It is also the nation's first national laboratory, chartered in 1946.

TOP500 Supercomputer Sites
A project started in 1993 to provide a reliable basis for tracking and detecting trends in high-performance computing. 

HPC Stakeholders

 Click here to expand...

NYU Research Technology Services (RTS) supports and encourages a model of hosting and managing clusters for research groups or departments in return for making unused cluster cycles available to the general NYU research community. These research groups and departments are our HPC Stakeholders, for whom NYU HPC manages hardware and provides priority access. Our current stakeholders are CGSBCNSCDSKussell Lab and CAOS.

If you are interested in becoming a stakeholder, please contact us at hpc@nyu.edu for details before you purchase your cluster. We can discuss your needs and work with you in the planning and purchase of hardware.

(more...)

Compute and Storage Facilites

 Click here to expand...

NYU HPC has retired its older clusters (Union Square and Cardiac), and has merged the hardware that was Bowery into its newest and primary cluster, Mercer.

  • Mercer has 4 login and 394 compute nodes:

    Number of nodesCPU type and speedNumber of cores per nodeGPUs per nodeTotal memory per nodeMemory available to jobsNode namesNode set name 
    112Intel Xeon E-2690v2 (Ivy Bridge) x86_64 3.0GHz (2014)20 64GB62GBcompute-14-* to compute-20-*

    ivybridge_20p_64GB_3000MHz

    ivybridge
    48Ivy Bridge x86_64 3.0GHz (2014)20 192GB189GBcompute-21-* to compute-23-*ivybridge_20p_192GB_3000MHz
    68Westmere x86_64 2.67GHz (2010)12 24GB23GBcompute-4-* to compute-8-7westmere_12p_24GB_2670MHzdell_westmere
    8Westmere x86_64 2.67GHz (2010)12 48GB46GBcompute-8-8 to compute-8-15westmere_12p_48GB_2670MHz
    16Westmere x86_64 2.67GHz (2010)12 96GB93GBcompute-9-*westmere_12p_96GB_2670MHz
    64Westmere x86_64 3.07GHz (2011)12 48GB46GBcompute-12-* and compute-13-*westmere_12p_48GB_3070MHz 
    64Nehalem x86_64 2.67GHz (2009)8 24GB23GBcompute-0-* to compute-3-*nehalem_8p_24GB_2670MHz 
    1Nehalem x86_64 2.27GHz (2009)16 256GB250GBcompute-10-0nehalem_16p_256GB_2270MHz 
    1Westmere x86_64 2.67GHz (2011)32 1TB1000GBcompute-10-1westmere_32p_1024GB_2670MHz 
    4Westmere x86_64 2.67GHz (2011)121 x NVidia Tesla M702024GB23GBcompute-11-*westmere_12p_24GB_2670MHz_Teslatesla
    9Sandy Bridge x86_64 2.0GHz (2014)164 x NVidia Titan128GB126GBgpu-23-*sandybridge_16p_128GB_2000MHz_Titantitan
    Icon

    To restrict a job to a specific subset of nodes, you can request the node set name as a feature, eg:

    #PBS -l feature=ivybridge_20p_64GB_3000MHz

    You can see a map of nodes and usage with pbstop. The diagram below indicates which nodes belong to which of the above nodesets:
     

  • Hydra is a 3-node cluster with 24 Intel Sandy Bridge cores and 8 GPUs per node . 75% of Hydra is reserved for use by the NYU Center for Data Science (CDS) (see HPC Stakeholders).
  • Dumbo is an older, 70-node Intel Xeon cluster running Hadoop. Our Hadoop support is still at "pilot" stage, and is "best effort" rather than guaranteed service. We plan to update the hardware for Hadoop in the near future.
  • BuTinah, located at and managed by NYU Abu Dhabi, has 4 login and 537 compute nodes each with 12 Intel Westmere cores and either 48 GB or 192 GB memory.

Access to the primary clusters is as indicated in the following diagram. The diagram also indicates what file storage NYU HPC provides.


The NYU HPC clusters have five filesystems for users' files. Each filesystem is configured differently to serve a different purpose:

 

Space

Environment
Variable

Space Purpose

Visibility

Backed up?

Flushed?

Allocation

Cost for Additional
Storage

Total Size

File System

/home
$HOME

Program development space; storing small files you
want to keep long term , e.g. source code, scripts.

login and compute nodes.

Starting with the installation of Mercer we have a unified /home filesystem served from same 7420 storage system as /archive and /work

Yes

ASCII filenames only 

No

20GB (unified /home, mounted on Mercer)

 

N/A

600TB (unified /home, space shared with /archive and /work)

NFS

ZFS

/archive
$ARCHIVE

Long-term storage, mounted only on login nodes.

Best for large files, please tar collections of small files when archiving.

Groups may request a common aggregate archive space.

login nodes only.

Common to all clusters.

Yes

ASCII filenames only

No

2TB

$500/year for 1TB

600TB

shared with /work and unified /home

ZFS

/scratch
$SCRATCH

Computational work space. Best suited to large, infrequent reads and writes.

Files are deleted after 60 days without use.

login and compute nodes.

Common to all clusters.

No

Files not accessed
for 60 days

5TB;
inode quota: 1 million
Policy

N/A

410TB

Lustre

/work
$WORK

Medium term, non-backed up storage mounted on login and compute nodes.

login and compute nodes.

NoNo500GBN/A600TB
shared with /archive and unified /home
ZFS
/state/partition1
$PBS_JOBTMP

Small, node-local filesystem cleaned up at the end of each Torque job. For small, frequent reads and writes.

Environment variable is defined in batch jobs (via qsub wrapper)

compute nodes only. Local to each compute node.

NoEnd of each jobVaries. Generally >100GBN/AVariesext3
 $PBS_MEMDISKOptional, node-local memory filesystem. Like $PBS_JOBTMP but smaller and faster. See here for usage.compute nodes only. Local to each compute node.NoEnd of each jobDefault 8GB. Specific amount can be requested (but must fit within node memory)N/AVariestmpfs or ramfs

 Only files and directories with ASCII-only filenames are backed up. Our backup system does not handle unicode in file or directory names, such files and directories (including all files and directories under them) will be bypassed.

Important: Of all the space, only /scratch should be used for computational purposes. Please do not write to /home when running jobs as it can easily be filled up.

*Note:  Capacity of the /home file system varies from cluster to cluster. Unlike /scratch and /archive, the /home file system is not mounted across clusters. Each cluster has its own /home, its own user base and /home allocation policy.   

To purchase additional storage, send email to hpc@nyu.edu.

See Clusters and Storage for more information.

Logging in to the NYU HPC Clusters 

 Click here to expand...

The HPC clusters are not directly visible to the internet, you must first log in to a bastion host named hpc.nyu.edu: (note that from the clusters you can still access the Internet - see Transferring files to and from the HPC clusters).

The diagram below illustrates the login path.

In a nutshell

Logging in is a two-step process:

  1. First login to the bastion host, hpc.nyu.edu. From a Mac or Linux workstation, this is a simple terminal command (replace my_netid" with your netid). Your password is the same one you use for NYU Home:

    Windows users will need to use Putty, see here for instructions.

    Icon

    You can't do anything on the bastion host, except ssh to the cluster

  2. Next login to the cluster. For Mercer, this is done with:

The full story

First you need to ensure your workstation has the necessary software and settings to connect to the clusters, and to use graphical interfaces. We have instructions for Windows and Mac users.

Next, here's how to log in from a  Windows / Linux / Mac workstation.

SSH tunneling for easier login and data transfer

The two-stage access can be inconvenient, especially when transferring files to and from the clusters. Secure direct access and file transfer is possible by setting up SSH tunneling from your workstation to the HPC clusters. We have instructions on setting this up for Windows, Mac and Linux.

 Logging in with passwordless SSH

Once you are using SSH tunneling, it is possible to configure your workstation and accounts to avoid typing your password each time. Instructions for this will be added here soon.

What can I do on the login node? (Can I run <insert-software-name-here>?)

The login nodes are for preparing, submitting and monitoring scripts, analyzing results, moving data around and code development and simple compilation. They are not suitable for running computational workloads - for this use the batch system.

Compiling a large source codebase, especially with heavy use of optimization or -ipo (interprocedural optimization), can use much memory and CPU time. In such circumstances it is best to use the batch system for compilation too, perhaps via an interactive batch job.

Finding and Using Software

 Click here to expand...

A variety of commercial and open-source software is available on the NYU HPC clusters, and can be accessed via Environment Modules.

Important

Icon

The login nodes are not suitable for computational work, they are a limited and shared resource for preparing and submitting computational jobs, developing and compiling software, and managing data. Computational work should instead be run via the batch system.

 Using NYU software on your own computer

NYU HPC hosts licenses for a number of commercial software packages which are suitable for workstation as well as HPC use, such as Matlab, COMSOL and Mathematica.  Contact us about accessing these packages.

 Getting new software installed on the HPC clusters

If you need a free or open source software package which is not currently available on the HPC clusters, contact us. Usually we can install it for you, or suggest an alternative which is already available.

Our ability to buy and install commercial software depends on the cost and on how widely it will be used. We may also be able to host licenses or share costs with you in return for making the software available also to the NYU research community, so if you need a specific commercial package contact us to discuss it.

 Compiling and developing software

Intel and GNU compilers are available on the clusters. For most code, we recommend the Intel compilers 

For debugging we have the GNU debugger gdb, the Intel debugger idb and Totalview by Roguewave. Debugging is best performed with an interactive batch session.

There is more about compiling and debugging on the old wiki pages.

 Usage examples on Mercer

 There are usage examples for many popular software packages in /share/apps/examples on Mercer:

  • batch - An example batch job
  • blcr  - Checkpoint-Restart facility for long jobs
  • comsol  - Computational Fluid Dynamics
  • c-sharp  - Language for the .NET/mono runtime environment
  • fluent  - Computational Fluid Dynamics / Multiphysics package
  • gaussian - Chemistry package
  • matlab  - For mathematical exploration
  • namd  - Molecular dynamics
  • qchem-amber  - Molecular dynamics
  • r  - Interpreted language for statistics work
  • resource-usage  - Shows minute-by-minute CPU and memory usage of a program
  • stata - Statistics package

Managing data: Storage, collaboration and moving data around

 Click here to expand...

Filesystems, their optimal usage and your space allocation are described under Storage.

Quotas

On Mercer, enter 'myquota' at the prompt to see how much space you have used and available on each filesystem.  

Security and collaboration: file permissions and ACL on NYU HPC clusters

By default, only you can edit, or even see, your files. You can grant permission for your colleagues to see or edit files with setfacl, and you can check the permissions on a file or directory with getfacl.

An access control list (or ACL) gives per-file, per-directory and per-user control over who can read, write and execute files. You can see the ACL for a file or directory with the getfacl command:

$ getfacl myfile.txt

To modify permissions for files or directories, use setfacl. For a detailed description, see 'man setfacl'. In the example below, I give read permission on dummy.txt to user bob123:

$ setfacl -m u:bob123:r myfile.txt

 For setting execute permission on files - useful for scripts, and for allowing directories to be entered - chmod is still used.

(more...)

Transferring files to and from the HPC clusters

To transfer data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel on the workstation. We have instructions for this for Windows, Mac and Linux workstations.

Once you have an SSH tunnel, you can transfer files to and from the HPC clusters - including  BuTinah at NYUAD.

Running jobs with qsub: How to use the batch system

 Click here to expand...

Working on the HPC clusters is not the same as working at a desktop workstation: in order to provide high performance computing to many users simultaneously, computational work must be packaged into a job - a script specifying what resources the job will need and the commands necessary to perform the work - and submitted to the system to be run without further input from the user. The system then schedules and runs the job on a dedicated portion of the cluster. (Note that there is a way to work interactively within this model, for work which cannot be scripted, such as debugging).

Job Scheduling

On the NYU clusters, Torque and Moab manage the running and scheduling of jobs. As a user you will interact mostly with Torque, which accepts and runs job scripts and manages and monitors the cluster's compute resources. Moab does the heavy thinking: the planning of which job should be run where and when.

Icon

Avoid requesting vastly more CPUs, memory or walltime than you actually need. Jobs needing fewer resources are easier to schedule - in our scheduling diagram, a job requiring just 1 CPU for 1 hour could be inserted into the gap on Node 1 CPU 4. Smaller jobs are also more likely to receive priority when being scheduled.

Note that a small overestimate, such as 10%-20%, is wise, lest your job run out of time and be killed before it finishes, but requesting several times what you need will result in longer queueing time for your job and less efficient system utilization for everybody.

(more...)

Login and Compute Nodes

Note that certain filesystems are visible to the login or compute nodes but not both: specifically at NYU /archive is not visible to the compute nodes, while /state/partition1 is visible and local only to individual compute nodes.

Important

Icon

Do not run computationally-heavy or long-running jobs on the login nodes! Not only will you have poor performance, the heavy resource usage of such jobs impacts others ability to use the login nodes for their intended purposes. If you need to run a job interactively (for example, when debugging), please do so through an interactive batch session.

(more...)

Queues

Not all jobs can be run at once - the cluster is finite! - so when jobs are submitted they are placed into a queue. When a "space" becomes available in the schedule Moab looks down the queue for the first job that will fit into the space. 

Jobs are not necessarily placed at the end of the queue - Moab uses the priority (discussed here) to determine where in the queue a job should be placed.

Icon

At NYU HPC shorter jobs are given higher priority

There is more than one queue. Each queue is configured for different types of jobs and has resource limits and priorities set accordingly. If you do not specify a queue to submit to, Torque will use the resources requested to select a queue for you. Frequently this is the best option, however in some circumstances you are better off explicitly specifying a queue.

 You can see the list of queues with the command "qstat -q", and you can see more detail about a specific queue with "qstat -Qf queue-name". 

(more...)

Writing a Job Script

(more...)

Submitting a Job

Jobs are submitted with the qsub command:

$ qsub options job-script

The options tell Torque information about the job, such as what resources will be needed. These can be specified in the job-script as PBS directives, or on the command line as options, or both (in which case the command line options take precedence should the two contradict each other). For each option there is a corresponding PBS directive with the syntax:

#PBS option

For example, you can specify that a job needs 2 nodes and 8 cores on each node by adding to the script the directive:

or as a command-line option to qsub when you submit the job: 

$ qsub -l nodes=2:ppn=8 my_script.q
(more...)

Monitoring Jobs

To see the status of a single job - or a list of specific jobs - pass the Job IDs to qstat, as in the following example: 

$ qstat 3593014 3593016
Job id Name User Time Use S Queue
------------- ---------------- --------------- -------- - -----
3593014 model_scen_1 ab123 7:23:47 R s48
3593016 model_scen_1 ab123 7:23:26 R s48

Most of the fields in the output are self-explanatory. The second-last column "S" is the job status, which can be :

  • Q meaning "Queued"
  • H meaning "Held" - this may be the result of a manual hold or of a job dependency
  • R meaning "Running"
  • C meaning "Completed". After the job finishes, it will remain with "completed" status for a short time before being removed from the batch system.

Other, less common job status flags are described in the manual (man qsub).

The program pbstop, available on the login nodes, shows which jobs are currently running on which nodes and cores of a cluster.

Jobs belonging to a single user can be highlighted by launching pbstop with the -u switch:

pbstop -u <NetID>

(of course, replace <NetID> with your NYU NetID). Or, you can use the alias "me":

pbstop -u me

When you start pbstop you see something like the annotated screenshot below. You might need to resize your terminal to make it all fit: 

(more...)

Canceling a Job

To kill a running job, or remove a queued job from the queue, use qdel:

$ qdel jobid

To cancel ALL of your jobs:

$ qdel all

Tutorials, FAQs and how to get help

 Click here to expand...

For help with any aspect of scientific or high performance computing on the NYU HPC clusters, email us at hpc@nyu.edu.

We are developing a set of tutorials to help NYU HPC users make the most of the facilities. Tutorials are suitable for self-directed learning and are also periodically run as classes in the library. NYU Data Services also provides tutorials for a range of scientific software - for dates and times of upcoming HPC classes see the calendar on the left, or see NYU Data Services for a wider schedule of classes.

Currently available HPC tutorials are:

Getting started on Mercer - how to login and migrate your workflow from our older clusters onto our newest cluster, Mercer.

Tutorial 1: Introduction to Unix/Linux

Tutorial 2: Getting Started in the NYU HPC environment

Tutorial 3: Using NYU HPC Effectively

The NYU HPC qsub tutorial is also available, covering:

 

Getting Started on Dumbo: How to login

Tutorial 1: MapReduce

Tutorial 2: Hive

 

FAQ

Something went wrong!

Why does running "ls" on /scratch take so long?

I can't login

When trying to login, I get warnings about "HOST IDENTIFICATION HAS CHANGED"

What happened to my data on /scratch?

In the library, my wireless connection keeps dropping out. How can I fix it?

I'm getting a "module: command not found" error

Warning: no access to tty (Bad file descriptor), Thus no job control in this shell

I get an error "Warning: no display specified." when I use -X flag with ssh

Who killed my job, and why?

I got an email "Please do not run jobs on login nodes"

Running jobs

What resources can and should I request?

Can I make sure a job gets executed only after another one completes?

How do I log in to a specific node?

How can I make sure my job is running smoothly?

My job will take longer than 48 hours, what should I do?

My job needs (MySQL, some other service) to be running

I want to run a job at 9am every day

Using software

How do I run ... (esp, needs a license)

a STATA job?

a Gaussian job?

a Matlab job?

I can't find (some software package)

Can you install (some software package)?

How can I view a PDF file on Mercer?

Managing data

How much of my file/space quota have I used?

How do I give my colleague access to my files?

How do I get the best transfer speed to or from BuTinah?

I have a huge amount of data that I want to compress for storage or transfer

Monthly Maintenance Window

 Click here to expand...

To provide the best possible service, ITS must regularly update and perform routine maintenance on its systems and networks. Some of these activities require that the affected systems and networks be shut down. While this work is essential, we also recognize that it presents an inconvenience. To enable those who use these systems to better plan for maintenance, we have guidelines for scheduling routine maintenance and upgrades to the HPC clusters as described below.

A MONTHLY SCHEDULED MAINTENANCE OF UP TO 12 HOURS WILL BE TAKEN, IF NEEDED, BEGINNING AT 8AM ON THE FIRST MONDAY OF EACH MONTH

Major scheduled maintenance and upgrade activities will take place, if needed, once per month.  These will be scheduled for the first Monday of each month at 8am to noon to start these scheduled maintenance and upgrade activities. The maintenance period may often be brief or not used at all, but can last up to 12 hours if this amount of time is needed to complete the work.

We have chosen early morning on the first Monday of each month for our maintenance work as it has been the time period during the week which has low usage on our clusters.

A notification will be sent to all HPC account holders announcing any scheduled maintenance work in advance.

A WEEKLY SCHEDULED MAINTENANCE OF UP TO FOUR HOURS (MONDAY 8 AM to NOON) MAY BE USED TO ADDRESS SMALLER MAINTENANCE AND UPGRADE NEEDS.

This time will not be used if not needed.

Featured Research

Error rendering macro 'excerpt-include' : User 'null' does not have permission to view the page 'An Event-Driven Model for Estimation of Phase-Amplitude Coupling at Time Scales of Cognitive Phenomena'.

 More featured research using NYU HPC

  • No labels