Skip to end of metadata
Go to start of metadata
The NYU HPC clusters have five filesystems for users' files. Each filesystem is configured differently to serve a different purpose:

 

Space

Environment
Variable

Space Purpose

Visibility

Backed up?

Flushed?

Allocation

Cost for Additional
Storage

Total Size

File System

/home
$HOME

Program development space; storing small files you
want to keep long term , e.g. source code, scripts.

login and compute nodes.

Starting with the installation of Mercer we have a unified /home filesystem served from same 7420 storage system as /archive and /work

Yes

ASCII filenames only 

No

20GB (unified /home, mounted on Mercer)

 

N/A

600TB (unified /home, space shared with /archive and /work)

NFS

ZFS

/archive
$ARCHIVE

Long-term storage, mounted only on login nodes.

Best for large files, please tar collections of small files when archiving.

Groups may request a common aggregate archive space.

login nodes only.

Common to all clusters.

Yes

ASCII filenames only

No

2TB

$500/year for 1TB

600TB

shared with /work and unified /home

ZFS

/scratch
$SCRATCH

Computational work space. Best suited to large, infrequent reads and writes.

Files are deleted after 60 days without use.

login and compute nodes.

Common to all clusters.

No

Files not accessed
for 60 days

5TB;
inode quota: 1 million
Policy

N/A

410TB

Lustre

/work
$WORK

Medium term, non-backed up storage mounted on login and compute nodes.

login and compute nodes.

NoNo500GBN/A600TB
shared with /archive and unified /home
ZFS
/state/partition1
$PBS_JOBTMP

Small, node-local filesystem cleaned up at the end of each Torque job. For small, frequent reads and writes.

Environment variable is defined in batch jobs (via qsub wrapper)

compute nodes only. Local to each compute node.

NoEnd of each jobVaries. Generally >100GBN/AVariesext3
 $PBS_MEMDISKOptional, node-local memory filesystem. Like $PBS_JOBTMP but smaller and faster. See here for usage.compute nodes only. Local to each compute node.NoEnd of each jobDefault 8GB. Specific amount can be requested (but must fit within node memory)N/AVariestmpfs or ramfs

 Only files and directories with ASCII-only filenames are backed up. Our backup system does not handle unicode in file or directory names, such files and directories (including all files and directories under them) will be bypassed.

Important: Of all the space, only /scratch should be used for computational purposes. Please do not write to /home when running jobs as it can easily be filled up.

*Note:  Capacity of the /home file system varies from cluster to cluster. Unlike /scratch and /archive, the /home file system is not mounted across clusters. Each cluster has its own /home, its own user base and /home allocation policy.   

To purchase additional storage, send email to hpc@nyu.edu.

Moving files across filesystems

To move a file to a different folder on the same filesystem, mv is best as it need not copy the file contents. However, when moving files from one filesystem to another it is best to use rsync, not mv. The file contents will need to be copied regardless, and if mv fails it can damage both the original and the copy. You can avoid this risk by using rsync to ensure the file is correctly copied before deleting it from the original location.

See How to copy files to and from the HPC clusters for some tips about using rsync.

How to use /scratch

The /scratch filesystem is configured for large-block I/O, such as sequential reading and writing of large files. However individual I/O operations are relatively costly, so programs using frequent, small I/O accesses will put a heavy load on the metadata servers, which in extreme cases can cause the system to become unstable. The system administrators generally detect this quickly and may kill a job whose I/O characteristics are are stretching the capabilities of the filesystem (if this happens, we will contact you to help configure your job for better stability and performance).

If frequent I/O is unavoidable, we recommend using the node-local, temporary filesystem on /state/partition1 for those files. Note that this filesystem is cleaned up at the end of each job, so your job will need to copy any files to and from it before exiting.

The accessibility of filesystems from the login and compute nodes is illustrated in the following diagram:

Filesystems generally, and Lustre (/scratch) especially, do not cope well with large numbers of files in a single directory - operations using that directory will be slow and impact /scratch performance for other users. Please limit the number of files or folders in a single directory to about 1000 - if you need more than this, it is best to split the directory over multiple folders.

Managing large numbers of files

Filesystems generally - and high-performance filesystems such as Lustre especially - perform best with a small to moderate number of medium to large files. Some specific issues to be aware of are:

On $SCRATCH

  • Lustre ($SCRATCH) gets performance mainly by striping distributing a large file across several disks and several "object storage servers". File metadata operations, on the other hand, do not have much parallelism available. So a few large read or write operations is vastly faster than many small reads or writes. This is true for reads and writes within a single files as well as for reads or writes on many files.
    • If your job does many small I/O operations, it might be better to copy the file to $PBS_JOBTMP or $PBS_MEMDISK at the start of the job, and open the local copy of the file.
    • (But for large reads and writes, $SCRATCH is likely to be faster than local disk)

  • The default stripe count on $SCRATCH is 4, so each file is striped across disks on 4 object storage servers. If you have a folder filled with files each smaller than 1MB, it is better not to stripe them. You can set the stripe count on a folder (under $SCRATCH) to 1 with:

  • Finding a file within a folder is a serial operation. And the more files in a folder, the longer it takes. With several thousand files, even 'ls' on that folder can take several minutes and may affect responsiveness of the filesystem for other users.
    • If you have more than about 1000 files in one folder, distribute them over a number of subfolders. The best performance will be when the number of subfolders is the square root of the total number of files (eg, for 1 million files, 1000 subfolders each containing 1000 files)

On $ARCHIVE

  • The backup regime on /archive is optimized for small numbers of large files - 1000 files of 1kb each take 1000 times as long to backup as 1 file of 1MB! Too many small files can prevent the backup from completing in time
    • when archiving a collection of small files, please tar the files first. You can send a collection of files to /archive with the command:

      And fetch it again with:

/scratch Policy

The /scratch storage system is a shared resource that needs to run as efficiently as possible for the benefit of all.  All HPC account holders have a /scratch disk space quota of 5TB and inode quota of 1 million. There is no system backup for data in /scratch, it is the user's responsibility to back up data. We cannot recover any data in /scratch, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.

  • All inactive files older than 60 days will be removed.  It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their user's HPC account locked. A second violation may result in your HPC account being turned off.  
  • We strongly urge users to regularly clean up their data in /scratch to decrease /scratch usage by backing up files you need to retain either on /archive or elsewhere. 
  • All users will be asked to do cleanup if total /scratch usage is above 75%, which will decrease the scratch file system performance.
  • We retain the right to clean up files on /scratch at any time if it is needed to improve system performance.

There are some recommendations:

  • Do not put important source code, scripts, libraries, executables in /scratch. These important files should be stored in /home.
  • Do not make soft link for the folders in /scratch to /home for /scratch access
  • We strongly suggest users work with big size files, instead of many small size files.
  • /scratch is optimized for infrequent, large reads and writes. For frequently accessed temporary files during job running process, please use local disk in the compute node or even RAM file system on the compute node to decrease IO load to /scratch file system.

How to use /archive

The $ARCHIVE filesystem is intended for longer-term storage of simulation results.

The backup system used for $ARCHIVE can only handle ASCII characters in filenames, if your filename has non-ASCII characters then it will not be backed up.

 

  • The backup regime on /archive is optimized for small numbers of large files - when archiving a collection of small files, please tar the files first. You can send a collection of files to /archive with the command:

And fetch it again with:

 

  • You can extract just a portion of the tar file:

Which will create a directory $SCRATCH/my_run_dir/subdir1 and put the tarred contents of subdir1 into it

 

  • To extract my_run_dir/subdir1 directly into scratch (without the my_run_dir/ prefix), use --strip-components=<n>

 

  • Finally, to see the contents of a tar file:

 

 

Group Quotas on /archive

HPC accounts include an allocation of 2 TB of storage in /archive. An HPC sponsor may request that his/her quota and the quota of his/her research group be combined to create a single larger space. Some conditions:

  • Requests must be made by the sponsor
  • All of the members of the group must share the same sponsor
  • All group members must be active users of the HPC system

The sponsor's account will hold the full quota and each individual's quota will be set to 0.

Requests will be considered by HPC management and assessed by evaluating the need for it and availability.

Maximum size of group quota is 10 TB. Additional storage can be added for $500/TB/year (based on availability)

To apply for a group quota please use the form at this link.   You will receive a response to your request within 1 week.

Extra archive space on /work

Users now have a default quota of 500GB on /work, which is NFS mounted on the Mercer login and compute nodes.

/work is intended as a medium-term archive space which is visible to the compute nodes - jobs are still best run in /scratch, and large I/O during a run should go to /scratch. /work is mounted on the compute nodes via NFS, which has higher latency and lower bandwidth than the Infiniband connection to /scratch. Files on /work are not flushed, but neither are they backed up.

 (more..)

Automatic File Deletion Policy

The table below describes the policy concerning the automatic deletion of files.

Space

Automatic File Deletion Policy

/home

none

/archive

none

/scratch

Files may be deleted as needed without warning if required for system productivity.

/worknone

ALL

ALL /home and /archive files associated with expired accounts will be automatically deleted 90 days after account expiration.
/scratch files will automatically be deleted no later than 30 days after account expiration. 

Recovering files from backup

Recovering files from backup of /home or /archive

Your $HOME and $ARCHIVE directories are regularly backed up. You can access snapshots of your $HOME or $ARCHIVE by looking in a "very hidden" (even ls -a will not display it) folder called $HOME/.zfs/snapshot (or $ARCHIVE/.zfs/snapshot).

When you run 'ls -a' in this folder you will see a list of hidden folders whose name includes the date and time of the snapshot it contains, eg:

Each of these folders contains your $HOME (or $ARCHIVE) as it was at the time and date of the snapshot. You can copy files from the snapshot back to $HOME, $SCRATCH or another location.

The files you copy back will be the files as they were on that date! If you copy it back to it's original location you may overwrite any changes you have made since then.

You cannot edit files in the snapshot, or add anything to the snapshot - it is a strictly read-only snapshot of a past state of your files

 

  • No labels