Skip to end of metadata
Go to start of metadata

Managing Data

Overview

On large clusters data is used continuously. Data is moved, staged, analyzed and sometimes deleted. Managing your data properly can determine the success of your research as it can directly impact access speed and available space not only for you but also for others. The NYU HPC storage system eases the migration and analysis of research data by providing three storage filesystems: /home, /scratch, and /archive. Below are descriptions of each and their uses. It's important to familiarize yourself with our policies and storage systems as described below.

Storage Allocations and Policies

Each individual researcher is assigned a standard storage allocation or quota on /home, /scratch and /archive. Researchers who use more than their allocated space will be blocked from submitting jobs until they clean their space and reduce their usage, or in the case of /archive, purchase additional storage. The chart below shows the storage allocations for individual accounts and the cost of additional /archive space.

Space

Environment
Variable

Space Purpose

Visibility

Backed up?

Flushed?

Allocation

Cost for Additional
Storage

Total Size

File System

/home

$HOME

Program development space; storing small files you
want to keep long term , e.g. source code, scripts.

login and compute nodes.

Each cluster has a separate /home filesystem.

Yes

No

5GB

N/A

~1TB*

NFS

/archive

$ARCHIVE

Long-term storage, mounted only on login nodes.

login nodes only.

Common to all clusters.

Yes

No

2TB**

$500/year for 1TB

200TB

ZFS

/scratch

$SCRATCH

Computational work space

login and compute nodes.

Common to all clusters.

No

Files not accessed
for 60 days

5TB;
inode quota: 1 million
Policy

N/A

301TB

Lustre

/work
$WORK

Medium term, non-backed up storage mounted on Bowery login and compute nodes .
Currently for HPC stakeholders only

login and compute nodes.

Only on Bowery.

NoNo500GBN/A200TB
shared with /archive
ZFS

Environment variables

Currently the environment variables $ARCHIVE and $SCRATCH are only defined in login shells - that is, during interactive sessions. To use these in batch jobs you will need to add the "-l" flag to bash or csh, eg:

Important: Of all the space, only /scratch should be used for computational purposes. Please do not write to /home when running jobs as it can easily be filled up.

*Note:  Capacity of the /home file system varies from cluster to cluster. Unlike /scratch and /archive, the /home file system is not mounted across clusters. Each cluster has its own /home, its own user base and /home allocation policy.   

To purchase additional storage, send email to hpc@nyu.edu.

/scratch Policy

The /scratch storage system is a shared resource that needs to run as efficiently as possible for the benefit of all.  All HPC account holders have a /scratch disk space quota of 5TB and inode quota of 1 million. There is no system backup for data in /scratch, it is the user's responsibility to back up data. We cannot recover any data in /scratch, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.

  • All inactive files older than 60 days will be removed.  It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their user's HPC account locked. A second violation may result in your HPC account being turned off.  
  • We strongly urge users to regularly clean up their data in /scratch to decrease /scratch usage by backing up files you need to retain either on /archive or elsewhere. 
  • All users will be asked to do cleanup if total /scratch usage is above 75%, which will decrease the scratch file system performance.
  • We retain the right to clean up files on /scratch at any time if it is needed to improve system performance.

There are some recommendations:

  • Do not put important source code, scripts, libraries, executables in /scratch. These important files should be stored in /home.
  • Do not make soft link for the folders in /scratch to /home for /scratch access
  • We strongly suggest users work with big size files, instead of many small size files.
  • For frequently accessed temporary files during job running process, please use local disk in the compute node or even RAM file system on the compute node to decrease IO load to /scratch file system.

Group Quotas on /archive

HPC accounts include an allocation of 2 TB of storage in /archive. An HPC sponsor may request that his/her quota and the quota of his/her research group be combined to create a single larger space. Some conditions:

  • Requests must be made by the sponsor
  • All of the members of the group must share the same sponsor
  • All group members must be active users of the HPC system

The sponsor's account will hold the full quota and each individual's quota will be set to 0.

Requests will be considered by HPC management and assessed by evaluating the need for it and availability.

Maximum size of group quota is 10 TB. Additional storage can be added for $500/TB/year (based on availability)

To apply for a group quota please use the form at this link.   You will receive a response to your request within 1 week.

Extra archive space on /work

Users now have a default quota of 500GB on /work, which is NFS mounted on the Mercer login and compute nodes.

/work is intended as a medium-term archive space which is visible to the compute nodes - jobs are still best run in /scratch, and large I/O during a run should go to /scratch. /work is mounted on the compute nodes via NFS, which has higher latency and lower bandwidth than the Infiniband connection to /scratch. Files on /work are not flushed, but neither are they backed up.

 (more..)

Automatic File Deletion Policy

The table below describes the policy concerning the automatic deletion of files.

Space

Automatic File Deletion Policy

/home

none

/archive

none

/scratch

Files may be deleted as needed without warning if required for system productivity.

ALL

ALL /home and /archive files associated with expired accounts will be automatically deleted 90 days after account expiration.
/scratch files will automatically be deleted no later than 30 days after account expiration. 

File systems Usage Practices

To optimize system performance and to make sure your work is not interrupted, be sure to move data to the appropriate storage file system on a regular basis. Migrating data from /scratch and /home to the /archive file system is the easiest way to keep space uncluttered. User IDs based on NYU NetIDs are synchronized across clusters and file systems to allow data to move fast. You can simply copy data into the proper file system or synchronize directories between the file systems.

/scratch and /archive are available on all clusters and you can use any of the cluster login nodes to move files from /scratch to /archive or vice versa.  Since /archive is not visible to the compute nodes, you may need to stage data from /archive to /scratch for analysis.

/home directories and files are unique to each cluster. Therefore, moving files from /home to either /scratch or /archive needs to be done on the target cluster. 

How to Use /scratch Effectively

/scratch uses the Lustre filesystem, which is optimized for large, sequential data access, and under these kinds of workloads /scratch has been benchmarked at ~3GB/s. Because of its parallel nature, /scratch has some idiosyncrasies that can make it seem difficult to use. NICS at the University of Tennessee and Oak Ridge has written some very good documentation on using the Lustre filesystem. If your application relies heavily on IO, or you have any questions about how to best use /scratch, the following links may prove helpful:
Lustre FAQ
IO Tips
IO Best Practices

If you have any questions, or need any assistance with /scratch, feel free to email the HPC help desk.

Checking Your Disk Usage

/scratch & /home

You can use the command "myquota" to check your usage on /scratch and /home as shown below.

For example,

$ myquota

Disk quotas for user yz437 (uid 1389130):
Filesystem  kbytes      quota        limit   grace  files    quota    limit  grace
 /scratch/    1376  5368709120  6442450944      -     10    1000000  1001000     -

Disk quotas for user yz437 (uid 1389130):
Filesystem                blocks      quota        limit    grace  files    quota    limit  grace
crunch.local:/export/apps     16920      5242880      5242880          988      0        0

 

There are two blocks showing space usage for /scratch and /home, respectively.

For filesystem /scratch, the space usage is given in kilo bytes. When your usage reaches the soft limit (5368709120 Bytes), you will get warnings. When your usage reaches the hard limit (6442450944 Bytes), you will no longer be able to create files. You will have a one week (1w) grace period to move or delete files before your usage is locked out. For more information check the page /scratch Quota Policy.

For filesystem crunch.local:/export/apps (/home), the soft limit (5242880 Bytes) is same as the hard limit (5242880 Bytes). When your usage reaches the hard limit (5242880 Bytes), you will no longer be able to create files. You will have to move or delete files before your usage is locked out. Since the soft limit is same as the hard limit, we don't need grace period.

/home & /archive

The easiest way to verify your disk allocation on /home and /archive is to use the du (display disk usage statistics) Unix command. To check your utilization on /home and /archive filesystems:

$ du -sh /archive/NetID
$ du -sh /home/NetID

Note

Make sure that you replace "NetID" with your NetID.

Note

/archive is a dynamically loaded filesystem. In other words, it only loads upon request. Before you use the du command, you may need to issue the ls (list directory contents) Unix command on your directory for the filesystem to load.

The following sequence of commands for the NetID tester12 shows you the expected results:

$ du -sh /archive/tester12
du: `/archive/tester12': No such file or directory
$ ls /archive/tester12
$ du -sh /archive/tester12
2.0K /archive/tester12

For /archive usage, you can also use df command:

$ df -h /archive/NetID

For example,

$ df -h /archive/yz437

Filesystem                                   Size     Used      Avail      Use%           Mounted on
10.0.1.19:/export/archive/yz437      2.0T      0            2.0T            0%      /archive/yz437



df -h /archive/$USER

Moving files to/from the cluster

Often you will need to move files between an NYU cluster and another outside machine, such as a desktop machine or an account at a different university or national lab. 

For security reasons, the cluster login nodes cannot be accessed by remote connections, i.e., you may not push files to the cluster login nodes directly.  To move files to any of the clusters, you must be logged into that cluster and then pull files from another host. The steps you'll use to do this will depend on whether the originating system has incoming scp/sftp services enabled. Most production Unix systems have incoming scp/sftp enabled, while most Windows and Mac desktops do not.   

Secure Copy or SCP is a means of securely transferring computer files between a local and a remote host or between two remote hosts. It is based on the Secure Shell (SSH) protocol. To see an example of the Secure Copy syntax (SCP), please see http://www.hypexr.org/linux_scp_help.php

Copying files from another Unix system (with incoming scp/sftp enabled):

  1. From the bastion host, log into the cluster you want to copy the files to.
  2. Once logged into the cluster, copy the file from the remote host to the desired directory on the login node using scp.

$ ssh NetID@usq.nyu.edu
$ scp NetID@yourhost.xxx:filename .

Copying files to/from BuTinah from your system:

  1. To achieve the better performance please use an alternate TCP port 922 which is routed via high bandwidth ANKABUT link in the place of default TCP port 22 which is routed via low bandwidth MLPS link.
Pushing to BuTinah:

$ scp -P 922 filename NetID@butinah.abudhabi.nyu.edu:~/

Pulling from BuTinah:

$ scp -P 922 NetID@butinah.abudhabi.nyu.edu:~/filename .

Note

Make sure that you replace "NetID" with your NetID.

Copying files from a system without incoming scp/sftp enabled (e.g. most Windows and Macs)

Enabling incoming scp/sftp on Windows and Macs/Linux through SSH Tunneling (recommended):

If you would like to directly copy data from your Windows, Mac or Linux system to the clusters (and vice versa) and bypass copying them to the intermediary bastion host, you can use

   

SCP through SSH Tunneling

   

Enabling incoming scp/sftp on Windows and Macs/Linux using the HPC Bastion Host as an intermediary data depository:

If you are moving small number of small data files you can use HPC bastion host as an intermediary data depository as shown below. Don't use this method if you are moving either large number of files or big data files.

  1. Using an SSH client, copy your files from your desktop machine to the bastion host.  e.g. WinSCP from Windows, or from a Mac terminal session issue the following:

    $ scp YOUR_FILE NetID@hpc.nyu.edu:~/.

  2. From the bastion host, log into the login node of the cluster you plan to copy the files to. Once logged into the login node of the cluster, copy the file from the bastion host to the cluster.

    $ ssh NetID@hpc.nyu.edu
    $ ssh NetID@bowery.nyu.edu
    $ scp NetID@hpc.nyu.edu:filename .

    Note

    Make sure that you replace "NetID" with your NetID.

    Note

    scp uses the dot "." to represent the target directory during a copy session. This target directory is the local directory from which you are initiating the scp session. If you want to place the target files or directories in a different path, please add the full directory path.

   

Keeping directories in sync with rsync

For managing source code that you are developing and running on multiple computers (including the NYU HPC clusters), we highly recommend using a revision control system such as git. This can help keep your source code synchronized between different computers, and works over the SSH tunnel between your workstation and the NYU HPC clusters. We aim to soon provide a tutorial on revision control for NYU users, in the meantime there is some good introductory material at the git website and a good reference guide at stackoverflow.

For replicating datasets between the HPC clusters and your workstation, or between two filesystems on an HPC cluster, rsync offers powerful functionality beyond that of cp or scp. With rsync you can copy directories between your workstation and the HPC clusters  - or between different filesystems - in such a way that permission and file modification timestamps are preserved, and that only files which have changed are transferred. 

For Windows users

Windows does not have a good native terminal application so you may be unable to use rsync. There are rsync-like applications for Windows, but we do not have experience with them. One promising option is to run rsync from Cygwin

The basic usage of rsync is:

rsync [options] source [source] destination

Where source is a list of one or more source files or directories to copy and destination is a directory into which to copy source. Commonly useful options are:

  • a "Archive" mode - permissions and timestamps of the source are replicated at the destination.
  • v "Verbose".
  • n  "dry run" - don't actually do anything, just indicate what would be done.
  • C "follow CVS ignore conventions" - more on this below.

Whether rsync treats destination as a new name for the copy of source, a parent directory into which to copy source, or a parent directory into which to place the contents of source, depends on the exact context of the command. For this reason, it is highly advisable to first run rsync with -n and -v to see exactly what rsync will do before issuing the "real" command, eg:

$ rsync -nav source destination
$ rsync -av source destination

For detailed information about rsync type "man rsync" at the command line.

Some examples of rsync usage:

  • To replicate in your /scratch area a directory tree you have saved in /work:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ rsync -av my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ ls -F /scratch/$USER/my_run_dir
    my_input_data/


    There is now a copy of my_input_data directory under /scratch/$USER/my_run_dir 

    If you append / to source, rsync will copy the contents of source rather than the source directory itself:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ rsync -av my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ ls -F /scratch/$USER/my_run_dir
    file1
    file2 

 

  • To replicate on Mercer a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running:

    $ hostname
    my_workstation
    $ ls -F
    my_input_data/
    $ rsync -av my_input_data mercer:/scratch/\$USER/my_run_dir

    The host name followed by a colon tells rsync that the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with username@remotehost:
    Note the backslash in \$USER - this instructs the shell not to expand $USER to your local (on your workstation) username. An equivalent command is: 

    $ ls -F
    my_input_data/
    $ rsync -av my_input_data NetID@mercer:/scratch/NetID/my_run_dir

  • To copy in the other direction, from /scratch on Mercer to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):

    $ hostname
    my_workstation
    $ rsync -av mercer:/scratch/\$USER/my_run_dir my_results
    $ ls my_results

    Only those files not already up-to-date on your workstation will be copied.

 

 

Ignoring certain files

 

The -C option tells rsync to follow CVS conventions about ignoring certain files. For example, when copying a tree of source code, you probably want the .c, .f and .h files but not the .o files. The conventions are described fully in the man page (man rsync). In summary, when -C is used the following files are ignored:

  • Any file or directory whose name matches any of:
    RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/


  • Any file whose name matches a pattern listed in the environment variable CVSIGNORE. This environment variable takes a list of patterns separated by spaces, such as the default list above. When defining CVSIGNORE you will need to enclose the definition in quotation marks, for example to skip Fortran output to unnamed unit numbers (whose files have names like fort.99) and netcdf files whose name ends in "intermediate.nc", set CVSIGNORE as follows (note that this syntax is for BASH)

    $ export CVSIGNORE="fort.?? *.intermediate.nc"

     

    Any file whose name matches a pattern listed in the file $HOME/.cvsignore, or in a file named .cvsignore within a directory being copied. This file has contents as per $CVSIGNORE, but with one pattern per line, for example:

    $ cat .cvsignore
    fort.??
    *.intermediate.nc

     

     

 

 

 

 

PBS Script Generator
An interactive tool that generates PBS script based on user's input. Check this page for more details.
Front-Line HPC Consulting
HPC consultations are available once a week, Monday 1-3 PM. Appointments are required. Please make an appointment at hpc@nyu.edu.

 

 

 

  • No labels