Child pages
  • How to copy files from / to the HPC clusters
Skip to end of metadata
Go to start of metadata

 

When you are inside NYU network

The article on this wiki page assumes that you are outside of NYU network. When you are within NYU network, it is much simpler - SSH tunneling is not necessary!

  • For WinSCP users, please choose SFTP, prince.hpc.nyu.edu and port 22 for connecting to the Prince login node; and choose SFTP, dumbo.es.its.nyu.edu and port 22 to the dumbo.
  • With scp on Linux / Mac, the command could be simply as "scp myfile.c NetID@prince.hpc.nyu.edu:/scratch/NetID/" .

 

Starting up SSH tunnel before you start

To copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel on the workstation. We have instructions for this for Windows, Mac and Linux workstations.

Transferring files to and from a Windows workstation with WinSCP

First you will need to install WinSCP. You can download it from here. Once you have set up the SSH tunnel you need not repeat that process, but at the beginning of each session you will need to start the tunnel by launching the shortcut you created during the setup.

Once your tunnel is running, you can launch WinSCP from the Start menu. You will be presented with a dialog like the one below.

During the SSH setup you created a tunnel, one end of which was probably at "localhost, port 8026". The hostname and port number in this dialog should be the same as you used for Step 2: Logging in via your SSH tunnel (again, probably "localhost" and "8026").

You can enter your user name (NetID) under "User name" if you wish. If you don't, WinSCP will ask for your user name and password when you click "Login".

Save the session by clicking "Save", so that next time you need not re-enter these details.

Click "Login". You will be asked for a username and password if you did not already enter them - these are your NetID and password.

Next you will see a window with two panes: the one on the left is your workstation and the one on your right is the HPC cluster. You can copy files from one to the other by dragging and dropping them.

Transferring files to and from a Mac workstation with FUGU

Fugu is a graphical frontend for the text-based Secure File Transfer Protocol (SFTP) client that ships with Mac OS X. SFTP is similar to FTP, but the entire session is encrypted, meaning nothing, including passwords, is sent in the clear. Fugu provides a friendly, intuitive interface for OpenSSH's SFTP client, featuring secure drag-and-drop file transfers, remote file editing and remote file preview. Fugu allows you to take advantage of SFTP's strong encryption without having to sacrifice ease of use. Additionally, Fugu can make secure file transfers using Secure Copy (SCP), and can create SSH tunnels to faciliate secure interaction between machines.

Step 1:

Download and install Fugu (https://faq.oit.gatech.edu/content/prism-how-install-fugu-mac – Universal binary version must be used with OSX, not the one on sourceforge which gives a 'PowerPC apps not supported' error).

Step 2:

Start Fugu.  Select SSH > new SSH tunnel

  • Create tunnel to: prince

  • Service or port: 22

  • Local port: 8026

  • Tunnel host: gw.hpc.nyu.edu

  • Username: NetID 

  • Port:

Step 3:

In SFTP window

  • Connect to: localhost

  • Username: NetID 

  • Port: 8026

Step 4:

  • Click connect and enter NetID password.

  • Drag and drop files to copy/paste to and from cluster.


Transferring files to and from Google drive with RCLONE

rclone - rsync for cloud storage, is a command line program to sync files and directories to and from cloud storage systems such as Google Drive, Amazon Drive, S3, B2 etc. rclone can be invoked in one of the three modes:

  • Copy mode to just copy new/changed files
  • Sync (one way) mode to make a directory identical
  • Check mode to check for file hash equality

rclone is available on Prince cluster, the module is rclone/1.35 .

Please try with these options: rclone --transfers=32 --checkers=16 --drive-chunk-size=16384k --drive-upload-cutoff=16384k copy source:sourcepath dest:destpath

This option works great for file sizes 1Gb+ to 250GB. Keep in mind that there is a rate limiting of 2 files/sec for upload into Google drive.  Small file transfers don’t work that well. If you have many small jobs, please tar the parent directory of such folders and splits the tar file into 100GB chunks and uploads then into Google drive.

 

Step 1:

Login to Prince:
$ ssh -Y NetID@prince.hpc.nyu.edu
If necessary please read the wiki page on how to login to HPC clusters.  

Step 2:

First we load the module clone, using command:
$ module load rclone/1.35

Step 3:

Configuring rclone and setting up remote access to your Google drive, using command:
$ rclone config

This will try to open the config files and you will see the below content:
You can select one of the options (here we show how to setup a new remote)

2017/02/24 10:21:00 Config file "/home/ad95/.rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n

You enter n for a new remote connection and give it a name.

name> remote1

Then you choose the type of storage for which you are setting up the remote (here we show the method for setting up a remote for google drive which is option 7)

Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph, Minio)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Encrypt/Decrypt a remote
   \ "crypt"
 6 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 7 / Google Drive
   \ "drive"
 8 / Hubic
   \ "hubic"
 9 / Local Disk
   \ "local"
10 / Microsoft OneDrive
   \ "onedrive"
11 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
12 / Yandex Disk
   \ "yandex"
Storage> 7

Then you see a few messages like the ones below:

Google Application Client Id - leave blank normally.
client_id> (just press enter key here) 
Google Application Client Secret - leave blank normally.
client_secret> (just press the enter key here)

Now since you are remotely accessing the cluster you have to select remote config i.e. option n

Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine or Y didn't work
y) Yes
n) No
y/n> n

You will see a message similar to the one below:

If your browser doesn't open automatically go to the following link: https://accounts.google.com/o/oauth2/auth?client_id=202264815644.apps.googleusercontent.com&redirect_uri=urn...

Log in and authorize rclone for access.

You have to open this url in your workstation systems browser and authenticate your Google drive options. Once that is done you will get a screen that displays a secret key/ verification code.
You enter this key/ copy key from browser and paste it in the terminal. Once the terminal accepts the verification code it displays the options below: 

y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

You can select y if everything seems okay with the remote or you can edit the same.
You can also view the current existing remotes.

Step 4:

Transferring files to Google drive, using the command below:
$ rclone copy <source_folder> <remote_name>:<name_of_folder_on_gdrive>

It looks something like below:
$ rclone copy /home/user1 remote1:backup_home_user1

Step 5:

The files are transferred and you can find the files on your Google drive.
Note: Rclone only copies new files or files different from the already existing files on Google drive.

Transferring files to and from a Linux or Mac workstation with scp

When you are comfortable with the command line interface, scp over the command line is the easiest and best way to transfer files between hosts.

The command line procedure described below is the same on Linux as on Mac. The first thing to do is setting up the SSH tunnel.

You will need to start your tunnel before you can log in or transfer files over it - this is also covered in the guide to setting up SSH tunneling.

In a new terminal, go to the directory on your workstation that you wish to transfer files to or from:

$ cd ~/my_data

On Unix systems - including OSX - the "~" symbol is a shortcut to your home directory.

The syntax of the scp command is:

$ scp source destination

Either source or destination can be on another (remote) host, by prefixing the path with "hostname:". If you have a different user name on the remote host to your username on the host you are working on you can specify it with "username@hostname:".

During the SSH tunneling setup, you created aliases for each of the NYU HPC clusters in your .ssh/config file. To scp files through the tunnel you will refer to the alias for the HPC cluster of interest, for example:

$ scp my_script.sh dumbo:

$ scp my_source_file.c prince:my_code/

$ scp my_data_file.nc prince:/scratch/NetID/

$ scp prince:my_code/my_source_file.c .

Note that we don't use "username@prince" because your username was specified in the .ssh/config file.

Important

Spaces matter - there must be no spaces between the remote hostname and the path on that host. If the file you wish to transfer has spaces in its name, you must delimit the spaces with a backslash (\). For example:

$ ls

file with spaces in its name.c

$ scp file\ with\ spaces\ in\ its\ name.c prince:

If you use tab-completion to save typing the full name, the tab-completion will insert the delimiters correctly.

Better still: Don't put spaces in file names, even when you are using Windows or Mac!

Specifying a directory on the remote host

In the first example above we copied the file "my_script.sh" from the current directory on the workstation (the local host) to the home directory on dumbo (the remote host).

In the second example, we copied "my_source_file.c" from the current directory on the workstation to the directory "my_code", which is a subdirectory of our home directory on prince. This destination directory must already exist! The trailing "/" tells scp to expect a subdirectory, so if the subdirectory does not already exist on prince, scp will exit with an error message like "scp: my_code/: Is a directory". If you omit the trailing "/", scp assumes you wish to copy "my_source_file.c" to your home directory on prince but give it a new name "my_code". 

In the third example, the destination directory begins with a forward slash (/), which means the path is absolute rather than relative to your home directory. This allows us to copy to a destination outside of our home directory, such as /scratch.

The fourth example transfers a file from a remote host (prince) to the current directory (.) on the local host.

Transferring multiple files using wildcards

The more general usage of scp allows multiple files to be copied from to a single destination, provided the destination is a directory, for example:

$ scp source_file_1.c source_file_2.c prince:my_code/

$ scp *.c prince:my_code/

In the first example we explicitly copy two source files to a directory on prince. In the second example we use wildcard expansion to copy all files with the suffix ".c" to prince. Note that the shell replaces "*.c" with a list of files matching that pattern, so the actual command which gets executed is more like the first example.

To use wildcard expansion on the remote host, we must delimit the wildcards with a backslash (\) to prevent the local shell from attempting to expand them. For example:

$ scp prince:my_code/\*.c .

The above command will pass "*.c" to Prince to be expanded there. This is equivalent to a command like:

$ scp prince:my_code/source_file_1.c prince:my_code/source_file_2.c .

If you are aiming to keep a directory on the remote and local host synchronized, rsync may be better suited to your needs.

Transferring over a non-default port

There is a brief introduction to ports in the section on setting up SSH tunneling.

The default port used by scp is port 22, however you can specify a different port with the -P switch. This is useful for faster transfer to and from dalma. If you initiated a once-off SSH tunnel then you will not have an alias for the HPC cluster so to use the tunnel your remote host is the chosen port on your workstation, for example:

$ scp -P 8021 source_file_1.c NetID@localhost:my_code

This is not the recommended way to use SSH tunneling

Fetching data from the Internet

The login and compute nodes of the NYU HPC clusters can access the Internet, even though they cannot be directly accessed from the Internet. You can, therefore, directly download datasets from the HPC clusters. The HPC clusters have a few command-line tools for downloading data:

  • scp or sftp: to fetch data from a remote system - such as dalma
  • wget: is for fetching webpages, and is also effective for downloading links from webpages (such as some_package.tar.gz)
    There is a helpful usage guide for with examples at thegeekstuff.
  • curl: is a newer tool with similar capabilities as wget and more. The documentation is available here and thegeekstuff also has a curl usage guide with examples.

Transferring files between NYU NY and NYUAD Dalma

For faster transfer between the HPC clusters at NYU in NYC and the Dalma cluster at NYUAD, use scp over port 922. This will route the transfer over a high bandwidth ANKABUT link rather than the default low bandwidth MLPS link. The speed difference is greatest when pulling files from Dalma to NYU NY.

Transferring many small files will still be slow - you will get better performance if you tar small files into a single archive, and transfer the tar file.

You can scp over port 922 directly with the following commands, initiated from any of the NYU HPC clusters in NYC:

Pushing to dalma

$ scp -P 922 filename NetID@dalma.abudhabi.nyu.edu:~/

Pulling from dalma

$ scp -P 922 NetID@dalma.abudhabi.nyu.edu:~/filename .

Setting an alias for fast data transfer between Dalma and NYC

To save typing this each time, and to support faster transfers between NYUAD and NYU when using rsync, you can prepare an alias in your ~/.ssh/config file on the NYU HPC cluster:

Host dalma
Hostname dalma.abudhabi.nyu.edu

Host dalma-fast
Hostname dalma.abudhabi.nyu.edu
Port 922

Now you can log in to dalma from that HPC cluster with:

$ ssh dalma

And scp files to and from dalma over the fast link with:

$ scp my_data.nc dalma-fast:

Keeping directories in sync with rsync

For managing source code that you are developing and running on multiple computers (including the NYU HPC clusters), we highly recommend using a revision control system such as git. This can help keep your source code synchronized between different computers, and works over the SSH tunnel between your workstation and the NYU HPC clusters. We aim to soon provide a tutorial on revision control for NYU users, in the meantime there is some good introductory material at the git website and a good reference guide at stackoverflow.

For replicating datasets between the HPC clusters and your workstation, or between two filesystems on an HPC cluster, rsync offers powerful functionality beyond that of cp or scp. With rsync you can copy directories between your workstation and the HPC clusters  - or between different filesystems - in such a way that permission and file modification timestamps are preserved, and that only files which have changed are transferred. 

For Windows users

Windows does not have a good native terminal application so you may be unable to use rsync. There are rsync-like applications for Windows, but we do not have experience with them. One promising option is to run rsync from Cygwin

The basic usage of rsync is:

rsync [options] source [source] destination

Where source is a list of one or more source files or directories to copy and destination is a directory into which to copy source. Commonly useful options are:

  • a "Archive" mode - permissions and timestamps of the source are replicated at the destination.
  • v "Verbose".
  • n  "dry run" - don't actually do anything, just indicate what would be done.
  • C "follow CVS ignore conventions" - more on this below.

Whether rsync treats destination as a new name for the copy of source, a parent directory into which to copy source, or a parent directory into which to place the contents of source, depends on the exact context of the command. For this reason, it is highly advisable to first run rsync with -n and -v to see exactly what rsync will do before issuing the "real" command, eg:

$ rsync -nav source destination
$ rsync -av source destination

For detailed information about rsync type "man rsync" at the command line.

Some examples of rsync usage

  • To replicate in your /scratch area a directory tree you have saved in /work:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ rsync -av my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ ls -F /scratch/$USER/my_run_dir
    my_input_data/

    There is now a copy of my_input_data directory under /scratch/$USER/my_run_dir
    If you append / to source, rsync will copy the contents of source rather than the source directory itself:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ rsync -av my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ ls -F /scratch/$USER/my_run_dir
    file1
    file2 

  • To replicate on prince a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running: 

    $ hostname
    my_workstation
    $ ls -F
    my_input_data/
    $ rsync -av my_input_data prince:/scratch/\$USER/my_run_dir

    The host name followed by a colon tells rsync that the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with username@remotehost:
    Note the backslash in \$USER - this instructs the shell not to expand $USER to your local (on your workstation) username. An equivalent command is: 

    $ ls -F
    my_input_data/
    $ rsync -av my_input_data NetID@prince:/scratch/NetID/my_run_dir

  • To copy in the other direction, from /scratch on prince to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):

    $ hostname
    my_workstation
    $ rsync -av prince:/scratch/\$USER/my_run_dir my_results
    $ ls my_results

    Only those files not already up-to-date on your workstation will be copied.

Ignoring certain files

The -C option tells rsync to follow CVS conventions about ignoring certain files. For example, when copying a tree of source code, you probably want the .c, .f and .h files but not the .o files. The conventions are described fully in the man page (man rsync). In summary, when -C is used the following files are ignored:

  • Any file or directory whose name matches any of:
    RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/

     

  • Any file whose name matches a pattern listed in the environment variable CVSIGNORE. This environment variable takes a list of patterns separated by spaces, such as the default list above. When defining CVSIGNORE you will need to enclose the definition in quotation marks, for example to skip Fortran output to unnamed unit numbers (whose files have names like fort.99) and netcdf files whose name ends in "intermediate.nc", set CVSIGNORE as follows (note that this syntax is for BASH)

    $ export CVSIGNORE="fort.?? *.intermediate.nc"

    Any file whose name matches a pattern listed in the file $HOME/.cvsignore, or in a file named .cvsignore within a directory being copied. This file has contents as per $CVSIGNORE, but with one pattern per line, for example:

    $ cat .cvsignore
    fort.??
    *.intermediate.nc