Skip to end of metadata
Go to start of metadata

Copying Files To/From Prince

This page is being updated to reflect copying files to/from the new cluster Prince.

Before you start

To copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel on the workstation. We have instructions for this for Windows, Mac and Linux workstations.

Transferring files to and from a Windows workstation with WinSCP

First you will need to install WinSCP. You can download it from here.

The instructions below are demonstrated in this video tutorial. Note that the first half of the video shows how to set up the SSH tunnel, as you have already done. Once you have set up the SSH tunnel you need not repeat that process, but at the beginning of each session you will need to start the tunnel by launching the shortcut you created during the setup

Once your tunnel is running, you can launch WinSCP from the Start menu. You will be presented with a dialog like the one below. Select "Session"

During the SSH setup you created a tunnel, one end of which was probably at "localhost, port 8023". The hostname and port number in this dialog should be the same as you used for Step 2: Logging in via your SSH tunnel (again, probably "localhost" and "8023")

You can enter your user name (NetID) under "User name" if you wish. If you don't, WinSCP will ask for your user name and password when you click "Login".

Save the session by clicking "Save", so that next time you need not re-enter these details.

Click "Login". You will be asked for a username and password if you did not already enter them - these are your NetID and password.

Next you will see a window with two panes: the one on the left is your workstation and the one on your right is the HPC cluster. You can copy files from one to the other by dragging and dropping them.

Transferring files to and from a Linux or Mac workstation with scp

When you are comfortable with the command line interface, scp over the command line is the easiest and best way to transfer files between hosts. For Mac users who prefer a GUI, we have video guides for:

The command line procedure described below is the same on Linux as on Mac. We have a video guide for this too, note that the first half of the video shows setting up the SSH tunnel, which you have already done.

You will need to start your tunnel before you can log in or transfer files over it - this is also covered in the guide to setting up SSH tunneling.

Transferring files between hosts with scp

In a new terminal, go to the directory on your workstation that you wish to transfer files to or from:

$ cd ~/my_data

On Unix systems - including OSX - the "~" symbol is a shortcut to your home directory.

The syntax of the scp command is:

$ scp source destination

Either source or destination can be on another (remote) host, by prefixing the path with "hostname:". If you have a different user name on the remote host to your username on the host you are working on you can specify it with "username@hostname:".

During the SSH tunneling setup, you created aliases for each of the NYU HPC clusters in your .ssh/config file. To scp files through the tunnel you will refer to the alias for the HPC cluster of interest, for example:

$ scp my_script.q mercer:

$ scp my_source_file.c mercer:my_code/

$ scp my_data_file.nc mercer:/scratch/NetID/

$ scp mercer:my_code/my_source_file.c .

Note that we don't use "username@mercer" because your username was specified in the .ssh/config file (eg Windows, Mac/Linux).

Important

Spaces matter - there must be no spaces between the remote hostname and the path on that host. If the file you wish to transfer has spaces in its name, you must delimit the spaces with a backslash (\). For example:

$ ls

file with spaces in its name.c

$ scp file\ with\ spaces\ in\ its\ name.c mercer:

If you use tab-completion to save typing the full name, the tab-completion will insert the delimiters correctly.

Better still: Don't put spaces in file names, even when you are using Windows or Mac!

Specifying a directory on the remote host

In the first example above we copied the file "my_script.q" from the current directory on the workstation (the local host) to the home directory on bowery (the remote host).

In the second example, we copied "my_source_file.c" from the current directory on the workstation to the directory "my_code", which is a subdirectory of our home directory on bowery. This destination directory must already exist! The trailing "/" tells scp to expect a subdirectory, so if the subdirectory does not already exist on bowery, scp will exit with an error message like "scp: my_code/: Is a directory". If you omit the trailing "/", scp assumes you wish to copy "my_source_file.c" to your home directory on bowery but give it a new name "my_code". 

In the third example, the destination directory begins with a forward slash (/), which means the path is absolute rather than relative to your home directory. This allows us to copy to a destination outside of our home directory, such as /scratch.

The fourth example transfers a file from a remote host (bowery) to the current directory (.) on the local host.

Transferring multiple files using wildcards

The more general usage of scp allows multiple files to be copied from to a single destination, provided the destination is a directory, for example:

$ scp source_file_1.c source_file_2.c mercer:my_code/

$ scp *.c mercer:my_code/

In the first example we explicitly copy two source files to a directory on bowery. In the second example we use wildcard expansion to copy all files with the suffix ".c" to mercer. Note that the shell replaces "*.c" with a list of files matching that pattern, so the actual command which gets executed is more like the first example.

To use wildcard expansion on the remote host, we must delimit the wildcards with a backslash (\) to prevent the local shell from attempting to expand them. For example:

$ scp mercer:my_code/\*.c .

The above command will pass "*.c" to Mercer to be expanded there. This is equivalent to a command like:

$ scp mercer:my_code/source_file_1.c bowery:my_code/source_file_2.c .

If you are aiming to keep a directory on the remote and local host synchronized, rsync may be better suited to your needs.

Transferring over a non-default port

There is a brief introduction to ports in the section on setting up SSH tunneling.

The default port used by scp is port 22, however you can specify a different port with the -P switch. This is useful for faster transfer to and from BuTinah. If you initiated a once-off SSH tunnel then you will not have an alias for the HPC cluster so to use the tunnel your remote host is the chosen port on your workstation, for example:

$ scp -P 8021 source_file_1.c NetID@localhost:my_code

This is not the recommended way to use SSH tunneling

Fetching data from the Internet

The login and compute nodes of the NYU HPC clusters can access the Internet, even though they cannot be directly accessed from the Internet. You can, therefore, directly download datasets from the HPC clusters. The HPC clusters have a few command-line tools for downloading data:

Transferring files between NYU NY and NYUAD BuTinah

For faster transfer between the HPC clusters at NYU in NYC and the BuTinah cluster at NYUAD, use scp over port 922. This will route the transfer over a high bandwidth ANKABUT link rather than the default low bandwidth MLPS link. The speed difference is greatest when pulling files from BuTinah to NYU NY.

Transferring many small files will still be slow - you will get better performance if you tar small files into a single archive, and transfer the tar file.

The default user environment on bowery sets an alias for scp which does this automatically, so in most cases you can skip over this section. If you are finding that file transfers between NYUAD and NYU are slow, you can check whether you are using the alias with 'which scp'. If the response is not '/usr/local/bin/scp_wrapper.sh', you should follow the instructions below.

You can scp over port 922 directly with the following commands, initiated from any of the NYU HPC clusters in NYC:

Pushing to BuTinah:

$ scp -P 922 filename NetID@butinah.abudhabi.nyu.edu:~/

Pulling from BuTinah:

$ scp -P 922 NetID@butinah.abudhabi.nyu.edu:~/filename .

Setting an alias for fast data transfer between BuTinah and NYC

To save typing this each time, and to support faster transfers between NYUAD and NYU when using rsync, you can prepare an alias in your ~/.ssh/config file on the NYU HPC cluster:

Host butinah
Hostname butinah.abudhabi.nyu.edu
 
Host butinah-fast
Hostname butinah.abudhabi.nyu.edu
Port 922

Now you can log in to BuTinah from that HPC cluster with:

$ ssh butinah

And scp files to and from BuTinah over the fast link with:

$ scp my_data.nc butinah-fast:

Keeping directories in sync with rsync

For managing source code that you are developing and running on multiple computers (including the NYU HPC clusters), we highly recommend using a revision control system such as git. This can help keep your source code synchronized between different computers, and works over the SSH tunnel between your workstation and the NYU HPC clusters. We aim to soon provide a tutorial on revision control for NYU users, in the meantime there is some good introductory material at the git website and a good reference guide at stackoverflow.

For replicating datasets between the HPC clusters and your workstation, or between two filesystems on an HPC cluster, rsync offers powerful functionality beyond that of cp or scp. With rsync you can copy directories between your workstation and the HPC clusters  - or between different filesystems - in such a way that permission and file modification timestamps are preserved, and that only files which have changed are transferred. 

For Windows users

Windows does not have a good native terminal application so you may be unable to use rsync. There are rsync-like applications for Windows, but we do not have experience with them. One promising option is to run rsync from Cygwin

The basic usage of rsync is:

rsync [options] source [source] destination

Where source is a list of one or more source files or directories to copy and destination is a directory into which to copy source. Commonly useful options are:

  • a "Archive" mode - permissions and timestamps of the source are replicated at the destination.
  • v "Verbose".
  • n  "dry run" - don't actually do anything, just indicate what would be done.
  • C "follow CVS ignore conventions" - more on this below.

Whether rsync treats destination as a new name for the copy of source, a parent directory into which to copy source, or a parent directory into which to place the contents of source, depends on the exact context of the command. For this reason, it is highly advisable to first run rsync with -n and -v to see exactly what rsync will do before issuing the "real" command, eg:

$ rsync -nav source destination
$ rsync -av source destination

For detailed information about rsync type "man rsync" at the command line.

Some examples of rsync usage:

  • To replicate in your /scratch area a directory tree you have saved in /work:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ rsync -av my_input_data /scratch/$USER/my_run_dir
    building file list ... done
    my_input_data/file1
    my_input_data/file2

    $ ls -F /scratch/$USER/my_run_dir
    my_input_data/


    There is now a copy of my_input_data directory under /scratch/$USER/my_run_dir 

    If you append / to source, rsync will copy the contents of source rather than the source directory itself:

    $ cd /work/$USER/
    $ ls -F
    my_input_data/
    $ rsync -nav my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ rsync -av my_input_data/ /scratch/$USER/my_run_dir
    building file list ... done
    file1
    file2

    $ ls -F /scratch/$USER/my_run_dir
    file1
    file2 

 

  • To replicate on Mercer a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running:

    $ hostname
    my_workstation
    $ ls -F
    my_input_data/
    $ rsync -av my_input_data mercer:/scratch/\$USER/my_run_dir

    The host name followed by a colon tells rsync that the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with username@remotehost:
    Note the backslash in \$USER - this instructs the shell not to expand $USER to your local (on your workstation) username. An equivalent command is: 

    $ ls -F
    my_input_data/
    $ rsync -av my_input_data NetID@mercer:/scratch/NetID/my_run_dir

  • To copy in the other direction, from /scratch on Mercer to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):

    $ hostname
    my_workstation
    $ rsync -av mercer:/scratch/\$USER/my_run_dir my_results
    $ ls my_results

    Only those files not already up-to-date on your workstation will be copied.

 

 

Ignoring certain files

 

The -C option tells rsync to follow CVS conventions about ignoring certain files. For example, when copying a tree of source code, you probably want the .c, .f and .h files but not the .o files. The conventions are described fully in the man page (man rsync). In summary, when -C is used the following files are ignored:

  • Any file or directory whose name matches any of:
    RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/


  • Any file whose name matches a pattern listed in the environment variable CVSIGNORE. This environment variable takes a list of patterns separated by spaces, such as the default list above. When defining CVSIGNORE you will need to enclose the definition in quotation marks, for example to skip Fortran output to unnamed unit numbers (whose files have names like fort.99) and netcdf files whose name ends in "intermediate.nc", set CVSIGNORE as follows (note that this syntax is for BASH)

    $ export CVSIGNORE="fort.?? *.intermediate.nc"

     

    Any file whose name matches a pattern listed in the file $HOME/.cvsignore, or in a file named .cvsignore within a directory being copied. This file has contents as per $CVSIGNORE, but with one pattern per line, for example:

    $ cat .cvsignore
    fort.??
    *.intermediate.nc

     

     

 

 

 

 

 

  • No labels