Copying Files To/From Prince
This page is being updated to reflect copying files to/from the new cluster Prince.
Before you startTo copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel on the workstation. We have instructions for this for Windows, Mac and Linux workstations.
Transferring files to and from a Windows workstation with WinSCP
First you will need to install WinSCP. You can download it from here.
The instructions below are demonstrated in this video tutorial. Note that the first half of the video shows how to set up the SSH tunnel, as you have already done. Once you have set up the SSH tunnel you need not repeat that process, but at the beginning of each session you will need to start the tunnel by launching the shortcut you created during the setup.
Once your tunnel is running, you can launch WinSCP from the Start menu. You will be presented with a dialog like the one below. Select "Session"
During the SSH setup you created a tunnel, one end of which was probably at "
localhost, port 8023". The hostname and port number in this dialog should be the same as you used for Step 2: Logging in via your SSH tunnel (again, probably "localhost" and "8023")
You can enter your user name (NetID) under "User name" if you wish. If you don't, WinSCP will ask for your user name and password when you click "Login".
Save the session by clicking "Save", so that next time you need not re-enter these details.
Click "Login". You will be asked for a username and password if you did not already enter them - these are your NetID and password.
Next you will see a window with two panes: the one on the left is your workstation and the one on your right is the HPC cluster. You can copy files from one to the other by dragging and dropping them.
Transferring files to and from a Linux or Mac workstation with scp
When you are comfortable with the command line interface,
scp over the command line is the easiest and best way to transfer files between hosts. For Mac users who prefer a GUI, we have video guides for:
The command line procedure described below is the same on Linux as on Mac. We have a video guide for this too, note that the first half of the video shows setting up the SSH tunnel, which you have already done.
You will need to start your tunnel before you can log in or transfer files over it - this is also covered in the guide to setting up SSH tunneling.
Transferring files between hosts with scp
In a new terminal, go to the directory on your workstation that you wish to transfer files to or from:
On Unix systems - including OSX - the "
~" symbol is a shortcut to your home directory.
The syntax of the scp command is:
destination can be on another (remote) host, by prefixing the path with "
hostname:". If you have a different user name on the remote host to your username on the host you are working on you can specify it with "
During the SSH tunneling setup, you created aliases for each of the NYU HPC clusters in your
.ssh/config file. To
scp files through the tunnel you will refer to the alias for the HPC cluster of interest, for example:
Spaces matter - there must be no spaces between the remote hostname and the path on that host. If the file you wish to transfer has spaces in its name, you must delimit the spaces with a backslash (
\). For example:
If you use tab-completion to save typing the full name, the tab-completion will insert the delimiters correctly.
Better still: Don't put spaces in file names, even when you are using Windows or Mac!
Specifying a directory on the remote host
In the first example above we copied the file "
my_script.q" from the current directory on the workstation (the local host) to the home directory on bowery (the remote host).
In the second example, we copied "
my_source_file.c" from the current directory on the workstation to the directory "
my_code", which is a subdirectory of our home directory on bowery. This destination directory must already exist! The trailing "
scp to expect a subdirectory, so if the subdirectory does not already exist on bowery,
scp will exit with an error message like "
scp: my_code/: Is a directory". If you omit the trailing "
scp assumes you wish to copy "
my_source_file.c" to your home directory on bowery but give it a new name "
In the third example, the destination directory begins with a forward slash (
/), which means the path is absolute rather than relative to your home directory. This allows us to copy to a destination outside of our home directory, such as
The fourth example transfers a file from a remote host (bowery) to the current directory (
.) on the local host.
Transferring multiple files using wildcards
The more general usage of
scp allows multiple files to be copied from to a single destination, provided the destination is a directory, for example:
In the first example we explicitly copy two source files to a directory on bowery. In the second example we use wildcard expansion to copy all files with the suffix "
.c" to mercer. Note that the shell replaces "*.c" with a list of files matching that pattern, so the actual command which gets executed is more like the first example.
To use wildcard expansion on the remote host, we must delimit the wildcards with a backslash (
\) to prevent the local shell from attempting to expand them. For example:
The above command will pass "
*.c" to Mercer to be expanded there. This is equivalent to a command like:
If you are aiming to keep a directory on the remote and local host synchronized, rsync may be better suited to your needs.
Transferring over a non-default port
There is a brief introduction to ports in the section on setting up SSH tunneling.
The default port used by
scp is port 22, however you can specify a different port with the -P switch. This is useful for faster transfer to and from BuTinah. If you initiated a once-off SSH tunnel then you will not have an alias for the HPC cluster so to use the tunnel your remote host is the chosen port on your workstation, for example:
This is not the recommended way to use SSH tunneling
Fetching data from the Internet
The login and compute nodes of the NYU HPC clusters can access the Internet, even though they cannot be directly accessed from the Internet. You can, therefore, directly download datasets from the HPC clusters. The HPC clusters have a few command-line tools for downloading data:
sftp: to fetch data from a remote system - such as BuTinah
wget: is for fetching webpages, and is also effective for downloading links from webpages (such as
There is a helpful usage guide for with examples at thegeekstuff.
curl: is a newer tool with similar capabilities as
wgetand more. The documentation is at http://curl.haxx.se/docs/manpage.html and thegeekstuff also has a curl usage guide with examples
Transferring files between NYU NY and NYUAD BuTinah
For faster transfer between the HPC clusters at NYU in NYC and the BuTinah cluster at NYUAD, use
scp over port 922. This will route the transfer over a high bandwidth ANKABUT link rather than the default low bandwidth MLPS link. The speed difference is greatest when pulling files from BuTinah to NYU NY.
Transferring many small files will still be slow - you will get better performance if you tar small files into a single archive, and transfer the tar file.
The default user environment on bowery sets an alias for
scp which does this automatically, so in most cases you can skip over this section. If you are finding that file transfers between NYUAD and NYU are slow, you can check whether you are using the alias with '
which scp'. If the response is not '
/usr/local/bin/scp_wrapper.sh', you should follow the instructions below.
scp over port 922 directly with the following commands, initiated from any of the NYU HPC clusters in NYC:
Pushing to BuTinah:
Pulling from BuTinah:
Setting an alias for fast data transfer between BuTinah and NYC
To save typing this each time, and to support faster transfers between NYUAD and NYU when using rsync, you can prepare an alias in your
~/.ssh/config file on the NYU HPC cluster:
Now you can log in to BuTinah from that HPC cluster with:
scp files to and from BuTinah over the fast link with:
Keeping directories in sync with rsync
For managing source code that you are developing and running on multiple computers (including the NYU HPC clusters), we highly recommend using a revision control system such as git. This can help keep your source code synchronized between different computers, and works over the SSH tunnel between your workstation and the NYU HPC clusters. We aim to soon provide a tutorial on revision control for NYU users, in the meantime there is some good introductory material at the git website and a good reference guide at stackoverflow.
For replicating datasets between the HPC clusters and your workstation, or between two filesystems on an HPC cluster,
rsync offers powerful functionality beyond that of
rsync you can copy directories between your workstation and the HPC clusters - or between different filesystems - in such a way that permission and file modification timestamps are preserved, and that only files which have changed are transferred.
For Windows users
Windows does not have a good native terminal application so you may be unable to use rsync. There are rsync-like applications for Windows, but we do not have experience with them. One promising option is to run rsync from Cygwin.
The basic usage of rsync is:
source is a list of one or more source files or directories to copy and
destination is a directory into which to copy
source. Commonly useful options are:
a"Archive" mode - permissions and timestamps of the source are replicated at the destination.
n"dry run" - don't actually do anything, just indicate what would be done.
C"follow CVS ignore conventions" - more on this below.
Whether rsync treats
destination as a new name for the copy of
source, a parent directory into which to copy
source, or a parent directory into which to place the contents of
source, depends on the exact context of the command. For this reason, it is highly advisable to first run
-v to see exactly what
rsync will do before issuing the "real" command, eg:
For detailed information about rsync type "
man rsync" at the command line.
Some examples of
To replicate in your
/scratcharea a directory tree you have saved in
There is now a copy of
If you append / to source, rsync will copy the contents of source rather than the source directory itself:
To replicate on Mercer a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running:
The host name followed by a colon tells
rsyncthat the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with
Note the backslash in
\$USER- this instructs the shell not to expand
$USERto your local (on your workstation) username. An equivalent command is:
To copy in the other direction, from /scratch on Mercer to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):
Only those files not already up-to-date on your workstation will be copied.
Ignoring certain files
-C option tells rsync to follow CVS conventions about ignoring certain files. For example, when copying a tree of source code, you probably want the
.h files but not the
.o files. The conventions are described fully in the man page (
man rsync). In summary, when
-C is used the following files are ignored:
Any file or directory whose name matches any of:
RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/
Any file whose name matches a pattern listed in the environment variable
CVSIGNORE. This environment variable takes a list of patterns separated by spaces, such as the default list above. When defining CVSIGNORE you will need to enclose the definition in quotation marks, for example to skip Fortran output to unnamed unit numbers (whose files have names like
fort.99) and netcdf files whose name ends in "
CVSIGNOREas follows (note that this syntax is for BASH)
Any file whose name matches a pattern listed in the file $HOME/.cvsignore, or in a file named .cvsignore within a directory being copied. This file has contents as per $CVSIGNORE, but with one pattern per line, for example: