When you are inside NYU network
The article on this wiki page assumes that you are outside of NYU network. When you are within NYU network, it is much simpler - SSH tunneling is not necessary!
Starting up SSH tunnel before you start
Transferring files to and from a Windows workstation with WinSCP
First you will need to install WinSCP. You can download it from here. Once you have set up the SSH tunnel you need not repeat that process, but at the beginning of each session you will need to start the tunnel by launching the shortcut you created during the setup.
Once your tunnel is running, you can launch WinSCP from the Start menu. You will be presented with a dialog like the one below.
During the SSH setup you created a tunnel, one end of which was probably at "localhost, port 8026". The hostname and port number in this dialog should be the same as you used for Step 2: Logging in via your SSH tunnel (again, probably "localhost" and "8026").
You can enter your user name (NetID) under "User name" if you wish. If you don't, WinSCP will ask for your user name and password when you click "Login".
Save the session by clicking "Save", so that next time you need not re-enter these details.
Click "Login". You will be asked for a username and password if you did not already enter them - these are your NetID and password.
Next you will see a window with two panes: the one on the left is your workstation and the one on your right is the HPC cluster. You can copy files from one to the other by dragging and dropping them.
Transferring files to and from a Mac workstation with FUGU
Fugu is a graphical frontend for the text-based Secure File Transfer Protocol (SFTP) client that ships with Mac OS X. SFTP is similar to FTP, but the entire session is encrypted, meaning nothing, including passwords, is sent in the clear. Fugu provides a friendly, intuitive interface for OpenSSH's SFTP client, featuring secure drag-and-drop file transfers, remote file editing and remote file preview. Fugu allows you to take advantage of SFTP's strong encryption without having to sacrifice ease of use. Additionally, Fugu can make secure file transfers using Secure Copy (SCP), and can create SSH tunnels to faciliate secure interaction between machines. Step 1: Download and install Fugu (https://faq.oit.gatech.edu/content/prism-how-install-fugu-mac – Universal binary version must be used with OSX, not the one on sourceforge which gives a 'PowerPC apps not supported' error). Step 2: Start Fugu. Select SSH > new SSH tunnel Create tunnel to: prince Service or port: 22 Local port: 8026 Tunnel host: gw.hpc.nyu.edu Username: NetID Port: Step 3: In SFTP window Connect to: localhost Username: NetID Port: 8026 Step 4: Click connect and enter NetID password. Drag and drop files to copy/paste to and from cluster.
Download and install Fugu (https://faq.oit.gatech.edu/content/prism-how-install-fugu-mac – Universal binary version must be used with OSX, not the one on sourceforge which gives a 'PowerPC apps not supported' error).
Start Fugu. Select SSH > new SSH tunnel
Create tunnel to: prince
Service or port: 22
Local port: 8026
Tunnel host: gw.hpc.nyu.edu
In SFTP window
Connect to: localhost
Click connect and enter NetID password.
Drag and drop files to copy/paste to and from cluster.
Transferring files to and from Google drive with RCLONE
rclone - rsync for cloud storage, is a command line program to sync files and directories to and from cloud storage systems such as Google Drive, Amazon Drive, S3, B2 etc. rclone can be invoked in one of the three modes:
- Copy mode to just copy new/changed files
- Sync (one way) mode to make a directory identical
- Check mode to check for file hash equality
rclone is available on Prince cluster, the module is rclone/1.35 .
Please try with these options: rclone --transfers=32 --checkers=16 --drive-chunk-size=16384k --drive-upload-cutoff=16384k copy source:sourcepath dest:destpath
This option works great for file sizes 1Gb+ to 250GB. Keep in mind that there is a rate limiting of 2 files/sec for upload into Google drive. Small file transfers don’t work that well. If you have many small jobs, please tar the parent directory of such folders and splits the tar file into 100GB chunks and uploads then into Google drive.
Login to Prince:
$ ssh -Y NetID@prince.hpc.nyu.edu
If necessary please read the wiki page on how to login to HPC clusters.
First we load the module clone, using command:
$ module load rclone/1.35
Configuring rclone and setting up remote access to your Google drive, using command:
$ rclone config
You can select one of the options (here we show how to setup a new remote)
2017/02/24 10:21:00 Config file "/home/ad95/.rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
You enter n for a new remote connection and give it a name.
Then you choose the type of storage for which you are setting up the remote (here we show the method for setting up a remote for google drive which is option 7)
Type of storage to configure.
Choose a number from below, or type in your own value
1 / Amazon Drive
\ "amazon cloud drive"
2 / Amazon S3 (also Dreamhost, Ceph, Minio)
3 / Backblaze B2
4 / Dropbox
5 / Encrypt/Decrypt a remote
6 / Google Cloud Storage (this is not Google Drive)
\ "google cloud storage"
7 / Google Drive
8 / Hubic
9 / Local Disk
10 / Microsoft OneDrive
11 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
12 / Yandex Disk
Then you see a few messages like the ones below:
Google Application Client Id - leave blank normally.
client_id> (just press enter key here)
Google Application Client Secret - leave blank normally.
client_secret> (just press the enter key here)
Now since you are remotely accessing the cluster you have to select remote config i.e. option n
Use auto config?
* Say Y if not sure
* Say N if you are working on a remote or headless machine or Y didn't work
You will see a message similar to the one below:
If your browser doesn't open automatically go to the following link: https://accounts.google.com/o/oauth2/auth?client_id=202264815644.apps.googleusercontent.com&redirect_uri=urn...
Log in and authorize rclone for access.
You have to open this url in your workstation systems browser and authenticate your Google drive options. Once that is done you will get a screen that displays a secret key/ verification code.
You enter this key/ copy key from browser and paste it in the terminal. Once the terminal accepts the verification code it displays the options below:
y) Yes this is OK
e) Edit this remote
d) Delete this remote
You can select y if everything seems okay with the remote or you can edit the same.
You can also view the current existing remotes.
Transferring files to Google drive, using the command below:
$ rclone copy <source_folder> <remote_name>:<name_of_folder_on_gdrive>
It looks something like below:
$ rclone copy /home/user1 remote1:backup_home_user1
The files are transferred and you can find the files on your Google drive.
Note: Rclone only copies new files or files different from the already existing files on Google drive.
Transferring files to and from a Linux or Mac workstation with scp
When you are comfortable with the command line interface,
scp over the command line is the easiest and best way to transfer files between hosts.
The command line procedure described below is the same on Linux as on Mac. The first thing to do is setting up the SSH tunnel.
In a new terminal, go to the directory on your workstation that you wish to transfer files to or from:
On Unix systems - including OSX - the "
~" symbol is a shortcut to your home directory.
The syntax of the scp command is:
destination can be on another (remote) host, by prefixing the path with "
hostname:". If you have a different user name on the remote host to your username on the host you are working on you can specify it with "
During the SSH tunneling setup, you created aliases for each of the NYU HPC clusters in your
.ssh/config file. To
scp files through the tunnel you will refer to the alias for the HPC cluster of interest, for example:
Note that we don't use "
username@prince" because your username was specified in the
Spaces matter - there must be no spaces between the remote hostname and the path on that host. If the file you wish to transfer has spaces in its name, you must delimit the spaces with a backslash (
\). For example:
If you use tab-completion to save typing the full name, the tab-completion will insert the delimiters correctly.
Better still: Don't put spaces in file names, even when you are using Windows or Mac!
Specifying a directory on the remote host
In the first example above we copied the file "
my_script.sh" from the current directory on the workstation (the local host) to the home directory on dumbo (the remote host).
In the second example, we copied "
my_source_file.c" from the current directory on the workstation to the directory "
my_code", which is a subdirectory of our home directory on prince. This destination directory must already exist! The trailing "
scp to expect a subdirectory, so if the subdirectory does not already exist on prince,
scp will exit with an error message like "
scp: my_code/: Is a directory". If you omit the trailing "
scp assumes you wish to copy "
my_source_file.c" to your home directory on prince but give it a new name "
In the third example, the destination directory begins with a forward slash (
/), which means the path is absolute rather than relative to your home directory. This allows us to copy to a destination outside of our home directory, such as
The fourth example transfers a file from a remote host (prince) to the current directory (
.) on the local host.
Transferring multiple files using wildcards
The more general usage of
scp allows multiple files to be copied from to a single destination, provided the destination is a directory, for example:
In the first example we explicitly copy two source files to a directory on prince. In the second example we use wildcard expansion to copy all files with the suffix "
.c" to prince. Note that the shell replaces "*.c" with a list of files matching that pattern, so the actual command which gets executed is more like the first example.
To use wildcard expansion on the remote host, we must delimit the wildcards with a backslash (
\) to prevent the local shell from attempting to expand them. For example:
The above command will pass "
*.c" to Prince to be expanded there. This is equivalent to a command like:
If you are aiming to keep a directory on the remote and local host synchronized, rsync may be better suited to your needs.
Transferring over a non-default port
There is a brief introduction to ports in the section on setting up SSH tunneling.
The default port used by
scp is port 22, however you can specify a different port with the -P switch. This is useful for faster transfer to and from dalma. If you initiated a once-off SSH tunnel then you will not have an alias for the HPC cluster so to use the tunnel your remote host is the chosen port on your workstation, for example:
This is not the recommended way to use SSH tunneling
Fetching data from the Internet
The login and compute nodes of the NYU HPC clusters can access the Internet, even though they cannot be directly accessed from the Internet. You can, therefore, directly download datasets from the HPC clusters. The HPC clusters have a few command-line tools for downloading data:
sftp: to fetch data from a remote system - such as dalma
wget: is for fetching webpages, and is also effective for downloading links from webpages (such as
There is a helpful usage guide for with examples at thegeekstuff.
curl: is a newer tool with similar capabilities as
wgetand more. The documentation is available here and thegeekstuff also has a curl usage guide with examples.
Transferring files between NYU NY and NYUAD Dalma
For faster transfer between the HPC clusters at NYU in NYC and the Dalma cluster at NYUAD, use
scp over port 922. This will route the transfer over a high bandwidth ANKABUT link rather than the default low bandwidth MLPS link. The speed difference is greatest when pulling files from Dalma to NYU NY.
Transferring many small files will still be slow - you will get better performance if you tar small files into a single archive, and transfer the tar file.
scp over port 922 directly with the following commands, initiated from any of the NYU HPC clusters in NYC:
Pushing to dalma
Pulling from dalma
Setting an alias for fast data transfer between Dalma and NYC
To save typing this each time, and to support faster transfers between NYUAD and NYU when using rsync, you can prepare an alias in your
~/.ssh/config file on the NYU HPC cluster:
Now you can log in to dalma from that HPC cluster with:
scp files to and from dalma over the fast link with:
Keeping directories in sync with rsync
For managing source code that you are developing and running on multiple computers (including the NYU HPC clusters), we highly recommend using a revision control system such as git. This can help keep your source code synchronized between different computers, and works over the SSH tunnel between your workstation and the NYU HPC clusters. We aim to soon provide a tutorial on revision control for NYU users, in the meantime there is some good introductory material at the git website and a good reference guide at stackoverflow.
For replicating datasets between the HPC clusters and your workstation, or between two filesystems on an HPC cluster,
rsync offers powerful functionality beyond that of
rsync you can copy directories between your workstation and the HPC clusters - or between different filesystems - in such a way that permission and file modification timestamps are preserved, and that only files which have changed are transferred.
For Windows users
Windows does not have a good native terminal application so you may be unable to use rsync. There are rsync-like applications for Windows, but we do not have experience with them. One promising option is to run rsync from Cygwin.
The basic usage of rsync is:
source is a list of one or more source files or directories to copy and
destination is a directory into which to copy
source. Commonly useful options are:
a"Archive" mode - permissions and timestamps of the source are replicated at the destination.
n"dry run" - don't actually do anything, just indicate what would be done.
C"follow CVS ignore conventions" - more on this below.
Whether rsync treats
destination as a new name for the copy of
source, a parent directory into which to copy
source, or a parent directory into which to place the contents of
source, depends on the exact context of the command. For this reason, it is highly advisable to first run
-v to see exactly what
rsync will do before issuing the "real" command, eg:
For detailed information about rsync type "
man rsync" at the command line.
Some examples of
To replicate in your
/scratcharea a directory tree you have saved in
There is now a copy of
If you append / to source, rsync will copy the contents of source rather than the source directory itself:
To replicate on prince a data directory you have on your workstation - assuming you are entering these commands on a local terminal on your workstation, and you have an SSH Tunnel set up and running:
The host name followed by a colon tells
rsyncthat the (in this case) destination is on another host. If your username on the other host is different to the username on the current host, you can specify the remote username with username@remotehost:
Note the backslash in
\$USER- this instructs the shell not to expand
$USERto your local (on your workstation) username. An equivalent command is:
To copy in the other direction, from /scratch on prince to your workstation (again, from a local terminal on your workstation and across an SSH Tunnel):
Only those files not already up-to-date on your workstation will be copied.
Ignoring certain files
-C option tells rsync to follow CVS conventions about ignoring certain files. For example, when copying a tree of source code, you probably want the
.h files but not the
.o files. The conventions are described fully in the man page (
man rsync). In summary, when
-C is used the following files are ignored:
Any file or directory whose name matches any of:
RCS SCCS CVS CVS.adm RCSLOG cvslog.* tags TAGS .make.state .nse_depinfo *~ #* .#* ,* _$* *$ *.old *.bak *.BAK *.orig *.rej .del-* *.a *.olb *.o *.obj *.so *.exe *.Z *.elc *.ln core .svn/
Any file whose name matches a pattern listed in the environment variable
CVSIGNORE. This environment variable takes a list of patterns separated by spaces, such as the default list above. When defining CVSIGNORE you will need to enclose the definition in quotation marks, for example to skip Fortran output to unnamed unit numbers (whose files have names like
fort.99) and netcdf files whose name ends in "
CVSIGNOREas follows (note that this syntax is for BASH)
Any file whose name matches a pattern listed in the file $HOME/.cvsignore, or in a file named .cvsignore within a directory being copied. This file has contents as per $CVSIGNORE, but with one pattern per line, for example: