Submitting Jobs to Deepthought

Table of Contents


Quick Start

To submit a job to Deepthought, first connect to one of the login nodes by sshing to login.deepthought.umd.edu.
Next, you'll need to create a job script. This is just a simple shell script that will specify the necessary job parameters and then run your program.

Here's an example of a simple script, we'll call test.sh:

#PBS -lwalltime=1:00
#PBS -lncpus=4

hostname
date

The first two lines specify parameters to the scheduler. The first, walltime, specifies the maximum amount of time you expect your job to run. The walltime parameter is in the form HH:MM:SS. If you leave off any digits, the ones you provide will be assumed to be the smallest units available, for instance a walltime of 1:00 is equal to one minute. You should specify a reasonable estimate for this number, because if you specify too large of a number your job may not be scheduled appropriately, and if you specify too small of a number your job will be terminated before it completes.

There are two parameters on the second line. The first, nodes, tells the scheduler how many nodes on which you want your job to run. The ppn parameter defines how many processors on each node you'll need. Currently the definition of nodes is rather misleading, as if you specify a ppn value smaller than the number of processors on a given node, you may end up with fewer actual nodes than you specify. For example, if you specify nodes=2,ppn=2, if there's a 4 CPU machine available, you'll be allocated 4 CPUs on that one node. If you want to be sure to get multiple machines, specifying ppn=4 is your best bet. For a more detailed method of specifying CPU/machine requirements, check out the examples section.

The remaining lines in the file are just standard commands, you will replace them with whatever your job requires. In this case once the job runs, it will print out the time and hostname to the output file. By default the script will be run in whatever shell you use to log in to the cluster, so if your normal shell is tcsh then the script will be run inside tcsh. If you want to change this, check out the examples section.

To submit your job, pick a queue that fits your needs, we'll choose the queue serial for this test, and then submit the job. (The serial queue is the default queue, but for this example we'll specify it anyway.)

deepthought:~: qsub -q serial test.sh 
4178.deepthought.umd.edu

The number that is returned to you is the identifier for the job, and you should use that anytime you want to find out more information about your job. For information on how to verify that your job is running, see the section Monitoring Your Jobs.

Once your job completes, unless you've specified otherwise, your output and any errors that occur will be written to two files in the same directory from which you submitted your job. The files will be named with the same name as your job script, with .eNNNN and .oNNNN appended, where the Ns are replaced by the job identifier.

Note that by default when you log in to the cluster, you are sitting in your home directory, and all output and submissions will be transferred to and from your home directory. For best performance, you should consider running your jobs from a space set aside for them. See Files and Storage and the qsub example on Running Your Job in a Different Directory for more information.

Here's what you should see when your job completes:

deepthought:~: cat test.sh.o4178
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
compute-2-39.deepthought.umd.edu
Mon Jan 22 11:13:09 EST 2007

deepthought:~: cat test.sh.e4178
term: Undefined variable.

As you can see in the output files above, the script ran and printed the hostname and date as specified by the job script. The few error messages that you see above are expected and can be ignored.

Choosing a Queue

The queues on Deepthought are laid out in a manner that allows short running jobs to take priority over longer jobs. This means that if two jobs are waiting in the queue, the higher priority job will run first. Note however that if a job is already running in a queue, it will be allowed to run to completion before the next job is started. The only exception to this rule is for jobs in the serial queue, which will be preempted (evicted) if a higher priority job comes along.

In addition to queue priorities, users of the cluster with paid allocations (users that contribute money or resources to the cluster) get priority over non-paying users. All users are provided with a certain number of service units (SUs) as determined by the HPCC Allocations and Advisory Committee. In addition, "free" usage of the cluster is provided to users with paid or non-paid allocations, assuming cycles are available. "Free" jobs run at low priority and will be preempted (evicted) if a higher-priority job comes along. To specify a queue, use the -q option to qsub. Note: paid users will also need to specify their high-priority account in order to take advantage of their elevated priority. If no account is specified, the default priorities will be used. See the section Job Accounting for information on how to specify an alternate account.

The queues are as follows:
queue #nodes wallclock priority notes
min max min max
debug   2   15 min high always available; use for interactive jobs
wide-debug 5 100%   15 min high  
narrow-med   20%   8 hr med  
wide-short 5 100%   2 hr med  
narrow-long   20%   3 days low  
narrow-extended   25%   2 weeks low paid allocations only
med-extended 5 50%   1 week low paid allocations only
wide-med 5 100%   8 hr low  
ib 2 100%   1 week high InfiniBand connected hosts
serial   100%   unlimited very low free; preemptable

Job Submission Examples

When submitting a job with qsub, if you request more than one node, you'll need to know how to get your jobs to use all of the nodes that you've been assigned. The scheduler will run your script on the first node in the list. It's up to you to decide what to do with the remaining nodes. The scheduler assigns the variable $PBS_NODEFILE which contains the name of a file that lists all of the nodes that you've been assigned.

Submitting an MPI Job Using OpenMPI

OpenMPI is the preferred MPI unless your application specifically requires one of the alternate MPI variants. OpenMPI automatically "knows" about the contents of $PBS_NODEFILE and as such you don't need to include it on the command line. OpenMPI is also compiled to support all of the various interconnect hardware, so for nodes with fast transport (Infiniband/Myrinet), the fastest interface will be selected automatically.

The following example will run the MPI executable alltoall on each of four processors on ten different nodes. Note that you will need to add the command tap -q openmpi-gnu to your .cshrc.mine file to set up your environment properly to run OpenMPI. For further information on the tap command check out the section Setting Up Your Environment.

#PBS -l nodes=10:ppn=4
#PBS -l walltime=00:00:60

mpirun -np 40 alltoall

Submitting an MPI Job Using LAM

The following example will run the MPI executable alltoall on each of four processors on ten different nodes. Note that you will need to add the command tap -q lam-gnu (or one of the other MPI flavors) to your .cshrc.mine file to set up your environment properly to run LAM. For further information on the tap command check out the section Setting Up Your Environment.

#PBS -l nodes=10:ppn=4
#PBS -l walltime=00:00:60

lamboot $PBS_NODEFILE

mpirun C alltoall

lamhalt

If you see errors in your output of the form "LAM failed to execute a LAM binary on the remote node X", it is most likely because you failed to add the appropriate tap command to your .cshrc.mine file.

Submitting an MPI Job Using MPICH

The following example will run the MPI executable alltoall on each of four processors on ten different nodes. Note that you will need to add the command tap -q mpich-gnu (or one of the other MPI flavors) to your .cshrc.mine file to set up your environment properly to run MPICH. For further information on the tap command check out the section Setting Up Your Environment.

Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)

#PBS -l nodes=10:ppn=4
#PBS -l walltime=00:00:60

mpdboot -n 10 -f $PBS_NODEFILE

mpiexec -n 40 alltoall

mpdallexit

Submitting a Non-MPI job

The following example will run a command on each of the nodes in the assigned list. It uses ssh to communicate between nodes. If your shell is csh/tcsh, use this:

#PBS -l nodes=10:ppn=4
#PBS -l walltime=00:00:60

foreach node (`cat $PBS_NODEFILE`)
   ssh $node hostname
end

And if your shell is sh/ksh/bash, use this:

#PBS -l nodes=10:ppn=4
#PBS -l walltime=00:00:60

for node in `cat $PBS_NODEFILE`; do
   ssh $node hostname
done

More Examples- Choosing Your Hardware

The cluster is made up of various different kinds of machines. Some machines have different numbers and speeds of processors, different amounts of memory and disk, and different network interconnects. The following examples show how to pick the hardware on which you want your job to run. Note, however, that unless you have a specific need for a particular hardware type, it's best to leave your job specification as generic as possible. This will give the scheduler the widest possible selection of machines and may get your jobs completed faster. For more information on the hardware, see here.

Specifying Processor Speed Requirements

Currently when specifying a processor, it is only possible to specify one specific processor speed. For instance, you can request all 3Ghz nodes, but you cannot say "give me all nodes faster than 2GHz". The tags currently available are: So to specify that you want 3 GHz processors for your job, you can do the following:

#PBS -l nodes=2:ppn=4:mhz3000
#PBS -l walltime=00:00:60

myjob

Specifying Node/CPU/Memory Requirements

Depending on the requirements of your job, you may need to give the scheduler more specific information about those requirements so that it can better assign you the resources that you need. By default, unless told otherwise, the scheduler will pack as many of your jobs as it can onto a given node. (You'll never share a node with someone else's jobs, but your own jobs are fair game for packing.) So, for instance if you have two jobs in the queue where you've specified nodes=2:ppn=2, both of these jobs can be scheduled simultaneously onto the same 4-processor machine.

If you want to request a specific amount of memory for your job, try something like the following:

#PBS -l nodes=1:ppn=4
#PBS -l mem=1024mb

myjob

This example requests a single 4 processor node with 1GB (1024MB) of memory.

Specifying the Amount of Scratch Space Needed

If your job requires more than a small amount of local scratch space, it would be a good idea to specify how much you need when you submit the job so that the scheduler can assign appropriate nodes to you.

Most of the nodes currently have at least 30GB of scratch space, and some have as much as 250GB available. Scratch space is currently mounted as /tmp. Scratch space will be cleared once your job completes.

The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.

#PBS -l nodes=1:ppn=4
#PBS -l file=5gb

myjob

More Examples- Other Job Options

Email Options

If you want to be notified via email when your job completes, you can add the -mXX option to your description file. If you want to receive mail when the job starts, replace the Xs with the letter b. If you want to receive mail when your job completes, replace the Xs with the letter e. You may add both letters if you like, and you'll get two email messages. By default, you will always be sent email if your job is aborted by the scheduler for any reason. The completion email will tell you the exit status of your job as well as the amount of resources the job consumed. Note that the CPU time and memory usage numbers provided in this email are unreliable at best. The email messages by default will be sent to your Glue account. If you'd like them to go elsewhere, you can add the -M option followed by a comma-seperated list of usernames.

#PBS -l walltime=00:00:60
#PBS -mbe -Mbob@myhost.com,jane@yourhost.com

date

Running Your Job in a Different Shell

By default, your job script will be run through whatever shell is your default shell. To change this, you'll need to add the -S option to your description file. Also note that when using the bash shell, you must explicitly run your .profile script, as it is not run for you automatically. If you have tap commands in your submit script, this is especially important because tap is defined in .profile. If you're using tcsh you don't need to worry about this.

The following example changes to using /bin/bash as the execution shell.

#PBS -lwalltime=00:01:00
#PBS -S /bin/bash

. ~/.profile   # only needed for bash shell

date
hostname

Running Your Job in a Different Directory

The working directory in which your job runs will be your home directory, unless you specify otherwise. So, even if you're sitting in /data/dt-raid5/bob/my_program when you submit your job, when the job runs, it will look in your home directory for any files that don't have a full pathname specified. To change this behavior, you'll need to add the -d argument to your job description file.

Also note that if you are using MPI, you may also need to add either the -wd option for LAM (mpirun) or the -wdir option for MPICH (mpiexec) to specify the working directory.

The following example (using LAM) switches the working directory to /data/dt-raid5/bob/my_program.

#PBS -lwalltime=00:01:00
#PBS -d /data/dt-raid5/bob/my_program

lamboot $PBS_NODEFILE

mpirun -wd /data/dt-raid5/bob/my_program C alltoall

lamhalt

Specifying the Amount of Time Your Job Will Run

When submitting a job, it is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.

To specify your estimated runtime, use the walltime parameter. This value should be specified in the form HHH:MM:SS. Note that if your job is expected to run over multiple days, simply convert the number of days into hours- for example a 3 day job would have a walltime value of 72:00:00. You may leave off the leading digits if you like- so a walltime of 15:00 will represent 15 minutes. Note also that while the scheduler may show walltimes in the form DD:HH:MM:SS when you view the queue status, this format will not be accepted when you submit a job.

If you do not specify a walltime, the default (maximum) permitted walltime for the queue will be used. See the section entitled Choosing a Queue for more information on queues and their assigned limits.

The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.

#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:00:60

hostname

Monitoring Your Jobs

To verify that your job is running, you can use the command showq. For example:

deepthought:~: showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

4178                  kevin    Running     4    00:01:00  Mon Jan 22 11:13:09

     1 Active Job        4 of  236 Processors Active (1.69%)
                         1 of   59 Nodes Active      (1.69%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

If your job shows up in the ACTIVE JOBS section as shown above, your job should be off and running.
If your job shows up in the IDLE JOBS section, that means that there currently are insufficient resources available to run your job. Check to make sure you haven't requested more processors than you need, and that you've specified a reasonable walltime. If you see lots of jobs in the ACTIVE JOBS section, it's probable that you'll just need to wait for someone else's job to finish before yours can start.
If your job shows up in the BLOCKED JOBS section, it most likely means that you did not have a sufficient amount of time remaining in your CPU allocation to run the job. Either specify a smaller walltime, or obtain an additional allocation. See the section Diagnosing Job Problems for further information.

To find out more detailed information about your job, use the checkjob command. This command will show you which specific nodes were allocated to your job, and it will also show you the job requirements you specified when you submitted the job.

deepthought:~: checkjob 4209

checking job 4209

State: Running
Creds:  user:kevin  group:wheel  account:kevin  class:serial  qos:serial
WallTime: 00:00:00 of 00:01:00
SubmitTime: Tue Jan 23 10:33:55
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Tue Jan 23 10:33:56
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [prod]
Allocated Nodes:
[compute-2-39.deeptho:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE PREEMPTOR
Attr:        PREEMPTEE

Reservation '4209' (00:00:00 -> 00:01:00  Duration: 00:01:00)
PE:  1.00  StartPriority:  200

If you want to view the output of your job while it is running, you can use the command qpeek. This command can be used to view both the standard output and standard error streams from your job, and can also be used to follow the output as it occurs.

deepthought:~: qpeek
qpeek:  Peek into a job's output spool files

Usage:  qpeek [options] JOBID

Options:
  -c      Show all of the output file ("cat", default)
  -h      Show only the beginning of the output file ("head")
  -t      Show only the end of the output file ("tail")
  -f      Show only the end of the file and keep listening ("tail -f")
  -f Show only the last  lines and keep listening ("tail -f")
  +0f     Show all of the file and keep listening ("tail +0f")
  -#      Show only # lines of output ("tail -")
  -e      Show the stderr file of the job
  -o      Show the stdout file of the job (default)
  -?      Display this help message

deepthought:~: qpeek 4209

...this is sample output from job 4209...

deepthought:~: qpeek -e 4209

...this is sample error messages from job 4209...

deepthought:~: qpeek -f 4209

...this is sample output from job 4209, the command will not exit, and
will continue to show output as it is generated...

Cancelling Your Jobs

To cancel your job before it completes, use the canceljob command.

deepthought:~: canceljob 7274


job '7274' cancelled

Job Accounting

All users of the cluster are provided with at least one allocation account. Paid users also receive a second, high-priority account. Jobs charged to the high-priority account take precedence over non-paid jobs. No job will preempt another job regardless of priority, with the exception of jobs in the serial queue, which will be preempted by any job with a higher priority. To determine the balance in your account(s), use the command mybalance.

deepthought:~: mybalance
Project  Machines Balance  
-------- -------- -------- 
test    ANY      12093571
test-hi ANY      71999976

This shows you the balance remaining (in seconds) in all of the accounts to which you are authorized to charge. The account with the -hi suffix is your high-priority account.

To submit jobs to an account other than your default (standard-priority) account, use the -A option to qsub.

deepthought:~: qsub -A test-hi test.sh
4194.deepthought.umd.edu

Diagnosing Job Problems

If your job doesn't run and ends up in the BLOCKED JOBS section, you can use the checkjob command to get more information about why your job isn't running.

deepthought:~: checkjob 4195

[ ... deleted for brevity ... ]

job is deferred.  Reason:  NoResources  (cannot create reservation for job '4195' (intital reservation attempt))
Holds:    Defer  (hold reason:  NoResources)
PE:  232.00  StartPriority:  200
cannot select job 4195 for partition DEFAULT (job hold active)

In this example, we see that the job was deferred because there are insufficient resources available to run the job. Once sufficient resources become available, the job will run automatically.

If instead, you see the following as part of the checkjob output, it means that the job you are trying to run will exceed the allocation you have remaining. This may simply be because you did not specify a walltime as part of your job specification. If your specifications are correct, you can either resubmit your job to your standard-priority account, or to the free serial queue, or you can request an additional allocation from the committee.

deepthought:~: checkjob 4204

[ ... deleted for brevity ... ]

job is deferred.  Reason:  BankFailure  (cannot debit job account)
Holds:    Defer  (hold reason:  BankFailure)
PE:  32.00  StartPriority:  200
cannot select job 4204 for partition DEFAULT (job hold active)

If none of the above conditions apply, and your job is listed in the IDLE JOBS section, keep the following in mind:

  1. When a user runs a job, their job will never share a node with a different user. This is to prevent one user's job from interfering with another user's job. Once a user has access to a node, there's no way to prevent them from using all of the available memory, disk, or processors. This means that some processors on a given node may not always be used. However, if a single user submits multiple jobs, those jobs will be packed onto nodes if the job resource requirements allow this.
  2. In addition to the number of processors, a user may also request a certain amount of memory or certain amount of disk space when submitting their job. For example, a user may know that their job needs 4G of RAM for their process to run. So to someone viewing the queue, it may appear that the node is being used "inefficiently", but in reality, it is not. In this case, even though a node may have 4 processors, if it only has 4G of RAM total, only one 4G job is going to run on it.
  3. Remember that this is a shared system. There's no guarantee that jobs submitted will run immediately. Jobs submitted using high-priority accounts will run first, followed by the standard priority accounts, with the 'free' serial jobs getting whatever's left over. The showq command lists jobs according to priority order, with the highest priority jobs listed first.

Interactive and Debug Jobs

The individual compute nodes generally do not allow direct shell access. This can be problematic if you want to test out your code on the exact processor on which your code will run. If you only need a single node for compiling and debugging purposes, the compute nodes with names ending in -0 are always available for remote shell access from the head node. Currently there are two nodes available, compute-1-0 and compute-2-0.

If you need shell access to additional nodes, provided some are available you can ask the scheduler to assign them to you with qsub -I. Assuming your requirements are met, you will be given a shell on the first node, and on that node, $PBS_NODEFILE will be set to the name of a file containing the list of nodes to which you now have access. You can then ssh to and between any of the nodes in that list, and you can also ssh to all of your assigned nodes from the head node.

For example, if you want to request two seperate nodes, try this:

deepthought:~: qsub -lnodes=2:ppn=4 -lwalltime=00:15:00 -I
qsub: waiting for job 4216.deepthought.umd.edu to start
qsub: job 4216.deepthought.umd.edu ready

DISPLAY not set.

compute-2-39:~: cat $PBS_NODEFILE
compute-2-39.deepthought.umd.edu
compute-2-39.deepthought.umd.edu
compute-2-39.deepthought.umd.edu
compute-2-39.deepthought.umd.edu
compute-2-38.deepthought.umd.edu
compute-2-38.deepthought.umd.edu
compute-2-38.deepthought.umd.edu
compute-2-38.deepthought.umd.edu

compute-2-39:~: ssh compute-2-38 date
Tue Jan 23 11:22:48 EST 2007

Setting Up Your Environment

In order to provide a cluster that can support a wide variety of users, many software packages are available for your use. However, in order to simplify each individual user's environment, a large number of those packages are not included in your default PATH.

Your account as provided gives you access to the basic tools needed to submit and monitor jobs, access basic Gnu compilers, etc. It is HIGHLY suggested that you DO NOT remove or modify the dot files (.cshrc, .profile, etc) that are provided for you. Instead, add any customizations you need to the alternate set of files described here. If you choose to modify the system default files, you run the risk of losing any systemwide changes that are necessary to keep your account running smoothly.

For packages that are not included in your default environment, the tap command is provided. When run, this command will modify your current session by adding the appropriate entries to your PATH, MANPATH, LD_LIBRARY_PATH and will set any other variables necessary to ensure the proper functioning of the package in question. Note that these changes are temporary and only exist until you log out. If you want to have tap run for you automatically, add the command tap -q <package> to your .cshrc.mine file. (The -q argument prevents tap from displaying any text output when it runs, which can confuse some shells.) If you run the tap command without any arguments, it will provide a list of available packages. Note that many of these packages are not accessible on the cluster by default, if you want access to them, let us know and if possible, we'll make them available.

For example, if you want to run Matlab, you'll want to do the following. Notice that Matlab is not available until after the tap command has been run.

deepthought:~: matlab
matlab: Command not found.
deepthought:~: tap matlab
----------------------------------------------------------------------

    This is a shortcut to the default version of Matlab available 
    on your platform. 

    Run command "matlab" to start up the program,
    or "matlab -h" to see various command-line options.

    There may be other versions of Matlab available.  Please check
    the Dash/KDE menu for specific versions of Matlab.

----------------------------------------------------------------------
deepthought:~: matlab

                              < M A T L A B >
                  Copyright 1984-2005 The MathWorks, Inc.
                   Version 7.0.4.352 (R14) Service Pack 2
                              January 29, 2005

 
  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.
 
>> 

Available Software

Software packages that are available on the cluster include:

Package Name Description
ansys100 Ansys 10.0
blast BLAST 2.2.18
cap3 CAP3 compiled with Intel compilers
clustalw ClustalW 1.83 compiled with Intel compilers
cns CNS 1.2 compiled with Intel compilers
fftw FFTW 2.1.5 (with MPI extensions, built with lam-gnu)
garli GARLI 0.951 compiled with Intel compilers (single process version)
garli-mpi GARLI 0.942 compiled with Intel compilers and OpenMPI (MPI version)
gromacs GROMACS version 3.3.3 compiled with Intel compilers
gsl GNU Scientific Library version 1.8
hdf HDF 4.2r1
hdf5 HDF 1.6.5
intel Intel Compilers 10.1.008, MKL 10.0.011
intel-mpi Intel MPI 3.1 - note that this does NOT work with Infiniband
java Java 1.5.0_11
java6 Java 1.6.0_04
lam-gnu LAM 7.1.2 compiled with Gnu compilers
lam-intel LAM 7.1.2 compiled with Intel compilers
lapack LAPACK 3.1.0 (This is the reference implementation - non-optimised)
It includes the BLAS library as well.
lucy lucy 1.19 compiled with Intel compilers
mathematica60 Mathematica 6.0
matlab Matlab 7.0.4
matlab2007b Matlab 7.5.0
modeltest modeltest 3.7 compiled with Intel compilers
also includes MrModeltest 2.2
mpich-gnu MPICH 1.0.4p1 compiled with Gnu compilers
mpich-intel MPICH 1.0.4p1 compiled with Intel compilers
mrbayes MrBayes 3.1.2 compiled with Intel compilers (single process version)
mrbayes-mpi MrBayes 3.1.2 compiled with Intel compilers and OpenMPI (MPI version)
muscle MUSCLE 3.6 compiled with Intel compilers
namd NAMD 2.6
openmpi-gnu OpenMPI 1.2.5 compiled with Gnu compilers
openmpi-intel OpenMPI 1.2.5 compiled with Intel compilers
openmpi-pgi OpenMPI 1.2.5 compiled with PGI compilers
netcdf NetCDF 3.6.1
paml PAML 4b compiled with Intel compilers
povray POV-Ray 3.6
R R 2.4.1
xplor-nih xplor-nih 2.19

Files and Storage

On the cluster, you have several options available to you regarding where files are stored. Your home directory is private to you, and should be used as little as possible for data storage. Your home directory is on a slower disk and is not optimized for high speed access.

Because much of the data generated on the cluster is of a transient nature and because of its size, data stored in the /data partitions is not backed up. This data resides on RAID protected filesystems, however there is always a small chance of loss or corruption. If you have critical data that must be saved, be sure to copy it elsewhere.

There are several general purpose areas that are intended for storage of computational data. These areas are accessible to all users of the cluster and as such you should be sure to protect any files or directories you create there. See Securing Your Data for more information.

The areas are:

Path Description
/data/dt-vol0 This is a RAID5 filesystem, currently approximately 2TB
/data/dt-vol1 This is a RAID5 filesystem, currently approximately 2TB
/data/dt-raid5 This is a RAID5 filesystem, currently approximately 500GB
/data/dt-raid10 This is a RAID10 filesystem, currently approximately 500GB
The remaining filesystems are for members of the CLFS group only
/data/dt-vol2 This is a RAID5 filesystem, currently approximately 1.7TB
/data/dt-vol3 This is a RAID5 filesystem, currently approximately 1.7TB
/data/dt-vol4 This is a RAID5 filesystem, currently approximately 1.7TB
/data/dt-vol5 This is a RAID5 filesystem, currently approximately 1.7TB

The filesystems will perform differently depending on their usage, so you may want to try both to see which works best for your application.

Please remember that you are sharing these filesystems with other researchers and other groups. If you have data residing there that you don't need, please remove it promptly. If you know you are going to create large files, make sure there is sufficient space available in the filesystem you are using. You can check this yourself with the df command:

deepthought:~: df -h /data/dt-vol0
Filesystem            Size  Used Avail Use% Mounted on
g20-fs1.deepthought.umd.edu:/export/data/vol0
                      1.8T  850G  888G  49% /a/g20-fs1/data/dt-vol0

This output shows that there are currently 888 GB of free space available on /data/dt-vol0.

If you have a Glue account and you want to share your data back and forth with that account, you can access it at /glue_homes/<username>. Note that you cannot have jobs read or write directly from your Glue directory, you'll need to copy data back and forth by hand as needed.

Securing Your Data

Your home directory as configured is private and only you have access to it. Any directories you create outside your home directory are your responsibility to secure appropriately. If you are unsure of how to do so, please contact hpcc-help@umd.edu for additional assistance.

If you're a member of a group, you'll want to make sure that you give your group access to these directories, and you may want to consider setting your umask so that any files you create automatically have group read and write access. To do so, add the line umask 002 to your .cshrc.mine file.