opteron/infinipath cluster

Send bug reports, suggestions etc. to Matt.

System overview

The cluster is a 44 compute node machine with an Infinipath (Infiniband) interconnect.

Each compute node contains

  1. 2 x Dual-Core Opteron Processor 2216 @ 2.4 GHz (4 cpu cores per node)
  2. 4 GB RAM
  3. Approx. 66 GB local storage mounted as /var/scratch
  4. A Pathscale Infinipath (Infinband) network interface card (NIC) for high bandwidth IPC (MPI)
  5. 1000 MB Ethernet for TCP/IP
The compute nodes are named node001, node002, ... node043, node045, and are connected with each other and with the additional nodes listed below through both Infinipath and 1000 MB Ethernet switches.

Additionally, there are two special-purpose nodes:
  1. A dedicated head node,, which has the same processor, RAM and NIC configuration as the root nodes.  This is the only node that has a network connection to the external world.
  2. A dedicated I/O node, storage01, that hosts 4.5 TB of RAID storage which is NFS-mounted by the head node and all of the compute nodes.
The OS on all nodes is CentOS 4.5 Linux, currently running with a 2.6.9 kernel. 

IMPORTANT!!  As is the case with all of our current and past clusters, it is crucial that all users be cognizant and considerate of the needs and usage patterns of other users.  In addition, it is every user's responsibility to practice "responsible computing", which includes, but is not limited to keeping their disk usage under control, and ensuring that their jobs are not significantly impacting overall system performance, particularly command line responsiveness on the head nodes. Bear in mind that this cluster is first and foremost for use by Joerg Rottler and his group members: we numerical relativists are guests on the machines, and need to act as such!

Compiler choices

The software environment on the head node includes four separate compiler suites (commands for compilation of F77, F90, C and C++ code, respectively, are given in parenthesis): 
There are man pages for all but the GNU compilers.

Here are sample invocations for simple (single source file) builds of optimized F77 and C executables that link against one of the group's standard libraries using each compiler suite


% pgf77 -L/usr/local/pgi/lib -fast foo.f -lbbhutil -o foo
% pgcc -L/usr/local/pgi/lib -fast foo.c -lbbhutil -o foo


% ifort -L/usr/local/intel/lib -O3 foo.f -lbbhutil -o foo
% icc -L/usr/local/intel/lib -O3 foo.c -lbbhutil -o foo


% pathf90 -L/usr/local/pathscale/lib -O3 -fno-second-underscore foo.f -lbbhutil -o foo
% pathcc -L/usr/local/pathscale/lib -O3 foo.c -lbbhutil -o foo


% f77 -L/usr/local/lib -O3 -fno-second-underscore foo.f -lbbhutil -o foo
% gcc -L/usr/local/lib -O3 foo.c -lbbhutil -o foo

If you are using the tcsh you can use the following aliases to set the values of environment variables such as F77, F90, CC, CXX, LDFLAGS etc to appropriate values for the various compilers:
Note that execution of any of these aliases results in an echo of which variables are set, and to what values. Should you wish to execute one of these aliases in your ~/.cshrc file, in order to define a default compilation environment at login-time, you should be sure to redirect standard output and standard error to supress the echoing. E.g.

pathopt >& /dev/null

Also observe that the Pathscale folk take their licensing seriously.  When a user invokes one of the compilers (pathcc, path90, etc), a lease is issued and, independent of the time it takes to compile, the lease will not expire for something like 5 minutes.  Since we currently have only a single-concurrent-lease license, this means that no other user will be able to use the compiler for at least 5 minutes.  Thus, don't be surprised to see the error message:
** Subscription: Unable to find a server.  The PathScale products cannot run without a subscription.  
Please see for details.
For more information, please rerun with -subverbose
when trying to use a Pathscale compiler.  Unfortunately, at the current time, there's nothing an ordinary user can do about this but wait it out.

Serial (single processor) job submission using PBS

Follow these steps to submit and run single-processor jobs on the cluster under PBS:
  1. Build your executable using your favorite compiler suite, as sketched above.
  2. Create a PBS script file.  You can see the contents of a basic template file, which needs occurences of 'XXX' replaced appropriately, HERE.
  3. Submit the job to the queue using qsub. E.g.

    % qsub serial.pbs

    Currently, there is only a single queue on the system, which handles both serial and parallel jobs.
  4. Monitor the status of your job using qstat, delete it from the queue using qdel etc
See the man pages for qsub, qstat, qdel etc. for full details concerning the syntax and semantics of the PBS commands

Parallel (MPI based) job submission using PBS

TO COME!! Supported by CIFAR, NSERC, CFI, BCKDF and UBC.