Submitting (Parallel) Batch Jobs on the Myrinet Nodes Using PBS

0. Important Note

Please refer to the policies for RUNNING I/O INTENSIVE JOBS before submitting parallel jobs on the cluster.

1. Overview

The Myrinet interconnect on the cluster provides the capabililty for high performance, fine-grained parallelism for a wide class of problems. Consequently, those jobs that truly require paralellism, and that run efficiently in parallel, will be given priority on the Myrinet nodes.

Users should note that, currently, the only low-level software installed on the cluster to support parallelism is MPI. Thus, it is anticipated that, at least initially, virtually all of the parallel jobs run on the Myrinet nodes will be MPI-based

Principally since unbalanced load factors are much more of a concern for "truly" parallel tasks than for trivially-parallelizable ones (such as a sequence of jobs having different input parameters), USERS ARE REQUESTED TO RUN MYRINET JOBS ONLY VIA PBS (i.e. BATCH), i.e. PLEASE DO NOT LAUNCH INTERACTIVE JOBS ON THE MYRINET NODES

At least initially there will be no job quotas or other restrictions on how users use the batch system. Management reserves the right to impose such limits if and when contention for resources becomes severe.

IMPORTANT!! NEW POLICY: ALL USERS PLEASE READ THE FOLLOWING

Due to the fact that the cluster now only has 3 non-Myrinet nodes (i.e. 6 non-Myrinet processors), and given that there will still be considerable demand for processors for serial jobs, users should feel free to submit a reasonable number of serial (i.e. single processor) jobs via the Myrinet queue. Please submit such jobs only after the gig-nodes are saturated. Also, "reasonable", is loosely defined here. If the qsort and qstat commands show that there are a lot of idle Myrinet processors, and no queued Myrinet batch jobs, then you should feel free to submit more jobs than if the Myrinet queue is very busy.

AGAIN, HOWEVER, DO NOT RUN INTERACTIVELY ON THE MYRINET NODES, EVEN THOSE THAT USED TO BE GIG NODES. Such behaviour will lead to PBS confusion, overloading of nodes, and nasty e-mail messages from management.

Similarly, at least until the WestGrid UBC/TRIUMF cluster comes back on-line and stabilizes, parallel users should avoid the temptation of completely saturating the machine with their parallel jobs. In particular, note that at least for the time being, parallel jobs should still be restricted to 32 or fewer processors.

As always, the cardinal rule of cluster usage is to be AWARE and CONSIDERATE of other users

Code development

As much as possible, the Myrinet nodes should be used for production runs. Users who are developing MPI-based parallel codes are urged to use the old PIII cluster for that purpose.

Resource usage

There are 54 dual Myrinet nodes, for a total of 108 processors. Users are requested to restrict their jobs to Users should be aware that the performance of their MPI jobs may depend on the load factor of the nodes on which they are running; in particular, when the Myrinet partition becomes heavily loaded, job throughput may drop. Users should feel free to contact management (Matt) to report such behaviour since it may be difficult to assess externally .

Users should also be aware that since the Myrinet upgrade on February 5, 2004, there are now two versions of Myrinet card in the cluster. Nodes 001 through 050 have 'C' cards, while 051 through 054 have a previous generation 'B' card, which has somewhat less peak performance than the 'C' card. However, in a multi-grid benchmark that management has run on

  1. 4 nodes with B cards
  2. 4 nodes with C cards
  3. 2 nodes with B cards and 2 nodes with C cards
there was essentially no difference in execution time among the three cases. Nonetheless, if you suspect some degradation of your code's performance when running on node051, node052 or node054, let Matt know of the circumstances immediately.

2. Compiling and Submitting an MPI Job

We illustrate the steps in compiling and submitting a MPI job via PBS using a simple MPI demo program that computes an approximation to Pi = 3.149 ... The sample code, along with a Makefile and a PBS job script can be downloaded as the gzipped tar archive cpi-mpi-myr-intel.tar.gz. Note that this specific example uses a C-language program and the Intel C/C++ compilers; examples for additional language and compiler combinations are given below.

After downloading the above archive to a convenient location within your home directory on the cluster, unpack it using the tar command:

head% tar zxf cpi-mpi-myr-intel.tar.gz
This will create a directory cpi-mpi-myr-intel that should have contents as follows:
head% cd cpi-mpi-myr-intel
head% ls
Makefile  Makefile.commented  cpi.c  cpi.pbs
You can browse the files listed above here: Note that Makefile and Makefile.commented are functionally identical; the latter has additional comments explaining the structure of the makefile.

To compile and link the test program, simply type make or make cpi:

head% make
/opt/gmpi.intel/bin/mpicc  -I/usr/local/intel/include -O3 -tpp7 -c cpi.c
/opt/gmpi.intel/bin/mpicc  -O3 -tpp7 -O3 -tpp7 -L/usr/local/intel/lib cpi.o -o cpi 
As mentioned in the commented version of the makefile, /opt/gmpi.intel/bin/mpicc is a script that essentially functions as a front-end to the Intel C/C++ compiler, icc, and which, among other things, ensures linkage with the proper version of the MPI library during the load phase.

Now that the executable cpi has been created, you can submit a batch job to run it using Myrinet on several processors using the PBS qsub command. We'll do this via the batch script file cpi.pbs that you can use as a template for your own submissions (note that cpi.pbs is a bash script, you can equally well use a tcsh script should you so wish):

head% qsub cpi.pbs
181.head
The output "181.head" from the qsub command indicates that the batch job has been assigned the job identifier 181---the job ID assigned to your submission will almost certainly differ. (You'll need the job ID should you want to cancel a job once it's been submitted to the queue; see below.)

Provided that there are enough free Myrinet nodes available, the batch job will run quickly (you can check its status using the qstat command), and upon termination, will leave an additional two files, suffixed by the job ID, in the submission directory.

head% ls
Makefile  Makefile.commented  cpi*  cpi.c  cpi.e181  cpi.o  cpi.o181  cpi.pbs
The new files contain the standard output and standard error from the batch job and can be browsed here: As a final note, should you wish to stop a batch job's execution, or remove it from the execution queue, use the qdel command, supplying the JOB id as a the single argument thereto. For example, the above job could have been terminated via
head% qdel 181

3. Language and Compiler Combinations

  1. Intel C Compiler
  2. Intel Fortran 77/90 Compiler
  3. PGI C Compiler
  4. PGI Fortran 77 Compiler

4. Additional Information

  1. Man pages for basic PBS commands
  2. Local Commands
  3. MPI Documentation
  4. Myrinet Documentation

BACK to the cluster home page.