Running Interactive Jobs on the Gigabit Nodes

1. Overview

The three gigabit nodes, node055, node056 and node057) are available for interactive use, in much the same way as the nodes of the old PIII cluster are. Users should feel free to start production jobs on whichever of the gig nodes are available, using the procedure outlined below.

Note that since the Myrinet upgrade on February 5, 2004, the following directions may seem a little over the top, given that there are now only 3 gigabit nodes now. Chalk this up to management's desire to minimize the amount of .html modifcations needed.

Also note that, particularly until the WestGrid UBC/TRIUMF cluster is back on-line and stable, users should feel free to submit a reasonable number of serial jobs via the Myrinet batch queue as described in this PAGE.

As is the case with the old cluster, there are no system-imposed limits to how many processors a single user can use at a given time, but users are expected to be aware of and considerate of the needs of other users, and management reserves the right to impose restrictions should contention for resources become severe.

2. Determining Available Nodes

First, login into the head node, vnfe4.physics.ubc.ca. Because the gig compute nodes are accessible only via a private internal network, all access to those nodes from the outside world must be made through the head:

your-workstation% ssh user@vnfe4.physics.ubc.ca

Once you have logged in, use the avail command (see HERE for full usage information) to list the gig nodes in order of increasing load factor (i.e. from least to most busy):

head% avail
node055 0.00 0.00 0.00
node056 0.00 0.00 0.00
node057 0.00 0.00 0.00

Note that it will take a few seconds for the avail command to complete, as it must connect to all of the gig nodes and execute the uptime command on each. The three columns of numbers listed by avail are the 1-, 5- and 15-minute load averages on the respective gig nodes. Roughly speaking, we can interpret the load average as the number of CPU intensive jobs that are currently running on the machine. Thus, a load average of 0.00 means that the machine is idle, while load averages of 1.00 and 2.00 indicate that one or two jobs, respectively, are running on the node. Since each node has two processors, a load average of 2.00 means that the node is essentially completely saturated.

You should choose the nodes on which you will run from the start of the listing produced by avail, unless all nodes have 2.00 load averages, in which case you will have to wait until one or more nodes become available. DO NOT initiate a job on a machine which already has a load average of 2.00; this will only slow down overall throughput on the cluster.

3. Initiating a Job on a Node

Once you have determined an available node, login to it using the rsh command.

head% rsh node055

Again, note that the rsh command must be executed from the head node (not e.g. from your local workstation), since the compute nodes are connected via a private local network. Once you have logged into the node, run top for a few seconds to confirm that the load average is 1.00 or less, since some other user may have recently "claimed" the node. Then, simply start up your job as you would on any Unix workstation:

node055% cd some-dir
node055% some-command

NOTE: If you background the job, then subsequently log out from the node before execution of the process is complete, the job will continue to run in the background, i.e. you do not have to explicitly nohup the job.

4. User Guidelines

In order to maximize the usefulness of the gigabit nodes, users should abide by the following guidelines:

BE AWARE AND CONSIDERATE OF OTHER USERS.
Treat the cluster as a remote computing environment:
- If at all possible, develop and debug on another Linux/Unix system, and, especially during the construction period, treat ALL storage on the cluster (not just the compute nodes) as volatile.
- If at all possible, build the executable from source on the cluster, to avoid problems with (e.g.) machine-dependent run-time support.
- From time-to-time, nodes in the cluster may have to be taken down for reboots on very short notice. Users running very long jobs that do not periodically checkpoint themselves do so at their own peril. There is currently no system-wide mechanism for suspending, then restarting, jobs. Management accepts no responsiblity for lost time and/or data.
If demand for the compute nodes is high, minimize the amount of development work you do on those nodes.
DO NOT start long jobs (more than a few minutes) if there are already two jobs running on the node. Use avail and top to determine the number of CPU intensive jobs that are currenly running.
DO start one additional job on a node that is currently running a job, unless there are completely free nodes. (Again use ruptime and top.)
DO NOT start a job that will result in total memory usage on a node exceeding 90%. (Once again, use top to see what percentage memory a currently running process is using)
BE AWARE AND CONSIDERATE OF OTHER USERS.

BACK to cluster home page.