PIII/Linux Cluster Homepage
Over 1,750,000 CPU Hours
Important: The information contained below is subject to
change without warning. Please e-mail
Matthew Choptuik immediately if you encounter problems using the
cluster. Please check here FREQUENTLY while using the
Access | Use |
Tips & Tricks |
Warnings | Steve Plokin's MACHINES
- MAY 2, 2007, 12 NOON:
All compute nodes of the cluster, except for those belonging to Steve
Plotkin, are now off-line, per the message below.
- APRIL 23, 2007: IMPORTANT!!!
After 7+ years of service, the
vn.physics.ubc.ca cluster will very shortly be decommissioned!!
Since the cluster has been almost completely idle, it is not expected
that this will present much of a hardship to any users of the facility.
Beginning later this week, management will start the process of
shutting down the compute nodes and moving them out of the co-location
room in the LS Klinck Building in order to free up space for new
The single most important issue that users of the cluster may need to
deal with is the disposition of files that are currently stored on one
of the front-end machines, vnfe1.physics.ubc.ca
and on one of the following partitions /d/vnfe1/home,
/d/vnfe1/home2, /d/vnfe1/home3, /d/vnfe3/home and /d/vnfe3/home2. Users with
useful and/or important files on any of those partitions are asked to
offload such files as soon as is feasible. Note, however, that it
is management's intent to continue to run the two front-ends (vnfe1 and vnfe3) for a period of at least a
few months following the cluster shutdown, so that, even after
decommissioning, files located on the front-ends will be available for
some time. In addition, this web site will continue to be
maintained for some time to come, and will be updated as necessary to
provide information concerning access to the front ends and other
Please contact Matt Choptuik (email@example.com) should you have any
questions/concerns about this matter.
- See HERE
for a recent shapshot of
usage on the cluster.
- See HERE
for recent node load
- See HERE for usage summary by user.
- See HERE
for disk usage summary by partition and user. PLEASE try to keep
paritions below 80% usage.
- To date, this cluster has had about 190
or so "crashes" (mean time between node crashes: 1.9
- JULY 27, 8:45 AM:
MPI is MOSTLY functional again on
the cluster, as the following table indicates:
C F77 C++ F90
PGI Yes Yes ? ?
INTEL Yes No ? ?
where '?' denotes 'unknown/who uses that stuff anyway?'.
Management will continue to work on the Intel/F77 and other issues,
but, e.g., 3 of the 4 examples below should work again
- JANUARY 26 12:00 NOON: IMPORTANT!! ALL USERS OF
THE CLUSTER SHOULD RE-READ THE FOLLOWING!!
SUMMARY OF WHAT HAS CHANGED WITH THE UPGRADE
- There are now two
(rather than one) distinct partitions (partitions / and /home) on each
of the machines (front-ends and compute nodes). The / partition
occupies about 20% of the space of the physical partition, /home about
80%. This has two key consequences of which all users MUST be
aware, and that some users will need to act on SOON.
- The /home partitions on vnfe1 and vnfe3 are NOW 20% SMALLER THAN THEY USED TO BE.
THIS MEANS THAT
USERS MUST REDUCE THE AMOUNT OF INFORMATION THAT THEY STORE ON THIS
CLUSTER. Note that the WestGrid
HSM (Hierarchichal Storage Management) Facility is
perfect for all of your long term, high volume storage needs.
you don't yet have a WestGrid account, get your supervisor to take the
that it will take to fill out the on-line application for a group
account (she/he doesn't ever have to log into a WestGrid computer, just
fill our the form), that group account will be assigned a project
which you then can use to create your WestGrid account.
- /tmp on all the compute nodes is mounted on /, so /tmp is
not ideal for high volume, LOCAL, disk storage. Thus, what would
normally be the /home partition on the compute nodes has been renamed
/sratch. and given the atrributes of a /tmp directory. On ANY
machine (front-end or node) to which you login, the directory
/scratch/$USER should be automatically created. If this is NOT
the case, please report that fact to management immediately.
- PLEASE DO ALL HIGH-VOLUME
I/O ON THE LOCAL PARTITIONS I.E. on /scratch
- USERS WHO VIOLATE
THIS POLICY WILL BE DEEMED NEOPHYTES AND WILL BE SUBJECT TO
HAVING ALL OF THEIR
PROCESSES ON THE CLUSTER SUMMARILY TERMINATED.
- The cluster is now running Mandrake 10.1 with 2.6
kernels. In 1999, when we last did an install of the OS (Mandrake
6.1), we simply installed everything. This is no longer an easy
task, so if you find some software and/or capability that is missing,
please feel free to report that fact to management, and we will do our
best to rectify the situation, provided that the software and its use
is compatible with the operating principles of the cluster.
- The PGI compiler suite,
version 5.2-2, is installed and
available on ALL machines (not
just the front ends as previously,
license now served off VNP4). If your compilation is QUICK, then
by all means, compile on a node. If not, COMPILE ON YOUR
FRONT-END. Since vnfe2 no longer exists, if you used to use vnfe2
as your front-end you will need to figure out (via cd; pwd) which is
"your" front-end (i.e. the one on which your NFS home directory for the
cluster is physically mounted).
- BUILDING MPI APPS WITH
MPICH: The version of MPICH that
has been compiled with the PGI compilers is mpich.1.2.6
- The INTEL compiler suite,
version 8.1, is installed an
available on ALL machines
(license served off VNP4). If your
computation is QUICK (i.e. 1-20 seconds of compile time), then by all
means compile on a node. If not, COMPILE ON YOUR FRONT-END. Read
3. above if you don't know which machine is your front-end.
- BUILDING MPI APPS WITH
MPICH: The version of MPICH
that has been compiled with the PGI compilers is mpich.1.2.6
PARALLEL CONVENIENCE FEATURES
based on Matt's old private script, Mpirun, is available to help you
easily launch interactive parallel jobs. See HERE for
is automatically created for you at login, should it not already
SUMMARY OF WHAT HASN'T CHANGED WITH THE UPGRADE
- ANY OF THE (AMMENDED) BASIC
OPERATIONAL RULES (SEE SYSTEM USE SECTION BELOW).
- THIS STILL ISN'T A SYSTEM
All users MUST login to the cluster
machines using ssh (the secure shell).
Note that the cluster is now running Open SSH Version 3.9p1,
Protocols 1.5 and 2.0.
Please also see the Warnings below.
- See HERE for a complete list of
currently available machine names and IP numbers
- 2 front-end nodes, 62 compute nodes. Each compute node has dual
PIII 850 MHz processors and 512 MByte of RAM.
- Use vnN (/d/vnfe1/home/matt/scripts/vnN) to
see currently active compute
nodes (caveat emptor)
- If you are in Steven Plotkin's group, see HERE.
At least while the cluster is under construction (and possibly after
that), the cluster will be operated essentially as a cluster of
workstations. To this end, you will be able to ssh directly to
any of the compute nodes, as well as the front end nodes, and do pretty
much everything on a compute node that you would on a front end node.
There is currently NO BATCH SYSTEM on the cluster. Users should
feel free to interactively
start a reasonable number of production jobs on whatever machines(s)
Use ruptime on one of the nodes to see load averages on all
machines in the cluster.
In order to maximize the usefulness of the cluster, users should
abide by the following guidelines:
Send mail to Matthew Choptuik, if there
is software you wish to have installed. Please include a description of
the software, and, if possible, a distribution site from which it can
- BE AWARE AND CONSIDERATE OF OTHER USERS.
- Please ensure that you have a valid .forward file in your
home directory on the cluster so that mail sent to you will actually get
- Minimize the amount of network traffic to and from the cluster.
The cluster's current link to the outside world is 100 MegaBIT/s maximum---about
10 Mbyte/s. Extremely large files should definitely be moved to and fro
during off-peak hours Pacific Time.
- Treat the cluster as a remote computing environment:
- If at all possible, develop and debug on another Linux/Unix
- If at all possible, build the executable from source on
the cluster, to avoid problems with (e.g.) machine-dependent
- From time-to-time, nodes in the cluster may have to be
taken down for reboots on very short notice. Users running very
long jobs that do not periodically checkpoint themselves do so at their
own peril. There is currently no system-wide mechanism for
suspending, then restarting, jobs. Management accepts no responsiblity
for lost time and/or data.
- DO NOT use the front-end nodes for long jobs (more than a few
CPU minutes), except by arrangment with the management.
- If demand for the compute nodes is high, minimize the amount of
development work you do on those nodes.
- DO NOT start long jobs (more than a few minutes) if there
are already two jobs running on the node. Use top to
determine the number of CPU intensive jobs that are currenly running.
- DO start one additional job on a node that is
currently running a job, unless there are completely free nodes. (Again
use ruptime and top.)
- DO NOT start a job that will result in total memory usage
on a node exceeding 90%. (Once again, use top to see
what percentage memory a currently running process is using)
- WATCH YOUR DISK USAGE, particularly on the front end
nodes. This is not a system for computing neophytes, and users unable
or unwilling to keep /home directories under control will be [severe
punishment to be determined later].
- SCRATCH SPACE: Each user has a scratch directory /scratch/$USER
that should be used for local storage on the compute nodes. PLEASE
WRITE LARGE DATA FILES TO /tmp ITSELF and PLEASE WATCH YOUR
SCRATCH USAGE, ESPECIALLY IF /scratch's PARTITION IS 80% FULL OR MORE.
Scratch should be used especially in those cases when a user has many
I/O intensive jobs running simultaneously since, in such instances,
writing to the NFS mounted partitions can easily swamp the front-ends.
- BE AWARE AND CONSIDERATE OF OTHER USERS.
Unless otherwise specified, all software is available on all
machines (both front-end and compute nodes)
Maintained by firstname.lastname@example.org.
Supported by CIAR, CFI and NSERC.