##################################################################### vn Crash reports ###################################################################### CRASH_1 ###################################################################### (1) Tue Nov 16 20:24:20 PST 1999 vn1 was manually rebooted after Matt noticed it hung about a couple of hours ago. Matt couldn't wait for the fsck to finish, so got the monitor attached just about the time the machine was finally coming up. Probably coincidence. Nov 16 18:24:16 vn1 -- MARK -- Nov 16 18:44:16 vn1 -- MARK -- Nov 16 19:00:18 vn1 sshd[18725]: log: Closing connection to 142.103.234.31 Nov 16 19:01:01 vn1 anacron[20549]: Updated timestamp for job `cron.hourly' to 1999-11-16 Nov 16 19:01:01 vn1 PAM_pwdb[20553]: (su) session opened for user news by (uid=9) Nov 16 19:01:01 vn1 PAM_pwdb[20553]: (su) session closed for user news SHOULD LOOK AT THE CRONTABS (??), BUT NOTHING OBVIOUS IN THE LOG :n image/*.1/var/log/messages.991116 (2) Tue Nov 16 21:27:09 PST 1999 Things look pretty stable. vn1 up 1:09, 0 users, load 0.08, 0.11, 0.10 . . vnfe1 up 4+06:17, 2 users, load 0.98, 0.46, 0.27 vnfe2 up 4+06:17, 0 users, load 0.11, 0.14, 0.08 vnfe3 up 4+06:17, 0 users, load 0.10, 0.13, 0.08 ###################################################################### CRASH_2 ###################################################################### Thu Nov 18 19:45:23 PST 1999 ############################################################ (1) vnfe1 ran out of processes (too many defunct), had to be manually reset 1316 ? Z 0:00 [kwmsound ] ###################################################################### CRASH_3 ###################################################################### Thu Nov 18 20:24:57 PST 1999 ############################################################ (1) vn18 down Fri Nov 19 11:11:53 PST 1999 Put v18's harddrive into vn48, vn18 now up, vn48 down. ###################################################################### CRASH_4 ###################################################################### Fri Nov 19 08:11:01 PST 1999 ###################################################################### Stopped vn3 changed BIOS ettings (ignore ERRORS) rebooted OK Stopped vn5 changed BIOS ettings (ignore ERRORS) rebooted OK ###################################################################### CRASH_5 ###################################################################### Fri Nov 19 12:07:56 PST 1999 ############################################################ (1) vn44 down Probably time to take all machines down and check the BIOS. # From /var/tmp/messages Nov 18 11:01:00 vn44 PAM_pwdb[8211]: (su) session opened for user news by (uid=9) Nov 18 11:01:00 vn44 PAM_pwdb[8211]: (su) session closed for user news Nov 18 11:14:46 vn44 sshd[8232]: log: Connection from 142.103.237.17 port 975 Nov 18 11:14:46 vn44 sshd[8232]: log: RSA authentication for root accepted. Nov 18 11:14:46 vn44 sshd[8232]: log: ROOT LOGIN as 'root' from vn17.physics.ubc.ca Nov 18 11:14:46 vn44 sshd[8234]: log: executing remote command as root: date Nov 18 11:14:46 vn44 sshd[8232]: log: Closing connection to 142.103.237.17 Nov 18 11:23:49 vn44 sshd[8250]: log: Connection from 142.103.237.225 port 972 Nov 18 11:23:49 vn44 sshd[8250]: log: RSA authentication for root accepted. Nov 18 11:23:49 vn44 sshd[8250]: log: ROOT LOGIN as 'root' from vnfe1.physics.ubc.ca Nov 18 11:23:49 vn44 sshd[8252]: log: executing remote command as root: date Nov 18 11:23:49 vn44 sshd[8250]: log: Closing connection to 142.103.237.225 Nov 18 11:40:00 vn44 sshd[8268]: log: Connection from 128.83.131.6 port 1022 Nov 18 11:40:04 vn44 sshd[8268]: log: Password authentication for ehonda accepted. Nov 18 11:40:39 vn44 kernel: nfs_dentry_delete: res_r0=4.45/phicore.dat: ino=2707513, count=2, nlink=1 Nov 18 11:40:40 vn44 kernel: nfs_dentry_delete: .ssh/known_hosts: ino=786482, count=2, nlink=1 Nov 18 11:40:40 vn44 kernel: nfs_dentry_delete: res_r0=4.45/esum.dat: ino=2707516, count=2, nlink=1 Nov 18 11:42:32 vn44 sshd[8268]: log: Closing connection to 128.83.131.6 Nov 18 11:48:14 vn44 sshd[6358]: log: Generating new 768 bit RSA key. Nov 18 11:48:16 vn44 sshd[6358]: log: RSA key generation complete. Nov 18 14:30:09 vn44 syslogd 1.3-3: restart. Nov 18 14:30:09 vn44 syslog: syslogd startup succeeded Nov 18 14:30:09 vn44 kernel: klogd 1.3-3, log source = /proc/kmsg started. Nov 18 14:30:09 vn44 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Nov 18 14:30:09 vn44 syslog: klogd startup succeeded ###################################################################### CRASH_6/7 ###################################################################### Fri Nov 19 7:00:00 PST 1999 (Approximately) (1) vn38/vn44 down for (0:42 / 2:56) respectively, note vn44 down for second time. Fri Nov 19 08:09:01 PST 1999 # Connecting cables to vn38 # BIOS settings look OK, machine starts but WAS halted and DID NOT restart # Connecting cables to vn44, which is also halted # Same story, BIOS settings looked kosher (all errors, AC power restart), # changed Primary detection to auto, and disabled Power on on modem activity # just for kicks. # vn38 is up ... log excerpt Nov 19 04:54:23 vn38 -- MARK -- Nov 19 05:01:00 vn38 anacron[1719]: Updated timestamp for job `cron.hourly' to 1999-11-19 Nov 19 05:01:00 vn38 PAM_pwdb[1723]: (su) session opened for user news by (uid=9) Nov 19 05:01:00 vn38 PAM_pwdb[1723]: (su) session closed for user news Nov 19 05:14:23 vn38 -- MARK -- Nov 19 05:34:23 vn38 -- MARK -- Nov 19 05:54:23 vn38 -- MARK -- Nov 19 06:01:00 vn38 anacron[1759]: Updated timestamp for job `cron.hourly' to 1999-11-19 Nov 19 06:01:00 vn38 PAM_pwdb[1763]: (su) session opened for user news by (uid=9) Nov 19 06:01:00 vn38 PAM_pwdb[1763]: (su) session closed for user news Nov 19 06:14:23 vn38 -- MARK -- Nov 19 06:15:34 vn38 sshd[1784]: log: Connection from 128.83.131.6 port 1022 Nov 19 06:15:35 vn38 sshd[1784]: log: Rhosts with RSA host authentication accepted for ehonda, ehonda on einstein.ph.utexas.edu. Nov 19 06:16:48 vn38 sshd[1784]: log: Closing connection to 128.83.131.6 Nov 19 08:28:29 vn38 syslogd 1.3-3: restart. Nov 19 08:28:29 vn38 syslog: syslogd startup succeeded Nov 19 08:28:29 vn38 kernel: klogd 1.3-3, log source = /proc/kmsg started. Nov 19 08:28:29 vn38 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Nov 19 08:28:29 vn38 syslog: klogd startup succeeded Nov 19 08:28:30 vn38 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. Nov 19 08:28:30 vn38 kernel: Symbols match kernel version 2.2.13. Nov 19 08:28:30 vn38 kernel: Loaded 123 symbols from 6 modules. Nov 19 08:28:30 vn38 kernel: Linux version 2.2.13-7mdksmp (root@kenobi.mandrakesoft.com) (gcc version 2.95.1 19990816 (release)) #1 SMP Wed Sep 15 16:38:50 CEST 1999 # vn44 coming up, but was taken down dirty (vn38 apparently not, but that may # have been due to lack of activity) # Nope, had to do a manual fsck on /dev/hda1, next time use # yes | /dev/hda1 Nov 19 04:01:00 vn44 anacron[982]: Updated timestamp for job `cron.hourly' to 1999-11-19 Nov 19 04:01:00 vn44 PAM_pwdb[986]: (su) session opened for user news by (uid=9) Nov 19 04:01:00 vn44 PAM_pwdb[986]: (su) session closed for user news Nov 19 04:02:00 vn44 anacron[1009]: Updated timestamp for job `cron.daily' to 1999-11-19 Nov 19 04:02:00 vn44 PAM_pwdb[1013]: (su) session opened for user news by (uid=9) Nov 19 04:02:33 vn44 PAM_pwdb[1013]: (su) session closed for user news Nov 19 08:40:39 vn44 syslogd 1.3-3: restart. Nov 19 08:40:39 vn44 syslog: syslogd startup succeeded Nov 19 08:40:39 vn44 kernel: klogd 1.3-3, log source = /proc/kmsg started. Nov 19 08:40:39 vn44 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Nov 19 08:40:39 vn44 syslog: klogd startup succeeded Nov 19 08:40:39 vn44 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. Nov 19 08:40:39 vn44 kernel: Symbols match kernel version 2.2.13. Nov 19 08:40:39 vn44 kernel: Loaded 123 symbols from 6 modules. Nov 19 08:40:39 vn44 kernel: Linux version 2.2.13-7mdksmp (root@kenobi.mandrakesoft.com) (gcc version 2.95.1 19990816 (release)) #1 SMP Wed Sep 15 16:38:50 CEST 1999 ###################################################################### RECCOMENDATION: Next time either one goes down, stuff disk into tail-end node (if necessary) and send to Varsity ###################################################################### CRASH_8 ###################################################################### Fri Nov 19 14:59:48 PST 1999 vn44 down (1) Disconnected vn44 and sent back to Varsity with other "defective" nodes ############################################################ CRASH_9 ###################################################################### Fri Nov 19 15:43:42 PST 1999 (1) vn38 down (surprise!!), make vn43 -> vn38, vnN -> 42 DONE ############################################################ CRASH_10 ###################################################################### Wed Nov 24 06:25:52 PST 1999 (1) vn43 down System DOA as per vn38 etc. previously Shutting down vn59, swapped in vn43's disks, bringing up as vn43 vn43 up, vn59 labelled for shipout, vn60 still here!! DONE ############################################################ CRASH_11 ############################################################ Tue Nov 23 11:47:26 PST 1999 Connecting vn49 ... vn60. All power on, but vn59's fan is not running vn59 went back into shop but came back Nov 30 with fan still not running Wed Dec 1 11:56:21 PST 1999 Machine back as vn63 ############################################################ CRASH_12 ############################################################ Wed Dec 1 10:54:23 PST 1999 vn59 down 2:00 vn60 down 1:54 # Almost certainly nodes which died before # Bill's coming over to do on-site replacement of power # supplies # Taking down vn61, vn62 <-> vn59, vn60 # As matt@vnfe1 viw vnN N=60 vnMakeMPIMachines 1 60 ############################################################ CRASH_13 ############################################################ # vn59 is down (was vn61 this morning) # Take down vn63 <-> vn59 Thu Dec 2 11:14:13 PST 1999 # vn63's power supply replaced # vn63 comes up OK, needs secondary configuration ############################################################ CRASH_14 ############################################################ Tue Dec 7 22:03:49 PST 1999 # vn20 is down (and machine room is warm, complain to staff) # but still running, can't get video out of it, get # brutal, just as well since it's some kind of kernel panic. Remains to # be seen whether or not it's heat related. Dec 7 01:01:00 vn20 PAM_pwdb[17801]: (su) session opened for user news by (uid=9) Dec 7 01:01:01 vn20 PAM_pwdb[17801]: (su) session closed for user news Dec 7 01:03:25 vn20 kernel: eth0: Transmit timed out: status 0050 0000 at 11/11 command 000ca000. Dec 7 01:03:25 vn20 kernel: eth0: Trying to restart the transmitter... . . . *** about 15872 times *** Tue Dec 7 23:17:26 PST 1999 ############################################################ CRASH_15 ############################################################ Wed Dec 8 15:33:14 PST 1999 (1) vn61 went incommunicado while we were installing vn64 Still pingable, connecting video, login screen, but can't get control via mouse or keyboard, hard reboot comes up Dec 8 14:01:00 vn61 PAM_pwdb[26099]: (su) session closed for user news Dec 8 17:01:12 vn61 sshd[673]: log: Generating new 768 bit RSA key. Dec 8 17:01:12 vn61 sshd[673]: log: RSA key generation complete. Dec 8 17:20:41 vn61 -- MARK -- Dec 8 17:40:41 vn61 -- MARK -- Dec 8 17:52:01 vn61 sshd[26129]: log: Connection from 128.83.131.6 port 1021 Dec 8 17:52:03 vn61 sshd[26129]: log: Rhosts with RSA host authentication accepted for ehonda, ehonda on einstein.ph.utexas.edu. Dec 8 17:52:10 vn61 kernel: nfs_dentry_delete: Cactus/B: ino=4102169, count=2, nlink=3 Dec 8 15:44:30 vn61 syslogd 1.3-3: restart. Dec 8 15:44:30 vn61 syslog: syslogd startup succeeded Dec 8 15:44:30 vn61 kernel: klogd 1.3-3, log source = /proc/kmsg started. NFS problem?? ############################################################ CRASH_16 ############################################################ Fri Dec 10 20:58:27 PST 1999 (1) vn43 incommunicado (really need to get remote re-boot figured out, also faster disk recovery) Hard reboot, back up at Fri Dec 10 22:00:13 PST 1999 Same problem as CRASH_14 (kernel: eth0, and lots of error messages in log! ... buggy kernel??) Dec 10 19:01:02 vn43 PAM_pwdb[32270]: (su) session closed for user news Dec 10 19:12:34 vn43 -- MARK -- Dec 10 19:32:34 vn43 -- MARK -- Dec 10 19:52:34 vn43 -- MARK -- Dec 10 20:01:01 vn43 anacron[32306]: Updated timestamp for job `cron.hourly' to 1999-12-10Dec 10 20:01:01 vn43 PAM_pwdb[32310]: (su) session opened for user news by (uid=9) Dec 10 20:01:02 vn43 PAM_pwdb[32310]: (su) session closed for user news Dec 10 20:12:34 vn43 -- MARK -- Dec 10 20:32:34 vn43 -- MARK -- Dec 10 20:46:54 vn43 kernel: eth0: Transmit timed out: status 0050 0000 at 2/2 command 000ca000. Dec 10 20:46:54 vn43 kernel: eth0: Trying to restart the transmitter... Dec 10 20:46:59 vn43 kernel: eth0: Transmit timed out: status 0050 0000 at 2/2 command 000ca000. Dec 10 20:46:59 vn43 kernel: eth0: Trying to restart the transmitter... Dec 10 20:47:04 vn43 kernel: eth0: Transmit timed out: status 0050 0000 at 2/2 co ############################################################ CRASH_17 ############################################################ Sat Dec 11 12:09:43 PST 1999 (1) vn33 down, odds are on eth0 problem, had one previously vn33 down 1:28 ############################################################ Replaced network cards vn1, vn6, vn20, vn33, vn39, vn43, vn50 ############################################################ ############################################################ CRASH_18 ############################################################ Tue Dec 14 18:17:33 PST 1999 (1) vn50 down at 17:31?? Still ping-able, not manifestly eth0 problem vn50 down 0:46 Dec 14 05:59:44 vn50 sshd[21692]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 14 05:59:44 vn50 sshd[21694]: log: executing remote command as user matt Dec 14 05:59:45 vn50 sshd[21692]: log: Closing connection to 142.103.237.225 Dec 14 06:01:00 vn50 anacron[21718]: Updated timestamp for job `cron.hourly' to 1999-12-14Dec 14 06:01:00 vn50 PAM_pwdb[21722]: (su) session opened for user news by (uid=9) Dec 14 06:01:00 vn50 PAM_pwdb[21722]: (su) session closed for user news Dec 14 06:17:58 vn50 -- MARK -- Dec 14 06:37:58 vn50 -- MARK -- Dec 14 06:38:26 vn50 sshd[659]: log: Generating new 768 bit RSA key. Dec 14 06:38:26 vn50 sshd[659]: log: RSA key generation complete. Dec 14 06:57:58 vn50 -- MARK -- Dec 14 07:01:00 vn50 anacron[21757]: Updated timestamp for job `cron.hourly' to 1999-12-14Dec 14 07:01:00 vn50 PAM_pwdb[21761]: (su) session opened for user news by (uid=9) Dec 14 07:01:00 vn50 PAM_pwdb[21761]: (su) session closed for user news Dec 14 07:17:58 vn50 -- MARK -- Dec 14 07:37:58 vn50 -- MARK -- Follow up??? But 'atsci' on node at about 14:30 Dec 14 14:48:18 vn50 PAM_pwdb[22460]: (rsh) session opened for user atsci by (uid=0) Dec 14 14:26:23 vn52 PAM_pwdb[22382]: (rsh) session closed for user atsci Dec 14 14:27:41 vn52 pam_rhosts_auth[22395]: allowed to atsci@vnfe3.physics.ubc.ca as atsci Dec 14 14:27:41 vn52 PAM_pwdb[22395]: (rsh) session opened for user atsci by (uid=0) Dec 14 14:27:42 vn52 kernel: nfs_dentry_delete: process/1: ino=3563562, count=2, nlink=2 Dec 14 14:31:12 vn52 kernel: eth0: Transmit timed out: status 0050 0000 at 872459111/872459111 command 000ca000. Dec 14 14:31:12 vn52 kernel: eth0: Trying to restart the transmitter... ############################################################ CRASH_19 ############################################################ Wed Dec 15 18:18:05 PST 1999 (1) Josh apparently hung up vn60 Dec 15 20:26:42 vn60 sshd[662]: log: Generating new 768 bit RSA key. Dec 15 20:26:43 vn60 sshd[662]: log: RSA key generation complete. Dec 15 20:45:37 vn60 -- MARK -- Dec 15 20:58:41 vn60 sshd[24387]: log: Connection from 142.103.237.227 port 1023 Dec 15 20:58:45 vn60 sshd[24387]: log: Password authentication for atsci accepted. Dec 15 18:01:00 vn60 anacron[24412]: Updated timestamp for job `cron.hourly' to 1999-12-15Dec 15 18:01:00 vn60 PAM_pwdb[24416]: (su) session opened for user news by (uid=9) Dec 15 18:01:01 vn60 PAM_pwdb[24416]: (su) session closed for user news Dec 15 18:45:34 vn60 syslogd 1.3-3: restart. Dec 15 18:45:34 vn60 syslog: syslogd startup succeeded Dec 15 18:45:34 vn60 kernel: klogd 1.3-3, log source = /proc/kmsg started. Dec 15 18:45:34 vn60 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Dec 15 18:45:34 vn60 syslog: klogd startup succeeded ############################################################ CRASH_20 ############################################################ Thu Dec 16 09:23:24 PST 1999 (1) vn17 rebooted itself via VnHello, screwed up clock in the process, Roman Petryk was primary victim. Dec 16 08:08:14 vn17 sshd[12488]: log: ROOT LOGIN as 'root' from vnfe1.physics.ubc.ca Dec 16 08:08:15 vn17 sshd[12490]: log: executing remote command as root: cat /var/log/messages.vnHello Dec 16 08:08:16 vn17 sshd[12488]: log: Closing connection to 142.103.237.225 Dec 16 08:16:45 vn17 gpm[535]: Error in protocol Dec 16 08:16:47 vn17 innd: innd shutdown succeeded Dec 16 08:16:48 vn17 innd: actived -9 succeeded Dec 16 08:16:49 vn17 xfs: xfs shutdown succeeded . . . Dec 16 08:17:10 vn17 crond: crond shutdown succeeded Dec 16 08:17:11 vn17 lpd: lpd shutdown succeeded Dec 16 08:17:13 vn17 kernel: Kernel logging (proc) stopped. Dec 16 08:17:13 vn17 kernel: Kernel log daemon terminating. Dec 16 08:17:14 vn17 syslog: klogd shutdown succeeded Dec 16 08:17:15 vn17 exiting on signal 15 Dec 17 00:18:56 vn17 syslogd 1.3-3: restart. Dec 17 00:18:56 vn17 syslog: syslogd startup succeeded Dec 17 00:18:56 vn17 kernel: klogd 1.3-3, log source = /proc/kmsg started. Dec 17 00:18:56 vn17 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Dec 17 00:18:56 vn17 syslog: klogd startup succeeded ############################################################ CRASH_21 ############################################################ Fri Dec 17 14:31:32 PST 1999 vnHello: All appears well on vn36.physics.ubc.ca at Fri Dec 17 14:16:31 PST 1999vnHello: --------------------------------------------------------------------- vnHello: Executing on vn36.physics.ubc.ca at Fri Dec 17 14:30:01 PST 1999 vnHello: Rebooting vn36.physics.ubc.ca at Fri Dec 17 14:31:32 PST 1999 vnHello: --------------------------------------------------------------------- vnHello: Executing on vn36.physics.ubc.ca at Fri Dec 17 14:45:00 PST 1999 # But no trace in log file of receiver hang-up Dec 17 14:28:30 vn36 sshd[26999]: log: Rhosts with RSA host authentication accepted for root, matt on vnfe1.physics.ubc.ca. Dec 17 14:28:30 vn36 sshd[26999]: log: ROOT LOGIN as 'root' from vnfe1.physics.ubc.ca Dec 17 14:28:30 vn36 sshd[27001]: log: executing remote command as root: cdi; setenv CFLAGS "-O3"; setenv FFLAGS "-O3"; Installz jvs Dec 17 14:28:49 vn36 sshd[26999]: log: Closing connection to 142.103.237.225 Dec 17 14:31:35 vn36 gpm[531]: Error in protocol Dec 17 14:31:38 vn36 innd: innd shutdown succeeded Dec 17 14:31:38 vn36 innd: actived -9 succeeded # Ethan was running 'bubbles' at the time ... ############################################################ CRASH_22 ############################################################ Wed Dec 22 17:25:53 PST 1999 [matt@vn5 ~]$ down rar0502 down 39+23:24 vn45 down 4:09 Dec 22 17:01:01 vn40 PAM_pwdb[15845]: (su) session opened for user news by (uid=9) Dec 22 17:01:02 vn40 PAM_pwdb[15845]: (su) session closed for user news Dec 22 17:05:31 vn40 sshd[15872]: log: Connection from 142.103.237.225 port 999 Dec 22 17:05:31 vn40 sshd[15872]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 17:05:31 vn40 sshd[15874]: log: executing remote command as user matt Dec 22 17:05:34 vn40 sshd[15872]: log: Closing connection to 142.103.237.225 Dec 22 17:13:30 vn40 sshd[15894]: log: Connection from 142.103.237.225 port 1000 Dec 22 17:13:30 vn40 sshd[15894]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 17:13:30 vn40 sshd[15896]: log: executing remote command as user matt Dec 22 17:13:33 vn40 sshd[15894]: log: Closing connection to 142.103.237.225 Dec 22 17:21:26 vn40 sshd[15939]: log: Connection from 142.103.237.225 port 1001 Dec 22 17:21:26 vn40 sshd[15939]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 17:21:26 vn40 sshd[15941]: log: executing remote command as user matt Dec 22 17:21:28 vn40 sshd[15939]: log: Closing connection to 142.103.237.225 Dec 22 17:29:19 vn40 sshd[15959]: log: Connection from 142.103.237.225 port 1001 Dec 22 17:29:19 vn40 sshd[15959]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 17:29:20 vn40 sshd[15961]: log: executing remote command as user matt Dec 22 17:29:22 vn40 sshd[15959]: log: Closing connection to 142.103.237.225 Dec 22 17:35:55 vn40 sshd[657]: log: Generating new 768 bit RSA key. Dec 22 17:35:55 vn40 sshd[657]: log: RSA key generation complete. Dec 22 17:37:32 vn40 sshd[16007]: log: Connection from 142.103.237.225 port 999 Dec 22 17:37:32 vn40 sshd[16007]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 17:37:32 vn40 sshd[16009]: log: executing remote command as user matt Dec 22 17:37:35 vn40 sshd[16007]: log: Closing connection to 142.103.237.225 Dec 22 17:43:38 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:43:38 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:43:43 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:43:43 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:43:48 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:43:48 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:43:53 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:43:53 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:43:58 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:43:58 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:03 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:03 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:08 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:08 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:13 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:13 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:18 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:18 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:23 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:23 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:28 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:28 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:33 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. Dec 22 17:44:33 vn40 kernel: eth0: Trying to restart the transmitter... Dec 22 17:44:38 vn40 kernel: eth0: Transmit timed out: status 0050 0000 at 8/8 command 000ca000. ############################################################ CRASH_23 ############################################################ Wed Dec 22 17:25:53 PST 1999 [matt@vn5 ~]$ down rar0502 down 40+00:01 vn40 down 0:17 vn45 down 4:42 Dec 22 12:56:35 vn45 sshd[18057]: log: executing remote command as user matt Dec 22 12:56:37 vn45 sshd[18055]: log: Closing connection to 142.103.237.225 Dec 22 13:01:01 vn45 anacron[18098]: Updated timestamp for job `cron.hourly' to 1999-12-22 Dec 22 13:01:01 vn45 PAM_pwdb[18102]: (su) session opened for user news by (uid=9) Dec 22 13:01:02 vn45 PAM_pwdb[18102]: (su) session closed for user news Dec 22 13:04:23 vn45 sshd[18130]: log: Connection from 142.103.237.225 port 996 Dec 22 13:04:23 vn45 sshd[18130]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 13:04:23 vn45 sshd[18132]: log: executing remote command as user matt Dec 22 13:04:26 vn45 sshd[18130]: log: Closing connection to 142.103.237.225 Dec 22 13:12:11 vn45 sshd[18152]: log: Connection from 142.103.237.225 port 997 Dec 22 13:12:11 vn45 sshd[18152]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 13:12:11 vn45 sshd[18154]: log: executing remote command as user matt Dec 22 13:12:14 vn45 sshd[18152]: log: Closing connection to 142.103.237.225 Dec 22 13:18:01 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:01 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:06 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:06 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:11 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:11 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:16 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:16 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:21 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:21 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:26 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:26 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:31 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:31 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:36 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:36 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:41 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:41 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:46 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:46 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:51 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:51 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:18:56 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:18:56 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:01 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:01 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:06 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:06 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:11 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:11 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:16 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:16 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:21 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:21 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:26 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:26 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:31 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:31 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:36 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:36 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:41 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:41 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:46 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:46 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:51 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:51 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:19:56 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:19:56 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:20:01 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:20:01 vn45 kernel: eth0: Trying to restart the transmitter... Dec 22 13:20:06 vn45 kernel: eth0: Transmit timed out: status 0050 0000 at 0/0 command 000ca000. Dec 22 13:20:06 vn45 kernel: eth0: Trying to restart the transmitter... ############################################################ CRASH_24 ############################################################ Thu Dec 23 07:16:47 PST 1999 matt@vnfe1 vn]$ down rar0502 down 40+13:20 vn5 down 11:57 # (Machine apparently went down literally 5 minutes after I left the machine room) # Still pingable; as suspected, Frans seems to have been the victim this # time (NFS) Dec 22 19:20:58 vn5 sshd[14170]: log: Connection from 142.103.237.225 port 1008 Dec 22 19:20:58 vn5 sshd[14170]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 22 19:20:58 vn5 sshd[14172]: log: executing remote command as user matt Dec 22 19:20:59 vn5 sshd[14170]: log: Closing connection to 142.103.237.225 Dec 22 19:22:53 vn5 kernel: nfs_lookup: ads/cs_gauss1 ino=1409068 in use, count=2, nlink=2 Dec 22 19:22:53 vn5 kernel: show_dentry: ads/cs_gauss1, d_count=3(unhashed) Dec 22 19:23:14 vn5 kernel: nfs_dentry_delete: ads/cs_gauss_1: ino=1409068, count=2, nlink=2 Dec 22 19:23:14 vn5 kernel: nfs_lookup: ads/cs_gauss_1 ino=1409068 in use, count=2, nlink=2 Dec 22 19:23:14 vn5 kernel: show_dentry: ads/cs_gauss1, d_count=3(unhashed) Dec 22 19:23:24 vn5 kernel: __nfs_fhget: inode 1409071 still busy, i_count=2 Dec 22 19:23:24 vn5 kernel: __nfs_fhget: killing cs_gauss1/.id0.swp filehandle Dec 23 07:20:53 vn5 syslogd 1.3-3: restart. Dec 23 07:20:53 vn5 syslog: syslogd startup succeeded Dec 23 07:20:53 vn5 kernel: klogd 1.3-3, log source = /proc/kmsg started. Dec 23 07:20:53 vn5 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Dec 23 07:20:53 vn5 syslog: klogd startup succeeded Dec 23 07:20:53 vn5 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. # Thu Dec 23 07:24:33 PST 1999 ############################################################ CRASH_25 ############################################################ Sat Dec 25 05:55:54 PST 1999 (1) Hung vn8 yesterday ... # AT KINCK, RE-BOOT SINGLE USER, Restore /etc/sysconfig/network-scripts # /etc/sysconfig/network-scripts. #!/bin/sh # This script will be executed *after* all the other init scripts. # You can put your own initialization stuff in here if you don't # want to do the full Sys V style init stuff. . . . # Until we figure out the real solution route add default gw 142.103.237.254 eth0 || echo "rc.local: route add default gw 142.103.237.254 eth0 failed" if [ -f /usr/local/sbin/sshd ]; then /usr/local/sbin/sshd & fi if [ -f /usr/local/bin/ntpd ]; then /usr/local/bin/ntpd & fi ############################################################ CRASH_26 ############################################################ Sun Dec 26 02:45:39 PST 1999 (1) vn35 and vn55 (with new drivers) have rebooted themselves---still have lockup, but now with no error message?? # SEE README.KERNEL, downloaded diagnostic programs (eepro-diag, mii-diag), # installed (but note, have to compile somewhere where kernel source has # been configured, e.g. bh6) in ~matt/scripts and incorporated in # vnHELLO watchdog ############################################################ CRASH_27 ############################################################ Mon Dec 27 07:23:45 PST 1999 (1) vn3 rebooted at Dec 27 02:46 Dec 27 02:39:42 vn3 sshd[29989]: log: Closing connection to 142.103.237.225 Dec 27 02:46:53 vn3 gpm[529]: Error in protocol # Messages in /var/log/messages.vnHello seem to indicate that Ethernet # card is OK ... need better diagnostics?? ############################################################ CRASH_28 ############################################################ Mon Dec 27 16:22:38 PST 1999 (1) vn16 has been down for about 30 minutes, automatic reboot apparently not kicking in vn16 down 0:30 "Typical" eth0: hang ... Dec 27 15:46:31 vn16 sshd[2302]: log: executing remote command as user matt Dec 27 15:46:33 vn16 sshd[2300]: log: Closing connection to 142.103.237.225 Dec 27 15:54:18 vn16 sshd[2326]: log: Connection from 142.103.237.225 port 1003 Dec 27 15:54:18 vn16 sshd[2326]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 27 15:54:18 vn16 sshd[2328]: log: executing remote command as user matt Dec 27 15:54:21 vn16 sshd[2326]: log: Closing connection to 142.103.237.225 Dec 27 15:55:36 vn16 kernel: eth0: Transmit timed out: status 0050 0000 at 12/12 command 000ca000. Dec 27 15:55:36 vn16 kernel: eth0: Trying to restart the transmitter... Dec 27 15:55:41 vn16 kernel: eth0: Transmit timed out: status 0050 0000 at 12/12 command 000ca000. Dec 27 15:55:41 vn16 kernel: eth0: Trying to restart the transmitter... Dec 27 15:55:46 vn16 kernel: eth0: Transmit timed out: status 0050 0000 at 12/12 command 000ca000. Dec 27 15:55:46 vn16 kernel: eth0: Trying to restart the transmitter... ############################################################ CRASH_30 ############################################################ (1) vn64 has problems Dec 29 10:04:18 vn64 sshd[27985]: log: Closing connection to 142.103.237.225 Dec 29 10:12:04 vn64 sshd[28008]: log: Connection from 142.103.237.225 port 1004Dec 29 10:12:05 vn64 sshd[28008]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Dec 29 10:12:05 vn64 sshd[28010]: log: executing remote command as user matt Dec 29 10:12:07 vn64 sshd[28008]: log: Closing connection to 142.103.237.225 Dec 29 10:16:52 vn64 modprobe: can't locate module lo:0 Dec 29 10:16:52 vn64 modprobe: can't locate module lo:1 # Rebooted remotely, not clear what happened ############################################################ CRASH_31 ############################################################ Sat Jan 1 16:42:08 PST 2000 (1) vn4 didn't come back after reboot No video/keyboard response, hard re-boot linux-new single linux-new # Came back ok, log looks strange, no boot-up messages ############################################################ CRASH_32 ############################################################ Mon Jan 3 07:43:57 PST 2000 (1) Mijan managed to hang vn1 pretty easily !!ssh root@vn1.physics.ubc.ca cat /tmp/log Jan 3 06:01:00 vn1 anacron[26664]: Updated timestamp for job `cron.hourly' to 2000-01-03 Jan 3 06:22:01 vn1 -- MARK -- Jan 3 06:41:13 vn1 sshd[26682]: log: Connection from 128.118.147.187 port 9273 Jan 3 06:41:19 vn1 sshd[26682]: log: Password authentication for mijan accepted. Jan 3 06:55:19 vn1 sshd[26755]: log: Connection from 142.103.237.225 port 1008 Jan 3 06:55:19 vn1 sshd[26755]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Jan 3 06:56:07 vn1 sshd[26755]: log: Closing connection to 142.103.237.225 Jan 3 06:57:50 vn1 sshd[27199]: log: Connection from 142.103.237.1 port 1019 Jan 3 06:57:52 vn1 sshd[27199]: log: RSA authentication for mijan accepted. Jan 3 06:57:52 vn1 sshd[27203]: log: executing remote command as user mijan Jan 3 06:58:56 vn1 sshd[27199]: fatal: Connection closed by remote host. Jan 3 06:59:15 vn1 sshd[27619]: log: Connection from 142.103.237.1 port 1002 Jan 3 06:59:16 vn1 sshd[27619]: log: RSA authentication for mijan accepted. Jan 3 06:59:16 vn1 sshd[27623]: log: executing remote command as user mijan Jan 3 07:01:02 vn1 anacron[27704]: Updated timestamp for job `cron.hourly' to 2000-01-03 Jan 3 07:55:10 vn1 syslogd 1.3-3: restart. Jan 3 07:55:10 vn1 syslog: syslogd startup succeeded Jan 3 07:55:10 vn1 kernel: klogd 1.3-3, log source = /proc/kmsg started. ############################################################ CRASH_33 ############################################################ Mon Jan 3 09:38:36 PST 2000 (1) vnfe2 down, mijan apparent culprit, will take this chance to reboot to attempt to fix processor non-detection No clear sign of what may have gone wrong Jan 3 08:30:32 vnfe2 sshd[13512]: log: executing remote command as user matt Jan 3 08:30:32 vnfe2 sshd[13510]: log: Closing connection to 142.103.175.48 Jan 3 08:31:46 vnfe2 sshd[13575]: log: Connection from 128.118.147.187 port 9440 Jan 3 08:31:49 vnfe2 sshd[13575]: log: Password authentication for mijan accepted. Jan 3 08:45:30 vnfe2 sshd[13817]: log: Connection from 142.103.175.48 port 1020 Jan 3 08:45:31 vnfe2 sshd[13817]: log: RSA authentication for matt accepted. Jan 3 08:45:31 vnfe2 sshd[13819]: log: executing remote command as user matt Jan 3 08:45:31 vnfe2 sshd[13817]: log: Closing connection to 142.103.175.48 Jan 3 09:00:30 vnfe2 sshd[13849]: log: Connection from 142.103.175.48 port 1020 Jan 3 09:00:31 vnfe2 sshd[13849]: log: RSA authentication for matt accepted. Jan 3 09:00:31 vnfe2 sshd[13851]: log: executing remote command as user matt Jan 3 09:00:31 vnfe2 sshd[13849]: log: Closing connection to 142.103.175.48 Jan 3 09:01:00 vnfe2 anacron[13867]: Updated timestamp for job `cron.hourly' to 2000-01-03 Jan 3 09:15:30 vnfe2 sshd[14167]: log: Connection from 142.103.175.48 port 1020 Jan 3 09:15:30 vnfe2 sshd[14167]: log: RSA authentication for matt accepted. Jan 3 09:15:30 vnfe2 sshd[14169]: log: executing remote command as user matt Jan 3 09:15:31 vnfe2 sshd[14167]: log: Closing connection to 142.103.175.48 Jan 4 09:48:46 vnfe2 syslogd 1.3-3: restart. Jan 4 09:48:46 vnfe2 syslog: syslogd startup succeeded Jan 4 09:48:46 vnfe2 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jan 4 09:48:46 vnfe2 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Jan 4 09:48:46 vnfe2 syslog: klogd startup succeeded ############################################################ CRASH_34 ############################################################ Wed Jan 19 12:19:46 PST 2000 (1) Luis hung up vn12 with Cactus Jan 19 11:32:43 vn12 sshd[2007]: log: executing remote command as user matt Jan 19 11:32:45 vn12 sshd[2005]: log: Closing connection to 142.103.237.225 Jan 19 11:35:48 vn12 sshd[1020]: fatal: Connection closed by remote host. Jan 19 11:36:17 vn12 sshd[2025]: log: Connection from 142.103.237.225 port 1016 Jan 19 11:36:17 vn12 sshd[2025]: log: Rhosts with RSA host authentication accepted for luisl, luisl on vnfe1.physics.ubc.ca. Jan 19 11:36:17 vn12 sshd[2027]: log: executing remote command as user luisl Jan 19 11:36:26 vn12 sshd[603]: log: Generating new 768 bit RSA key. Jan 19 11:36:26 vn12 sshd[603]: log: RSA key generation complete. Jan 19 12:14:53 vn12 syslogd 1.3-3: restart. Jan 19 12:14:53 vn12 syslog: syslogd startup succeeded Jan 19 12:14:53 vn12 kernel: klogd 1.3-3, log source = /proc/kmsg started. ############################################################ CRASH_35 ############################################################ Sun Jan 30 05:26:20 PST 2000 (1) Matt hung vn27 running 'wave2d'; should re-instate watchdog? BUT ... ssh still works (Thanks Jason), and we can remotely reboot ############################################################ CRASH_36 ############################################################ Sun Jan 30 20:17:54 PST 2000 # Looks like vn8 will need "on-site" attention. Definitely # have problem with "defunct" processes (some may be due # to premature/improper MPI job termination) Jan 30 20:10:30 vn8 sshd[16470]: log: Connection from 142.103.237.8 port 1022 Jan 30 20:10:30 vn8 sshd[16470]: log: RSA authentication for root accepted. Jan 30 20:10:30 vn8 sshd[16470]: log: ROOT LOGIN as 'root' from vn8.physics.ubc.ca Jan 30 20:16:18 vn8 exiting on signal 15 Jan 31 08:52:53 vn8 syslogd 1.3-3: restart. Jan 31 08:52:53 vn8 syslog: syslogd startup succeeded Jan 31 08:52:53 vn8 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jan 31 08:52:53 vn8 kernel: Inspecting /boot/System.map cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola free t 12 # 16-way parallelism with vn8 134.700u 9.240s 2:27.01 97.9% 0+0k 0+0io 17569pf+0w ############################################################ CRASH_37 ############################################################ (1) Hung vn36 up running wave2d demo with output Jan 31 13:46:24 vn36 sshd[19148]: log: executing remote command as user matt Jan 31 13:46:25 vn36 sshd[19146]: log: Closing connection to 142.103.237.225 Jan 31 13:47:44 vn36 sshd[19071]: fatal: Connection closed by remote host. Jan 31 16:05:01 vn36 syslogd 1.3-3: restart. Jan 31 16:05:01 vn36 syslog: syslogd startup succeeded Jan 31 16:05:01 vn36 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jan 31 16:05:01 vn36 kernel: Inspecting /boot/System.map Jan 31 16:05:01 vn36 syslog: klogd startup succeeded ############################################################ CRASH_38 ############################################################ Thu Feb 3 14:31:22 PST 2000 (1) vn52 had weird out of memory error etc. reported by Daub ... rebooted Feb 3 14:27:09 vn52 sshd[31363]: log: Connection from 142.103.234.31 port 1015 Feb 3 14:27:10 vn52 sshd[31363]: log: Rhosts with RSA host authentication accepted for matt, matt on laplace.physics.ubc.ca. Feb 3 14:27:46 vn52 sshd[31383]: log: Connection from 142.103.237.225 port 999 Feb 3 14:27:46 vn52 sshd[31383]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Feb 3 14:27:46 vn52 sshd[31385]: log: executing remote command as user matt Feb 3 14:27:46 vn52 kernel: Unable to load interpreter Feb 3 14:27:46 vn52 sshd[31383]: log: Closing connection to 142.103.237.225 Feb 3 14:28:06 vn52 sshd[31417]: log: Connection from 142.103.237.52 port 1020 Feb 3 14:28:06 vn52 sshd[31417]: fatal: Connection closed by remote host. Feb 3 14:28:08 vn52 sshd[31363]: log: Closing connection to 142.103.234.31 Feb 3 14:28:15 vn52 gpm[529]: Error in protocol Feb 3 14:28:20 vn52 sshd[31335]: log: Closing connection to 142.103.237.52 Feb 3 14:28:21 vn52 innd: innd shutdown succeeded Feb 3 14:28:21 vn52 innd: actived -9 succeeded Feb 3 14:28:22 vn52 xfs: xfs shutdown succeeded ############################################################ CRASH_39 ############################################################ Thu Feb 10 09:39:23 PST 2000 Lothar reported MPI problems with vn10; although I could run some MPI applications from/using vn10, did get unexpected p4_error: msgs at times, rebooted vn10 From lothar@triumf.ca Thu Feb 10 09:27:02 2000 Looks like something is wrong with vn10, I couldn't start mpirun from it, even though otherwise it seemed to be fine... I switched to another node, and everything works fine.. Lothar # Again, nothing obvious in logs Feb 10 09:37:27 vn10 sshd[1525]: log: executing remote command as user matt Feb 10 09:37:29 vn10 sshd[1523]: log: Closing connection to 142.103.237.225 Feb 10 09:38:42 vn10 sshd[624]: log: Closing connection to 142.103.234.31 Feb 10 09:38:46 vn10 sshd[1551]: log: Connection from 142.103.175.105 port 1016 Feb 10 09:38:46 vn10 sshd[1551]: log: RSA authentication for root accepted. Feb 10 09:38:46 vn10 sshd[1551]: log: ROOT LOGIN as 'root' from dsl105.net.ubc.ca Feb 10 09:38:56 vn10 gpm[529]: Error in protocol Feb 10 09:38:59 vn10 xfs: xfs shutdown succeeded Feb 10 09:38:59 vn10 gpm: Shutting down gpm mouse services: Feb 10 09:38:59 vn10 gpm: gpm Feb 10 09:38:59 vn10 gpm: Feb 10 09:38:59 vn10 rc: Stopping gpm succeeded Feb 10 09:38:59 vn10 nfs: Shutting down NFS services: succeeded Feb 10 09:39:00 vn10 mountd[407]: Caught signal 15, un-registering and exiting. Feb 10 09:39:03 vn10 nfs: rpc.mountd shutdown succeeded Feb 10 09:39:09 vn10 kernel: nfsd: terminating on signal 9 Feb 10 09:39:09 vn10 last message repeated 7 times Feb 10 09:39:09 vn10 kernel: nfsd: last server exiting Feb 10 09:39:10 vn10 nfs: nfsd shutdown succeeded Feb 10 09:39:12 vn10 nfs: rpc.rquotad shutdown succeeded Feb 10 09:39:13 vn10 rpc.statd[275]: Caught signal 15, un-registering and exiting. Feb 10 09:39:14 vn10 nfs: rpc.statd shutdown succeeded Feb 10 09:39:16 vn10 rwhod: rwhod shutdown succeeded Feb 10 09:39:16 vn10 postfix: Shutting down postfix: Feb 10 09:39:16 vn10 postfix: postfix Feb 10 09:39:16 vn10 rc: Stopping postfix succeeded Feb 10 09:39:17 vn10 sendmail: sendmail shutdown failed Feb 10 09:39:19 vn10 inet: inetd shutdown succeeded Feb 10 09:39:20 vn10 atd: atd shutdown succeeded Feb 10 09:39:22 vn10 crond: crond shutdown succeeded Feb 10 09:39:24 vn10 lpd: lpd shutdown succeeded Feb 10 09:39:24 vn10 kernel: Kernel logging (proc) stopped. Feb 10 09:39:24 vn10 kernel: Kernel log daemon terminating. Feb 10 09:39:27 vn10 syslog: klogd shutdown succeeded Feb 10 09:39:27 vn10 exiting on signal 15 Feb 10 10:41:02 vn10 syslogd 1.3-3: restart. Feb 10 10:41:02 vn10 syslog: syslogd startup succeeded Feb 10 10:41:02 vn10 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 10 10:41:02 vn10 kernel: Inspecting /boot/System.map Feb 10 10:41:02 vn10 syslog: klogd startup succeeded Feb 10 10:41:02 vn10 kernel: Loaded 6360 symbols from /boot/System.map. Feb 10 10:41:02 vn10 kernel: Symbols match kernel version 2.2.13. Feb 10 10:41:02 vn10 kernel: Loaded 123 symbols from 6 modules. Feb 10 10:41:02 vn10 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Feb 10 10:41:02 vn10 kernel: Intel MultiProcessor Specification v1.1 Feb 10 10:41:02 vn10 kernel: Virtual Wire compatibility mode. # MPI test cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola t 10 # Runs on 8-processors OK ############################################################ CRASH_40 ############################################################ Fri Feb 11 19:19:05 PST 2000 After complaints from Frans and Inaki about memory problems on vn37, vn57 and vn58, reboot vn57 Feb 10 16:09:16 vn57 sshd[17823]: log: executing remote command as user matt Feb 10 16:09:19 vn57 sshd[17821]: log: Closing connection to 142.103.237.225 Feb 10 16:17:11 vn57 sshd[17847]: log: Connection from 142.103.237.225 port 1002Feb 10 16:17:11 vn57 sshd[17847]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Feb 10 16:17:12 vn57 sshd[17849]: log: executing remote command as user matt Feb 10 16:17:13 vn57 kernel: Unable to load interpreter Feb 10 16:17:13 vn57 last message repeated 2 times Feb 10 16:17:14 vn57 sshd[17847]: log: Closing connection to 142.103.237.225 Feb 10 16:25:00 vn57 sshd[17868]: log: Connection from 142.103.237.225 port 986 Feb 10 16:25:00 vn57 sshd[17868]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. . . . ############################################################ CRASH_41, CRASH_42, CRASH_43, CRASH_44 ############################################################ Look in Rtop from 1200-1400 Hi Matt, Just in case you didn't realised vn32, and vn35, vn56 went down. At least that's what ruptime says. I wonder if it was my fault! I was logging in, trying to run my program on them, and they stop working, Just frozen (I was coping a file 1 lines long)!!! vn39, vn27 and vn11 seem to have some kind of problem also because I cannot run my progrma neither. I think there is no enough free memory on them. vn.physics.ubc.ca Compute Node Status: Sat Feb 12 13:30:00 PST 2000 The following nodes are down: 1: vn32 down 0:58 2: vn35 down 1:00 3: vn56 down 1:23 All pingable but not telnetable. Hard reboot of vn32, vn35, vn56 vn39 Idle, soft reboot can't hurt vn27 vn11 #----------------------------------------------------------------------- vn32 date is off vnSetdate Feb 12 12:09:56 vn32 sshd[19590]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Feb 12 12:09:57 vn32 sshd[19592]: log: executing remote command as user matt Feb 12 12:10:00 vn32 sshd[19590]: log: Closing connection to 142.103.237.225 Feb 12 12:29:48 vn32 PAM_pwdb[19619]: (login) session opened for user suqin by (uid=0) Feb 12 12:29:48 vn32 -- suqin[19619]: LOGIN ON 2 BY suqin FROM vn30 Feb 12 12:34:53 vn32 sshd[19651]: log: Connection from 142.103.234.22 port 1020 Feb 12 12:34:57 vn32 PAM_pwdb[19653]: (login) session opened for user suqin by (uid=0) Feb 12 12:34:57 vn32 sshd[19651]: log: Password authentication for inaki accepted. Feb 12 12:34:57 vn32 -- suqin[19653]: LOGIN ON 1 BY suqin FROM vn30 Feb 13 06:16:58 vn32 syslogd 1.3-3: restart. Feb 13 06:16:58 vn32 syslog: syslogd startup succeeded Feb 13 06:16:58 vn32 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 13 06:16:58 vn32 kernel: Inspecting /boot/System.map Feb 13 06:16:58 vn32 syslog: klogd startup succeeded Feb 13 06:16:59 vn32 kernel: Loaded 6360 symbols from /boot/System.map. Feb 13 06:16:59 vn32 kernel: Symbols match kernel version 2.2.13. Feb 13 06:16:59 vn32 kernel: Loaded 123 symbols from 6 modules. #----------------------------------------------------------------------- vn35 About the same as above .. nothing obvious, memory problem suspected #----------------------------------------------------------------------- vn56 Feb 12 11:54:34 vn56 sshd[17670]: log: Closing connection to 142.103.237.225 Feb 12 12:01:00 vn56 anacron[17699]: Updated timestamp for job `cron.hourly' to 2000-02-12 Feb 12 12:02:33 vn56 PAM_pwdb[17704]: (login) session opened for user suqin by (uid=0) Feb 12 12:02:33 vn56 -- suqin[17704]: LOGIN ON 0 BY suqin FROM vn50 Feb 12 12:02:34 vn56 PAM_pwdb[17704]: (login) session closed for user suqin Feb 12 12:02:50 vn56 sshd[17715]: log: Connection from 142.103.237.225 port 994 Feb 12 12:02:50 vn56 sshd[17715]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Feb 12 12:02:50 vn56 sshd[17717]: log: executing remote command as user matt Feb 12 12:02:51 vn56 kernel: Unable to load interpreter Feb 12 12:02:51 vn56 sshd[17715]: log: Closing connection to 142.103.237.225 Feb 12 12:02:55 vn56 PAM_pwdb[17714]: (login) session opened for user suqin by (uid=0) Feb 12 12:02:55 vn56 -- suqin[17714]: LOGIN ON 0 BY suqin FROM vn50 Feb 12 12:02:55 vn56 PAM_pwdb[17714]: (login) session closed for user suqin Feb 12 12:03:10 vn56 sshd[653]: log: Generating new 768 bit RSA key. Feb 12 12:03:10 vn56 sshd[653]: log: RSA key generation complete. Feb 12 12:07:49 vn56 sshd[17732]: log: Connection from 142.103.237.225 port 1007 Feb 12 12:07:51 vn56 sshd[17732]: log: RSA authentication for inaki accepted. Jan 12 14:18:21 vn56 syslogd 1.3-3: restart. Jan 12 14:18:21 vn56 syslog: syslogd startup succeeded Jan 12 14:18:21 vn56 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jan 12 14:18:21 vn56 kernel: Inspecting /boot/System.map Jan 12 14:18:21 vn56 syslog: klogd startup succeeded Jan 12 14:18:22 vn56 kernel: Loaded 6360 symbols from /boot/System.map. Jan 12 14:18:22 vn56 kernel: Symbols match kernel version 2.2.13. Jan 12 14:18:22 vn56 kernel: Loaded 123 symbols from 6 modules. Jan 12 14:18:22 vn56 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Jan 12 14:18:22 vn56 kernel: Intel MultiProcessor Specification v1.1 Jan 12 14:18:22 vn56 kernel: Virtual Wire compatibility mode. ############################################################ CRASH_45 ############################################################ Sat Feb 12 17:45:45 PST 2000 (1) vnfe1 hung up, had to be hard rebooted, looks like my indiscriminant killing of 'fpi' might have been the problem! vnfe1 kernel: find_fh_dentry: 08:01, 991277/991298 not found -- need full search! # Maybe should think about going to new kernels (?) Feb 12 17:12:13 vnfe1 kernel: lookup_by_inode: ino 991298 not found in demoMPIPGI Feb 12 17:12:13 vnfe1 kernel: find_fh_dentry: 08:01, 991277/991298 not found -- need full search! Feb 12 17:12:13 vnfe1 kernel: lookup_by_inode: ino 991298 not found in demoMPIPGI Feb 12 17:12:13 vnfe1 kernel: find_fh_dentry: 08:01, 991277/991298 not found -- need full search! Feb 12 17:12:13 vnfe1 kernel: lookup_by_inode: ino 991298 not found in demoMPIPGI Feb 12 17:12:13 vnfe1 kernel: find_fh_dentry: 08:01, 991277/991298 not found -- need full search! Feb 12 17:12:13 vnfe1 kernel: lookup_by_inode: ino 991298 not found in demoMPIPGI Feb 12 17:42:43 vnfe1 syslogd 1.3-3: restart. Feb 12 17:42:43 vnfe1 syslog: syslogd startup succeeded Feb 12 17:42:43 vnfe1 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 12 17:42:43 vnfe1 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Feb 12 17:42:43 vnfe1 syslog: klogd startup succeeded Feb 12 17:42:43 vnfe1 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. Feb 12 17:42:43 vnfe1 kernel: Symbols match kernel version 2.2.13. Feb 12 17:42:43 vnfe1 kernel: Loaded 137 symbols from 7 modules. ############################################################ CRASH_46 ############################################################ Mon Feb 14 09:33:32 PST 2000 (1) Lothar reports MPI problem with vn21 Hi, the inclusion of node 21 caused a segmentation fault in my program, while otherwise it's running fine.. Lothar # As root@vn21 reboot vnSetdate # LOGS SHOW NOTHING Feb 14 09:25:40 vn21 sshd[15029]: log: Connection from 142.103.237.225 port 1002 Feb 14 09:25:40 vn21 sshd[15029]: log: RSA authentication for idle accepted. Feb 14 09:25:40 vn21 sshd[15031]: log: executing remote command as user idle Feb 14 09:25:42 vn21 sshd[15029]: log: Closing connection to 142.103.237.225 Feb 14 09:27:13 vn21 pam_rhosts_auth[15050]: allowed to lothar@vn16.physics.ubc.ca as lothar Feb 14 09:27:13 vn21 PAM_pwdb[15050]: (rsh) session opened for user lothar by (uid=0) Feb 14 09:27:13 vn21 PAM_pwdb[15050]: (rsh) session closed for user lothar Feb 14 09:30:53 vn21 sshd[653]: log: Generating new 768 bit RSA key. Feb 14 09:30:54 vn21 sshd[653]: log: RSA key generation complete. Feb 14 09:32:34 vn21 sshd[15068]: log: Connection from 142.103.234.31 port 1018 Feb 14 09:32:34 vn21 sshd[15068]: log: Rhosts with RSA host authentication accepted for root, matt on laplace.physics.ubc.ca. Feb 14 09:32:34 vn21 sshd[15068]: log: ROOT LOGIN as 'root' from laplace.physics.ubc.ca Feb 14 09:32:55 vn21 sshd[15068]: log: Closing connection to 142.103.234.31 Feb 14 09:32:57 vn21 gpm[529]: Error in protocol . . . Feb 14 09:33:13 vn21 kernel: Kernel log daemon terminating. Feb 14 09:33:15 vn21 syslog: klogd shutdown succeeded Feb 14 09:33:15 vn21 exiting on signal 15 Feb 15 01:34:41 vn21 syslogd 1.3-3: restart. Feb 15 01:34:41 vn21 syslog: syslogd startup succeeded Feb 15 01:34:42 vn21 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 15 01:34:42 vn21 kernel: Inspecting /boot/System.map Feb 15 01:34:42 vn21 syslog: klogd startup succeeded Feb 15 01:34:42 vn21 kernel: Loaded 6360 symbols from /boot/System.map. # MPI test cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola t 10 # Runs on 8-processors OK ############################################################ CRASH_47 ############################################################ Mon Feb 14 12:33:01 PST 2000 (1) Lothar reports problems with vn55, pingable but cannot telnet, sshd Note vn55 previously rebooted itself Dec 26 when watchdog was still operating Mon Feb 14 13:11:10 PST 2000 Hard rebooting vn55 # Would be nice to have error messages in the log for a change # but we get what we pay for # As root@vn55 date (looked OK) vnSetdate # kdm problem ?? Feb 14 12:13:49 vn55 PAM_pwdb[7750]: (rsh) session closed for user lothar Feb 14 12:13:49 vn55 PAM_pwdb[8089]: (rsh) session closed for user lothar Feb 14 12:13:50 vn55 PAM_pwdb[9008]: (rsh) session closed for user lothar Feb 14 12:14:09 vn55 pam_rhosts_auth[13940]: allowed to lothar@vn16.physics.ubc.ca as lothar Feb 14 12:16:37 vn55 PAM_pwdb[10970]: (rsh) session closed for user lothar Feb 14 12:16:37 vn55 PAM_pwdb[10984]: (rsh) session closed for user lothar Feb 14 12:16:37 vn55 PAM_pwdb[8994]: (rsh) session closed for user lothar Feb 14 12:16:37 vn55 PAM_pwdb[8682]: (rsh) session closed for user lothar Feb 14 12:16:37 vn55 PAM_pwdb[9319]: (rsh) session closed for user lothar Feb 14 12:16:37 vn55 kdm[662]: Server for display :0 terminated unexpectedly: 2304 Feb 14 12:17:09 vn55 kdm[662]: Server for display :0 terminated unexpectedly: 1 Feb 14 12:17:11 vn55 kdm[662]: server unexpectedly died Feb 14 13:15:19 vn55 syslogd 1.3-3: restart. Feb 14 13:15:19 vn55 syslog: syslogd startup succeeded Feb 14 13:15:19 vn55 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 14 13:15:19 vn55 kernel: Inspecting /boot/System.map Feb 14 13:15:19 vn55 syslog: klogd startup succeeded # MPI test ?? ssh matt@vn55 cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola t 10 # Runs on 8-processors OK ############################################################ CRASH_48 ############################################################ Tue Feb 15 10:01:36 PST 2000 (1) From Lothar vn30 produced the dreaded p4_ memory errors again.. # As root@vn30 ps -elf > /tmp/PS # Note ... there are a LOT of hung etc. processes! F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 100 S root 1 0 0 60 0 - 287 add_ti Jan01 ? 00:00:03 init [5] 040 S root 2 1 0 60 0 - 0 down_t Jan01 ? 00:00:02 [kflushd] 040 S root 3 1 0 60 0 - 0 add_ti Jan01 ? 00:00:50 [kupdate] 040 S root 4 1 0 60 0 - 0 get_ca Jan01 ? 00:00:00 [kpiod] 040 S root 5 1 0 60 0 - 0 add_ti Jan01 ? 00:00:26 [kswapd] 140 S bin 233 1 0 60 0 - 289 add_ti Jan01 ? 00:00:00 portmap 040 S root 264 1 0 60 0 - 0 down_t Jan01 ? 00:01:42 [rpciod] 040 S root 265 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [lockd] 140 S root 275 1 0 60 0 - 296 add_ti Jan01 ? 00:00:00 [rpc.statd] 140 S root 304 1 0 60 0 - 337 add_ti Jan01 ? 00:00:39 syslogd 140 S root 314 1 0 60 0 - 356 mm_all Jan01 ? 00:00:00 klogd 040 S daemon 329 1 0 60 0 - 293 add_ti Jan01 ? 00:00:00 /usr/sbin/atd 040 S root 344 1 0 60 0 - 342 add_ti Jan01 ? 00:00:05 crond 140 S root 359 1 0 60 0 - 327 add_ti Jan01 ? 00:00:00 inetd 140 S root 374 1 0 60 0 - 340 add_ti Jan01 ? 00:00:00 [lpd] 040 S root 397 1 0 60 0 - 284 add_ti Jan01 ? 00:00:00 [rpc.rquotad] 040 S root 407 1 0 60 0 - 295 add_ti Jan01 ? 00:00:00 [rpc.mountd] 040 S root 419 1 0 60 0 - 0 add_ti Jan01 ? 00:00:17 [nfsd] 040 S root 420 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 421 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 422 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 423 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 424 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 425 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 040 S root 426 1 0 60 0 - 0 add_ti Jan01 ? 00:00:00 [nfsd] 140 S root 441 1 0 60 0 - 325 sk_att Jan01 ? 00:07:11 rwhod 100 S root 506 1 0 60 0 - 424 add_ti Jan01 ? 00:00:18 /usr/lib/postfix/master 100 S postfix 511 506 0 60 0 - 460 add_ti Jan01 ? 00:00:15 qmgr -l -t fifo -u 140 S root 529 1 0 60 0 - 298 add_ti Jan01 ? 00:04:04 gpm -t ps/2 040 S postgres 538 1 0 60 0 - 1151 add_ti Jan01 ? 00:00:00 [postmaster] 000 S root 545 1 0 60 0 - 432 do_exi Jan01 ? 00:00:00 [safe_mysqld] 100 S root 552 545 0 60 0 - 850 add_ti Jan01 ? 00:00:00 [mysqld] 040 S root 564 552 0 60 0 - 850 add_ti Jan01 ? 00:00:50 [mysqld] 040 S root 565 564 0 60 0 - 850 sigsus Jan01 ? 00:00:00 [mysqld] 040 S xfs 566 1 0 60 0 - 864 add_ti Jan01 ? 00:00:00 xfs -port -1 140 S root 600 1 0 60 0 - 399 add_ti Jan01 ? 00:01:36 /usr/local/bin/ntpd 140 S root 601 1 0 60 0 - 544 add_ti Jan01 ? 00:05:09 /usr/local/sbin/sshd 100 S root 606 1 0 60 0 - 281 add_ti Jan01 tty1 00:00:00 [mingetty] 100 S root 607 1 0 60 0 - 281 add_ti Jan01 tty2 00:00:00 [mingetty] 100 S root 608 1 0 60 0 - 281 add_ti Jan01 tty3 00:00:00 [mingetty] 100 S root 609 1 0 60 0 - 281 add_ti Jan01 tty4 00:00:00 [mingetty] 100 S root 610 1 0 60 0 - 281 add_ti Jan01 tty5 00:00:00 [mingetty] 100 S root 611 1 0 60 0 - 281 add_ti Jan01 tty6 00:00:00 [mingetty] 100 S root 612 1 0 60 0 - 1502 add_ti Jan01 ? 00:00:00 [prefdm] 100 S root 616 612 0 60 0 - 2844 add_ti Jan01 ? 00:08:56 /etc/X11/X -auth /usr/X11R6/lib/X11/xdm/authdir/A:0-I6PDgY 040 S root 617 612 0 60 0 - 1550 add_ti Jan01 ? 00:12:30 -:0 100 S root 4504 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 4505 4504 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 4516 4505 0 60 0 - 6002 add_ti Jan20 ? 00:42:20 [rphase] 040 S lothar 4517 4516 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 4518 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 4519 4518 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 4530 4519 0 60 0 - 6002 add_ti Jan20 ? 00:42:30 [rphase] 040 S lothar 4531 4530 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 5177 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 5178 5177 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 5189 5178 0 60 0 - 6002 add_ti Jan20 ? 00:09:48 [rphase] 040 S lothar 5190 5189 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 5191 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 5192 5191 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 5203 5192 0 60 0 - 6002 add_ti Jan20 ? 00:09:52 [rphase] 040 S lothar 5204 5203 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 5429 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 5430 5429 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 5441 5430 1 60 0 - 6002 add_ti Jan20 ? 07:33:03 [rphase] 040 S lothar 5442 5441 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 5443 359 0 60 0 - 548 add_ti Jan20 ? 00:00:00 [in.rshd] 100 S lothar 5444 5443 0 60 0 - 483 sigsus Jan20 ? 00:00:00 [tcsh] 000 S lothar 5455 5444 1 60 0 - 6002 add_ti Jan20 ? 07:34:19 [rphase] 040 S lothar 5456 5455 0 60 0 - 5721 add_ti Jan20 ? 00:00:00 [rphase] 100 S root 9752 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 9753 9752 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 9764 9753 0 60 0 - 6003 add_ti Jan21 ? 01:53:39 [rphase] 040 S lothar 9765 9764 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 9766 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 9767 9766 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 9778 9767 0 60 0 - 6003 add_ti Jan21 ? 01:53:59 [rphase] 040 S lothar 9779 9778 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 10763 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 10764 10763 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 10775 10764 0 60 0 - 6002 add_ti Jan21 ? 01:24:53 [rphase] 040 S lothar 10776 10775 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 10777 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 10778 10777 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 10789 10778 0 60 0 - 6002 add_ti Jan21 ? 01:25:09 [rphase] 040 S lothar 10790 10789 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 11389 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 11390 11389 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 11401 11390 0 60 0 - 6002 add_ti Jan21 ? 00:04:48 [rphase] 040 S lothar 11402 11401 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 11403 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 11404 11403 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 11415 11404 0 60 0 - 6002 add_ti Jan21 ? 00:04:50 [rphase] 040 S lothar 11416 11415 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 11451 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 11452 11451 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 11463 11452 0 60 0 - 6002 add_ti Jan21 ? 02:17:54 [rphase] 040 S lothar 11464 11463 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 11465 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 11466 11465 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 11477 11466 0 60 0 - 6002 add_ti Jan21 ? 02:18:21 [rphase] 040 S lothar 11478 11477 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 12476 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 12477 12476 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 12488 12477 0 60 0 - 6002 add_ti Jan21 ? 02:08:03 [rphase] 040 S lothar 12489 12488 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 12490 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 12491 12490 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 12502 12491 0 60 0 - 6002 add_ti Jan21 ? 02:08:31 [rphase] 040 S lothar 12503 12502 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 16818 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 16819 16818 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 16830 16819 0 60 0 - 6002 add_ti Jan21 ? 01:15:23 [rphase] 040 S lothar 16831 16830 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 16832 359 0 60 0 - 548 add_ti Jan21 ? 00:00:00 [in.rshd] 100 S lothar 16833 16832 0 60 0 - 483 sigsus Jan21 ? 00:00:00 [tcsh] 000 S lothar 16844 16833 0 60 0 - 6002 add_ti Jan21 ? 01:15:39 [rphase] 040 S lothar 16845 16844 0 60 0 - 5721 add_ti Jan21 ? 00:00:00 [rphase] 100 S root 20588 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 [in.rshd] 100 S lothar 20589 20588 0 60 0 - 483 sigsus Jan22 ? 00:00:00 [tcsh] 000 S lothar 20600 20589 0 60 0 - 6002 add_ti Jan22 ? 01:12:03 [rphase] 040 S lothar 20601 20600 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 [rphase] 100 S root 20602 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 [in.rshd] 100 S lothar 20603 20602 0 60 0 - 483 sigsus Jan22 ? 00:00:00 [tcsh] 000 S lothar 20614 20603 0 60 0 - 6002 add_ti Jan22 ? 01:12:19 [rphase] 040 S lothar 20615 20614 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 [rphase] 100 S root 21121 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 [in.rshd] 100 S lothar 21122 21121 0 60 0 - 483 sigsus Jan22 ? 00:00:00 [tcsh] 000 S lothar 21133 21122 0 60 0 - 6002 add_ti Jan22 ? 00:04:45 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4427 -p4amslave 040 S lothar 21134 21133 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 [rphase] 100 S root 21135 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 [in.rshd] 100 S lothar 21136 21135 0 60 0 - 483 sigsus Jan22 ? 00:00:00 [tcsh] 000 S lothar 21147 21136 0 60 0 - 6002 add_ti Jan22 ? 00:04:48 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4427 -p4amslave 040 S lothar 21148 21147 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 [rphase] 100 S root 21198 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 21199 21198 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 \-p4amslave 000 S lothar 21210 21199 0 60 0 - 6002 add_ti Jan22 ? 01:11:40 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 -p4amslave 040 S lothar 21211 21210 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 -p4amslave 100 S root 21212 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 21213 21212 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 \-p4amslave 000 S lothar 21224 21213 0 60 0 - 6002 add_ti Jan22 ? 01:11:55 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 -p4amslave 040 S lothar 21225 21224 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4471 -p4amslave 100 S root 21739 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 21740 21739 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 \-p4amslave 000 S lothar 21751 21740 0 60 0 - 6002 add_ti Jan22 ? 00:44:14 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 -p4amslave 040 S lothar 21752 21751 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 -p4amslave 100 S root 21753 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 21754 21753 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 \-p4amslave 000 S lothar 21765 21754 0 60 0 - 6002 add_ti Jan22 ? 00:44:23 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 -p4amslave 040 S lothar 21766 21765 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4515 -p4amslave 100 S root 22077 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 22078 22077 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 \-p4amslave 000 S lothar 22089 22078 0 60 0 - 6002 add_ti Jan22 ? 01:10:47 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 -p4amslave 040 S lothar 22090 22089 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 -p4amslave 100 S root 22091 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 22092 22091 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 \-p4amslave 000 S lothar 22103 22092 0 60 0 - 6002 add_ti Jan22 ? 01:11:01 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 -p4amslave 040 S lothar 22104 22103 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4559 -p4amslave 100 S root 22643 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 22644 22643 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 \-p4amslave 000 S lothar 22655 22644 0 60 0 - 6002 add_ti Jan22 ? 00:46:29 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 -p4amslave 040 S lothar 22656 22655 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 -p4amslave 100 S root 22657 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 22658 22657 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 \-p4amslave 000 S lothar 22669 22658 0 60 0 - 6002 add_ti Jan22 ? 00:46:42 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 -p4amslave 040 S lothar 22670 22669 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4603 -p4amslave 100 S root 23013 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 23014 23013 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 \-p4amslave 000 S lothar 23025 23014 0 60 0 - 6002 add_ti Jan22 ? 00:40:25 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 -p4amslave 040 S lothar 23026 23025 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 -p4amslave 100 S root 23027 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S lothar 23028 23027 0 60 0 - 483 sigsus Jan22 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 \-p4amslave 000 S lothar 23039 23028 0 60 0 - 6002 add_ti Jan22 ? 00:40:36 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 -p4amslave 040 S lothar 23040 23039 0 60 0 - 5721 add_ti Jan22 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 4647 -p4amslave 100 S root 23479 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S murashov 23480 23479 0 60 0 - 481 sigsus Jan22 ? 00:00:00 csh -c /d/vnfe2/home/murashov/RUN/2x1/../../vasp.4.4/vasp vn11.physics.ubc.ca 36 000 S murashov 23490 23480 0 60 0 - 39147 add_ti Jan22 ? 00:33:54 /d/vnfe2/home/murashov/RUN/2x1/../../vasp.4.4/vasp vn11.physics.ubc.ca 3675 -p4a 040 S murashov 23491 23490 0 60 0 - 2194 add_ti Jan22 ? 00:00:00 /d/vnfe2/home/murashov/RUN/2x1/../../vasp.4.4/vasp vn11.physics.ubc.ca 3675 -p4a 100 S root 23492 359 0 60 0 - 548 add_ti Jan22 ? 00:00:00 in.rshd 100 S murashov 23493 23492 0 60 0 - 481 sigsus Jan22 ? 00:00:00 csh -c /d/vnfe2/home/murashov/RUN/2x1/../../vasp.4.4/vasp vn11.physics.ubc.ca 36 000 S murashov 23503 23493 0 60 0 - 39157 add_ti Jan22 ? 00:32:04 [vasp] 040 S murashov 23504 23503 0 60 0 - 2194 add_ti Jan22 ? 00:00:00 /d/vnfe2/home/murashov/RUN/2x1/../../vasp.4.4/vasp vn11.physics.ubc.ca 3675 -p4a 100 S root 16486 359 0 60 0 - 548 add_ti Jan23 ? 00:00:00 in.rshd 100 S lothar 16487 16486 0 60 0 - 483 sigsus Jan23 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 \-p4amslave 000 S lothar 16498 16487 0 60 0 - 6003 add_ti Jan23 ? 01:09:22 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 -p4amslave 040 S lothar 16499 16498 0 60 0 - 5721 add_ti Jan23 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 -p4amslave 100 S root 16500 359 0 60 0 - 548 add_ti Jan23 ? 00:00:00 in.rshd 100 S lothar 16501 16500 0 60 0 - 483 sigsus Jan23 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 \-p4amslave 000 S lothar 16512 16501 0 60 0 - 6003 add_ti Jan23 ? 01:09:02 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 -p4amslave 040 S lothar 16513 16512 0 60 0 - 5721 add_ti Jan23 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3232 -p4amslave 100 S root 17009 359 0 60 0 - 548 add_ti Jan23 ? 00:00:00 in.rshd 100 S lothar 17010 17009 0 60 0 - 483 sigsus Jan23 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 \-p4amslave 000 S lothar 17021 17010 0 60 0 - 6003 add_ti Jan23 ? 00:04:33 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 -p4amslave 040 S lothar 17022 17021 0 60 0 - 5721 add_ti Jan23 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 -p4amslave 100 S root 17023 359 0 60 0 - 548 add_ti Jan23 ? 00:00:00 in.rshd 100 S lothar 17024 17023 0 60 0 - 483 sigsus Jan23 ? 00:00:00 tcsh -c /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 \-p4amslave 000 S lothar 17035 17024 0 60 0 - 6003 add_ti Jan23 ? 00:04:32 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 -p4amslave 040 S lothar 17036 17035 0 60 0 - 5721 add_ti Jan23 ? 00:00:00 /d/vnfe1/home/lothar/mpiphase/rphase vn16 3279 -p4amslave 100 S postfix 23241 506 0 60 0 - 410 add_ti Jan24 ? 00:00:00 pickup -l -t fifo 140 S root 23292 601 0 61 0 - 590 add_ti Jan24 ? 00:00:00 /usr/local/sbin/sshd 100 S root 23294 23292 0 67 0 - 517 sigsus Jan24 pts/0 00:00:00 -tcsh 100 R root 23309 23294 0 76 0 - 645 - Jan24 pts/0 00:00:00 ps -elf # Rebooting, nothing obvious in logs # MPI test ssh matt@vn30 cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola free t 10 # Runs on 8-processors OK ############################################################ CRASH_49, CRASH_50, CRASH_51 ############################################################ Tue Feb 15 17:15:42 PST 2000 (1) Kendal reports MPI problems on vn16, vn23, vn24 # Although all are idle, and comunicable, all have # messy process state (as previously) # Rebooting all three # vn23, vn24 taking long time to come back, will have to # back on site # In machine room vn23, vn24 on, but un-pingable, will connect monitor to vn23 # and hard reboot, ditto vn24 # From root@vn23:/var/log/messages Feb 15 16:33:49 vn23 PAM_pwdb[25672]: (rsh) session closed for user wkb Feb 15 16:35:19 vn23 sshd[25674]: log: Connection from 142.103.237.225 port 1022 Feb 15 16:35:19 vn23 sshd[25674]: log: RSA authentication for idle accepted. Feb 15 16:35:19 vn23 sshd[25676]: log: executing remote command as user idle Feb 15 16:35:21 vn23 sshd[25674]: log: Closing connection to 142.103.237.225 Feb 15 16:42:09 vn23 pam_rhosts_auth[25697]: allowed to wkb@vn21.physics.ubc.ca as wkb . . . Feb 15 16:49:33 vn23 sshd[25763]: log: ROOT LOGIN as 'root' from laplace.physics.ubc.ca Feb 15 16:49:43 vn23 gpm[477]: Error in protocol Feb 15 16:49:50 vn23 innd: innd shutdown succeeded Feb 15 16:49:50 vn23 innd: actived -9 succeeded Feb 15 16:49:51 vn23 xfs: xfs shutdown succeeded Feb 15 16:49:51 vn23 gpm: Shutting down gpm mouse services: Feb 15 16:49:51 vn23 gpm: gpm Feb 15 16:49:51 vn23 gpm: Feb 15 16:49:51 vn23 rc: Stopping gpm succeeded Feb 15 16:49:52 vn23 rwhod: rwhod shutdown succeeded Feb 15 16:49:52 vn23 postfix: Shutting down postfix: Feb 15 16:49:52 vn23 postfix: postfix Feb 15 16:49:52 vn23 rc: Stopping postfix succeeded Feb 15 16:49:52 vn23 sendmail: sendmail shutdown failed Feb 15 16:49:52 vn23 sshd[25763]: log: Closing connection to 142.103.234.31 Feb 15 16:49:53 vn23 inet: inetd shutdown succeeded Feb 15 16:49:54 vn23 atd: atd shutdown succeeded Feb 15 16:49:55 vn23 crond: crond shutdown succeeded Feb 15 16:49:55 vn23 lpd: lpd shutdown succeeded Feb 15 16:49:56 vn23 kernel: Kernel logging (proc) stopped. Feb 15 16:49:56 vn23 kernel: Kernel log daemon terminating. Feb 15 16:49:57 vn23 syslog: klogd shutdown succeeded Feb 15 16:49:58 vn23 exiting on signal 15 Feb 15 17:11:21 vn23 syslogd 1.3-3: restart. Feb 15 17:11:21 vn23 syslog: syslogd startup succeeded Feb 15 17:11:21 vn23 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 15 17:11:21 vn23 kernel: Inspecting /boot/System.map Feb 15 17:11:21 vn23 syslog: klogd startup succeeded Feb 15 17:11:21 vn23 kernel: Loaded 6360 symbols from /boot/System.map. Feb 15 17:11:21 vn23 kernel: Symbols match kernel version 2.2.13. Feb 15 17:11:21 vn23 kernel: Loaded 123 symbols from 6 modules. Feb 15 17:11:21 vn23 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 # From root@vn24:/var/log/messages Feb 15 16:50:23 vn24 lpd: lpd shutdown succeeded Feb 15 16:50:24 vn24 kernel: Kernel logging (proc) stopped. Feb 15 16:50:24 vn24 kernel: Kernel log daemon terminating. Feb 15 16:50:25 vn24 syslog: klogd shutdown succeeded Feb 15 16:50:26 vn24 exiting on signal 15 Feb 16 09:12:51 vn24 syslogd 1.3-3: restart. Feb 16 09:12:51 vn24 syslog: syslogd startup succeeded Feb 16 09:12:51 vn24 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 16 09:12:51 vn24 kernel: Inspecting /boot/System.map Feb 16 09:12:51 vn24 syslog: klogd startup succeeded Feb 16 09:12:51 vn24 kernel: Loaded 6360 symbols from /boot/System.map. Feb 16 09:12:51 vn24 kernel: Symbols match kernel version 2.2.13. Feb 16 09:12:51 vn24 kernel: Loaded 123 symbols from 6 modules. Feb 16 09:12:51 vn24 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 ############################################################ CRASH_52 ############################################################ Wed Feb 23 20:05:31 PST 2000 (1) vn23 pingable, but otherwise incommunicado, Dave's single processor Cactus job may be the culprit?? vn23 does not appear to be responding. it might be my fault, as i was running a cactus job on it (single processor, no mpi) earlier. i killed the job, did a few other things, and then the shell didn't respond. Feb 23 16:38:31 vn23 sshd[20941]: log: Rhosts with RSA host authentication accepted for dave, dave on vnfe1.physics.ubc.ca. Feb 23 16:42:48 vn23 sshd[20961]: log: Connection from 142.103.237.225 port 1020 . . . Feb 23 18:00:02 vn23 sshd[21216]: log: Closing connection to 142.103.237.225 Feb 23 18:01:00 vn23 anacron[21246]: Updated timestamp for job `cron.hourly' to 2000-02-23 Feb 23 18:07:44 vn23 sshd[21254]: log: Connection from 142.103.237.225 port 1017 Feb 23 18:07:44 vn23 sshd[21254]: log: RSA authentication for idle accepted. Feb 23 18:07:44 vn23 sshd[21256]: log: executing remote command as user idle Feb 23 18:07:45 vn23 sshd[21254]: log: Closing connection to 142.103.237.225 Feb 23 20:08:35 vn23 syslogd 1.3-3: restart. Feb 23 20:08:35 vn23 syslog: syslogd startup succeeded Feb 23 20:08:35 vn23 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 23 20:08:35 vn23 kernel: Inspecting /boot/System.map Feb 23 20:08:35 vn23 syslog: klogd startup succeeded Feb 23 20:08:35 vn23 kernel: Loaded 6360 symbols from /boot/System.map. Feb 23 20:08:35 vn23 kernel: Symbols match kernel version 2.2.13. Feb 23 20:08:35 vn23 kernel: Loaded 123 symbols from 6 modules. Feb 23 20:08:35 vn23 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Feb 23 20:08:35 vn23 kernel: Intel MultiProcessor Specification v1.1 Feb 23 20:08:35 vn23 kernel: Virtual Wire compatibility mode. Feb 23 20:08:35 vn23 kernel: OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000 Feb 23 20:08:35 vn23 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Feb 23 20:08:35 vn23 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Feb 23 20:08:35 vn23 kernel: I/O APIC #2 Version 17 at 0xFEC00000. Feb 23 20:08:35 vn23 kernel: Processors: 2 Feb 23 20:08:35 vn23 kernel: mapped APIC to ffffe000 (fee00000) Feb 23 20:08:35 vn23 kernel: mapped IOAPIC to ffffd000 (fec00000) Feb 23 20:08:35 vn23 kernel: Detected 451030267 Hz processor. Feb 23 20:08:35 vn23 kernel: Console: colour VGA+ 80x25 Feb 23 20:08:35 vn23 kernel: Calibrating delay loop... 448.92 BogoMIPS cd /d/vnfe1/home/matt/debug/rnpl/wave2d sola free t 10 # Works with 32, 8 processors ############################################################ CRASH_53 ############################################################ Thu Feb 24 10:21:59 PST 2000 [root@vnfe1]# down rar0502 down 103+16:25 vnfe2 down 1:33 # Pingable, but incommunicado # Hard reboot # Nothing obvious in log Feb 24 06:01:00 vnfe2 anacron[4973]: Updated timestamp for job `cron.hourly' to 2000-02-24 Feb 24 06:11:52 vnfe2 -- MARK -- Feb 24 06:31:52 vnfe2 -- MARK -- Feb 24 06:51:52 vnfe2 -- MARK -- Feb 24 07:01:00 vnfe2 anacron[4993]: Updated timestamp for job `cron.hourly' to 2000-02-24 Feb 24 07:11:52 vnfe2 -- MARK -- Feb 24 07:31:53 vnfe2 -- MARK -- Feb 24 07:51:24 vnfe2 sshd[5007]: log: Connection from 128.83.114.226 port 1017 Feb 24 07:51:25 vnfe2 sshd[5007]: log: RSA authentication for dave accepted. Feb 24 07:51:52 vnfe2 sshd[5042]: log: Connection from 128.83.114.226 port 1016 Feb 24 07:51:53 vnfe2 sshd[5042]: log: RSA authentication for dave accepted. Feb 24 07:56:14 vnfe2 sshd[615]: log: Generating new 768 bit RSA key. Feb 24 07:56:14 vnfe2 sshd[615]: log: RSA key generation complete. Feb 24 08:00:13 vnfe2 sshd[5078]: log: Connection from 128.83.114.226 port 1018 Feb 24 08:00:14 vnfe2 sshd[5078]: log: RSA authentication for dave accepted. Feb 24 08:01:00 vnfe2 anacron[5137]: Updated timestamp for job `cron.hourly' to 2000-02-24 Feb 24 08:11:53 vnfe2 -- MARK -- Feb 24 08:31:53 vnfe2 -- MARK -- Feb 25 10:31:06 vnfe2 syslogd 1.3-3: restart. Feb 25 10:31:06 vnfe2 syslog: syslogd startup succeeded Feb 25 10:31:06 vnfe2 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 25 10:31:06 vnfe2 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Feb 25 10:31:06 vnfe2 syslog: klogd startup succeeded Feb 25 10:31:06 vnfe2 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. Feb 25 10:31:06 vnfe2 kernel: Symbols match kernel version 2.2.13. Feb 25 10:31:06 vnfe2 kernel: Loaded 137 symbols from 7 modules. Feb 25 10:31:06 vnfe2 kernel: Linux version 2.2.13-7mdksmp (root@kenobi.mandrakesoft.com) (gcc version 2.95.1 19990816 (release)) #1 SMP Wed Sep 15 16:38:50 CEST 1999 Feb 25 10:31:06 vnfe2 kernel: Intel MultiProcessor Specification v1.4 Feb 25 10:31:06 vnfe2 kernel: Virtual Wire compatibility mode. Feb 25 10:31:06 vnfe2 kernel: OEM ID: INTEL Product ID: Lancewood APIC at: 0xFEE00000 Feb 25 10:31:06 vnfe2 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Feb 25 10:31:06 vnfe2 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Feb 25 10:31:06 vnfe2 kernel: I/O APIC #2 Version 17 at 0xFEC00000. Feb 25 10:31:06 vnfe2 kernel: Processors: 2 Feb 25 10:31:06 vnfe2 kernel: mapped APIC to ffffe000 (fee00000) Feb 25 10:31:06 vnfe2 kernel: mapped IOAPIC to ffffd000 (fec00000) Feb 25 10:31:06 vnfe2 kernel: Detected 447693220 Hz processor. Feb 25 10:31:06 vnfe2 kernel: Console: colour VGA+ 80x25 Feb 25 10:31:06 vnfe2 kernel: Calibrating delay loop... 445.64 BogoMIPS Feb 25 10:31:06 vnfe2 kernel: Memory: 516852k/524224k available (1100k kernel code, 424k reserved, 5440k data, 72k init) Feb 25 10:31:06 vnfe2 kernel: VFS: Diskquotas version dquot_6.4.0 initialized Feb 25 10:31:06 vnfe2 kernel: Pentium-III serial number disabled. Feb 25 10:31:06 vnfe2 kernel: Checking 386/387 coupling... OK, FPU using exception 16 error reporting. Feb 25 10:31:06 vnfe2 kernel: Checking 'hlt' instruction... OK. Feb 25 10:31:06 vnfe2 kernel: POSIX conformance testing by UNIFIX Feb 25 10:31:06 vnfe2 kernel: mtrr: v1.35a (19990819) Richard Gooch (rgooch@atnf.csiro.au) Feb 25 10:31:06 vnfe2 kernel: Pentium-III serial number disabled. ############################################################ CRASH_54 ############################################################ (1) Tue Feb 29 12:09:43 PST 2000 rar0502 down 108+18:13 vnfe3 down 1:03 Pingable ... incommunicado ... hook up monitor # Dave was last on machine??? (Why, Dave?) Feb 29 08:27:24 vnfe3 -- MARK -- Feb 29 08:35:38 vnfe3 sshd[14513]: log: Connection from 142.103.237.225 port 1017 Feb 29 08:35:38 vnfe3 sshd[14513]: log: Rhosts with RSA host authentication accepted for root, jason on vnfe1.physics.ubc.ca. Feb 29 08:35:38 vnfe3 sshd[14513]: log: ROOT LOGIN as 'root' from vnfe1.physics.ubc.ca Feb 29 08:35:45 vnfe3 sshd[670]: log: Generating new 768 bit RSA key. Feb 29 08:35:46 vnfe3 sshd[670]: log: RSA key generation complete. Feb 29 08:37:00 vnfe3 sshd[14513]: log: Closing connection to 142.103.237.225 Feb 29 08:47:24 vnfe3 -- MARK -- Feb 29 09:01:00 vnfe3 anacron[14545]: Updated timestamp for job `cron.hourly' to 2000-02-29 Feb 29 09:27:24 vnfe3 -- MARK -- Feb 29 09:47:24 vnfe3 -- MARK -- Feb 29 10:01:00 vnfe3 anacron[14568]: Updated timestamp for job `cron.hourly' to 2000-02-29 Feb 29 10:27:24 vnfe3 -- MARK -- Feb 29 10:47:25 vnfe3 -- MARK -- Feb 29 11:01:00 vnfe3 anacron[14590]: Updated timestamp for job `cron.hourly' to 2000-02-29 Feb 29 11:05:43 vnfe3 sshd[14619]: log: Connection from 141.142.7.4 port 13114 Feb 29 11:05:46 vnfe3 sshd[14619]: log: Password authentication for dave accepted. Feb 29 11:05:46 vnfe3 sshd[14621]: log: executing remote command as user dave Feb 29 11:05:48 vnfe3 sshd[14619]: log: Closing connection to 141.142.7.4 Mar 1 05:22:08 vnfe3 syslogd 1.3-3: restart. Mar 1 05:22:08 vnfe3 syslog: syslogd startup succeeded Mar 1 05:22:08 vnfe3 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 1 05:22:08 vnfe3 kernel: Inspecting /boot/System.map-2.2.13-7mdksmp Mar 1 05:22:08 vnfe3 syslog: klogd startup succeeded Mar 1 05:22:08 vnfe3 kernel: Loaded 6360 symbols from /boot/System.map-2.2.13-7mdksmp. Mar 1 05:22:08 vnfe3 kernel: Symbols match kernel version 2.2.13. Mar 1 05:22:08 vnfe3 kernel: Loaded 137 symbols from 7 modules. Mar 1 05:22:08 vnfe3 kernel: Linux version 2.2.13-7mdksmp (root@kenobi.mandrakesoft.com) (gcc version 2.95.1 19990816 (release)) #1 SMP Wed Sep 15 16:38:50 CEST 1999 Mar 1 05:22:08 vnfe3 kernel: Intel MultiProcessor Specification v1.4 Mar 1 05:22:08 vnfe3 kernel: Virtual Wire compatibility mode. Mar 1 05:22:08 vnfe3 kernel: OEM ID: INTEL Product ID: Lancewood APIC at: 0xFEE00000 Mar 1 05:22:08 vnfe3 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Mar 1 05:22:08 vnfe3 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Mar 1 05:22:08 vnfe3 kernel: I/O APIC #2 Version 17 at 0xFEC00000. . . . vnSetdate jj ntpd ############################################################ CRASH_55 ############################################################ (1) vn10 pingable, incommunicado rar0502 down 109+18:23 vn10 down 0:40 Wed Mar 1 12:39:26 PST 2000 (2) Hard reboot in machine room ... don't know what if anything Dave did? Mar 1 11:56:11 vn10 sshd[18921]: log: executing remote command as user idle Mar 1 11:56:13 vn10 sshd[18919]: log: Closing connection to 142.103.237.225 Mar 1 11:57:14 vn10 sshd[18947]: log: Connection from 142.103.237.225 port 1007 Mar 1 11:57:14 vn10 sshd[18947]: log: Rhosts with RSA host authentication accepted for dave, dave on vnfe1.physics.ubc.ca. Mar 1 12:01:00 vn10 anacron[18972]: Updated timestamp for job `cron.hourly' to 2000-03-01 Mar 1 12:03:37 vn10 sshd[18976]: log: Connection from 142.103.237.225 port 1008 Mar 1 12:03:37 vn10 sshd[18976]: log: RSA authentication for idle accepted. Mar 1 12:03:37 vn10 sshd[18978]: log: executing remote command as user idle Mar 1 12:03:39 vn10 sshd[18976]: log: Closing connection to 142.103.237.225 Mar 1 12:11:01 vn10 sshd[18999]: log: Connection from 142.103.237.225 port 1008 Mar 1 12:11:01 vn10 sshd[18999]: log: RSA authentication for idle accepted. Mar 1 12:11:01 vn10 sshd[19001]: log: executing remote command as user idle Mar 1 12:11:02 vn10 sshd[18999]: log: Closing connection to 142.103.237.225 Mar 1 12:18:25 vn10 sshd[19023]: log: Connection from 142.103.237.225 port 1008 Mar 1 12:18:25 vn10 sshd[19023]: log: RSA authentication for idle accepted. Mar 1 12:18:25 vn10 sshd[19025]: log: executing remote command as user idle Mar 1 12:18:27 vn10 sshd[19023]: log: Closing connection to 142.103.237.225 Mar 1 12:25:52 vn10 sshd[19047]: log: Connection from 142.103.237.225 port 1008 Mar 1 12:25:52 vn10 sshd[19047]: log: RSA authentication for idle accepted. Mar 1 12:25:52 vn10 sshd[19049]: log: executing remote command as user idle Mar 1 12:25:53 vn10 sshd[19047]: log: Closing connection to 142.103.237.225 Mar 1 12:30:51 vn10 sshd[18947]: log: Closing connection to 142.103.237.225 Mar 1 12:33:19 vn10 sshd[19075]: log: Connection from 142.103.237.225 port 1003 Mar 1 12:33:19 vn10 sshd[19075]: log: RSA authentication for idle accepted. Mar 1 12:33:19 vn10 sshd[19077]: log: executing remote command as user idle Mar 1 12:33:20 vn10 sshd[19075]: log: Closing connection to 142.103.237.225 Mar 1 12:40:45 vn10 sshd[19098]: log: Connection from 142.103.237.225 port 1012 Mar 1 12:40:45 vn10 sshd[19098]: log: RSA authentication for idle accepted. Mar 1 12:40:45 vn10 sshd[19100]: log: executing remote command as user idle Mar 1 12:40:47 vn10 sshd[19098]: log: Closing connection to 142.103.237.225 Mar 1 13:42:31 vn10 syslogd 1.3-3: restart. Mar 1 13:42:32 vn10 syslog: syslogd startup succeeded Mar 1 13:42:32 vn10 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 1 13:42:32 vn10 kernel: Inspecting /boot/System.map Mar 1 13:42:32 vn10 syslog: klogd startup succeeded Mar 1 13:42:32 vn10 kernel: Loaded 6360 symbols from /boot/System.map. Mar 1 13:42:32 vn10 kernel: Symbols match kernel version 2.2.13. Mar 1 13:42:32 vn10 kernel: Loaded 123 symbols from 6 modules. Mar 1 13:42:32 vn10 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Mar 1 13:42:32 vn10 kernel: Intel MultiProcessor Specification v1.1 Mar 1 13:42:32 vn10 kernel: Virtual Wire compatibility mode. Mar 1 13:42:32 vn10 kernel: OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000 Mar 1 13:42:32 vn10 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Mar 1 13:42:32 vn10 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Mar 1 13:42:32 vn10 kernel: I/O APIC #2 Version 17 at 0xFEC00000. Mar 1 13:42:32 vn10 kernel: Processors: 2 Mar 1 13:42:32 vn10 kernel: mapped APIC to ffffe000 (fee00000) Mar 1 13:42:32 vn10 kernel: mapped IOAPIC to ffffd000 (fec00000) Mar 1 13:42:32 vn10 kernel: Detected 451030267 Hz processor. Mar 1 13:42:32 vn10 kernel: Console: colour VGA+ 80x25 Mar 1 13:42:32 vn10 kernel: Calibrating delay loop... 448.92 BogoMIPS Mar 1 13:42:32 vn10 kernel: Memory: 517348k/524224k available (980k kernel code, 424k reserved, 5408k data, 64k init) Mar 1 13:42:32 vn10 kernel: VFS: Diskquotas version dquot_6.4.0 initialized Mar 1 13:42:32 vn10 kernel: Checking 386/387 coupling... OK, FPU using exception 16 error reporting. Mar 1 13:42:32 vn10 kernel: Checking 'hlt' instruction... OK. Mar 1 13:42:32 vn10 kernel: POSIX conformance testing by UNIFIX Mar 1 13:42:32 vn10 kernel: mtrr: v1.35a (19990819) Richard Gooch (rgooch@atnf.csiro.au) Mar 1 13:42:32 vn10 kernel: per-CPU timeslice cutoff: 100.00 usecs. Mar 1 13:42:32 vn10 kernel: CPU1: Intel Pentium III (Katmai) stepping 03 Mar 1 13:42:32 vn10 kernel: calibrating APIC timer ... Mar 1 13:42:32 vn10 kernel: ..... CPU clock speed is 451.0104 MHz. Mar 1 13:42:32 vn10 kernel: ..... system bus clock speed is 100.2243 MHz. Mar 1 13:42:32 vn10 kernel: Booting processor 0 eip 2000 Mar 1 13:42:32 vn10 kernel: Calibrating delay loop... 450.56 BogoMIPS Mar 1 13:42:32 vn10 kernel: OK. Mar 1 13:42:32 vn10 kernel: CPU0: Intel Pentium III (Katmai) stepping 03 Mar 1 13:42:32 vn10 kernel: Total of 2 processors activated (899.48 BogoMIPS). Mar 1 13:42:32 vn10 kernel: enabling symmetric IO mode... ...done. ############################################################ CRASH_56, CRASH_57 ############################################################ Thu Mar 2 07:43:25 PST 2000 (1) Matt hung the following nodes running too-huge MPI jobs (and ctrl-C-ing out etc.) vn13 vn51 [root@vnfe1]# date; down Thu Mar 2 08:26:34 PST 2000 rar0502 down 110+14:30 vn13 down 0:48 vn51 down 0:56 # Hard reboot of vn13, vn51 vnSetdate ntptimeset #----------------------------------------------------------- !!ssh vn13 cat /tmp/log #----------------------------------------------------------- Mar 2 08:03:36 vn13 -- MARK -- Mar 2 08:12:46 vn13 ntpd[600]: time reset 1.139896 s Mar 2 08:12:46 vn13 ntpd[600]: synchronisation lost Mar 2 08:23:37 vn13 -- MARK -- Mar 2 08:31:10 vn13 syslogd 1.3-3: restart. Mar 2 08:31:10 vn13 syslog: syslogd startup succeeded Mar 2 08:31:10 vn13 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 2 08:31:10 vn13 kernel: Inspecting /boot/System.map Mar 2 08:31:10 vn13 syslog: klogd startup succeeded Mar 2 08:31:10 vn13 kernel: Loaded 6360 symbols from /boot/System.map. Mar 2 08:31:10 vn13 kernel: Symbols match kernel version 2.2.13. Mar 2 08:31:10 vn13 kernel: Loaded 123 symbols from 6 modules. #----------------------------------------------------------- !!ssh vn51 cat /tmp/log #----------------------------------------------------------- Mar 2 07:29:30 vn51 sshd[30157]: log: RSA authentication for idle accepted. Mar 2 07:29:30 vn51 sshd[30159]: log: executing remote command as user idle Mar 2 07:29:32 vn51 sshd[30157]: log: Closing connection to 142.103.237.225 Mar 2 07:30:14 vn51 pam_rhosts_auth[30183]: allowed to matt@vn13.physics.ubc.ca as matt Mar 2 07:30:14 vn51 PAM_pwdb[30183]: (rsh) session opened for user matt by (uid=0) Mar 2 08:31:19 vn51 syslogd 1.3-3: restart. Mar 2 08:31:19 vn51 syslog: syslogd startup succeeded Mar 2 08:31:20 vn51 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 2 08:31:20 vn51 kernel: Inspecting /boot/System.map Mar 2 08:31:20 vn51 syslog: klogd startup succeeded Mar 2 08:31:20 vn51 kernel: Loaded 6360 symbols from /boot/System.map. # NEED TO GET A WATCHDOG RUNNING AGAIN (rsh or MPI based though) ############################################################ CRASH_58 ############################################################ Fri Mar 3 22:29:07 PST 2000 (1) Inaki apparently hung vn10 up via some 'mv' command !!ssh root@vn10 cat /tmp/log Mar 3 16:47:20 vn10 sshd[24917]: log: RSA authentication for inaki accepted. Mar 3 16:49:12 vn10 sshd[24938]: log: Connection from 142.103.237.225 port 1006 Mar 3 16:49:12 vn10 sshd[24938]: log: RSA authentication for idle accepted. Mar 3 16:49:12 vn10 sshd[24940]: log: executing remote command as user idle Mar 3 16:49:13 vn10 sshd[24938]: log: Closing connection to 142.103.237.225 Mar 3 16:57:05 vn10 sshd[24965]: log: Connection from 142.103.237.225 port 1006 Mar 3 16:57:06 vn10 sshd[24965]: log: RSA authentication for idle accepted. Mar 3 16:57:06 vn10 sshd[24967]: log: executing remote command as user idle Mar 3 16:57:07 vn10 sshd[24965]: log: Closing connection to 142.103.237.225 Mar 3 17:01:00 vn10 anacron[24996]: Updated timestamp for job `cron.hourly' to 2000-03-03 Mar 3 17:05:00 vn10 sshd[25000]: log: Connection from 142.103.237.225 port 1006 Mar 3 17:05:00 vn10 sshd[25000]: log: RSA authentication for idle accepted. Mar 3 17:05:00 vn10 sshd[25002]: log: executing remote command as user idle Mar 3 17:05:01 vn10 sshd[25000]: log: Closing connection to 142.103.237.225 Mar 3 23:25:41 vn10 syslogd 1.3-3: restart. Mar 3 23:25:41 vn10 syslog: syslogd startup succeeded Mar 3 23:25:41 vn10 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 3 23:25:41 vn10 kernel: Inspecting /boot/System.map Mar 3 23:25:41 vn10 syslog: klogd startup succeeded Mar 3 23:25:42 vn10 kernel: Loaded 6360 symbols from /boot/System.map. Mar 3 23:25:42 vn10 kernel: Symbols match kernel version 2.2.13. Mar 3 23:25:42 vn10 kernel: Loaded 123 symbols from 6 modules. Mar 3 23:25:42 vn10 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Mar 3 23:25:42 vn10 kernel: Intel MultiProcessor Specification v1.1 Mar 3 23:25:42 vn10 kernel: Virtual Wire compatibility mode. Mar 3 23:25:42 vn10 kernel: OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000 Mar 3 23:25:42 vn10 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Mar 3 23:25:42 vn10 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Mar 3 23:25:42 vn10 kernel: I/O APIC #2 Version 17 at 0xFEC00000. Mar 3 23:25:42 vn10 kernel: Processors: 2 Mar 3 23:25:42 vn10 kernel: mapped APIC to ffffe000 (fee00000) Mar 3 23:25:42 vn10 kernel: mapped IOAPIC to ffffd000 (fec00000) Mar 3 23:25:42 vn10 kernel: Detected 451025105 Hz processor. Mar 3 23:25:42 vn10 kernel: Console: colour VGA+ 80x25 Mar 3 23:25:42 vn10 kernel: Calibrating delay loop... 448.92 BogoMIPS Mar 3 23:25:42 vn10 kernel: Memory: 517348k/524224k available (980k kernel code, 424k reserved, 5408k data, 64k init) Mar 3 23:25:42 vn10 kernel: VFS: Diskquotas version dquot_6.4.0 initialized Mar 3 23:25:42 vn10 kernel: Checking 386/387 coupling... OK, FPU using exception 16 error reporting. ############################################################ CRASH_59 ############################################################ FUTURE_ACTION Could just run watchdog pgm to re-start inetd instead of reboot?? Tue Mar 7 15:21:31 PST 2000 rar0502 down 115+21:25 vn13 down 2:41 hi matt, sorry about this, but vn13 just went down. my cactus code ran, and finished. (not sure how long ago it finished because i was out of the office for an hour or so.) i then issued the deadly mv command to change a directory name, and it locked up. i ensured that no shells were open to the directory in question. the command was executed from vn13. should this be a problem? in the future i'll try to only change directory names from vnfe1 i guess. sorry for the hassle, dave ----------------------------------------------------------------------- David Neilsen dave@dirac.ph.utexas.edu Center for Relativity, University of Texas at Austin Life is too short to occupy oneself with the slaying of the slain more than once. ---Thomas Henry Huxley, 1861 ----------------------------------------------------------------------- From dave@galileo.ph.utexas.edu Tue Mar 7 12:50:44 2000 Received: from galileo.ph.utexas.edu (galileo.ph.utexas.edu [128.83.114.127]) by laplace.physics.ubc.ca (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id MAA11602 for ; Tue, 7 Mar 2000 12:50:43 -0800 Received: (from dave@localhost) by galileo.ph.utexas.edu (8.9.3/8.9.3) id OAA16520 for matt@laplace.physics.ubc.ca; Tue, 7 Mar 2000 14:50:43 -0600 (CST) Date: Tue, 7 Mar 2000 14:50:43 -0600 (CST) From: David Neilsen Message-Id: <200003072050.OAA16520@galileo.ph.utexas.edu> To: matt Subject: final commands on vn13 Status: R matt, here are the last commands i executed: ( Cactus code runs and stops ) determinant = 0 in ginv. Sorry. So long... FORTRAN STOP [dave@vn13 exe]$ cd fs [dave@vn13 fs]$ ls [dave@vn13 fs]$ cd .. [dave@vn13 exe]$ ls [dave@vn13 exe]$ mv fs fs_nompi [dave@vn13 exe]$ ls [ vn13 dies here...] dave Mar 7 08:27:18 vn13 sshd[5109]: log: RSA authentication for idle accepted. Mar 7 08:27:18 vn13 sshd[5111]: log: executing remote command as user idle Mar 7 08:27:20 vn13 sshd[5109]: log: Closing connection to 142.103.237.225 Mar 7 08:30:09 vn13 sshd[5135]: log: Connection from 142.103.237.225 port 1011 Mar 7 08:30:09 vn13 sshd[5135]: log: Rhosts with RSA host authentication accepted for dave, dave on vnfe1.physics.ubc.ca. Mar 7 08:32:44 vn13 sshd[604]: log: Generating new 768 bit RSA key. . . . Mar 7 12:40:02 vn13 sshd[5923]: log: Connection from 142.103.237.225 port 1005 Mar 7 12:40:02 vn13 sshd[5923]: log: RSA authentication for idle accepted. Mar 7 12:40:02 vn13 sshd[5925]: log: executing remote command as user idle Mar 7 12:40:04 vn13 sshd[5923]: log: Closing connection to 142.103.237.225 Mar 7 15:26:50 vn13 syslogd 1.3-3: restart. Mar 7 15:26:50 vn13 syslog: syslogd startup succeeded Mar 7 15:26:50 vn13 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 7 15:26:50 vn13 kernel: Inspecting /boot/System.map Mar 7 15:26:50 vn13 syslog: klogd startup succeeded Mar 7 15:26:50 vn13 kernel: Loaded 6360 symbols from /boot/System.map. Mar 7 15:26:50 vn13 kernel: Symbols match kernel version 2.2.13. Mar 7 15:26:50 vn13 kernel: Loaded 123 symbols from 6 modules. Mar 7 15:26:50 vn13 kernel: Linux version 2.2.13-7Pmdksmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Dec 24 10:04:34 PST 1999 Mar 7 15:26:50 vn13 kernel: Intel MultiProcessor Specification v1.1 Mar 7 15:26:50 vn13 kernel: Virtual Wire compatibility mode. Mar 7 15:26:50 vn13 kernel: OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000 Mar 7 15:26:50 vn13 kernel: Processor #1 Pentium(tm) Pro APIC version 17 Mar 7 15:26:50 vn13 kernel: Processor #0 Pentium(tm) Pro APIC version 17 Mar 7 15:26:50 vn13 kernel: I/O APIC #2 Version 17 at 0xFEC00000. Mar 7 15:26:50 vn13 kernel: Processors: 2 Mar 7 15:26:50 vn13 kernel: mapped APIC to ffffe000 (fee00000) ############################################################ CRASH_60 ############################################################ Tue Mar 7 15:35:46 PST 2000 (1) Matt brought vn13 down via 'mv', as per Dave, Inaki ... [matt@vn13 wave2d]$ mv Archive Archive.O [matt@vn13 wave2d]$ pwd /d/vnfe1/home/matt/debug/rnpl/wave2d [matt@vn13 wave2d]$ mv Archive.O Archive [matt@vn13 wave2d]$ pwd /d/vnfe1/home/matt/debug/rnpl/wave2d [matt@vn13 wave2d]$ cd .. [matt@vn13 rnpl]$ ls fwave3d/ wave2d/ wave2d_0/ wave3d1o/ [matt@vn13 rnpl]$ pwd /d/vnfe1/home/matt/debug/rnpl [matt@vn13 rnpl]$ pwd /d/vnfe1/home/matt/debug/rnpl [matt@vn13 rnpl]$ ls fwave3d/ wave2d/ wave2d_0/ wave3d1o/ [matt@vn13 rnpl]$ mv wave2d wave2d.O [matt@vn13 rnpl]$ Pwd matt@vn13.physics.ubc.ca:/d/vnfe1/home/matt/debug/rnpl [matt@vn13 rnpl]$ cd wave2d.O wave2d chdir: Too many arguments. [matt@vn13 rnpl]$ mv !* ############################################################ CRASH_61 ############################################################ Fri Mar 10 18:55:09 PST 2000 rar0502 down 119+00:58 vn64 down 0:56 (1) Scott H. hung up with 'mv' ############################################################ CRASH_62 ############################################################ Sat Mar 18 15:31:12 PST 2000 (1) Hung up vn13 with mpptest running on vn13/vn17 (after interrupt), maybe should recompile with latest version of driver? FUTURE_ACTION Check mpptest on bh... [root@vnfe1]# date Sat Mar 18 16:18:14 PST 2000 [root@vnfe1]# ping vn13 PING vn13.physics.ubc.ca (142.103.237.13): 56 data bytes 64 bytes from 142.103.237.13: icmp_seq=0 ttl=255 time=0.4 ms 64 bytes from 142.103.237.13: icmp_seq=1 ttl=255 time=0.2 ms --- vn13.physics.ubc.ca ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.2/0.3/0.4 ms [root@vnfe1]# telnet !$ telnet vn13 Trying 142.103.237.13... Connected to vn13.physics.ubc.ca. Escape character is '^]'. ^] telnet> quit Connection closed. ssh root@vn13 cat /tmp/log Mar 18 15:28:05 vn13 sshd[3233]: log: Closing connection to 142.103.237.225 Mar 18 15:28:11 vn13 sshd[3264]: log: Connection from 142.103.237.225 port 1017 Mar 18 15:28:11 vn13 sshd[3264]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Mar 18 15:28:11 vn13 sshd[3266]: log: executing remote command as user matt Mar 18 15:28:13 vn13 sshd[3279]: log: Connection from 142.103.237.225 port 1015 Mar 18 15:28:13 vn13 sshd[3279]: log: Rhosts with RSA host authentication accepted for matt, matt on vnfe1.physics.ubc.ca. Mar 18 15:28:13 vn13 sshd[3281]: log: executing remote command as user matt Mar 18 16:22:54 vn13 syslogd 1.3-3: restart. Mar 18 16:22:54 vn13 syslog: syslogd startup succeeded Mar 18 16:22:54 vn13 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 18 16:22:54 vn13 kernel: Inspecting /boot/System.map-2.2.14-Psmp Mar 18 16:22:54 vn13 syslog: klogd startup succeeded Mar 18 16:22:54 vn13 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Mar 18 16:22:54 vn13 kernel: Symbols match kernel version 2.2.14. Mar 18 16:22:54 vn13 kernel: Loaded 124 symbols from 6 modules. ############################################################ CRASH_63, CRASH_64 ############################################################ Thu Mar 23 11:47:52 PST 2000 vn20 down 0:45 vn.physics.ubc.ca usage: Thu Mar 23 10:56:06 PST 2000 LAST TIME STAMP 2000:03:23:1054.10 NODE PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 17: vn20 739 inaki 20 0 183M 183M 280 R 0 48.4 36.3 2086m svlasov 31: vn20 742 inaki 18 0 183M 183M 280 R 0 46.9 36.3 2086m svlasov Thu Mar 23 12:40:37 PST 2000 [matt@vnfe1 ~]$ ping vn20 PING vn20.physics.ubc.ca (142.103.237.20): 56 data bytes --- vn20.physics.ubc.ca ping statistics --- 2 packets transmitted, 0 packets received, 100% packet loss [matt@vnfe1 ~]$ down rar0502 down 131+18:44 vn20 down 1:38 # Hard reboot ... verified restart with monitor # Didn't come back, reconnecting monitor, and possibly rebooting (network probs, # no routed to host? # No, looks like fsck barfed, and that RAM disks might be an idea after all # booted with linux-up OK, then rebooted with new FUTURE_ACTION: Figure out RAM disk support set N=vn20 ssh root@$N "hostname -s; uname -a; date; vnSetdate; ntptimeset; jj whod; ruptime | grep $N" # RECEIVER HANG-UPS ??? # KERNEL PROBLEMS: NFS ??? Mar 23 10:54:53 vn20 sshd[16449]: log: Closing connection to 142.103.237.225 Mar 23 11:00:00 vn20 kernel: 6d5f6461 78657475 7365645f 796f7274 705f5f00 65726874 6d5f6461 78657475 Mar 23 11:00:00 vn20 kernel: Call Trace: [] [lockd:nlmsvc_invalidate_client_Rsmp2gig_b13232b7+11022/14664] [eepro100:eepro100_drv_id+7965414/48557234] Mar 23 11:00:00 vn20 kernel: [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-500612/5484] [eepro100:eepro100_drv_id+2377506/54145142] [eepro100:eepro100_drv_id+2377478/54145170] [eepro100:eepro100_drv_id+2377506/54145142] [eepro100:eepro100_drv_id+2377478/54145170] [eepro100:eepro100_drv_id+2377506/54145142] [eepro100:eepro100_drv_id+2377478/54145170] [eepro100:eepro100_drv_id+7965421/48557227] Mar 23 11:00:00 vn20 kernel: [lockd:nlmsvc_invalidate_client_Rsmp2gig_b13232b7+11094/14664] [eepro100:eepro100_drv_id+2377506/54145142] [lockd:nlmsvc_invalidate_client_Rsmp2gig_b13232b7+11127/14664] [eepro100:eepro100_drv_id+2377506/54145142] [eepro100:eepro100_drv_id+2377478/54145170] [eepro100:eepro100_drv_id+2377477/54145171] [eepro100:eepro100_drv_id+2377477/54145171] [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-500612/5484] . . . Mar 23 11:00:09 vn20 kernel: [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-459651/5484] [eepro100:eepro100_drv_id+4643124/51879524] [eepro100:eepro100_drv_id+4574271/51948377] [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-366884/5484] [eepro100:eepro100_drv_id+4574271/51948377] [eepro100:eepro100_drv_id+5872122/50650526] <1>Unable to handle kernel paging request at virtual address a0050000 Mar 23 11:00:09 vn20 kernel: current->tss.cr3 = 039e4000, %cr3 = 039e4000 Mar 23 11:00:09 vn20 kernel: *pde = 1f7a8063 Mar 23 11:00:09 vn20 kernel: *pte = 00000000 Mar 23 11:01:00 vn20 kernel: stuck on TLB IPI wait (CPU#1) Mar 23 11:02:02 vn20 kernel: eth0: Transmit timed out: status 7048 0000 at 827158/827170 commands 000ca000 000ca000 000ca000. Mar 23 11:02:02 vn20 kernel: nfs: server vnfe1 not responding, still trying Mar 23 11:02:03 vn20 kernel: nfs: server vnfe1 not responding, still trying Mar 23 11:02:04 vn20 kernel: eth0: Transmit timed out: status 2050 0000 at 827158/827171 . . . # Apparently also powered vn19 down by accident ! set N=vn19 ssh root@$N "hostname -s; uname -a; date; vnSetdate; ntptimeset; jj whod; ruptime | grep $N" ############################################################ CRASH_65 ############################################################ (1) vn20 down again Inaki?? # Hard reboot ... verified restart with monitor # Fsck barfed again, but at tail end of check FUTURE_ACTION: Figure out RAM disk support set N=vn20 ssh root@$N "hostname -s; uname -a; date; vnSetdate; ntptimeset; jj whod; ruptime | grep $N" Mar 28 18:16:42 vn20 sshd[7474]: log: Closing connection to 142.103.237.225 Mar 28 18:20:01 vn20 kernel: Unable to handle kernel paging request at virtual address 64801092 Mar 28 18:20:01 vn20 kernel: current->tss.cr3 = 00ed6000, %cr3 = 00ed6000 Mar 28 18:20:01 vn20 kernel: *pde = 00000000 Mar 28 18:20:01 vn20 kernel: Oops: 0000 Mar 28 18:20:01 vn20 kernel: CPU: 1 Mar 28 18:20:01 vn20 kernel: EIP: 0010:[<64801092>] Mar 28 18:20:01 vn20 kernel: EFLAGS: 00010282 Mar 28 18:20:01 vn20 kernel: eax: 00001d48 ebx: 83860000 ecx: 83861fa0 edx: 00000031 Mar 28 18:20:01 vn20 kernel: esi: 08051730 edi: 08051e8a ebp: 347ffffd esp: 83861fcd Mar 28 18:20:01 vn20 kernel: ds: 0018 es: 0018 ss: 0018 Mar 28 18:20:01 vn20 kernel: Process crond (pid: 7495, process nr: 47, stackpage=83861000) Mar 28 18:20:01 vn20 kernel: Stack: 30000000 8a080517 6808051e be7ffffd 2b000000 2b000000 be000000 58000000 Mar 28 18:20:01 vn20 kernel: 232ab60d 02000000 54000002 2b7ffff8 0c000000 000000c0 0d400000 00000060 Mar 28 18:20:01 vn20 kernel: 0e400000 00000000 0e400000 000000a0 0f400000 00000040 0f400000 000000e0 Mar 28 18:20:01 vn20 kernel: Call Trace: [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-222084/5484] [eepro100:eepro100_drv_id+183024/56339624] [eepro100:eepro100_drv_id+510704/56011944] [timer_bh+708/1008] [eepro100:eepro100_drv_id+903920/55618728] [tcp_rcv_established+100/1500] [eepro100:eepro100_drv_id+1231600/55291048] Mar 28 18:20:01 vn20 kernel: [FPU_div+1192/1236] [eepro100:eepro100_drv_id+1559280/54963368] [lockd:nlmclnt_proc_Rsmp2gig_9d12044b+-222084/5484] [eepro100:eepro100_drv_id+183024/56339624] [eepro100:eepro100_drv_id+510704/56011944] [timer_bh+708/1008] [eepro100:eepro100_drv_id+903920/55618728] [tcp_rcv_established+100/1500] Mar 28 18:20:01 vn20 kernel: [eepro100:eepro100_drv_id+1231600/55291048] [FPU_div+1192/1236] [eepro100:eepro100_drv_id+1559280/54963368] [eepro100:eepro100_drv_id+805740/55716908] [eepro100:eepro100_drv_id+3090373/53432275] [eepro100:eepro100_drv_id+712942/55809706] [eepro100:eepro100_drv_id+5281302/51241346] [eepro100:eepro100_drv_id+4479380/52043268] . . . ############################################################ CRASH_66 ############################################################ Sun Apr 16 05:57:52 PDT 2000 (1) vn20 down again, need to replace NIC card? Could still ping, login etc. but clearly something awry Warm in machine room again, 21.6C at the UPSes. Complained Hard reboot of vn20 with monitor attached Apr 15 16:56:38 vn20 sshd[18770]: log: RSA authentication for idle accepted. Apr 15 16:56:38 vn20 sshd[18772]: log: executing remote command as user idle Apr 15 16:56:41 vn20 sshd[18770]: log: Closing connection to 142.103.237.225 Apr 15 17:00:00 vn20 kernel: Unable to handle kernel paging request at virtual address 64801092 Apr 15 17:00:00 vn20 kernel: current->tss.cr3 = 0fd0c000, %cr3 = 0fd0c000 Apr 15 17:00:00 vn20 kernel: *pde = 00000000 Apr 15 17:00:00 vn20 kernel: Oops: 0000 Apr 15 17:00:00 vn20 kernel: CPU: 1 Apr 15 17:00:00 vn20 kernel: EIP: 0010:[<64801092>] Apr 15 17:00:00 vn20 kernel: EFLAGS: 00010282 Apr 15 17:00:00 vn20 kernel: eax: 00004969 ebx: 83860000 ecx: 83861fa0 edx: 0000002d Apr 15 17:00:00 vn20 kernel: esi: 08051730 edi: 08051e8a ebp: 347ffffd esp: 83861fcd Apr 15 17:00:00 vn20 kernel: ds: 0018 es: 0018 ss: 0018 Apr 15 17:00:00 vn20 kernel: Process crond (pid: 18791, process nr: 84, stackpage=83861000) Apr 15 17:00:00 vn20 kernel: Stack: 30000000 8a080517 7808051e be7ffffd 2b000000 2b000000 be000000 58000000 Apr 15 17:00:00 vn20 kernel: 232ab60d 06000000 64000002 2b7ffff8 8b000000 168b0875 85ba048b 890474c0 Apr 15 17:00:00 vn20 kernel: 47418a04 72fc7d3b 08758bea 8d044e89 5f5eec65 c35dec89 90909090 90909090 Apr 15 17:00:00 vn20 kernel: Call Trace: [eepro100:eepro100_drv_id+2378005/54144643] [eepro100:eepro100_drv_id+2378485/54144163] [lockd:nlmsvc_invalidate_client_Rsmp2gig_b13232b7+11164/14664] [eepro100:eepro100_drv_id+3607807/52914841] [eepro100:eepro100_drv_id+211582/56311066] [eepro100:eepro100_drv_id+3594915/52927733] [eepro100:eepro100_drv_id+3593388/52929260] Apr 15 17:00:00 vn20 kernel: [eepro100:eepro100_drv_id+3817358/52705290] [eepro100:eepro100_drv_id+7941365/48581283] [eepro100:eepro100_drv_id+3609147/52913501] [eepro100:eepro100_drv_id+3607807/52914841] [eepro100:eepro100_drv_id+1247606/55275042] [eepro100:eepro100_drv_id+4846742/51675906] [__ksymtab_net_families+7/8] [__ksymtab_net_families+7/8] Apr 15 17:00:00 vn20 kernel: [eepro100:eepro100_drv_id+3730181/52792467] [eepro100:eepro100_drv_id+3725104/52797544] [eepro100:eepro100_drv_id+7922812/48599836] [eepro100:eepro100_drv_id+3723444/52799204] [eepro100:eepro100_drv_id+7922812/48599836] [eepro100:eepro100_drv_id+3556175/52966473] [__ksymtab_net_families+7/8] [__ksymtab_net_families+7/8] Apr 15 17:00:00 vn20 kernel: [eepro100:eepro100_drv_id+7922821/48599827] [eepro100:eepro100_drv_id+7922814/48599834] [eepro100:eepro100_drv_id+1271961/55250687] [eepro100:eepro100_drv_id+1271961/55250687] [__ksymtab_net_families+7/8] [eepro100:eepro100_drv_id+3591851/52930797] [eepro100:eepro100_drv_id+3589330/52933318] [eepro100:eepro100_drv_id+3579612/52943036] Apr 15 17:00:00 vn20 kernel: [eepro100:eepro100_drv_id+3600305/52922343] [eepro100:eepro100_drv_id+3407104/53115544] [eepro100:eepro100_drv_id+3407104/53115544] [eepro100:eepro100_drv_id+3804447/52718201] [eepro100:eepro100_drv_id+1626559/54896089] [__ksymtab_datagram_poll+4/8] [eepro100:eepro100_drv_id+6819491/49703157] [eepro100:eepro100_drv_id+3594933/52927715] ############################################################ CRASH_67 ############################################################ Sun Apr 16 07:24:27 PDT 2000 (1) vn59 down, apparently when I was in machine room? Bad power plug connection, need to send apology to whomever was running on it. ############################################################ CRASH_68 ############################################################ Thu Jul 6 06:57:53 PDT 2000 (1) NFS problems with vnfe1? After some flailing, got vnfe1 back up in single user mode, removed NFS mounts, exports Load average looks sane now (had been in the 20's when things hung up!) 8:01am up 5 min, 2 users, load average: 0.15, 0.34, 0.18 (2) Looks like vnfe3 had problems as well? Nope ... was apparently just a time out fortunately. TO_DO Check on PGI Compilers Not working (3) Everything re-mounted on vnfe1 Re-exporting (will need to do explicit re-mount on nodes etc.) umount vnfe1:/home vnfe1:/home2 [root@vnfe1]# pwd /d [root@vnfe1]# ls -lt */*/Usage -rw-r--r-- 1 root root 43 Jul 6 05:36 vnfe1/home/Usage -rw-r--r-- 1 root root 72 Jul 5 05:29 vnfe3/home2/Usage -rw-r--r-- 1 root root 363 Jul 5 05:29 vnfe3/home/Usage -rw-r--r-- 1 root root 140 Jul 5 05:26 vnfe2/home/Usage -rw-r--r-- 1 root root 66 Jul 5 05:26 vnfe2/home2/Usage -rw-r--r-- 1 root root 154 Jul 5 05:26 vnfe1/home2/Usage ... and vnfe1/home/Usage is empty ... so try running manually? vnNCommand 'umount vnfe1:/home vnfe1:/home2; mount -a' vn1 vn2 vn3 vn4 vn7 vnNCommand 'df | grep vnfe1' # This is going to be a *big* mess! vnNCommand 'kill -9 `ps -elf | grep stocki | grep -v grep|/tmp/nth 4`; ps -elf | grep stocki | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep inaki | grep -v grep|/tmp/nth 4`; ps -elf | grep inaki | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep ehonda | grep -v grep|/tmp/nth 4`; ps -elf | grep ehonda | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep tzenova | grep -v grep|/tmp/nth 4`; ps -elf | grep tzenova| grep -v grep' vnNCommand 'kill -9 `ps -elf | grep petryk | grep -v grep|/tmp/nth 4`; ps -elf | grep petryk | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep minghe | grep -v grep|/tmp/nth 4`; ps -elf | grep minghe | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep matt | grep -v grep|/tmp/nth 4`; ps -elf | grep matt | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep lothar | grep -v grep|/tmp/nth 4`; ps -elf | grep lothar | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep idle | grep -v grep|/tmp/nth 4`; ps -elf | grep idle | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep demo | grep -v grep|/tmp/nth 4`; ps -elf | grep demo | grep -v grep' -elf | /tmp/nth 3 | sort | uniq vnNCommand 'umount vnfe1:/home vnfe1:/home2; mount -a; df | grep vnfe1' # Affected users ehonda inaki lothar matt minghe petryk stocki tzenova >>> Executing as root@142.103.237.7 umount: /d/vnfe1/home: device is busy umount: vnfe1:/home2: not found vnfe1:/home2 17066300 13305248 2872258 82% /d/vnfe1/home2 # SO ALL THAT IS REALLY REQUIRED TO CLEAN UP AFTER THIS SORT OF THING IS A SCRIPT # WHICH GOES THROUGH ALL USERS WITH HOMES ON GIVEN MACHINE AND KILLS ALL # JOBS RUNNING ON THE NODES # vnfe1 came back up with only one proc recognized again # rebooted, changed BIOS setting, rebooted [root@vnfe1]# date `ssh root@vnfe2 date +%m%d%H%M%Y.%S` Thu Jul 6 09:32:54 PDT 2000 [root@vnfe1]# grep proc /proc/cpuinfo processor : 0 processor : 1 vnCommand 'cd ~matt/scripts; wc nth' # All looks OK Thu Jul 6 09:37:43 PDT 2000 ############################################################ CRASH_69 ############################################################ Tue Jul 11 14:24:47 PDT 2000 (1) vnfe1 dead, hard rebooted in machine room date `ssh root@vnfe2 date +%m%d%H%M%Y.%S` LOOKS LIKE IT WAS A PROBLEM WITH THE TAPE DRIVES Jul 11 13:15:57 vnfe1 kernel: scsi : aborting command due to timeout : pid 4952613, scsi1, channel 0, id 0, lun 0 Test Unit Ready 00 00 00 00 00 Jul 11 13:15:57 vnfe1 kernel: SCSI host 1 abort (pid 4952613) timed out - resetting Jul 11 13:15:57 vnfe1 kernel: SCSI bus is being reset for host 1 channel 0. Jul 11 13:15:58 vnfe1 kernel: SCSI host 1 channel 0 reset (pid 4952613) timed out - trying harder Jul 11 13:15:58 vnfe1 kernel: SCSI bus is being reset for host 1 channel 0. Jul 11 13:16:00 vnfe1 kernel: SCSI host 1 abort (pid 4952613) timed out - resetting Jul 11 13:16:00 vnfe1 kernel: SCSI bus is being reset for host 1 channel 0. Jul 11 13:16:03 vnfe1 kernel: SCSI host 1 channel 0 reset (pid 4952613) timed out - trying harder Jul 11 13:16:03 vnfe1 kernel: SCSI bus is being reset for host 1 channel 0. Jul 11 13:16:05 vnfe1 kernel: SCSI host 1 reset (pid 4952613) timed out again - Jul 11 13:16:05 vnfe1 kernel: probably an unrecoverable SCSI bus or device hang. vnNCommand 'kill -9 `ps -elf | grep stocki | grep -v grep|/tmp/nth 4`; ps -elf | grep stocki | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep ehonda | grep -v grep|/tmp/nth 4`; ps -elf | grep ehonda | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep tzenova | grep -v grep|/tmp/nth 4`; ps -elf | grep tzenova| grep -v grep' vnNCommand 'kill -9 `ps -elf | grep petryk | grep -v grep|/tmp/nth 4`; ps -elf | grep petryk | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep minghe | grep -v grep|/tmp/nth 4`; ps -elf | grep minghe | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep matt | grep -v grep|/tmp/nth 4`; ps -elf | grep matt | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep lothar | grep -v grep|/tmp/nth 4`; ps -elf | grep lothar | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep idle | grep -v grep|/tmp/nth 4`; ps -elf | grep idle | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep inaki | grep -v grep|/tmp/nth 4`; ps -elf | grep inaki | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep demo | grep -v grep|/tmp/nth 4`; ps -elf | grep demo | grep -v grep' vnNCommand 'kill -9 `ps -elf | grep fransp | grep -v grep|/tmp/nth 4`; ps -elf | grep demo | grep -v grep' vnNCommand 'umount vnfe1:/home vnfe1:/home2; mount -a; df | grep vnfe1' #Problem nodes None # Update web page ehonda fransp inaki lothar matt minghe petryk stocki tzenova ############################################################ CRASH_70 ############################################################ Thu Jul 27 01:43:59 PDT 2000 vn42 down 1:17 # Still ping'able # Hard reboot set N=vn42 ssh root@$N "hostname -s; uname -a; date; vnSetdate; ntptimeset; jj whod; ruptime | grep $N" # Nothing obvious in log Jul 27 00:00:38 vn42 sshd[12257]: log: Closing connection to 142.103.237.225 Jul 27 00:01:01 vn42 anacron[12282]: Updated timestamp for job `cron.hourly' to 2000-07-27 Jul 27 00:10:00 vn42 sshd[12286]: log: Connection from 142.103.237.225 port 1014 Jul 27 00:10:00 vn42 sshd[12286]: log: RSA authentication for idle accepted. Jul 27 00:10:01 vn42 sshd[12290]: log: executing remote command as user idle Jul 27 00:10:04 vn42 sshd[12286]: log: Closing connection to 142.103.237.225 Jul 27 00:19:20 vn42 sshd[606]: log: Generating new 768 bit RSA key. Jul 27 00:19:20 vn42 sshd[606]: log: RSA key generation complete. Jul 27 00:19:45 vn42 sshd[12312]: log: Connection from 142.103.237.225 port 1011 Jul 27 00:19:45 vn42 sshd[12312]: log: RSA authentication for idle accepted. Jul 27 00:19:45 vn42 sshd[12314]: log: executing remote command as user idle Jul 27 00:19:48 vn42 sshd[12312]: log: Closing connection to 142.103.237.225 Jul 27 00:49:35 vn42 syslogd 1.3-3: restart. Jul 27 00:49:35 vn42 syslog: syslogd startup succeeded Jul 27 00:49:35 vn42 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jul 27 00:49:35 vn42 kernel: Inspecting /boot/System.map-2.2.14-Psmp Jul 27 00:49:35 vn42 syslog: klogd startup succeeded ############################################################ CRASH_71, CRASH_72 ############################################################ Sat Aug 12 08:08:30 PDT 2000 vn11 down 9:07 vn2 down 8:44 # Temperature 22C (whined about the A/C) # Hard re-booting vn2 Aug 11 22:54:25 vn2 sshd[27567]: log: executing remote command as user idle Aug 11 22:54:27 vn2 sshd[27565]: log: Closing connection to 142.103.237.225 Aug 11 23:01:00 vn2 anacron[27595]: Updated timestamp for job `cron.hourly' to 2000-08-11 Aug 11 23:02:55 vn2 sshd[27599]: log: Connection from 142.103.237.225 port 1021 Aug 11 23:02:56 vn2 sshd[27599]: log: RSA authentication for idle accepted. Aug 11 23:02:56 vn2 sshd[27601]: log: executing remote command as user idle Aug 11 23:02:58 vn2 sshd[27599]: log: Closing connection to 142.103.237.225 Aug 11 23:16:24 vn2 -- MARK -- Aug 11 23:24:17 vn2 sshd[27627]: log: Connection from 142.103.234.22 port 1021 Aug 11 23:24:18 vn2 sshd[27627]: log: RSA authentication for cwlai accepted. Aug 12 07:59:40 vn2 syslogd 1.3-3: restart. Aug 12 07:59:40 vn2 syslog: syslogd startup succeeded Aug 12 07:59:40 vn2 kernel: klogd 1.3-3, log source = /proc/kmsg started. Aug 12 07:59:40 vn2 kernel: Inspecting /boot/System.map-2.2.14-Psmp Aug 12 07:59:40 vn2 syslog: klogd startup succeeded Aug 12 07:59:40 vn2 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Aug 12 07:59:40 vn2 kernel: Symbols match kernel version 2.2.14. Aug 12 07:59:40 vn2 kernel: Loaded 124 symbols from 6 modules. Aug 12 07:59:40 vn2 kernel: Linux version 2.2.14-Psmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Mar 17 23:30:57 PST 2000 # Hard re-booting vn11 (so, suspect that Kevin may be able to hang a node?) Aug 11 22:55:00 vn11 sshd[18932]: log: Closing connection to 142.103.237.225 Aug 11 23:01:00 vn11 anacron[18962]: Updated timestamp for job `cron.hourly' to 2000-08-11 Aug 11 23:02:17 vn11 sshd[18966]: log: Connection from 142.103.237.1 port 1021 Aug 11 23:02:18 vn11 sshd[18966]: log: RSA authentication for cwlai accepted. Aug 11 23:02:18 vn11 sshd[18968]: log: executing remote command as user cwlai Aug 12 08:04:45 vn11 syslogd 1.3-3: restart. Aug 12 08:04:45 vn11 syslog: syslogd startup succeeded Aug 12 08:04:45 vn11 kernel: klogd 1.3-3, log source = /proc/kmsg started. Aug 12 08:04:45 vn11 kernel: Inspecting /boot/System.map-2.2.14-Psmp Aug 12 08:04:45 vn11 syslog: klogd startup succeeded Aug 12 08:04:45 vn11 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Aug 12 08:04:45 vn11 kernel: Symbols match kernel version 2.2.14. ############################################################ CRASH_73 ############################################################ Wed Aug 30 05:56:58 PDT 2000 (1) Problem with vn55 64: vn55 up 163+23:05, 0 users, load 14.60, 14.51, 14.17 Can still ping, but can't ssh, telnet Aug 30 04:21:17 vn55 pam_rhosts_auth[6067]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:17 vn55 PAM_pwdb[6067]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:17 vn55 pam_rhosts_auth[6080]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:17 vn55 PAM_pwdb[6080]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:18 vn55 pam_rhosts_auth[6093]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:18 vn55 PAM_pwdb[6093]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:19 vn55 pam_rhosts_auth[6106]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:19 vn55 PAM_pwdb[6106]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:20 vn55 pam_rhosts_auth[6119]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:20 vn55 PAM_pwdb[6119]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:22 vn55 pam_rhosts_auth[6132]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:22 vn55 PAM_pwdb[6132]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:24 vn55 pam_rhosts_auth[6145]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:24 vn55 PAM_pwdb[6145]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:26 vn55 pam_rhosts_auth[6158]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:26 vn55 PAM_pwdb[6158]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:28 vn55 pam_rhosts_auth[6171]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:28 vn55 PAM_pwdb[6171]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:21:31 vn55 pam_rhosts_auth[6184]: allowed to suqin@vn23.physics.ubc.ca as suqin Aug 30 04:21:31 vn55 PAM_pwdb[6184]: (rsh) session opened for user suqin by (uid=0) Aug 30 04:30:36 vn55 sshd[6202]: log: Connection from 142.103.237.225 port 1014 Aug 30 04:30:39 vn55 sshd[6202]: log: RSA authentication for idle accepted. Aug 30 04:30:44 vn55 sshd[6204]: log: executing remote command as user idle Aug 30 04:31:15 vn55 sshd[6202]: log: Closing connection to 142.103.237.225 Aug 30 04:32:17 vn55 sshd[606]: log: Generating new 768 bit RSA key. Aug 30 04:32:20 vn55 sshd[606]: log: RSA key generation complete. Aug 30 04:42:44 vn55 sshd[6225]: log: Connection from 142.103.237.225 port 1013 Aug 30 04:42:48 vn55 sshd[6225]: log: RSA authentication for idle accepted. Aug 30 04:42:54 vn55 sshd[6227]: log: executing remote command as user idle Aug 30 04:43:38 vn55 sshd[6225]: log: Closing connection to 142.103.237.225 Aug 30 04:54:54 vn55 sshd[6251]: log: Connection from 142.103.237.225 port 1014 Aug 30 04:54:57 vn55 sshd[6251]: log: RSA authentication for idle accepted. Aug 30 04:54:58 vn55 sshd[6253]: log: executing remote command as user idle Aug 30 04:55:10 vn55 sshd[6251]: log: Closing connection to 142.103.237.225 Aug 30 05:01:05 vn55 anacron[6281]: Updated timestamp for job `cron.hourly' to 2000-08-30 Aug 30 05:10:02 vn55 kernel: Unable to load interpreter Aug 30 05:15:06 vn55 kernel: Unable to load interpreter Aug 30 05:20:10 vn55 kernel: Unable to load interpreter Aug 30 05:30:03 vn55 kernel: Unable to load interpreter ############################################################ CRASH_74 ############################################################ Fri Sep 22 17:06:01 PDT 2000 (1) vn8 rebooted about two hours ago?? ############################################################ CRASH_75 ############################################################ Wed Nov 15 10:20:07 PST 2000 (1) vn20 down In machine room, hard reboot of vn20 # vn20 back up # No obvious symptoms Nov 15 00:12:02 vn20 sshd[26914]: log: Closing connection to 142.103.237.225 Nov 15 00:22:08 vn20 sshd[26936]: log: Connection from 142.103.237.225 port 1022 Nov 15 00:22:09 vn20 sshd[26936]: log: RSA authentication for idle accepted. Nov 15 00:22:09 vn20 sshd[26938]: log: executing remote command as user idle Nov 15 00:34:59 vn20 -- MARK -- Nov 15 00:48:48 vn20 sshd[554]: log: Generating new 768 bit RSA key. Nov 15 00:48:50 vn20 sshd[554]: log: RSA key generation complete. Nov 15 01:14:59 vn20 -- MARK -- Nov 15 01:34:59 vn20 -- MARK -- Nov 15 01:54:59 vn20 -- MARK -- Nov 15 02:14:59 vn20 -- MARK -- Nov 15 02:34:59 vn20 -- MARK -- Nov 15 02:54:59 vn20 -- MARK -- Nov 15 03:14:59 vn20 -- MARK -- Nov 15 03:34:59 vn20 -- MARK -- Nov 15 03:54:59 vn20 -- MARK -- Nov 15 04:14:59 vn20 -- MARK -- Nov 15 04:34:59 vn20 -- MARK -- Nov 15 04:54:59 vn20 -- MARK -- Nov 15 05:14:59 vn20 -- MARK -- Nov 15 05:34:59 vn20 -- MARK -- Nov 15 05:54:59 vn20 -- MARK -- Nov 15 06:14:59 vn20 -- MARK -- Nov 15 06:34:59 vn20 -- MARK -- Nov 15 06:54:59 vn20 -- MARK -- Nov 15 07:14:59 vn20 -- MARK -- Nov 15 07:34:59 vn20 -- MARK -- Nov 15 07:54:59 vn20 -- MARK -- Nov 15 08:14:59 vn20 -- MARK -- Nov 15 08:34:59 vn20 -- MARK -- Nov 15 08:54:59 vn20 -- MARK -- Nov 15 09:14:59 vn20 -- MARK -- Nov 15 09:34:59 vn20 -- MARK -- Nov 15 09:37:36 vn20 sshd[26936]: fatal: Connection closed by remote host. Nov 15 09:39:24 vn20 sshd[27058]: log: Connection from 142.103.237.225 port 1011 Nov 15 09:39:24 vn20 sshd[27058]: log: RSA authentication for idle accepted. Nov 15 09:39:25 vn20 sshd[27060]: log: executing remote command as user idle Nov 15 11:04:02 vn20 syslogd 1.3-3: restart. Nov 15 11:04:02 vn20 syslog: syslogd startup succeeded ############################################################ CRASH_76 ############################################################ Sun Nov 26 05:32:42 PST 2000 # Electrical work in Klinck yesterday vn5 down # Hard reboot, had bad disk, needed manual fsck fsck /dev/hda1 Nov 26 03:51:56 vn5 sshd[2667]: log: Closing connection to 142.103.237.225 Nov 26 03:59:08 vn5 sshd[2688]: log: Connection from 142.103.237.225 port 979 Nov 26 03:59:08 vn5 sshd[2688]: log: RSA authentication for idle accepted. Nov 26 03:59:08 vn5 sshd[2690]: log: executing remote command as user idle Nov 26 03:59:09 vn5 sshd[2688]: log: Closing connection to 142.103.237.225 Nov 26 04:01:00 vn5 anacron[2718]: Updated timestamp for job `cron.hourly' to 2000-11-26 Nov 26 04:02:00 vn5 anacron[2726]: Updated timestamp for job `cron.daily' to 2000-11-26 ############################################################ CRASH_77 ############################################################ Sun Nov 26 17:54:46 PST 2000 vn43 down # sdewekker@vn36.physics.ubc.ca Nov 26 15:42:24 vn43 sshd[15818]: log: Closing connection to 142.103.237.225 Nov 26 15:43:56 vn43 pam_rhosts_auth[15839]: allowed to sdewekker@vn36.physics.ubc.ca as sdewekker Nov 26 15:43:56 vn43 PAM_pwdb[15839]: (rsh) session opened for user sdewekker by (uid=0) Nov 26 15:44:31 vn43 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000014 Nov 26 15:44:31 vn43 kernel: current->tss.cr3 = 00101000, %cr3 = 00101000 Nov 26 15:44:31 vn43 kernel: *pde = 00000000 Nov 26 15:44:31 vn43 kernel: Oops: 0000 Nov 26 15:44:31 vn43 kernel: CPU: 0 Nov 26 15:44:31 vn43 kernel: EIP: 0010:[try_to_free_buffers+18/132] Nov 26 15:44:31 vn43 kernel: EFLAGS: 00010207 ############################################################ CRASH_78 ############################################################ Sat Dec 2 17:46:06 PST 2000 # After power-down/power-up vn28 doesn't see any hard drive, trying # to diagnose in Hennings, but may have to go into the shop Jason replaced hard-drive, reset BIOS, all seems well ############################################################ CRASH_79 ############################################################ Sun Dec 3 06:29:22 PST 2000 # Shortly after power-down/up vn20 down (again!) # Hard-reboot, had disk errors, needed manual check fsck /dev/hda1 # Subsequent reboot panicked, but next one OK jcohena running? Dec 2 21:24:31 vn20 sshd[1741]: log: executing remote command as user idle Dec 2 21:24:34 vn20 sshd[1739]: log: Closing connection to 142.103.237.225 Dec 2 21:31:30 vn20 sshd[1765]: log: Connection from 142.103.237.225 port 1003 Dec 2 21:31:30 vn20 sshd[1765]: log: RSA authentication for idle accepted. Dec 2 21:31:30 vn20 sshd[1767]: log: executing remote command as user idle Dec 3 06:39:20 vn20 syslogd 1.3-3: restart. Dec 3 06:39:20 vn20 syslog: syslogd startup succeeded Dec 3 06:39:20 vn20 kernel: klogd 1.3-3, log source = /proc/kmsg started. Dec 3 06:39:20 vn20 kernel: Inspecting /boot/System.map-2.2.14-Psmp Dec 3 06:39:20 vn20 syslog: klogd startup succeeded ############################################################ CRASH_80 ############################################################ Thu Dec 14 04:03:24 PST 2000 # vn23 down at about 10PM last evening Thu Dec 14 07:58:30 PST 2000 # In machine room, hard re-boot of vn23 # Came up OK, e.g. no manual fsck required # Suqin on Dec 13 20:57:59 vn23 sshd[13356]: log: Closing connection to 142.103.237.225 Dec 13 21:01:00 vn23 anacron[13386]: Updated timestamp for job `cron.hourly' to 2000-12-13Dec 13 21:05:47 vn23 sshd[13391]: log: Connection from 142.103.237.225 port 999 Dec 13 21:05:47 vn23 sshd[13391]: log: RSA authentication for idle accepted. Dec 13 21:05:47 vn23 sshd[13393]: log: executing remote command as user idle Dec 13 21:05:48 vn23 sshd[13391]: log: Closing connection to 142.103.237.225 Dec 13 21:08:35 vn23 PAM_pwdb[13413]: (login) session opened for user suqin by (uid=0) Dec 13 21:08:35 vn23 -- suqin[13413]: LOGIN ON 0 BY suqin FROM vn18 ############################################################ CRASH_81 ############################################################ Sun Dec 17 10:53:46 PST 2000 # vn5's load average going through the roof, can't ssh # (was installing testing/nwchem) # Jason installed hard drive in one of the bh machines, did # a manual fsck, and seems to have fixed it up ... # From /var/log/messages Dec 17 10:30:39 vn5 sshd[11374]: log: executing remote command as root: ls -lt /usr/local/nwchem Dec 17 10:30:51 vn5 sshd[11372]: log: Closing connection to 142.103.237.225 Dec 17 10:32:10 vn5 kernel: attempt to access beyond end of device Dec 17 10:32:10 vn5 kernel: 03:01: rw=0, want=943206972, limit=12691318 Dec 17 10:32:10 vn5 kernel: dev 03:01 blksize=1024 blocknr=943206971 sector=1886413942 size=1024 count=1 ############################################################ CRASH_82 ############################################################ Sun Feb 4 12:43:29 PST 2001 vn20's load average at about 60, can still log in, but can't, e.g. cd to ~idle reboot minghe, fransp, ytwang were running ############################################################ CRASH_83 ############################################################ Mon Feb 5 13:21:35 PST 2001 vn5 had to be re-booted (could do so remotely), starting to look like disk might be bad? Feb 4 23:25:30 vn5 kernel: free_one_pmd: bad directory entry 00000002 Feb 4 23:25:30 vn5 kernel: free_one_pmd: bad directory entry 00000004 Feb 4 23:25:30 vn5 last message repeated 8 times Feb 4 23:25:30 vn5 PAM_pwdb[24504]: (rsh) session closed for user roman Feb 4 23:30:00 vn5 kernel: free_one_pmd: bad directory entry 00000004 Feb 4 23:30:00 vn5 last message repeated 22 times Feb 4 23:30:00 vn5 kernel: free_one_pmd: bad directory entry 00000006 Feb 4 23:30:00 vn5 kernel: free_one_pmd: bad directory entry 00000004 Feb 4 23:30:00 vn5 last message repeated 15 times Feb 4 23:31:45 vn5 sshd[24547]: log: Connection from 142.103.237.225 port 1011 Feb 4 23:31:45 vn5 sshd[24547]: log: RSA authentication for idle accepted. Feb 4 23:31:45 vn5 kernel: free_one_pmd: bad directory entry 00000004 Feb 4 23:31:45 vn5 last message repeated 22 times Feb 4 23:31:45 vn5 kernel: free_one_pmd: bad directory entry 00000006 Feb 4 23:31:45 vn5 kernel: free_one_pmd: bad directory entry 00000004 ############################################################ CRASH_84, CRASH_85 ############################################################ Sat Feb 24 08:46:54 PST 2001 vn5, vn10 down vn10's was powered off, rocker switch did not bring it up, unplugging and reseating power cord did ------------------------------------------------------------ vn10 log excerpt ------------------------------------------------------------ Feb 24 00:23:27 vn10 sshd[2186]: log: Closing connection to 142.103.237.225 Feb 24 00:28:58 vn10 PAM_pwdb[2208]: (login) session opened for user suqin by (uid=0) Feb 24 00:28:58 vn10 -- suqin[2208]: LOGIN ON 0 BY suqin FROM vn1 Feb 24 00:29:53 vn10 sshd[2326]: log: Connection from 142.103.237.225 port 1013 Feb 24 00:29:53 vn10 sshd[2326]: log: RSA authentication for idle accepted. Feb 24 00:29:53 vn10 sshd[2328]: log: executing remote command as user idle Feb 24 00:29:56 vn10 sshd[2326]: log: Closing connection to 142.103.237.225 Feb 24 00:36:27 vn10 sshd[3177]: log: Connection from 142.103.237.225 port 1013 Feb 24 00:36:28 vn10 sshd[3177]: log: RSA authentication for idle accepted. Feb 24 00:36:28 vn10 sshd[3179]: log: executing remote command as user idle Feb 24 00:36:29 vn10 sshd[3177]: log: Closing connection to 142.103.237.225 Feb 24 08:51:40 vn10 syslogd 1.3-3: restart. Feb 24 08:51:40 vn10 syslog: syslogd startup succeeded Feb 24 08:51:40 vn10 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 24 08:51:40 vn10 kernel: Inspecting /boot/System.map-2.2.14-Psmp ------------------------------------------------------------ vn5 log excerpt ------------------------------------------------------------ Feb 24 00:43:03 vn5 sshd[2456]: log: Connection from 142.103.237.225 port 1018 Feb 24 00:43:03 vn5 sshd[2456]: log: RSA authentication for idle accepted. Feb 24 00:43:03 vn5 sshd[2458]: log: executing remote command as user idle Feb 24 00:43:05 vn5 sshd[2456]: log: Closing connection to 142.103.237.225 Feb 24 00:45:38 vn5 sshd[606]: log: Generating new 768 bit RSA key. Feb 24 00:45:39 vn5 sshd[606]: log: RSA key generation complete. Feb 24 00:49:17 vn5 PAM_pwdb[2482]: (login) session opened for user suqin by (uid=0) Feb 24 00:49:17 vn5 -- suqin[2482]: LOGIN ON 0 BY suqin FROM vn49 Feb 24 08:54:15 vn5 syslogd 1.3-3: restart. Feb 24 08:54:15 vn5 syslog: syslogd startup succeeded Feb 24 08:54:16 vn5 kernel: klogd 1.3-3, log source = /proc/kmsg started. Feb 24 08:54:16 vn5 kernel: Inspecting /boot/System.map-2.2.14-Psmp Feb 24 08:54:16 vn5 syslog: klogd startup succeeded Feb 24 08:54:16 vn5 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. ############################################################ CRASH_86 ############################################################ Sat Feb 24 15:53:09 PST 2001 vn10 down again, put in call to Bill, this is one of the machines which had new power supply installed Sat Feb 24 16:38:40 PST 2001 Power supply replaced ############################################################ CRASH_87 ############################################################ Sun Feb 25 07:13:06 PST 2001 vn5's load average through the roof again (as per CRASH_87 Sun Dec 17 10:53:46 PST 2000) ... bad disk? Feb 25 04:22:00 vn5 anacron[5907]: Updated timestamp for job `cron.weekly' to 2001-02-25 Feb 25 04:23:02 vn5 kernel: kmem_free: Bad obj addr (objp=9235a5a0, name=buffer_head) Feb 25 04:23:02 vn5 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Feb 25 04:23:02 vn5 kernel: current->tss.cr3 = 00101000, %cr3 = 00101000 Feb 25 04:23:02 vn5 kernel: *pde = 00000000 Feb 25 04:23:02 vn5 kernel: Oops: 0002 Feb 25 04:23:02 vn5 kernel: CPU: 0 Feb 25 04:23:02 vn5 kernel: EIP: 0010:[kmem_cache_free+371/408] Feb 25 04:23:02 vn5 kernel: EFLAGS: 00010286 Feb 25 04:23:02 vn5 kernel: eax: 0000003d ebx: 9235a5a0 ecx: 0000003b edx: 0000004b Feb 25 04:23:02 vn5 kernel: esi: 9ffef740 edi: 00000286 ebp: 00000018 esp: 9ffcdf70 Feb 25 04:23:02 vn5 kernel: ds: 0018 es: 0018 ss: 0018 Feb 25 04:23:02 vn5 kernel: Process kswapd (pid: 5, process nr: 6, stackpage=9ffcd000) Feb 25 04:23:02 vn5 kernel: Stack: 9235a660 80503570 9235a5fc 9ffef740 8012a9b5 9ffef740 9235a5a0 9235a5a0 Feb 25 04:23:02 vn5 kernel: 9235a660 8012b373 9235a5a0 9235a5a0 80503570 000007fe 00000030 9ffcc000 Feb 25 04:23:02 vn5 kernel: 8011fc9e 80503570 0000000d 00000006 80124f32 00000006 00000030 9ffcc000 Feb 25 04:23:02 vn5 kernel: Call Trace: [put_unused_buffer_head+33/76] [try_to_free_buffers+75/132] [shrink_mmap+218/304] [do_try_to_free_pages+66/160] [tvecs+7118/13280] [kswapd+103/156] [get_options+0/116] Feb 25 04:23:02 vn5 kernel: [kernel_thread+35/48] Feb 25 04:23:02 vn5 kernel: Code: c7 05 00 00 00 00 00 00 00 00 eb 10 90 56 53 68 9e f4 1b 80 Feb 25 04:23:03 vn5 kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000001c Feb 25 04:23:03 vn5 kernel: current->tss.cr3 = 0fced000, %cr3 = 0fced000 Feb 25 04:23:03 vn5 kernel: *pde = 00000000 Feb 25 04:23:03 vn5 kernel: Oops: 0000 Feb 25 04:23:03 vn5 kernel: CPU: 1 Feb 25 04:23:03 vn5 Feb 25 09:02:39 vn5 syslogd 1.3-3: restart. ############################################################ CRASH_88 ############################################################ Tue Feb 27 18:46:31 PST 2001 vn5 down 5:51 Jason re-seated everything, if it goes down again we'll seriously consider swapping out the disk ############################################################ CRASH_89 ############################################################ vn32 down 12:44 Inaki re-booted, nothing obvious in log Mar 23 22:16:23 vn32 sshd[9289]: log: Connection from 142.103.237.225 port 985 Mar 23 22:16:24 vn32 sshd[9289]: log: RSA authentication for idle accepted. Mar 23 22:16:24 vn32 sshd[9291]: log: executing remote command as user idle Mar 23 22:16:25 vn32 sshd[9289]: log: Closing connection to 142.103.237.225 Mar 24 11:27:34 vn32 syslogd 1.3-3: restart. Mar 24 11:27:34 vn32 syslog: syslogd startup succeeded Mar 24 11:27:34 vn32 kernel: klogd 1.3-3, log source = /proc/kmsg started. Mar 24 11:27:34 vn32 kernel: Inspecting /boot/System.map-2.2.14-Psmp ############################################################ CRASH_90 ############################################################ vn25 down Jason rebooted, disk in bad shape ... Memory replaced, system back up with new disk Apr 20 ############################################################ CRASH_91 ############################################################ vn20 Also taken down ############################################################ CRASH_92 ############################################################ Sat Apr 21 09:20:30 PDT 2001 vn20 down after secondary upgrade ############################################################ CRASH_93 ############################################################ Mon Apr 23 14:19:28 PDT 2001 vn5 unresponsive with symptoms much as per CRASH_87 Memory replaced/Fan replaced ############################################################ CRASH_94 ############################################################ Thu Apr 26 19:02:39 PDT 2001 vn43 down Fri Apr 27 11:28:22 PDT 2001 # In machine room, hard-reboot ... Needed manual fsck /dev/hda1 # Lots of errors, after reboot clock was off by an houro ntptimeset vnSetdate; hwclock --systohc; hwclock --show; ntptimeset # Nothing obvious in log Apr 26 18:36:17 vn43 PAM_pwdb[19584]: (rsh) session opened for user ghlim by (uid=0) Apr 26 18:36:18 vn43 PAM_pwdb[19584]: (rsh) session closed for user ghlim Apr 26 18:36:23 vn43 pam_rhosts_auth[19590]: allowed to ghlim@vnfe1.physics.ubc.ca as ghlim Apr 26 18:36:23 vn43 PAM_pwdb[19590]: (rsh) session opened for user ghlim by (uid=0) Apr 26 18:36:24 vn43 PAM_pwdb[19590]: (rsh) session closed for user ghlim Apr 26 18:37:01 vn43 sshd[19596]: log: Connection from 142.103.237.226 port 1023 Apr 26 18:37:02 vn43 sshd[19596]: log: Rhosts with RSA host authentication accepted for roman, roman on vnfe2.physics.ubc.ca. Apr 27 10:40:39 vn43 syslogd 1.3-3: restart. Apr 27 10:40:39 vn43 syslog: syslogd startup succeeded Apr 27 10:40:39 vn43 kernel: klogd 1.3-3, log source = /proc/kmsg started. Apr 27 10:40:39 vn43 kernel: Inspecting /boot/System.map-2.2.14-Psmp Apr 27 10:40:39 vn43 syslog: klogd startup succeeded ############################################################ CRASH_95, CRASH_96, CRASH_97 ############################################################ vn.physics.ubc.ca Compute Node Status: Mon Apr 30 09:15:00 PDT 2001 The following nodes are down: 1: vn13 down 9:39 2: vn26 down 9:38 3: vn31 down 9:54 #--------------------------------------------------------------------- # Hard reboot of vn13 #--------------------------------------------------------------------- vnSetdate; hwclock --systohc; hwclock --show; killall ntpd; ntpd; jj ntpd; ntptimeset # messages: Apr 29 23:30:00 vn13 CROND[2097]: (root) CMD ( /sbin/rmmod -as) Apr 29 23:30:00 vn13 CROND[2098]: (root) CMD (date >> /tmp/DATE) Apr 29 23:36:11 vn13 pam_rhosts_auth[2101]: allowed to suqin@vn4.physics.ubc.ca as suqin Apr 29 23:36:11 vn13 PAM_pwdb[2101]: (rsh) session opened for user suqin by (uid=0) Apr 29 21:32:20 vn13 portmap[1553]: connect from 169.237.91.88 to dump(): request from unauthorized host Name: qc2.ucdavis.edu Address: 169.237.91.88 # secure: Apr 29 23:36:11 vn13 in.rshd[2101]: connect from 142.103.237.4 #--------------------------------------------------------------------- # Hard reboot of vn26 #--------------------------------------------------------------------- vnSetdate; hwclock --systohc; hwclock --show; killall ntpd; ntpd; jj ntpd; ntptimeset # messages: Apr 29 23:30:00 vn26 CROND[25209]: (root) CMD ( /sbin/rmmod -as) Apr 29 23:30:00 vn26 CROND[25210]: (root) CMD (date >> /tmp/DATE) Apr 29 23:38:51 vn26 sshd[25141]: fatal: Connection closed by remote host. Apr 29 23:39:00 vn26 sshd[25213]: log: Connection from 142.103.17.37 port 1094 Apr 29 23:39:01 vn26 sshd[25213]: log: Password authentication for fengxs accepted. Apr 29 23:40:00 vn26 CROND[25233]: (root) CMD ( /sbin/rmmod -as) Apr 30 09:13:58 vn26 syslogd 1.3-3: restart. Apr 30 09:13:58 vn26 syslog: syslogd startup succeeded# messages: # secure: Apr 29 23:40:05 vn26 in.rshd[25234]: connect from 142.103.237.4 #--------------------------------------------------------------------- # Hard reboot of vn31 #--------------------------------------------------------------------- vnSetdate; hwclock --systohc; hwclock --show; killall ntpd; ntpd; jj ntpd; ntptimeset # messages: Apr 29 23:23:38 vn31 PAM_pwdb[21411]: (login) session opened for user fengxs by (uid=0) Apr 29 23:23:38 vn31 pam_console[21411]: can't find device or X11 socket to examine for 0 Apr 30 09:25:26 vn31 syslogd 1.3-3: restart. Apr 30 09:25:26 vn31 syslog: syslogd startup succeeded Apr 30 09:25:26 vn31 kernel: klogd 1.3-3, log source = /proc/kmsg started. # secure: Apr 29 23:23:33 vn31 in.telnetd[21410]: connect from 142.103.237.26 #--------------------------------------------------------------------- # From logs on vn4 #--------------------------------------------------------------------- Apr 29 23:28:52 vn4 PAM_pwdb[20586]: (login) session opened for user suqin by (uid=0) Apr 29 23:28:52 vn4 pam_console[20586]: can't find device or X11 socket to examine for 0 # But these messages don't see too unusual Dear Suqin and Xiaosi: Last evening just before 11:30PM, nodes vn16, vn26 and vn31 crashed. The first to go down was vn31, and it looks like it went down just after you telnet'ed in from vn26. At about the same time, it looks like you were starting an MPI job which included vn16, Suqin. Was there anything unusual about the jobs which you tried to start last evening and/or did you notice any behaviour out of the ordinary? Thanks ... Matt ############################################################ CRASH_98 ############################################################ Sat Jun 9 12:07:17 PDT 2001 vn64 down, switch has been picking up errors for since May 26, will get replacement NIC from Bill. Mon Jun 11 14:27:48 PDT 2001 # Swapped out card ############################################################ CRASH_99 ############################################################ Wed Jun 20 14:03:12 PDT 2001 Had to reboot vnfe3 due to un-resolved system problem ... Too many open files in system? ############################################################ CRASH_100 ############################################################ Thu Jun 21 08:47:43 PDT 2001 1: vn52 down 10:47 Hard reboot in machine room # Suqin's MPI job? Jun 20 21:44:49 vn52 sshd[18237]: log: Connection from 142.103.237.225 port 880 Jun 20 21:44:49 vn52 sshd[18237]: log: RSA authentication for idle accepted. Jun 20 21:44:49 vn52 sshd[18239]: log: executing remote command as user idle Jun 20 21:44:50 vn52 sshd[18237]: log: Closing connection to 142.103.237.225 Jun 20 21:45:00 vn52 CROND[18259]: (root) CMD (date >> /tmp/DATE) Jun 20 21:46:18 vn52 pam_rhosts_auth[18261]: allowed to suqin@vn5.physics.ubc.ca as suqin Jun 20 21:46:18 vn52 PAM_pwdb[18261]: (rsh) session opened for user suqin by (uid=0) Jun 21 08:17:00 vn52 syslogd 1.3-3: restart. Jun 21 08:17:00 vn52 syslog: syslogd startup succeeded Jun 21 08:17:00 vn52 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jun 21 08:17:00 vn52 kernel: Inspecting /boot/System.map-2.2.14-Psmp Jun 21 08:17:00 vn52 syslog: klogd startup succeeded Jun 21 08:17:00 vn52 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Jun 21 08:17:00 vn52 kernel: Symbols match kernel version 2.2.14. ############################################################ CRASH_101 ############################################################ Sat Aug 4 17:36:31 PDT 2001 vn48 down 3:06 # Pingable, not responsive # Hard reboot # Suqin *may* have been on, but not obviously. No users other than idle # In Rtop files since 2001:08:04:0646 Aug 4 14:30:19 vn48 sshd[3827]: log: Connection from 142.103.237.225 port 1019 Aug 4 14:30:20 vn48 sshd[3827]: log: RSA authentication for idle accepted. Aug 4 14:30:20 vn48 sshd[3829]: log: executing remote command as user idle Aug 4 14:30:21 vn48 sshd[3827]: log: Closing connection to 142.103.237.225 Aug 4 14:30:51 vn48 PAM_pwdb[3849]: (login) session opened for user suqin by (uid=0) Aug 4 14:30:51 vn48 pam_console[3849]: can't find device or X11 socket to examine for 0 Aug 4 16:42:05 vn48 syslogd 1.3-3: restart. Aug 4 16:42:05 vn48 syslog: syslogd startup succeeded Aug 4 16:42:05 vn48 kernel: klogd 1.3-3, log source = /proc/kmsg started. Aug 4 16:42:05 vn48 kernel: Inspecting /boot/System.map-2.2.14-Psmp Aug 4 16:42:05 vn48 syslog: klogd startup succeeded Aug 4 16:42:05 vn48 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Aug 4 16:42:05 vn48 kernel: Symbols match kernel version 2.2.14. ############################################################ CRASH_102 ############################################################ Wed Sep 5 12:07:32 PDT 2001 (1) Everything down after power failure 06:30-10:15 vnfe1's second disk apparently fried Got it back working after taking it out, reinstalling ############################################################ CRASH_103 ############################################################ Mon Sep 10 08:42:23 PDT 2001 (1) vn43 apparently self-re-booted at about 06:30, nothing apparent in logs Luis running on it at the time ############################################################ CRASH_104 ############################################################ Mon Oct 1 09:09:55 PDT 2001 vn43 down 7:50 (1) vn43 pingable but otherwise incommunicado Oct 1 01:13:24 vn43 kernel: swap_free: offset exceeds max Oct 1 01:13:24 vn43 last message repeated 10 times Oct 1 01:13:24 vn43 kernel: swap_duplicate: entry 80000000, offset exceeds max Oct 1 01:13:24 vn43 kernel: VM: killing process sed Oct 1 01:13:24 vn43 kernel: swap_free: offset exceeds max Oct 1 01:13:24 vn43 kernel: swap_free: offset exceeds max Oct 1 01:13:26 vn43 sshd[32378]: log: Closing connection to 142.103.237.225 ############################################################ CRASH_105 ############################################################ Sun Nov 4 08:53:33 PST 2001 vn4 down 17:36 (1) vn4 pingable but otherwise incommunicado Hard reboot ... nothing obvious in logs Nov 3 15:04:22 vn4 sshd[23807]: log: Closing connection to 142.103.237.225 Nov 3 15:10:00 vn4 CROND[23829]: (root) CMD ( /sbin/rmmod -as) Nov 3 15:13:23 vn4 sshd[23830]: log: Connection from 142.103.237.225 port 1017 Nov 3 15:13:23 vn4 sshd[23830]: log: RSA authentication for idle accepted. Nov 3 15:13:23 vn4 sshd[23832]: log: executing remote command as user idle Nov 3 15:13:26 vn4 sshd[23830]: log: Closing connection to 142.103.237.225 Nov 3 15:15:00 vn4 CROND[23852]: (root) CMD (date >> /tmp/DATE) Nov 3 15:20:00 vn4 CROND[23855]: (root) CMD ( /sbin/rmmod -as) Nov 4 10:59:24 vn4 syslogd 1.3-3: restart. Nov 4 10:59:24 vn4 syslog: syslogd startup succeeded Nov 4 10:59:24 vn4 kernel: klogd 1.3-3, log source = /proc/kmsg started. Nov 4 10:59:24 vn4 kernel: Inspecting /boot/System.map-2.2.14-Psmp Nov 4 10:59:24 vn4 syslog: klogd startup succeeded Nov 4 10:59:24 vn4 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Nov 4 10:59:24 vn4 kernel: Symbols match kernel version 2.2.14. Nov 4 10:59:24 vn4 kernel: Loaded 124 symbols from 6 modules. Nov 4 10:59:24 vn4 kernel: Linux version 2.2.14-Psmp (root@vn1.physics.ubc.ca) (gcc version pgcc-2.91.66 19990314 (egcs-1.1.2 release)) #1 SMP Fri Mar 17 23:30:57 PST 2000 Nov 4 10:59:24 vn4 kernel: Intel MultiProcessor Specification v1.1 ############################################################ CRASH_106 ############################################################ Thu Nov 15 11:42:40 PST 2001 vn43 down at about 1:00 AM last evening Nov 15 01:04:07 vn43 sshd[31560]: log: Connection from 142.103.237.43 port 1023 Nov 15 01:04:07 vn43 sshd[31560]: log: Rhosts with RSA host authentication accepted for fransp, fransp on vn43.physics.ubc.ca. Nov 15 01:04:07 vn43 sshd[31562]: log: executing remote command as user fransp Nov 15 01:04:07 vn43 sshd[31560]: log: Closing connection to 142.103.237.43 Nov 15 11:17:51 vn43 syslogd 1.3-3: restart. Nov 15 11:17:51 vn43 syslog: syslogd startup succeeded Nov 15 11:17:51 vn43 kernel: klogd 1.3-3, log source = /proc/kmsg started. Nov 15 11:17:51 vn43 kernel: Inspecting /boot/System.map-2.2.14-Psmp Nov 15 11:17:51 vn43 syslog: klogd startup succeeded ############################################################ CRASH_107 ############################################################ Wind storm last night with gusts to 100 km/h multiple power outages from 3AM - 3PM Everything down Fri Dec 14 18:02:03 PST 2001 Everything back vnNbgCommand vnSetdate vnNbgCommand ntpd vnNbgCommand hwclock --systohc vnallCommand ntptimeset > /tmp/ntp vnallbgCommand 'killall rwhod; rwhod' ############################################################ CRASH_108 ############################################################ vn.physics.ubc.ca Compute Node Status: Tue Dec 18 09:45:00 PST 2001 The following nodes are down: 1: vn51 down 7:59 # Hard reboot Dec 18 01:45:00 vn51 CROND[29740]: (root) CMD (date >> /tmp/DATE) Dec 18 01:48:06 vn51 sshd[29743]: Accepted rsa for idle from 142.103.237.225 port 4971 Dec 18 01:48:06 vn51 modprobe: can't locate module net-pf-10 Dec 18 10:01:49 vn51 syslogd 1.3-3: restart. Dec 18 10:01:49 vn51 syslog: syslogd startup succeeded Dec 18 10:01:49 vn51 kernel: klogd 1.3-3, log source = /proc/kmsg started. ############################################################ CRASH_109 ############################################################ vn.physics.ubc.ca Compute Node Status: Sun Jan 20 08:45:00 PST 2002 The following nodes are down: 1: vn62 down 9:52 Pingable but otherwise incommunicado Plischke was running 2 jobs at the time, but may have been something zheqiong was doing which caused the crash Jan 19 22:40:00 vn62 CROND[10241]: (root) CMD ( /sbin/rmmod -as) Jan 19 22:42:22 vn62 sshd[10242]: Accepted password for zheqiong from 24.83.22.197 port 1164 Jan 19 22:42:22 vn62 modprobe: can't locate module net-pf-10 Jan 19 22:42:22 vn62 last message repeated 2 times Jan 19 22:45:00 vn62 CROND[10259]: (root) CMD (date >> /tmp/DATE) Jan 19 22:45:36 vn62 sshd[10261]: Accepted rsa for idle from 142.103.237.225 port 1744 Jan 19 22:45:36 vn62 modprobe: can't locate module net-pf-10 Jan 19 22:45:37 vn62 last message repeated 3 times Jan 19 22:50:00 vn62 CROND[10286]: (root) CMD ( /sbin/rmmod -as) Jan 19 22:53:00 vn62 sshd[10287]: Accepted rsa for idle from 142.103.237.225 port 1811 Jan 19 22:53:00 vn62 modprobe: can't locate module net-pf-10 Jan 19 22:53:00 vn62 last message repeated 3 times Jan 20 09:46:39 vn62 syslogd 1.3-3: restart. Jan 20 09:46:39 vn62 syslog: syslogd startup succeeded Jan 20 09:46:39 vn62 kernel: klogd 1.3-3, log source = /proc/kmsg started. Jan 20 09:46:39 vn62 kernel: Inspecting /boot/System.map-2.2.14-Psmp Jan 20 09:46:39 vn62 syslog: klogd startup succeeded ############################################################ CRASH_110 ############################################################ (1) vn43 down (frans running v. large memory job) 118: vn43 5 root 20 0 0 0 0 RW 0 28.6 0.0 25:52 kswapd 119: vn43 23461 fransp 20 0 428M 424M 812 R 0 19.7 84.0 24:15 graxi_ad_F Log full of kernel messages: Warning: Permanently added 'vn43.physics.ubc.ca' (RSA1) to the list of known hosts. Apr 8 07:01:02 vn43 last message repeated 3 times Apr 8 07:01:02 vn43 modprobe: can't locate module net-pf-10 Apr 8 07:01:03 vn43 kernel: stuck on TLB IPI wait (CPU#1) Apr 8 07:01:03 vn43 last message repeated 3 times Apr 8 07:01:02 vn43 CROND[14550]: (root) CMD (run-parts /etc/cron.hourly) Apr 8 07:01:03 vn43 kernel: stuck on TLB IPI wait (CPU#1) Apr 8 07:01:05 vn43 last message repeated 14 times Apr 8 07:01:05 vn43 anacron[14563]: Updated timestamp for job `cron.hourly' to 2002-04-08 Apr 8 07:09:23 vn43 sshd[14578]: Accepted rsa for idle from 142.103.237.225 port 4551 Apr 8 07:09:23 vn43 modprobe: can't locate module net-pf-10 Apr 8 07:10:00 vn43 kernel: stuck on TLB IPI wait (CPU#1) Apr 8 07:10:00 vn43 last message repeated 9 times SWAP OUT MEMORY NEXT TIME (ORDER MEMORY FROM BILL?) ############################################################ CRASH_111 ############################################################ Tue May 7 23:40:14 PDT 2002 (1) vn9 had kernel error May 5, needed reboot, Scott reports The reboot didn't work. After I /sbin/reboot , it came up , checked the disk, then said that it had problems with the check, then gave me two choices: a) give root password for maintenance; or b) CTRL-D for normal bootup. When I do either, it just reboots the machine, thus doing neither. The next time it comes up, it has to check the disk again, and again it fails. I've done this about 5 times now, with no change. Everytime, it complains about "Duplicate/bad blocks" at various inodes on /dev/hda1. Wed May 8 08:56:23 PDT 2002 (1) In machine room, boot vn9 lilo: linux_up single Automatic check of /dev/hda1 fails, enter root password, get prompt fsck /dev/hda1 fails with kernel panic. Best to fix from other machine. Disk seems OK, replaced memory, OK. ############################################################ CRASH_112 ############################################################ Mon June 17 (1) Scott reboots vnfe3 ############################################################ CRASH_113 ############################################################ Wed Jul 17 07:26:09 PDT 2002 1: vn2 down 11:55 Wed Jul 17 08:31:01 PDT 2002 In machine room, presumably will hard reboot cooperon.physics.ubc.ca:/export/data24/cooperon still getting mounted, need to delete Brought up single, removed /etc/fstab entry, seems OK ntptimeset vnSetdate ntptimeset # Absolutely nothing in logs Jul 16 19:12:57 vn2 modprobe: can't locate module net-pf-10 Jul 16 19:12:57 vn2 last message repeated 4 times Jul 16 19:15:00 vn2 CROND[6915]: (root) CMD (date >> /tmp/DATE) Jul 16 19:20:00 vn2 CROND[6928]: (root) CMD ( /sbin/rmmod -as) Jul 16 19:20:52 vn2 sshd[6929]: Accepted publickey for idle from 142.103.237.225 port 2640 ssh2 Jul 16 19:20:52 vn2 modprobe: can't locate module net-pf-10 Jul 16 19:20:52 vn2 last message repeated 4 times Jul 17 07:43:38 vn2 syslogd 1.3-3: restart. Jul 17 07:43:38 vn2 syslog: syslogd startup succeeded Jul 17 07:43:38 vn2 kernel: klogd 1.3-3, log source = /proc/kmsg started. grep 'vn2 ' 2002:07:16:1904.59 72: vn2 6201 ghlim 16 0 1244 1244 436 R 0 48.6 0.2 80:55 rest4 92: vn2 27696 plischke 16 0 36064 35M 472 R 0 48.1 6.9 4456m bincl48 # Mailed ghlim and plischke ############################################################ CRASH_114 ############################################################ Thu Sep 12 20:47:19 PDT 2002 vn27 down 0:30 Overduin running on both processors at time, sent e-mail Fri Sep 13 08:30:46 PDT 2002 # In machine room, no lights on vn27, but fan still running. # Hard reset with keyboard, terminal Dead in the water, fan on, but otherwise no indication of power. Taking back to Hennings, will have Scott/Jason look at it (?), but then call Varsity. Mon Sep 16 16:54:16 PDT 2002 Power supply, replaced; vn27 re-inserted and apparently OK although vnfe1 partitions cannot be mounted due to existing problem with vnfe1's exports. ############################################################ CRASH_115 ############################################################ Tue Oct 1 08:26:58 PDT 2002 vn43 down 11:39 Hard reboot in machine room Needed manual fsck /dev/hda1 Nothing obvious in logs, but this is the 7th time vn43 has crashed, MEMORY? ############################################################ CRASH_116 ############################################################ Tue Oct 29 13:25:21 PST 2002 (1) vnfe1 hung at about 12 noon. Hard reboot OK, and fixed NFS mounting problems. Still some problems with df on certain nodes, probably due to hard-mounts of cooperon FS's in /etc/mtab Nothing obvious in logs Oct 29 11:42:12 vnfe1 sshd[31086]: Accepted publickey for murray from 142.103.237.24 port 3543 ssh2 Oct 29 11:44:06 vnfe1 sshd[31241]: Accepted publickey for matt from 142.103.237.225 port 3214 ssh2 Oct 29 11:52:05 vnfe1 sshd[31520]: Accepted publickey for matt from 142.103.237.225 port 3287 ssh2 Oct 29 13:18:23 vnfe1 syslogd 1.3-3: restart. Oct 29 13:18:23 vnfe1 syslog: syslogd startup succeeded Oct 29 13:18:23 vnfe1 kernel: klogd 1.3-3, log source = /proc/kmsg started. Oct 29 13:18:23 vnfe1 kernel: Inspecting /boot/System.map-2.2.14-Psmp Oct 29 13:18:23 vnfe1 syslog: klogd startup succeeded ############################################################ CRASH_117 ############################################################ Tue Oct 29 13:25:21 PST 2002 (1) vn24 also incommunicado Nothing apparent in logs Oct 29 13:50:57 vn24 modprobe: can't locate module net-pf-10 Oct 29 13:51:13 vn24 sshd[27068]: Accepted publickey for root from 142.103.237.225 port 1753 ssh2 Oct 29 13:51:13 vn24 modprobe: can't locate module net-pf-10 Oct 29 13:52:20 vn24 sshd[27085]: Accepted publickey for root from 142.103.237.225 port 1817 ssh2 Oct 29 13:52:20 vn24 modprobe: can't locate module net-pf-10 Oct 29 14:03:20 vn24 syslogd 1.3-3: restart. Oct 29 14:03:20 vn24 syslog: syslogd startup succeeded Oct 29 14:03:20 vn24 kernel: klogd 1.3-3, log source = /proc/kmsg started. Oct 29 14:03:20 vn24 kernel: Inspecting /boot/System.map-2.2.14-Psmp Oct 29 14:03:20 vn24 syslog: klogd startup succeeded ############################################################ CRASH_118 ############################################################ Thu Oct 31 07:26:35 PST 2002 (1) vn40 un-pingable Hard re-boot in machine room OK Kernel problem? Oct 31 03:46:36 vn40 sshd[15060]: Accepted publickey for idle from 142.103.237.225 port 1212 ssh2 Oct 31 03:46:36 vn40 modprobe: can't locate module net-pf-10 Oct 31 03:50:00 vn40 CROND[15082]: (root) CMD ( /sbin/rmmod -as) Oct 31 03:54:20 vn40 sshd[15083]: Accepted publickey for idle from 142.103.237.225 port 1279 ssh2 Oct 31 03:54:21 vn40 modprobe: can't locate module net-pf-10 Oct 31 04:00:00 vn40 kernel: stuck on TLB IPI wait (CPU#1) Oct 31 04:00:00 vn40 last message repeated 15 times Oct 31 04:00:00 vn40 CROND[15106]: (root) CMD ( /sbin/rmmod -as) Oct 31 04:00:00 vn40 kernel: stuck on TLB IPI wait (CPU#1) Oct 31 04:00:00 vn40 last message repeated 14 times Oct 31 04:00:00 vn40 CROND[15107]: (root) CMD (date >> /tmp/DATE) ############################################################ CRASH_119 ############################################################ Thu Nov 28 19:26:30 PST 2002 vn34 down 1:06 Thu Nov 28 21:29:56 PST 2002 In machine room, hard-reboot, OK Nothing obvious in logs Nov 28 18:13:03 vn34 modprobe: can't locate module net-pf-10 Nov 28 18:13:03 vn34 modprobe: can't locate module net-pf-10 Nov 28 18:20:00 vn34 CROND[1584]: (root) CMD ( /sbin/rmmod -as) Nov 28 18:21:18 vn34 sshd[1587]: Accepted publickey for idle from 142.103.237.225 port 1768 ssh2 Nov 28 18:21:18 vn34 modprobe: can't locate module net-pf-10 Nov 28 18:21:18 vn34 modprobe: can't locate module net-pf-10 Nov 28 21:39:18 vn34 syslogd 1.3-3: restart. Nov 28 21:39:18 vn34 syslog: syslogd startup succeeded Nov 28 21:39:18 vn34 kernel: klogd 1.3-3, log source = /proc/kmsg started. ############################################################ CRASH_120 ############################################################ Tue Dec 3 09:13:48 PST 2002 vn47 down 2:15 Tue Dec 3 10:14:21 PST 2002 In machine room, hard re-boot, OK Nothing obvious in logs. Dec 3 06:56:44 vn47 sshd[18666]: Accepted publickey for idle from 142.103.237.225 port 1677 ssh2 Dec 3 06:56:44 vn47 modprobe: can't locate module net-pf-10 Dec 3 06:56:44 vn47 modprobe: can't locate module net-pf-10 Dec 3 07:00:00 vn47 CROND[18689]: (root) CMD ( /sbin/rmmod -as) Dec 3 07:01:00 vn47 CROND[18691]: (root) CMD (run-parts /etc/cron.hourly) Dec 3 07:01:01 vn47 anacron[18694]: Updated timestamp for job `cron.hourly' to 2002-12-03Dec 3 10:24:20 vn47 syslogd 1.3-3: restart. Dec 3 10:24:20 vn47 syslog: syslogd startup succeeded Dec 3 10:24:20 vn47 kernel: klogd 1.3-3, log source = /proc/kmsg started. Dec 3 10:24:20 vn47 kernel: Inspecting /boot/System.map-2.2.14-Psmp Dec 3 10:24:20 vn47 syslog: klogd startup succeeded Dec 3 10:24:20 vn47 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. ############################################################ CRASH_121, CRASH_122 ############################################################ Unable to ssh into vn14, vn15 Mon Dec 30 12:14:24 PST 2002 In machine room, hard reboot of vn14, vn15 vn14 /var/log/messages Dec 25 04:02:01 vn14 syslogd 1.3-3: restart. Dec 25 04:10:00 vn14 crond[365]: (CRON) error (can't fork) Dec 25 04:20:00 vn14 crond[365]: (CRON) error (can't fork) Dec 25 04:30:00 vn14 crond[365]: (CRON) error (can't fork) Dec 25 04:40:00 vn14 crond[365]: (CRON) error (can't fork) vn15 /var/log/messages Dec 25 17:34:39 vn15 modprobe: can't locate module net-pf-10 Dec 25 17:39:00 vn15 sshd[19200]: Accepted publickey for scn from 142.103.234.165 port 58248 ssh2 Dec 25 17:39:00 vn15 sshd[19200]: Disconnecting: fork failed: Resource temporarily unavailable Dec 25 17:39:00 vn15 kernel: request_module[net-pf-10]: fork failed, errno 11 Dec 25 17:39:00 vn15 kernel: request_module[net-pf-10]: fork failed, errno 11 Dec 25 17:39:00 vn15 sshd[19201]: Accepted publickey for scn from 142.103.234.165 port 58249 ssh2 ############################################################ CRASH_123 ############################################################ Tue May 20 15:59:17 PDT 2003 vnfe3 had problems, required hard re-boot, and there may be a CPU problem ... We're having some troubles with vnfe3. Kevin first alerted me to the problem this afternoon when he noticed that the tape drive was not responding since there were was a zombie 'tar cvf /dev/tape' that could not be killed. After trying everything I could think of (even tried and failed to kill the 'rmt' processes), I "hard ejected" the tape by holding down on the eject button. This ejected the tape, but when I went to ssh to it I couldn't; nor, could I ping it. I plugged the monitor and keyboard to it and found that it was completely unresponsive to keyboard input and the monitor displayed many "memory addresses" ala (<4324234482> <43284234332> <0053445893> ...) or something like that. Then, when I rebooted it (a hard reboot), I noticed that immediately after the SCSI BIOS initialization and right before the LILO prompt, the following message was displayed for about 0.5 seconds (it's not verbatim, but I believe it's what is displayed): Error: Processor 1 Error: Processor 2 Error: Processor 1 Failed FRB level 3 timer Error: Processor 2 Failed FRB level 3 timer It booted up fine after checking the disk. I noticed many messages like the following in the logs: May 18 04:02:03 vnfe3 kernel: lockd: couldn't bind to server 142.103.237.225 - retrying. May 18 04:02:38 vnfe3 last message repeated 14 times May 18 04:03:43 vnfe3 last message repeated 26 times May 18 04:04:48 vnfe3 last message repeated 26 times May 18 04:05:53 vnfe3 last message repeated 26 times May 18 04:06:43 vnfe3 last message repeated 21 times May 18 04:06:46 vnfe3 kernel: scsi : aborting command due to timeout : pid 142320027, scsi1, channel 0, id 0, lun 0 Read Block Limits 00 00 00 00 00 May 18 04:06:48 vnfe3 kernel: lockd: couldn't bind to server 142.103.237.225 - retrying. Sorry if I unwittingly caused the problem, but I think it may be a CPU problem because of the message at bootup. I can't search the net from vnfe4 for some reason, so I'll try to look up this message when I go back to bh7. Hopefully, it's not a big problem... ############################################################ CRASH_124 ############################################################ Wed Jun 25 09:11:45 PDT 2003 Nodes vn49-vn64 inclusive have been down for about a day vn.physics.ubc.ca Compute Node Status: Wed Jun 25 09:00:00 PDT 2003 The following nodes are down: 1: vn49 down 21:09 2: vn50 down 21:10 3: vn51 down 21:08 4: vn52 down 21:09 5: vn53 down 21:10 6: vn54 down 21:08 7: vn55 down 21:08 8: vn56 down 21:10 9: vn57 down 21:08 10: vn58 down 21:08 11: vn59 down 21:09 12: vn60 down 21:10 13: vn61 down 21:08 14: vn62 down 21:10 15: vn63 down 21:10 16: vn64 down 21:09 Suspect problem with UPS and/or associated circuit. Hey Matt, So, Kevin and I went over to the cluster and found that, indeed, vn49-vn64 were turned off. The workers removing the absestos had unknowingly flipped the circuit breaker of one of the UPS's. I informed them what happened and they promised to be more careful. Dave said that we should report any damage, loss of time or resources to him so that he can charge the contractors for it. scott n. # Hacked vnN to generate vn49 - vn64 vnCommand vnSetdate vnCommand ntpd vnCommand ntptimeset vnCommand df OK ############################################################ CRASH_125 ############################################################ Mon Jun 30 09:44:50 PDT 2003 vn14 down 2+05:00 Scott N hard rebooted---nothing apparent in logs There was a signal to the monitor, but nothing was displayed so I did a hard reboot. It checked the disk fine and I can't find anything wrong with it by scanning through the logs a little. ?? ############################################################ CRASH_126 ############################################################ Wed Jul 2 09:12:39 PDT 2003 vn25 down 1:04 From cwlai@warp.physics.ubc.ca Wed Jul 2 18:42:22 2003 Hi Matt, For the record, here's a summary of what happened to vn25 today: I went over to check vn25 this morning at around 10:30am. The connection seems fine and the light on network card looks the same as the other nodes. However, I couldn't ssh or ping to vn25, so I did a hard reboot of vn25. It didn't fix the problem. After talking to Scott I found the extension cords and connected the monitor and keyboard to vn25. There was a couple failures when rebooted, and I couldn't use '/bin/ls' or 'which', although cd, pwd, works fine. We took vn25 over to Henn 414 and tried to boot it with the rescue disk. But error message Kernel panic: No init found. Try passing init= option to kernel. showed up as Scott reported in the previous email. I used a linux distribution which fit on a single floppy to boot up the system and mounted the harddrive. After checking the log messages and have no clues to what might cause the problem we did a e2fsck to fix filesystem errors. But the same failure as in the morning appeared when we rebooted from harddrive. Finally we did a clean installation of Mandrake 6.1, and I installed ssh to /usr/local and run sshd on it. Scott added the deamon to /etc/rc.d/rc.local and now we can ssh to vn25 as root, and only root can ssh to vn25 right now since that's the only account created on the node. Let us know if there is any follow up we need to do. # Old version of sshd is installed, need to perform secondary set-up, # installation, but will defer for the time-being ############################################################ CRASH_127 ############################################################ Mon July 7 Electricians threw a breaker again, took down nodes vnfe[13], vn1-vn16 All back OK ############################################################ CRASH_128, CRASH_129, CRASH_130 ############################################################ vn9 vn16 vn39 down after cluster relocation with bad power supplies? ############################################################ CRASH_131 ############################################################ Mon Jul 14 10:28:14 PDT 2003 UPS powering vn7-vn8, vn18-vn29 went off-line, Dave Jones diagnosed as twist connector not being locked. vn25 didn't come back, apparently same problem as previously. Should reinstall on spare disk. ############################################################ CRASH_132 ############################################################ Fri Jul 25 08:32:56 PDT 2003 After power shutdown, noticed that vn16 wasn't booting, Varsity picked it up, diagnosed as BIOS problem, reloaded defaults, and now it appears to boot OK Updated vnN vnDistEtc motd csh.cshrc passwd shadow hosts.allow mkdir -p /d/vnfe4/home ############################################################ CRASH_133 ############################################################ Tue Sep 30 16:02:54 PDT 2003 vn25 down 2:17 Hard reboot in machine room Back OK Nothing apparent in logs. Kendal running on it at the time? ############################################################ CRASH_134 ############################################################ Thu Oct 9 09:40:30 PDT 2003 Can't ssh into vn25, although can telnet to it, reboot came back OK ############################################################ CRASH_135 ############################################################ Wed Oct 15 08:41:23 PDT 2003 vn25 down 11:31 Can't ping ... In machine room hard reboot Needed to manually fsck /dev/hda1 Oct 14 12:01:00 vn25 CROND[1285]: (root) CMD (run-parts /etc/cron.hourly) Oct 14 12:01:01 vn25 kernel: free_one_pmd: bad directory entry 00000002 Oct 14 12:01:01 vn25 kernel: free_one_pmd: bad directory entry 00000002 Oct 14 12:01:01 vn25 anacron[1288]: Updated timestamp for job `cron.hourly' to 2003-10-14 Oct 14 12:10:00 vn25 CROND[1293]: (root) CMD ( /sbin/rmmod -as) ... so may be some problem with disk/IO subsystem? ############################################################ CRASH_136 ############################################################ Mon Oct 20 08:52:42 PDT 2003 vn25 down Can't ping ... In machine room hard reboot Needed to manually fsck /dev/hda1 Several messages such as Oct 19 19:19:28 vn25 kernel: free_one_pmd: bad directory entry 00000001 again (possibly) implicating disk? ############################################################ CRASH_137 ############################################################ vn25 down 15:35 Time to replace disk? Wed Nov 19 13:36:20 PST 2003 # vn25 had hard drive replaced, needs re-installation ############################################################ CRASH_138 ############################################################ Thu Nov 6 13:55:25 PST 2003 (1) vn44 down, no disk light, need to get Varsity to look into it as well as vn25 Wed Nov 19 13:36:05 PST 2003 # vn44 had power supply and front fan replaced. Back up. ############################################################ CRASH_139 ############################################################ Thu Nov 6 13:55:28 PST 2003 (1) Accidentally powered down vn45 ############################################################ CRASH_140 ############################################################ Mon Nov 17 09:37:53 PST 2003 (1) vn17 down, no disk or power light, need to send to Varsity Hi Jody: In addition to node032 in the new cluster, we have three bad nodes in the old cluster that we'd like you guys to have a look at it. They are vn17 - no power light or disk light vn25 - have had recurrent problems with disk, possibly needs a new one? vn44 - no disk light I've pulled them out of the cluster and left them by the A/C unit so that they can be picked up when you next come out to the machine room. Thanks ... Matt Wed Nov 19 13:34:22 PST 2003 # vn17 had power supply and front fan replaced. Back up. ############################################################ CRASH_141 ############################################################ Sat Nov 29 17:29:40 PST 2003 (1) vn25 crashed during secondary installation ############################################################ CRASH_142 ############################################################ Fri Dec 12 11:05:01 PST 2003 (1) All machines down last evening due to power outage on campus. ############################################################ CRASH_143 ############################################################ Tue Jan 6 ??:??:?? PST 2004 (1) vn20 down, suspected memory problems, Varsity took it into the shop and apparently found mem. problem, Jan 16 ############################################################ CRASH_144 ############################################################ Sun Apr 18 17:26:25 PDT 2004 (1) vn20 down again, ping-able, but can't telnet/ssh Hard reboot in machine room Back up, nothing apparent in logs ############################################################ CRASH_145 ############################################################ Mon May 24 06:53:40 PDT 2004 (1) vn22 down, ping-able, but can't telnet/ssh Send message to Pal ############################################################ CRASH_146 ############################################################ Thu Jun 10 05:22:52 PDT 2004 (1) vn11 down, ping-able, but can't telnet/ssh Send message to Pal ############################################################ CRASH_147 ############################################################ Wed Jun 16 19:51:25 PDT 2004 (1) vn52 down, ping-able, but can't telnet/ssh Send message to Pal ############################################################ CRASH_148 ############################################################ Wed Jun 23 10:10:28 PDT 2004 (1) vn34 down, ping-able, but can't telnet/ssh ############################################################ CRASH_149 ############################################################ Fri Jul 9 15:33:21 PDT 2004 (1) vn11 down, can't ping Sent message to Pal Pal rebooted Sat AM, down again Mon Jul 12 11:31:45 PDT 2004 vn11 down 1+18:28 vn22 down 1+09:15 vn11 to shop Wed Jul 14 15:40:07 PDT 2004 vn11 back from Varsity, replaced power supply, $143 including $50.00 labor. ############################################################ CRASH_150 ############################################################ Mon Jul 12 11:31:59 PDT 2004 vn22 down 1+09:16 # Reboot (last went down May 2?, 2004) # Nothing obvious in logs, but trinat recently logged in ############################################################ CRASH_151 ############################################################ Mon Aug 23 11:51:32 PDT 2004 vn1 down 1+13:50 # As root@vn1 vnSetdate ntpd # Nothing obvious in log, but indications that someone has recently been # trying to hack in ? ############################################################ CRASH_152 ############################################################ Wed Aug 25 14:34:00 PDT 2004 vn25 down 3:44 # Pal finds that machine has no power (possible P/S?), phones # Varsity Wed Sep 1 18:14:52 PDT 2004 # Pal gets new (400 W) power supply installs, back on line ############################################################ CRASH_153 ############################################################ # Sun Sep 19 11:46:41 PDT 2004 # Cluster BIT reboot (see README) ############################################################ CRASH_154 ############################################################ Tue Sep 28 11:37:52 PDT 2004 # vnfe2 was down, Pal rebooted vnSetdate # Nothing apparent in logs ############################################################ CRASH_155 ############################################################ Sat Oct 9 20:37:47 PDT 2004 vnfe2 down 6:17 Sun Oct 10 11:59:11 PDT 2004 vnfe2 down 21:38 # In machine room, hard reboot, comes up but hangs on # NFS, need to bring it up single-user Sun Oct 10 12:24:37 PDT 2004 # Takes forever for NFS to come up, gets through export but # hangs on mountd # Third hard reboot into single user, but have to go read # fellowship apps. Tue Oct 12 02:38:33 PDT 2004 # vnfe2 up and running ... vnCommand 'mount -a; df' # ... proceeding SLOWLY but SURELY ############################################################ CRASH_156 ############################################################ # vn16 rebooted itself? Apparently, nothing in logs [root@vn16]# uptime 10:28am up 2 days, 22:30, 1 user, load average: 1.00, 1.00, 0.93 /var/log/messages Oct 10 10:59:32 vn16 PAM_pwdb[2260]: (login) session opened for user suqin by (uid=0) Oct 10 10:59:32 vn16 pam_console[2260]: can't find device or X11 socket to examine for 0 Oct 10 11:00:01 vn16 CROND[2274]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:01:00 vn16 CROND[2276]: (root) CMD (run-parts /etc/cron.hourly) Oct 10 11:01:00 vn16 anacron[2279]: Updated timestamp for job `cron.hourly' to 2004-10-10 Oct 10 11:10:00 vn16 CROND[2284]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:20:00 vn16 CROND[2286]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:30:00 vn16 CROND[2288]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:40:00 vn16 CROND[2290]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:50:00 vn16 CROND[2292]: (root) CMD ( /sbin/rmmod -as) Oct 10 11:08:38 vn16 syslogd 1.3-3: restart. Oct 10 11:08:38 vn16 syslog: syslogd startup succeeded Oct 10 11:08:38 vn16 kernel: klogd 1.3-3, log source = /proc/kmsg started. Oct 10 11:08:38 vn16 kernel: Inspecting /boot/System.map-2.2.14-Psmp Oct 10 11:08:38 vn16 syslog: klogd startup succeeded Oct 10 11:08:38 vn16 kernel: Loaded 6407 symbols from /boot/System.map-2.2.14-Psmp. Oct 10 11:08:38 vn16 kernel: Symbols match kernel version 2.2.14. ############################################################ CRASH_157 ############################################################ # Sat Oct 23 08:10:33 PDT 2004 # vn16 down again, needs to go into the shop # Varsity can find nothing wrong ############################################################ CRASH_158 ############################################################ # Sun Nov 7 13:14:16 PST 2004 # vnfe2 unstable, multiple reboots etc. # Need to copy vnfe2:{/home,/home2} somewhere, then take it # offline (will have to temporarily make vn1 e.g. # vnfe2 so that ntpd, PG compilers etc. will work Mon Nov 8 09:51:23 PST 2004 # In machine room with 10 minutes until an appt. will reboot # vnfe2 into single user mode # Mon Nov 15 10:49:52 PST 2004 # vnfe2 back from shop, Tony reseated second CPU and replaced # power supply # Back up OK cat /proc/cpuinfo # OK # TODO # Reenable vnfe2:{/home,/home2} mounts rrvi /etc/fstab # As matt@bh0 rrvi /etc/fstab # Continued to have problems. Had Pal take out half of the # memory. OK. Had Pal reinsert the other half. OK. # So apparently memory simply had to be reseated Wed Nov 24 07:44:27 PST 2004 # Nope ... looks like there's still problems ############################################################ CRASH_159, CRASH_160, CRASH_161 ############################################################ # 5 node crashes in past 24 hours vn34 vn44 vn47 vn62 vn62 From psandhu@physics.ubc.ca Wed Nov 10 13:49:10 2004 Date: Wed, 10 Nov 2004 13:15:22 -0800 (PST) From: Pal Sandhu To: Matthew W. Choptuik Subject: vn44 Hi Matt, this is Pal. I rebooted vn44. It looks like it went down last night at around 10pm. Also vn34 and vn47 went down last night at about 7pm but I had roland reboot those last night. ############################################################ CRASH_162, CRASH_163 ############################################################ # vn62 cd /home/matt/system/vn/Rtop test -f /tmp/rr && /bin/rm /tmp/rr && touch /tmp/rr foreach N (vn34 vn44 vn47 vn62) echo "+++++++++++++++++ NODE $N +++++++++++++++++++++" >> /tmp/rr grep $N 2004:11:10* >> /tmp/rr grep $N 2004:11:11* >> /tmp/rr end more /tmp/rr Rcat /tmp/rr +++++++++++++++++ NODE vn34 +++++++++++++++++++++ 2004:11:10:1336.42:42: vn34 926 roland 19 0 360M 360M 956 R 0 48.7 71.2 365:09 vlasov 2004:11:10:1336.42:69: vn34 634 watsma 17 0 58652 57M 916 R 0 47.3 11.3 512:04 neuralnetlm 2004:11:10:1336.42:104: vn34 1278 idle 3 0 1036 1036 840 R 0 2.3 0.2 0:00 top 2004:11:10:1344.19:36: vn34 926 roland 19 0 361M 359M 956 R 0 48.6 71.1 368:45 vlasov 2004:11:10:1344.19:56: vn34 634 watsma 19 0 58656 57M 916 R 0 47.7 11.3 515:44 neuralnetlm 2004:11:10:1344.19:102: vn34 1301 idle 3 0 1036 1036 840 R 0 2.3 0.2 0:00 top 2004:11:10:1351.41:24: vn34 926 roland 19 0 361M 359M 956 R 0 49.7 71.1 372:12 vlasov 2004:11:10:1351.41:68: vn34 634 watsma 18 0 58652 57M 916 R 0 46.8 11.3 519:22 neuralnetlm 2004:11:10:1351.41:104: vn34 1324 idle 3 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1358.56:65: vn34 926 roland 18 0 361M 360M 956 R 0 46.9 71.2 375:37 vlasov 2004:11:10:1358.56:66: vn34 634 watsma 19 0 58652 57M 916 R 0 46.9 11.3 522:58 neuralnetlm 2004:11:10:1358.56:93: vn34 1347 idle 3 0 1036 1036 840 R 0 2.3 0.2 0:00 top 2004:11:10:1406.11:26: vn34 634 watsma 13 0 58656 57M 916 R 0 49.3 11.3 526:33 neuralnetlm 2004:11:10:1406.11:73: vn34 926 roland 12 0 361M 360M 956 R 0 45.9 71.2 379:03 vlasov 2004:11:10:1406.11:107: vn34 1376 idle 2 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1413.25:35: vn34 634 watsma 20 0 58656 57M 916 R 0 48.6 11.3 530:08 neuralnetlm 2004:11:10:1413.25:74: vn34 926 roland 19 0 364M 362M 956 R 0 43.8 71.6 382:24 vlasov 2004:11:10:1413.25:103: vn34 1399 idle 3 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1420.38:50: vn34 926 roland 19 0 365M 364M 956 R 0 47.3 72.0 385:43 vlasov 2004:11:10:1420.38:51: vn34 634 watsma 17 0 58656 57M 916 R 0 47.3 11.3 533:46 neuralnetlm 2004:11:10:1420.38:82: vn34 1436 idle 3 0 1036 1036 840 R 0 2.8 0.2 0:00 top 2004:11:10:1428.08:55: vn34 634 watsma 18 0 58656 57M 916 R 0 47.3 11.3 537:25 neuralnetlm 2004:11:10:1428.08:73: vn34 926 roland 15 0 364M 362M 956 R 0 43.9 71.6 389:07 vlasov 2004:11:10:1428.08:102: vn34 1457 idle 3 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1435.21:44: vn34 634 watsma 18 0 58656 57M 916 R 0 48.0 11.3 541:00 neuralnetlm 2004:11:10:1435.21:73: vn34 926 roland 16 0 364M 362M 956 R 0 41.3 71.6 392:24 vlasov 2004:11:10:1435.21:103: vn34 1480 idle 3 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1442.37:48: vn34 634 watsma 18 0 58656 57M 916 R 0 47.9 11.3 544:42 neuralnetlm 2004:11:10:1442.37:70: vn34 926 roland 18 0 367M 365M 956 R 0 46.4 72.3 395:50 vlasov 2004:11:10:1442.37:101: vn34 1503 idle 2 0 1036 1036 840 R 0 1.9 0.2 0:00 top 2004:11:10:1450.01:36: vn34 926 roland 17 0 365M 363M 956 R 0 49.0 71.9 399:09 vlasov 2004:11:10:1450.01:37: vn34 634 watsma 17 0 58664 57M 916 R 0 49.0 11.3 548:17 neuralnetlm 2004:11:10:1450.01:105: vn34 1527 idle 2 0 1036 1036 840 R 0 1.9 0.2 0:00 top +++++++++++++++++ NODE vn44 +++++++++++++++++++++ 2004:11:10:1420.38:128: vn44 628 root 1 0 3860 3860 3524 S 0 0.9 0.7 0:08 prefdm 2004:11:10:1450.01:77: vn44 905 roland 12 0 3468 3468 988 R 0 37.1 0.6 0:14 vlasov +++++++++++++++++ NODE vn47 +++++++++++++++++++++ 2004:11:10:1336.42:91: vn47 730 roland 19 0 371M 370M 884 R 0 38.2 73.2 357:50 vlasov 2004:11:10:1336.42:116: vn47 1101 idle 4 0 1032 1032 840 R 0 1.9 0.1 0:00 top 2004:11:10:1344.19:85: vn47 730 roland 20 0 369M 368M 884 S 0 39.9 72.9 360:34 vlasov 2004:11:10:1344.19:124: vn47 1124 idle 2 0 1032 1032 840 R 0 1.4 0.1 0:00 top 2004:11:10:1351.41:81: vn47 730 roland 11 0 369M 368M 884 R 0 37.1 72.9 363:17 vlasov 2004:11:10:1351.41:115: vn47 1147 idle 3 0 1032 1032 840 R 0 1.4 0.1 0:00 top 2004:11:10:1358.56:81: vn47 730 roland 11 0 370M 369M 884 S 0 38.7 73.0 365:59 vlasov 2004:11:10:1358.56:116: vn47 1170 idle 3 0 1032 1032 840 R 0 1.4 0.1 0:00 top 2004:11:10:1406.11:79: vn47 730 roland 11 0 370M 369M 884 R 0 38.7 73.1 368:40 vlasov 2004:11:10:1406.11:118: vn47 1216 idle 3 0 1032 1032 840 R 0 1.4 0.1 0:00 top 2004:11:10:1413.25:75: vn47 730 roland 14 0 371M 370M 884 R 0 41.3 73.3 371:22 vlasov 2004:11:10:1413.25:98: vn47 1244 idle 3 0 1032 1032 840 R 0 1.9 0.1 0:00 top 2004:11:10:1420.38:74: vn47 730 roland 14 0 374M 373M 884 R 0 38.0 73.8 374:08 vlasov 2004:11:10:1420.38:99: vn47 1277 idle 3 0 1032 1032 840 R 0 1.9 0.1 0:00 top 2004:11:10:1428.08:74: vn47 730 roland 13 0 373M 372M 884 R 0 38.2 73.6 376:49 vlasov 2004:11:10:1428.08:96: vn47 1298 idle 3 0 1032 1032 840 R 0 1.9 0.1 0:00 top 2004:11:10:1435.21:74: vn47 730 roland 12 0 373M 372M 884 S 0 39.1 73.6 379:29 vlasov 2004:11:10:1435.21:109: vn47 1321 idle 3 0 1032 1032 840 R 0 1.4 0.1 0:00 top 2004:11:10:1442.37:77: vn47 730 roland 19 0 373M 372M 884 R 0 36.8 73.6 382:17 vlasov 2004:11:10:1442.37:88: vn47 1345 idle 3 0 1032 1032 840 R 0 2.3 0.1 0:00 top 2004:11:10:1450.01:78: vn47 730 roland 11 0 376M 375M 884 R 0 36.6 74.2 385:00 vlasov 2004:11:10:1450.01:92: vn47 1368 idle 4 0 1032 1032 840 R 0 2.3 0.1 0:00 top +++++++++++++++++ NODE vn62 +++++++++++++++++++++ 2004:11:10:1428.08:123: vn62 620 root 2 0 1464 444 296 S 0 0.9 0.0 254:58 prefdm 2004:11:10:1435.21:121: vn62 620 root 2 0 1464 444 296 S 0 0.9 0.0 255:01 prefdm 2004:11:10:1442.37:123: vn62 620 root 2 0 1464 444 296 S 0 0.9 0.0 255:04 prefdm 2004:11:10:1450.01:124: vn62 620 root 1 0 1464 444 296 S 0 0.9 0.0 255:08 prefdm # Sent message to Roland asking him to cease and desist with (Current) parallel # vlasov, and have installed cron job to killall vlasov jobs from roo@vnfe1 # every 5 minutes ############################################################ CRASH_164 ############################################################ Fri Dec 10 09:15:51 PST 2004 vn2 down 13:23 # In machine room, fan not running, nor on vn3. # Power supply most likely on vn2 # Immediate action ... check all fans etc. ############################################################ CRASH_165 ############################################################ Fri Dec 10 09:25:42 PST 2004 # vn3: Fan not running, but machine is, Take down and # offline possibly for local diagnosis and service ############################################################ CRASH_166 ... ############################################################ # Lots of fans not running ############################################################ CRASH_175 ############################################################ # vn45 down while I'm at Outing Lodge Tue Dec 21 10:41:59 PST 2004 # In machine room, tons of RPC msgs, no response, hard # reboot, had a chat with Dave, Jeremy's machine still idle ############################################################ CRASH_176 ############################################################ Fri Jan 28 10:57:10 PST 2005 # vn20 problematic after upgrade, kernel fault during # post-install. # Check memory, had memory replaced previously # Take out of cluster and hand it over to Pal Hi Pal, Ben: vn20 needs some attention. I suspect there may be memory problems. Please open it up, take half the memory out and then bring it back up. If it continues to misbehave, we'll swap in the removed half and repeat. What's the story with vn25? Thanks, Matt #----------------------------------------------------------- #----------------------------------------------------------- Sun Jan 30 13:39:00 PST 2005 #----------------------------------------------------------- #----------------------------------------------------------- # TODO: Will swap identities of vn20 and vn64 down vn20 down 9:40 # Frans currenrtly running on vn64, have asked him to clear # off. Sun Jan 30 14:56:17 PST 2005 # Now in machine room, will kill Frans if he hasn't done so yet. foreach u (XXX) vnallbgCommand "ps -elf | grep $u | grep -v grep | nth 4 | pre kill -9 | csh" end # Right on, right on, Frans; has killed them gh...'s # As root@vn64 # Lock running rampant wall "Shutdown in 120 seconds"; sleep 120; shutdown -h now # Bring up vn64 as vn20 in single user mode, reconfigure network (two # occurences of 20 -> 64 in IP address and hostname) via drakconf # OK? # Search and destroy all known_hosts vnallbgCommand '/bin/rm ~/.ssh/known_hosts' # Bonzai!! vnallbgCommand 'vnallbgCommand date' # As matt@bh0 cd .ssh # After some hacking on it Reset_known_hosts # As idle@vnfe1 # define tall alias to run cpi on all nodes # tall (1) ... OK # tall (2) ... OK # tall (3) ... OK # tall (4) ... OK ... Wall clock Sun Jan 30 15:33:03 PST 2005 Sun Jan 30 15:33:48 PST 2005 # tall (5) ... OK ... Wall clock Sun Jan 30 15:33:57 PST 2005 Sun Jan 30 15:34:39 PST 2005 # Verify identity on vn20, look for vestiges of vn64 # As root@vn20 cd /etc find . -exec grep -il vn64 {} \; > /tmp/vn64 ./hosts ./motd ./hosts.orig ./ntp.conf vi `cat /tmp/vn64` #./hosts #./motd #./hosts.orig #./ntp.conf # All OK # Continue with transmutation of vn64 -> vn20 # Cable set 2 -> vn64 (nee vn20) # Reconfigured via drakconf # Up as vn64 # As root@vn64 cd /etc find . -exec grep -il vn20 {} \; > /tmp/vn20 vi /tmp/vn20 # OK # Put vn64 back in pool, # As idle@vnfe1 # tall (1) tall Sun Jan 30 15:52:45 PST 2005 Sun Jan 30 15:53:30 PST 2005 # tall (2) Sun Jan 30 15:59:28 PST 2005 Sun Jan 30 16:00:13 PST 2005 # tall (3) Sun Jan 30 16:03:42 PST 2005 Sun Jan 30 16:05:00 PST 2005 # As root@vn20 hostname vn20 localhost strace hostname -s # vi /etc/hosts # Fixed that up # vn64 hung up again, could try running the uniprocessor kernel, # HAS Boot up message PXE-M04: Hooking bootstrap interrupt at 18h # ... is this normal? Should check vn63 # As root@vn63 # wall "Shutdown in 30 seconds"; sleep 30; shutdown -h now # Root@vn63 has no such message, so have Pal, Ben and/or # Varsity look at it # Keep vn63 and vn64 down so that BIOSes can be compared etc. From matt@physics.ubc.ca Sun Jan 30 16:36:52 2005 Date: Sun, 30 Jan 2005 16:32:37 -0800 (PST) From: Matthew W. Choptuik To: Ben Gutierrez , Pal Sandhu Cc: Matthew Choptuik , Tony Subject: Update on old cluster Hi guys ... after a bit more sleuthing, I am hypothesizing that there may be a BIOS problem with vn64 (previously vn20). On boot up, it reports PXE-M04: Hooking bootstrap interrupt at 18h and then on the next screen complains about the BIOS update. vn64, e.g., does no such thing. I've turned both vn6[34] off, and disconnected their ethernet cables. Please 0) Bring up vn64 yourself (off the network) and verify that you see the above mentioned error messages 1) Put the rest of the memory back in, see if that has any effect on the messages 2) In any case, but particularly if putting the mem back in has no effect, check vn63 BIOS in detail against vn64 (Version, everything). If there are any differences, get in touch with Varsity. Ditto if you can't find any differences---have them come and get both vn64 (which doesn't work) and vn63 (which does) and look for BIOS differences. Let me know immediately if you have any questions about this. Thanks, Matt -------------------------------------------------------------------------- Matthew W. Choptuik|Dept. of Phys. & Astronomy, UBC|6224 Agricultural Road Vancouver BC, V6T 1Z1, Canada|Voice: (604) 822-2412|Fax: (604) 822-5324 choptuik@physics.ubc.ca|http://laplace.physics.ubc.ca/Members/matt/ #----------------------------------------------------------- Tue Feb 8 10:49:33 PST 2005 #----------------------------------------------------------- # vn64 (nee vn20) back from the shop, but # # 1) Can't get keyboard control via KVM # 2) Reports 384M (which turns out to be right, # 3) DOESN'T RECOGNIZE NETWORK CARD. Phone Bill/Tony, Tony comes # out instantly, verifies that we have 1 x 256M + 1 x 128M # Send Pal and Ben message re memory [root@localhost]# grep LOWMEM messages* messages:Jan 30 15:41:16 vn64 kernel: 255MB LOWMEM available. messages:Jan 30 16:19:10 vn64 kernel: 255MB LOWMEM available. messages:Feb 3 16:19:55 vn64 kernel: 383MB LOWMEM available. messages:Feb 8 10:18:42 vn64 kernel: 383MB LOWMEM available. messages:Feb 8 10:22:28 vn64 kernel: 383MB LOWMEM available. messages:Feb 8 11:00:51 vn64 kernel: 383MB LOWMEM available. messages.1:Jan 28 14:26:58 vn20 kernel: 127MB LOWMEM available. messages.1:Jan 28 16:05:53 vn20 kernel: 255MB LOWMEM available. messages.2:Jan 20 12:09:56 localhost kernel: 511MB LOWMEM available. messages.2:Jan 20 12:39:12 localhost kernel: 511MB LOWMEM available. messages.2:Jan 20 13:04:30 localhost kernel: 511MB LOWMEM available. messages.2:Jan 20 13:13:54 vn20 kernel: 511MB LOWMEM available. messages.2:Jan 20 13:35:10 vn20 kernel: 511MB LOWMEM available. messages.2:Jan 20 18:17:01 vn20 kernel: 511MB LOWMEM available. # AS root@vn64 reboot # Fix up /etc/hosts which ends up with vn64 as an alias for 127.0.0.1 vi /etc/hosts ntptimeset df # umask IS getting set to 2, per Scott's message, must be in start up # *directory* (sigh) # csh.login! (rookie mistake) # ... which we will simply eliminate. # Some nodes still have /home, which is a problem vn6 vn25 vn52 vn64 # Quick fix is to set umask in .cshrc, but better fix is to # use stub csh.login # Now seems to be OK, put Lock on vn64, ask Ben and Pal to # get the memory back in Wed Feb 9 13:11:58 PST 2005 # Memory back ############################################################ CRASH_177 ############################################################ Sat Feb 12 04:50:10 PST 2005 # vn64 down again vn64 down 0:48 # As matt@vnfe1 viw vnN # comment-disable vn40 Sun Feb 13 08:01:45 PST 2005 # In machine room, KVM monitor still on vn64, as well as # original 'remote' keyboard Crashed with kernel error in synch_inode EIP is at __single_synch_inode File system # This is a sick puppy. Swap vn64 and some comparable bh machine ########################################################### CRASH_178 ############################################################ # vn6 has gone down twice (ping-able, no shell etc.) # swap with vn63 # Frans running on vn63, swap with vn6 root@vn6:/etc/motd root@vn63:/etc/motd Fri Mar 18 16:47:02 PST 2005 # Taking vn6 down # As root@vn6 shutdown -h now # As matt@vnfe1 ping vn6 # Dead, OK ... Need vn63 to drain, Frans still running. top - 16:48:34 up 43 days, 28 min, 2 users, load average: 0.93, 0.73, 0.65 Tasks: 67 total, 2 running, 65 sleeping, 0 stopped, 0 zombie Cpu(s): 3.3% us, 0.3% sy, 30.0% ni, 66.3% id, 0.1% wa, 0.0% hi, 0.0% si Mem: 514684k total, 493212k used, 21472k free, 79308k buffers Swap: 511520k total, 0k used, 511520k free, 241532k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4411 fransp 15 0 28028 12m 5468 R 3.9 2.6 143:05.72 gh3d_b_L1 9252 root 15 0 1992 884 1788 R 2.0 0.2 0:00.01 top # OK # TODO # As root@vn63 shutdown -h now # Swap vn6/vn63 ... Tue Mar 29 19:28:47 PST 2005 # vn63 now idle (already!) ... swap vn6 and vn63 # As root@vn63 shutdown -h now # ... off to machine room, then to office to pick up stuff and work a bit # before heading packing up for Baylor trip # Had forgotten that vn31 was also down. Will try a reboot # on it first # vn31 on KVM:2 # vn31 has no disk light etc, power supply # After struggling with the KVM cables, brought vn6 up in # single user mode, used drakconf to change to vn63, # rebooted in single user mode, unmunged /etc/hosts (since, # e.g. 142.103.237.63 is not identified with vn63) # Remove everyone's known_hosts files # after archiving them # As root@vnfe1 ls /d/*/home*/*/.ssh/known_hosts /bin/rm -f /d/*/home*/*/.ssh/known_hosts # vn6 -> vn63 looks successful ... # ...take out of cluster and put on power supplies # vn63 -> vn6 # As root@vn63 shutdown -h now # boot: linux single cd /etc cp hosts hosts.real drakconf # configure network cd /etc mv hosts hosts.drakconf cp hosts.real hosts reboot # boot: linux single # OK exit # See README for continuation /Tue Mar 29 21:14:54 PST 2005 #----------------------------------------------------------------------- # TODO (Pal and/or Ben and/or ...) Diagnose problems with # vn63 and vn64 (nee vn6 and vn20) #----------------------------------------------------------------------- ########################################################### CRASH_179 ############################################################ # vn25 has load average of merely 159, not surprisingly, # sshd is refusing connections # # Have had recurrent problems with this node # Sun Apr 10 06:57:24 PDT 2005 # In machine room, power off, attach KVM cables to port 2, # KVM to port 2, power on, don't switch KVM until Linux starts # booting. # # As root@vn25 # Strange date, 3 hours off date Sun Apr 10 03:59:23 PDT 2005 vnSetdate ntp ntptimeset # OK # Now, where's the Rtop archive wa Rtop # Testing MPI connectivity, # All not well, vn6 down? # Need cron job to test MPI connectivity ########################################################### CRASH_180 ############################################################ # vn6, Matt took out the power cable (accidentally of course!) # reboot OK # date, ntp OK # As matt@vnfe1 cdex Scan vn6 # OK? ########################################################### CRASH_181 ############################################################ Fri Apr 15 07:35:41 PDT 2005 # vn25 down vn25 down 1+09:53 # Ben rebooted and it came back OK (I'm in Banff @ BIRS) ########################################################### CRASH_182 ############################################################ Mon Apr 18 08:16:31 PDT 2005 # vn25 down vn25 down 13:28 # Ben rebooted 11 hours after message sent and it came # back OK (still in Banff) ########################################################### CRASH_183 ############################################################ Tue Apr 19 08:21:32 PDT 2005 ############################################################ # Guess what! vn25 down down vn25 down 4:17 # Message to Ben and Pal, leave it, will look at it when I get back # TODO Fri Apr 29 14:06:31 PDT 2005 # Ben put new memory (2 x 256MB) in # OK vnSetdate # As matt@vnfe1 nodes # Updated vnN, ~matt/scripts/mp_func/ Scan # OK Hammer # OK # Run with vn1 vn43 vn25 vn12 # OK ########################################################### CRASH_184 ############################################################ Wed Apr 20 19:00:19 PDT 2005 ############################################################ # vn52 down vn52 down 14:59 # Message to Ben and Pal to investigate # Ben revived by swapping the P/S with one of vn63/vn64 ########################################################### CRASH_185 ############################################################ # Mon Apr 25 12:08:24 PDT 2005 # vn43 down vn25 down 2+20:33 vn43 down 3+02:19 # whowason indicates that nothing was running at time # of crash Fri Apr 29 13:52:27 PDT 2005 # On KVM switch can still initiate root login, hangs, # as does shutdown via Ctrl-Alt-Del. Hard reboot # Now is hanging on NFS # Reseat eth0 cable, drakconf eth0, OK ########################################################### CRASH_186 CRASH_187 CRASH_188 CRASH_189 ############################################################ Sat Jul 30 07:49:40 PDT 2005 vn22 down 18:45 vn29 down 18:43 vn32 down 18:42 vn41 down 19:07 # In machine room foreach n (vn22 vn29 vn32 vn41) ping -c 1 $n end # All dead in the water, presumably power supplies or # equivalent. Sychrony interesting, but certainly not # unheard of (vis a vis our Zippy supplies) Dear Tony and Jody: 4 nodes from our OLD cluster (no longer on warranty) died almost simultaneously yesterday. I don't know if there was anything going on in the machine room at that time (since there's some construction here), but in any case, I suspect that they will all need new power supplies. I've pulled them from the cluster, and left them on the floor in front of the AC unit. They are labelled vn22, vn29, vn32 and vn41. Would appreciate if you could pick them up next time you are on campus, or if you want to chance it, just bring out four power supplies. In fact you can go ahead and charge six replacement power supplies to my usual account, since I'm sure we'll need them down the road in any case. Let myself and Ben know when you plan to be over at the cluster room if you will be bringing parts, otherwise you should be able to pick the stuff up from the machine room by yourself. Thanks, Matt # As matt@vnfe1 viw vnN grep '#' ~matt/scripts/vnN #!/bin/sh # Numeric addresses for nodes # TODO :n ~/scripts/mp_func cat < 200) ergo flaky response # kills processes, then notices that / is read-only # reboot # As matt@vnfe1 ping vn41 # ... and doesn't look like it's coming back # TODO: -> machine room and investigate Sat Sep 10 15:57:14 PDT 2005 # In machine room, vn41 takes REAL hard reboot, comes back # with disk errors, may have to run fsck manually on # /dev/h.*d.*, NOPE maybe not, reboots clean, we should be back # in bidness Sat Sep 10 15:59:44 PDT 2005 # ... and we are, check time, ntp, df OK # As root@vn41 ntptimeset # OK viw vnN # OK vnCommand date # LOGS ARE USELESS, END OF INCIDENT ############################################################ CRASH_193 ############################################################ Fri Sep 30 16:04:35 PDT 2005 # In machine room vn6 down 1+04:10 # KVM:2 on vn6 # Looks like a P/S ... fan, but no disk light, decable, # send message to Jason and Ben whowas on vn6 # Nothing, so no need to send message on that account # As matt@bh0 cds vi vnN #142.103.237.6 Sun Oct 2 10:10:20 PDT 2005 # Jason and Ben cannabilize vn63/vn64 and get vn6 # back up ############################################################ CRASH_194 ############################################################ Mon Oct 3 14:43:02 PDT 2005 # /etc on vn41 is read only, hope that we can reboot # immediately! # As root@vn41 reboot # Hook it up to KVM:2 # Needed hard reboot, and had filesystem errors that seem to # be fixed up whowason vn41 # As root@vn41 jj ntp # OK, ntptimeset Your clock is off by 0.0319875 seconds. (142.103.237.225) [15/15] vnDistEtc passwd shadow group hosts.allow hosts.deny # OK ############################################################ CRASH_195 ############################################################ Thu Oct 6 14:10:31 PDT 2005 # /etc on vn41 is read only, hope that we can reboot # immediately! # vn41 down again. Deemed flaky and will thus exchange # identity with vn62 # As root@vn62 sleep 60; shutdown -h now # Old vn62 now up as vn41 # Need to reset known_hosts TODO: Post message about this # better # As matt@bh0 cd .ssh Reset_kdnown_hosts # Take vn62 out of the active roster cds vi vnN #vn62 make export # Old vn41 (vn62) back up, disk was OK, logs # Installed vn62:/etc/motd ################################################################# # PLEASE STAY OFF THIS NODE UNTIL FURTHER NOTICE!! # ################################################################# vi /var/log/messages Oct 4 08:10:10 vn41 sshd[8695]: Failed password for invalid user adam from ::ffff:139.223.200.144 port 32887 ssh2 Oct 4 08:10:10 vn41 sshd[8693]: Failed password for sync from ::ffff:82.103.129.98 port 54809 ssh2 Oct 4 08:10:12 vn41 sshd[8697]: Invalid user adam from ::ffff:139.223.200.144 Oct 4 08:10:12 vn41 sshd[8697]: error: Could not get shadow information for NOUSER Oct 4 08:10:12 vn41 sshd[8697]: Failed password for invalid user adam from ::ffff:139.223.200.144 port 32959 ssh2 Oct 4 08:10:12 vn41 kernel: swap_free: Bad swap offset entry 00080000 Oct 4 08:10:12 vn41 kernel: swap_free: Bad swap offset entry 00080000 Oct 4 08:10:12 vn41 kernel: ------------[ cut here ]------------ Oct 4 08:10:12 vn41 kernel: kernel BUG at mm/rmap.c:407! Oct 6 14:40:13 vn41 syslogd 1.4.1: restart. Oct 6 14:40:13 vn41 kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 6 14:40:13 vn41 kernel: Inspecting /boot/System.map-2.6.8.1-10mdksmp Oct 6 14:40:13 vn41 partmon: Checking if partitions have enough free diskspace: # TODO: KERNEL BUG Fri Oct 7 10:51:22 PDT 2005 # Ben replaced memory and ordered new set. # Machine up and running # As root@vn62 vnSetdate jj ntp ntptimeset # OK df # OK # Update /etc/motd and Web page # As matt@vnfe1 cds vi vnN # enable vn62 make export make export vnDistEtc passwd shadow group ############################################################ CRASH_196 ############################################################ Wed Oct 12 06:44:58 PDT 2005 down vn62 down 4+02:37 # So memory swap apparently didn't work. viw vnN cds; make export # disable vn62 etc cp motd.2005.10.07 motd.2005.10.12 vi !$ CP !$ motd vnDistEtc motd ############################################################ CRASH_197 ############################################################ Sat Oct 22 14:13:06 PDT 2005 vn42 down 4+08:35 vn61 down 3+03:38 vn62 down 14+10:10 # vn42 has been down for several days. # Hard reboot brings it back vr 42 # As root@vn42 vnSetdate jj ntp ntptimeset Your clock is off by 0.3737720 seconds. (142.103.237.227) [15/15] df # OK, but nothing obvious in logs !!ssh root@vn42.physics.ubc.ca cat /tmp/l Oct 18 05:30:41 vn42 sshd[23026]: Accepted publickey for idle from ::ffff:142.103.237.225 port 59949 ssh2 Oct 18 05:35:00 vn42 CROND[23092]: (mail) CMD (/usr/bin/python -S /usr/lib/mailman/cron/gate_news) Oct 18 05:35:40 vn42 sshd[23096]: Accepted publickey for idle from ::ffff:142.103.237.225 port 60015 ssh2 Oct 19 08:03:02 vn42 syslogd 1.4.1: restart. Oct 19 08:03:02 vn42 kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 19 08:03:02 vn42 kernel: Inspecting /boot/System.map-2.6.8.1-10mdksmp Oct 19 08:03:02 vn42 partmon: Checking if partitions have enough free diskspace: Oct 19 08:03:02 vn42 kernel: Loaded 18312 symbols from /boot/System.map-2.6.8.1-10mdksmp. ############################################################ CRASH_198 ############################################################ # Wed Nov 2 10:37:50 PST 2005 # vn16 was incommunicado, in machine room on KVM:2, can log # in, but ping vn15 complains about lack of buffer space, # soft reboot # As matt@vnfe1 ssh root@vn16 df # OK ntptimeset Your clock is off by -0.5378585 seconds. (142.103.237.225) [15/15] # EOI ############################################################ CRASH_199 ############################################################ # Wed May 17 12:21:04 PDT 2006 # # Just noticed that vn24 and vn66 (one of Steve P's) are # also down # As matt@vnfe1 cd etc cp motd motd.2006.05.17 vi !$ #----------------------------------------------------------- Sat May 20 11:05:57 PDT 2006 #----------------------------------------------------------- # Jason has replaced PS, and machine is back on line, vn66 # was literally off-line, as in unplugged from the switch ############################################################ CRASH_200 ############################################################ Mon Jun 5 11:26:41 PDT 2006 down vn62 down 240+07:23 vn63 down 4+01:28 vn64 down 2+01:03 vn65 down 4+01:21 ping -c 1 vn62 # Alive ping -c 1 vn63 # DEAD ping -c 1 vn64 # DEAD ping -c 1 vn65 # DEAD # vn63 not booting, no doubt since I clobbered the # /etc/fstab via foreach loop # TODO Will need rescue CD to recover on # TODO: vn63 # TODO: vn64 # TODO: vn65 # ############################################################ CRASH_202 ############################################################ Tue Jun 27 10:19:39 PDT 2006 # vn36 down, apparently with bad P/S # Sent phone calls and TODO: message to Jason ############################################################ CRASH_203 ############################################################ # In course of investigatin vn36 state (see above), as well # as continued work on the still-out-of-svc Plotkin machines, # managed to hard reboot vnfe1 via the A/C plug about 3 times # # Plug wasn't fully seated, and too much tension on cable. # Alleviated and we're back on line ############################################################ CRASH_204 ############################################################ # vn16 is down # As root@vnfe1 date; down Mon Jul 3 13:12:54 PDT 2006 vn16 down 5+20:15 vn62 down 268+09:10 vn63 down 32+03:14 vn65 down 32+03:08 vn67 down 24+00:20 # No green diode on front ... on back, power toggle off, # A/C cord out, sleep 1 minute, reconnect A/C and network, # power on brings it back # As root@vnfe1 ping vn16 # OK # As root@vn16 ntptimeset # TODO: FIX vnSetdate Mon Jul 3 13:18:17 PDT 2006 df Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3221080 2518956 57% / /dev/hda6 6581928 561668 6020260 9% /scratch vnfe1:/home 10958172 9104224 1853948 84% /d/vnfe1/home vnfe1:/home2 17496684 14284992 3211692 82% /d/vnfe1/home2 vnfe1:/home3 17496684 9248864 8247820 53% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 784928668 177979636 82% /d/vnfe4/home bh0:/home 149336320 123715336 18035096 88% /d/bh0/home # From root@vn16:/var/log/messages ... !!ssh root@vn16.physics.ubc.ca cat /tmp/l Jun 27 16:55:00 vn16 CROND[8186]: (mail) CMD (/usr/bin/python -S /usr/lib/mailman/cron/gate_news) Jun 27 16:55:48 vn16 sshd[8190]: Accepted publickey for idle from 142.103.237.225 port 38372 ssh2 Jun 27 17:00:00 vn16 CROND[8256]: (mail) CMD (/usr/bin/python -S /usr/lib/mailman/cron/gate_news) Jul 3 13:16:55 vn16 syslogd 1.4.1: restart. Jul 3 13:16:55 vn16 kernel: klogd 1.4.1, log source = /proc/kmsg started. Jul 3 13:16:55 vn16 kernel: Inspecting /boot/System.map-2.6.8.1-10mdksmp Jul 3 13:16:56 vn16 partmon: Checking if partitions have enough free diskspace: Jul 3 13:16:56 vn16 kernel: Loaded 18312 symbols from /boot/System.map-2.6.8.1-10mdksmp. Jul 3 13:16:56 vn16 kernel: Symbols match kernel version 2.6.8. P # ... so nothing apparent in logs Mon Jul 3 13:20:21 PDT 2006 # END OF INCIDENT ############################################################ CRASH_205 ############################################################ Thu Jul 6 11:59:20 PDT 2006 # Had to hard reboot vn63 after SNAFU vis a vis assigning # its IP to one of Steve P's machines by mistake (vn66) # As root@vn63 date; uptime; df Thu Jul 6 12:03:26 PDT 2006 12:03:26 up 3 min, 1 user, load average: 0.21, 0.11, 0.03 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda5 239334880 98952784 128224508 44% / vnfe1:/home 10958176 9117216 1840960 84% /d/vnfe1/home vnfe1:/home2 17496688 14285176 3211512 82% /d/vnfe1/home2 vnfe1:/home3 17496688 9248864 8247824 53% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 789346536 173561768 82% /d/vnfe4/home boson:/mandrake 115377640 107921416 1595312 99% /mandrake # TODO: Restore full /etc/fstab Thu Jul 6 12:03:35 PDT 2006 # END OF INCIDENT ############################################################ CRASH_206 ############################################################ Wed Aug 2 10:36:09 PDT 2006 # As root@vnfe3 (TODO: rwhod working on vnfe1 w/o reboot!!) down vn16 down 13:11 # Attached KVM 2, screen displayed kernel panic, hardware reset # XXX: Needed a manual fix of /dev/hda1 ... OK # Machine back up with no obvious boot time probs ... # As root@vn16 vnSetdate Wed Aug 2 10:42:28 PDT 2006 df Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226416 2513620 57% / /dev/hda6 6581928 561668 6020260 9% /scratch vnfe1:/home 10958172 8767808 2190364 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810601964 152306340 85% /d/vnfe4/home bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889852 61651312 44% /eminem vn67:/xtina 114349864 92289504 16251660 86% /xtina vi /var/log/messages # Had lost logging on July 23, so system was probably shaking # itself to pieces in any case # As matT@vnfe3 vnCommand 'df' | tee /tmp/vn-df Rcat /tmp/vn-df !!ssh matt@vnfe3.physics.ubc.ca cat /tmp/vn-df Warning: Permanently added 'vnfe3.physics.ubc.ca' (RSA) to the list of known hosts. >>> Executing as root@142.103.237.1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3352932 2387104 59% / /dev/hda6 6581928 1547428 5034500 24% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.2 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3227108 2512928 57% / /dev/hda6 6581928 537456 6044472 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.3 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226864 2513172 57% / /dev/hda6 6581928 521580 6060348 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.4 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3228208 2511828 57% / /dev/hda6 6581928 526384 6055544 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.5 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226860 2513176 57% / /dev/hda6 6581928 524584 6057344 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home vn62:/sting 238299068 90915128 135278968 41% /sting vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina bh0:/home 149336320 124649364 17101068 88% /d/bh0/home >>> Executing as root@142.103.237.6 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3232208 2507828 57% / /dev/hda6 6581928 500464 6081464 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602048 152306256 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.7 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226756 2513280 57% / /dev/hda6 6581928 795948 5785980 13% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602060 152306244 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.8 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226696 2513340 57% / /dev/hda6 6581928 550196 6031732 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602060 152306244 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.9 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226612 2513424 57% / /dev/hda6 6581928 591564 5990364 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602060 152306244 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.10 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226964 2513072 57% / /dev/hda6 6581928 642328 5939600 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602060 152306244 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.11 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3227232 2512804 57% / /dev/hda6 6581928 601760 5980168 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602060 152306244 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.12 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3218752 2521284 57% / /dev/hda6 6581928 696248 5885680 11% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.13 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226988 2513048 57% / /dev/hda6 6581928 721420 5860508 11% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.14 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226728 2513308 57% / /dev/hda6 6581928 596152 5985776 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.15 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226856 2513180 57% / /dev/hda6 6581928 600160 5981768 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.16 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226416 2513620 57% / /dev/hda6 6581928 561668 6020260 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.17 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226544 2513492 57% / /dev/hda6 6581928 574644 6007284 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.18 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3483908 2256128 61% / /dev/hda6 6581928 577112 6004816 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.19 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3646692 2093344 64% / /dev/hda6 6581928 3336888 3245040 51% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.20 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226088 2513948 57% / /dev/hda6 6581928 581092 6000836 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602064 152306240 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.21 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3227504 2512532 57% / /dev/hda6 6581928 588544 5993384 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602072 152306232 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.22 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226392 2513644 57% / /dev/hda6 6581928 840336 5741592 13% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602072 152306232 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.23 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226288 2513748 57% / /dev/hda6 6581928 618004 5963924 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602072 152306232 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.24 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3235496 2504540 57% / /dev/hda6 6581928 587356 5994572 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.25 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6040288 3449524 2283928 61% / /dev/hda6 13156256 1304388 11851868 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.26 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226196 2513840 57% / /dev/hda6 6581928 590700 5991228 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.27 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226504 2513532 57% / /dev/hda6 6581928 601152 5980776 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.28 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3232768 2507268 57% / /dev/hda6 6581928 529980 6051948 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.29 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3225880 2514156 57% / /dev/hda6 6581928 631908 5950020 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602076 152306228 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.30 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226144 2513892 57% / /dev/hda6 6581928 978300 5603628 15% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.31 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3227392 2512644 57% / /dev/hda6 6581928 586664 5995264 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.32 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3225736 2514300 57% / /dev/hda6 6581928 583312 5998616 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915128 135278968 41% /sting >>> Executing as root@142.103.237.33 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226864 2513172 57% / /dev/hda6 6581928 552384 6029544 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.34 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226084 2513952 57% / /dev/hda6 6581928 787672 5794256 12% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.35 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226560 2513476 57% / /dev/hda6 6581928 482164 6099764 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.36 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3227848 2512188 57% / /dev/hda6 6581928 419304 6162624 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.37 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226420 2513616 57% / /dev/hda6 6581928 480920 6101008 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.38 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226444 2513592 57% / /dev/hda6 6581928 578000 6003928 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.39 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226788 2513248 57% / /dev/hda6 6581928 734012 5847916 12% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.40 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226132 2513904 57% / /dev/hda6 6581928 431816 6150112 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.41 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226684 2513352 57% / /dev/hda6 6581928 460652 6121276 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.42 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3225136 2514900 57% / /dev/hda6 6581928 479700 6102228 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602084 152306220 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.43 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226868 2513168 57% / /dev/hda6 6581928 742928 5839000 12% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602088 152306216 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.44 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226532 2513504 57% / /dev/hda6 6581928 480960 6100968 8% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602088 152306216 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.45 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226536 2513500 57% / /dev/hda6 6581928 445604 6136324 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602088 152306216 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.46 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226044 2513992 57% / /dev/hda6 6581928 407348 6174580 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602088 152306216 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.47 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226148 2513888 57% / /dev/hda6 6581928 702172 5879756 11% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810602088 152306216 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.48 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3225988 2514048 57% / /dev/hda6 6581928 448532 6133396 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810603392 152304912 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.49 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226216 2513820 57% / /dev/hda6 6581928 450944 6130984 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607276 152301028 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.50 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226040 2513996 57% / /dev/hda6 6581928 682580 5899348 11% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607276 152301028 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651308 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.51 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226988 2513048 57% / /dev/hda6 6581928 457060 6124868 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.52 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 4154972 1585064 73% / /dev/hda6 6581928 148420 6433508 3% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.53 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226120 2513916 57% / /dev/hda6 6581928 436368 6145560 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.54 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226128 2513908 57% / /dev/hda6 6581928 321000 6260928 5% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina >>> Executing as root@142.103.237.55 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226264 2513772 57% / /dev/hda6 6581928 414552 6167376 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home >>> Executing as root@142.103.237.56 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226792 2513244 57% / /dev/hda6 6581928 411460 6170468 7% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.57 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226304 2513732 57% / /dev/hda6 6581928 324372 6257556 5% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.58 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226644 2513392 57% / /dev/hda6 6581928 327320 6254608 5% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.59 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226288 2513748 57% / /dev/hda6 6581928 584616 5997312 9% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.60 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3226152 2513884 57% / /dev/hda6 6581928 629184 5952744 10% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110668 44% /mariah vn64:/whitney 239334880 114070436 113106856 51% /whitney vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.61 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 6047220 3225776 2514260 57% / /dev/hda6 6581928 347384 6234544 6% /scratch vnfe1:/home 10958172 8767812 2190360 81% /d/vnfe1/home vnfe1:/home2 17496684 14293744 3202940 82% /d/vnfe1/home2 vnfe1:/home3 17496684 11856208 5640476 68% /d/vnfe2/home vnfe3:/home 10958172 9420480 1537692 86% /d/vnfe3/home vnfe3:/home2 17066300 14655236 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438796 810607284 152301020 85% /d/vnfe4/home bh0:/home 149336320 124649364 17101068 88% /d/bh0/home vn65:/elvis 191264348 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889860 61651304 44% /eminem vn67:/xtina 114349864 92289508 16251656 86% /xtina vn62:/sting 238299068 90915132 135278964 41% /sting >>> Executing as root@142.103.237.62 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda5 238299068 90915132 135278964 41% / vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607288 152301024 85% /d/vnfe4/home bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110672 44% /mariah vn64:/whitney 239334880 114070432 113106856 51% /whitney vn65:/elvis 191264352 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651304 44% /eminem vn67:/xtina 114349864 92289504 16251656 86% /xtina >>> Executing as root@142.103.237.63 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda5 239334880 99066624 128110668 44% / vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607296 152301016 85% /d/vnfe4/home bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vn62:/sting 238299072 90915136 135278968 41% /sting vn64:/whitney 239334880 114070432 113106856 51% /whitney vn65:/elvis 191264352 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651304 44% /eminem vn67:/xtina 114349864 92289504 16251656 86% /xtina boson:/mandrake 115377640 83677928 25838800 77% /mandrake >>> Executing as root@142.103.237.64 Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda5 239334880 114070436 113106856 51% / vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607296 152301016 85% /d/vnfe4/home bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vn63:/mariah 239334880 99066624 128110672 44% /mariah vn65:/elvis 191264352 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651304 44% /eminem vn67:/xtina 114349864 92289504 16251656 86% /xtina boson:/mandrake 115377640 83677928 25838800 77% /mandrake vn62:/sting 238299072 90915136 135278968 41% /sting >>> Executing as root@142.103.237.65 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda5 191264348 8175008 173373632 5% / bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607296 152301016 85% /d/vnfe4/home vn62:/sting 238299072 90915136 135278968 41% /sting vn63:/mariah 239334880 99066624 128110672 44% /mariah vn64:/whitney 239334880 114070432 113106856 51% /whitney vn66:/eminem 114349864 46889856 61651304 44% /eminem vn67:/xtina 114349864 92289504 16251656 86% /xtina boson:/mandrake 115377640 83677928 25838800 77% /mandrake >>> Executing as root@142.103.237.66 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda5 114349864 46889860 61651304 44% / bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607296 152301016 85% /d/vnfe4/home vn62:/sting 238299072 90915136 135278968 41% /sting vn63:/mariah 239334880 99066624 128110672 44% /mariah vn64:/whitney 239334880 114070432 113106856 51% /whitney vn65:/elvis 191264352 8175008 173373632 5% /elvis vn67:/xtina 114349864 92289504 16251656 86% /xtina boson:/mandrake 115377640 83677928 25838800 77% /mandrake >>> Executing as root@142.103.237.67 Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda5 114349864 92289508 16251656 86% / vnfe1:/home 10958176 8767816 2190360 81% /d/vnfe1/home vnfe1:/home2 17496688 14293744 3202944 82% /d/vnfe1/home2 vnfe1:/home3 17496688 11856208 5640480 68% /d/vnfe2/home vnfe3:/home 10958176 9420480 1537696 86% /d/vnfe3/home vnfe3:/home2 17066304 14655240 1522272 91% /d/vnfe3/home2 vnfe4:/home 1014438800 810607296 152301016 85% /d/vnfe4/home bh0:/home 149336320 124649360 17101072 88% /d/bh0/home vn62:/sting 238299072 90915136 135278968 41% /sting vn63:/mariah 239334880 99066624 128110672 44% /mariah vn64:/whitney 239334880 114070432 113106856 51% /whitney vn65:/elvis 191264352 8175008 173373632 5% /elvis vn66:/eminem 114349864 46889856 61651304 44% /eminem boson:/mandrake 115377640 83677928 25838800 77% /mandrake >>> Executing as root@142.103.237.69 # All OK and END OF INCIDENT Wed Aug 2 10:46:08 PDT 2006 # make export ############################################################ CRASH_207 ############################################################ Mon Aug 28 08:02:04 PDT 2006 ############################################################ # As root@vnfe3 down vn16 down 21+18:29 # Hard reboot with machine on KVM 2 ping vn16 # As root@vn16 date vnSetdate cd /var/log view messages # NOTHING APPARENT IN LOG --- MEMORY? ############################################################ CRASH_208 ############################################################ Fri Sep 15 10:33:53 PDT 2006 # vn16 has been down for over 2 weeks, in machine room, # kernel panic on console # # XXX-TODO: Replace memory ############################################################ CRASH_209 ############################################################