3 PGI CDK 4.1 Release Notes

3.1 Supported Systems and Licensing
3.2 PGI CDK 4.1 Contents
3.3 New Features
3.4 New Compiler Options
3.5 Problems Corrected in Release 4.1
3.6 PGCC C and C++ Compiler Notes
3.7 The PGI CDK 4.1 and libpthread
3.8 The PGI CDK 4.1 and glibc
3.9 The PGI BLAS LAPACK Libs
3.10 OpenMP Tutorial
3.11 Debugging with PGDBG

This document describes changes between PGI CDK 4.1 and previous releases, as well as late-breaking information not included in the current printing of the PGI User's Guide.

3.1 Supported Systems and Licensing

PGI CDK 4.1 is supported on systems using the Intel Pentium or Pentium Pro/II/III/4 or compatible processors, including AMD Athlon/AthlonXP, and running Linux with a kernel version of 2.2.10 or above. This includes versions of Linux that use glibc2.2.x, such as Redhat 6.0 to 9.0, and SuSE 6.1 to 8.2.

The PGI CDK Fortran, C, and C++ compilers are license managed. The open source components of the PGI CDK, including MPI-CH, ScaLAPACK, and PBS, are open source software packages that are not license-managed. For the PGI CDK compilers, the FLEXlm license manager controls the number of simultaneous users. When the PGI CDK compilers are first installed, they are usable for 15 days without a license key. Please contact PGI to obtain a permanent license key as soon as possible.

To make the PGI CDK compilers operational, you will need to follow the installation instructions in Section 1 above, including installation of the license daemon.

3.2 PGI CDK 4.1 Contents

Release 4.1 of the PGI CDK consists of the following PGI compilers and tools:

PGHPF data parallel High Performance Fortran Compiler
PGF90 native OpenMP and auto-threading Fortran 90 Compiler
PGF77 native OpenMP and auto-threading F77 Compiler
PGCC native OpenMP and auto-threading ANSI and K&R C compiler
PGC++ native OpenMP and auto-threading ANSI C++ Compiler
PGPROF graphical profiler
PGDBG graphical debugger

the following open source clustering utilities:

MPI-CH version 1.2.5, an implementation of the Message-Passing Interface (MPI) standard, compiled for use with the PGI compilers under Linux systems with a kernel revision of 2.2.10 or higher
ScaLAPACK linear algebra math library for distributed-memory systems, including BLACS version 1.1 (the Basic Linear Algebra Communication Subroutines) and ScaLAPACK version 1.7 for use with MPI-CH and the PGI compilers under Linux systems with a kernel revision of 2.2.10 or higher.
PBS portable batch queuing system from Veridian, version 2.3.15. PBS is automatically configured and built upon installation to ensure compatibility with the system on which it is installed

and the following documentation and tutorial materials:

OSC Training Materials - an extensive set of HTML-based parallel and scientific programming training materials developed by the Ohio Supercomputer Center
Complete online HTML Documentation for the PGI compilers and tools
Online UNIX man pages for all of the supplied software
A hard-copy CD-ROM media kit including the PGI User's Guide, MPI The Complete Reference, Volume 1, the High Performance Fortran Handbook, How to Build a Beowulf, and a printed copy of these release notes.

3.3 New Features

Following are the new features included in The PGI CDK 4.1:

The MPI-CH libraries have been updated to version 1.2.5
Ssh and rsh are now supported for cluster process control.
The Rogue Wave Standard Template Library has been replaced with STLport, version 4.5. Rogue Wave is NOT supported in release 4.1.

3.4 New Compiler Options

No new compiler oprtions are present in this release, though there have been some corrections for proper execution. Along with 4.0, the SSE/SSE2 type instructions are presented here. For general performance purposes, we recommend users try these three sets first;

-fast, -fastsse, and -fastsse -Mnontemporal.

* -Mscalarsse - a newly introduced technology for CPUs that support SSE and/or SSE2 type instructions. Pentium III and AthlonXP CPUs support SSE instruction types, while the Pentium 4 CPU also supports the SSE2 instruction types. In prior releases only the vectorizing optimizations used these instructions, and now they are being utilized in all coding opportunities. Note: some versions of Linux have assemblers that do not support the newer SSE style instructions. This switch should only be used if your assembler `as' accepts these instructions. Older versions of Linux (for example, Red Hat 6.2) do not accept the SSE2 type instructions.

* -fastsse - This switch combines the switches for ` -fast -Mscalarsse -Mvect=sse -Mcache_align -Mflushz' that combines a set of optimizations that frequently work well together to improve performance. This switch is meant only for machines with SSE instruction type support, as in the Pentium III and AthlonXP. It will also work on Pentium 4s, and supports SSE and SSE2 instruction types in that case. See the warning above about your assembler version.

* -Mnontemporal - This switch is important to note with -fastsse -Mvect=sse options. Some programs run slower with -fastsse, due to the prefetching used. -Montemporal offers a different data movement scheme. When tuning code, it is good practice to try `-fastsse -Mnontemporal' in addition to `-fastsse` alone, to see which will be faster.

3.5 Problems Corrected in Release 4.1

The following problems are corrected in the current release. A description of the problem is given, but some problems can only be described in general terms because of complexity or confidentiality.

Technical Problem Reports (TPRs) Corrected in 4.1-1
TPR	Lang	Description	Symptom
2823	pgf90	Reading real `3.2e4' with `I' edit descriptor in list directed integer input did not cause error message.	No error message produced during execution.
2826	pgf90	Intrinsic call with a derived data type caused an internal compiler error.	`pgf90 -c -Mfree xx.f` `ICE: mkexpr1: bad id 14 (xx.f: 185)`
2835	pgf90	Fails to detect error with -Mstandard `read(...,...,SIZE=n) ...` which should read `read(...,...,SIZE=n,ADVANCE='NO') ...`	No compiler error detected.
2837	pgf90	A merge operation causes a false -Mbounds error	`0: Subscript out of range for ...`
2839	pgcc	`pgcc generates assembly output with -Kpic that is invalid.` `cvtsi2ss $88,%xmm0`	`pgcc xx.c -fastsse -Kpic` `Error: suffix or operands invalid for ´cvtsi2ss'`
2847	ALL	__STDC__,__PGI need to be undefined pgxx -U__STDC__ fails	__STDC__, __PGI and other symbols remain defined.
2849	pgf90 pgf77	programs compiled -I8 fail unless IO specifiers like IOSTAT are declared integer4 or logical4	compiler error messages
2851	pgf90	-Mstandard was not catching our f90 extension of allowing derived type members to be allocatable, as an error.	pgf90 -Mstandard x.f does not report errors.
2855	pgf90pgCC	pgf90 -U does not work. related to 2847	pgf90 -U xxx does not become -undef xxx
2865	PgCC	s=NULL;cout<<s fails	program fails
2867	PgCC	pgCC --one_instantiation_per_object p.cc	`C++ prelinker: executing --inf-loops`
2872	pgf90	array constructors with 2 or more implied DOs using the same DO index fails.	`x = (/ (x(j), j=Nx/2+1, Nx), (x(j), j=1, Nx/2) /) fails`
2873	pgf90	function declared in the interface block of a module and later,in a program contained in that module that is passed as an ARGUMENT to another subprogram	`Internal compiler error. Errors in Lowering`
2874	pgf90	SCALE function in f90 gives wrong results for double precision.	wrong answers
2881	pgf90	`DOT_PRODUCT(CONJG(a(k1:k2)),u(i(k1:k2)))`returns wrong results	wrong results
2882	pgCC	C++ ignores `pragma omp' in templates	1 thread in omp areas
2902	pgf90	-fastsse causes internal compiler error	`ICE:replace_invar: nonzero subsc stride`
2907	pgf90	matmul gives wrong complx answers	wrong answers
2912	pgf90	program causes Internal Compiler Error	`ICE. exp_ref:IM_BASE op#2 not based sym`
2916	pgf77 pgf90	`OPEN(...,...,...,CONVERT='NATIVE') should ignore -byteswapio`	results swapped when they shouldn't be.

3.6 PGCC C and C++ Compiler Notes

The Rogue Wave Standard Template Library has been replaced with STLport, version 4.5. Rogue Wave is no longer supported. Users should look at the STLport license for any usage issues.

3.7 The PGI CDK 4.1 and libpthread

Previous releases of the PGI CDK Linux compiler products have included a customized version of libpthread.so called libpgthread.so. The purpose of this library is to give the user more thread stack space to run OpenMP and -Mconcur compiled programs. With Release 8.0 Red Hat and equivalent releases, we are seeing libpthread.so and libpthread.a with `re-sizeable' thread stack areas. In these cases

1. The filename $PGI/linux86/lib/libpghtread.so is a soft link to /usr/lib/libpthread.so.

2. Instead of `setenv MPSTKZ 256M', for example, to increase the libpgthread.so thread stack area, the Linux system call `limit stacksize 256M' will now apply to thread stacks.

3.8 The PGI CDK 4.1 and glibc

Release 4.1 of the PGI CDK compilers and tools are built and validated under both the Linux 2.2.10 through 2.4.x kernels. Distributions of Linux, from Red Hat 6.0 to 9.0 and SuSE 6.1 to 8.2, incorporate revision 2.2.10 or greater of the Linux kernel and glibc2.1.x or greater. If you are using a version of Linux that is supported by the 4.1 CDK release, the PGI installation script will automatically detect it. Your installation will be modified as appropriate for these systems. .

3.9 The PGI BLAS LAPACK Libs

Precompiled versions of the BLAS and LAPACK math libraries are included for all supported Linux systems in the files $PGI/linux86/lib/libblas.a and $PGI/linux86/lib/liblapack.a. These can be linked in to your applications by simply placing the -llapack -lblas options on the link line:

% pgf77 myprog.F -lblas -llapack

Note that these libraries are compiled with switches that are relatively optimal but fully portable across the various IA32 architectures. In particular, they do not take advantage of SSE/SSE2 instructions, or prefetch instructions. If you would like to rebuild libblas.a and liblapack.a on a CPU that supports SSE/SSE2 type instructions, PGI recommends using the following options:

-fast -pc 64 -Mvect=sse -Mcache_align -Kieee

NOTE: slmach.f and dlmach.f must be compiled -O0!

If you would like to rebuild libblas.a and liblapack.a on an AMD Athlon, PGI recommends using the following options:

-fast -pc 64 -Mvect=prefetch -Kieee

As on the Pentium III, slmach.f and dlmach.f must be compiled -O0.

STLport license for any usage issues.

3.10 OpenMP Tutorial

A self-guided online tutorial is available to help you become familiar with how OpenMP parallelization directives. In particular, the tutorial takes the user step by step through the process of parallelizing the NAS FT benchmark using OpenMP directives. The tutorial can be found at:

    ftp://ftp.pgroup.com/pub/SMP

You can download this file using a web browser, and unpack the file using the following commands:

       % gunzip fftpde.tar.gz
       % tar xvf fftpde.tar

Change directories to the fftpde sub-directory, and follow the instructions in the README file.

3.11 Debugging with PGDBG

3.11.1 PGDBG 4.1 Features

Note: Most of this information was also in the PGI CDK 4.0 Release Notes. It is present here and will be part of the PGDBG User's Guide in a future release, and will not be removed from the release notes until it has. PGDBG has had a number of corrections, and it now is supported under ssh, but beyond that the features have not changed from 4.0.

PGDBG 4.1 can debug SMP OpenMP (or Linux pthread) programs, as well as multiprocess cluster programs executed via mpirun. The PGI license file restricts the total number of threads and processes that PGDBG will debug.

PGDBG 4.1 supports ssh as well as rsh . A new environment variable, PGRSH, should be set to ssh or rsh, to indicate the communication needed.

PGDBG's parallel debug capabilities are extensively documented at http://www.pgroup.com/docs.htm or at $PGI/doc/index.htm. This documentation is intended to supplement Chapter 15 of the PGI User's Guide.

The following enhancements are included in PGDBG 4.1:

* Combined Multi-process and Multi-thread Support

* SSH support

- be sure to `export PGRSH=ssh'

* Multi-process Support

- Process Ids obtained from mpirun

- Same source and debug info used for all processes

- full process control

- process grouping

- informative messages regarding state and location

* Process Control

- concise control of groups of processes - process synchronization - configurable process stop and wait modes - serial, and process-only debug modes

* OpenMP & Linuxthread Support

- threads identified by OpenMP logical CPU ID

- automatic thread detection and attach

- full thread control in parallel regions

- thread grouping

- line level debugging preserved when a thread

- enters a parallel region

- enters a serial region - hits an OpenMP barrier - hits an OpenMP synchronize statement - enters an OpenMP sections program section

- informative messages regarding thread state and location

* Thread Control

- concise control of groups of threads - thread synchronization - configurable thread stop and wait modes - serial, and threads-only debug modes

* GUI Enhancements

- Thread sub-window. Lists each thread by its logical CPU ID. Displays for each thread its state and stop location. Threads are grouped by parent process.

- Program I/O sub-window. Pops up automatically when program prints to stdout. The program I/O sub-window can also be raised from the Window menu.

- Output written to stdout by the process being debugged is no longer block buffered.

- process grid. Displays each process as a color coded button in a grid. Click on a grid element to refresh the GUI in the scope of that process. Each grid element is numbered with the process's logical ID.

- Process grouping. Control processes in groups

- Thread grid. Displays each thread as a color coded button in a grid. Click on a grid element to refresh the GUI in the scope of that thread. Each grid element is numbered with the thread's logical CPU ID.

- Thread grouping. Control threads in groups.

* Other Enhancements

- Better support for Fortran arrays and pointers

3.11.2 PGDBG 4.1 Technical Information

Here are a number of details not documented in the PGDBG User's Guide.

3.11.2.1 Threads and Signals

PGDBG intercepts all signals sent to any of the threads in a multi-threaded program, and passes them on according to that signal's disposition maintained by PGDBG (see the catch, ignore commands).

If a thread runs into a busy loop, or if the program runs into deadlock, control-C over the debugging command line to interrupt the threads. This causes SIGINT to be sent to all threads. By default PGDBG does not relay SIGINT to any of the threads, so in most cases program behavior is not affected.

Sending a SIGINT (control-C) to a program while it is in the middle of initializing its threads (calling omp_set_num_threads(), or entering a parallel region ) may kill some of the threads if the signal is sent before each thread is fully initialized. Avoid sending SIGINT in these situations. When the number of threads employed by a program is large, thread initialization may take a while.

3.11.2.2 Signals Used by Internally by PGDBG

SIGTRAP indicates a breakpoint has been hit. A message is displayed whenever a thread hits a breakpoint. SIGSTOP is used internally by PGDBG. Its use is mostly invisible to the user. Changing the disposition of these signals in PGDBG will result in undefined behavior.

Reserved Signals: On linux86, the thread library uses SIGRT1, SIGRT3 to communicate among threads internally. In the absence of real-time signals in the kernel, SIGUSR1, SIGUSR2 are used. Changing the disposition of these signals in PGDBG will result in undefined behavior.

3.11.3 Scoping

Nested Subroutines

To reference a nested subroutine you must qualify its name with the name of its enclosing function using the scoping operator @.

For example:

subroutine subtest (ndim)
integer(4), intent(in) :: ndim
integer, dimension(ndim) :: ijk
call subsubtest ()
contains
    subroutine  subsubtest ()
    integer :: I
    i=9
    ijk(1) = 1
    end subroutine subsubtest
    subroutine  subsubtest2 ()
    ijk(1) = 1
    end subroutine subsubtest2
end subroutine subtest           
program testscope
integer(4), parameter :: ndim = 4
call subtest (ndim)
end program testscope

pgdbg> break subtest@subsubtest
breakpoint set at: subsubtest line: 8 in "ex.f90" address: 0x80494091
pgdbg> names subtest@subsubtest 
i = 0
pgdbg> decls subtest@subsubtest 
arguments:
variables:
integer*4 i;
pgdbg> whereis subsubtest
function:       "ex.f90"@subtest@subsubtest

Fortran 90 Modules

To access a member mm of a Fortran 90 module M you must qualify mm

with M using the scoping operator @. If the current scope is M the qualification can be omitted.

For example:

module M
implicit none
real mm
contains
subroutine stub
print *,mm
end subroutine stub
end module M
program test
use M
implicit none
call stub()
print *,mm
end program test

pgdbg> Stopped at 0x80494e3, function MAIN, file M.f90, line 13
#13:       call stub()
pgdbg> which mm
"M.f90"@m@mm
pgdbg> print "M.f90"@m@mm
0
pgdbg> names m
mm = 0
stub = "M.f90"@m@stub
pgdbg> decls m
real*4 mm;
subroutine stub();
pgdbg> print m@mm
0
pgdbg> break stub
breakpoint set at: stub line:6 in "M.f90" address: 0x8049446      1
pgdbg> c
Stopped at 0x8049446, function stub, file M.f90, line 6
Warning: Source file M.f90 has been modified more recently than object file
#6:           print *,mm
pgdbg> print mm
0
pgdbg>

3.11.4 Lexical Blocks

Line numbers are used to name lexical blocks. The line number of the first instruction contained by a lexical block indicates the start scope of the lexical block.

Below variable var is declared in the lexical block starting at line 5. The lexical block has the unique name "lex.c"@main@5. The variable var declared in "lex.c"@main@5 has the unique name "lex.c"@main@5@var.

For Example:

lex.c:
main()
{
    int var = 0;
    {
        int var = 1;
        printf("var %d\n",var);
    }
    printf("var %d\n",var)
}
pgdbg> n
Stopped at 0x8048b10, function main, file
/home/pete/pgdbg/bugs/workon3/ctest/lex.c, line 6
#6:         printf("var %d\n",var);
pgdbg> print var
1
pgdbg> which var
"lex.c"@main@5@var
pgdbg> whereis var
variable:       "lex.c"@main@var
variable:       "lex.c"@main@5@var
pgdbg> names "lex.c"@main@5
var = 1

3.11.5 Private Variables

PGDBG understands private variables with some restrictions. In particular, inspecting private variables while debugging FORTRAN programs is not supported.

Private variables in C must be declared in the enclosing lexical block of the parallel region in order for them to be visible using PGDBG.

For example:

{
    #pragma omp parallel    
    {
        int i;
        ...
        /* i is private to 'this' thread */
        ...
    }
}

In the above case, i would be visible inside PGDBG for each thread. However, in the following example, i is not visible inside PGDBG:

{
    int i;
    #pragma omp parallel private(i)  
    {
        ...
        /* i is private to 'this' thread 
           but not visible within PGDBG */
        ...
    }
}

A private variable of a Thread A is accessed by switching the current thread to A, and by using the name (qualified if necessary) of the private variable.

3.11.6 Graphical User Interface (GUI) Notes

3.11.6.1 Setting the Font

Use the xlsfonts command to list all fonts installed on your system, then choose one you like. For this example, we choose a sony font that is completely specified by the following string:

-sony-fixed-medium-r-normal--24-230-75-75-c-120-iso8859-1

There are two ways to set the font that your PGDBG GUI uses.

1. Use your .Xresources file:

Xpgdbg*font : <chosen font>
pgdbg*font : <chosen font>

For example:

pgdbg*font : -sony-fixed-medium-r-normal--24-230-75-75-c-120-iso8859-1

You will have to merge these changes into your X environment for them to take effect. You can use the following command:

       % xrdb -merge $HOME/.Xresources

2. Use the command line options : -fn <font>. For example:

% pgdbg -fn -sony-fixed-medium-r-normal--0-0-100-100-c-0-jisx0201.1976-0...

3.11.6.2 Control-C from GUI

The active window must be the command window (upper window) where the PGDBG prompt appears for control-C to interrupt the program being debugged. interrupt the program being debugged.

3.11.6.3 Shared Object Files

PGDBG supports debugging of dynamically linked executables that reference shared object files created using the compilers. If the executable being debugged is dynamically linked, PGDBG will report when each shared object is loaded and/or unloaded.

For example:

  pgdbg> ...
  pgdbg> n
  Stopped at 0x8048bee, function main, file   
  dynload.c, line 36
  #36: handle = dlopen("libpetesSO2.so",RTLD_NOW);
  pgdbg> n
  libpetesSO2.so loaded by ld-linux.so.2.
  Stopped at 0x8048c31, function main, file
  dynload.c, line 41
  #41:       if (handle){
  pgdbg> n
  Stopped at 0x8048c37, function main, file
  dynload.c, line 42
  #42:         dlclose(handle);
  pgdbg> n
  libpetesSO2.so unloaded by ld-linux.so.2.
  Stopped at 0x8048c42, function main, file
  dynload.c, line 45
  #45:     }
  pgdbg> ...

The global symbols defined by a dynamically linked shared object are visible during a PGDBG debug session. These symbols are currently available only without type and line number information. The machine level PGDBG commands (breaki, dump, hwatch, disasm, etc) are useful for inspecting these symbols. Each symbol is available with respect to the load status of its defining shared object.

For example, dynamically-linkable Position Independent Code (PIC) is implemented using a Procedure Linkage Table (PLT) and Global Offset Table (GOT). Each PIC function is bound lazily at run-time. If a function has not been linked dynamically, PGDBG reports the address of its PLT entry as its address. If a function has been linked dynamically, PGDBG reports the virtual address of the function itself. So, PGDBG reports the current or "effective" address of symbols with respect to dynamic linking and loading. PGDBG treats global symbols defined in shared objects in a similar way. The address of a global variable may be the address of its GOT entry or an absolute address, depending in part on its load status.