Glossary

This hint is applicable to ld instructions. The load instruction becomes visible to all future data references, however prior data references may become visible later.

Ambiguous Memory Accesses

Ambiguous memory accesses are a pair of memory accesses that may refer to the same address in memory.

Big-endian

A method of storing data so that the most significant byte appears in a lower-numbered location in memory.

Branch

The Intel® Itanium® architecture supports several types of branches. These include conditional and unconditional branches (jumps), function calls and returns, and loop branches.

Branch Handling

A branch instruction that is mispredicted incurs a misprediction penalty. The misprediction penalty gets higher as the depth and width of processors grow.

Cardinality

The range of numbers a data item can count.

code emission

The process of emitting the sequence of instructions for a function. Code emission can be done in text as a sequence of assembly instructions, or can be done in binary form into a .obj file.

Comparison Relations (crel)

The two source operands of the compare (cmp) instructions are compared for one of the following ten relations (crel):

*crel*	a related to b
eq	a==b
ne	a!=b
lt/ult	a<b	signed/unsigned
le/ule	a<=b	signed/unsigned
gt/ugt	a>b	signed/unsigned
ge/uge	a>=b	signed/unsigned

Control Dependency

An instruction is control dependent if it depends on a branch instruction to execute.

Copy Propagation

Eliminates unnecessary assignments by using the value assigned to a variable in place of the variable itself. In many cases, the compiler can avoid using a register.

Cycle Break

The cycle break (;;) indicates the end of an instruction group. It is placed in the code by the assembly writer, or compiler.

Data Dependency

Instructions are considered to be data dependent if the first produces a result that is used by the second, or if the second instruction is data dependent on the first through a third instruction. Dependent instructions cannot be executed in parallel. You cannot change the execution sequence of dependent instructions.

Dead Store Elimination

Seeks to ensure that there is no store to the same memory location twice without an intervening read from that location.

Floating-point Comparison Relations (frel)

The two source operands of the floating-point compare (fcmp) instructions are compared for one of the following 12 relations (frel):

*frel*	f2 related to f3	*frel*	f2 related to f3
eq	f2==f3	neq	!(f2==f3)
lt	f2<f3	nlt	!(f2<f3)
le	f2<=f3	nle	!(f2<=f3)
gt	f2>f3	ngt	!(f2>f3)
ge	f2>=f3	nge	!(f2>=f3)
unord	f2?f3	ord	!(f2?f3)

Floating-point Status Register (FPSR)

The Intel® Itanium® architecture provides four separate status fields (sf0-sf3) enabling four different computational environments. Each status field contains dynamic control and status information for floating-point operations.

The FPSR contains the four status fields and a traps field that traps the IEEE exception events and denormal operand exceptions. This register also includes 6 reserved bits which must be 0.

Fortran

In Fortran written for the Intel® Itanium® architecture, all pointers are 64-bit quantities.

General Register Stack

96 general registers, starting at r32, used to pass parameters to the called procedure and store local variables for the currently executing procedure.

Hide Memory Latency

The Intel® Itanium® architecture provides the means to hide memory latencies by:

allowing the compiler to schedule loads earlier in the code.
enabling memory hierarchy cache management.

High-level Optimizations

Include

resource optimizations: load-pair generation and loop unrolling
memory hierarchy optimizations: data prefetching, loop and register blocking, linear loop transformations, and scalar replacements

IA-32 Architecture

IA-32 is Intel’s 32-bit and 16-bit instruction set architecture supported on the Pentium® and P6 family of processors. See the Intel Architecture Software Developer’s Manual , Volume 2 “Instruction Set Reference Manual”, Order Number 243191, for detailed information.

Intel® Itanium® Architecture

The Itanium architecture is Intel's 64-bit architecture. The Itanium architecture also provides full compatibility with Intel's 32-bit architecture also known as IA-32.

Intel® Itanium® Architecture Software Developer's Manual

The Intel® Itanium® Architecture Software Developer's Manual

Order numbers:

Volume 1 rev. 1.1: 245317-002
Volume 2 rev. 1.1: 245318-002
Volume 3 rev. 1.1: 245319-002
Volume 4 rev. 1.1: 245320-002

Immediate

An immediate is a numeric instruction operand.

Improve Branch handling

The Intel® Itanium® architecture improves branch handling by:

providing the means to minimize branches in the code and to increase branch prediction rate for the remaining branches.
providing specific support for typical branches.

Increase ILP

The Itanium® architecture increases ILP by:

providing more architectural resources: large register files, and a 3-instruction wide word.
enabling the compiler/assembly writer to explicitly indicate parallelism.

Induction Variable

In their simplest form, induction variables are variables whose successive values form an arithmetic progression over some part of a program, usually a loop. Usually the loop's iterations are counted by an integer-valued variable that proceeds upward (or downward) by a constant amount with each iteration.

Instruction Level Parallelism

The ability to execute many instructions in parallel in multiple functional units during the same cycle.

Instruction Pointer (IP)

The 64-bit instruction pointer holds the address of the bundle of the currently executing instruction. The IP cannot be directly read or written, it increments as instructions are executed. Branch instructions set the IP to a new value. The IP is always 16-byte aligned.

LC Application Register

The Loop Count (LC) register is a 64-bit counter used in counted loops. LC is decrement by counted loop type branches.

ldf-Load Floating-point

Itanium® architecture assembly instruction that loads a single floating point value into a register.

ldfp-Load Floating-point Pair

Itanium® architecture assembly instruction that loads two floating-point values into two registers simultaneously.

Little-endian

A method of storing data so that the least significant byte appears in a lower-numbered location in memory.

Loop Branch

The branch from the "bottom" of the loop to the "top" of the loop. The branch, if taken, continues the loop computation. If the branch is not taken, control exits out of the loop.

Loop Unrolling

A method used to improve the parallelism of a loop. The loop instructions are replicated and the end code adjusted to eliminate the branch.

Memory Disambiguation

The process of determining whether two or more pointers are pointing to the same memory location. In C/C++, it is possible to make two or more memory references access the same memory location. In Fortran, memory ambiguity is not a problem, due to language semantics.

Memory latency

The time required by the processor, between an issuance of a load instruction and the moment when the result of this instruction can be used.

Hide memory latencies: Intel® Itanium® architecture provides the means to hide memory latencies by:

allowing the compiler to schedule loads earlier in the code
enabling memory hierarchy cache management

Modular Code Support

The Intel® Itanium® architecture supports the current compiler trend to produce modular code by providing specific hardware support for function calls and returns.

Modulo-scheduled Counted Loops

For modulo-scheduled counted loops, the calculation of whether the branch is taken or not depends on the Loop Count application register and on the epilog condition: whether the Epilog Counter application register is greater than one or not.

Use the modulo-scheduled counted loop instructions br.ctop and br.cexit when the loop decision is located at the bottom of the loop body and therefore a taken branch will continue the loop while a fall through branch will exit the loop.

These instructions are only allowed in instruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an Illegal Operation fault, whether the branch would have been taken or not.

Modulo-scheduled While Loops

For modulo-scheduled while loops, the calculation of whether the branch is taken or not depends on the qualifying predicate and on the epilog condition: whether the Epilog Counter application register is greater than one or not.

Use the modulo-scheduled while loop instructions br.wtop and br.wexit when the loop decision is located somewhere other than the bottom of the loop and therefore a fall though branch will continue the loop and a taken branch will exit the loop.

Multiple Status Fields Registers

The Intel® Itanium® architecture supports 4 sets of control and status fields with the first being the main set. The multiple sets allow intermediate calculations to be performed on the alternate sets.

Multiply and Accumulate Instructions (fma)

The Intel® Itanium® architecture supports various arithmetic floating-point instructions to meet the common needs. For example, a floating-point multiply and add (fma), multiply and subtract (fms) and many more.

The fma instruction, with its four operands (f = a * b + c) forms the basis of all the floating-point arithmetic.

When c=0, this is a multiply
When b=1, this is an add
When b=1 and C=0, this is a normalization

The fma instruction, also provides improved accuracy in multiply and add operations, since there is only one rounding stage, after the add.

NaT Bit/NaT Value (Not a Thing)

The NaT bit and NaTVal enable propagating exception tokens in general and floating-point registers:

General registers have an additional NaT bit. When the NaT bit is set to true (1) the value stored in the register is not valid.
Floating-point registers use a special instance of NaN, called NaTVal. NaTVal is used to propagate valid/invalid results of speculative loads of floating-point data.

Normal Compare Type

The normal (no ctype) compare instruction writes the compare result to one target, and the complement to the other.

Parallel Compare Types

The OR, AND and OR and complement (or.andcm) compare instructions, either write a specific answer to the predicate registers, or leave them unchanged, depending on the result of the compare operation. This allows multiple simultaneous OR-type or multiple simultaneous AND-type compares to target the same predicate register.

Pointer-precision data types

Data types that are the same size as pointers.

POINTER_32

POINTER_32 is a 32-bit pointer.
In Win32, this is a native pointer.
In Win64, POINTER_32 is created by truncating a 64-bit pointer. All pointers are 64-bit on any 64-bit platform.

POINTER_64

POINTER_64 is a 64-bit pointer. In Win32, POINTER_64 is created by sign extending a 32-bit pointer. In Win64, this is a native pointer. Note that no assumptions should be made about pointer sign bits.

Polymorphism

The ability of one data item to have a different type depending on the way in which it is used.

Postpass Schedulings

Scheduling performed after register allocation in the backend of the compiler. The register allocator may introduce spills, or may get rid of MOV instructions. Blocks where such changes have been made are re-scheduled by the postpass scheduler.

Predicate Registers

64 one-bit predicate registers enable controlling the execution of instructions. When the value of a predicate register is true (1), the instruction is executed. The predicate registers enable:

validating/invalidating instructions
eliminating branches in if/then/else logic blocks

There are:

16 static predicate registers
48 rotating predicate registers for controlling software pipelining

Instructions that are not explicitly preceded by a predicate, defaults to the first predicate register, pr0, which is read-only, and is always true (1).

Predication

The conditional execution of instructions based on their predicate. When the predicate is true (1), the instruction is executed. When is is false (0), the instruction is treated as a NOP.

Prediction Strategy Hint

A prediction strategy hint describes how the processor should predict conditional branches. Depending on the value of the hint, the processor can predict the branch as a taken branch, can not predict it, or can base the prediction on a specified predicate which is set up in advance.

Procedure Frame

The subset of stacked registers visible to a procedure. The procedure frame contains a predefined number of input and output registers, to a maximum of 96 registers.

Procedure stack

A contiguous array of memory locations, commonly referred to as �the stack�, used in many processors, to save the state of the calling procedure, pass parameters to the called procedure and store local variables for the currently executing procedure.

Qualifying Predicate

A predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not preceded by a predicate explicitly, assume the first predicate register, p0, which is always true.

RAW (Read-After-write) Dependency Violation

A type of data dependency between two instructions in one instruction group. The later instruction reads data from the same register to which the earlier instruction wrote.

Example:

add r4=r5,r6
mov r9=r4

A RAW data dependency exists between r4 in the first line and r4 in the second line.

Register Load and Store Instructions

Moving data between registers to and from memory is performed strictly through the load (ld) and store (st) instructions. The Intel® Itanium® architecture supports loads and stores of all data types. Because registers are written as 64-bit, loads are zero-extended. Stores always write the exact number of bytes for the required format.

Release Hint

This hint is applicable to a st instruction. The store instruction becomes visible after all prior data references, however later data references may become visible earlier.

Representative Workload

The work performed is typical of the stress on the system under normal operating conditions.

Rotating Registers

Registers which are rotated by one register position on each loop execution. The logical names of the registers are rotated in a wrap-around fashion, so that logical register X is logical register X+1 after one rotation. The predicate, floating-point and general registers can be rotated.

Scaling Pointers

Use these pointers when casting a pointer to an integer for pointer arithmetic

Scoreboarding

Technique that enables instructions to execute out of order when sufficient resources exist, and when no data dependencies exist.

The processor maintains a table that indicates the status of instructions and the registers to which they are writing.

Critical data dependency violations arise from any of the following:

Read After Write (RAW) data dependence
antidependence of name (WAR)
output dependence (WAW).

WAR and WAW benefit from register renaming, which leaves us with the RAW true dependency. Scoreboarding enables maximum concurrency limited only by the true RAW dependency and structural dependency violations.

Example:

.mfi
   nop.m 0
   fma f29=f28,f27,f26
   nop.i 0;;
.mfi
   nop.m 0
   fma f30=f29,f27,f26
   nop.i 0

On issue of the fma, the target register is marked "invalid data". This marking is removed once the operation has finished, four cycles later, and the valid result can be accessed.

If an instruction tries to read the data before the "invalid data" tag is removed, the new operation stalls until the data is ready.

The data in f29 isn't ready because the fma is a scoreboarded operation. Therefore the second fma stalls for three cycles.

SIMD

Single Instruction Multiple Data (SIMD) technique. This technique speeds up performance by using one instruction to process multiple data elements in parallel.

Software Pipelining

Software pipelining is a method that enables the processor to execute, in any given time, several instructions in various stages of the loop.

Spatial Locality

Data with spatial locality is data with memory addresses close to the data or instructions currently in use.

Speculation

To hide memory access latencies, advanced load instructions (ld.a) move potentially data dependent loads earlier in the code, and control-speculative load instructions (ld.s) hoist loads above conditional branches.

Stage Predicates

Predicates that turn on or off instructions in a software-pipelined loop. A software-pipelined loop has several stages. Each instruction is executed in a particular stage and is predicated by the stage predicate corresponding to that stage.

Strength Reduction

Replaces expensive operations such as multiplications and divisions with less expensive ones such as additions and subtractions.

Templates

The set of templates define the combinations of functional units that can be invoked by a executing a single bundle. This in turn lets the compiler schedule the functional units in an order that avoids contention. The template can also indicate a stop.

The 24 available templates are listed opposite.

M - is a memory function
I - is an integer function
F - is a floating point function
B - is a branch function
L - is a function involving a long immediate
"s" indicates a stop.

* L+X is an extended type that is dispatched to the I-unit.

MII
MIsI
MLX*
MMI
MsMI
MFI
MMF
MIB
MBB
BBB
MMB
MFB

MIIs
MIsIs
MLXs*
MMIs
MsMIs
MFIs
MMFs
MIBs
MBBs
BBBs
MMBs
MFBs

Temporal Locality

Data with temporal locality is data that is likely to be reused. The older the data, the less likely the program is to use it again.

Trip Count

Loop count.

ulps

A measure of the error between an infinitely precise result and the actual machine result.

Unconditional Compare Type

The unconditional (unc) compare instruction first initializes both predicate targets to 0, independent of the qualifying predicate. It then operates the same as the normal type, writing the compare result to one target, and the complement to the other.

Uniform Data Model (UDM)

The Uniform Data Model (UDM) proposes to use identically named data types for both the Win32 and Win64 environments. By using this model, you can maintain a single source code development environment for both Win32 and Win64, provided no architecture specific design features are implemented.

WAW (Write-Afer-Write) Dependency Violation

A type of data dependency between two instructions in one instruction group. The two instructions write to the same register.

Example:

add r4=r5,r6
add r4=r5,r6

A WAW data dependency exists between r4 in the first line and r4 in the second line.

Assembly Directives

Predefined Section Directives

The predefined section directives define and option between commonly-used sections. A predefined section directive creates a new section with the default flags and type attributes, and makes that section the current section. The predefined section directive mnemonics are the same as the section names.
The table below lists the predefined section directives, and their default flags and type attributes.

Directive/ Section Name Flags Type Usage

.text
"ax"

"progbits"

Read-only object code

.data
"wa"

"progbits"

Read-write initialized long data

.sdata
"was"

"progbits"

Read-write initialized short data

.bss
"wa"

"nobits"

Read-write uninitialized long data.

.sbss
"was"

"nobits"

Read-write uninitialized short data.

.rodata
"a"

"progbits"

Read-only long data (literals)

.srodat
"as"

"progbits"

Read-only short data (literals)

.comment
""

"progbits"

Comments in the object file

Include File Directive

To include the contents of another file in the current source file, use the .include directive in the following format:

.include "filename"

Where "filename" Specifies a string constant. If the specified filename is an absolute pathname, the file is included.

Procedure Directives

The .proc and .endp directives combine code belonging to the same procedure.

The .proc directive marks the beginning of a procedure, and the .endp directive marks the end of a procedure. A single procedure may consist of several disjointed blocks of code. Each block should be individually bracketed with these directives. Name operands within a procedure can be used only for that specific procedure.

The following code sequence shows the basic format of a procedure:

.proc name,...
name:		    // label
...		    // instructions in procedure
.endp name,...

Where name represents one or more entry points of the procedure. Each entry point has a different name.
Name operands of the .endp directive are ignored.

Symbol Scope Declaration

Symbols are declared as global, weak, or local scopes. Symbol scopes are used to resolve symbol references within one object file or between multiple object files. Symbol scopes are placed in the object file symbol table and any reference to a symbol is resolved in link time. By default, symbols have a local scope, where they are available only to the current assembly- language source file in which they are defined.

Declaring Local Scope

References to symbols with a local scope are resolved from within the object file in which the symbols are declared. Local symbols with the same name in different object files do not refer to the same entity.

Symbols have a local scope by default, so it is not necessary to declare symbols with local scopes. However, the .local directive is available for completeness. The .local directive has the following format:

.local name,name, ...

Where name represents a symbol name.

Declaring Global Scope

References to symbols with a global scope are resolved within the object file in which the symbols are declared, and within other object
files. Global symbols with the same name in different object files refer to the same entity.

To declare one or more symbols with a global scope, use the .global directive. These symbols are flagged as global symbols for the
linkage editor. The .global directive has the following format:

.global name,name, ...

Where name represents a symbol name.

Alignment Statement

References to symbols with a global scope are resolved within the object file in which the assembler automatically aligns instructions and data objects on the appropriate boundaries within a section. It aligns bundles on
16-byte boundaries, and data objects according to their size. The assembler does not align string data, since they are byte arrays.

To disable automatic alignment of data objects in data allocation statements, add the .ua completer after the data allocation mnemonic,
for example, data4.ua.

Each section has an alignment attribute that is determined by the largest aligned object within the section.

Section location counters are not aligned automatically. To align the location counter in the current section to a specified alignment
boundary use the .align statement.

The .align statement has the following format:

.align expression

Where expression is an integer that specifies the alignment boundary for the location counter in the current section. The integer must be a power of two.

The .align statement enables the assembler to reserve space in any section type, including a "nobits" section. During program execution time the contents of a "nobits" section are initialized as zero by the operating system program loader. When using the .align statement in any other section type, the assembler initializes the reserved space with zeros for non-executable sections, and with a NOP pattern for executable sections.

Code Examples

C Code Example

The following example presents an opportunity to load data from memory before the control dependency.

int add5(int *a)
{
  if (a==NULL)
    return (-1);
  else
    return (*a+5);
}

Directive/ Section Name	Flags	Type	Usage
.text	"ax"	"progbits"	Read-only object code
.data	"wa"	"progbits"	Read-write initialized long data
.sdata	"was"	"progbits"	Read-write initialized short data
.bss	"wa"	"nobits"	Read-write uninitialized long data.
.sbss	"was"	"nobits"	Read-write uninitialized short data.
.rodata	"a"	"progbits"	Read-only long data (literals)
.srodat	"as"	"progbits"	Read-only short data (literals)
.comment	""	"progbits"	Comments in the object file