Glossary

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


Assembly Directives
Predefined Section Directives
Include File Directive
Procedure Directives
Symbol Scope Declaration
Declaring Local Scope
Declaring Global Scope
Alignment Statement

C Code Example

A
Acquire Hint
Ambiguous Memory Accesses

back to top



B
Big-endian
Branch
Branch Handling

back to top



C
Cardinality
Code Emission
Comparison Relations (crel)
Control Dependency
Copy Propagation
Cycle Break

back to top



D
Data Dependency
Dead Store Elimination

back to top



F
Floating-point Comparison Relations (frel)
Floating-point Status Register (FPSR)
Fortran

back to top



G
General Register Stack

back to top



H
Hide Memory Latency
High-level Optimizations

back to top



I
IA-32 Architecture
Induction Variable
Intel® Itanium® Architecture
Intel® Itanium® Architecture Software Developer's Manual
Immediate
Improve Branch handling
Increase ILP
Instruction Level Parallelism (ILP)
Instruction Pointer (IP)

back to top



L
LC Application Register
ldf-Load Floating-point
ldfp-Load Floating-point Pair

Little-endian
Loop Branch
Loop Unrolling

back to top



M
Memory disambiguation
Memory latency
Modular Code Support
Modulo-scheduled Counted Loops
Modulo-scheduled While Loops
Multiple Status Fields Registers
Multiply and Accumulate Instructions (fma)

back to top



N
NaT Bit/NaT Value (Not a Thing)
Normal Compare Type

back to top



P
Parallel Compare Types
Pointer-precision data types
POINTER_32
POINTER_64
Polymorphism
Postpass schedulings
Predicate Registers
Predication
Prediction Strategy Hint
Procedure Frame
Procedure stack

back to top



Q
Qualifying Predicate

back to top



R
RAW (Read-After-write) Dependency Violation
Register Load and Store Instructions
Release Hint
Representative Workload
Rotating Registers

back to top



S
Scaling Pointers
Scoreboarding
SIMD
Spatial Locality
Speculation
Software Pipelining
Stage Predicates
Strength Reduction

back to top



T
Templates
Temporal Locality
Trip Count

back to top



U
ulps
Unconditional Compare Type

Uniform Data Model (UDM)

back to top




W
WAW (Write-Afer-Write) Dependency Violation

back to top



 


Acquire Hint

This hint is applicable to ld instructions. The load instruction becomes visible to all future data references, however prior data references may become visible later.

back to top




Ambiguous Memory Accesses

Ambiguous memory accesses are a pair of memory accesses that may refer to the same address in memory.

back to top




Big-endian

A method of storing data so that the most significant byte appears in a lower-numbered location in memory.

back to top





Branch

The Intel® Itanium® architecture supports several types of branches. These include conditional and unconditional branches (jumps), function calls and returns, and loop branches.

back to top




Branch Handling

A branch instruction that is mispredicted incurs a misprediction penalty. The misprediction penalty gets higher as the depth and width of processors grow.

back to top




Cardinality

The range of numbers a data item can count.

back to top




code emission

The process of emitting the sequence of instructions for a function. Code emission can be done in text as a sequence of assembly instructions, or can be done in binary form into a .obj file.

back to top




Comparison Relations (crel)

The two source operands of the compare (cmp) instructions are compared for one of the following ten relations (crel):

crel

a related to b

eq a==b
ne a!=b
lt/ult a<b

signed/unsigned

le/ule a<=b

signed/unsigned

gt/ugt a>b

signed/unsigned

ge/uge a>=b

signed/unsigned

back to top




Control Dependency

An instruction is control dependent if it depends on a branch instruction to execute.

back to top




Copy Propagation

Eliminates unnecessary assignments by using the value assigned to a variable in place of the variable itself. In many cases, the compiler can avoid using a register.

back to top




Cycle Break

The cycle break (;;) indicates the end of an instruction group. It is placed in the code by the assembly writer, or compiler.

back to top




Data Dependency

Instructions are considered to be data dependent if the first produces a result that is used by the second, or if the second instruction is data dependent on the first through a third instruction. Dependent instructions cannot be executed in parallel. You cannot change the execution sequence of dependent instructions.

back to top




Dead Store Elimination

Seeks to ensure that there is no store to the same memory location twice without an intervening read from that location.

back to top




Floating-point Comparison Relations (frel)

The two source operands of the floating-point compare (fcmp) instructions are compared for one of the following 12 relations (frel):

frel

f2 related to f3

frel

f2 related to f3

eq f2==f3 neq !(f2==f3)
lt f2<f3 nlt !(f2<f3)
le f2<=f3 nle !(f2<=f3)
gt f2>f3 ngt !(f2>f3)
ge f2>=f3 nge !(f2>=f3)
unord f2?f3 ord !(f2?f3)

back to top




Floating-point Status Register (FPSR)

The Intel® Itanium® architecture provides four separate status fields (sf0-sf3) enabling four different computational environments. Each status field contains dynamic control and status information for floating-point operations.

The FPSR contains the four status fields and a traps field that traps the IEEE exception events and denormal operand exceptions. This register also includes 6 reserved bits which must be 0.

back to top




Fortran

In Fortran written for the Intel® Itanium® architecture, all pointers are 64-bit quantities.

back to top




General Register Stack

96 general registers, starting at r32, used to pass parameters to the called procedure and store local variables for the currently executing procedure.

back to top




Hide Memory Latency

The Intel® Itanium® architecture provides the means to hide memory latencies by:

back to top




High-level Optimizations

Include

back to top




IA-32 Architecture

IA-32 is Intel’s 32-bit and 16-bit instruction set architecture supported on the Pentium® and P6 family of processors. See the Intel Architecture Software Developer’s Manual , Volume 2 “Instruction Set Reference Manual”, Order Number 243191, for detailed information.


Intel® Itanium® Architecture

The Itanium architecture is Intel's 64-bit architecture. The Itanium architecture also provides full compatibility with Intel's 32-bit architecture also known as IA-32.

back to top




Intel® Itanium® Architecture Software Developer's Manual

The Intel® Itanium® Architecture Software Developer's Manual

Order numbers:

back to top




Immediate

An immediate is a numeric instruction operand.

back to top




Improve Branch handling

The Intel® Itanium® architecture improves branch handling by:

back to top




Increase ILP

The Itanium® architecture increases ILP by:

back to top




Induction Variable

In their simplest form, induction variables are variables whose successive values form an arithmetic progression over some part of a program, usually a loop. Usually the loop's iterations are counted by an integer-valued variable that proceeds upward (or downward) by a constant amount with each iteration.

back to top




Instruction Level Parallelism

The ability to execute many instructions in parallel in multiple functional units during the same cycle.

back to top




Instruction Pointer (IP)

The 64-bit instruction pointer holds the address of the bundle of the currently executing instruction. The IP cannot be directly read or written, it increments as instructions are executed. Branch instructions set the IP to a new value. The IP is always 16-byte aligned.

back to top




LC Application Register

The Loop Count (LC) register is a 64-bit counter used in counted loops. LC is decrement by counted loop type branches.

back to top




ldf-Load Floating-point

Itanium® architecture assembly instruction that loads a single floating point value into a register.

back to top




ldfp-Load Floating-point Pair

Itanium® architecture assembly instruction that loads two floating-point values into two registers simultaneously.

back to top





Little-endian

A method of storing data so that the least significant byte appears in a lower-numbered location in memory.

back to top




Loop Branch

The branch from the "bottom" of the loop to the "top" of the loop. The branch, if taken, continues the loop computation. If the branch is not taken, control exits out of the loop.

back to top





Loop Unrolling

A method used to improve the parallelism of a loop. The loop instructions are replicated and the end code adjusted to eliminate the branch.

back to top




Memory Disambiguation

The process of determining whether two or more pointers are pointing to the same memory location. In C/C++, it is possible to make two or more memory references access the same memory location. In Fortran, memory ambiguity is not a problem, due to language semantics.

back to top




Memory latency

The time required by the processor, between an issuance of a load instruction and the moment when the result of this instruction can be used.

Hide memory latencies: Intel® Itanium® architecture provides the means to hide memory latencies by:

back to top




Modular Code Support

The Intel® Itanium® architecture supports the current compiler trend to produce modular code by providing specific hardware support for function calls and returns.

back to top




Modulo-scheduled Counted Loops

For modulo-scheduled counted loops, the calculation of whether the branch is taken or not depends on the Loop Count application register and on the epilog condition: whether the Epilog Counter application register is greater than one or not.

Use the modulo-scheduled counted loop instructions br.ctop and br.cexit when the loop decision is located at the bottom of the loop body and therefore a taken branch will continue the loop while a fall through branch will exit the loop.

These instructions are only allowed in instruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an Illegal Operation fault, whether the branch would have been taken or not.

back to top




Modulo-scheduled While Loops

For modulo-scheduled while loops, the calculation of whether the branch is taken or not depends on the qualifying predicate and on the epilog condition: whether the Epilog Counter application register is greater than one or not.

Use the modulo-scheduled while loop instructions br.wtop and br.wexit when the loop decision is located somewhere other than the bottom of the loop and therefore a fall though branch will continue the loop and a taken branch will exit the loop.

These instructions are only allowed in instruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an Illegal Operation fault, whether the branch would have been taken or not.

 

back to top




Multiple Status Fields Registers

The Intel® Itanium® architecture supports 4 sets of control and status fields with the first being the main set. The multiple sets allow intermediate calculations to be performed on the alternate sets.

back to top




Multiply and Accumulate Instructions (fma)

The Intel® Itanium® architecture supports various arithmetic floating-point instructions to meet the common needs. For example, a floating-point multiply and add (fma), multiply and subtract (fms) and many more.

The fma instruction, with its four operands (f = a * b + c) forms the basis of all the floating-point arithmetic.

The fma instruction, also provides improved accuracy in multiply and add operations, since there is only one rounding stage, after the add.

back to top




NaT Bit/NaT Value (Not a Thing)

The NaT bit and NaTVal enable propagating exception tokens in general and floating-point registers:

back to top




Normal Compare Type

The normal (no ctype) compare instruction writes the compare result to one target, and the complement to the other.

back to top




Parallel Compare Types

The OR, AND and OR and complement (or.andcm) compare instructions, either write a specific answer to the predicate registers, or leave them unchanged, depending on the result of the compare operation. This allows multiple simultaneous OR-type or multiple simultaneous AND-type compares to target the same predicate register.

back to top




Pointer-precision data types

Data types that are the same size as pointers.

back to top




POINTER_32

POINTER_32 is a 32-bit pointer.
In Win32, this is a native pointer.
In Win64, POINTER_32 is created by truncating a 64-bit pointer. All pointers are 64-bit on any 64-bit platform.

back to top




POINTER_64

POINTER_64 is a 64-bit pointer. In Win32, POINTER_64 is created by sign extending a 32-bit pointer. In Win64, this is a native pointer. Note that no assumptions should be made about pointer sign bits.

back to top



 


Polymorphism

The ability of one data item to have a different type depending on the way in which it is used.

back to top




Postpass Schedulings

Scheduling performed after register allocation in the backend of the compiler. The register allocator may introduce spills, or may get rid of MOV instructions. Blocks where such changes have been made are re-scheduled by the postpass scheduler.

back to top




Predicate Registers

64 one-bit predicate registers enable controlling the execution of instructions. When the value of a predicate register is true (1), the instruction is executed. The predicate registers enable:

There are:

Instructions that are not explicitly preceded by a predicate, defaults to the first predicate register, pr0, which is read-only, and is always true (1).

back to top




Predication

The conditional execution of instructions based on their predicate. When the predicate is true (1), the instruction is executed. When is is false (0), the instruction is treated as a NOP.

back to top




Prediction Strategy Hint

A prediction strategy hint describes how the processor should predict conditional branches. Depending on the value of the hint, the processor can predict the branch as a taken branch, can not predict it, or can base the prediction on a specified predicate which is set up in advance.

back to top




Procedure Frame

The subset of stacked registers visible to a procedure. The procedure frame contains a predefined number of input and output registers, to a maximum of 96 registers.

back to top




Procedure stack

A contiguous array of memory locations, commonly referred to as “the stack”, used in many processors, to save the state of the calling procedure, pass parameters to the called procedure and store local variables for the currently executing procedure.

back to top




Qualifying Predicate

A predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not preceded by a predicate explicitly, assume the first predicate register, p0, which is always true.

back to top




RAW (Read-After-write) Dependency Violation

A type of data dependency between two instructions in one instruction group. The later instruction reads data from the same register to which the earlier instruction wrote.

Example:
add r4=r5,r6
mov r9=r4

A RAW data dependency exists between r4 in the first line and r4 in the second line.

back to top




Register Load and Store Instructions

Moving data between registers to and from memory is performed strictly through the load (ld) and store (st) instructions. The Intel® Itanium® architecture supports loads and stores of all data types. Because registers are written as 64-bit, loads are zero-extended. Stores always write the exact number of bytes for the required format.

back to top




Release Hint

This hint is applicable to a st instruction. The store instruction becomes visible after all prior data references, however later data references may become visible earlier.

back to top




Representative Workload

The work performed is typical of the stress on the system under normal operating conditions.

back to top




Rotating Registers

Registers which are rotated by one register position on each loop execution. The logical names of the registers are rotated in a wrap-around fashion, so that logical register X is logical register X+1 after one rotation. The predicate, floating-point and general registers can be rotated.

back to top




Scaling Pointers

Use these pointers when casting a pointer to an integer for pointer arithmetic

 

back to top




Scoreboarding

Technique that enables instructions to execute out of order when sufficient resources exist, and when no data dependencies exist.

The processor maintains a table that indicates the status of instructions and the registers to which they are writing.

Critical data dependency violations arise from any of the following:

WAR and WAW benefit from register renaming, which leaves us with the RAW true dependency. Scoreboarding enables maximum concurrency limited only by the true RAW dependency and structural dependency violations.

Example:

.mfi
   nop.m 0
   fma f29=f28,f27,f26
   nop.i 0;;
.mfi
   nop.m 0
   fma f30=f29,f27,f26
   nop.i 0

On issue of the fma, the target register is marked "invalid data". This marking is removed once the operation has finished, four cycles later, and the valid result can be accessed.

If an instruction tries to read the data before the "invalid data" tag is removed, the new operation stalls until the data is ready.

The data in f29 isn't ready because the fma is a scoreboarded operation. Therefore the second fma stalls for three cycles.

 

back to top




SIMD

Single Instruction Multiple Data (SIMD) technique. This technique speeds up performance by using one instruction to process multiple data elements in parallel.

back to top




Software Pipelining

Software pipelining is a method that enables the processor to execute, in any given time, several instructions in various stages of the loop.

back to top




Spatial Locality

Data with spatial locality is data with memory addresses close to the data or instructions currently in use.

back to top




Speculation

To hide memory access latencies, advanced load instructions (ld.a) move potentially data dependent loads earlier in the code, and control-speculative load instructions (ld.s) hoist loads above conditional branches.

back to top




Stage Predicates

Predicates that turn on or off instructions in a software-pipelined loop. A software-pipelined loop has several stages. Each instruction is executed in a particular stage and is predicated by the stage predicate corresponding to that stage.

back to top




Strength Reduction

Replaces expensive operations such as multiplications and divisions with less expensive ones such as additions and subtractions.

back to top





Templates

The set of templates define the combinations of functional units that can be invoked by a executing a single bundle. This in turn lets the compiler schedule the functional units in an order that avoids contention. The template can also indicate a stop.

The 24 available templates are listed opposite.

M - is a memory function
I - is an integer function
F - is a floating point function
B - is a branch function
L - is a function involving a long immediate
"s" indicates a stop.

* L+X is an extended type that is dispatched to the I-unit.

MII
MIsI
MLX*
MMI
MsMI
MFI
MMF
MIB
MBB
BBB
MMB
MFB

MIIs
MIsIs
MLXs*
MMIs
MsMIs
MFIs
MMFs
MIBs
MBBs
BBBs
MMBs
MFBs

back to top




Temporal Locality

Data with temporal locality is data that is likely to be reused. The older the data, the less likely the program is to use it again.

back to top




Trip Count

Loop count.

back to top




ulps

A measure of the error between an infinitely precise result and the actual machine result.

back to top




Unconditional Compare Type

The unconditional (unc) compare instruction first initializes both predicate targets to 0, independent of the qualifying predicate. It then operates the same as the normal type, writing the compare result to one target, and the complement to the other.

back to top




Uniform Data Model (UDM)

The Uniform Data Model (UDM) proposes to use identically named data types for both the Win32 and Win64 environments. By using this model, you can maintain a single source code development environment for both Win32 and Win64, provided no architecture specific design features are implemented.

back to top




WAW (Write-Afer-Write) Dependency Violation

A type of data dependency between two instructions in one instruction group. The two instructions write to the same register.

Example:
add r4=r5,r6
add r4=r5,r6

A WAW data dependency exists between r4 in the first line and r4 in the second line.

back to top




Assembly Directives


Predefined Section Directives

The predefined section directives define and option between commonly-used sections. A predefined section directive creates a new section with the default flags and type attributes, and makes that section the current section. The predefined section directive mnemonics are the same as the section names.
The table below lists the predefined section directives, and their default flags and type attributes.

Directive/ Section Name Flags Type Usage
.text

"ax"

"progbits"

Read-only object code

.data

"wa"

"progbits"

Read-write initialized long data

.sdata

"was"

"progbits"

Read-write initialized short data

.bss

"wa"

"nobits"

Read-write uninitialized long data.

.sbss

"was"

"nobits"

Read-write uninitialized short data.

.rodata

"a"

"progbits"

Read-only long data (literals)

.srodat

"as"

"progbits"

Read-only short data (literals)

.comment

""

"progbits"

Comments in the object file

back to top




Include File Directive

To include the contents of another file in the current source file, use the .include directive in the following format:

.include "filename"

Where "filename" Specifies a string constant. If the specified filename is an absolute pathname, the file is included.

back to top




Procedure Directives

The .proc and .endp directives combine code belonging to the same procedure.

The .proc directive marks the beginning of a procedure, and the .endp directive marks the end of a procedure. A single procedure may consist of several disjointed blocks of code. Each block should be individually bracketed with these directives. Name operands within a procedure can be used only for that specific procedure.

The following code sequence shows the basic format of a procedure:

.proc name,...
name:		    // label
...		    // instructions in procedure
.endp name,...

Where name represents one or more entry points of the procedure. Each entry point has a different name.
Name operands of the .endp directive are ignored.

back to top




Symbol Scope Declaration

Symbols are declared as global, weak, or local scopes. Symbol scopes are used to resolve symbol references within one object file or between multiple object files. Symbol scopes are placed in the object file symbol table and any reference to a symbol is resolved in link time. By default, symbols have a local scope, where they are available only to the current assembly- language source file in which they are defined.

back to top




Declaring Local Scope

References to symbols with a local scope are resolved from within the object file in which the symbols are declared. Local symbols with the same name in different object files do not refer to the same entity.

Symbols have a local scope by default, so it is not necessary to declare symbols with local scopes. However, the .local directive is available for completeness. The .local directive has the following format:

.local name,name, ...

Where name represents a symbol name.

back to top




Declaring Global Scope

References to symbols with a global scope are resolved within the object file in which the symbols are declared, and within other object
files. Global symbols with the same name in different object files refer to the same entity.

To declare one or more symbols with a global scope, use the .global directive. These symbols are flagged as global symbols for the
linkage editor. The .global directive has the following format:

.global name,name, ...

Where name represents a symbol name.

back to top




Alignment Statement

References to symbols with a global scope are resolved within the object file in which the assembler automatically aligns instructions and data objects on the appropriate boundaries within a section. It aligns bundles on
16-byte boundaries, and data objects according to their size. The assembler does not align string data, since they are byte arrays.

To disable automatic alignment of data objects in data allocation statements, add the .ua completer after the data allocation mnemonic,
for example, data4.ua.

Each section has an alignment attribute that is determined by the largest aligned object within the section.

Section location counters are not aligned automatically. To align the location counter in the current section to a specified alignment
boundary use the .align statement.

The .align statement has the following format:

.align expression

Where expression is an integer that specifies the alignment boundary for the location counter in the current section. The integer must be a power of two.

The .align statement enables the assembler to reserve space in any section type, including a "nobits" section. During program execution time the contents of a "nobits" section are initialized as zero by the operating system program loader. When using the .align statement in any other section type, the assembler initializes the reserved space with zeros for non-executable sections, and with a NOP pattern for executable sections.

back to top




Code Examples


C Code Example

The following example presents an opportunity to load data from memory before the control dependency.

int add5(int *a)
{
  if (a==NULL)
    return (-1);
  else
    return (*a+5);
}

back to top