Instruction Scheduling for the Intel® Itanium® 2 Processor

On the Itanium® processor, use of the output of MM instructions (variable shifts, etc.) by integer instructions (ALU, st, ld) must be completed or the pipeline is flushed. Flushing the pipeline causes a penalty of ten cycles, because the compiler must insert blocks of nops with stop bits after shift operations. These blocks result because the MM instructions take an average latency of 4 cycles. The Integer instructions that use the outputs of the MM instructions are placed at least 4 cycles away from the issue of the MM instructions.

On the Itanium 2 processor, these operations are scoreboarded, removing the risk of flushing the pipeline. Therefore:

The latency for such use is three cycles instead of four
The subsequent use will simply stall until the data is ready

The example on the next page shows a comparison of the assembly code generated with and without the -G2 option.