Cacheability Support Using Streaming SIMD Extensions

The following intrinsics provide ways to make efficient use of the cache.

void _mm_prefetch(char * p, int i )

(uses PREFETCH)

Loads one cache line of data from address p to a location "closer" to the processor. The value i specifies the type of prefetch operation: the constants _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used, corresponding to the type of prefetch instruction.

void _mm_stream_pi(__m64 * p, __m64 a )

(uses MOVNTQ)

Stores the data in a to the address p without polluting the caches. This intrinsic requires you to empty the multimedia state for the mmx register. See The EMMS Instruction: Why You Need It and When to Use It topic.

void _mm_stream_ps(float * p, __m128 a )

(see MOVNTPS)

Stores the data in a to the address p without polluting the caches. The address must be 16-byte-aligned.

void _mm_sfence(void)

(uses SFENCE)

Guarantees that every preceding store is globally visible before any subsequent store.

void _mm_pause(void)

The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state. This intrinsic provides especially significant performance gain and described in more detail below.

PAUSE Intrinsic

The PAUSE intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, PAUSE improves the speed at which the code detects the release of the lock. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.

Example of loop with the PAUSE instruction:

spin_loop:pause

cmp eax, A

jne spin_loop

In the above example, the program spins until memory location A matches the value in register eax. The code sequence that follows shows a test-and-test-and-set. In this example, the spin occurs only after the attempt to get a lock has failed.

get_lock: mov eax, 1

xchg eax, A ; Try to get lock

cmp eax, 0 ; Test if successful

jne spin_loop

Critical Section:

<critical_section code>

mov A, 0 ; Release lock

jmp continue

spin_loop: pause; Spin-loop hint

cmp 0, A ; Check lock availability

jne spin_loop

jmp get_lock

continue: <other code>

Note that the first branch is predicted to fall-through to the critical section in anticipation of successfully gaining access to the lock. It is highly recommended that all spin-wait loops include the PAUSE instruction. Since PAUSE is backwards compatible to all existing IA-32 processor generations, a test for processor type (a CPUID test) is not needed. All legacy processors will execute PAUSE as a NOP, but in processors which use the PAUSE as a hint there can be significant performance benefit.