proj-oot-old-150618-intelOptimizeManualNotes

Intel Optimization Manual notes

notes from http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

as of Feb 2015

All the Coding Rules

Chapter 3 has a bunch of bold, italicized, numbered "Assembly/Compiler Coding Rule"s and "User/Source coding rule"s sprinkled through it.

the rules are:

Chapter 3.4: OPTIMIZING THE FRONT END

Assembly/Compiler Coding Rule 1. (MH impact, M generality) Arrange code to make basic blocks contiguous and eliminate unnecessary branches.

Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.

Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.

Assembly/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with near returns, and far calls must be matched with far returns. Pushing the return address on the stack and jumping to the routine to be called is not recommended since it creates a mismatch in calls and returns.

Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if doing so decreases code size or if the function is small and the call site is frequently executed.

Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache.

Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If there are more than 16 nested calls and returns in rapid succession; consider transforming the program with inline to reduce the call depth.

Assembly/Compiler Coding Rule 8. (ML impact, ML generality) Favor inlining small functions that contain branches with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a performance penalty may be incurred.)

Assembly/Compiler Coding Rule 9. (L impact, L generality) If the last statement in a function is a call to another function, consider converting the call to a jump. This will save the call/return overhead as well as an entry in the return stack buffer.

Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in a 16-byte chunk.

Assembly/Compiler Coding Rule 11. (M impact, L generality) Do not put more than two end loop branches in a 16-byte chunk.

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned.

Assembly/Compiler Coding Rule 13. (M impact, H generality) If the body of a conditional is not likely to be executed, it should be placed in another part of the program. If it is highly unlikely to be executed and code locality is an issue, it should be placed on a different code page.

Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path.

User/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more common taken targets and at least one of those targets is correlated with branch history leading up to the branch, then convert the indirect branch to a tree where one or more indirect branches are preceded by conditional branches to those targets. Apply this “peeling” procedure to the common target of an indirect branch that correlates to branch history.

Assembly/Compiler Coding Rule 15. (H impact, M generality) Unroll small loops until the overhead of the branch and induction variable accounts (generally) for less than 10% of the execution time of the loop.

Assembly/Compiler Coding Rule 16. (H impact, M generality) Avoid unrolling loops excessively; this may thrash the trace cache or instruction cache.

Assembly/Compiler Coding Rule 17. (M impact, M generality) Unroll loops that are frequently executed and have a predictable number of iterations to reduce the number of iterations to 16 or fewer. Do this unless it increases code size so that the working set no longer fits in the trace or instruction cache. If the loop body contains more than one conditional branch, then unroll so that the number of iterations is 16/(# conditional branches).

Assembly/Compiler Coding Rule 18. (ML impact, M generality) For improving fetch/decode throughput, Give preference to memory flavor of an instruction over the register-only flavor of the same instruction, if such instruction can benefit from micro-fusion.

Assembly/Compiler Coding Rule 19. (M impact, ML generality) Employ macro-fusion where possible using instruction pairs that support macro-fusion. Prefer TEST over CMP if possible. Use unsigned variables and unsigned jumps when possible. Try to logically verify that a variable is non- negative at the time of comparison. Avoid CMP or TEST of MEM-IMM flavor when possible. However, do not add other instructions to avoid using the MEM-IMM flavor.

Assembly/Compiler Coding Rule 20. (M impact, ML generality) Software can enable macro fusion when it can be logically determined that a variable is non-negative at the time of comparison; use TEST appropriately to enable macro-fusion when comparing a variable with 0.

Assembly/Compiler Coding Rule 21. (MH impact, MH generality) Favor generating code using imm8 or imm32 values instead of imm16 values.

Assembly/Compiler Coding Rule 22. (M impact, ML generality) Ensure instructions using 0xF7 opcode byte does not start at offset 14 of a fetch line; and avoid using these instruction to operate on 16-bit data, upcast short data to 32 bits.

Assembly/Compiler Coding Rule 23. (MH impact, MH generality) Break up a loop long sequence of instructions into loops of shorter instruction blocks of no more than the size of LSD.

Assembly/Compiler Coding Rule 24. (MH impact, M generality) Avoid unrolling loops containing LCP stalls, if the unrolled block exceeds the size of LSD.

Assembly/Compiler Coding Rule 24. (MH impact, M generality) Avoid unrolling loops containing LCP stalls, if the unrolled block exceeds the size of LSD.

Assembly/Compiler Coding Rule 25. (M impact, M generality) Avoid putting explicit references to ESP in a sequence of stack operations (POP, PUSH, CALL, RET).

Assembly/Compiler Coding Rule 26. (ML impact, L generality) Use simple instructions that are less than eight bytes in length.

Assembly/Compiler Coding Rule 27. (M impact, MH generality) Avoid using prefixes to change the size of immediate and displacement.

Chapter 3.5: OPTIMIZING THE EXECUTION CORE

Assembly/Compiler Coding Rule 28. (M impact, H generality) Favor single-micro-operation instructions. Also favor instruction with shorter latencies.

Assembly/Compiler Coding Rule 29. (M impact, L generality) Avoid prefixes, especially multiple non-0F-prefixed opcodes.

Assembly/Compiler Coding Rule 30. (M impact, L generality) Do not use many segment registers.

Assembly/Compiler Coding Rule 31. (M impact, M generality) Avoid using complex instructions (for example, enter, leave, or loop) that have more than four μops and require multiple cycles to decode. Use sequences of simple instructions instead.

Assembly/Compiler Coding Rule 32. (MH impact, M generality) Use push/pop to manage stack space and address adjustments between function calls/returns instead of enter/leave. Using enter instruction with non-zero immediates can experience significant delays in the pipeline in addition to misprediction.

Assembly/Compiler Coding Rule 33. (M impact, H generality) INC and DEC instructions should be replaced with ADD or SUB instructions, because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore creating false dependencies on earlier instructions that set the flags.

Assembly/Compiler Coding Rule 34. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth out of the trace cache are the critical factor, then use the LEA instruction.

Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE by register or ROTATE by immediate instructions. If possible, replace with a ROTATE by 1 instruction.

Assembly/Compiler Coding Rule 36. (M impact, ML generality) Use dependency-breaking-idiom instructions to set a register to 0, or to break a false dependence chain resulting from re-use of registers. In contexts where the condition codes must be preserved, move 0 into the register instead. This requires more code space than using XOR and SUB, but avoids setting the condition codes.

Assembly/Compiler Coding Rule 37. (M impact, MH generality) Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

Assembly/Compiler Coding Rule 38. (M impact, M generality) Try to use zero extension or operate on 32-bit operands instead of using moves with sign extension.

Assembly/Compiler Coding Rule 39. (ML impact, L generality) Avoid placing instructions that use 32-bit immediates which cannot be encoded as sign-extended 16-bit immediates near each other. Try to schedule μops that have no immediate immediately before or after μops with 32-bit immediates.

Assembly/Compiler Coding Rule 40. (ML impact, M generality) Use the TEST instruction instead of AND when the result of the logical AND is not used. This saves μops in execution. Use a TEST of a register with itself instead of a CMP of the register to zero, this saves the need to encode the zero and saves encoding space. Avoid comparing a constant to a memory operand. It is preferable to load the memory operand and compare the constant to a register.

Assembly/Compiler Coding Rule 41. (ML impact, M generality) Eliminate unnecessary compare with zero instructions by using the appropriate conditional jump instruction when the flags are already set by a preceding arithmetic instruction. If necessary, use a TEST instruction instead of a compare. Be certain that any code transformations made do not introduce problems with overflow.

Assembly/Compiler Coding Rule 42. (H impact, MH generality) For small loops, placing loop invariants in memory is better than spilling loop-carried dependencies.

Assembly/Compiler Coding Rule 43. (M impact, ML generality) Avoid introducing dependences with partial floating-point register writes, e.g. from the MOVSD XMMREG1, XMMREG2 instruction. Use the MOVAPD XMMREG1, XMMREG2 instruction instead.

Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2 . If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.

Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.

User/Source Coding Rule 2. (H impact, M generality) Use the smallest possible floating-point or SIMD data type, to enable more parallelism with the use of a (longer) SIMD vector. For example, use single precision instead of double precision where possible.

User/Source Coding Rule 3. (M impact, ML generality) Arrange the nesting of loops so that the innermost nesting level is free of inter-iteration dependencies. Especially avoid the case where the store of data in an earlier iteration happens lexically after the load of that data in a future iteration, something which is called a lexically backward dependence..

User/Source Coding Rule 4. (M impact, ML generality) Avoid the use of conditional branches inside loops and consider using SSE instructions to eliminate branches.

User/Source Coding Rule 5. (M impact, ML generality) Keep induction (loop) variable expressions simple.

Chapter 3.6: OPTIMIZING MEMORY ACCESSES

Assembly/Compiler Coding Rule 46. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.

Assembly/Compiler Coding Rule 47. (H impact, M generality) Pass parameters in registers instead of on the stack where possible. Passing arguments on the stack requires a store followed by a reload. While this sequence is optimized in hardware by providing the value to the load directly from the memory order buffer without the need to access the data cache if permitted by store-forwarding restrictions, floating-point values incur a significant latency in forwarding. Passing floating-point arguments in (preferably XMM) registers should save this long latency operation.

Assembly/Compiler Coding Rule 48. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data.

Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.

Assembly/Compiler Coding Rule 50. (H impact, ML generality) If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessary. This is better than incurring the penalties of a failed store-forward.

Assembly/Compiler Coding Rule 51. (MH impact, ML generality) Avoid several small loads after large stores to the same area of memory by using a single large read and register copies as needed.

Assembly/Compiler Coding Rule 52. (H impact, MH generality) Where it is possible to do so without incurring other penalties, prioritize the allocation of variables to registers, as in register allocation and for parameter passing, to minimize the likelihood and impact of store-forwarding problems. Try not to store-forward data generated from a long latency instruction - for example, MUL or DIV. Avoid store-forwarding data for variables with the shortest store-load distance. Avoid store-forwarding data for variables with many and/or long dependence chains, and especially avoid including a store forward on a loop-carried dependence chain.

Assembly/Compiler Coding Rule 53. (M impact, MH generality) Calculate store addresses as early as possible to avoid having stores block loads.

User/Source Coding Rule 6. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.

Assembly/Compiler Coding Rule 54. (H impact, M generality) Try to arrange data structures such that they permit sequential access.

User/Source Coding Rule 7. (M impact, L generality) Beware of false sharing within a cache line (64 bytes).

Assembly/Compiler Coding Rule 55. (H impact, M generality) Make sure that the stack is aligned at the largest multi-byte granular data type boundary matching the register width.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

User/Source Coding Rule 8. (H impact, ML generality) Consider using a special memory allocation library with address offset capability to avoid aliasing.

User/Source Coding Rule 9. (M impact, M generality) When padding variable declarations to avoid aliasing, the greatest benefit comes from avoiding aliasing on second-level cache lines, suggesting an offset of 128 bytes or more.

Assembly/Compiler Coding Rule 57. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch. Tuning Suggestion 1. In rare cases, a performance problem may be caused by executing data on a code page as instructions. This is very likely to happen when execution is following an indirect branch that is not resident in the trace cache. If this is clearly causing a performance problem, try moving the data elsewhere, or inserting an illegal opcode or a PAUSE instruction immediately after the indirect branch. Note that the latter two alternatives may degrade performance in some circumstances.

Assembly/Compiler Coding Rule 58. (H impact, L generality) Always put code and data on separate pages. Avoid self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code that performs the modifications and the code being modified are on separate 4-KByte pages or on separate aligned 1-KByte subpages.

Assembly/Compiler Coding Rule 59. (H impact, L generality) If an inner loop writes to more than four arrays (four distinct cache lines), apply loop fission to break up the body of the loop such that only four arrays are being written to in each iteration of each of the resulting loops.

User/Source Coding Rule 10. (H impact, H generality) Optimization techniques such as blocking, loop interchange, loop skewing, and packing are best done by the compiler. Optimize data structures either to fit in one-half of the first-level cache or in the second-level cache; turn on loop optimizations in the compiler to enhance locality for nested loops.

User/Source Coding Rule 11. (M impact, ML generality) If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance.

User/Source Coding Rule 12. (H impact, H generality) To achieve effective amortization of bus latency, software should favor data access patterns that result in higher concentrations of cache miss patterns, with cache miss strides that are significantly smaller than half the hardware prefetch trigger threshold.

Chapter 3.8: FLOATING-POINT

User/Source Coding Rule 13. (M impact, M generality) Enable the compiler’s use of SSE, SSE2 and more advanced SIMD instruction sets (e.g. AVX) with appropriate switches. Favor scalar SIMD code generation to replace x87 code generation.

User/Source Coding Rule 14. (H impact, ML generality) Make sure your application stays in range to avoid denormal values, underflows..

User/Source Coding Rule 15. (M impact, ML generality) Usually, math libraries take advantage of the transcendental instructions (for example, FSIN) when evaluating elementary functions. If there is no critical need to evaluate the transcendental functions using the extended precision of 80 bits, applications should consider an alternate, software-based approach, such as a look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performance with these techniques by choosing the desired numeric precision and the size of the look-up table, and by taking advantage of the parallelism of the SSE and the SSE2 instructions.

User/Source Coding Rule 16. (H impact, ML generality) Denormalized floating-point constants should be avoided as much as possible.

Assembly/Compiler Coding Rule 60. (H impact, M generality) Minimize changes to bits 8-12 of the floating-point control word. Changes for more than two values (each value being a combination of the following bits: precision, rounding and infinity control, and the rest of bits in FCW) leads to delays that are on the order of the pipeline depth.

Assembly/Compiler Coding Rule 61. (H impact, L generality) Minimize the number of changes to the rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of more than two values of the set of rounding, precision, and infinity bits.

Assembly/Compiler Coding Rule 62. (H impact, L generality) Minimize the number of changes to the precision mode.

Assembly/Compiler Coding Rule 63. (M impact, M generality) Use Streaming SIMD Extensions 2 or Streaming SIMD Extensions unless you need an x87 feature. Most SSE2 arithmetic operations have shorter latency then their X87 counterpart and they eliminate the overhead associated with the management of the X87 register stack.

Chapter 8.4 THREAD SYNCHRONIZATION

User/Source Coding Rule 17. (M impact, H generality) Insert the PAUSE instruction in fast spin loops and keep the number of loop repetitions to a minimum to improve overall system performance.

User/Source Coding Rule 18. (M impact, L generality) Replace a spin lock that may be acquired by multiple threads with pipelined locks such that no more than two threads have write accesses to one lock. If only one thread needs to write to a variable shared by two threads, there is no need to use a lock.

User/Source Coding Rule 19. (H impact, M generality) Use a thread-blocking API in a long idle loop to free up the processor.

User/Source Coding Rule 20. (H impact, M generality) Beware of false sharing within a cache line (64 bytes on Intel Pentium 4, Intel Xeon, Pentium M, Intel Core Duo processors), and within a sector (128 bytes on Pentium 4 and Intel Xeon processors).

User/Source Coding Rule 21. (M impact, ML generality) Place each synchronization variable alone, separated by 128 bytes or in a separate cache line.

User/Source Coding Rule 22. (H impact, L generality) Do not place any spin lock variable to span a cache line boundary.

8.5 SYSTEM BUS OPTIMIZATION

User/Source Coding Rule 23. (M impact, H generality) Improve data and code locality to conserve bus command bandwidth.

User/Source Coding Rule 24. (M impact, L generality) Avoid excessive use of software prefetch instructions and allow automatic hardware prefetcher to work. Excessive use of software prefetches can significantly and unnecessarily increase bus utilization if used inappropriately.

User/Source Coding Rule 25. (M impact, M generality) Consider using overlapping multiple back- to-back memory reads to improve effective cache miss latencies.

User/Source Coding Rule 26. (M impact, M generality) Consider adjusting the sequencing of memory references such that the distribution of distances of successive cache misses of the last level cache peaks towards 64 bytes.

User/Source Coding Rule 27. (M impact, M generality) Use full write transactions to achieve higher data throughput.

Chapter 8.6 MEMORY OPTIMIZATION

User/Source Coding Rule 28. (H impact, H generality) Use cache blocking to improve locality of data access. Target one quarter to one half of the cache size when targeting Intel processors supporting HT Technology or target a block size that allow all the logical processors serviced by a cache to share that cache simultaneously.

User/Source Coding Rule 29. (H impact, M generality) Minimize the sharing of data between threads that execute on different bus agents sharing a common bus. The situation of a platform consisting of multiple bus domains should also minimize data sharing across bus domains.

User/Source Coding Rule 30. (H impact, H generality) Minimize data access patterns that are offset by multiples of 64 KBytes in each thread.

Chapter 8.7 FRONT END OPTIMIZATION

User/Source Coding Rule 31. (M impact, L generality) Avoid excessive loop unrolling to ensure the LSD is operating efficiently..

Chapter 9. 64-BIT MODE CODING GUIDELINES

Assembly/Compiler Coding Rule 64. (H impact, M generality) Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers.

Assembly/Compiler Coding Rule 65. (M impact, MH generality) When they are needed to reduce register pressure, use the 8 extra general purpose registers for integer code and 8 extra XMM registers for floating-point or SIMD code.

Assembly/Compiler Coding Rule 66. (ML impact, M generality) Prefer 64-bit by 64-bit integer multiplies that produce 64-bit results over multiplies that produce 128-bit results.

Assembly/Compiler Coding Rule 67. (ML impact, M generality) Stagger accessing the high 64-bit result of a 128-bit multiply after accessing the low 64-bit results.

Assembly/Compiler Coding Rule 68. (ML impact, M generality) Use the 64-bit versions of multiply for 32-bit integer multiplies that require a 64 bit result.

Assembly/Compiler Coding Rule 68. (ML impact, M generality) Use the 64-bit versions of multiply for 32-bit integer multiplies that require a 64 bit result.

Assembly/Compiler Coding Rule 69. (ML impact, M generality) Use the 64-bit versions of add for 64-bit adds.

Assembly/Compiler Coding Rule 70. (L impact, L generality) If software prefetch instructions are necessary, use the prefetch instructions provided by SSE.

CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXT-PROCESSING/LEXING/PARSING

SSE4.2 Coding Rule 5. (H impact, H generality) Loop-carry dependency that depends on the ECX result of PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM for address adjustment must be minimized. Isolate code paths that expect ECX result will be 16 (bytes) or 8 (words), replace these values of ECX with constants in address adjustment expressions to take advantage of memory disambiguation hardware.

Chapter 11 INTEL AVX INTRINSICS

Assembly/Compiler Coding Rule 71. (H impact, H generality) Whenever 256-bit AVX code and 128-bit SSE code might execute together, use the VZEROUPPER instruction whenever a transition from “Modified/Unsaved” state is expected.

Assembly/Compiler Coding Rule 72. (H impact, H generality) Add VZEROUPPER instruction after 256-bit AVX instructions are executed and before any function call that might execute SSE code. Add VZEROUPPER at the end of any function that uses 256-bit AVX instructions.

Assembly/Compiler Coding Rule 73. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment.

Assembly/Compiler Coding Rule 74. (M impact, H generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment. (my note: i think they meant 16-byte for this one)

Assembly/Compiler Coding Rule 75. (M impact, M generality) Use Blend instructions in lieu of shuffle instruction in AVX whenever possible.

User/Source Coding Rule 32. Factor in precision and rounding characteristics of FMA instructions when replacing multiply/add operations executing non-FMA instructions.

User/Source Coding Rule 33. Factor in result-dependency, latency of FP add vs. FMA instructions when replacing FP add operations with FMA instructions.

CHAPTER 12 INTEL TSX

User/Source Coding Rule 34. When using RTM for implementing lock elision, always test for lock inside the transactional region. Tuning Suggestion 14. Don't use an RTM wrapper if the lock variable is not readable in the wrapper

User/Source Coding Rule 35. RTM abort handlers must provide a valid tested non transactional fallback path. Tuning Suggestion 16. Lock Busy retries should wait for the lock to become free again.

User/Source Coding Rule 35. RTM abort handlers must provide a valid tested non transactional fallback path. Tuning Suggestion 16. Lock Busy retries should wait for the lock to become free again.

Chapter 14.3 For Intel Atom

Chapter 14.3.1 Intel Atom Frontend

Assembly/Compiler Coding Rule 1. (MH impact, ML generality) For Intel Atom processors, minimize the presence of complex instructions requiring MSROM to take advantage the optimal decode bandwidth provided by the two decode units.

Assembly/Compiler Coding Rule 2. (M impact, H generality) For Intel Atom processors, keeping the instruction working set footprint small will help the front end to take advantage the optimal decode bandwidth provided by the two decode units.

Assembly/Compiler Coding Rule 3. (MH impact, ML generality) For Intel Atom processors, avoiding back-to-back X87 instructions will help the front end to take advantage the optimal decode bandwidth provided by the two decode units.

Chapter 14.3.2 Intel Atom Execution Core

Assembly/Compiler Coding Rule 4. (M impact, H generality) For Intel Atom processors, place a MOV instruction between a flag producer instruction and a flag consumer instruction that would have incurred a two-cycle delay. This will prevent partial flag dependency.

Assembly/Compiler Coding Rule 5. (MH impact, H generality) For Intel Atom processors, LEA should be used for address manipulation; but software should avoid the following situations which creates dependencies from ALU to AGU: an ALU instruction (instead of LEA) for address manipulation or ESP updates; a LEA for ternary addition or non-destructive writes which do not feed address generation. Alternatively, hoist producer instruction more than 3 cycles above the consumer instruction that uses the AGU.

Assembly/Compiler Coding Rule 6. (M impact, M generality) For Intel Atom processors, sequence an independent FP or integer multiply after an integer multiply instruction to take advantage of pipelined IMUL execution.

Assembly/Compiler Coding Rule 7. (M impact, M generality) For Intel Atom processors, hoist the producer instruction for the implicit register count of an integer shift instruction before the shift instruction by at least two cycles.

Assembly/Compiler Coding Rule 8. (M impact, MH generality) For Intel Atom processors, LEA, simple loads and POP are slower if the input is smaller than 4 bytes.

Assembly/Compiler Coding Rule 9. (MH impact, H generality) For Intel Atom processors, prefer SIMD instructions operating on XMM register over X87 instructions using FP stack. Use Packed single-precision instructions where possible. Replace packed double-precision instruction with scalar double-precision instructions.

Assembly/Compiler Coding Rule 10. (M impact, ML generality) For Intel Atom processors, library software performing sophisticated math operations like transcendental functions should use SIMD instructions operating on XMM register instead of native X87 instructions.

Assembly/Compiler Coding Rule 11. (M impact, M generality) For Intel Atom processors, enable DAZ and FTZ whenever possible.

Assembly/Compiler Coding Rule 12. (H impact, L generality) For Intel Atom processors, use divide instruction only when it is absolutely necessary, and pay attention to use the smallest data size operand.

Assembly/Compiler Coding Rule 13. (MH impact, M generality) For Intel Atom processors, prefer a sequence MOVAPS+PALIGN over MOVUPS. Similarly, MOVDQA+PALIGNR is preferred over MOVDQU.

Assembly/Compiler Coding Rule 14. (MH impact, H generality) For Intel Atom processors, ensure data are aligned in memory to its natural size. For example, 4-byte data should be aligned to 4-byte boundary, etc. Additionally, smaller access (less than 4 bytes) within a chunk may experience delay if they touch different bytes.

Assembly/Compiler Coding Rule 15. (H impact, ML generality) For Intel Atom processors, use segments with base set to 0 whenever possible; avoid non-zero segment base address that is not aligned to cache line boundary at all cost.

Assembly/Compiler Coding Rule 16. (H impact, L generality) For Intel Atom processors, when using non-zero segment bases, Use DS, FS, GS; string operation should use implicit ES.

Assembly/Compiler Coding Rule 17. (M impact, ML generality) For Intel Atom processors, favor using ES, DS, SS over FS, GS with zero segment base.

Assembly/Compiler Coding Rule 18. (MH impact, M generality) For Intel Atom processors, “bool“ and “char“ value should be passed onto and read off the stack as 32-bit data.

Assembly/Compiler Coding Rule 19. (MH impact, M generality) For Intel Atom processors, favor register form of PUSH/POP and avoid using LEAVE; Use LEA to adjust ESP instead of ADD/SUB.

Chapter 15 For Intel Silvermont

Intel Silvermont Frontend:

Tuning Suggestion 1. Use the perfmon counter MS_DECODED.MS_ENTRY to find the number of instructions that need the MSROM (the count will include any assist or fault that occurred).

Assembly/Compiler Coding Rule 1. (M impact, M generality) Try to keep the I-footprint small to get the best reuse of the predecode bits.

Tuning Suggestion 2. Use the perfmon counter DECODE_RESTRICTION.PREDECODE_WRONG to count the number of times that a decode restriction reduced instruction decode throughput because predecoded bits are incorrect.

Assembly/Compiler Coding Rule 2. (MH impact, H generality) Minimize the use of instructions that have the following characteristics to achieve more than one instruction per cycle throughput: (i) using the MSROM, (ii) more than 3 escape/prefix bytes, (iii) more than 8 bytes long, or (iv) have back to back branches.

User/Source Coding Rule 1. (M impact, M generality) Keep per-iteration instruction count below 29 when considering loop unrolling technique on short loops with high iteration count. Tuning Suggestion 3. Use the BACLEARS.ANY perfmon counter to see if the loop unrolling is causing too much pressure. Use the ICACHE.MISSES perfmon counter to see if loop unrolling is having an excessive negative effect on the instruction footprint.

Intel Silvermont Execution Core:

Assembly/Compiler Coding Rule 3. (M impact, M generality) Use CMP/ADD/SUB instructions to compute branch conditions instead of INC/DEC instructions whenever possible.

Assembly/Compiler Coding Rule 4. (M impact, M generality) Favor SSE floating-point instructions over x87 floating point instructions.

Assembly/Compiler Coding Rule 5. (MH impact, M generality) Run with exceptions masked and the DAZ and FTZ flags set (whenever possible).

Tuning Suggestion 5. Use the perfmon counters MACHINE_CLEARS.FP_ASSIST to see if floating exceptions are impacting program performance.

User/Source Coding Rule 2. (M impact, L generality) Use divides only when really needed and take care to use the correct data size and sign so that you get the most efficient execution. Tuning Suggestion 6. Use the perfmon counters UOPS_RETIRED.DIV and CYCLES_DIV_BUSY.ANY to see if the divides are a bottleneck in the program.

User/Source Coding Rule 3. (M impact, M generality) Use PALIGNR when stepping through packed single elements



H impact and H generality rules

Chapter 3.4: OPTIMIZING THE FRONT END

Assembly/Compiler Coding Rule 6. (H impact, H generality) Do not inline a function if doing so increases the working set size beyond what will fit in the trace cache.

Chapter 3.6: OPTIMIZING MEMORY ACCESSES

Assembly/Compiler Coding Rule 46. (H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.

User/Source Coding Rule 10. (H impact, H generality) Optimization techniques such as blocking, loop interchange, loop skewing, and packing are best done by the compiler. Optimize data structures either to fit in one-half of the first-level cache or in the second-level cache; turn on loop optimizations in the compiler to enhance locality for nested loops.

User/Source Coding Rule 12. (H impact, H generality) To achieve effective amortization of bus latency, software should favor data access patterns that result in higher concentrations of cache miss patterns, with cache miss strides that are significantly smaller than half the hardware prefetch trigger threshold.

Chapter 8.6 MULTICORE MEMORY OPTIMIZATION

User/Source Coding Rule 28. (H impact, H generality) Use cache blocking to improve locality of data access. Target one quarter to one half of the cache size when targeting Intel processors supporting HT Technology or target a block size that allow all the logical processors serviced by a cache to share that cache simultaneously.

User/Source Coding Rule 30. (H impact, H generality) Minimize data access patterns that are offset by multiples of 64 KBytes in each thread.

CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXT-PROCESSING/LEXING/PARSING

SSE4.2 Coding Rule 5. (H impact, H generality) Loop-carry dependency that depends on the ECX result of PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM for address adjustment must be minimized. Isolate code paths that expect ECX result will be 16 (bytes) or 8 (words), replace these values of ECX with constants in address adjustment expressions to take advantage of memory disambiguation hardware.

Chapter 11.1 INTEL AVX INTRINSICS

Assembly/Compiler Coding Rule 71. (H impact, H generality) Whenever 256-bit AVX code and 128-bit SSE code might execute together, use the VZEROUPPER instruction whenever a transition from “Modified/Unsaved” state is expected.

Assembly/Compiler Coding Rule 72. (H impact, H generality) Add VZEROUPPER instruction after 256-bit AVX instructions are executed and before any function call that might execute SSE code. Add VZEROUPPER at the end of any function that uses 256-bit AVX instructions.

H impact or H generality and at least M impact and M generality (and not included in previous)

Chapter 3.4: OPTIMIZING THE FRONT END

Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned.

Assembly/Compiler Coding Rule 13. (M impact, H generality) If the body of a conditional is not likely to be executed, it should be placed in another part of the program. If it is highly unlikely to be executed and code locality is an issue, it should be placed on a different code page.

Assembly/Compiler Coding Rule 15. (H impact, M generality) Unroll small loops until the overhead of the branch and induction variable accounts (generally) for less than 10% of the execution time of the loop.

Assembly/Compiler Coding Rule 16. (H impact, M generality) Avoid unrolling loops excessively; this may thrash the trace cache or instruction cache.

Chapter 3.5: OPTIMIZING THE EXECUTION CORE

Assembly/Compiler Coding Rule 28. (M impact, H generality) Favor single-micro-operation instructions. Also favor instruction with shorter latencies.

Assembly/Compiler Coding Rule 33. (M impact, H generality) INC and DEC instructions should be replaced with ADD or SUB instructions, because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore creating false dependencies on earlier instructions that set the flags.

Assembly/Compiler Coding Rule 42. (H impact, MH generality) For small loops, placing loop invariants in memory is better than spilling loop-carried dependencies.

User/Source Coding Rule 2. (H impact, M generality) Use the smallest possible floating-point or SIMD data type, to enable more parallelism with the use of a (longer) SIMD vector. For example, use single precision instead of double precision where possible.

Chapter 3.6: OPTIMIZING MEMORY ACCESSES

Assembly/Compiler Coding Rule 47. (H impact, M generality) Pass parameters in registers instead of on the stack where possible. Passing arguments on the stack requires a store followed by a reload. While this sequence is optimized in hardware by providing the value to the load directly from the memory order buffer without the need to access the data cache if permitted by store-forwarding restrictions, floating-point values incur a significant latency in forwarding. Passing floating-point arguments in (preferably XMM) registers should save this long latency operation.

Assembly/Compiler Coding Rule 48. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data.

Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data.

Assembly/Compiler Coding Rule 52. (H impact, MH generality) Where it is possible to do so without incurring other penalties, prioritize the allocation of variables to registers, as in register allocation and for parameter passing, to minimize the likelihood and impact of store-forwarding problems. Try not to store-forward data generated from a long latency instruction - for example, MUL or DIV. Avoid store-forwarding data for variables with the shortest store-load distance. Avoid store-forwarding data for variables with many and/or long dependence chains, and especially avoid including a store forward on a loop-carried dependence chain.

User/Source Coding Rule 6. (H impact, M generality) Pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary.

Assembly/Compiler Coding Rule 54. (H impact, M generality) Try to arrange data structures such that they permit sequential access.

Assembly/Compiler Coding Rule 55. (H impact, M generality) Make sure that the stack is aligned at the largest multi-byte granular data type boundary matching the register width.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

Chapter 3.8: FLOATING-POINT

Assembly/Compiler Coding Rule 60. (H impact, M generality) Minimize changes to bits 8-12 of the floating-point control word. Changes for more than two values (each value being a combination of the following bits: precision, rounding and infinity control, and the rest of bits in FCW) leads to delays that are on the order of the pipeline depth.

Chapter 8.4 MULTICORE THREAD SYNCHRONIZATION

User/Source Coding Rule 17. (M impact, H generality) Insert the PAUSE instruction in fast spin loops and keep the number of loop repetitions to a minimum to improve overall system performance.

User/Source Coding Rule 19. (H impact, M generality) Use a thread-blocking API in a long idle loop to free up the processor.

User/Source Coding Rule 23. (M impact, H generality) Improve data and code locality to conserve bus command bandwidth.

chapter ???

User/Source Coding Rule 29. (H impact, M generality) Minimize the sharing of data between threads that execute on different bus agents sharing a common bus. The situation of a platform consisting of multiple bus domains should also minimize data sharing across bus domains.

Chapter 9. 64-BIT MODE CODING GUIDELINES

Assembly/Compiler Coding Rule 64. (H impact, M generality) Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers.

Assembly/Compiler Coding Rule 73. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment.

Assembly/Compiler Coding Rule 74. (M impact, H generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment. (my note: i think they meant 16-byte for this one)

Chapter 14.3.1 Intel Atom Frontend

Assembly/Compiler Coding Rule 2. (M impact, H generality) For Intel Atom processors, keeping the instruction working set footprint small will help the front end to take advantage the optimal decode bandwidth provided by the two decode units.

Chapter 14.3.2 Intel Atom Execution Core

Assembly/Compiler Coding Rule 4. (M impact, H generality) For Intel Atom processors, place a MOV instruction between a flag producer instruction and a flag consumer instruction that would have incurred a two-cycle delay. This will prevent partial flag dependency.

Assembly/Compiler Coding Rule 5. (MH impact, H generality) For Intel Atom processors, LEA should be used for address manipulation; but software should avoid the following situations which creates dependencies from ALU to AGU: an ALU instruction (instead of LEA) for address manipulation or ESP updates; a LEA for ternary addition or non-destructive writes which do not feed address generation. Alternatively, hoist producer instruction more than 3 cycles above the consumer instruction that uses the AGU.

Assembly/Compiler Coding Rule 9. (MH impact, H generality) For Intel Atom processors, prefer SIMD instructions operating on XMM register over X87 instructions using FP stack. Use Packed single-precision instructions where possible. Replace packed double-precision instruction with scalar double-precision instructions.

Assembly/Compiler Coding Rule 14. (MH impact, H generality) For Intel Atom processors, ensure data are aligned in memory to its natural size. For example, 4-byte data should be aligned to 4-byte boundary, etc. Additionally, smaller access (less than 4 bytes) within a chunk may experience delay if they touch different bytes.

Assembly/Compiler Coding Rule 2. (MH impact, H generality) Minimize the use of instructions that have the following characteristics to achieve more than one instruction per cycle throughput: (i) using the MSROM, (ii) more than 3 escape/prefix bytes, (iii) more than 8 bytes long, or (iv) have back to back branches.