Bayle Shanks's website: proj-plbook-plChOptimizationImpl

Table of Contents for Programming Languages: a survey

 Since you mentioned escape analysis: would it be possible to optimize x ++ y to avoid copying of x (just change the last tail pointer to point to y), if x is not used anywhere else? Does GHC do this?    Roman Cheplyaka Feb 27 '11 at 13:47

@Roman: No, GHC's escape analysis only applies to local let-bindings. There is no interprocedural analysis going on. For your example, you'd need to statically ensure that there is no other pointer anywhere in the heap that points to x or any of its successors. You'd need linear or uniqueness types to prove something like that. nominolo Feb 27 '11 at 14:00

-- http://stackoverflow.com/questions/5132350/how-do-haskell-compilers-decide-whether-to-allocate-on-the-heap-or-the-stack#comment5759432_5132862

https://en.wikipedia.org/wiki/Hash_consing

https://shipilev.net/jvm-anatomy-park/

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/hoopl-haskell10.pdf

---

"some keywords to get you started are “constant propagation”, “common subexpression elimination”, “loop invariant code motion”, “global value numbering”, “strength reduction”, “scalar replacement of aggregates”, “dead code elimination”, and “loop unrolling”." -- [1]

---

the "performance" section of this blog post talks about some late-stage language optimizations that were particularly helpful for one application:

https://blogs.msdn.microsoft.com/dotnet/2018/08/20/bing-com-runs-on-net-core-2-1/

---

https://v8.dev/blog/v8-lite

---

example:

http://blog.kevmod.com/2020/05/python-performance-its-not-just-the-interpreter/ discussion: https://news.ycombinator.com/item?id=23235930

---

Links:

http://www.mikedrivendevelopment.com/2018/06/what-i-learned-making-my-own-jit.html
a lot of stuff written by Thomas Schilling ("nominolo"), e.g. http://nominolo.blogspot.com/2012/07/implementing-fast-interpreters.html
a lot of stuff written by Mike Pall (of LuaJIT? fame) ("mikemike"), e.g. https://www.reddit.com/user/mikemike
https://blog.mozilla.org/nnethercote/2020/09/08/how-to-speed-up-the-rust-compiler-one-last-time/
- https://news.ycombinator.com/item?id=24403351
https://c9x.me/compile/bib/ has some links to optimization stuff

---

JIT

https://carolchen.me/blog/jits-intro/ https://carolchen.me/blog/jits-impls/ https://blog.erlang.org/a-first-look-at-the-jit/

---

"Most of the heavy-duty optimizations in modern industrial compilers happen on the linear IR representation anyway." [2]

---

pizlonator 1 day ago [–]

The lack of condition codes is a big deal for anyone relying on overflow checked arithmetic, like modern safe languages that do this for all integer math by default, or dynamic languages where it’s necessary for the JIT to speculate that the dynamic “number” type (which in those languages is either like a double or like a bigint semantically) is being used as an integer.

RISC-V means three instructions instead of two in the best case. It requires five or more instead of two in bad cases. That’s extremely annoying since these code sequences will be emitted frequently if that’s how all math in the language works.

Also worth noting, since this always comes up, that these things are super hard for a compiler to optimize away. JSC tries very aggressively but only succeeds a minority of the time (we have a backwards abstract interpreter based on how values are used, a forward interpreter that uses a simplified octagon domain to prove integer ranges, and a bunch of other things - and it’s not enough). So, even with very aggressive compilers you will often emit sequences to check overflow. It’s ideal if this is just a branch on a flag because this maximizes density. Density is especially important in JITs; more density means not just better perf but better memory usage since JITed instructions use dirty memory.

tom_mellior 1 day ago [–]

> JSC [...] a forward interpreter that uses a simplified octagon domain to prove integer ranges

Off-topic, but could you point me to more details on this?

...

pizlonator 1 day ago [–]

I didn’t call it octagon when I wrote it because I didn’t know that I reinvented a shitty/awesome version (shitty because it misses optimization opportunities, awesome because by missing them it converges super fast while still nailing the hard bits).

Look at DFGIntegerRangeOptimizationPhase?.cpp

---

about alloca:

https://www.gnu.org/software/libc/manual/html_node/Advantages-of-Alloca.html https://www.gnu.org/software/libc/manual/html_node/Disadvantages-of-Alloca.html https://stackoverflow.com/questions/1018853/why-is-the-use-of-alloca-not-considered-good-practice

---

https://www.amazon.ca/Advanced-Compiler-Design-Implementation-Muchnick/dp/1558603204

---

" What are the most valuable GCC optimizations for x86-64?

A decent RA
Code selection

GCC-9.0, i7-9700K under FC29 SPECInt2000 Est. GCC -O2 GCC -O0 + simple RA + combiner -fno-inline 5458 4342 (80%) -finline 6141 4339 (71%) " -- [3]

---

see "Major Types of Optimizations" slide for a good intro to some optimizations:

https://passlab.github.io/CSCE513/notes/lecture03_ISA_Principles.pdf

---

Which optimizations are most important?

"The ~8 passes to write if you're going to bother. •Inline, Unroll (& Vectorize), CSE, DCE, Code Motion, Constant Fold, Peephole ... Many compilers just do those, get ~80% best-case perf. " -- [4]

---

QBE does at least:

"Copy elimination. Sparse conditional constant propagation. Dead instructions elimination. Registerization of small stack slots. Split spiller and register allocator thanks to SSA form. (Simpler and faster than graph coloring.) Smart spilling heuristic based on loop analysis. Linear register allocator with hinting. "

---

" MIR project competitors

I've been studying JITs for a long time. Before starting the MIR project, I thought about adapting the existing code. Here is a comparison of MIR with the simple compiler projects I considered:

Project SLOC License IR type Major Optimizations Output MIR 16K C MIT non-SSA Inlining, GCSE, SCCP, RA, CP, DCE, LA Binary code DotGNU? LibJIT? 80K C LGPL non-SSA Only RA and primitive CP Binary code .NET RyuJIT? 360K C++ MIT SSA MIR ones minus SCCP plus LICM, Range Binary code QBE 10K C MIT SSA MIR ones plus aliasing minus inlining Assembler LIBFirm 140K C LGPL2 SSA RA, Inlining, DCE, LICM, CP, LA, Others Assembler CraneLift? 70K Rust Apache SSA DCE, LICM, RA, GVN, CP, LA Binary Code

Here are the abbreviations used in the table:

    CP: Copy propagation
    DCE: Dead code elimination
    GCSE: Global common sub-expression elimination
    GVN: Global value numbering
    LA: Finding loops in CFG
    LICM: Loop invariant code motion
    RA: Register allocation
    Range: Range analysis
    SCCP: Sparse conditional constant propagation

Others include:

    Partial redundancy elimination
    Loop unrolling
    Scalar replacement
    Tail recursion elimination
    Expression reassociation
    Function cloning
    Strength reduction
    Loop optimizations
    Load store motion
    Jump threading

After studying these projects closely, I decided to start the MIR project. The biggest cons of the existing projects were their size, the fact that it would be hard for me to achieve my goals using them, and that I can not control the projects. Also, the smaller source size of the MIR project makes studying the code easier for other people and reduces the effort required to maintain it. " -- [5]

"So what are the most valuable optimizations? The most important optimizations are those for effectively exploiting the most commonly used CPU resources: instructions and registers. Therefore, the most valuable optimizations are good register allocation (RA) and instruction selection.

Recently, I did an experiment by switching on only a fast and simple RA and combiner in GCC. There are no options to do this, I needed to modify GCC. (I can provide a patch if somebody is interested.) Compared to hundreds of optimizations in GCC-9.0 with -O2, these two optimizations achieve almost 80% performance on an Intel i7-9700K machine under Fedora Core 29 for real-world programs through SpecCPU?, one of the most credible compiler benchmark suites:

SPECInt2000 Est. GCC -O2 GCC -O0 + simple RA + combiner -fno-inline 5458 4342 (80%) -finline 6141 4339 (71%) " -- [6]

" MIR generator

Currently, the most interesting component, at least for me, is the MIR generator producing optimized machine code. Figure 9 shows how it works: [mir-gen] Figure 9: The process the MIR generator follows.">

Here is the process in more detail. First, we simplify MIR as much as possible, which means only using register indirect memory addressing. It also means that immediate instruction operands are used only in move instructions. During this pass, we also remove unused instructions through local value numbering.

Then, we inline function calls represented by the MIR instructions inline and calland, after that, build the control-flow graph (CFG)—in other words, the basic blocks and edges between them. Next, we transform code to reuse already calculated values in global scope through so-called global common subexpression elimination (GCSE). This step is an important optimization for inlined code.

The next pass mostly removes instructions found redundant on the previous pass through a dead code elimination optimization. Then, we do sparse conditional constant propagation (SCCP). This step is also an important optimization for inlined code when there are constant call arguments. We also find that there can be constant expression values.

Constant values can switch off CFG paths during any execution. Switching off these paths can make other values constant, too. SCCP is a combined optimization, which means its result is better than performing the two separate optimizations of constant propagation and removing unused CFG paths.

After that, we run the machine-dependent code. Most x86 instructions, for example, additionally require the instruction result operand to be the same as one of the input operands. Therefore, the machine-dependent code is run to produce two-operand instructions. This code also generates additional MIR instructions for call parameter passing and returns.

The next step is to find natural loops in the MIR code CFG. This information will be used for the subsequent register allocation. Then, we calculate live information for MIR variables. In the compiler world, this technique is known as a backward data-flow problem.

Once that work is complete, we calculate program points where MIR variables live. This info is used in a fast register allocator that assigns target hard registers, or stack slots, to MIR variables and removes various copy instructions. After the register assignment, we rewrite the MIR code, changing variables to the assigned hard registers or stack slots.

Then, we try to combine pairs of data-dependent MIR instructions into ones whose forms can represent a machine instruction. This is an instruction selection task. This pass also does copy propagation optimization.

After instruction combining, there are usually many instructions whose output is never used. We delete such instructions. And then, finally, we generate machine code in memory. Each MIR instruction is encoded by a machine instruction. For AMD64, the encoding can be complicated. " -- [7]

"Function inlining is the most important optimization for better JIT performance. Method calls are expensive and inlining permits optimizations in a bigger scope. " -- [8]

Compared to MIR:

QBE "has the same optimizations as MIR-generator plus aliasing but QBE has no inlining" -- [9]
"LIBJIT has fewer optimizations: only copy propagation and register allocation" -- [10]
"RyuJIT? optimizations is basically MIR-generator optimizations minus SCCP" -- [11]

MIR does "Only the most valuable optimization usage:

    function inlining
    global common sub-expression elimination
    variable renaming
    register pressure sensitive loop invariant code motion
    sparse conditional constant propagation
    dead code elimination
    code selection
    fast register allocator with implicit coalescing hard registers and stack slots for copy elimination" -- [12]

MIR JIT compiler pipeline, mostly quoted from https://github.com/vnmakarov/mir (note: there are three optimization levels, -O0 (least optimized), -O1, -O2 (most optimized); -O1 is the default; below, i annotate all steps that are not present in -O1 with their optimization level):

Simplify: lowering MIR
Inline: inlining MIR calls
(-O2 only) Build CFG: building Control Flow Graph (basic blocks and CFG edges)
(-O2 only) Build SSA: Building Single Static Assignment Form by adding phi nodes and SSA edges to operands (We use a form of Braun's algorithm to build SSA (M. Braun et al. "Simple and Efficient Construction of Static Single Assignment Form"))
(-O2 only) Copy Propagation: SSA copy propagation keeping conventional SSA form and removing redundant extension instructions
(-O2 only) Global Value Numbering: Removing redundant instructions through GVN
(-O2 only) Dead Code Elimination: removing instructions with unused outputs
(-O2 only) Sparse Conditional Constant Propagation: constant propagation and removing death paths of CFG
(-O2 only) Out of SSA: Removing phi nodes and SSA edges (We keep SSA in conventional form all the time to make out-of-SSA pass trivial)
Machinize: run machine-dependent code transforming MIR for calls ABI, 2-op instructions, etc
(-O1 and -O2 only) Find Loops: finding natural loops and building loop tree
Build Live Info: calculating live in and live out for the basic blocks
(-O1 and -O2 only) Build Live Ranges: calculating program point ranges for registers
Assign Registers: fast RA for -O0 or priority-based linear scan RA for -O1 and above
Rewrite: transform MIR according to the assign using reserved hard regs
(-O1 and -O2 only) Combine (code selection): merging data-depended instructions into one
(-O1 and -O2 only) Dead Code Elimination: removing instructions with unused outputs
Generate Machine Instructions: run machine-dependent code creating machine instructions

alternate to CFG is VSDG:

Optimizing compilation with the Value State Dependence Graph Alan C. Lawrence https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf

https://thoughtstreams.io/dobesv/value-state-dependence-graph/

---

https://github.com/naver/lispe/wiki/2.5--LispE-vs.-Python%3A-A-Stochastic-Gradient-Descent-Comparison

---