Thursday, 21 December 2023

Instruction patterns and compiler passes

Previous: Machine Descriptions           Next: Forcing Memory Constants               Up: Intro  

There are several ways to create an instruction pattern, define_insn, define_expand, define_split, etc.  Which one you choose can depend on which compiler pass you expect them to be used in.

RTL pattern matching

An important realisation I made early on is that in all but the final stages of compilation, gcc is using the RTL to drive aspects like as register allocations and optimisations.  The compiler knows nothing about the opcodes - it doesn’t even see them - they are emitted directly to the output file.  So while it feels like the opcode output is the most important thing to focus on, I think getting the instruction and RTL patterns right is actually more likely to get the compiler to output efficient code.  Knowing when to use each method of defining a pattern took me a while to understand.

In the TMS9900 back-end I have used these patterns:

  • define_insn
  • define_expand
  • define_insn_and_split
  • define_peephole2

define_insn

The standard instruction pattern is defined using define_insn.  This is generally straightforward but does have limitations.  When there are many alternatives and options the output code can become convoluted.  It can also become very repetitive when slight different insns have to share similar methods (signed and unsigned mul/div for example).

The named insns are used initially to build the RTL block list.  Later passes match patterns and will use named patterns and unnamed patterns.

define_expand

The define_expand instruction pattern is one that I liked and used whenever I could.  In many cases, a complicated pattern can be expanded into a number of simpler insns that emit one or at most two opcodes.  Expanding like this also enables the optimiser to make more informed decisions as to how to reduce the RTL block list.

Expands defined with define_expand do have a major limitation though in that they are expanded in the first pass of the compiler, and not referred to again.  This means if a later pass of the compiler, say the combiner, generates RTL that matches an expand, the expand won’t be used.  Another limitation of expand is that it can't expand into another expand.  For these reason, expands have limited utility in some cases.

NOTE: if define_expand has a pattern that is generated by an optimising pass of the compiler, the pattern will not get matched.  define_expand patterns are only used before any optimising passes.

define_insn_and_split

Splits are another method of achieving an expansion of a complex RTL into multiple insns.  In some cases, a split can be used to “preprocess” a pattern.  There is an example here.  An example in the TMS9900 back-end is the conversion of immediates into constant memory references.  Since most opcodes, such as MOV, in TMS9900 can’t use immediate values, immediates must be first loaded into a register.  We could use an expand for this, but if a later compiler pass then sees an opportunity to reduce two separate MOV instructions, the expand will be lost.  The split gets around this.

If an insn is preprocessing an RTL that is then later used as an insn, the define_insn_and_split pattern can be used to combine both together.  I haven't found a use case for using define_split on its own in the current backend.

define_peephole2

Peepholes can be used to match patterns where we can make an optimisation that is not obvious to the compiler due to the nature of the TMS9900.  They are defined using define_peephole2 (define_peephole is deprecated apparently).  An example of this is where gcc generates code to copy a value from one memory location to another via a register.  The TMS9900 can do this directly.  So this peephole:

(define_peephole2
  [(set (match_operand:QI 0 "register_operand" "")
        (match_operand:QI 1 "memory_operand" ""))
   (set (match_operand:QI 2 "memory_operand" "")
        (match_dup 0))]
  "peep2_reg_dead_p(2, operands[0])"
  [(set (match_operand:QI 2 "memory_operand" "")
        (match_operand:QI 1 "memory_operand" ""))]
  {
    tms9900_debug_operands ("peep-movqi-mem-mem", NULL_RTX, operands, 3);
  }
)

matches two set patterns for bytes.  The first is moving a memory_operand to a register_operand and the second is moving that register to another memory_operand.  The operands are number 0, 1 and 2.  Operands are only defined once so the second reference to the register is replaced with match_dup 0.  

There is a condition here that operands[0] (the register) is dead because we don't want to eliminate a load of a value into a reg that is later needed.  So only eliminate if the register is indeed only a temporary location.

NOTE: Peepholes don't have names which can make it hard to see when they are used.  I added a debug statement to each one so I can see them in the output file with the -minline_rtl switch.

NOTE: I've used peepholes very sparingly and focused instead on creating patterns that correctly match as many use cases as possible.  Previously, peepholes were created much more liberally, but "premature optimisation ...."

Problems during compiler passes

The compiler makes many passes, but these are the ones I’ve only found errors most likely to occur:
  • expand - The expand pass expands all the define_expand patterns.  
  • combine - The combine pass looks for insns that can be merged together.
  • reload - The reload pass reduces the register usage and spills values to the stack when it can't allocate a register
  • ira - The ira (integrated register allocator) actually does the register allocation.  
  • peephole2 - The peephole pass looks for sequences of instructions that match a pattern.
  • final - emits the actual assembly

Expand and insn lists

As the compiler does its thing, it generates lists of insns for each function.  Because it doesn't actually emit any code until a function has been completely parsed (as far as I can tell) we know a lot about the function even before we have emitted any code.  For example, we know if it is a leaf (doesn't call any other functions) and don't need to save R11 in the function prologue.

Insns are numbered using a UID.  In the dumps for each pass, each insn is listed with its UID and the UID of the previous and next insn.  This is useful for finding where insns were combined or eliminated.

If we compiled this code fragment:

void f (void)
{
    int x = 3;
}

gcc would create a list of blocks (with only one block in this case).  In the file gccdump.128r.expand we would see: 

(insn 6 5 11 3 <stdin>:3 (set (mem/c/i:HI (reg/f:HI 17 virtual-stack-vars) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) -1 (nil))

This does look confusing initially, but essentially it says it is an insn, numbered 6, previous is 5, next is 11, input is stdin line 3 and it is doing a "set" of a HI reg to a symbol.  It has allocated a pseudo-reg 17 for now to hold the result.  Any reg higher than the highest physical reg (r15 in our case) is a pseudo-reg.  This will be allocated physical regs later in the ira and reload passes.

The compiler will look for a named insn called "movhi" (move a halfword) to do this operation.  Our implementation of movhi is a define_expand which emits a CLR or a SETO if it can or otherwise emits a tms9900_movhi insn.  The expand also replaces an immediate with a const if required.

In the next pass (gccdump.133r.vregs) we can see a reference to tms9900_movhi and an allocation on the stack frame (R9 = frame pointer):

(insn 6 5 11 3 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 9 r9) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

and in gccdump.174r.ira

(insn 6 3 14 2 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 10 r10) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

Combiner

In this pass, the combiner looks for sequences of patterns in the insn list and tries to match these with other patterns (named and unnamed) in the machine description or eliminate unnecessary insns.  In the output file gccdump.159r.combine it is common to see entries like (note 1 0 5 NOTE_INSN_DELETED) to say an insn has been removed from the list.

In this and other passes, the compiler tries to match patterns provided by define_insn, define_split, etc, but not define_expand.

An example would be a NOT(AND) sequence where we can do this with just the SZC instruction.  The insn:

; This handles reverse-order not-and combinations                                                                     
(define_insn "*not_andhi" 
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,rR>,Q,Q") 
        (and:HI (not:HI (match_operand:HI 1 "nonimmediate_operand" "rR>,Q,rR>,Q"))                                    
                (match_operand:HI 2 "nonimmediate_operand" "0,0,0,0")))]                                              
  ""                                                                                                                  
  { 
    tms9900_debug_operands ("*not_andhi", insn, operands, 3);                                                        
    return "szc  %1, %0";                                                                                             
  }
  [(set_attr "length" "2,4,4,6")])  

is used to ensure this C code:

int x;
int y;
void f()
{
 x = x&~y;
}

produces this output:

def f
f
szc  @y, @x
b    *r11

Reload / IRA

Up until these passes, pseudo regs may be used in the pattern list.  A pseudo reg is any reg that has a value of 16 or higher.  Earlier passes allocate as many of these as they need and its up to reload to replace them with physical regs.  Doing this can mean moving insns around or spilling to the stack.

These passes sometimes throw up errors about constraints or sometimes for inline assembly. "error: ‘asm’ operand requires impossible reload" is one that was recently reported for example.

Peephole

The peephole pass examines sequences of patterns in the RTL list and tries to find matching peephole pattern definitions.  This pass is quite late in the process so optimisations and register allocations are already done.  Scratch regs etc cannot be allocated at this point.

Final

The final pass is the only that actually emits the opcodes.  It seemed strange to me at first, but nothing is actually output until this pass and none of the code in any insn that generates code is executed until this pass.  Up to this point, all work has been done only on the insn rtx list.



No comments:

Post a Comment

I published the URL to this blog on atari age.  The posts are in reverse chronological order but the best place to start is the beginning .