Thursday, 21 December 2023

Instruction patterns and compiler passes

Previous: Machine Descriptions           Next: Forcing Memory Constants               Up: Intro  

There are several ways to create an instruction pattern, define_insn, define_expand, define_split, etc.  Which one you choose can depend on which compiler pass you expect them to be used in.

RTL pattern matching

An important realisation I made early on is that in all but the final stages of compilation, gcc is using the RTL to drive aspects like as register allocations and optimisations.  The compiler knows nothing about the opcodes - it doesn’t even see them - they are emitted directly to the output file.  So while it feels like the opcode output is the most important thing to focus on, I think getting the instruction and RTL patterns right is actually more likely to get the compiler to output efficient code.  Knowing when to use each method of defining a pattern took me a while to understand.

In the TMS9900 back-end I have used these patterns:

  • define_insn
  • define_expand
  • define_insn_and_split
  • define_peephole2

define_insn

The standard instruction pattern is defined using define_insn.  This is generally straightforward but does have limitations.  When there are many alternatives and options the output code can become convoluted.  It can also become very repetitive when slight different insns have to share similar methods (signed and unsigned mul/div for example).

The named insns are used initially to build the RTL block list.  Later passes match patterns and will use named patterns and unnamed patterns.

define_expand

The define_expand instruction pattern is one that I liked and used whenever I could.  In many cases, a complicated pattern can be expanded into a number of simpler insns that emit one or at most two opcodes.  Expanding like this also enables the optimiser to make more informed decisions as to how to reduce the RTL block list.

Expands defined with define_expand do have a major limitation though in that they are expanded in the first pass of the compiler, and not referred to again.  This means if a later pass of the compiler, say the combiner, generates RTL that matches an expand, the expand won’t be used.  Another limitation of expand is that it can't expand into another expand.  For these reason, expands have limited utility in some cases.

NOTE: if define_expand has a pattern that is generated by an optimising pass of the compiler, the pattern will not get matched.  define_expand patterns are only used before any optimising passes.

define_insn_and_split

Splits are another method of achieving an expansion of a complex RTL into multiple insns.  In some cases, a split can be used to “preprocess” a pattern.  There is an example here.  An example in the TMS9900 back-end is the conversion of immediates into constant memory references.  Since most opcodes, such as MOV, in TMS9900 can’t use immediate values, immediates must be first loaded into a register.  We could use an expand for this, but if a later compiler pass then sees an opportunity to reduce two separate MOV instructions, the expand will be lost.  The split gets around this.

If an insn is preprocessing an RTL that is then later used as an insn, the define_insn_and_split pattern can be used to combine both together.  I haven't found a use case for using define_split on its own in the current backend.

define_peephole2

Peepholes can be used to match patterns where we can make an optimisation that is not obvious to the compiler due to the nature of the TMS9900.  They are defined using define_peephole2 (define_peephole is deprecated apparently).  An example of this is where gcc generates code to copy a value from one memory location to another via a register.  The TMS9900 can do this directly.  So this peephole:

(define_peephole2
  [(set (match_operand:QI 0 "register_operand" "")
        (match_operand:QI 1 "memory_operand" ""))
   (set (match_operand:QI 2 "memory_operand" "")
        (match_dup 0))]
  "peep2_reg_dead_p(2, operands[0])"
  [(set (match_operand:QI 2 "memory_operand" "")
        (match_operand:QI 1 "memory_operand" ""))]
  {
    tms9900_debug_operands ("peep-movqi-mem-mem", NULL_RTX, operands, 3);
  }
)

matches two set patterns for bytes.  The first is moving a memory_operand to a register_operand and the second is moving that register to another memory_operand.  The operands are number 0, 1 and 2.  Operands are only defined once so the second reference to the register is replaced with match_dup 0.  

There is a condition here that operands[0] (the register) is dead because we don't want to eliminate a load of a value into a reg that is later needed.  So only eliminate if the register is indeed only a temporary location.

NOTE: Peepholes don't have names which can make it hard to see when they are used.  I added a debug statement to each one so I can see them in the output file with the -minline_rtl switch.

NOTE: I've used peepholes very sparingly and focused instead on creating patterns that correctly match as many use cases as possible.  Previously, peepholes were created much more liberally, but "premature optimisation ...."

Problems during compiler passes

The compiler makes many passes, but these are the ones I’ve only found errors most likely to occur:
  • expand - The expand pass expands all the define_expand patterns.  
  • combine - The combine pass looks for insns that can be merged together.
  • reload - The reload pass reduces the register usage and spills values to the stack when it can't allocate a register
  • ira - The ira (integrated register allocator) actually does the register allocation.  
  • peephole2 - The peephole pass looks for sequences of instructions that match a pattern.
  • final - emits the actual assembly

Expand and insn lists

As the compiler does its thing, it generates lists of insns for each function.  Because it doesn't actually emit any code until a function has been completely parsed (as far as I can tell) we know a lot about the function even before we have emitted any code.  For example, we know if it is a leaf (doesn't call any other functions) and don't need to save R11 in the function prologue.

Insns are numbered using a UID.  In the dumps for each pass, each insn is listed with its UID and the UID of the previous and next insn.  This is useful for finding where insns were combined or eliminated.

If we compiled this code fragment:

void f (void)
{
    int x = 3;
}

gcc would create a list of blocks (with only one block in this case).  In the file gccdump.128r.expand we would see: 

(insn 6 5 11 3 <stdin>:3 (set (mem/c/i:HI (reg/f:HI 17 virtual-stack-vars) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) -1 (nil))

This does look confusing initially, but essentially it says it is an insn, numbered 6, previous is 5, next is 11, input is stdin line 3 and it is doing a "set" of a HI reg to a symbol.  It has allocated a pseudo-reg 17 for now to hold the result.  Any reg higher than the highest physical reg (r15 in our case) is a pseudo-reg.  This will be allocated physical regs later in the ira and reload passes.

The compiler will look for a named insn called "movhi" (move a halfword) to do this operation.  Our implementation of movhi is a define_expand which emits a CLR or a SETO if it can or otherwise emits a tms9900_movhi insn.  The expand also replaces an immediate with a const if required.

In the next pass (gccdump.133r.vregs) we can see a reference to tms9900_movhi and an allocation on the stack frame (R9 = frame pointer):

(insn 6 5 11 3 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 9 r9) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

and in gccdump.174r.ira

(insn 6 3 14 2 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 10 r10) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

Combiner

In this pass, the combiner looks for sequences of patterns in the insn list and tries to match these with other patterns (named and unnamed) in the machine description or eliminate unnecessary insns.  In the output file gccdump.159r.combine it is common to see entries like (note 1 0 5 NOTE_INSN_DELETED) to say an insn has been removed from the list.

In this and other passes, the compiler tries to match patterns provided by define_insn, define_split, etc, but not define_expand.

An example would be a NOT(AND) sequence where we can do this with just the SZC instruction.  The insn:

; This handles reverse-order not-and combinations                                                                     
(define_insn "*not_andhi" 
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,rR>,Q,Q") 
        (and:HI (not:HI (match_operand:HI 1 "nonimmediate_operand" "rR>,Q,rR>,Q"))                                    
                (match_operand:HI 2 "nonimmediate_operand" "0,0,0,0")))]                                              
  ""                                                                                                                  
  { 
    tms9900_debug_operands ("*not_andhi", insn, operands, 3);                                                        
    return "szc  %1, %0";                                                                                             
  }
  [(set_attr "length" "2,4,4,6")])  

is used to ensure this C code:

int x;
int y;
void f()
{
 x = x&~y;
}

produces this output:

def f
f
szc  @y, @x
b    *r11

Reload / IRA

Up until these passes, pseudo regs may be used in the pattern list.  A pseudo reg is any reg that has a value of 16 or higher.  Earlier passes allocate as many of these as they need and its up to reload to replace them with physical regs.  Doing this can mean moving insns around or spilling to the stack.

These passes sometimes throw up errors about constraints or sometimes for inline assembly. "error: ‘asm’ operand requires impossible reload" is one that was recently reported for example.

Peephole

The peephole pass examines sequences of patterns in the RTL list and tries to find matching peephole pattern definitions.  This pass is quite late in the process so optimisations and register allocations are already done.  Scratch regs etc cannot be allocated at this point.

Final

The final pass is the only that actually emits the opcodes.  It seemed strange to me at first, but nothing is actually output until this pass and none of the code in any insn that generates code is executed until this pass.  Up to this point, all work has been done only on the insn rtx list.



Tuesday, 19 December 2023

TMS9900 Machine Description

Previous: Coding in C                Next: Instruction Patterns               Up: Intro  

Insns

The machine description file gcc/config/tms9900.md is the primary definition of the TMS9900 backend.  Machine descriptions for gcc in general are documented here.  They use the .md extension (an unfortunate clash with markdown) and essentially list a series of instructions with patterns (called "insns") that emit assembly opcodes to achieve a result.  The machine description is parsed by the compiler build system and a number of header and source code files are emitted which are then built in to the compiler.  In many places in the md file it is possible to include C code.  Names of named insns are used to generate a function called gen_<insn-name> so names must follow the same rules as C function naming (can't include symbols such as "-" or "$" for example).  The insn patterns are used by the compiler to match the RTL generated by the compiler front-end.

Insn names

Insns can be named or unnamed.  Named insns are either standard names (like "movqi") or else should have a unique name that won't clash with another backend, typically by prepending the back end name (e.g. "tms9900_movqi").  Unnamed insns can simply have an empty name ("") or a name starting with an asterisk ("*my_insn"), in which case the name is emitted in debug output but otherwise treated as empty.  In early compilation, gcc will look for named insns to match patterns in the RTL.  For example, it will look for "movqi" if it wants to move a byte or "addhi3" if it wants to add 16-bit values.  In later stages, it will try to match patterns which is where unnamed insns come in.

NOTE: only named insns can be referred to by define_expand, define_split, etc.

Library calls

When the compiler can't find a named insn to perform an operation it will often instead emit a library call so floating point methods are defined in libgcc.  This can't be done for fundamental insns like "movqi" but can be done for example to add two 32-bit values.  If gcc can't find an insn called "addsi3" it will emit "bl __addsi3" instead.

Sometimes, the compiler will try to generate its own code for missing insns.  For example this code:

long x;
int y;
long z;
 
void f(void)
{
   z = x>>y;
}

compiled without optimisation will generate a sequence of about 30 opcodes inline to do the 32-bit shift, but if compiled with -Os, the output will simply be "bl @__ashrsi3".  This is a very significant space saver but does add a few cycles for the call and return.

Operands

Each insn takes several operands, typically 2 or 3.  Operand values have a mode which indicates their size.  The size of standard int (SI) in gcc is 32-bits.  A half-word (HI) is 16 bits and a quarter-word (QI) is 8 bits.  Other types like double-word (DI) are not used in the TMS9900 backend.  There are also float and double (SF and DF) modes but there are no insns defined for these since the TMS9900 does not have any floating point hardware capability.  One unusual mode is the condition code (CC) which is set by compare operations.  This is implicitly "clobbered" by insns in gcc-4.4.0.  The TMS9900 does an implicit compare to zero on most instructions.

The operand modes and count are usually appended to the end of the insn so "extendqihi2" means perform an extend operation where the src is QI (8 bits), the dest is HI (16 bits) and there are two operands.

Constraints and lengths

The type of the operand is defined using a single letter called a constraint.  Constraints can be grouped together and comma-separated into alternatives. The values use in the TMS9900 backend are:

  • r = a register
  • R = a register indirect reference to a memory location (e.g. MOV *R4 ....)
  • > = like R, but with a post increment (e.g. MOV *R4+....)
  • Q = a label address of a memory location (e.g. MOV @LABEL ...)
  • i = an immediate value (e.g. LI R1,>1234)
  • n = a constant number, excludes labels and other constants
  • 0 = not really a constraint.  It just means this operand must be the same as operand[0] for this alternative.  It could also be 1, 2... to copy constraints for other operands.

For r, R, > and Q, the opcode is usually the same, as the addressing mode doesn't change the opcode length, but the length of the instruction varies if using Q, so it makes sense to break these out in separate constraint "alternatives".  Each alternative can have a slightly different implementation and/or can have a different length.

In addition, some values can use shorter methods for set, add, etc.  Loading a zero into a register can be achieved with "CLR Rx" instead of "LI Rx,0"; loading -1 can be done with "SETO rx"; adding 1 to a reg can be done with "INC Rx" instead of "AI Rx,1" and so on.  Each of these opcodes is 2 bytes shorter than an immediate.  To make these easier to detect as alternatives, these constraints are also defined:

  • L = +2 or -2; used to emit INCT or DECT
  • M = -1; use to emit SETO or DEC
  • N = +1; used to emit INC
  • O = 0; used to emit CLR

NOTE: It is important with these constraints that they appear in the alternatives list before a broader constraint like 'i'.  For example, the addhi3 operand2 constraints are:

   (match_operand:HI 2 "general_operand" "rR>,Q,LMN,LMN,i,rR>,Q, rR>,Q")

Because 'i' will match any immediate value so must be after the 'LMN' constraints which are more specific.

Constraints can also be preceded by modifiers.  The equals ("=") modifier means this value is set.  This is always operand[0] in the TMS9900 backend.  The percent ("%") modifier means operands are commutative, i.e. can be swapped because A+B is the same as B+A, etc.  Sometimes this makes the compiler's life easier.  

Having commutative operands is actually particularly important for the TMS9900 since arithmetic operators only take two operands.  So there is no opcode to do for example x=a+b.  Instead, we need to do x=a;x+=b.  For this reason, generally, operand[1] uses the "0" constraint to say it must match operand[0].

Predicates

When declaring an insn, each operand is given a predicate that gcc uses to match operand types.  The predicates used in the TMS9900 backend are:

  • general_operand - anything will match
  • register_operand - only a register will match
  • nonimmediate_operand - anything except an immediate (constant) value will match
  • const_int_operand - only a constant integer will match
  • memory_operand - only a memory reference (R or Q, see below) will match
The pattern for an insn may look like this:

(set (match_operand:HI 0 "register_operand" "=r") ... )

The predicate is used first to find a matching pattern and then the constraint is used to identify which alternative to use.  

NOTE: a constraint list MUST contain all alternatives that match.  A pattern like:

(set (match_operand:HI 0 "general_operand" "=rR,Q") ... )

can cause a compiler error since the predicate will match an immediate value but gcc won't be able to find an alternative to use since there is no "i" constraint.

Extending and truncating

Changing an operand to a larger width is called an extend.  Reducing it a shorter width is called a truncate.  There is only way to truncate an operand but extending can be zero extend or sign extend.  We can extend from QI to HI, HI to SI or QI to SI and truncate from SI to HI, SI to QI or HI to QI.  This is gives us 9 unique insn patterns we need to implement!

Truncating or extend from SI to HI and vice versa is quite straightforward.  It's just a matter of discarding a word or adding a zero word or an inverted word (>FFFF) if sign extending a negative value.

Generally, we can't truncate or extend a memory reference, only a register.  There is usually no use case for this anyway as a memory location never changes size.  But a value in a register could be loaded as a QI and then accessed as a HI given that gcc thinks low-order byte values are accessible in a wider word.

Going from SI to QI and vice versa needs to be done in two steps, SI to HI and then SI to QI, so these are candidates for define_expand.  More on define_expand later. To be safe, I use a scratch reg for the intermediate value.  For example:

(define_expand "zero_extendqisi2"
  [(set (match_operand:SI 0 "register_operand" "")
        (zero_extend:SI (match_operand:QI 1 "nonimmediate_operand" "")))]
  ""
  {
    tms9900_debug_operands ("zero_extendqisi2", NULL_RTX, operands, 2);
    rtx tmp = gen_reg_rtx (HImode);
    emit_insn (gen_zero_extendqihi2 (tmp, operands[1]));
    emit_insn (gen_zero_extendhisi2 (operands[0], tmp));
    DONE;
  }
)

NOTE: define_expand C code branches must end in either a DONE or a FAIL, or the insn pattern will remain in the list.  These are macros that perform a goto in the generated code.  Omitting these won't cause any error to be emitted but will cause really weird behaviour in the output that will take forever to track down.  Been there, done that.

The trickier conversions are QI to HI and vice versa.  For HI to QI truncation we don't care about the low-order byte and we can emit "SWPB" to just change the order of the bytes.  To extend QI to HI we do care and must fill the high-order byte with zeros or a sign extension.  Sign extension is done using an arithmetic shift "SRA Rx,8"and zero extension with a logical shift  "SRL Rx,8".

Lengths, jumps and branches

Each "define_insn" and (not "define_expand" or "define_split") has an array of length values at the end, like this:

 [(set_attr "length" "4,2,4,4,6")] 

These lengths are the number of bytes emitted by each of the alternatives identified by the constraints.  The number of length attributes must exactly match the number of alternatives (e.g. 5 alternatives must have 5 lengths).  These lengths are used to determine the distance to a label for a jump instruction.  In TMS9900 we can jump backwards by 256 bytes or forward by 254 bytes.  If a longer jump is needed we need to emit a branch ("B") instruction.  

In some cases we can't determine the exact length of an insn because it may vary depending on the circumstances at the time.  For example, we may have to emit an additional SWPB to correct a dropped truncate.  In these cases I've made the length to be the worst case length.  This does mean we will occasionally emit a branch where a jump would have been sufficient but this should be rare.

The function output_branch() in gcc/config/tms9900/tms9900.c determines what opcodes to output for all conditional jumps and will replace a jump with an inverted jump over a branch if distance requires it.

Tailcalls

If the optimising compiler sees that the end sequence of two functions are identical, it will emit insns for only one of the functions and issue a branch instruction in the other and use a common epilogue and return.  This is called a tailcall.  It is identified in the md by checking the macro SIBLING_CALL_P(insn).

Defining insns

The main way to define an insn is by using "define_insn".  To define the insn "movhi", which moves one word to another, we include this in the md file:

(define_insn "movhi"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,rR>,Q,  Q,r")
        (match_operand:HI 1 "general_operand"      "rR>, Q, rR>,Q,i"))]
  ""
  {
    if (which_alternative == 4)
    {
      return("li   %0, %1");
    }
    else
    {
      return("mov  %1, %0");
    }
  }
  [(set_attr "length" "2,4,4,6,4")])

Here we have defined the name (movhi), the operation (set) and described the two operands.  The first (destination) operand is a "non_immediate" operand, i.e. any type except an immediate.  The second (source) is a "general_operand" which can be anything.  There are 5 alternatives defined.  The Q constraint adds 2 bytes to the length of the opcode but r, R and > are the same length, so each combination of rR> and Q are broken out as separate alternatives.  "MOV r,r" is 2 bytes but "MOV @label1,@label2" would be 6 bytes.  The final constraint is for immediates.  Since TMS9900 does not allow immediate values to be used with MOV we instead emit load immediate (LI) for alternative number 4.  This operation is 4 bytes long.

The empty string after the pattern is the condition.  This insn will only be used if the condition is met.  This is usually empty.

NOTE: The entire insn is surrounded by a pair of brackets ().  The pattern and the set_attr are each surrounded by a pair of square brackets [].  Each part of the pattern is surrounded by brackets as well ().  The nested brackets gets confusing.  My most common cause of build failures is a missing or misplaced closing bracket.

Emitting code

There are many ways to emit code in the md file.  The example above uses C code as indicated by the opening brace ("{").  If instead of a brace, there is a double quote, the text is output directly.  For an insn like abs which has only one operand, could have done this:

(define_insn "abshi2"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,Q")
        (abs:HI (match_operand:HI 1 "nonimmediate_operand" "0,0")))]
  ""
  "abs  %0"
  [(set_attr "length" "2,4")])

Or we could also list instructions for each alternative like this (although there is no point as the same instruction is used for both alternatives):

(define_insn "abshi2"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,Q")
        (abs:HI (match_operand:HI 1 "nonimmediate_operand" "0,0")))]
  ""
  "@"
  "abs  %0"
  "abs  %0"
  [(set_attr "length" "2,4")])

Or within the C code, we could just do { return "abs %0"; }

Or within C, we could call you can use { output_asm_insn ("abs %0", operands); return ""; }

Or if you only have a short piece of C code you could add a double quote followed by an asterisk and then a C statement with any double quotes escaped like this: "* return \"abs %0\""

I think there are just too many alternatives here tbh.  I think mixing and matching methods of emitting output is going to be error prone.  I personally like consistency so have tried to always use C code to emit opcodes.  I also try and avoid mixing output_asm_insn("abs %0") and return "abs %0" in the same insn.

NOTE: since the MD file is compiled to C code, omitting a return will cause random output.  Been there done that too.



Thursday, 14 December 2023

Coding machine descriptions in C

Previous: TMS9900                Next: Machine Description               Up: Intro  

Very often it is convenient to write a fragment of C code inline in the tms9900.md file.  Sometimes if a fragment gets too big or will be used in a few places, it can be placed as a function in the tms9900.c file with a corresponding prototype in tms9900-protos.h.  There are a few tips and tricks I picked up about writing C code in this environment.

rtx objects

The compiled RTL is represented internally by an object called "rtx".  This object is used for lots of things, insns, operands, constants, etc.  There are several macros defined to get information about an rtx object.

Testing whether an rtx is something is done using macros that have _P appended.  So to test if an rtx is an insn, you can call INSN_P(rtx); to test if it is a reg you can call REG_P(rtx); to test for numbers use CONST_INT_P(rtx), and so on.  For the general case, GET_CODE(rtx) will return an enum with the type of an rtx.  GET_RTX_NAME(code) will return a printable string containing the code name.  

NOTE: I've found the code name to reliably match the enum so when debugging it is much easier to find an insn with a name of "plus" than it is to wade through the enum definition file trying to figure out which one is number 573. [1]

If it is a reg or memory location, it will have a mode.  In C code, the modes are defined by an enum as QImode, HImode, SImode, etc.  To get the mode of an rtx, the macro GET_MODE(rtx) can be used.  To get a printable name, the macro GET_MODE_NAME(mode) can be used.

Registers

Registers have defines for convenience such as HARD_R0_REGNUM or FRAME_POINTER_REGNUM (aka HARD_R9_REGNUM).  The stack pointer is also available through a global rtx called stack_pointer_rtx.

NOTE: be sure to wrap register values with gen_rtx_REG() before passing to any functions that expect an rtx parameter.  I'm sure there will be a warning, but it will get lost in the big sea of warnings and then gcc will just crash.

Emitting opcodes

C code can emit opcodes.  This is done in functions tms9900_expand_prologue() for example.  The emit_insn() function will emit an insn rtx into the output list.  Each named insn in the md file will have a corresponding function with a "gen_" prefix so you can call gen_movhi() and so on to emit these insns.  The functions take an array of rtx objects as parameters for operands.

Operands

Operands are passed to insns as an array of rtx objects.  So operand[0] is the first rtx and so on.  I did once have a concern that we shouldn't modify rtx objects passed to insns but that seems to be a valid thing to and many other backends do it all the time.  This is useful if we want to replace an operand with something else or swap them around, etc.

NOTE: be careful not to access operands with a higher array index than your insn defines.  As with any C code there is no bounds checking on the array.  If an insn takes 3 operands and you try to access operand[3] the compiler will probably (make that hopefully) seg fault.

[1] I just made that up, I've no idea what the code is for PLUS.



Friday, 8 December 2023

The TMS9900

Previous: Scope                Next: Coding in C              Up: Intro  

Background

The TMS9900 has some unique features among processors of its era.  Firstly, it is a true 16-bit CPU - a rarity in the 1970's.  All operations are natively 16-bit - the address bus doesn't even have a least significant bit!  In fact, some ops like MPY and DIV have a src or dst that is 32-bit values.  

One oddity is that the CPU has no general-purpose registers on-chip. Instead, it has a workspace pointer register that defines a location in memory where 16 16-bit general-purpose registers reside. These are numbered R0 through R15.

Details of the CPU are already detailed comprehensively elsewhere, so these are some highlights that are relevant to building a GCC backend:

  • Registers in memory.  The workspace pointer (WP) points to a memory location where general-purpose registers reside.  Changing the WP provides a mechanism for an instant context switch.  Very handy for interrupt vectors.
  • No stack.  There is no dedicated stack pointer in the TMS9900.  By convention, R10 is often used.  If we also decide the stack grows downward, then a post-increment instruction (e.g. MOV *R10+, R1) can be used to implement a POP. Unfortunately, we have no post-decrement addressing mode so a PUSH needs two instructions (DEC R10, MOV R1, *R10).
  • Big-endian.   In itself, being big-endian is not terribly unusual but coupled with the registers-in-memory aspect does present some unique challenges.
  • Restricted byte instructions.  Many instructions have both a word and a byte variant (e.g. MOV/MOVB) but many also do not (NEG, ABS, INV).
  • Unsigned mul/div.  The multiply (MPY) and divide (DIV) opcodes only take unsigned operands.  To do signed mul/div/mod we need to first note the signs of the operands, then take absolute values, and finally correct the sign of the results.
  • Incomplete set of conditional jumps.  TMS9900 is missing some conditional jump instructions.  For example, it has JL, JLE, JLT for logical less, less or equal and less than but no JLTE for less-than-or-equal.  This means sometimes having to emit an additional negative jump.
  • No AND instruction.  There is no binary &value function in TMS9900.  The closest is SZCB which is really &~value so INV Rx; SZCB Rx is needed.  This can cause problems when the compiler actually does try to do &~value as it will try and emit INV Rx; INV Rx; SZCB Rx.  Also INV is word only.

Registers usage

While the 16 general purpose registers are generally fully orthogonal, some have additional special functions:

  • R0 - used as shift count register in shift instructions, also cannot be used as an index reg
  • R11 - return address from a branch and link (BL) instruction
  • R12 - base address for CRU operations
  • R13,R14,R15 - status, workspace and return address from a BLWP instruction (e.g. an interrupt)

Of these, we don't really care about R12 thru R15 as we don't emit any CRU operations or BLWP but we do take special care when using R0 and R11.

In addition for gcc, we need a stack pointer and a base pointer. R9 and R10 are used by convention for these purposes.

R12 has another function as a static chain register but this hasn't been implemented yet.

8-bit bytes and 16-bit words

So far, these peculiarities pose no great challenge.  But there is one oddity that has taken much if not most of the effort in stabilising gcc and this is the nature of registers that are big-endian and stored in RAM.  Insomnia had found byte order issues early on and these still persist today.

Specifically, as with any big-endian system, the most significant byte is stored at the lower address in a word.  The sequence:

    LI R1, >ABCD
    MOV R1,@>1234

will place the word >ABCD into memory address >1234.  Being big-endian, >AB is placed in >1234 and >CD is placed in >1235.   All good so far, but if we then do:

    CLR  R2
    MOVB >1234, R2

We might logically expect R2 to contain the 16-bit value >00AB but unfortunately, instead, it contains >AB00.  It's logical - it has moved a byte into the lower byte address of the register - but not intuitive.  When gcc comes along this becomes a major issue as gcc assumes that a register accessed with a byte operation can subsequently be used as a word reg, but we cannot mix and match 8 and 16-bit datatypes on the TMS9900 without explicit conversions.  While gcc does emit conversion instructions (insns) to extend and truncate values, it can in certain cases decide to omit these, as we will see later, which causes major headaches.

Long (32-bit) values

32-bit integer values are handled by allocating two consecutive registers or memory locations.  For basic operations like MOV this is straightforward.  For ADD or S, it requires also checking the carry from the low word and adding it to the high word.  For MPY and DIV it is much more complicated so these have been relegated to libgcc, as have 32-bit shift operations.

Register allocation

Scratch registers

In some insns, it is necessary to allocate a temporary, or scratch, register to hold an intermediate value. Ideally, any insns that need a scratch reg can declare that they “clobber” a scratch reg.  In the TMS9900 backend, I have prevented the compiler from using R0 as a general reg.  There are a couple of reasons for this: R0 cannot be used as an index register (MOV R1,@label[R0] doesn’t do anything) and R0 is needed as the shift count for shift operations.  Since we know R0 is generally available for us, we can and do use it as a scratch reg for several insns.  This simplifies the md patterns slightly and lowers the chances of a failure to allocate a scratch reg.  One use of R0 for example is to do a comparison to zero. Since MOV does an implicit comparison to zero, emitting MOV Rx,R0 is a low-cost way to compare a register to zero.  But why not move the register to itself?  Well we could (and I did for a short while), but if comparing say a memory location to zero then a couple of issues arise.  Doing MOV @label,@label is 6 bytes long but also may have unintended side effects.  It is common for some memory locations to have different read and write behaviours.  An example is writing to the cartridge ROM addresses>6000 to >601E which causes a bank switch on some hardware.  Comparing one of these to zero using the label move was causing an inadvertent bank switch!

Parameters

The tms9900.h file defines R1 thru R8 as registers available to pass parameters.  Since most functions have 8 or fewer parameters, this allows almost all parameter passing to happen in registers. Parameters 9 onwards, or any parameters in a variable parameter list, are passed on the stack instead.  I am not sure if it is safe to use any of R1 through R8 for other purposes if it is known that the parameter count is less than 8.  I’m also not sure if it is safe to modify registers passed as parameters.  In any assembly library calls, I have saved any regs that are modified just to be safe.

General registers

R12 thru R15 are allocated as general regs but marked as non-volatile.  The compiler will save the existing value of these regs to the stack if it uses them in a function.  Optimised code will often use R9 as well when it can optimise away the stack frame.  I did try to mark R12-R15 as general regs, but I saw the compiler tended to spill to the stack instead for some reason so reverted them back to nvols.


Friday, 1 December 2023

Scope

Previous: Intro           Next: TMS9900

Tools and libraries

The tools in binutils seem very mature.  I have not found any issues with the assembler, linker, objdump of other tools.  I've done a little bit of work in libgcc but only when it blocks compiler development.

I have steered clear of libc and have no plans to do any work there.  Tursi's libti99 provides printf() and other functions which has been enough for me.

GCC version

One aspect of picking up a development activity that start 13 years ago, is that many of the tools have changed, including the compiler itself.   Gcc-4.4.0 was the latest in 2010 but in 2023, gcc-13.2.0 is the latest.  While tools for retro computing are never going to go out of date (because they already are), environments do change.  Building gcc-4.4.0 with gcc-12.2.0 throws up lots of warnings for example.  But so far nothing has actually broken.

A lot has changed gcc4 and gcc13.  A lot hasn’t changed though and the basics of the machine description is still the same.  So moving to gcc13 doesn’t need a complete rewrite thankfully.  But enough has changed to make the upgrade non-trivial.  All code is now compiled as C++ which immediately causes several compilation errors.  Several #define options give a compilation error that the option is "poisoned" (presumably deprecated long ago). Other significant changes include a change to the way condition codes are handled.  Instead of being implied, they now have to explicitly called out as clobbers to the condition code.  Reload is particularly sensitive to clobbering of condition codes, so it is necessary to make several insns into define_insn_and_split with a condition of “reload done”.  This is documented here.

The size of the compiler has grown very significantly as well.  The line count went up an order of magnitude, with a corresponding increase in build time.  A clean build of gcc13 takes close to an hour on my desktop.  For these reasons, the only currently released versions of the tms9900 are for the gcc-4.4.0 compiler.  There is a dev branch for gcc13 though and initial tests have shown significantly better code output from gcc13.  Once gcc4 is stabilised, my the plan is to rebase the gcc13 branch and continue development there.

Development Philosophy

My approach is a little different to Insomnia's in a few respects.  I'm trying to keep things simple and keep my focus narrow.  I also believe that “Premature optimization is the root of all evil”.  The principles I'm following are:

  • Keep the scope narrow. Just the compiler. Happily, the binutils seem quite stable and I haven't had to touch anything there.  I'm not (yet) building any libraries apart from a few essentials in libgcc.
  • Deliver functional first and optimal later.  In the past, it looks like a lot of effort has gone into peepholes and other optimisations.  While we are working in a very constrained environment, fine tuning can happen later once we have a fully functional base to work with.
  • Avoid changes to GCC.  I'm trying very hard to avoid patching the compiler.  In theory, a backend for gcc should be confined to just a machine description file, a header file to define the target machine and optionally some C support functions.  GCC is something like 3 million lines of code in size.  I haven't enough time left on earth to ramp up on all the internals and certainly not enough to debug any breakage I may cause by tweaking the internals!
  • Keep the machine description simple.  No lengthy sequences of opcodes in insns - these can be placed in library functions.  No insns defined for any complex functions including 32-bit ops, floating point, etc.
  • Limit the use cases.  The compiler has no support for CRU operations or for interrupts or BLWP.  These are best handled using inline asm calls.  The inline asm can of course wrap C functions as well.

I published the URL to this blog on atari age.  The posts are in reverse chronological order but the best place to start is the beginning .