Thursday, 29 February 2024

I published the URL to this blog on atari age.  The posts are in reverse chronological order but the best place to start is the beginning.

Wednesday, 28 February 2024

Dev, Release, Install, Test, Debug

Previous: Missing Extends                           Up: Intro  

There are a few scripts in the repo for installing gcc and binutils as well as scripts to create new patches and debian packages.

I work exclusively on Linux (debian bookworm currently) and have a script to create a .deb package which can be used for debian, ubuntu and others.  The .deb package installs bins, libs, etc, in /opt/tms9900-gcc by default.  Many users though are on Windows or MacOS and I am relying on others to build packaged releases for those OSs since I can't.

Development

The src files are contained in the "dev" directory.  Only modifications to gcc and binutils are stored in the repo.  This means only storing about 50 files as opposed to over 60,000.  The dev directory contains two copies of any files that have been modified.  The original files for gcc are in gcc-4.4.0-orig and the modified files are in gcc-4.4.0.  Similarly for binutils.  New files (including "gcc/config/tms9900/tms9900.md") have a corresponding empty file in the orig directory.  This allows diff to create a patch that includes all changes and all additions to the stock gcc-4.4.0 distribution.

Files to be edited are mostly in "dev/gcc-4.4.0/gcc/config/tms9900".  This dir contains the md file as well as a few C and H files.  It also contains some library asm functions which get included in libgcc.a

To rebuild the compiler in the dev hierarchy, from the "dev/gcc-4.4.0/build" directory, you can use the following commands to build and install gcc and libgcc:

    make all-gcc
    make install-gcc
    make all-target-libgcc
    make install-target-libgcc

This assumes of course this directory has previously been configured for builds, from the build directory, using:

    ../configure --prefix <target-dir> --target=tms9900 --enable-languages=c,c++

Release

The script mkpatches.sh will generate new patch files.  It takes a version number as a parameter.  It will  update the gcc DATESTAMP and REVISION files with the current date and the supplied version number.  The patch revision can be queried in code by accessing the __TMS9900_PATCH_MAJOR__ and  __TMS9900_PATCH_MINOR__ macros which return integer values.

Install

The primary install tool is a script called install.sh.  It downloads gcc-4.4.0 and binutils-2.19.1, applies the patches to them, builds the compiler and toolchain, and installs to the specified directory.  This script will create a directory called build in the pwd and do all its work in there.  It marks progress by creating progress files such as ".gcc_patched" and ".gcc_built".  To do a clean build from updated patches, the easiest thing to do is "rm -rf build" or at the very least remove the progress files.

Test

There are a number of unit test files in the "test" directory.  These are broken up so each fits into a single cartridge ROM (8KiB).  A single EA5 test binary is on the TODO list.  The tests depend on libti99 to provide printf and other functions.  The file "tap.c" contains some basic functions to run tests and produce TAP output.  For example, the test_subreg.c file produces this output:



As issues are reported in the forums on Atari age, I add new tests to reproduce these and run them before each release.

Debug

If all goes well, you don't need this step.  But of course you probably will.  Finding out where things go wrong is of course very important.  Usually, this involves compiling a single module and examining the output.  I have found it easier to invoke the cc1 part of the compiler instead of running gcc.  cc1 takes input from stdin and creates (by default) an assembly output file called "gccdump.s".  My debug compile script looks like this:

    #!/bin/bash
    INCLUDE=-I../libti99i
    CFLAGS="$2 $3 $4 $5 $6 -Wall -da -std=c99 -fomit-frame-pointer     $INCLUDE"
    ../dev/gcc-4.4.0/build/gcc/cc1 $CFLAGS < $1

I can then add or not add options such as "-Os" as needed.  The "-da" flag to cc1 creates dump files of the various compiler passes, such as "gccdump.128r.expand", "gccdump.159r.combine" and so on.  These are not the easiest files to read but comparing them to each other can be useful to find where various changes to the output occurred.

I also added a command line switch called "-minline_rtl" which outputs the RTL patterns to the asm_out_file (the file being created as cc1 runs) as comments.  I found this very helpful to understand why certain patterns matched and some didn't.  For example, a standard function prologue might have output like this:

    ; addhi3-20 : (insn 20 4 21 <stdin>:6 (set (reg/f:HI 10 r10)
    ;         (plus:HI (reg/f:HI 10 r10)
    ;             (const_int -2 [0xfffffffffffffffe]))) 63 {addhi3}     (nil))
     dect r10
    
    ; movhi-21 : (insn 21 20 22 <stdin>:6 (set (mem:HI (reg/f:HI 10     r10) [0 S2 A16])
    ;         (reg:HI 11 r11)) 12 {tms9900_movhi} (expr_list:REG_DEAD     (reg:HI 11 r11)
    ;         (nil)))
       mov  r11, *r10

Here I can see that addhi3 insn number 20 emitted the DECT and movhi3 insn number 21 emitted the MOV.

ASIDE: Note the REG_DEAD note on R11 after the MOV.  This means the compiler does not need the value in R11 any more after this insn.  This is sometimes useful to know as we avoid use of scratch regs and so on when we know it is safe to destroy the contents of a reg.

NOTE: writing tests that work with -Os or -O2 can be tricky.  The compiler is very crafty at eliminating code that it sees produces invariant results.  A function that adds two constants together and assigns the result to a local variable will just get reduced down to assigning the result for example.  Declaring vars as non static and external to the test function ensures the compiler can't assume they hold particular values.  Assigning test values in a separate function also ensures it can't make assumptions about their contents.  Declaring functions as non static ensures they don't get inlined or eliminated.

Dummy md

I found it useful also to create a dummy md file and this is included in the repo.  This file creates no output, just comments.  The output is only to help me understand why some patterns get matched and others don't and so on.  It also lets me play around with constraints and predicates without having to worry about alternatives or lengths.



Thursday, 1 February 2024

Missing extend and truncate insns

Previous: Forcing Memory Constants            Next: Dev,Release,Install                Up: Intro  

Throughout the past few months, there have been frequently reports of cases where the compiler missed an extend or truncate insn.  A simple example is where a short and char are added together.  Insomia's blog shows that I'm not the only one who wrestled with this problem:

Various fixes have been tried including adding a separate subreg pass to the compiler to detect when subregs have been inserted.  I've disabled this pass as I don't think modifying the compiler is the way to go and in any case it didn't seem to be working as intended anyway.

There didn't seem to me to be a pattern to the dropped extends until I noticed it often happened when doing binary "and" operation.  So while this C code:

char c;

int i;
int z;

void f()
{
  z = (c + i) & 0x1f;
}

correctly emitted this assembly unoptimised:

def f
f
movb @c, r1
movb r1, r2
sra  r2, 8
mov  @i, r1
a    r2, r1
andi r1, >1F
mov  r1, @z
b    *r11


BUT when using -Os or -O2, it emitted this:

def f
f
movb @c, r1
a    @i, r1
andi r1, >1F
mov  r1, @z
b    *r11


Shorter, but in fact too short.  Why did the optimiser drop the extend "SRA r2,8"?  It turns out it is because of the mask in the binary and.  The compiler determined that the bits in >1F don't affect the high order byte of the result so the extend was redundant.  But this is based on the (faulty) assumption that the low byte of the register is written to MOVB whereas we know in the TMS9900 it is the MSB that is changed.

Finding exactly where in GCC this reduction happens would be like searching for a needle in a haystack (it happens in the combine pass, but gcc/combine.c alone is 13KLOC) and modifying gcc is not something I want to do.

Looking through the GCC dumps I could see that insn 6 and 7 in gccdump.158r.dce :

(insn 6 3 7 2 <stdin>:7 (set (reg:HI 22 [ c+-1 ])
        (sign_extend:HI (mem/c/i:QI (symbol_ref:HI ("c") <var_decl 0x7f7e81e82000 c>) [0 c+0 S1 A8]))) 17 {extendqihi2} (nil))

(insn 7 6 8 2 <stdin>:7 (set (reg:HI 23)
        (plus:HI (reg:HI 22 [ c+-1 ])
            (mem/c/i:HI (symbol_ref:HI ("i") <var_decl 0x7f7e81e820a0 i>) [2 i+0 S2 A16]))) 63 {addhi3} (expr_list:REG_DEAD (reg:HI 22 [ c+-1 ])
        (nil)))

were replaced in gccdump.159r.combine with:

(insn 7 6 8 2 <stdin>:7 (set (reg:HI 23)
        (plus:HI (mem/c/i:HI (symbol_ref:HI ("i") <var_decl 0x7f7e81e820a0 i>) [2 i+0 S2 A16])
            (subreg:HI (mem/c/i:QI (symbol_ref:HI ("c") <var_decl 0x7f7e81e82000 c>) [0 c+0 S1 A8]) 0))) 63 {addhi3} (nil))   
 
So I could see the extend had been replaced with a subreg.  There was a message at the start of the combine dump that said:

Failed to match this instruction:
(set (reg:HI 23)
    (plus:HI (sign_extend:HI (mem/c/i:QI (symbol_ref:HI ("c") <var_decl 0x7f7e81e82000 c>) [0 c+0 S1 A8]))
        (mem/c/i:HI (symbol_ref:HI ("i") <var_decl 0x7f7e81e820a0 i>) [2 i+0 S2 A16])))

So it is possible I could have provided a pattern that combines plus and sign_extend and this would have satisfied the combiner.  But I didn't want to duplicate every pattern with variants that include extends.

Instead, looking further on, I could see that in gccdump.174r.ira the subreg ref disappeared and was replaced with an offset of -1:

(insn 7 17 8 2 <stdin>:7 (set (reg:HI 1 r1 [23])
        (plus:HI (reg:HI 1 r1 [+-1 ])
            (mem/c/i:HI (symbol_ref:HI ("i") <var_decl 0x7f7e81e820a0 i>) [2 i+0 S2 A16]))) 63 {addhi3} (nil))

This makes sense.  The register allocator (IRA) has to allocate real regs to replace subregs.  As far as it is concerned, R1 with an offset of -1 contains the byte it wants.  The macro REG_OFFSET returns this values, so at least we have a way to detect when an extend has been dropped.  So I added an emit of SRA whenever I saw an offset of -1.  But then I got another bug report for code like this:

int f (char c, unsigned int len)
{
    return c+len;
}

which was emitting an SRA where one already existed.  Looking at gccdump.s I could see:

def f
f

; extendqihi2-8 : (insn 8 5 9 <stdin>:2 (set (reg:HI 1 r1 [orig:26 c+-1 ] [26])
;         (sign_extend:HI (reg:QI 1 r1 [ c ]))) 17 {extendqihi2} (nil))

sra  r1, 8

; addhi3-14 : (insn 14 9 20 <stdin>:4 (set (reg/i:HI 1 r1)
;         (plus:HI (reg:HI 1 r1 [orig:26 c+-1 ] [26])
;             (reg:HI 2 r2 [ len ]))) 63 {addhi3} (expr_list:REG_DEAD (reg:HI 2 r2 [ len ])
;         (nil)))

So here the compiler actually did extend and the offset refers to the original reg (pseudo reg 26), not the current reg.  So I had to add another condition:

  (ORIGINAL_REGNO (operand) == REGNO (operand)) 

And eventually I had a function called tms9900_operand_subreg_offset() that detects dropped extends or truncates.  If it returns true, the caller must emit a corrective opcode.

A call to this function has been added to addhi3, subhi3 and andhi3 and they take corrective action if it returns true.  I subsequently noticed that these insns could have an offset present in either operands[1] or operands[2] so I had to make two calls in each.  

But so far, touch wood, I haven't had any more reports of missed truncates or extends.



Monday, 29 January 2024

Forcing memory constants

Previous: Instruction Patterns                Next: Missing Extends               Up: Intro  

The TMS9900 can do some operations directly with constants, but not that many.  There is LI to load a reg, CI to compare, AI to add and so on.  But there are many things it can't do, such as load a constant directly to a memory location, compare a memory value to a constant, or do anything at all with byte constants.

I realised at some point that it was more efficient in many cases just to store a literal value in code memory and refer to it.  So comparing a byte value went from:

        li       r1,>400
        cb       @label,r1

to this:

LC0
        byte 4
        ...
cb       @label, @LC0

So 8 bytes are reduced to 5 bytes and 2 instructions to 1 instruction.  And if LC0 gets referenced again, the savings are even more.  It also fixed some issues with greater than or less than that had arisen when comparing a byte to a constant loaded into a reg.

This was done using the define_insn_and_split pattern as follows:

(define_insn_and_split "cmpqi"
  [(set (cc0)
        (compare (match_operand:QI 0 "nonimmediate_operand" "=g")
                 (match_operand:QI 1 "general_operand"      "g")))]
  ""
  "cb   %0, %1"
  "CONST_INT_P (operands[1])"
  [(set (cc0)
        (compare (match_dup 0)
                 (match_dup 1)))]
  {
    tms9900_debug_operands ("split_cmpqi", NULL_RTX, operands, 2);
    int val = INTVAL (operands[1]) & 0xff;
    operands[1] = force_const_mem (QImode, GEN_INT (val));
  }
  [(set_attr "length" "6")]
)

In gcc13, this can become a define_insn_and_rewrite to avoid duplicating the pattern but this works well in gcc4.  The insn part just emits CB in all cases.  The split part has a condition of CONST_INT_P (operands[1]) in which case it replaces operands[1] with a force_mem_const which emits a constant into memory.  So effectively converting the "i" constraint into a "Q" constraint.

NOTE: for some reason that I haven't yet figured out, adding proper constraints causes the compiler error "unable to generate reloads" to be emitted.  This happens even if all constraint types are included.  Since I know the immediate case has been rewritten, I just replaced the constraints with "g" and this worked.  There must be something I'm missing between "g" and "rR>Qi".  It does mean the length is too long for some alternatives but that's only a very minor inefficiency.

NOTE: I found that it is important to call GEN_INT(val) without any mask.  Adding a mask with 0xff caused a compiler assert in combine.c:do_SUBST() when it compares the value against trunc_int_for_mode().  Just removing the mask solved this issue.





Thursday, 21 December 2023

Instruction patterns and compiler passes

Previous: Machine Descriptions           Next: Forcing Memory Constants               Up: Intro  

There are several ways to create an instruction pattern, define_insn, define_expand, define_split, etc.  Which one you choose can depend on which compiler pass you expect them to be used in.

RTL pattern matching

An important realisation I made early on is that in all but the final stages of compilation, gcc is using the RTL to drive aspects like as register allocations and optimisations.  The compiler knows nothing about the opcodes - it doesn’t even see them - they are emitted directly to the output file.  So while it feels like the opcode output is the most important thing to focus on, I think getting the instruction and RTL patterns right is actually more likely to get the compiler to output efficient code.  Knowing when to use each method of defining a pattern took me a while to understand.

In the TMS9900 back-end I have used these patterns:

  • define_insn
  • define_expand
  • define_insn_and_split
  • define_peephole2

define_insn

The standard instruction pattern is defined using define_insn.  This is generally straightforward but does have limitations.  When there are many alternatives and options the output code can become convoluted.  It can also become very repetitive when slight different insns have to share similar methods (signed and unsigned mul/div for example).

The named insns are used initially to build the RTL block list.  Later passes match patterns and will use named patterns and unnamed patterns.

define_expand

The define_expand instruction pattern is one that I liked and used whenever I could.  In many cases, a complicated pattern can be expanded into a number of simpler insns that emit one or at most two opcodes.  Expanding like this also enables the optimiser to make more informed decisions as to how to reduce the RTL block list.

Expands defined with define_expand do have a major limitation though in that they are expanded in the first pass of the compiler, and not referred to again.  This means if a later pass of the compiler, say the combiner, generates RTL that matches an expand, the expand won’t be used.  Another limitation of expand is that it can't expand into another expand.  For these reason, expands have limited utility in some cases.

NOTE: if define_expand has a pattern that is generated by an optimising pass of the compiler, the pattern will not get matched.  define_expand patterns are only used before any optimising passes.

define_insn_and_split

Splits are another method of achieving an expansion of a complex RTL into multiple insns.  In some cases, a split can be used to “preprocess” a pattern.  There is an example here.  An example in the TMS9900 back-end is the conversion of immediates into constant memory references.  Since most opcodes, such as MOV, in TMS9900 can’t use immediate values, immediates must be first loaded into a register.  We could use an expand for this, but if a later compiler pass then sees an opportunity to reduce two separate MOV instructions, the expand will be lost.  The split gets around this.

If an insn is preprocessing an RTL that is then later used as an insn, the define_insn_and_split pattern can be used to combine both together.  I haven't found a use case for using define_split on its own in the current backend.

define_peephole2

Peepholes can be used to match patterns where we can make an optimisation that is not obvious to the compiler due to the nature of the TMS9900.  They are defined using define_peephole2 (define_peephole is deprecated apparently).  An example of this is where gcc generates code to copy a value from one memory location to another via a register.  The TMS9900 can do this directly.  So this peephole:

(define_peephole2
  [(set (match_operand:QI 0 "register_operand" "")
        (match_operand:QI 1 "memory_operand" ""))
   (set (match_operand:QI 2 "memory_operand" "")
        (match_dup 0))]
  "peep2_reg_dead_p(2, operands[0])"
  [(set (match_operand:QI 2 "memory_operand" "")
        (match_operand:QI 1 "memory_operand" ""))]
  {
    tms9900_debug_operands ("peep-movqi-mem-mem", NULL_RTX, operands, 3);
  }
)

matches two set patterns for bytes.  The first is moving a memory_operand to a register_operand and the second is moving that register to another memory_operand.  The operands are number 0, 1 and 2.  Operands are only defined once so the second reference to the register is replaced with match_dup 0.  

There is a condition here that operands[0] (the register) is dead because we don't want to eliminate a load of a value into a reg that is later needed.  So only eliminate if the register is indeed only a temporary location.

NOTE: Peepholes don't have names which can make it hard to see when they are used.  I added a debug statement to each one so I can see them in the output file with the -minline_rtl switch.

NOTE: I've used peepholes very sparingly and focused instead on creating patterns that correctly match as many use cases as possible.  Previously, peepholes were created much more liberally, but "premature optimisation ...."

Problems during compiler passes

The compiler makes many passes, but these are the ones I’ve only found errors most likely to occur:
  • expand - The expand pass expands all the define_expand patterns.  
  • combine - The combine pass looks for insns that can be merged together.
  • reload - The reload pass reduces the register usage and spills values to the stack when it can't allocate a register
  • ira - The ira (integrated register allocator) actually does the register allocation.  
  • peephole2 - The peephole pass looks for sequences of instructions that match a pattern.
  • final - emits the actual assembly

Expand and insn lists

As the compiler does its thing, it generates lists of insns for each function.  Because it doesn't actually emit any code until a function has been completely parsed (as far as I can tell) we know a lot about the function even before we have emitted any code.  For example, we know if it is a leaf (doesn't call any other functions) and don't need to save R11 in the function prologue.

Insns are numbered using a UID.  In the dumps for each pass, each insn is listed with its UID and the UID of the previous and next insn.  This is useful for finding where insns were combined or eliminated.

If we compiled this code fragment:

void f (void)
{
    int x = 3;
}

gcc would create a list of blocks (with only one block in this case).  In the file gccdump.128r.expand we would see: 

(insn 6 5 11 3 <stdin>:3 (set (mem/c/i:HI (reg/f:HI 17 virtual-stack-vars) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) -1 (nil))

This does look confusing initially, but essentially it says it is an insn, numbered 6, previous is 5, next is 11, input is stdin line 3 and it is doing a "set" of a HI reg to a symbol.  It has allocated a pseudo-reg 17 for now to hold the result.  Any reg higher than the highest physical reg (r15 in our case) is a pseudo-reg.  This will be allocated physical regs later in the ira and reload passes.

The compiler will look for a named insn called "movhi" (move a halfword) to do this operation.  Our implementation of movhi is a define_expand which emits a CLR or a SETO if it can or otherwise emits a tms9900_movhi insn.  The expand also replaces an immediate with a const if required.

In the next pass (gccdump.133r.vregs) we can see a reference to tms9900_movhi and an allocation on the stack frame (R9 = frame pointer):

(insn 6 5 11 3 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 9 r9) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

and in gccdump.174r.ira

(insn 6 3 14 2 <stdin>:6 (set (mem/c/i:HI (reg/f:HI 10 r10) [0 x+0 S2 A16])
        (mem/u/c/i:HI (symbol_ref/u:HI ("*LC0") [flags 0x2]) [0 S2 A16])) 12 {tms9900_movhi} (nil))

Combiner

In this pass, the combiner looks for sequences of patterns in the insn list and tries to match these with other patterns (named and unnamed) in the machine description or eliminate unnecessary insns.  In the output file gccdump.159r.combine it is common to see entries like (note 1 0 5 NOTE_INSN_DELETED) to say an insn has been removed from the list.

In this and other passes, the compiler tries to match patterns provided by define_insn, define_split, etc, but not define_expand.

An example would be a NOT(AND) sequence where we can do this with just the SZC instruction.  The insn:

; This handles reverse-order not-and combinations                                                                     
(define_insn "*not_andhi" 
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,rR>,Q,Q") 
        (and:HI (not:HI (match_operand:HI 1 "nonimmediate_operand" "rR>,Q,rR>,Q"))                                    
                (match_operand:HI 2 "nonimmediate_operand" "0,0,0,0")))]                                              
  ""                                                                                                                  
  { 
    tms9900_debug_operands ("*not_andhi", insn, operands, 3);                                                        
    return "szc  %1, %0";                                                                                             
  }
  [(set_attr "length" "2,4,4,6")])  

is used to ensure this C code:

int x;
int y;
void f()
{
 x = x&~y;
}

produces this output:

def f
f
szc  @y, @x
b    *r11

Reload / IRA

Up until these passes, pseudo regs may be used in the pattern list.  A pseudo reg is any reg that has a value of 16 or higher.  Earlier passes allocate as many of these as they need and its up to reload to replace them with physical regs.  Doing this can mean moving insns around or spilling to the stack.

These passes sometimes throw up errors about constraints or sometimes for inline assembly. "error: ‘asm’ operand requires impossible reload" is one that was recently reported for example.

Peephole

The peephole pass examines sequences of patterns in the RTL list and tries to find matching peephole pattern definitions.  This pass is quite late in the process so optimisations and register allocations are already done.  Scratch regs etc cannot be allocated at this point.

Final

The final pass is the only that actually emits the opcodes.  It seemed strange to me at first, but nothing is actually output until this pass and none of the code in any insn that generates code is executed until this pass.  Up to this point, all work has been done only on the insn rtx list.



Tuesday, 19 December 2023

TMS9900 Machine Description

Previous: Coding in C                Next: Instruction Patterns               Up: Intro  

Insns

The machine description file gcc/config/tms9900.md is the primary definition of the TMS9900 backend.  Machine descriptions for gcc in general are documented here.  They use the .md extension (an unfortunate clash with markdown) and essentially list a series of instructions with patterns (called "insns") that emit assembly opcodes to achieve a result.  The machine description is parsed by the compiler build system and a number of header and source code files are emitted which are then built in to the compiler.  In many places in the md file it is possible to include C code.  Names of named insns are used to generate a function called gen_<insn-name> so names must follow the same rules as C function naming (can't include symbols such as "-" or "$" for example).  The insn patterns are used by the compiler to match the RTL generated by the compiler front-end.

Insn names

Insns can be named or unnamed.  Named insns are either standard names (like "movqi") or else should have a unique name that won't clash with another backend, typically by prepending the back end name (e.g. "tms9900_movqi").  Unnamed insns can simply have an empty name ("") or a name starting with an asterisk ("*my_insn"), in which case the name is emitted in debug output but otherwise treated as empty.  In early compilation, gcc will look for named insns to match patterns in the RTL.  For example, it will look for "movqi" if it wants to move a byte or "addhi3" if it wants to add 16-bit values.  In later stages, it will try to match patterns which is where unnamed insns come in.

NOTE: only named insns can be referred to by define_expand, define_split, etc.

Library calls

When the compiler can't find a named insn to perform an operation it will often instead emit a library call so floating point methods are defined in libgcc.  This can't be done for fundamental insns like "movqi" but can be done for example to add two 32-bit values.  If gcc can't find an insn called "addsi3" it will emit "bl __addsi3" instead.

Sometimes, the compiler will try to generate its own code for missing insns.  For example this code:

long x;
int y;
long z;
 
void f(void)
{
   z = x>>y;
}

compiled without optimisation will generate a sequence of about 30 opcodes inline to do the 32-bit shift, but if compiled with -Os, the output will simply be "bl @__ashrsi3".  This is a very significant space saver but does add a few cycles for the call and return.

Operands

Each insn takes several operands, typically 2 or 3.  Operand values have a mode which indicates their size.  The size of standard int (SI) in gcc is 32-bits.  A half-word (HI) is 16 bits and a quarter-word (QI) is 8 bits.  Other types like double-word (DI) are not used in the TMS9900 backend.  There are also float and double (SF and DF) modes but there are no insns defined for these since the TMS9900 does not have any floating point hardware capability.  One unusual mode is the condition code (CC) which is set by compare operations.  This is implicitly "clobbered" by insns in gcc-4.4.0.  The TMS9900 does an implicit compare to zero on most instructions.

The operand modes and count are usually appended to the end of the insn so "extendqihi2" means perform an extend operation where the src is QI (8 bits), the dest is HI (16 bits) and there are two operands.

Constraints and lengths

The type of the operand is defined using a single letter called a constraint.  Constraints can be grouped together and comma-separated into alternatives. The values use in the TMS9900 backend are:

  • r = a register
  • R = a register indirect reference to a memory location (e.g. MOV *R4 ....)
  • > = like R, but with a post increment (e.g. MOV *R4+....)
  • Q = a label address of a memory location (e.g. MOV @LABEL ...)
  • i = an immediate value (e.g. LI R1,>1234)
  • n = a constant number, excludes labels and other constants
  • 0 = not really a constraint.  It just means this operand must be the same as operand[0] for this alternative.  It could also be 1, 2... to copy constraints for other operands.

For r, R, > and Q, the opcode is usually the same, as the addressing mode doesn't change the opcode length, but the length of the instruction varies if using Q, so it makes sense to break these out in separate constraint "alternatives".  Each alternative can have a slightly different implementation and/or can have a different length.

In addition, some values can use shorter methods for set, add, etc.  Loading a zero into a register can be achieved with "CLR Rx" instead of "LI Rx,0"; loading -1 can be done with "SETO rx"; adding 1 to a reg can be done with "INC Rx" instead of "AI Rx,1" and so on.  Each of these opcodes is 2 bytes shorter than an immediate.  To make these easier to detect as alternatives, these constraints are also defined:

  • L = +2 or -2; used to emit INCT or DECT
  • M = -1; use to emit SETO or DEC
  • N = +1; used to emit INC
  • O = 0; used to emit CLR

NOTE: It is important with these constraints that they appear in the alternatives list before a broader constraint like 'i'.  For example, the addhi3 operand2 constraints are:

   (match_operand:HI 2 "general_operand" "rR>,Q,LMN,LMN,i,rR>,Q, rR>,Q")

Because 'i' will match any immediate value so must be after the 'LMN' constraints which are more specific.

Constraints can also be preceded by modifiers.  The equals ("=") modifier means this value is set.  This is always operand[0] in the TMS9900 backend.  The percent ("%") modifier means operands are commutative, i.e. can be swapped because A+B is the same as B+A, etc.  Sometimes this makes the compiler's life easier.  

Having commutative operands is actually particularly important for the TMS9900 since arithmetic operators only take two operands.  So there is no opcode to do for example x=a+b.  Instead, we need to do x=a;x+=b.  For this reason, generally, operand[1] uses the "0" constraint to say it must match operand[0].

Predicates

When declaring an insn, each operand is given a predicate that gcc uses to match operand types.  The predicates used in the TMS9900 backend are:

  • general_operand - anything will match
  • register_operand - only a register will match
  • nonimmediate_operand - anything except an immediate (constant) value will match
  • const_int_operand - only a constant integer will match
  • memory_operand - only a memory reference (R or Q, see below) will match
The pattern for an insn may look like this:

(set (match_operand:HI 0 "register_operand" "=r") ... )

The predicate is used first to find a matching pattern and then the constraint is used to identify which alternative to use.  

NOTE: a constraint list MUST contain all alternatives that match.  A pattern like:

(set (match_operand:HI 0 "general_operand" "=rR,Q") ... )

can cause a compiler error since the predicate will match an immediate value but gcc won't be able to find an alternative to use since there is no "i" constraint.

Extending and truncating

Changing an operand to a larger width is called an extend.  Reducing it a shorter width is called a truncate.  There is only way to truncate an operand but extending can be zero extend or sign extend.  We can extend from QI to HI, HI to SI or QI to SI and truncate from SI to HI, SI to QI or HI to QI.  This is gives us 9 unique insn patterns we need to implement!

Truncating or extend from SI to HI and vice versa is quite straightforward.  It's just a matter of discarding a word or adding a zero word or an inverted word (>FFFF) if sign extending a negative value.

Generally, we can't truncate or extend a memory reference, only a register.  There is usually no use case for this anyway as a memory location never changes size.  But a value in a register could be loaded as a QI and then accessed as a HI given that gcc thinks low-order byte values are accessible in a wider word.

Going from SI to QI and vice versa needs to be done in two steps, SI to HI and then SI to QI, so these are candidates for define_expand.  More on define_expand later. To be safe, I use a scratch reg for the intermediate value.  For example:

(define_expand "zero_extendqisi2"
  [(set (match_operand:SI 0 "register_operand" "")
        (zero_extend:SI (match_operand:QI 1 "nonimmediate_operand" "")))]
  ""
  {
    tms9900_debug_operands ("zero_extendqisi2", NULL_RTX, operands, 2);
    rtx tmp = gen_reg_rtx (HImode);
    emit_insn (gen_zero_extendqihi2 (tmp, operands[1]));
    emit_insn (gen_zero_extendhisi2 (operands[0], tmp));
    DONE;
  }
)

NOTE: define_expand C code branches must end in either a DONE or a FAIL, or the insn pattern will remain in the list.  These are macros that perform a goto in the generated code.  Omitting these won't cause any error to be emitted but will cause really weird behaviour in the output that will take forever to track down.  Been there, done that.

The trickier conversions are QI to HI and vice versa.  For HI to QI truncation we don't care about the low-order byte and we can emit "SWPB" to just change the order of the bytes.  To extend QI to HI we do care and must fill the high-order byte with zeros or a sign extension.  Sign extension is done using an arithmetic shift "SRA Rx,8"and zero extension with a logical shift  "SRL Rx,8".

Lengths, jumps and branches

Each "define_insn" and (not "define_expand" or "define_split") has an array of length values at the end, like this:

 [(set_attr "length" "4,2,4,4,6")] 

These lengths are the number of bytes emitted by each of the alternatives identified by the constraints.  The number of length attributes must exactly match the number of alternatives (e.g. 5 alternatives must have 5 lengths).  These lengths are used to determine the distance to a label for a jump instruction.  In TMS9900 we can jump backwards by 256 bytes or forward by 254 bytes.  If a longer jump is needed we need to emit a branch ("B") instruction.  

In some cases we can't determine the exact length of an insn because it may vary depending on the circumstances at the time.  For example, we may have to emit an additional SWPB to correct a dropped truncate.  In these cases I've made the length to be the worst case length.  This does mean we will occasionally emit a branch where a jump would have been sufficient but this should be rare.

The function output_branch() in gcc/config/tms9900/tms9900.c determines what opcodes to output for all conditional jumps and will replace a jump with an inverted jump over a branch if distance requires it.

Tailcalls

If the optimising compiler sees that the end sequence of two functions are identical, it will emit insns for only one of the functions and issue a branch instruction in the other and use a common epilogue and return.  This is called a tailcall.  It is identified in the md by checking the macro SIBLING_CALL_P(insn).

Defining insns

The main way to define an insn is by using "define_insn".  To define the insn "movhi", which moves one word to another, we include this in the md file:

(define_insn "movhi"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,rR>,Q,  Q,r")
        (match_operand:HI 1 "general_operand"      "rR>, Q, rR>,Q,i"))]
  ""
  {
    if (which_alternative == 4)
    {
      return("li   %0, %1");
    }
    else
    {
      return("mov  %1, %0");
    }
  }
  [(set_attr "length" "2,4,4,6,4")])

Here we have defined the name (movhi), the operation (set) and described the two operands.  The first (destination) operand is a "non_immediate" operand, i.e. any type except an immediate.  The second (source) is a "general_operand" which can be anything.  There are 5 alternatives defined.  The Q constraint adds 2 bytes to the length of the opcode but r, R and > are the same length, so each combination of rR> and Q are broken out as separate alternatives.  "MOV r,r" is 2 bytes but "MOV @label1,@label2" would be 6 bytes.  The final constraint is for immediates.  Since TMS9900 does not allow immediate values to be used with MOV we instead emit load immediate (LI) for alternative number 4.  This operation is 4 bytes long.

The empty string after the pattern is the condition.  This insn will only be used if the condition is met.  This is usually empty.

NOTE: The entire insn is surrounded by a pair of brackets ().  The pattern and the set_attr are each surrounded by a pair of square brackets [].  Each part of the pattern is surrounded by brackets as well ().  The nested brackets gets confusing.  My most common cause of build failures is a missing or misplaced closing bracket.

Emitting code

There are many ways to emit code in the md file.  The example above uses C code as indicated by the opening brace ("{").  If instead of a brace, there is a double quote, the text is output directly.  For an insn like abs which has only one operand, could have done this:

(define_insn "abshi2"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,Q")
        (abs:HI (match_operand:HI 1 "nonimmediate_operand" "0,0")))]
  ""
  "abs  %0"
  [(set_attr "length" "2,4")])

Or we could also list instructions for each alternative like this (although there is no point as the same instruction is used for both alternatives):

(define_insn "abshi2"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=rR>,Q")
        (abs:HI (match_operand:HI 1 "nonimmediate_operand" "0,0")))]
  ""
  "@"
  "abs  %0"
  "abs  %0"
  [(set_attr "length" "2,4")])

Or within the C code, we could just do { return "abs %0"; }

Or within C, we could call you can use { output_asm_insn ("abs %0", operands); return ""; }

Or if you only have a short piece of C code you could add a double quote followed by an asterisk and then a C statement with any double quotes escaped like this: "* return \"abs %0\""

I think there are just too many alternatives here tbh.  I think mixing and matching methods of emitting output is going to be error prone.  I personally like consistency so have tried to always use C code to emit opcodes.  I also try and avoid mixing output_asm_insn("abs %0") and return "abs %0" in the same insn.

NOTE: since the MD file is compiled to C code, omitting a return will cause random output.  Been there done that too.



Thursday, 14 December 2023

Coding machine descriptions in C

Previous: TMS9900                Next: Machine Description               Up: Intro  

Very often it is convenient to write a fragment of C code inline in the tms9900.md file.  Sometimes if a fragment gets too big or will be used in a few places, it can be placed as a function in the tms9900.c file with a corresponding prototype in tms9900-protos.h.  There are a few tips and tricks I picked up about writing C code in this environment.

rtx objects

The compiled RTL is represented internally by an object called "rtx".  This object is used for lots of things, insns, operands, constants, etc.  There are several macros defined to get information about an rtx object.

Testing whether an rtx is something is done using macros that have _P appended.  So to test if an rtx is an insn, you can call INSN_P(rtx); to test if it is a reg you can call REG_P(rtx); to test for numbers use CONST_INT_P(rtx), and so on.  For the general case, GET_CODE(rtx) will return an enum with the type of an rtx.  GET_RTX_NAME(code) will return a printable string containing the code name.  

NOTE: I've found the code name to reliably match the enum so when debugging it is much easier to find an insn with a name of "plus" than it is to wade through the enum definition file trying to figure out which one is number 573. [1]

If it is a reg or memory location, it will have a mode.  In C code, the modes are defined by an enum as QImode, HImode, SImode, etc.  To get the mode of an rtx, the macro GET_MODE(rtx) can be used.  To get a printable name, the macro GET_MODE_NAME(mode) can be used.

Registers

Registers have defines for convenience such as HARD_R0_REGNUM or FRAME_POINTER_REGNUM (aka HARD_R9_REGNUM).  The stack pointer is also available through a global rtx called stack_pointer_rtx.

NOTE: be sure to wrap register values with gen_rtx_REG() before passing to any functions that expect an rtx parameter.  I'm sure there will be a warning, but it will get lost in the big sea of warnings and then gcc will just crash.

Emitting opcodes

C code can emit opcodes.  This is done in functions tms9900_expand_prologue() for example.  The emit_insn() function will emit an insn rtx into the output list.  Each named insn in the md file will have a corresponding function with a "gen_" prefix so you can call gen_movhi() and so on to emit these insns.  The functions take an array of rtx objects as parameters for operands.

Operands

Operands are passed to insns as an array of rtx objects.  So operand[0] is the first rtx and so on.  I did once have a concern that we shouldn't modify rtx objects passed to insns but that seems to be a valid thing to and many other backends do it all the time.  This is useful if we want to replace an operand with something else or swap them around, etc.

NOTE: be careful not to access operands with a higher array index than your insn defines.  As with any C code there is no bounds checking on the array.  If an insn takes 3 operands and you try to access operand[3] the compiler will probably (make that hopefully) seg fault.

[1] I just made that up, I've no idea what the code is for PLUS.



I published the URL to this blog on atari age.  The posts are in reverse chronological order but the best place to start is the beginning .