Compiler Back End

Well, the complete code base is somehow a compiler backend for LLVM. Here, we really speak about the final code generation passes that you may find in src/backend.

As explained in the scalar IR presentation, we bet on a very simple scalar IR to make it easy to parse and modify. The idea is to fix the unrelated problem (very Gen specific) where we can i.e. when the code is generated.

The code generation in the compiler backend is classically divided into four steps

  • Instruction selection (defined in src/backend/gen_insn_selection.*pp). We expose an interface for the instruction selection engine. We implemented a very simple selection (called SimpleSelection) that does a quick and dirty one-to-many instruction generation.

  • Register allocation (defined in src/backend/gen_reg_allocation.*pp). The code implements a linear scan allocator on the code selected in the previous pass. See below for more details about register vector allocations.

  • Instruction scheduling. This one is not done yet. We just output the same instruction order as the program order. Note that we plan to implement an adaptive scheduling between register allocation and instruction selection (to avoid spilling as much as possible)

  • Instruction encoding. This is the final step that encodes the program into Gen ISA.

Instruction selection

Usually, the instruction selection consists in mapping p instructions to q ISA instructions under a cost driven model. Each basic block is therefore tiled into some numbers of groups of ISA instructions such that the final cost is minimized.

The literature is particularly dense on the subject. Compilers usually use today either tree matching methods or selection DAG techniques (as LLVM backends do)

The instruction selection is still a work in progress in our compiler and we only implement the most stupid (and inefficient) technique: we simply generate as many instructions as we need for each individual IR instructions. Since we do not support immediate sources, this in particular leads to really ugly looking code such as mov (16) r2:f 1.f. It is still a work in progress.

Other than that, the instruction selection is really a book keeping structure. We basically output SelectionInstruction objects which are the 1-to-1 mapping of Gen ISA encoding functions defined in src/backend/gen_encoder.*pp.

However, the SelectionInstruction still use unallocated virtual registers and do not use vectors but simply tuples of virtual registers.

Register allocation

The register allocation actually consists in two steps:

  1. Handling the vector for all the instructions that require them

  2. Performing the register allocation itself

Step 1 consists in scanning all the vectors required by sends. Obviously, the same register may be used in different vectors and that may lead to interferences. We simply sort the vectors from the largest to the smallest and allocate them in that order. As an optimization we also identify sub-vectors i.e. vectors included in larger ones and no not allocate them.

The code may be largely improved in particular if we take into account liveness interferences as well. Basically, a register may be part of several vectors if the registers that are not in both vectors at the same location are not alive at the same time.

This is still a work in progress. Code is right now handled by method GenRegAllocator::allocateVector.

Step 2 performs the register allocation i.e. it associates each virtual register to one (or several) physical registers. The first thing is that the Gen register file is very flexible i.e. it can (almost) be freely partitioned. To handle this peculiarity, we simply implemented a free list based generic memory allocator as done with RegisterFilePartitioner in src/backend/context.cpp.

We provide two directions of memory allocation. From tail to head direction is used for normal register, and from head to tail is for the curbe payload register allocation.

We then simply implemented a linear scan allocator (see gen_reg_allocation.cpp). The spilling is implemented in the same file. The heuristics we used is the register's end point. It always try to spill the register with largest liveness end point if possible. Although Gen support to spill 4 SIMD8 register at once, we only support one currently. Need to optimize it latter, at least for the vectors' spilling. Maybe a new pass in the backend to find opportunity to gatter more spilled register into one contiguous area is also worth to do. We also can consider the spill register's interval to do smarter scratch memory allocation to reduce scratch memory requirement.

Instruction scheduling

Pre register allocation instruction scheduling is not implemented. Although it may reduce register pressure but it may also increase register dependencies. We need to think about a trade-off mechanism to do this optimization.
Post register allocation scheduling has been implemented, and could get about 8% performance improvement. But those cycles data are based on experiments not accurate. We may need to tweak it when we get more information.

Instruction encoding

This is mostly done in src/backend/gen_context.cpp and src/backend/gen_encoder./*pp. This is mostly glue code and it is pretty straightforward. We just forward the selection code using the physically allocated registers. There is nothing special here. Just boilerplate.

There are plenty of huge macro instructions in the gen_context.cpp currently. Most of them are for the long/double support on a Gen platform which doesn't support long/double in the hardware level. We may need to clean up and move those non-hardware related functions into upper layer. Too many huge instruction which will totally make the register spilling and dead code elimination harder and inefficient.