Falcon LLVM backend

This is a port of llvm to the Falcon microprocessor. It consists of the following parts:

Current status: llvm and clang are slowly starting to work; lld, lldb, compiler-rt are not yet supported.

General Environment

Falcon is a MCU with very limitted code and data space, thus lots of features are not supported, or should be avoided due to their space cost.

There are 4 revisions of Falcon hardware: v0, v3, v4, v5. Also, Falcons have an optional crypto coprocessor. clang needs to be given the target explicitely on invocation (otherwise it'll compile for your host machine):

clang --target=falcon3 -c code.c

The supported targets are:

The target needs to exactly match the MCU version the code will be run on.

All freestanding C features are supported, but some should be avoided. C++ support is present, but limitted.

Some C features are implemented via library calls - using the feature in a program will incur a one-time space cost, as the necessary library function will need to be linked into the final image. It will also be slow at runtime.

Basic C Features

Types have the following sizes:

Floating point types are not natively supported by hardware, and any attempt to use them will lower to calls to soft float library, and thus should be avoided.

The Falcon registers are 32-bit, and can be used as 8-bit and 16-bit as well - thus 8/16/32-bit integer types are prefered. 64-bit and 128-bit integer types are supported, but will be implemented by lowering into 32-bit chunks by the compiler.

The following operations on 8/16/32-bit types are natively supported by Falcon and are fast - use them with no regrets:

The corresponding 64-bit and 128-bit operations are reasonably good as well.

Multiplication

Falcon supports two kinds of multiplication natively;

  • unsigned 16×16 -> 32
  • signed 16×16 -> 32

If possible, aim for multiplications that can be implemented by one of the above instructions, eg:

short sa, sb, sr;
unsigned short usa, usb;
signed char ca, cb, cr;
unsigned char uca, ucb;
int a, b, r;

// signed 16×16
r = sa * sb;
r = sa * 1234;
r = sa * -1234;
r = sa * cb;
r = sa * ucb; // unsigned char losslessly converts to signed short
r = ca * cb;
r = ca * ucb;
// unsigned 16×16
r = usa * usb;
r = usa * 1234;
r = usa * ucb;
r = uca * ucb;
// any multiplication with truncation
sr = a * b;
cr = a * b;

// the following will NOT be covered by one instruction:
r = sa * usa; // one input signed, other unsigned
r = sa * 0x8000; // const too large for signed
r = usa * -3; // const negative
r = usa * 0x12345; // const too large

Type casts to short or unsigned short can be useful here.

If a 32-bit multiplication cannot be implemented by a single instruction, the compiler will cut it into multiple 16×16 -> 32 pieces. If one of the operands fits in signed or unsigned 16 bits, this is still useful information to the compiler (it will be able to omit some multiplications) - so use a type cast in that case as well, if possible.

64-bit multiplications will be converted to library calls.

Division

Starting with v3, Falcon natively supports 32-bit unsigned division and modulus. Thus, when doing a divide, try to make the operands unsigned.

The compiler can convert divisions by a constant to multiplications or shifts. It's also a good idea to use unsigned operands in this case - the emitted sequences will be shorter.

A library call will be inserted for all unsupported divisions, ie:

  • any division on v0
  • operands larger than 32 bits
  • signed operands

Byte swap

Byte swapping 16-bit numbers can be done in single instruction, 32-bit numbers in three instructions. This is wired to the usual bswap builtins.

Population count, bit scans

clang supports builtins for popcnt and bit scans. They will work just fine on Falcon, but will call to library functions - there is no native support here.

Memory

Falcon has two memory spaces: code and data. They have independent addressing - this means a global variable may happen to have the same address as a function. Also, it's not possible to read data from the code space in Falcon code.

Unaligned memory accesses are not supported. Attempts to load/store unaligned data will silently corrupt the data.

Falcon-specific features

Assembly syntax

The following operand types are accepted:

  • %<register>: a register operand, which can be one of:
    • %r0 - %r15: the 32-bit GPRs
    • %r0h - %r15h: the low 16-bit words of corresponding GPRs
    • %r0b - %r15b: the low bytes of corresponding GPRs
    • %iv0, %iv1, %tv: interrupt and trap vectors
    • %sp - the stack pointer
    • %pc - the program counter (read only)
    • %xcbase, %xdbase - DMA transfer code and data base
    • %flags - the flags register (individual bits can also be named, see below)
    • %cx - the crypto transfer mode
    • %cauth - secure mode entry control
    • %xports - DMA transfer port selection
    • %tstat - trap status (v3+ only)
    • %p0 - %p7: the 1-bit general-purpose predicate bits in %flags
    • %ccc, %cco, %ccs, %ccz - the condition code bits in %flags
    • %ie0, %ie1, %ie2 - the interrupt enables in %flags (%ie2 only valid on v4+)
    • %sie0, %sie1, %sie2 - the saved interrupt enables in %flags (%sie2 only valid on v4+)
    • %ta - trap active bit in %flags
  • immediates: as usual, can be specified as expressions involving symbols. The expression may be used as-is, or transformed using one of the modifiers:
    • <expression>: expression is used as the immediate as-is. If it doesn't fit in the range of immediates accepted by the instruction, an error occurs. If the instruction has two forms with different immediate ranges allowed, and the expression is an assembly-time constant, the smallest one that can contain the value is used. If it's not an assembly-time constant, the larger one is used.
    • %u8(<expression>): like above, but the immediate form with 8-bit unsigned immediate is used. If the instruction has no such form, it's an error.
    • %u16(<expression>): like above, but the immediate form with 16-bit unsigned immediate is used.
    • %u24(<expression>): like above, but the immediate form with 24-bit unsigned immediate is used.
    • %u32(<expression>): like above, but the immediate form with 32-bit unsigned immediate is used.
    • %s8(<expression>): like above, but the immediate form with 8-bit signed immediate is used.
    • %s16(<expression>): like above, but the immediate form with 16-bit signed immediate is used.
    • %lo16(<expression>): takes the low 16 bits of expression, and treats it as a signed 16-bit number. Can be used together with %hi16 or %hi8 to assemble a 32-bit or 24-bit number, respectively.
    • %hi16(<expression>): takes the high 16 bits of expression, and treats it as an unsigned 16-bit number. Can be used together with %lo16 to assemble a 32-bit number.
    • %hi8(<expression>): takes the high 16 bits of expression, and treats it as an unsigned 8-bit number. If it doesn't fit, it's an error. Can be used together with %lo16 to assemble a 24-bit number.
  • (%<register>) - memory or IO space, addressed by a register.
  • <expression>(%<register>) - memory or IO space, addressed by a register with immediate offset.
  • %<register>(%<register>) - memory or IO space, addressed by a register with register indexing. The first register (outside the parentheses) is the index, and is multiplied by the access size. The second register is the base.

Global registers

Some Falcon registers can be accessed directly as global variables. You need to declare them yourself, like that:

// In global scope
// Interrupt enables and saved enables
register _Bool ie0 asm("ie0");
register _Bool ie1 asm("ie1");
register _Bool sie0 asm("sie0");
register _Bool sie1 asm("sie1");
// Trap active
register _Bool ta asm("ta");
// Interrupt and trap vectors
register void *iv0 asm("iv0");
register void *iv1 asm("iv1");
register void *tv asm("tv");
// Trap status
register uint32_t tstat asm("tstat");
// Crypto stuff
register uint32_t cx asm("cx");
register uint32_t cauth asm("cauth");

The data transfer related registers are not exposed this way - instead, they're written automatically when data transfer instructions are executed.

It is also possible to reserve up to 8 GPRs and 6 predicate registers for use as fast global variables. To do that, pass the number of reserved registers to clang like this:

clang -mglobal-gprs=3 -mglobal-preds=2

The reserved GPRs will be allocated from the following, in order: r8, r7, r6, r5, r4, r3, r2, r1. The reserved predicates will be allocated from the following, in order: p7, p6, p5, p4, p3, p2. They can be accessed as follows:

// In global scope
register _Bool var1 asm("p7");
register int var1 asm("r8");
// 8/16-bit subregisters may be used as well.
register short var1 asm("rh7");
register char var1 asm("rb6");

Warning

Reserving global registers changes the ABI. Code compiled with different reserved register settings cannot be linked together.

Calling convetions

For functions other than interrupt handlers, the following register assignments are used:

  • r0: call-saved; in functions involving VLAs or alloca, used as a frame pointer (and thus not available for inline assembly).
  • r1 - r8: call-saved, unless reserved for global variables.
  • r9: call-clobbered
  • r10 - r15: call-clobbered, used to pass arguments and return values.
  • p0 - p1: call-clobbered, used to pass arguments and return values.
  • p2 - p3: call-clobbered, used to pass arguments, unless reserved for global variables.
  • p4 - p7: call-saved, used to pass

Stack pointer maintains a 4-byte alignment, and no more. On entry to function, stack pointer points to a return address, potentially followed by arguments that didn't fit in registers. Functions pop the return address off the stack when returning.

Arguments are handled as follows:

  • for _Bool arguments, the first free register from p0 - p3 range is used. If no such registers are free (all have been used for previous arguments or global variables), zero extend to 32 bits, and treat as an int.
  • align argument size up to 4 bytes. For each 4-byte chunk in turn, assign the first free register in r10 - r15 range. After all registers are used, dump all remaining chunks consecutively on stack, starting right after the return address.

Return values are handled as follows:

  • _Bool is returned in p0.
  • <=32-bit scalars are returned in r10.
  • 64-bit scalars are returned in r10 - r11.
  • 128-bit scalars are returned in r10 - r13.
  • aggregates are returned in memory - caller allocates this memory and passes its address as a hidden first parameter.

va_list is a simple void*, pointing to the next 4-byte chunk to read. If a function invokes va_start, it'll do the following:

  • if r10-r15 are covered by fixed arguments, va_start simply returns the address of the first vararg on stack
  • otherwise:
    • pop the return address to a register
    • push r15-rX onto stack, where rX is the first register containing varargs
    • va_start will return the stack pointer as of after the push
    • on return, drop the pushed arguments, and branch directly to the return address in a register instead of using ret

Note that _Bool cannot be passed as a vararg (the usual C promotion rules make it promote to int).

Interrupt handlers

Adding an interrupt attribute on a function will make it suitable for use as an interrupt handler (ie. it will save all registers it uses and use iret to return). Such functions should have no parameters and void return type, and shouldn't be called directly by other functions. Their addresses are suitable for writing into iv0, iv1, tv registers:

void handle_interrupt(void) __attribute__((__interrupt__) {
  // ...
}

void init_interrupts(void) {
  iv1 = handle_interrupt;
}

XXX: maybe expose tstat register in a nicer way here? Add a way to inspect register state for traps?

Noreturn

A function marked as _Noreturn will use an optimized prologue setup (or rather, lack of it) - call-saved registers will not be saved. Use it on a main function for good effect.

Halt instruction

The Falcon halt instruction can be executed as follows:

_Noreturn void halt_processor() {
  __falcon_hlt();
}

This instruction will halt the MCU (requiring a reboot of MCU) and trigger a host interrupt (see MCU documentation for more details).

Wait instruction

The Falcon wait instruction can be executed as follows:

_Noreturn void wait_forever() {
  __falcon_wait(1);
}

register _Bool should_sleep asm("p7");
_Noreturn void wait_for_interrupt() {
  __falcon_wait(should_sleep);
}

void handle_interrupt(void) __attribute__((__interrupt__) {
  // ...
  should_sleep = 0;
}

The __falcon_wait function should only be called with constant 1 (to wait forever) or a global register variable of type _Bool - otherwise, there's no way for an interrupt handler to change the flag it's waiting for.

IO space access

The IO space can be accessed by builtins as well:

// Write 1 to reg 0x1234
__falcon_iowr(0x1234, 1);
// With barrier.
__falcon_iowrb(0x1234, 1);
int val = __falcon_iord(0x1234);
// With barrier.
int val2 = __falcon_iordb(0x1234);

Data transfers

The builtins take care of setting xdbase, xcbase, xports registers themselves:

__falcon_xdwait();
__falcon_xdbar();
__falcon_xcwait();
uint32_t ext_offset;
uint32_t ext_base;
uint8_t ext_port;
void *ptr;
uint8_t size_lg2;
__falcon_xdst(ext_port, ext_base, ext_offset, ptr, size_lg2);
__falcon_xdld(ext_port, ext_base, ext_offset, ptr, size_lg2);

// For falcon0
__falcon_xcld(ext_port, ext_base, ext_offset, ptr);

// For falcon3+
void *cptr;
uint16_t phys_addr;
__falcon_xcld(ext_port, ext_base, cptr, phys_addr);

TLB functions

The v3+ TLB instructions are likewise exposed as builtins:

void *cptr;
__falcon_itlb(cptr);
uint32_t r1 = __falcon_ptlb(cptr);
uint32_t r2 = __falcon_vtlb(cptr);

Traps

To cause an illegal instruction trap, use the usual clang builtin:

__builtin_trap();

The v3+ trap instructions are not exposed on the C level, since it's not clear what the calling convention for traps should be. If you manage to find a use for them, use inline assembly, eg.:

register uint32_t parameter asm("r12");
asm volatile ("trap2" : : "r"(parameter));

C++

Most C++ features are supported - this includes classes, virtual functions, overloading, templates, inline functions. Use them to your heart's delight (but be careful with templates and inlines, space is expensive).

RTTI and exception handling are not supported, due to space constraints. Don't use them.

Static and global variables with constructors are not supported, since they require library support - don't use them.

[XXX: at least global variables with constructors would be nice]