DefAssembler ⭐

DefAssembler is an assembler (an Assembly compiler) for the x86-64 instruction set that is made to run in real-time on browsers. For this purpose it is written in JavaScript, and it’s designed to integrate incremental changes to the source code efficiently.

I started working on DefAssembler in early 2021. The original motivation, and the main reason it’s used today, is to aid code-golfing in Assembly. More specifically, it was written to allow code.golf to add Assembly as a language. It took around 3 months (between February and May) to develop the assembler to the point where it could be run on the site, and I’ve been maintaining the project since then.

Here’s a demo of the assembler, integrated into the CodeMirror editor (full page here):


SYS_WRITE = 1
SYS_EXIT = 60
STDOUT_FILENO = 1

# Printing
.data
buffer: .string "Hello, world!\n"
bufferLen = . - buffer

.text
mov $SYS_WRITE, %eax
mov $STDOUT_FILENO, %edi
mov $buffer, %esi
mov $bufferLen, %edx
syscall

# Looping
.data
digit: .byte '0', '\n'

.text
mov $10, %bl
numberLoop:
    mov $SYS_WRITE, %eax
    mov $STDOUT_FILENO, %edi
    mov $digit, %esi
    mov $2, %edx
    syscall

    incb (%rsi)
    dec %bl
    jnz numberLoop

# Accessing arguments
pop %rbx
pop %rax

argLoop:
    dec %ebx
    jz endArgLoop

    pop %rsi
    mov %rsi, %rdi

    mov $-1, %ecx
    xor %al, %al
    repnz scasb

    not %ecx
    movb $'\n', -1(%rsi, %rcx)

    mov %ecx, %edx
    mov $SYS_WRITE, %eax
    mov $STDOUT_FILENO, %edi
    syscall

    jmp argLoop
endArgLoop:

mov $SYS_EXIT, %eax
mov $0, %edi
syscall

The binary output of each assembled line is displayed on its right; notice that if you change any line, its output (and sometimes the outputs of other lines) will change too.

An important aspect of the assembler is how it incorporates incremental changes. For example, when you change a line in the source code, only that line will be parsed and compiled, rather than the entire document. Other instructions will only be changed to adjust instruction offsets (internally this process is called “recompilation,” although it doesn’t include re-parsing the source code of the instructions).

To see this in action, you can press F3 in the CodeMirror editor to enable debug mode. In this mode each instruction is highlighted, and when a change is made, the portion of the source code that is re-parsed is highlighted. The rest of the code isn’t read for this change.

DefAssembler, similar to the GNU Assembler (gas), supports both AT&T and Intel syntaxes (syntices?). By default the AT&T syntax is chosen, however both syntaxes are supported simultaneously and can be switched back and forth (even in the middle of the code!) at will using the .att_syntax and .intel_syntax directives, as seen here:


.intel_syntax
add eax, 20
push rbx
mov [325], al
xor esi, [rbx + rax * 4]

.att_syntax
add $20, %eax
push %rbx
mov %al, 325
xor (%rbx, %rax, 4), %esi

Instruction listings

Another aspect of the assembler I’m pretty proud of is the compressed instruction set. One of the original solutions for DefAssembler’s purpose was ass-js, another x86-64 assembler for JavaScript. However, one thing I didn’t like about it was that it kept a complete JSON file for each mnemonic (instruction name), which I found a bit wasteful.

For DefAssembler, I came up with a much more concise and tightly-packed format for instructions, which looks something like this (this is a modified excerpt from the core/mnemonicList.js file in the DefAssembler source code):

add
04 i R_0bw
83.0 Ib rwlq
05 iL R_0l
80.0 i rbwl
05 iL R_0q
81.0 IL rq
00 Rbwlq r
02 r Rbwlq

mov
88 Rbwlq r
8A r Rbwlq
C7.0 Il Rq
C7.0 iL mq
B0+8.o i Rbwlq
C6.0 i rbwl

jmp
EB-2 jbl
FF.4 rQ
FF.5 mf

inc:FE.0 rbwlq
dec:FE.1 rbwlq

Each instruction listing consists of a line or list of lines, with each line encoding a different variation of the instruction. Each variation is used under different circumstances, typically depending on the list of operands supplied to the instruction.

For example, the add instruction has 8 variations:

The first one, 04 i R_0bw, is chosen if the instruction is supplied an immediate operand (i) followed by either an al or ax register (R_0bw: R means register type, _0 means it must be a register of id 0 (al/ax/eax/rax), and bw means it can either be byte- or word-sized). When this variation is chosen, the opcode used for the encoding of this instruction is 04. For example:

add $32, %al
The second one, 83.0 Ib rwlq, is chosen if the first operand is a byte-sized immediate (Ib (the capital I means the immediate is treated as signed)) and the second operand is a register or memory operand (r) of size word, long or quad (wlq). When this variation is chosen, the encoded opcode will be 83, with an extension field of 0. For example:

add $60, %ebx

And so on! Note that technically an instruction like add $40, %ax fits both the first and second variations, however the variations are checked for each instruction in top-to-bottom order, so the first variation will “catch” the instruction before the second.

For increased efficiency and memory, each instruction listing is only decoded when the instruction’s name is parsed for the first time by the assembler; until then it’s simply stored as a string.