Defectus

DefAssembler ⭐

DefAssembler is an assembler (an Assembly compiler) for the x86-64 instruction set that is made to run in real-time on browsers. For this purpose it is written in JavaScript, and it’s designed to integrate incremental changes to the source code efficiently.

I started working on DefAssembler in early 2021. The original motivation, and the main reason it’s used today, is to aid code-golfing in Assembly. More specifically, it was written to allow code.golf to add Assembly as a language. It took around 3 months (between February and May) to develop the assembler to the point where it could be run on the site, and I’ve been maintaining the project since then.

Here’s a demo of the assembler, integrated into the CodeMirror editor (full page here):

SYS_WRITE = 1 SYS_EXIT = 60 STDOUT_FILENO = 1 # Printing .data buffer: .string "Hello, world!\n" bufferLen = . - buffer .text mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi mov $buffer, %esi mov $bufferLen, %edx syscall # Looping .data digit: .byte '0', '\n' .text mov $10, %bl numberLoop: mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi mov $digit, %esi mov $2, %edx syscall incb (%rsi) dec %bl jnz numberLoop # Accessing arguments pop %rbx pop %rax argLoop: dec %ebx jz endArgLoop pop %rsi mov %rsi, %rdi mov $-1, %ecx xor %al, %al repnz scasb not %ecx movb $'\n', -1(%rsi, %rcx) mov %ecx, %edx mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi syscall jmp argLoop endArgLoop: mov $SYS_EXIT, %eax mov $0, %edi syscall

The binary output of each assembled line is displayed on its right; notice that if you change any line, its output (and sometimes the outputs of other lines) will change too.

An important aspect of the assembler is how it incorporates incremental changes. For example, when you change a line in the source code, only that line will be parsed and compiled, rather than the entire document. Other instructions will only be changed to adjust instruction offsets (internally this process is called “recompilation,” although it doesn’t include re-parsing the source code of the instructions).

To see this in action, you can press F3 in the CodeMirror editor to enable debug mode. In this mode each instruction is highlighted, and when a change is made, the portion of the source code that is re-parsed is highlighted. The rest of the code isn’t read for this change.

DefAssembler, similar to the GNU Assembler (gas), supports both AT&T and Intel syntaxes (syntices?). By default the AT&T syntax is chosen, however both syntaxes are supported simultaneously and can be switched back and forth (even in the middle of the code!) at will using the .att_syntax and .intel_syntax directives, as seen here:

.intel_syntax add eax, 20 push rbx mov [325], al xor esi, [rbx + rax * 4] .att_syntax add $20, %eax push %rbx mov %al, 325 xor (%rbx, %rax, 4), %esi

Instruction listings

Another aspect of the assembler I’m pretty proud of is the compressed instruction set. One of the original solutions for DefAssembler’s purpose was ass-js, another x86-64 assembler for JavaScript. However, one thing I didn’t like about it was that it kept a complete JSON file for each mnemonic (instruction name), which I found a bit wasteful.

For DefAssembler, I came up with a much more concise and tightly-packed format for instructions, which looks something like this (this is a modified excerpt from the core/mnemonicList.js file in the DefAssembler source code):

add
04 i R_0bw
83.0 Ib rwlq
05 iL R_0l
80.0 i rbwl
05 iL R_0q
81.0 IL rq
00 Rbwlq r
02 r Rbwlq

mov
88 Rbwlq r
8A r Rbwlq
C7.0 Il Rq
C7.0 iL mq
B0+8.o i Rbwlq
C6.0 i rbwl

jmp
EB-2 jbl
FF.4 rQ
FF.5 mf

inc:FE.0 rbwlq
dec:FE.1 rbwlq

Each instruction listing consists of a line or list of lines, with each line encoding a different variation of the instruction. Each variation is used under different circumstances, typically depending on the list of operands supplied to the instruction.

For example, the add instruction has 8 variations:

And so on! Note that technically an instruction like add $40, %ax fits both the first and second variations, however the variations are checked for each instruction in top-to-bottom order, so the first variation will “catch” the instruction before the second.

For increased efficiency and memory, each instruction listing is only decoded when the instruction’s name is parsed for the first time by the assembler; until then it’s simply stored as a string.