Introduction to Assembly

Assembly languages describe the machine code of a computer in simple, human-readable syntax. In these languages, each machine instruction (i.e. a sequence of bytes that is executed by the processor) is represented by a single command. The assembly language on code.golf (hereinafter referred to as "Assembly") represents the machine language of the x86-64 architecture.

Syntaxes

There are two main branches of syntax used for Assembly: Intel syntax and AT&T syntax. The assembler used by code.golf, DefAssembler, supports both of these syntaxes, defaulting to AT&T syntax unless otherwise stated using directives. In this article, when a difference between the two syntaxes emerges, a table that illustrates this difference will appear.

Instructions

Each instruction in Assembly consists of a mnemonic (the instruction's name, e.g. add, sub, mov) and a comma-separated list of operands. Note that in AT&T syntax, the order of the operands in each instruction is reversed.

The number, types, sizes, and purposes of the operands given to an instruction depend on its mnemonic. For example, the instruction push accepts a single operand, the instruction mov accepts 2, and the instruction imul accepts anywhere from 1 to 3 operands. To understand what each instruction does and which operands it accepts, its mnemonic should be looked up in Volume 2 of the Intel 64 Software Developer's Manuals (I recommend using a more straightforward index, like Félix Cloutier's x86 instruction reference).

Operands

An operand is an argument passed to an instruction, for example which values to read and where to write the result. There are 3 common types of operands in Assembly (in reality there are about 10 types, but the ones not listed here are rarely used):

Registers - small units of data used to store temporary values on the processor. They can either be read from or written to, or in some cases both. In x86-64, there are 16 different registers, each consisting of 64 bits. It is also possible to read from and write to each register's lower 32, 16, or 8 bits (in some cases the top 8 bits of the 16 lowest bits can also be accessed).

Click here for a full list of registers

Name	64-bit	32-bit	16-bit	8-bit	High 8-bit	Notes
Accumulator	`rax`	`eax`	`ax`	`al`	`ah`	This register is implicitly multiplied and divided by the `mul` and `div` instructions. It is also used by `lods`, `stos`, `scas`, `xlat` and `cmpxchg`, and for `syscall` it both selects the syscall ID and holds the return value of the syscall.
Base	`rbx`	`ebx`	`bx`	`bl`	`bh`	This register is used implicitly in the `xlat` instruction.
Counter	`rcx`	`ecx`	`cx`	`cl`	`ch`	This register allows for short-form loops with the `loop`/`loope`/`loopne` instruction, as well as short-form zero checks with `jrcxz`/`jecxz`. It is also used by string instructions with a `rep`/`repe`/`repne` prefix as a limit on how many times to execute the instruction. The register is overwritten by `syscall`.
Data	`rdx`	`edx`	`dx`	`dl`	`dh`	This register is often paired up with the accumulator, e.g. rdx:rax, to act as an extension of that register (for example in the `mul` and `div` instructions). It also serves as the third argument in `syscall`.
Source index	`rsi`	`esi`	`si`	`sil`		This register is used by the `movs`, `lods`, and `cmps` string instructions. It also serves as the second argument in `syscall`.
Destination index	`rdi`	`edi`	`di`	`dil`		This register is used by the `movs`, `stos`, `cmps`, and `scas` string instructions. It also serves as the first argument in `syscall`.
Stack pointer	`rsp`	`esp`	`sp`	`spl`		This register holds the address of the stack, a portion of the computer's memory dedicated to storing temporary values (written to with `push` and read from by `pop`). It is the only register that has a non-zero value at the start of the program, and in general, its value should not be manually changed. It is affected by the `push`, `pop`, `enter` and `leave` instructions.
Base pointer	`rbp`	`ebp`	`bp`	`bpl`		This register typically serves as an address holder for the base of a stack frame (a portion of the stack holding the local memory of a function), but it may be used for any other purpose. It is used by the `enter` and `leave` instructions.
	`r8`	`r8d`	`r8w`	`r8b`		This register serves as the fifth argument to `syscall`.
	`r9`	`r9d`	`r9w`	`r9b`		This register serves as the sixth argument to `syscall`.
	`r10`	`r10d`	`r10w`	`r10b`		This register serves as the fourth argument to `syscall`.
	`r11`	`r11d`	`r11w`	`r11b`		This register is overwritten by `syscall`.
	`r12`	`r12d`	`r12w`	`r12b`
	`r13`	`r13d`	`r13w`	`r13b`
	`r14`	`r14d`	`r14w`	`r14b`
	`r15`	`r15d`	`r15w`	`r15b`

AT&T syntax	Intel syntax
Registers are prefixed with a percent sign (`%`)	Registers are written without a prefix
E.g. `%rax`, `%ebx`, `%cx`, `%dl`	E.g. `rax`, `ebx`, `cx`, `dl`

Immediates - constant numbers that are stored as part of the instruction. They provide an immediate value that can be quickly read by the processor without accessing memory. They are written using arithmetic expressions, which can include number literals, string literals and/or symbols (macro-like static values that are stored as names for easy access).

AT&T syntax	Intel syntax
Immediates are prefixed with a dollar sign (`$`). Expressions not prefixed with a dollar sign are treated as memory offsets (see below).	Immediates are written without a prefix. If the immediate contains a symbol name, however, the expression must be prefixed with the word `OFFSET`; otherwise, it is treated as a memory offset (see below).
E.g. `$23`, `$(15 + 6) * 2`, `$symbol`	E.g. `23`, `(15 + 6) * 2`, `OFFSET symbol`

Memory expressions - calculated addresses which instruct the processor to read from and write to certain addresses in the computer's temporary memory (RAM). These consist of 3 parts, all of which are summed by the processor to calculate the final address:

Displacement - a numerical value giving an offset to the address
Base register - a 64-bit register whose value is added to the address
Index register - another 64-bit register, whose value can also be scaled up by 1, 2, 4, or 8

Each of these parts is optional, however at least one is required to form a valid memory expression.

The syntax for a memory expression is as follows:

AT&T syntax	Intel syntax
`<displacement>(<base>, <index>, <scale>)`	`<displacement>[<base> + <index> * <scale>]`
E.g. `23(%rsi, %rax, 4)`	E.g. `23[rsi + rax * 4]`

Note that in some cases, the size of a memory expression may be ambiguous; if this is the case, the assembler will issue an error stating this. To disambiguate the size:

AT&T syntax	Intel syntax
The instruction's name should have a letter at the end specifying the size	The memory expression should be prefixed with the size name
E.g. `incl (%rax)`	E.g. `inc LONG [rax]`

Flags

In addition to the general-purpose registers, x86-64 processors also store a "flags" register. This register holds an array of bits, each of which represents a flag - a boolean value indicating some useful information about the results of the previous instruction. These flags are:

Carry - set to 1 if the last instruction caused an arithmetic carry that was truncated in the result (e.g. with the shl instruction the most-significant bit may be shifted out of a register, causing this flag to turn on). This flag can be manipulated using stc, clc and cmc. It is also read by adc, sbb, rcl, and rcr.
Zero - set to 1 if the last instruction generated a 0 value.
Overflow - set to 1 if the truncation done by the last instruction caused the sign bit (the most-significant bit) to erroneously flip over.
Sign - set to the sign bit (the most-significant bit) of the value generated by the last instruction (equivalently, set to 1 if the last instruction generated a negative value).
Direction - determines if the string instructions should increase (0) or decrease (1) the rsi and rdi registers. This flag can be manipulated using std and cld.
Parity - set to 1 if there is an even number of 1 bits in the value generated by the last instruction.

Conditional branching in Assembly is done through conditional jump instructions, which jump to the given location depending on the flags (e.g. jz (jump if zero) will jump only if the zero flag is set to 1). Some conditional jumps also check multiple flags to form simpler conditions; these can be paired with the cmp instruction to create simple arithmetic conditions. For example, the following program will halt if and only if edx is less than or equal to ebx:

AT&T syntax	Intel syntax
`label: cmp %ebx, %edx jnle label`	`label: cmp edx, ebx jnle label`

Note that not all instructions cause the flags to change; for example, while arithmetic instructions (e.g. add, sub, xor etc.) always affect the flags, mov, push, pop and xchg do not affect any flags.

The flags register can be manually read from and written to with pushf, popf, lahf, and sahf, however this is generally not recommended.

Prefixes

In some cases, a prefix may also be added before the mnemonic. Prefixes give the processor more info on how to execute the instruction. For example, in repne scasb, the instruction scasb (which increments rdi and compares the byte at the address pointed by it to al) will execute in a loop so long as al and the byte at rdi are not equal (repeat while not equal). In other words, it will find the first byte after the address rdi that is equal to al.

Directives

Along with executable code, it's also possible to embed raw data in an Assembly program. These are sequences of bytes (e.g. strings, integers, arrays) that will be loaded into the program's memory along with the code; they may be written to and read from by the program's instructions. This data can be added through the use of directives. These are pseudo-instructions that tell the assembler (the program that turns the assembly code into an executable file) to do certain things besides encoding instructions. For example, the following directives generate, in order, a byte, a sequence of words, a quad-word, and a string:

AT&T syntax	Intel syntax
`.byte 5`	`db 5`
`.word 0, 1, 1, 2`	`dw 0, 1, 1, 2`
`.quad 0x0123456789ABCDEF`	`dq 0123456789ABCDEFh`
`.ascii "Hello, world!"`	`db "Hello, world!"`

Directives can also be used to divide the bytes of a program into sections, which are distinct regions of memory that have different permissions (e.g. the .text section can be read, executed, and modified, the .data section can be read and modified but not executed, and the .rodata section can only be read):

AT&T syntax	Intel syntax
`.text`	`section .text`
`.data`	`section .data`
`.section .rodata`	`section .rodata`

Finally, directives can be used to set the syntax of a program: .intel_syntax sets the syntax to Intel, and .att_syntax sets the syntax to AT&T. These can be used anywhere within the program and in any order.

I/O

Input and output are not defined by Assembly; they way they are handled depends on the operating system the program runs under. In code.golf, the assembly programs are run under Linux, which uses system calls (which are executed via the syscall instruction) for program output. For a list of available system calls see the Linux System Call Table for x86-64 (note that sys_write is the only one you really need to know on code.golf). On Linux, the arguments passed to a program appear at the top of the stack when the program begins; to get the next argument, simply pop it off the stack (the popped value is an address pointing to the argument string).

Sample code

AT&T syntax Intel syntax

AT&T syntax	Intel syntax
.att_syntax SYS_WRITE = 1 SYS_EXIT = 60 STDOUT_FILENO = 1 # Printing .data buffer: .string "Hello, World!\n" bufferLen = . - buffer .text mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi mov $buffer, %esi mov $bufferLen, %edx syscall # Looping .data digit: .byte '0', '\n' .text mov $0, %bl numberLoop: mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi mov $digit, %esi mov $2, %edx syscall incb (%rsi) inc %bl cmp $10, %bl jl numberLoop # Accessing arguments pop %rbx pop %rax argLoop: dec %rbx jz endArgLoop pop %rsi mov %rsi, %rdi mov $-1, %ecx mov $0, %al repne scasb not %ecx movb $'\n', -1(%rsi, %rcx) mov %ecx, %edx mov $SYS_WRITE, %eax mov $STDOUT_FILENO, %edi syscall jmp argLoop endArgLoop: mov $SYS_EXIT, %eax mov $0, %edi syscall	.intel_syntax SYS_WRITE = 1 SYS_EXIT = 60 STDOUT_FILENO = 1 ; Printing section .data buffer db "Hello, World!\n" bufferLen = $ - buffer section .text mov eax, OFFSET SYS_WRITE mov edi, OFFSET STDOUT_FILENO mov esi, OFFSET buffer mov edx, OFFSET bufferLen syscall ; Looping section .data digit db '0', '\n' section .text mov bl, 0 numberLoop: mov eax, OFFSET SYS_WRITE mov edi, OFFSET STDOUT_FILENO mov esi, OFFSET digit mov edx, 2 syscall inc BYTE [rsi] inc bl cmp bl, 10 jl numberLoop ; Accessing arguments pop rbx pop rax argLoop: dec rbx jz endArgLoop pop rsi mov rdi, rsi mov ecx, -1 mov al, 0 repne scasb not ecx mov BYTE [rsi + rcx - 1], '\n' mov edx, ecx mov eax, OFFSET SYS_WRITE mov edi, OFFSET STDOUT_FILENO syscall jmp argLoop endArgLoop: mov eax, OFFSET SYS_EXIT mov edi, 0 syscall

.att_syntax
SYS_WRITE = 1
SYS_EXIT = 60
STDOUT_FILENO = 1

# Printing
.data
buffer: .string "Hello, World!\n"
bufferLen = . - buffer

.text
mov $SYS_WRITE, %eax
mov $STDOUT_FILENO, %edi
mov $buffer, %esi
mov $bufferLen, %edx
syscall

# Looping
.data
digit: .byte '0', '\n'

.text
mov $0, %bl
numberLoop:
    mov $SYS_WRITE, %eax
    mov $STDOUT_FILENO, %edi
    mov $digit, %esi
    mov $2, %edx
    syscall

    incb (%rsi)
    inc %bl
    cmp $10, %bl
    jl numberLoop

# Accessing arguments
pop %rbx
pop %rax

argLoop:
    dec %rbx
    jz endArgLoop

    pop %rsi
    mov %rsi, %rdi

    mov $-1, %ecx
    mov $0, %al
    repne scasb

    not %ecx
    movb $'\n', -1(%rsi, %rcx)

    mov %ecx, %edx
    mov $SYS_WRITE, %eax
    mov $STDOUT_FILENO, %edi
    syscall

    jmp argLoop
endArgLoop:

mov $SYS_EXIT, %eax
mov $0, %edi
syscall

.intel_syntax
SYS_WRITE = 1
SYS_EXIT = 60
STDOUT_FILENO = 1

; Printing
section .data
buffer db "Hello, World!\n"
bufferLen = $ - buffer

section .text
mov eax, OFFSET SYS_WRITE
mov edi, OFFSET STDOUT_FILENO
mov esi, OFFSET buffer
mov edx, OFFSET bufferLen
syscall

; Looping
section .data
digit db '0', '\n'

section .text
mov bl, 0
numberLoop:
    mov eax, OFFSET SYS_WRITE
    mov edi, OFFSET STDOUT_FILENO
    mov esi, OFFSET digit
    mov edx, 2
    syscall

    inc BYTE [rsi]
    inc bl
    cmp bl, 10
    jl numberLoop

; Accessing arguments
pop rbx
pop rax

argLoop:
    dec rbx
    jz endArgLoop

    pop rsi
    mov rdi, rsi

    mov ecx, -1
    mov al, 0
    repne scasb

    not ecx
    mov BYTE [rsi + rcx - 1], '\n'

    mov edx, ecx
    mov eax, OFFSET SYS_WRITE
    mov edi, OFFSET STDOUT_FILENO
    syscall

    jmp argLoop
endArgLoop:

mov eax, OFFSET SYS_EXIT
mov edi, 0
syscall

Golfing tips

Transferring values between registers

The shortest way to move a value from one 64-bit register to another is using push and pop:

AT&T syntax	Intel syntax
`push %rax # 50 pop %rsi # 5E`	`push rax ; 50 pop rsi ; 5E`

For any registers except r8-r15 (for which an additional REX byte is required), this will result in a 2-byte encoding.

If the registers are 32-bit and one of them is eax, this may be shortened to a single byte with an xchg instruction:

AT&T syntax	Intel syntax
`xchg %eax, %esi # 96`	`xchg eax, esi ; 96`

Avoiding the REX prefix

An additional byte called the REX prefix is added in certain encodings. It can be easily identified, as it always has the form 4X (where X is between 0 and F). It provides auxiliary information about the registers used in an instruction. You should usually seek to avoid it.

The following cases cause a REX prefix to appear:

Usage of the r8-r15 registers
Usage of the spl, bpl, sil and dil registers
Usage of 64-bit registers/values (except in certain instructions, such as push and pop)

Avoiding the word-size prefix

Additionally, instructions that use 16-bit registers entail the encoding of a 66 prefix; for this reason they should generally be avoided.

Use the `al` register for arithmetic operations with immediates

For arithmetic instructions (e.g. add, sub, and, or, xor), the 8-bit size form of the instruction with an immediate and a register (e.g. add cl, 4, xor bl, 20), is 1 byte shorter if the register being used is al.

AT&T syntax	Intel syntax
`add $1, %cl # 80 C1 01 add $1, %bl # 80 C3 01 add $1, %al # 04 01`	`add cl, 1 ; 80 C1 01 add bl, 1 ; 80 C3 01 add al, 1 ; 04 01`

Placing a 1 value into a register

In holes that have no arguments, the value at the top of the stack (argc) has the value 1. This makes it convenient to pop into registers where this value is significant (typically rdi, as it's used for write syscalls as the file descriptor).

Using argv[0] as memory

The second value from the top of the stack holds a pointer to argv[0], which is a string containing the program's name (in DefAssembler this is always /tmp/asm). This memory area can be used for relatively small memory operations, such as printing numbers.

AT&T syntax Intel syntax

AT&T syntax	Intel syntax
`pop %rdi # 5F pop %rsi # 5E movl $'Heyo', (%rsi) # C7 06 48 65 79 6F mov $4, dl # B2 04 mov $1, al # B0 01 syscall # 0F 05`	`pop rdi ; 5F pop rsi ; 5E mov LONG [rsi], 'Heyo' ; C7 06 48 65 79 6F mov dl, 4 ; B2 04 mov al, 1 ; B0 01 syscall ; 0F 05`

pop %rdi                # 5F
pop %rsi                # 5E

movl $'Heyo', (%rsi)    # C7 06 48 65 79 6F
mov $4, dl              # B2 04
mov $1, al              # B0 01
syscall                 # 0F 05

pop rdi                   ; 5F
pop rsi                   ; 5E

mov LONG [rsi], 'Heyo'    ; C7 06 48 65 79 6F
mov dl, 4                 ; B2 04
mov al, 1                 ; B0 01
syscall                   ; 0F 05

Zeroing registers

The shortest way to set a register's value to 0 is xoring or subing it with itself, as in:

AT&T syntax	Intel syntax
`xor %ecx, %ecx # 31 C9 sub %eax, %eax # 29 C0`	`xor ecx, ecx ; 31 C9 sub eax, eax ; 29 C0`

Note that specifying a 64-bit size in this case has no effect; writing to a 32-bit size register will always set the top 32 bits of that register's 64-bit counterpart to 0.

Zeroing `edx`

The edx register is a special case; it can be zeroed using a single byte instruction, namely cdq (cltd in AT&T syntax), if the value in eax is non-negative (if it is negative, edx will be set to -1). This is especially useful before performing a div instruction, where the value in edx may affect the result of the operation.

Placing values divisible by 256 in 16-bit size registers

Occasionally, it is useful to place values divisible by 256 (or just arbitrary large values) in certain registers. This can be made shorter by writing to the register's high-byte part.

Instead of:

AT&T syntax	Intel syntax
`mov $512, %cx # 66 B9 00 02`	`mov cx, 512 ; 66 B9 00 02`

Use:

AT&T syntax	Intel syntax
`mov $2, %ch # B5 02`	`mov ch, 2 ; B5 02`

Advanced golfing tips

General information

Program layout and startup

The program as a whole is always loaded at 0x400000. However, it's possible to set a start address with .globl _start; _start: and it's possible to create extra sections to move the actual bytes in the solution with .section directive (you need to use .section <name>, "wax" to make them readable, writable and executable). Up to 70 custom sections seem to be supported, allowing to have a section that starts at addresses up to 0x448000 (it's not clear why more than 70 sections don't seem to work); this can be useful if a shorter way of initializing an address is available for such values. Common symbols up to around .comm x, 0xf78000000 (not clear why there seems to be such a limit) can be declared to enlarge the bss section and thus make a lot of address space usable at no code size cost, but it seems impossible to have any data be placed after the bss section. Note that .text (i.e. the program code) is writable by default unlike normal Linux executables.

Programs are ran as usual by the Linux kernel, meaning that at the start the registers are zeroed except for %esp (as better described in the x86-64 psABI), the stack contains argc (1 for no argument), then argv[0] which is always "/tmp/asm" and then the pointers to the arguments terminated by 0, followed by another 0 due to no environment being present, then the ELF auxiliary entries. The arguments starting with argv[0] are laid out sequentially in memory separated by '\0', starting at a variable offset after the end of the ELF auxilary entries on the stack and after the arguments, the null-terminated executable name, "/tmp/asm" is present, followed by 8 null bytes and then the stack ends and access past that point will cause a page fault.

Program structure

The example code is quite a poor example, because it doesn't matter what happens after you write correct output (as long as you don't randomly make extra correct write syscalls), so you don't need to call exit or in fact usually do anything after making the write syscall (if you do, then use hlt).

You also usually don't need to terminate the argument loop in any way and can thus usually ignore argc, as your solution will generally crash after the last argument since you will pop a null pointer, which is perfectly fine. Furthermore, sometimes you only need to pop and ignore the argument count since it can be fine to pass argv[0] to your argument logic.

You can either compose the whole output in memory and write it, or use multiple write calls.

Debugging

To debug code, add an hlt or ud2 instruction to have the evaluator fail at that point, dumping register values. If you want to inspect anything, move it to the r8-r15 registers. You can also run the code locally under a debugger such as gdb.

If you see "exit status 31", you probably made a system call with incorrect value.

General instruction selection

Optimal code will usually be composed mostly of 1 and 2-byte instructions, and 3+ bytes instructions should be checked for optimization opportunities.

It's best to avoid the address size, operand size and REX prefixes when possible, which means trying to not use extra x86-64 registers, avoiding the address size modifier (which is pretty much useless anyway) by specifying 64-bit registers in address, trying to avoid 16-bit operations and 64-bit operations that need REX (pretty much all of them except for push, pop, loop, jrcxz)

Registers should of course be selected to minimize moves and minimize instruction length (in particular, operations with al and immediates are shorter). System calls and several instructions use specific registers, so rename your registers accordingly. Remember that syscall clobbers rcx (and rax and r11).

These instructions have been reported by either CatsAreFluffy, JakobVarmose or lyphyser as being used in their solutions: adc, add, aesenclast, and, bsf, bswap, bt, btr, bts, call, cbtw, clc, cld, cltd, cmc, cmovCC, cmp, cmpsb, cmpxchg, cpuid, cqto, crc32, dec, div, hlt, fadd, fdiv, fild, fistp, fld, fmul, fstp, fsub, idiv, imul, inc, int $0x80, jCC, jmp, jrcxz, lahf, lea, lodsb, lodsl, lodsq, lodsw, loop, loope, loopne, mov, movsb, movsbl, movsbq, movsl, movsq, movsw, movsxd, movups, movzbl, movzwl, mul, neg, not, or, pcmpistrm, pdep, pop, popcnt, popf, push, pushf, rcl, rcr, rdrand, rep, repe, repne, ret, rol, ror, sahf, sar, sbb, scasb, scasl, scasq, scasw, setCC, shl, shr, smsw, stc, std, stosb, stosl, stosw, sub, syscall, test, xadd, xchg, xlatb, xor

Others potentially useful could be other x87 instructions and andn, bextr, blsi, blsmsk, blsr, bsr, btc, bzhi, cmpxchg16b, lzcnt, movbe, mulx, pext, sarx, shld, shlx, shrd, shrx, tzcnt, xgetbv, xrstor, xsave.

The following directives can be useful: .ascii, .asciz, .byte, .comm, .data, .globl _start, .long, .word, .section

SSE and VEX/EVEX-encoded instructions can perhaps be useful in some cases, although ALU and x87 instructions is usually shorter for integer and floating point respectively.

Hole solving workflow

If the algorithm is complex or you are unsure about it, you may write a solution in C (or perhaps Rust or Zig) first until it works. You can also compile it to assembly to get a first version, even though this is less useful than it seems as you will have to mostly rewrite it anyway to be competitive. When compiling, try both gcc and clang, try both 64-bit and 32-bit (to force it to not use the 64-bit registers that need a prefix) and try different optimization levels (usually -Oz is best since it's designed to aggressively optimize code size).
Find the most similar hole you have solved and copy the code, removing the hole-specific details
Write your solution until it works (this is the "draw the rest of the owl" part). When writing code, mark any instructions that may not be necessary or could be shortened with a comment (mostly instructions that clear registers or initialize them and could be shortened if the register happens to be clear or have the upper bits clear). Possibly write a comment describing what registers are used for. This might involve writing and running an extra program to generate data if applicable.
If your solution unexpectedly doesn't work, debug it as described in the "debugging" section and correct it until it works
For each of the instructions you marked and any others you notice, try to delete them one by one and see if the solution still works (you could also write a script that tries to delete every single instruction in the program)
Check the code for any extra unnecessary address size, REX prefix or debugging code and remove them
Check if it is possible to shorten the code by reallocating registers
Check any instruction longer than 2 bytes for optimization opportunities, and look for other optimization opportunities like the ones described in this page
Check any unconditional jumps, reorganize the code flow so that they jump forwards, and then replace them with disabling bytes as described in "unconditional forward jumps"
Check whether any constant data can be changed to start or end within another constant, within the program code or within the zeroes before and after a section (you may need to create extra sections to have zeroes before)
Check for any jumps that are longer than 2 bytes due to having a displacement outside -128 to 127 and fix them as described in "Long jumps"

Note on preserving a C-like entrypoint

If you're writing syscall-oriented C as a starting point, here's some (unoptimized) code to get the arguments out as arguments to a pseudo-main:

__attribute__((naked))
void _start(void) {
    __asm__ __volatile__ (
        "movq (%%rsp), %%rdi\n"     /* argc */
        "leaq 8(%%rsp), %%rsi\n"    /* argv */
        "call _real_start\n"        /* call main */
        "movq %%rax, %%rdi\n"       /* return value (for debugging)*/
        "movq $60, %%rax\n"         /* sys_exit */
        "syscall\n"               
        :
        :
        :
    );
}
int _real_start(int argc, char *argv[]) {
    /* main */
    return argc;
}

It can be compiled with gcc -nostdlib. Similar code should work on other source languages.

Options for main program tasks

Writing output from memory to stdout

There are 3 ways to make a write syscall: syscall, int 0x80 and sysenter.

syscall is the default one, works with 64-bit addresses, and needs eax = edi = 1, esi = buffer, edx = length. It sets rcx to the return rip value and r12 to the return flags value, and rax to the return value (which is the length assuming it succeeded)

int 0x80 is the other main one, only works with 32-bit addresses, and needs eax = 4, ebx = 1, ecx = buffer, edx = length. It doesn't support 64-bit addresses, but has the advantage of not clobbering extra registers, which can be better if you call it in a loop.

sysenter, which is probably not useful, is like int 0x80, but it returns to an address in the vdso with code that pops three registers and returns.

In argument-less holes, the 1 constant can be created with a simple pop rax. The length can be computed by subtraction, or alternatively the length can be constant and the start can be computed by subtraction if needed. When starting the output at -1 (32-bit), if an extra byte is written, the length can then be computed by simply copying the output end pointer to the rdx register (since the subtraction would be subtracting zero after adjusting for the extra byte written).

Reading the input and writing the output in memory

There are a few ways to access memory sequentially, to read the input or write the output. Usually string instructions are the best for reading input, while the output depends on the exact hole.

Use lods to read values from rsi into rax, or scas to compare values from rdi with rax or stos to write values into rdi, possibly backwards with std. 1 byte per 8-bit or 32-bit reads, 2-byte for 16-bit or 64-bit reads, but has register restrictions, doesn't set flags
Set a register to the address and use simple reads/writes (possibly with vector instructions) and increment/decrement it. If you are reading flags, you can compare with cmp or you can get both value and flags adding or oring into a zeroed register. If you are writing, you can use xchg to zero the register you are writing from. Needs 4 bytes for 8-bit access+increment (with inc/dec), 5 bytes for 32-bit access+increment, 6 bytes for 16-bit and 64-bit access/increment. You can save a byte in larger than 8-bit accesses if you can use rax as index register and incrementing with add al somehow works. One extra byte if reading from a 64-bit address for the REX prefix on the increment.
Instead of addressing with a single register, use (%r?x,%r.x) addressing with two registers. Uses an extra byte for the SIB addressing, but keeps the start and index around, and can let you make the write system call with no register setup
Point rsp to the start of the data and use pop to read 64-bit or 16-bit chunks forwards or push to write 64-bit or 16-bit units backwards. Needs 1 byte for 64-bit access, 2 bytes for 16-bit accesses, 6 bytes for 8-bit accesses, 7 bytes for 32-bit accesses.

If writing backwards, not using a separate index registers can save code in the syscall by using rsi as the register. If writing forwards, using a separate index register can save code in the write syscall if using rsi as the immutable start offset,

Placing the output in memory

The output can be put in many places, which are mostly chosen so that initializing the address is cheaper than the simple 5-byte immediate load. Also, if you have a precomputed table, you may want to put the table at the end of the program, followed by the output buffer with index-based addressing, so that a single register can access both the table and the output.

At 4GB, going backwards. This requires 0 bytes for initialization using an already zeroed register if going backwards. You need to write .comm x, 0xf78000000 or similar to reserve a common symbol to make all the 32-bit address space after 0x400000 usable.
At 4GB, going forwards. This requires 2 bytes for initialization with decl, uses 64-bit addresses. You need to write .comm x, 0xf78000000 or similar to reserve a big enough common symbol.
At the lower 32-bits of the stack pointer. You need to write .comm x, 0xf78000000 or similar to reserve a big enough common symbol. This requires 2 bytes for initialization (mov %esp, %e?x) or 0 bytes plus 1 for each access due to the address override prefix if using %esp. Compared to dec %e?x, has the advantage of staying in 32-bit space, but can rarely fail if %esp is too low and doesn't allow some tricks to save instructions when computing the length
In the argument area on the stack. This requires 1 byte for initialization (pop %r?x), 2 if not popping argc or 0 if already popping argv[0]
On the stack from the stack pointer. This requires 0 bytes for initialization if using rsp as index, or 2 bytes to move rsp to another register (using push %rsp; pop %r?p)
At the end of the program code. This requires 2 bytes for initialization in rcx with the syscall trick (use .globl _start; _start right before the final syscall and then jmp to the actual code; the syscall will set rcx to the address after the syscall), 4 with the syscall trick plus a move, or 5 bytes for a 32-bit immediate load.
Overwriting the program code. This requires 2 bytes for initialization in rcx with the syscall trick (start the program with a syscall), or 4 with the syscall trick plus a move or with bts $22, %reg
At some other easily constructible 32-bit address, using .comm x, 0xf78000000 to make it valid. The main way is to get the one address with either mov %esp, %e?x or dec e?x and then double it with lea (%r?x, %r?x), %e?x or similar. There are a lot of other ways to do this, such as the 3-byte add $-128, %reg, 4-byte bts $23, %reg, the 2-byte cpuid, 3-byte imul imm, reg, reg, etc.

Note that placements on the stack and at 4GB going forwards consist of 64-bit addresses that don't fit in 32 bits, which means that you must use syscall instead of int 0x80 and you must increment/decrement with REX prefixes or string instructions.

Common subroutines

Writing integers with a newline

Backwards output

Backwards output is usually shorter if only integers need to be printed because, compared to forwards output with string operations, it saves a "dec" at the beginning of the program to initialize the output register, a move from edi to esi, and the actual output routine is shorter (17 base size compared to 18 for stack-based forwards output).

Here is an example of a program structure to write integers backwards:

# required at the start of the file
.comm x, 0xf78000000

# sample code
mov $123456789, %eax

### insert code to print integers here

# sample code to print it
mov $1, %al # or pop %eax if the hole is argumentless
mov %eax, %edi
sub %esi, %edx
syscall

17-20: Division loop with direct output

Example (for numbers > 0 in eax):

mov $10, %bl

mov $'\n'-'0', %dl

for_digit_in_num:
add $'0', %dl
dec %esi
xchg %dl, (%rsi)
cmp %eax, %edx
div %ebx
jc for_digit_in_num

General version:

# compute number and put it into register <R>

addr32 loopne [...] # if <D>=l, replacing a j(n)e in the program, if there is one to replace
dec %<OUT> # if <D>=l and there is no j(n)e in the program to replace, add 1 to the cost, recorded as "+l" in <#>
dec %<OUT> # if <D>=d

cltd # if <#> has "+c" and cltd is needed, add 1 to the cost

mov $10, %<DIVbyte> # DIVbyte is DIV if DIV is a 8-bit register or the low byte of DIV otherwise

<NEWLINE>

.byte <S> # if <#> has "+s"/S ds non-empty and can't jump into the loop for free, add 1 to the cost

for_digit_in_num:

# D/A/P/T in order specified by OPS field; those in parenthesis must be skipped in the first iteration, either with a free jump or the skip disable above

D: div %<DIV> # preserves CF (undocumented)
[D: idiv %<DIV> # if S!=84, preserves CF (undocumented)]

# REM is ah if DIV is a 8-bit register, or dl otherwise
A: add $'0', %<REM> if A=add or A=a/x OR A=op or A=*
A: lea '0'(%rdx), %edx if REM=dl and (A=lea or A=*)
[A: or $'0', %<REM> if A=* or A=op]
[A: xor $'0', %<REM> if A=add or A=a/x OR A=op or A=*; need to change - into ^ in NEWLINE]
[A: adc $'0', %<REM> if (A=add or A=op or A=*) and CF=0 is guaranteed)]

P: dec %<OUT>; xchg %<REM>, (%<OUT>) # if J!=loopne
P: xchg %<REM>, -1(<OUT>) # if J=loopne, OUT=ecx

T: <TEST>

<J> for_digit_in_num

Size:

Base size is 12 + {1 if J!=loopne} + sizeof(D) + sizeof(NEWLINE) + sizeof(TEST)
1 extra byte if # has "+c" and cltd has to be added
1 extra byte if # has "+d" and dec %ecx needs to be used instead of addr32 loop because there is no suitable je/jne to replace
1 extra byte if # has "+s" and the skip disable in S has to be used rather than jumping for free inside the loop by having an outer loop jump inside the print loop and starting with _start inside the outer loop (note that sometimes it's possible to jump inside an instruction in the loop to clear flags), or 2 extra bytes if # has "+ss" and a 2-byte jump is used instead of a disable
If the output is written multiple times, then xchg no longer clears the register, so an extra cltd or xor may need to be added at 0-2 bytes extra cost
For input in rax, the input is clobbered, so 2 extra bytes are needed if it needs to be saved, which also needed in general if the number needs to be moved to the input register (or only 1 if xchg can be used)

Notes:

DIV can be replaced with cl/ch/ecx if it's not the output register at no cost, or if DIV!=bh with other registers at cost of extra REX prefixes
OUT=esi can be replaced with edi or ebx or ecx if unused, or other registers at cost of extra SIB bytes (for esp and ebp) or REX prefixes
64-bit integers can be printed by using div %rbx instead of div %ebx, at cost of one more byte; note that 32-bit registers usually still work (by luck) for the tests, otherwise another REX prefix is needed there
Bignums can be printed by replacing the division with a division loop from MSB to LSB
It's possible to compare with 0x80000001 and use jo instead of comparing with 1 and using jnc
Numbers larger or equal than 2^31 don't work if cltd is needed since it will set %edx to -1 instead of 0

R=bl, DIV=bh, OUT=esi:

Usually the best if having the input in %bl rather than %al/%ax/%eax saves code size.

#	Num	D	NEWLINE	S	OPS	A	TEST	J	Conds
17+s	0-255		or %ebx,%eax	0f	(AT)PD	op	cmp $1,%al	jnc	eax = 0
17+s	1-255		mov %ebx,%eax	41	(A)PDT	*	test %eax,%eax	jnz	clobber r12

R=eax, DIV=bl/ebx, OUT=ecx:

Usually the best if 0 or numbers up to 2559 need to be printed and ecx is not otherwise needed.

#	Num	D	NEWLINE	S	OPS	DIV	A	TEST	J	Conds
17+cd	any	l	mov %bl,(%rcx)		DAPT	ebx	*	test %eax,%eax	loopne
17+d	0-2559	l	mov %bl,(%rcx)		DAPT	bl	*	test %al,%al	loopne

R=al/ax/eax, DIV=bl/ebx, OUT=esi:

Usually the best if the conditions for using the other options don't hold.

#	Num	R	NEWLINE	S	OPS	DIV	A	TEST	J
17	1-255	al	mov $'\n'‑'0',%ah		APTD	bl	a/x	cmp %al,%ah	jc
17+c	>0	eax	mov $'\n'‑'0',%dl		APTD	ebx	a/x	cmp %eax,%edx	jc
17+s	0-255	ax	or %bl,%ah	0f	(AT)PD	bl	op	cmp $1,%al	jnc
18	0-2^31	eax	mov $'\n'‑'0',%dl		APTD	ebx	add	sbb %eax,%edx; cltd	jc

R=al/ax/eax, DIV=bl/ebx, OUT=esi with any number, other variants:

#	Num	R	NEWLINE	S	OPS	DIV	A	TEST	J	Conds
18	any	eax	mov $'\n'‑'0',%dl		APTD	ebx	add	sbb %eax, (%rsi,%rsi,4)	jc	no rewrite
17+s	any	eax	mov %ebx,%edx	a9	(AT)PD	ebx	*	cmp %edi,%eax	jnc	edi=1
17+s	any	eax	mov %ebx,%edx	0f	(AT)PD	ebx	*	cmp %eax,%e?x	jc	e?x=0,#
17+s	any	eax	mov %ebx,%edx	81	(TA)PD	ebx	lea	cmp %eax,%e?x	jc	e?x=0,##
17+cs	any*	eax	or %ebx,%edx	0f	(AT)PD	ebx	*	cmp $1,%al	jnc
17+s	any*	eax	mov %ebx,%edx	0f	(AT)PD	ebx	*	cmp $1,%al	jnc	CF=OF=0
17+s	any*	eax	or %ebx,%edx	84	(AT)PD	ebx	lea	cmp $1,%al	jnc	###
17+s	0-255	ax	mov %bl,%ah	0f	(AT)PD	bl	op	cmp %al,%?[lh]	jc	?[lh]=0,#

Note that these are better than the "sbb %eax, %edx; cltd" variant only if you need numbers larger than 2^31 or if the skip is free due to a free jump that can be changed to point after the skip, or if you need CF=1 after the loop.

# = CF=1 and OF=0

## = (%rcx) must point to a 32-bit value less than around 0x30528d??; alternatively using the load encoding of cmp, (%rbx) must be so

### = (%rbp) must be valid

* = no decimal prefix of the number must be a multiple of 256

R=al/ax/eax, DIV=bl/ebx, OUT=esi for positive numbers, other variants:

#	Num	R	NEWLINE	OPS	DIV	A	TEST	J	Conds
17+c	>0	eax	mov $'\n'‑'0',%dl	APTD	ebx	a/x	cmp %edi,%eax	jnc	edi=1
17+c	>0*	eax	mov $'\n'‑'0',%dl	APTD	ebx	a/x	cmp $1,%al	jnc
17+c	>0**	eax	mov $'\n'-'0',%dl	APDT	ebx	a/x	cmp %eax,%edx	jnz
17+c	>0***	eax	mov $'\n'-'0',%dl	APDT	ebx	a/x	cmp %e?x,%edx	jnz	e?x=0
17+c	>0***	eax	mov $'\n'-'0',%dl	APDT	ebx	a/x	cmp %?[lh],%dl	jnz	?[lh]=0
17	1-255	al	mov $'\n'‑'0',%ah	APTD	bl	a/x	cmp $1,%al	jnc
17	1-255	al	mov $'\n'‑'0',%ah	APDT	bl	a/x	test %eax,%eax	jnz
18+cs	>0	eax	mov $'\n'‑'0',%dl	APTD	ebx	a/x	cmp $1,%eax	jnc

There are usually only better than the main variants if you need different flags after the loop.

# = CF=1 and OF=0

* = no decimal prefix of the number must be a multiple of 256

** = most significant two digits different

*** = no zero digits

Different options that are usually worse with OUT=esi:

#	Num	R	D	NEWLINE	S	OPS	DIV	A	TEST	J
17+ss	any	eax		mov %ebx,%edx	?	(TDA)P	ebx	lea	cmp %ebx,%eax	jnc
17+cs	>0	eax		mov $'\n'-'0',%dl	84	(D)APT	ebx	a/x	test %eax,%eax	jnz
19+c	any	eax	d	mov %bl,(%rsi)		DAPT	ebx	*	test %eax,%eax	jnz
19	0-2559	ax	d	mov %bl,(%rsi)		DAPT	bl	*	test %al,%al	jnz

Generally worse or equivalent variants with OUT=esi:

#	Num	R	D	NEWLINE	S	OPS	DIV	A	TEST	J	Conds
17+c	>0**	eax		mov $'\n',%dl		PDTA	ebx	lea	cmp %eax,%edx	jnz
17+s	any	eax		mov %ebx,%edx	b9	(DAT)P	ebx	*	cmp %edi,%eax	jnc	edi=1,clobber ecx
17+s	any	eax		mov %ebx,%edx	a9	(DAT)P	ebx	*	cmp %edi,%eax	jnc	edi=1,ecx=ptr
19+c	any	eax	d	mov %bl, (%rsi)		TDAP	ebx	lea	cmp %ebx,%eax	jnc

This is what the S disables do (note that it's better to avoid them and jump inside the loop from an outer loop, starting the program inside the outer loop):

S	skips	Turns into	Notes
0f	AT	jo 0x...	needs OF=0, set by the or
41	A	add $'0', %r12b	clobbers r12
81	TA	cmp %eax, %e?x; lea '0'(%rdx), %edx	cmp $0x30528d??, (%rcx)
84	D	test %dh, %bh; rep
84	TA	test %cl, 0x13c3052(%rbp)	0x13c3052(%rbp) must be valid pointer
a9	AT	test $..., %eax; clc
a9	DAT	test $..., %eax; xor %bh, (%rcx); clc	rcx must be valid pointer
b9	DAT	mov $..., %ecx; xor %bh, (%rcx); clc	clobbers rcx

Constructing variants

Ops order rules:

Must be DAP or APD or PDA since adjustment must be between division and printing (but PDA ends with edx!=0 and is thus often not useful)
If PDT then DTA or TAD or ADT since A must not be between D and T as it would interfere with testing remainder=0

Implications of ops order:

If T is followed by P or A when not testing remainder, then a disable that includes T and not the following operation must be used, since otherwise moving T after A or P gives a strictly better result due to less flags clobbering
If T < A, lea (and thus ebx) must be used, since otherwise A clobbers all flags
If T < D, jnc must be used, since DIV clobbers non-CF flags
If T < P , jnc or loope must be used, since without loope P uses dec %esi clobbers non-CF flags and thus requires jnc
If D < P,T or if T,P < D, quotient=0 must be tested
If PDT quotient=remainder=0 must be tested
If TDP, then quotient<10 must be tested
If D < P then to print the newline either a dec+mov '\n',(%rsi) must be used, or loopne+mov '\n', (%ecx) must be used, or a disable must be used
If q=0 or q=r=0 and jne can be used, there is no point in using jnc
If q=r=0, jnc can't be used
A < D or A < P is highly preferable since otherwise edx!=0 at the end of the loop, which can require extra code

Hence these are the possibilities:

DAPT q=0 with jnz/loope, bl/ebx
(DAT)P q=0 with jnc/loope, bl/ebx
DTAP q=0 with jnc, lea, ebx
TDAP q<10 with jnc, lea, ebx and dec or a disable - can't find a disable for TDA and a TD disable doesn't work with lea due to carry into dh
APDT q=r=0 with jnz/loope, bl/ebx
APTD q=0 with jnc
(AT)PD q=0
(T)APD q=0

These end with edx!=0:

PDTA q=r=0 with lea, ebx
PDAT q=r=0 - NO, A interferes with remainder testing
PTDA q=0 with lea, jnc, ebx
(T)PDA q=0

17: for integers between 10 and 99, number in `al`

mov $10, %bl

div %bl
add $'00', %ax
sub $3, %esi
mov %bl, 2(%rsi)
xchg %ax, (%rsi)

Forwards output

Here is an example of a program structure to write integers forwards:

# required at the start of the file
.comm x, 0xf78000000

# initialize edi to 0xffffffff
dec %edi
push %rdi

# sample code
mov $123456789, %eax

### insert code to print integers here

# sample code to print it
# this will skip the last newline, but it's ok because trailing whitespace is ignored
pop %rsi
mov %edi, %edx # edi is length-1
mov $1, %al # or pop %eax if the hole is argumentless
mov %eax, %edi
syscall

18: for any integer, number in `eax`, clobbers `edx`

mov $10, %bl
push $'\n'-'0'

for_digit_in_num:
cltd
div %ebx
push %rdx
test %eax, %eax
jnz for_digit_in_num

for_char_in_output:
pop %rax
add $'0', %al
stosb
jnc for_char_in_output

17-18: for integers between 0 and 99, number in `al`/`eax`

mov $10, %bl
div %ebx # or div %bl

test %al, %al
je no_high_digit
add $'0', %al
stosb
no_high_digit:

xchg %edx, %eax # or mov %ah, %al
add $'0', %al
stosb

xchg %ebx, %eax
stosb

12: for integers between 10 and 99, number in `al`

mov $10, %bl

div %bl
add $'00', %ax
stosw
xchg %ebx, %eax
stosb

Parsing integers

This code will parse a tuple of integers separated by space and terminated by newline; variants are possible.

# mov $..., %cl

for_number_in_input:
# use mov %edx, %e?x for just 2 numbers

cltd
for_digit_in_input_number:
lodsb
sub $'0', %al
jb break_digit_in_input_number
imul $10, %edx, %edx
add %eax, %edx
jmp for_digit_in_input_number
break_digit_in_input_number:

push %rdx # delete this for just 2 numbers

# or loop for_number_in_input
# can often just use jp here instead of cmp; jne
cmp $'\n'-'0', %al
jne for_number_in_input

# pop as many numbers as are present in the input
# or have them already in %edx and %e?x for just 2 numbers
pop %rbx
pop %rdx

Printing a string from a table of strings

The version with backwards start offset is shorter if it needs to be used once, while the forwards one should be shorter if called multiple times, and the one with terminators might be best in special conditions.

With a table of start offsets, backwards (shortest)

S: # must be at offset 0

s2: .ascii "ef"
s1: .ascii "bcd"
s0: .ascii "a"
s_:

.globl _start
_start:

for_arg_in_args:
syscall
rip:

lea S-rip(%rcx), %eax

# put index in edx, can also reallocate register

first_value = 0
movzwl table-rip-first_value(%rcx,%rdx), %ecx
sub %ch, %cl
xchg %al, %ch
xchg %eax, %esi
rep movsb

[...]
jmp for_arg_in_args

table: .byte s_ - S, s0 - S, s1 - S
# omit last entry since it is zero

With a table of start offsets

S: # must be at offset 0

s0: .ascii "a"
s1: .ascii "bcd"
s2: .ascii "ef"
s3:

table: .byte s0 - S, s1 - S, s2 - S, s3 - S

.globl _start
_start:
syscall
rip:

mov %ecx, %esi

# put index in eax, can also reallocate register

first_value = 0
mov table-rip-first_value(%rcx,%rax), %edx
movb %dl, %sil
movzbl %dh, %ecx
sub %dl, %cl
rep movsb

esi can be computed in many other equivalent ways, and it's possible to add the byte instead of replacing the lower byte; however, all variants seem to be the same code size, and this variant allows to only do the mov %ecx, %esi once if the code is executed multiple times.

With terminators

s0: .asciz "a"
s1: .asciz "bcd"
s2: .asciz "ef"

.globl _start
_start:
syscall
rip:

mov %ecx, %esi
# or lea s0(%rcx), %esi if the table doesn't start at 0

# put index in ecx, you'll need a mov for other registers
for_earlier_str_in_table:
for_byte_in_earlier_str:
lodsb
test %al, %al
jne for_byte_in_earlier_str
loop for_earlier_str_in_table

# turn stosb into test $0xaa, %al
.byte 0xa8
for_byte_in_str:
stosb
lodsb
test %al, %al
jne for_byte_in_str

Another approach:

push %rdi
mov %ecx, %edi
# or lea s0(%rcx), %esi if the table doesn't start at 0

# put index in eax

for_earlier_str_in_table:
mov $0xff, %cl
repne scasb
dec %eax
jne for_earlier_str_in_table

mov %edi, %esi
pop %rdi

# turn stosb into test $0xaa, %al
.byte 0xa8
for_byte_in_str:
stosb
lodsb
test %al, %al
jne for_byte_in_str

Non-sequential strings

The approaches above require to store the strings sequentially: removing this constraint costs an extra byte per string, but can save memory if several strings are substrings of each other, or otherwise present in other constants.

It is possible instead to use 2 bytes per string for start and length and adapt the routines by adding a scale of 2 in the address computation and removing the length subtraction (or in the terminated case, remove the first loop and instead directly load the start).

Fixed-length string table

Another approach is to reserve a constant number of bytes in the string table, allowing to compute the start address by multiplication or preferably with SIB byte scale; if the strings are not fixed-length, then an additional table storing the lengths can be used.

This will of course be profitable if the strings are constant length or close to it.

String hashing

String hashing can be done with crc32 or with FNV hashing.

Divisibility check

In addition to using div, it is possible to use a subtraction loop to check for divisibility, like this:

divisibility_loop:
sub %e?x, %e?x
je is_divisor
jns divisibility_loop

Or this:

divisibility_loop:
sub %e?x, %e?x
js not_divisor
jne divisibility_loop

Instead of:

cltd # if edx != 0
div %e?x
test %edx, %edx
je is_divisor

Advantages compared to division:

Saves one byte if cltd would be required for division
Can use any register or memory location for the number to divide rather than having to use eax
Can divide by immediates without extra instructions to load them

Disadvantages compared to division:

Doesn't produce a quotient unless 2-4 more bytes are used
Doesn't produce a nonnegative remainder (but rather a nonpositive one) unless 2 more bytes are used
Execution time is proportional to quotient rather than constant

Common tasks and optimizations

Control flow

Loops

There are 3 main ways to loop for a given number of iterations:

Backwards with loop. This is the shortest, but can only loop backwards in ecx with 0 excluded, and doesn't set flags in the decrement. You can also get a conditional exit for free with loopne/loope.
Backwards without loop (using dec or sub to set flags for the comparison). This can save one byte if not using eax as the loop since initialization with an immediate is cheaper than comparison with an immediate.
Forwards without loop (using cmp to set flags for the comparison)

It's also possible to loop until a stop condition (if there is no natural condition, it's possible to match one of the bytes in the last output, ideally checking %al since it's shortest):

With a normal conditional jump
Using loopne to get a free rcx/ecx decrement - note that you can use .byte 0x67 before the loopne to get a 32-bit decrement (this goes well with backwards int $0x80 output starting at 4GB)

Conditional jumps and flags

It's possible to shorten conditional branches in several ways:

Put the value in rcx and use jrcxz or loop
Replace a mov into a cleared register with an add, or, xor or possibly adc so that flags are set for free
Reuse flags generated from a previous operation instead of doing test or cmp
test a register with itself instead of using cmp with 0 when checking for 0
Use shr $1, ror $1 or rcr $1 to check the lowest bit instead of test $1, if possible. shr $1 can be useful because it also clears it.
Use loope or loopne to get a free rcx/ecx decrement in addition to a je/jne jump - note that you can use .byte 0x67 before the loopne to get a 32-bit decrement

Flags can be saved and restored with pushf/popf or sahf/lahf.

Sometimes it is possible to restructure comparison to use the carry flag (e.g. checking for zero with cmp $1, %e?x and then using the carry flag), which allows to preserve flags across dec, inc, div and idiv.

A 2-byte loop dummy; dummy: instruction can be used to decrement rcx preserving flags.

If using multiple conditional jumps, try to find a single instruction (usually a cmp/subtraction) with an appropriate value that can set flags suitable for all of them at once.

The parity flag can be used to check whether there is an even or an odd number of 1 bits in the lowest byte.

The sign flag can often be used to check for special values.

The div and idiv instruction are documented to have undefined effect on flags: in fact, it seems they clear PF, ZF, SF, set AF and preserve CF, OF and other flags. The imul instruction is documented to have an undefined effect on non-CF/OF flags; in fact, it seems that it preserves them.

Note that a single cmp instruction with x lets you discriminate, with only conditional jumps, between x - 2 (jb/jl/js and jnp), x - 1 (jb/jl/js and jp), x (je), x + 1 (ja/jg and jnp) and x + 3 (ja/jg and jp).

Among range of values, a cmp with x lets you discriminate between numeric intervals like this (assuming x between 1 and 0x7f):

0..x-1	x	x..0x7f	0x80..(0x7f+x)	(0x80+x)..0xff
CF, SF	ZF		OF	SF
jb	jz	jg	jo	jnb+js

jnz	jz	jnz	jnz	jnz
jc	jnc	jnc	jnc	jnc
jb	jae	jae	jae	jae
jbe	jbe	ja	ja	ja
jle	jle	jg	jle	jle
jl	jge	jge	jl	jl
jno	jno	jno	jo	jno
js	jns	jns	jns	js

For negative x = 0x100 - y:

0..x-0x81	x-0x80..0x7f	0x80..x	x	x..0xff
0..0x7f-y	0x80-y..0x7f	0x80..-y	-y	-y..0xff
CF	CF, SF, OF	CF, SF	ZF
jna+jns	jo	jl	jz	ja

jnz	jnz	jnz	jz	jnz
jc	jc	jc	jnc	jnc
jb	jb	jb	jae	jae
jbe	jbe	jbe	jbe	ja
jge	jge	jl	jge	jge
jg	jg	jle	jle	jg
jno	jo	jno	jno	jno
jns	js	js	jns	jns

In addition, the parity flag further splits each interval into two sets.

Applying the above, you can for instance:

Match 0, 1, 2, 3, 5, 0x80, 0x81, 0x82, 0x83 with a single cmp with 2, then 4, 6, 7, 8, 10, 0x85, 0x86, 0x87, 0x88 with a cmp with 7 and so on
Match 0x7a, 0x7b, 0x7c, 0x7d, 0xfa, 0xfb, 0xfc, 0xfd, 0xff with a single cmp with 0xfc.

Unconditional short forward jumps

The simplest way to jump forward unconditionally is the 2-byte jmp instruction. However, it can often be done in one byte by outputting a byte that causes the machine code in the next instruction to change to an instruction that does nothing relevant, possibly setting flags desirably.

It's also of course even better to reorder the code to eliminate unconditional jumps. If that's impossible, it's good to see if there are opportunities to reorder the code so that the previously described optimization can be applied.

The following generally applicable disabling bytes are available:

1-byte: test $imm8, %al or mov $imm8, %?[lh]
4-byte: test $imm32, %eax or mov $imm32, %e?x
Single instructions: REX prefix (usually)

In general, a great way to find such bytes is to try them all, since there are only 256 choices. Here is a script to do so:

#/bin/bash
#asm="$(</dev/stdin)"
asm="$1"

out="$(mktemp)"
if ! as --64 -o "$out" <<<"$asm"; then
	echo "Input fails to assemble" >&2
	exit 1
fi

disasm="$(objdump --no-show-raw-insn --no-addresses -w -d "$out"|grep -E '^\s+')"

re="$2"
if test -z "$re"; then
re='\(bad\)|.byte|\.\.|'"$(tr '\n' ';' <<<"$disasm"|sed -re '
s/\s+/ /g
s/\s*;\s*/;/g
s/^;*\s*//
s/\s*;*$//
s/;/\n/g
s/[^^[:space:]]/[&]/g
s/\n/|/g
s/([a-zA-Z0-9_]\])\s+(\[[a-zA-Z0-9])/\1\\s+\2/g
s/([a-zA-Z0-9_]\])\s+(\[[a-zA-Z0-9])/\1\\s+\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[a-zA-Z0-9])/\1\\s*\2/g
s/([a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/\s/\\s*/g
s/\^/\\^/g
')"
fi

echo "All except matching: $re"

for((i=0;i<256;++i)); do
	echo ".byte $i"$'\n'"$asm"|as --64 -o "$out";
	dis="$(objdump -w -d "$out")"
	if ! grep -qE "$re" <<<"$dis"; then
		sed -e '/^Disassembly/d; /^\s*$/d; /:\s*file format/d;' <<<"$dis"
	fi
done

Non-short jumps

In long programs, some jumps may have a displacement outside the -128 to 127 bytes range, which will result in a 5-byte or 6-byte instruction instead of a 2-byte one.

You can try to solve the issue as follows:

Rearrange (or optimize!) the program so that you don't have jumps that are so long
Change the jump to jump to another jump so that both jumps are short. This only needs 2 or 4 bytes, but you need an existing jump to the same destination or a place to put the jump without needing to jump around or disable it.
Compute the jump destination by setting the lower byte in a register holding an instruction pointer and then using the "jCC *%r?x" jump instruction variant. This is particularly useful for the backward jump that closes the argument loop.
Save the jump destination by executing a syscall earlier at the jump destination and then using jCC *%rcx
Have a data table start in the trailing ff bytes of the jump

Function calls and bytecode/bitcode interpretation

Function calls are in most cases not efficient, since it takes 5 bytes for call symbol, plus possible argument setup: it is often better to structure repeated tasks as a loop and read the input values from an arrays.

If using function calls anyway, 2-byte call *%r?x is preferable to 5-byte call symbol, and so it can be efficient to, using the syscall trick, set up a register pointing to program data and offset it so that it points to the most used function: this will allow to both address your data, if any, relative to that register and call that function with two bytes.

If structuring the repeated tasks as a loop is not straightforward, it's usually possible to first conceptually structure the program as a sequence of function calls, then transform the functions in a single meta-function that dispatches to the functions, then transform the calls to a meta function into a loop that reads data from arrays, and then finally inline everything.

When reading data in this fashion, it can be beneficial to compress the data, effectively creating a bytecode or bitcode interpreter.

In this case, there is of course a tradeoff between a more powerful interpreter that takes more code but allows shorter bytecode and a simpler interpreter with longer bytecode, and a lot of experimentation can be done to find the best tradeoff.

The most generic tradeoff is the one between byte-aligned bytecode (which wastes space if you need than a multiple of 8 bits), word-aligned bitcode (i.e. each opcode is N bits and goes across byte boundaries, but it don't straddle across "word" boundaries where words are usually 16/32/64-bit, or even 128/256/512 with SSE/AVX2/AVX-512 - for instance you can have 5 6-bit opcodes in a 32-bit word, wasting 2 bits every 32), fully unaligned bitcode (which wastes little space, but is the most complex to read) and even forms of arithmetic/range coding (optimal, but most complex, probably not worth it unless there's a hole with a ton of data).

Here are some tips for using bytecode/bitcode:

It is possible to test and consume a flag by putting it in the low bit and using shr/sar $1, %reg followed by testing the carry flag (other shift/rotate instruction work as well); note that shr instructions also puts the high bit in the overflow flag and you can test for zero and parity as well
It is possible to split a bitfield with 2 value by shifting it so that the higher part ends up in ah and the lower in al
It's usually faster to test for the end of bytecode by checking for zero somewhere in the loop rather than using a counter; of course, this requires to engineer the bytecode so that there are no zeroes in normal bytecode. Note that you don't need to specify the final zero as long as the bytecode is at the end of a section
You might to be able to choose the encoding so that the start or end of bytecode is identical to either the end or start of program code or another data constant, so that you can save bytes by overlapping

Using instructions unusually

In addition to writing and reading output and using lookup tables for xlatb, string instructions can shorten more general tasks (as long as the registers are valid pointers):

rsi can be incremented with lods
rdi can be incremented with scas
rsi and rdi can be both incremented with cmps
rep lods can add ecx to esi, optionally scaled by 1/2/4/8
rep scas can add ecx to edi, optionally scaled by 1/2/4/8
rep cmps can add ecx to both esi and edi, optionally scaled by 1/2/4/8
Reading a byte at ebx can be done with xlatb if eax is clear

The loop instruction can be used to decrement rcx preserving flags with a target on the next instruction.

The cpuid instruction, especially in the lahf; cpuid variant, can be used to clear all main registers at once.

The syscall instruction can be used to get the current IP in rcx.

Aliasing/overlapping instructions

It is sometimes possible to save bytes by having instructions overlap, or equivalently by writing code that jumps in the middle of an instruction; this is best accomplished by using ".byte" to encode the relevant code so that labels can be placed appropriately, although computed labels are also possible.

This technique heavily depends on the exact instruction encodings and order, so it may be needed to try to permute instructions and to use equivalent opcodes and different instruction forms.

Unconditional short forward jumps / instruction disables

As explained above, it can be very useful to perform forward jumps in 1 byte by choosing a byte that results in do-nothing instructions that overlap the instructions to be jumped over.

Aliasing pre-loop instructions with the loop (e.g. eliding arbitrary instruction immediates)

In addition to disabling some instructions in a loop, aliasing pre-loop instructions with the start of the loop can be useful in other cases to save 1 byte (or in theory more, but it's unlikely to be possible), in particular when the pre-loop instruction does something useful and the in-loop instructions are useless or harmful in the first loop iteration.

A particularly important case is when an instruction takes an immediate that can be arbitrarily chosen in a given range, such as in initialization of byte registers in short holes (or holes with fixed output size < 256), where the exact value is not important, because extra output is cut off by explicitly setting the length before the syscall.

Jumping in the middle of an instruction (e.g. to set flags)

Sometimes it is possible to jump in the middle of an instruction that comes before a label rather than to the label in case the resulting instruction is valid and of the desired size; this can sometimes be useful to be able to conditionally jump and change flags, and also possibly in other cases.

Saving leading or trailing 00 bytes in instructions

If an instruction starts or ends with 00 bytes, it may be possible to save those bytes by putting the instruction at the start of end of a section (by using custom sections like .section mytext23, "wax").

In the trailing case and in the leading case if it's not the start of the program, this means that zero padding is executed, which means executing 00 00 instructions, which are add %al, (%rax) and thus require rax to be a valid pointer (and clobberable if al != 0).

Optimizing arbitrary immediates / removing degrees of freedom

Sometimes a solution can have immediate constants that can be set to a range of values, or in other words there are "degrees of freedom" in the solution.

It can be beneficial to remove those degrees of freedom, by selecting values that allow to have a shorter solution. These are the possible ways:

Find a way to rearrange previous code to leave an acceptable value in the required register
Alias the immediate with the start of a loop, as described before
Load only the high 8-bit register part, if a multiple of 256 is acceptable
Use a value that can be produced cheaply; for example, copying %esp into the register for a large value, or using one of the instructions that make a special value
Select a value that results in leaving desired values in registers at the end of computation, thus saving later instructions that set those registers (these are usually the registers needed to perform the final print syscall). This might require exhaustive search by running the program with any possible initial value to find those that produce the desired final values

Another kind of degree of freedom is that if a program ends with an unconditional jump to the start (i.e. those that process arguments and don't need any one-time initialization), then the whole program can be "rotated" arbitrarily preserving semantics by defining _start to where it should start. This can be useful to alias the start or end of the program code with a data table, or to get a somewhat arbitrary constant (limited by program length and instruction granularity) in %cl for free by selecting the offset at which the syscall instruction is placed.

Short vs long numeric holes

Non-long numeric holes can usually be done in less code than long versions due to:

Obviously, shorter instructions to load the initial value for backwards iteration
Shorter instructions when applying immediates to %al instead of %eax
The ability to select a different shorter variant of integer printing code
cltd not needed before div
The edx register being available due to using 8-bit division instead of 32-bit, and possibly extra 8-bit registers being available
The ability to set the output length directly via mov $..., %dl rather than using sub %esi/%ecx, %edx at the same code size, which then allows to start backwards iteration at any point and thus either initialize with mov $..., %ch/%bh or mov from another constant-initialized register, or even starting with a decrement from 0 allowing to use no code at all to initialize the loop counter (which is usually in %bl since loop will decrement to 2^64-1 or 2^32-1 with an addr32 prefix), or using instruction immediate aliasing

Instructions

Floating point and the x87

The x87 is the shortest way to do floating point since all x87 instructions use simple 2 byte ModR/M encodings, while "ps" SSE variants take 3 bytes, other non-AVX-512 SSE/AVX forms take 4 bytes and AVX-512 EVEX-coded instructions take 6 bytes.

x87 instructions are 2 or more bytes shorter for these tasks:

Division by a potentially zero value (no check required, the x87 just gives an infinite or QNaN result)
Fixed point multiplication and division (saves shifts for scaling)
Any computations that actually require a variable exponent
Arithmetic on IEEE-754 floating point values
80-bit memory loads and stores (1 operation instead of 2)
Converting to/from packed BCD (although that seems not very useful)
Taking the absolute value (fabs vs test+js+neg)
Trigonometric functions, exponentiation, logarithm, square root

x87 instructions are 1 byte shorter for these tasks:

64-bit non-wrapping add/sub/mul/div/rem without immediates that fit the x87 stack model, loads and stores (saves REX)
16-bit memory loads and stores (saves operand size prefix)
Stack-like push/pop, if it fits the FPU register stack (no/op instead of push/pop)
Conditional moves (fcmov vs cmov)

x87 instructions are the same length (but can save GPR registers):

32-bit memory loads and stores
8/16/32-bit non-wrapping add/sub/mul/div/rem without immediates that fit the x87 stack model
Setting and testing a boolean, setting done once (set with fld1, test with fucomi and jpe/jpo vs mov + test + je/jne)
Setting and testing a boolean, setting done multiple times, pointer available (set with fldenv (%ptr), clear with fninit, test with fucomi + jp/jpe vs mov + test + j/jne)
Operations with immediates that happen to be pointed by a non-rsp register (e.g. 1 in argument-less holes or the byte following a syscall instructions)
Arithmetic with values computed by ALU instructions, if they can be put in (%r?x) for free

x87 instructions are 1 byte longer for these tasks:

Operations with immediate 1 in argument-less holes or 2 in single argument holes (%(rsp) requires an extra byte)

x87 instructions are 2 byte longer for these tasks:

Pushing/popping to the stack, if it doesn't fit the FPU register stack (dummy push/pop + FP load/store)
Loads at (%rsi) or stores at (%rdi) with increment (dummy string instructon + FP load/store)
Increment/decrement by 1 the first time (fld1 + faddp/fsubp vs inc/dec)
Getting both quotient and remainder of a division (2 ops instead of 1)
Operations with immediates, if immediate is 0/1 or pointer to code already available (either need an extra instruction like fld1 or need at least 1 byte of memory and 1 byte for the displacement to the immediate)
Arithmetic with values computed by ALU instructions if they can't be put in (%r?X) for free (load and store to/from memory)

x87 instructions are more than 2 bytes longer for these tasks:

Operations with immediates, if immediate is not 0/1 and a pointer to code is not already available (.byte + syscall + displacement)
Wrapping arithmetic (needs explicit remainder computation)
Looping for a given number of iterations (fld1 + fadd/fsub + fucomi + jcc + loading the limit or initial value vs mov + loop)
8-bit memory loads and stores (can't do directly)
Conversion between integers and floating point in memory (FP load + store)
Bignum arithmetic (no carry flag or extending/two-word multiplication and division)

Constructing constants in x87 registers

Note that when specifying floating-point immediates, usually only the top 2 bytes of a 32-bit floating point number need to be specified, and the others can be zeroes coming from the ending padding of the previous section.

Bytes	Instruction	Values	Condition
0	fi... (%r?x)	0	?r?x pointing to clear memory
+1	fi... disp(%rsp)	1	argument-less holes
+1	fi... disp(%rsp)	2	1 argument holes
2	fldz	0
2	fld1	1
+2-9	`fi... disp8(%r?x); label: .byte $imm, ...`	8 to 64-bit integers or floating point values	code pointer available
4	fld1; fadd %st(0), %st(0)	2
4	fldpi; frndint	3
4	fld1; fchs	-1
+4-11	`syscall; fi... disp8(%rcx); label: .byte $imm, ...`	8 to 64-bit integers or floating point values	syscall valid
+5-12	`fi... disp32(%rip); .section .data.imm, "a"; label: .byte $imm, ...`	8 to 64-bit integers or floating point values
6	fld1; fadd %st(0), %st(0); fadd %st(0), %st(0)	4
6	fldpi; frndint; fadd %st(0), %st(0)	6
6	fld1; fchs; fadd %st(0), %st(0)	-2
6	fld1; fchs; f2xm1	-0.5

Effective addressing with ModR/M and SIB bytes

r8-r15 registers need an extra REX byte if not already present

rsp can never be used as an index register, but all other forms are possible, albeit sometimes with an extra byte (or 3-4 extra bytes in case of (,%r?x,#), which can usually be replaced with (%r?x,%r?x,#) where the first register is cleared).

RIP-relative addressing is usually suboptimal since there is no disp8 variant; instead, use syscall at the start of loop and then index from ecx which will have the address following the syscall instruction.

# can be 1, 2, 4 or 8; by specifying the same register as both base and index, you can multiply it by 2, 3, 5 or 9.

An address size prefix can be used to use 32-bit registers in the address calculation: this is usually only useful when referring to esp or derived values in edi or esi to store a table at the address corrisponding to the lower 32-bits of esp and only if done up to 2 times (otherwise moving to a register is shorter).

Bytes	Value	Condition
+0	`%e?x`
+0	`(%r?x)`	not `rbp` or `rsp` or `r12` or `r13`

+1	`disp8(%r?x)`	not `rsp` or `r12`
+1	`(%r?x,%r?x,#)`	base (1st reg) not `rbp` or `r13`, index (2nd reg) not `rsp`
+1	`(%rsp)`
+1	`(%rbp)`

+2	`disp8(%r?x,%r?x,#)`	index (2nd reg) not `rsp`
+2	`disp8(%rsp)`
+2	`disp8(%r12)`

+4	`disp32(%r?x)`	not `rsp` or `r12`
+4	`disp32(%rip)`

+5	`disp32(%r?x,%r?x,#)`	index (2nd reg) not `rsp`
+5	`disp32(,%r?x,#)`	not `rsp`
+5	`addr32`
+5	`disp32(%rsp)`
+5	`disp32(%r12)`

Copying or moving values between registers

Bytes	Instruction	Values	Side effect	Condition
1	`xchg %e?x, %eax`	32-bit values	exchanges data	only to/from `eax`

2	`xchg %e?x, %e?x`	32-bit values	exchanges data
2	`push %r?x; pop %r?x`	64-bit values		needs stack
2	`xchg %r?x, %rax`	64-bit values	exchanges data	only to/from `rax`
2	`mov %?[lh], %?[lh]`	8-bit values
2	`mov %e?x, %e?x`	32-bit values
2	`lea (%r?x), %e?x`	32-bit values
2	`add/or/xor %e?x, %e?x`	32-bit values	sets flags	clear register
2	`add/or/xor %?[lh], %?[lh]`	8-bit values	sets flags	clear register
2	(`{load} mov %e?x, %e?x`)	32-bit values		(encode with .byte, not supported by defasm, it's the form where the r/m operand is the source)
2	(`movsxd %e?x, %e?x`)	32-bit values		(encode with .byte, not supported by defasm, it's movsxd without a REX prefix)
2	(`{load} mov %?[lh], %?[lh]`)	8-bit values		(encode with .byte, not supported by defasm, it's the form where the r/m operand is the source)

3	`mov %r?x, %r?x`	64-bit values
3	`lea (%r?x), %r?x`	64-bit values
3	`add/or/xor %r?x, %r?x`	64-bit values	sets flags	clear register
3	`xchg %r?x, %r?x`	64-bit values	exchanges data
3	`mov %?x, %?x`	16-bit values
3	`add/or/xor %?x, %?x`	16-bit values	sets flags	clear register
3	`xchg %?x, %?x`	16-bit values	exchanges data
3	`movzbl/movsbl %?[lh], %e?x`	8-bit into 32-bit values	extends
3	`movslq %?[lh], %r?x`	32-bit into 64-bit values	extends
3	`movzwl/movswl %?x, %e?x`	16-bit into 32-bit values	extends
3	`movsxd %e?x, %r?x`	32-bit into 64-bit values	sign extends
3	(`{load} mov %r?x, %r?x`)	64-bit values		(encode with .byte, not supported by defasm, it's the form where the r/m operand is the source)
3	(`{load} mov %?x, %?x`)	16-bit values		(encode with .byte, not supported by defasm, it's the form where the r/m operand is the source)

4	`movzbw/movsbw %?[lh], %r?x`	8-bit into 16-bit values	extends
4	`movsbq %?[lh], %r?x`	8-bit into 64-bit values	extends
4	`movswq %?x, %r?x`	16-bit into 64-bit values	extends

Constructing constants

Note: this list may not be exhaustive

Common generic patterns

Bytes	Instruction	Values	Condition
2	`mov $imm, %?[lh]`	8-bit values	clear register
2	`mov $imm, %?h`	16-bit values divisible by 256	clear register
3	`push $imm; pop %r?x`	8-bit values, sign-extended to 64-bit	needs stack
3	`lea imm(%r?x), %e?x`	8-bit values, sign-extended to 32-bit	r?x = 0
4	`mov $imm16, %?x`	16-bit values	clear register
5	`mov $imm32, %e?x`	32-bit values
10	`mov $imm64, %r?x`	64-bit values

Common patterns for specific constants

Bytes	Instruction	Values	Condition
1	`cltd`	0	in `edx`, if `eax` is zero or positive
1	`pop %r?x`	1	needs stack, only if there are no input arguments, only once
1	`xchg %eax, %edi`	1 (for syscall)	clobberable eax=1 anywhere in the program
2	`xor %e?x, %e?x`	0
2	`dec %e?x`	-1 (32-bit)	clear register
2	`mul %e?x`	0 in `eax` and `edx`	`%e?x` must be zero
2	`sbb %e?x, %e?x`	-1 (32-bit)	if the carry flag is set
3	`stc; sbb %e?x, %e?x`	-1 (32-bit)
3	`lahf; cpuid`	zeroes in `eax`, `ebx`, `ecx`, `edx`
3	`imul $imm, %other, %e?x`	32-bit multiples of another constant in `other`

Addresses

syscall is free in holes with multiple arguments where you write output in a loop for each argument, since you can start the argument loop with the syscall.

%esp and %rsp can be used directly as addresses, and argument addresses can be popped off the stack.

Note that the program start address is actually present on the stack as part of the AT_ENTRY auxv entry (at 0xc8(%rsp) in argument-less holes); however, the offset from either rsp or argv[0] doesn't fit in an 8-bit immediate, and it's thus not clear if there is any short generic way of reading it (the naive mov $0xc8, %bl; mov (%rsp,%rbx), %ebx is 5 bytes, the same as simply loading the value with a mov).

Bytes	Instruction	Values	Condition
0-2	`syscall`	`rip`	in `rcx`, also sets `r11` to flags and `rax`
1	`pop %r?x`	argv[0]	must be 2nd pop instruction
2	`mov %esp, %e?x`	even <4G address usually after end of program but before other areas
2	`dec %e?x`	`0xffffffff`, odd <4G address after end of program	clear register
2	`sbb %e?x, %e?x`	`0xffffffff`, odd <4G address after end of program	if the carry flag is set
2-4	`syscall; mov %ecx, %e?x`	`rip`	also sets `r11` to flags and `rax`
3	`stc; sbb %e?x, %e?x`	`0xffffffff`, odd <4G address after end of program
3-5	`syscall; lea imm8(%rcx), %e?x`	`rip+imm8`	also sets `r11` to flags and `rax`
4	`bts $22, %e?x`	`0x400000`	clear register
4	`btc $22, %e?x`	`0x400000`	clear register
4	`mov $0x4?, %?h; bswap %e?x`	`0x4?0000`	clear register except `?h`
4	`lahf; shl $13, %eax`	`0x400000`	in `eax` clear register, clear flags
5	`mov $address, %e?x`	any address

Other patterns

Bytes	Instruction	Values	Condition
1	`cltd`	-1	in `edx`, if `eax` is negative
1	`lahf`	2 plus flags	in `ah`, if all clear or the flag register value otherwise

2	`inc %e?x`	1	clear register
2	`mov %esp, %e?x`	large value usable as address
2	`int $0x80`	-4 (64-bit)	in `eax`, if `eax` clear
2	`syscall`	0x202 plus flags	in `r11`
2	`pushf; pop %r?x`	0x202 plus flags	needs stack
2	`cpuid`	zeroes in `eax`, `ebx`, `ecx`, `edx`	for most values of `eax` != 0
2	`cpuid`	a bunch of constants	assuming `eax` = 0
2	`div %eax`	1 in `eax`	assuming `edx` = 0 and `eax` != 0
2	`rdtsc`	time-dependent value	in `eax` and in `edx`
2	`fnstsw %ax`	0 in `ax`
2	`mul %?[lh]`	0 in `ax`	`%?[lh]` must be zero
2	`mov %?s, %e?x`	0	preserves flags, `%?s`=`%ds`, `%es`, on some OSes `%fs`, `%gs`
2	`mov %cs, %e?x`	0x33	OS-dependent
2	`mov %ss, %e?x`	0x2b	OS-dependent

3	`dec %r?x`	-1 (64-bit)	clear register
3	`mov $imm, %al; cwtl`	8-bit values	in `eax`, if `ah` is clear
3	`lea $imm8(%other), %e?x`	32-bit values close to another constant in `other`
3	`stc; pushf; pop %r?x`	0x203 plus flags
3	`smsw %e?x`	0x80050033	only sets low 32 bits (in common OSes)
3	`rdpkru`	0x55555554	in `eax`, if `ecx` clear (OS-dependent?)
3	`str %e?x`	0x40	only sets low 32 bits (OS-dependent?)
3	`xgetbv`	0x207	in `eax`, if `ecx` clear (OS-dependent?)
3	`add $-imm, %e?x`	negative 8-bit values, sign-extended to 32-bit	clear register
3	`rdrand %e?x`	random value
4	`cltd; mov $imm, %dl`	8-bit values	in `edx`, if `eax` positive

4	`xor %e?x, %e?x; mov $imm, %?l`	8-bit values	in `eax`, `ebx`, `ecx`, `edx`
4	`lea $imm8(%other), %r?x`	64-bit values close to another constant in `other`
4	`imul $imm, %other, %r?x`	64-bit multiples of another constant in `other`
4	`bts $bit, %e?x`	32-bit powers of 2	clear register
4	`stc; sbb %r?x, %r?x`	-1 (64-bit)
4	`add $-imm, %r?x`	negative 8-bit values, sign-extended to 64-bit	clear register
4	`syscall; mov $imm, %cl`	any program offset in a 256-byte window	assuming syscall does nothing
4	`mov $5, %al; cpuid`	`eax` = `ebx` = 0x40, `ecx` = 3, `edx` = 0x11
4	`mov $6, %al; cpuid`	`eax` = 4, `ebx` = 0, `ecx` = 1, `edx` = 0

5	`bts $bit, %r?x`	64-bit powers of 2	clear register

6	`push $-imm32; pop %r?x`	negative 32-bit values, sign-extended to 64-bit	needs stack

7	`mov $-imm32, %r?x`	negative 32-bit values, sign-extended to 64-bit

LSL/LAR/VERR/VERW behavior

Input	LAR	LSL	VERR	VERW
0x20-0x23	0xcffb00	0xffffffff	ZF=1	ZF=0
0x28-0x2b	0xcff300	0xffffffff	ZF=1	ZF=1
0x30-0x33	0xaffb00	0xffffffff	ZF=1	ZF=0
0x78-0x7b	0x40f500	0x0-0xf	ZF=1	ZF=0
others	0	0	ZF=0	ZF=0

cpuid values (on code.golf machine)

Functions not listed return all zeroes.

eax(in)	eax	ebx	ecx	edx
`0`	`10`	`68747541`	`444d4163`	`69746e65`
`1`	`a60f12`	`100800`	`7ef8320b`	`178bfbff`
`5`	`40`	`40`	`3`	`11`
`6`	`4`	`0`	`1`	`0`
`b`	`0`	`0`	`ef`	`0`
`80000000`	`80000028`	`68747541`	`444d4163`	`69746e65`
`80000001`	`a60f12`	`0`	`75c237ff`	`2fd3fbff`
`80000002`	`20444d41`	`657a7952`	`2037206e`	`30303737`
`80000003`	`432d3820`	`2065726f`	`636f7250`	`6f737365`
`80000004`	`20202072`	`20202020`	`20202020`	`202020`
`80000005`	`ff48ff40`	`ff48ff40`	`20080140`	`20080140`
`80000006`	`5c002200`	`6c004200`	`4006140`	`1009140`
`80000007`	`0`	`3b`	`0`	`6799`
`80000008`	`3030`	`791ef257`	`400f`	`10000`
`8000000a`	`1`	`8000`	`0`	`1ebfbcff`
`80000019`	`f048f040`	`f0400000`	`0`	`0`
`8000001a`	`6`	`0`	`0`	`0`
`8000001b`	`bff`	`0`	`0`	`0`
`8000001e`	`4`	`102`	`0`	`0`
`8000001f`	`1`	`b3`	`0`	`0`
`80000021`	`62fcf`	`15c`	`0`	`0`
`80000022`	`7`	`84106`	`3`	`0`
`80000026`	`0`	`0`	`ef`	`4`

Hole-specific tips

See Spigot algorithms for several mathematical constant holes.

Special functionality

BCD integers via the FPU

It's possibly to convert an integer into a packed BCD (meaning that every 4 bits hold a decimal digit) with this code:

push %rax
fildq (%rsp)
fbstp (%rsp)
pop %rax

There is also fbld for the inverse operation.

This is some example code for writing integers, which is however longer than the alternatives listed above.

23-24: Any integer, value in memory, output pointer in any register, clobberable pointer in any register

fildl (%r.x)
fbstp (%r?x)

dec %esi
mov $'\n', (%rsi)

for_digit_in_num:
mov (%r?x), %al # can save 1 byte with xlatb if eax clear and r?x is rbx
and $0xf, %al
or $'0', %al
dec %esi
mov %al, (%rsi)  # or xchg if you use xlatb
shrl $4, (%r?x)
jne for_digit_in_num

Entering 32-bit mode (and 16-bit mode)

While by default programs run in 64-bit mode, i.e. with a 64-bit code segment, it is also possible to switch to 32-bit compatibility mode by executing a far jump/call/return to selector 0x23 (or using sysenter, which returns in 32-bit mode, but seems unavailable on codegolf's AMD processor), which will make available the 1-byte BCD instructions (AA*, DA*) and the 1-byte encodings for INC and DEC (while of course disabling all 64-bit features).

Unfortunately the shortest code sequence to do so seems to be 6-9 bytes. The base version is push $0x23; call to32; [...]; to32: .byte 0x48; to64: .byte 0xcb, using push $0x33; call to64 to return to 64-bit mode.

It can be shortened to 6 bytes as .section mytext1, "wax"; push $0x23; .byte 0xe8, 0x1; .section mytext2, "wax"; .byte 0x48, 0xcb which requires rax to be a valid pointer and an even number of bytes to be present in section mytext1 before this: this will result in the call executing 00 00 = add %al, (%rax) instructions until the far return in the next section; then the return will be one byte back, so 00 48 cb = add %cl,-0x35(%eax) will be executed, continuing execution after the far return once in 32-bit mode.

It is also possible to enter 16-bit mode the same way, by first using the modify_ldt syscall to set up a 16-bit code segment, and jumping there, although of course this is quite unlikely to be profitable.

Deleted solutions

jakobvarmose deleted all their solutions, which included several optimal assembly solutions.

These are known possible solutions that might be better than the current optimal solution.

Hole	Bytes
ASCII Table	74 ← 75 ← 76 ← 78 ← 80 ← 81 ← 82 ← 83 ← 94 ← 95 ← 97 ← 98 ← 100 ← 102 ← 288
Foo Fizz Buzz Bar	82 ← 83 ← 84 ← 86 ← 87 ← 88 ← 90 ← 92 ← 93 ← 94 ← 96 ← 97 ← 98 ← 99 ← 100 ← 101 ← 102 ← 103 ← 129
Quine	28 ← 29 ← 30 ← 31 ← 32 ← 33 ← 34 ← 35 ← 36 ← 38 ← 40 ← 41 ← 42 ← 43 ← 44 ← 45 ← 46 ← 48 ← 50 ← 64 ← 70 ← 73 ← 81 ← 85 ← 90 ← 113 ← 119 ← 121
Rule 110	51 ← 53 ← 54 ← 58 ← 60 ← 62 ← 68 ← 74
United States	77 ← 78 ← 79 ← 80 ← 81 ← 82 ← 90 ← 91 ← 92 ← 93 ← 94 ← 96 ← 135 ← 137 ← 138 ← 150 ← 151 ← 152 ← 154 ← 155 ← 156
Vampire Numbers	118 ← 120 ← 121 ← 123 ← 124 ← 126 ← 130 ← 131 ← 133 ← 137 ← 140 ← 141 ← 142 ← 153 ← 154 ← 175 ← 176 ← 177 ← 222 ← 230 ← 232 ← 239 ← 242 ← 244 ← 247 ← 373 ← 1083 ← 1084

Introduction to Assembly

Syntaxes

Instructions

Operands

Flags

Prefixes

Directives

I/O

Sample code

Golfing tips

Transferring values between registers

Avoiding the REX prefix

Avoiding the word-size prefix

Use the al register for arithmetic operations with immediates

Placing a 1 value into a register

Using argv[0] as memory

Zeroing registers

Zeroing edx

Placing values divisible by 256 in 16-bit size registers

Advanced golfing tips

General information

Program layout and startup

Program structure

Debugging

General instruction selection

Hole solving workflow

Note on preserving a C-like entrypoint

Options for main program tasks

Writing output from memory to stdout

Reading the input and writing the output in memory

Placing the output in memory

Common subroutines

Writing integers with a newline

Backwards output

17-20: Division loop with direct output

Constructing variants

17: for integers between 10 and 99, number in al

Forwards output

18: for any integer, number in eax, clobbers edx

17-18: for integers between 0 and 99, number in al/eax

12: for integers between 10 and 99, number in al

Parsing integers

Printing a string from a table of strings

With a table of start offsets, backwards (shortest)

With a table of start offsets

With terminators

Non-sequential strings

Fixed-length string table

String hashing

Divisibility check

Common tasks and optimizations

Control flow

Loops

Conditional jumps and flags

Unconditional short forward jumps

Non-short jumps

Function calls and bytecode/bitcode interpretation

Using instructions unusually

Aliasing/overlapping instructions

Unconditional short forward jumps / instruction disables

Aliasing pre-loop instructions with the loop (e.g. eliding arbitrary instruction immediates)

Jumping in the middle of an instruction (e.g. to set flags)

Saving leading or trailing 00 bytes in instructions

Optimizing arbitrary immediates / removing degrees of freedom

Short vs long numeric holes

Instructions

Floating point and the x87

Constructing constants in x87 registers

Effective addressing with ModR/M and SIB bytes

Copying or moving values between registers

Constructing constants

Common generic patterns

Common patterns for specific constants

Addresses

Other patterns

LSL/LAR/VERR/VERW behavior

cpuid values (on code.golf machine)

Hole-specific tips

Special functionality

BCD integers via the FPU

Use the `al` register for arithmetic operations with immediates

Zeroing `edx`

17: for integers between 10 and 99, number in `al`

18: for any integer, number in `eax`, clobbers `edx`

17-18: for integers between 0 and 99, number in `al`/`eax`

12: for integers between 10 and 99, number in `al`