Introduction to Assembly
Assembly languages describe the machine code of a computer in simple, human-readable syntax. In these languages, each machine instruction (i.e. a sequence of bytes that is executed by the processor) is represented by a single command. The assembly language on code.golf (hereinafter referred to as "Assembly") represents the machine language of the x86-64 architecture.
Syntaxes
There are two main branches of syntax used for Assembly: Intel syntax and AT&T syntax. The assembler used by code.golf, DefAssembler, supports both of these syntaxes, defaulting to AT&T syntax unless otherwise stated using directives. In this article, when a difference between the two syntaxes emerges, a table that illustrates this difference will appear.
Instructions
Each instruction in Assembly consists of a mnemonic (the instruction's name, e.g. add
, sub
, mov
) and a comma-separated list of operands. Note that in AT&T syntax, the order of the operands in each instruction is reversed.
The number, types, sizes, and purposes of the operands given to an instruction depend on its mnemonic. For example, the instruction push
accepts a single operand, the instruction mov
accepts 2, and the instruction imul
accepts anywhere from 1 to 3 operands. To understand what each instruction does and which operands it accepts, its mnemonic should be looked up in Volume 2 of the Intel 64 Software Developer's Manuals (I recommend using a more straightforward index, like Félix Cloutier's x86 instruction reference).
Operands
An operand is an argument passed to an instruction, for example which values to read and where to write the result. There are 3 common types of operands in Assembly (in reality there are about 10 types, but the ones not listed here are rarely used):
-
Registers - small units of data used to store temporary values on the processor. They can either be read from or written to, or in some cases both. In x86-64, there are 16 different registers, each consisting of 64 bits. It is also possible to read from and write to each register's lower 32, 16, or 8 bits (in some cases the top 8 bits of the 16 lowest bits can also be accessed).
Click here for a full list of registers
Name 64-bit 32-bit 16-bit 8-bit High 8-bit Notes Accumulator rax
eax
ax
al
ah
This register is implicitly multiplied and divided by the mul
anddiv
instructions. It is also used bylods
,stos
,scas
,xlat
andcmpxchg
, and forsyscall
it both selects the syscall ID and holds the return value of the syscall.Base rbx
ebx
bx
bl
bh
This register is used implicitly in the xlat
instruction.Counter rcx
ecx
cx
cl
ch
This register allows for short-form loops with the loop
/loope
/loopne
instruction, as well as short-form zero checks withjrcxz
/jecxz
. It is also used by string instructions with arep
/repe
/repne
prefix as a limit on how many times to execute the instruction. The register is overwritten bysyscall
.Data rdx
edx
dx
dl
dh
This register is often paired up with the accumulator, e.g. rdx:rax, to act as an extension of that register (for example in the mul
anddiv
instructions). It also serves as the third argument insyscall
.Source index rsi
esi
si
sil
This register is used by the movs
,lods
, andcmps
string instructions. It also serves as the second argument insyscall
.Destination index rdi
edi
di
dil
This register is used by the movs
,stos
,cmps
, andscas
string instructions. It also serves as the first argument insyscall
.Stack pointer rsp
esp
sp
spl
This register holds the address of the stack, a portion of the computer's memory dedicated to storing temporary values (written to with push
and read from bypop
). It is the only register that has a non-zero value at the start of the program, and in general, its value should not be manually changed. It is affected by thepush
,pop
,enter
andleave
instructions.Base pointer rbp
ebp
bp
bpl
This register typically serves as an address holder for the base of a stack frame (a portion of the stack holding the local memory of a function), but it may be used for any other purpose. It is used by the enter
andleave
instructions.r8
r8d
r8w
r8b
This register serves as the fifth argument to syscall
.r9
r9d
r9w
r9b
This register serves as the sixth argument to syscall
.r10
r10d
r10w
r10b
This register serves as the fourth argument to syscall
.r11
r11d
r11w
r11b
This register is overwritten by syscall
.r12
r12d
r12w
r12b
r13
r13d
r13w
r13b
r14
r14d
r14w
r14b
r15
r15d
r15w
r15b
AT&T syntax Intel syntax Registers are prefixed with a percent sign ( %
)Registers are written without a prefix E.g. %rax
,%ebx
,%cx
,%dl
E.g. rax
,ebx
,cx
,dl
-
Immediates - constant numbers that are stored as part of the instruction. They provide an immediate value that can be quickly read by the processor without accessing memory. They are written using arithmetic expressions, which can include number literals, string literals and/or symbols (macro-like static values that are stored as names for easy access).
AT&T syntax Intel syntax Immediates are prefixed with a dollar sign ( $
). Expressions not prefixed with a dollar sign are treated as memory offsets (see below).Immediates are written without a prefix. If the immediate contains a symbol name, however, the expression must be prefixed with the word OFFSET
; otherwise, it is treated as a memory offset (see below).E.g. $23
,$(15 + 6) * 2
,$symbol
E.g. 23
,(15 + 6) * 2
,OFFSET symbol
-
Memory expressions - calculated addresses which instruct the processor to read from and write to certain addresses in the computer's temporary memory (RAM). These consist of 3 parts, all of which are summed by the processor to calculate the final address:
- Displacement - a numerical value giving an offset to the address
- Base register - a 64-bit register whose value is added to the address
- Index register - another 64-bit register, whose value can also be scaled up by 1, 2, 4, or 8
Each of these parts is optional, however at least one is required to form a valid memory expression.
The syntax for a memory expression is as follows:
AT&T syntax Intel syntax <displacement>(<base>, <index>, <scale>)
<displacement>[<base> + <index> * <scale>]
E.g. 23(%rsi, %rax, 4)
E.g. 23[rsi + rax * 4]
Note that in some cases, the size of a memory expression may be ambiguous; if this is the case, the assembler will issue an error stating this. To disambiguate the size:
AT&T syntax Intel syntax The instruction's name should have a letter at the end specifying the size The memory expression should be prefixed with the size name E.g. incl (%rax)
E.g. inc LONG [rax]
Flags
In addition to the general-purpose registers, x86-64 processors also store a "flags" register. This register holds an array of bits, each of which represents a flag - a boolean value indicating some useful information about the results of the previous instruction. These flags are:
- Carry - set to 1 if the last instruction caused an arithmetic carry that was truncated in the result (e.g. with the
shl
instruction the most-significant bit may be shifted out of a register, causing this flag to turn on). This flag can be manipulated usingstc
,clc
andcmc
. It is also read byadc
,sbb
,rcl
, andrcr
. - Zero - set to 1 if the last instruction generated a 0 value.
- Overflow - set to 1 if the truncation done by the last instruction caused the sign bit (the most-significant bit) to erroneously flip over.
- Sign - set to the sign bit (the most-significant bit) of the value generated by the last instruction (equivalently, set to 1 if the last instruction generated a negative value).
- Direction - determines if the string instructions should increase (0) or decrease (1) the
rsi
andrdi
registers. This flag can be manipulated usingstd
andcld
. - Parity - set to 1 if there is an even number of 1 bits in the value generated by the last instruction.
Conditional branching in Assembly is done through conditional jump instructions, which jump to the given location depending on the flags (e.g. jz
(j
ump if z
ero) will jump only if the zero flag is set to 1). Some conditional jumps also check multiple flags to form simpler conditions; these can be paired with the cmp
instruction to create simple arithmetic conditions. For example, the following program will halt if and only if edx
is less than or equal to ebx
:
AT&T syntax | Intel syntax |
---|---|
|
|
Note that not all instructions cause the flags to change; for example, while arithmetic instructions (e.g. add
, sub
, xor
etc.) always affect the flags, mov
, push
, pop
and xchg
do not affect any flags.
The flags register can be manually read from and written to with pushf
, popf
, lahf
, and sahf
, however this is generally not recommended.
Prefixes
In some cases, a prefix may also be added before the mnemonic. Prefixes give the processor more info on how to execute the instruction. For example, in repne scasb
, the instruction scasb
(which increments rdi
and compares the byte at the address pointed by it to al
) will execute in a loop so long as al
and the byte at rdi
are not equal (rep
eat while n
ot e
qual). In other words, it will find the first byte after the address rdi
that is equal to al
.
Directives
Along with executable code, it's also possible to embed raw data in an Assembly program. These are sequences of bytes (e.g. strings, integers, arrays) that will be loaded into the program's memory along with the code; they may be written to and read from by the program's instructions. This data can be added through the use of directives. These are pseudo-instructions that tell the assembler (the program that turns the assembly code into an executable file) to do certain things besides encoding instructions. For example, the following directives generate, in order, a byte, a sequence of words, a quad-word, and a string:
AT&T syntax | Intel syntax |
---|---|
.byte 5 |
db 5 |
.word 0, 1, 1, 2 |
dw 0, 1, 1, 2 |
.quad 0x0123456789ABCDEF |
dq 0123456789ABCDEFh |
.ascii "Hello, world!" |
db "Hello, world!" |
Directives can also be used to divide the bytes of a program into sections, which are distinct regions of memory that have different permissions (e.g. the .text
section can be read, executed, and modified, the .data
section can be read and modified but not executed, and the .rodata
section can only be read):
AT&T syntax | Intel syntax |
---|---|
.text |
section .text |
.data |
section .data |
.section .rodata |
section .rodata |
Finally, directives can be used to set the syntax of a program: .intel_syntax
sets the syntax to Intel, and .att_syntax
sets the syntax to AT&T. These can be used anywhere within the program and in any order.
I/O
Input and output are not defined by Assembly; they way they are handled depends on the operating system the program runs under. In code.golf, the assembly programs are run under Linux, which uses system calls (which are executed via the syscall
instruction) for program output. For a list of available system calls see the Linux System Call Table for x86-64 (note that sys_write
is the only one you really need to know on code.golf). On Linux, the arguments passed to a program appear at the top of the stack when the program begins; to get the next argument, simply pop
it off the stack (the popped value is an address pointing to the argument string).
Sample code
AT&T syntax | Intel syntax |
---|---|
|
|
Golfing tips
Transferring values between registers
The shortest way to move a value from one 64-bit register to another is using push
and pop
:
AT&T syntax | Intel syntax |
---|---|
|
|
For any registers except r8
-r15
(for which an additional REX byte is required), this will result in a 2-byte encoding.
If the registers are 32-bit and one of them is eax
, this may be shortened to a single byte with an xchg
instruction:
AT&T syntax | Intel syntax |
---|---|
|
|
Avoiding the REX prefix
An additional byte called the REX prefix is added in certain encodings. It can be easily identified, as it always has the form 4X
(where X
is between 0 and F). It provides auxiliary information about the registers used in an instruction. You should usually seek to avoid it.
The following cases cause a REX prefix to appear:
- Usage of the
r8
-r15
registers - Usage of the
spl
,bpl
,sil
anddil
registers - Usage of 64-bit registers/values (except in certain instructions, such as
push
andpop
)
Avoiding the word-size prefix
Additionally, instructions that use 16-bit registers entail the encoding of a 66
prefix; for this reason they should generally be avoided.
Use the al
register for arithmetic operations with immediates
For arithmetic instructions (e.g. add
, sub
, and
, or
, xor
), the 8-bit size form of the instruction with an immediate and a register (e.g. add cl, 4
, xor bl, 20
), is 1 byte shorter if the register being used is al
.
AT&T syntax | Intel syntax |
---|---|
|
|
Placing a 1 value into a register
In holes that have no arguments, the value at the top of the stack (argc) has the value 1. This makes it convenient to pop
into registers where this value is significant (typically rdi
, as it's used for write
syscalls as the file descriptor).
Using argv[0] as memory
The second value from the top of the stack holds a pointer to argv[0]
, which is a string containing the program's name (in DefAssembler this is always /tmp/asm
). This memory area can be used for relatively small memory operations, such as printing numbers.
AT&T syntax | Intel syntax |
---|---|
|
|
Zeroing registers
The shortest way to set a register's value to 0 is xor
ing or sub
ing it with itself, as in:
AT&T syntax | Intel syntax |
---|---|
|
|
Note that specifying a 64-bit size in this case has no effect; writing to a 32-bit size register will always set the top 32 bits of that register's 64-bit counterpart to 0.
Zeroing edx
The edx
register is a special case; it can be zeroed using a single byte instruction, namely cdq
(cltd
in AT&T syntax), if the value in eax
is non-negative (if it is negative, edx
will be set to -1). This is especially useful before performing a div
instruction, where the value in edx
may affect the result of the operation.
Placing values divisible by 256 in 16-bit size registers
Occasionally, it is useful to place values divisible by 256 (or just arbitrary large values) in certain registers. This can be made shorter by writing to the register's high-byte part.
Instead of:
AT&T syntax | Intel syntax |
---|---|
|
|
Use:
AT&T syntax | Intel syntax |
---|---|
|
|
Advanced golfing tips
General information
Program layout and startup
The program as a whole is always loaded at 0x400000
. However, it's possible to set a start address with .globl _start; _start:
and it's possible to create extra sections to move the actual bytes in the solution with .section
directive (you need to use .section <name>, "wax"
to make them readable, writable and executable). Up to 70 custom sections seem to be supported, allowing to have a section that starts at addresses up to 0x448000
(it's not clear why more than 70 sections don't seem to work); this can be useful if a shorter way of initializing an address is available for such values. Common symbols up to around .comm x, 0xf78000000
(not clear why there seems to be such a limit) can be declared to enlarge the bss section and thus make a lot of address space usable at no code size cost, but it seems impossible to have any data be placed after the bss section. Note that .text (i.e. the program code) is writable by default unlike normal Linux executables.
Programs are ran as usual by the Linux kernel, meaning that at the start the registers are zeroed except for %esp (as better described in the x86-64 psABI), the stack contains argc (1 for no argument), then argv[0] which is always "/tmp/asm" and then the pointers to the arguments terminated by 0, followed by another 0 due to no environment being present, then the ELF auxiliary entries. The arguments starting with argv[0] are laid out sequentially in memory separated by '\0', starting at a variable offset after the end of the ELF auxilary entries on the stack and after the arguments, the null-terminated executable name, "/tmp/asm" is present, followed by 8 null bytes and then the stack ends and access past that point will cause a page fault.
Program structure
The example code is quite a poor example, because it doesn't matter what happens after you write correct output (as long as you don't randomly make extra correct write syscalls), so you don't need to call exit or in fact usually do anything after making the write syscall (if you do, then use hlt
).
You also usually don't need to terminate the argument loop in any way and can thus usually ignore argc, as your solution will generally crash after the last argument since you will pop a null pointer, which is perfectly fine. Furthermore, sometimes you only need to pop and ignore the argument count since it can be fine to pass argv[0] to your argument logic.
You can either compose the whole output in memory and write it, or use multiple write calls.
Debugging
To debug code, add an hlt
or ud2
instruction to have the evaluator fail at that point, dumping register values. If you want to inspect anything, move it to the r8-r15 registers. You can also run the code locally under a debugger such as gdb.
If you see "exit status 31", you probably made a system call with incorrect value.
General instruction selection
Optimal code will usually be composed mostly of 1 and 2-byte instructions, and 3+ bytes instructions should be checked for optimization opportunities.
It's best to avoid the address size, operand size and REX prefixes when possible, which means trying to not use extra x86-64 registers, avoiding the address size modifier (which is pretty much useless anyway) by specifying 64-bit registers in address, trying to avoid 16-bit operations and 64-bit operations that need REX (pretty much all of them except for push
, pop
, loop
, jrcxz
)
Registers should of course be selected to minimize moves and minimize instruction length (in particular, operations with al
and immediates are shorter). System calls and several instructions use specific registers, so rename your registers accordingly. Remember that syscall
clobbers rcx
(and rax
and r11
).
These instructions have been reported by either CatsAreFluffy
, JakobVarmose
or lyphyser
as being used in their solutions: adc, add, aesenclast, and, bsf, bswap, bt, btr, bts, call, cbtw, clc, cld, cltd, cmc, cmovCC, cmp, cmpsb, cmpxchg, cpuid, cqto, crc32, dec, div, hlt, idiv, imul, inc, int $0x80, jCC, jmp, jrcxz, lahf, lea, lodsb, lodsl, lodsq, lodsw, loop, loope, loopne, mov, movsb, movsbl, movsbq, movsl, movsq, movsw, movsxd, movups, movzbl, movzwl, mul, neg, not, or, pcmpistrm, pdep, pop, popcnt, popf, push, pushf, rcl, rcr, rdrand, rep, repe, repne, ret, rol, ror, sahf, sar, sbb, scasb, scasl, scasq, scasw, setCC, shl, shr, stc, std, stosb, stosl, stosw, sub, syscall, test, xadd, xchg, xlatb, xor
Others potentially useful could be: andn, bextr, blsi, blsmsk, blsr, bsr, btc, bzhi, cmpxchg16b, lzcnt, movbe, mulx, pext, sarx, shld, shlx, shrd, shrx, smsw, tzcnt, xgetbv, xrstor, xsave
The following directives can be useful: .ascii, .asciz, .byte, .comm, .data, .globl _start, .long, .word, .section
SSE and VEX/EVEX-encoded instructions can perhaps be useful in some cases.
Hole solving workflow
- If the algorithm is complex or you are unsure about it, you may write a solution in C (or perhaps Rust or Zig) first until it works. You can also compile it to assembly to get a first version, even though this is less useful than it seems as you will have to mostly rewrite it anyway to be competitive. When compiling, try both gcc and clang, try both 64-bit and 32-bit (to force it to not use the 64-bit registers that need a prefix) and try different optimization levels (usually -Oz is best since it's designed to aggressively optimize code size).
- Find the most similar hole you have solved and copy the code, removing the hole-specific details
- Write your solution until it works (this is the "draw the rest of the owl" part). When writing code, mark any instructions that may not be necessary or could be shortened with a comment (mostly instructions that clear registers or initialize them and could be shortened if the register happens to be clear or have the upper bits clear). Possibly write a comment describing what registers are used for. This might involve writing and running an extra program to generate data if applicable.
- If your solution unexpectedly doesn't work, debug it as described in the "debugging" section and correct it until it works
- For each of the instructions you marked and any others you notice, try to delete them one by one and see if the solution still works (you could also write a script that tries to delete every single instruction in the program)
- Check the code for any extra unnecessary address size, REX prefix or debugging code and remove them
- Check if it is possible to shorten the code by reallocating registers
- Check any instruction longer than 2 bytes for optimization opportunities, and look for other optimization opportunities like the ones described in this page
- Check any unconditional jumps, reorganize the code flow so that they jump forwards, and then replace them with disabling bytes as described in "unconditional forward jumps"
- Check whether any constant data can be changed to start or end within another constant, within the program code or within the zeroes before and after a section (you may need to create extra sections to have zeroes before)
- Check for any jumps that are longer than 2 bytes due to having a displacement outside -128 to 127 and fix them as described in "Long jumps"
Options for main program tasks
Writing output from memory to stdout
There are 3 ways to make a write
syscall: syscall
, int 0x80
and sysenter
.
syscall
is the default one, works with 64-bit addresses, and needs eax
= edi
= 1, esi
= buffer, edx
= length. It sets rcx
to the return rip
value and r12
to the return flags value, and rax
to the return value (which is the length assuming it succeeded)
int 0x80
is the other main one, only works with 32-bit addresses, and needs eax
= 4, ebx
= 1, ecx
= buffer, edx
= length. It doesn't support 64-bit addresses, but has the advantage of not clobbering extra registers, which can be better if you call it in a loop.
sysenter
, which is probably not useful, is like int 0x80, but it returns to an address in the vdso with code that pops three registers and returns.
In argument-less holes, the 1 constant can be created with a simple pop rax
. The length can be computed by subtraction, or alternatively the length can be constant and the start can be computed by subtraction if needed. When starting the output at -1 (32-bit), if an extra byte is written, the length can then be computed by simply copying the output end pointer to the rdx
register (since the subtraction would be subtracting zero after adjusting for the extra byte written).
Reading the input and writing the output in memory
There are a few ways to access memory sequentially, to read the input or write the output. Usually string instructions are the best for reading input, while the output depends on the exact hole.
- Use
lods
to read values fromrsi
intorax
, orscas
to compare values fromrdi
withrax
orstos
to write values intordi
, possibly backwards withstd
. 1 byte per 8-bit or 32-bit reads, 2-byte for 16-bit or 64-bit reads, but has register restrictions, doesn't set flags - Set a register to the address and use simple reads/writes (possibly with vector instructions) and increment/decrement it. If you are reading flags, you can compare with
cmp
or you can get both value and flags adding or oring into a zeroed register. If you are writing, you can usexchg
to zero the register you are writing from. Needs 4 bytes for 8-bit access+increment (withinc
/dec
), 5 bytes for 32-bit access+increment, 6 bytes for 16-bit and 64-bit access/increment. You can save a byte in larger than 8-bit accesses if you can userax
as index register and incrementing withadd al
somehow works. One extra byte if reading from a 64-bit address for the REX prefix on the increment. - Instead of addressing with a single register, use
(%r?x,%r.x)
addressing with two registers. Uses an extra byte for the SIB addressing, but keeps the start and index around, and can let you make the write system call with no register setup - Point
rsp
to the start of the data and usepop
to read 64-bit or 16-bit chunks forwards orpush
to write 64-bit or 16-bit units backwards. Needs 1 byte for 64-bit access, 2 bytes for 16-bit accesses, 6 bytes for 8-bit accesses, 7 bytes for 32-bit accesses.
If writing backwards, not using a separate index registers can save code in the syscall by using rsi
as the register. If writing forwards, using a separate index register can save code in the write syscall if using rsi
as the immutable start offset,
Placing the output in memory
The output can be put in many places, which are mostly chosen so that initializing the address is cheaper than the simple 5-byte immediate load. Also, if you have a precomputed table, you may want to put the table at the end of the program, followed by the output buffer with index-based addressing, so that a single register can access both the table and the output.
- At 4GB, going backwards. This requires 0 bytes for initialization using an already zeroed register if going backwards. You need to write
.comm x, 0xf78000000
or similar to reserve a common symbol to make all the 32-bit address space after0x400000
usable. - At 4GB, going forwards. This requires 2 bytes for initialization with
decl
, uses 64-bit addresses. You need to write.comm x, 0xf78000000
or similar to reserve a big enough common symbol. - At the lower 32-bits of the stack pointer. You need to write
.comm x, 0xf78000000
or similar to reserve a big enough common symbol. This requires 2 bytes for initialization (mov %esp, %e?x
) or 0 bytes plus 1 for each access due to the address override prefix if using %esp. Compared todec %e?x
, has the advantage of staying in 32-bit space, but can rarely fail if %esp is too low and doesn't allow some tricks to save instructions when computing the length - In the argument area on the stack. This requires 1 byte for initialization (
pop %r?x
), 2 if not popping argc or 0 if already popping argv[0] - On the stack from the stack pointer. This requires 0 bytes for initialization if using
rsp
as index, or 2 bytes to moversp
to another register (usingpush %rsp; pop %r?p
) - At the end of the program code. This requires 2 bytes for initialization in
rcx
with the syscall trick (use.globl _start; _start
right before the finalsyscall
and thenjmp
to the actual code; thesyscall
will setrcx
to the address after thesyscall
), 4 with the syscall trick plus a move, or 5 bytes for a 32-bit immediate load. - Overwriting the program code. This requires 2 bytes for initialization in
rcx
with the syscall trick (start the program with asyscall
), or 4 with the syscall trick plus a move or withbts $22, %reg
- At some other easily constructible 32-bit address, using
.comm x, 0xf78000000
to make it valid. The main way is to get the one address with eithermov %esp, %e?x
ordec e?x
and then double it withlea (%r?x, %r?x), %e?x
or similar. There are a lot of other ways to do this, such as the 3-byteadd $-128, %reg
, 4-bytebts $23, %reg
, the 2-bytecpuid
, 3-byteimul imm, reg, reg
, etc.
Note that placements on the stack and at 4GB going forwards consist of 64-bit addresses that don't fit in 32 bits, which means that you must use syscall instead of int 0x80 and you must increment/decrement with REX prefixes or string instructions.
Common subroutines
Writing integers
General code structure
When writing integers, it's usually a good idea to write the output backwards, since it makes the routine easier.
Here is an example of a program structure to do so:
# required at the start of the file
.comm x, 0xf78000000
# sample code
mov $123456789, %eax
### insert code to print integers here
# sample code to print it
mov $1, %al # or pop %eax if the hole is argumentless
mov %eax, %edi
sub %esi, %edx
syscall
For any integer
Output pointer in esi
cltd # only needed if edx != 0
mov $10, %bl
dec %esi
mov %bl, (%rsi)
for_digit_in_num:
div %ebx
dec %esi
add $'0', %dl
xchg %dl, (%rsi)
test %eax, %eax
jnz for_digit_in_num
If the output is overwritten multiple times, the xchg needs to be possibly replaced and augmented with cltd
to clear edx
.
Output pointer in ecx
cltd # only needed if edx != 0
mov $10, %bl
dec %ecx
mov %bl,(%rcx)
digits:
# can also use 8-bit division if number is <= 2559
div %ebx
add $'0',%dl
xchg %dl,-1(%rcx)
test %eax,%eax
loopne digits
This is to be used with int $0x80
. Saves one byte compared to the syscall/esi version, but uses ecx
that is often used as a loop counter instead.
For integers between 1 and 2^32-1
cltd # only needed if edx != 0
mov $10, %bl
mov $'\n'-'0',%dl
# turn div %ebx into test %dh, %bh, use first iteration to print the \n
.byte 0x84
for_digit_in_num:
div %ebx
dec %esi
add $'0', %dl
xchg %dl, (%rsi)
test %eax, %eax
jnz for_digit_in_num
If the output is overwritten multiple times, the xchg needs to be possibly replaced and augmented with cltd
to clear edx
.
For integers between 0 and 2559 (useful if edx
!= 0)
Output pointer in esi
mov $10, %bl
dec %esi
mov %bl, (%rsi)
for_digit_in_num:
div %bl
dec %esi
add $'0', %ah
xchg %ah, (%rsi)
test %al, %al
jnz for_digit_in_num
If the output is overwritten multiple times, the xchg needs to be possibly replaced and augmented with cltd
to clear edx
.
Output pointer in ecx
mov $10, %bl
dec %ecx
mov %bl, (%rcx)
digits:
div %bl
add $'0', %ah
xchg %ah, -1(%rcx)
test %al, %al
loopne digits
This is to be used with int $0x80
. Saves one byte compared to the syscall/esi version, but uses ecx
that is often used as a loop counter instead.
For integers between 0 and 255
Number in al
, bl
= 0
or $10, %bl # clear CF and OF
mov %bl, %ah
# turns add $'0',%ah; cmp $1,%al into jo 0x13c30c4
.byte 0x0f
for_digit_in_num:
add $'0', %ah
cmp $1, %al # set CF if al == 0
dec %esi # preserves CF
xchg %ah, (%rsi) # preserves CF
div %bh # preserves CF (undocumented)
jnc for_digit_in_num
Number in bl
(or cl
or dl
), eax
= 0
mov $10, %bh
or %ebx, %eax # clear CF and OF
# turns add $'0',%ah; cmp $1,%al into jo 0x13c30c4
.byte 0x0f
for_digit_in_num:
add $'0', %ah
cmp $1, %al # set CF if al == 0
dec %esi # preserves CF
xchg %ah,(%rsi) # preserves CF
div %bh # preserves CF (undocumented)
jnc for_digit_in_num
For integers between 1 and 255
Number in al
mov $10, %bl
mov $'\n'-'0',%ah
for_digit_in_num:
add $'0', %ah
dec %esi
xchg %ah, (%rsi)
div %bl
# break if both the quotient and remainder are 0
test %eax, %eax
jnz for_digit_in_num
If the output is overwritten multiple times, the xchg needs to be possibly replaced and augmented with an instruction to clear ah
.
Number in bl
(or cl
or dl
)
mov $10, %bh
mov %ebx, %eax
# set ah = bh = 10 = '\n'
# turn add $'0', %ah into add $'0', %r12b
.byte 0x41
for_digit_in_num:
add $'0', %ah
dec %esi
xchg %ah, (%rsi)
div %bh
# break if both the quotient and remainder are 0
test %eax, %eax
jnz for_digit_in_num
Parsing integers
This code will parse a tuple of integers separated by space and terminated by newline; variants are possible.
# mov $..., %cl
for_number_in_input:
# use mov %edx, %e?x for just 2 numbers
cltd
for_digit_in_input_number:
lodsb
sub $'0', %al
jb break_digit_in_input_number
imul $10, %edx, %edx
add %eax, %edx
jmp for_digit_in_input_number
break_digit_in_input_number:
push %rdx # delete this for just 2 numbers
# or loop for_number_in_input
# can often just use jp here instead of cmp; jne
cmp $'\n'-'0', %al
jne for_number_in_input
# pop as many numbers as are present in the input
# or have them already in %edx and %e?x for just 2 numbers
pop %rbx
pop %rdx
Printing a string from a table of strings
The version with backwards start offset is shorter if it needs to be used once, while the forwards one should be shorter if called multiple times, and the one with terminators might be best in special conditions.
With a table of start offsets, backwards (shortest)
S: # must be at offset 0
s2: .ascii "ef"
s1: .ascii "bcd"
s0: .ascii "a"
s_:
.globl _start
_start:
for_arg_in_args:
syscall
rip:
lea S-rip(%rcx), %eax
# put index in edx, can also reallocate register
first_value = 0
movzwl table-rip-first_value(%rcx,%rdx), %ecx
sub %ch, %cl
xchg %al, %ch
xchg %eax, %esi
rep movsb
[...]
jmp for_arg_in_args
table: .byte s_ - S, s0 - S, s1 - S
# omit last entry since it is zero
With a table of start offsets
S: # must be at offset 0
s0: .ascii "a"
s1: .ascii "bcd"
s2: .ascii "ef"
s3:
table: .byte s0 - S, s1 - S, s2 - S, s3 - S
.globl _start
_start:
syscall
rip:
mov %ecx, %esi
# put index in eax, can also reallocate register
first_value = 0
mov table-rip-first_value(%rcx,%rax), %edx
movb %dl, %sil
movzbl %dh, %ecx
sub %dl, %cl
rep movsb
esi
can be computed in many other equivalent ways, and it's possible to add the byte instead of replacing the lower byte; however, all variants seem to be the same code size, and this variant allows to only do the mov %ecx, %esi
once if the code is executed multiple times.
With terminators
s0: .asciz "a"
s1: .asciz "bcd"
s2: .asciz "ef"
.globl _start
_start:
syscall
rip:
mov %ecx, %esi
# or lea s0(%rcx), %esi if the table doesn't start at 0
# put index in ecx, you'll need a mov for other registers
for_earlier_str_in_table:
for_byte_in_earlier_str:
lodsb
test %al, %al
jne for_byte_in_earlier_str
loop for_earlier_str_in_table
# turn stosb into test $0xaa, %al
.byte 0xa8
for_byte_in_str:
stosb
lodsb
test %al, %al
jne for_byte_in_str
Another approach:
push %rdi
mov %ecx, %edi
# or lea s0(%rcx), %esi if the table doesn't start at 0
# put index in eax
for_earlier_str_in_table:
mov $0xff, %cl
repne scasb
dec %eax
jne for_earlier_str_in_table
mov %edi, %esi
pop %rdi
# turn stosb into test $0xaa, %al
.byte 0xa8
for_byte_in_str:
stosb
lodsb
test %al, %al
jne for_byte_in_str
Non-sequential strings
The approaches above require to store the strings sequentially: removing this constraint costs an extra byte per string, but can save memory if several strings are substrings of each other, or otherwise present in other constants.
It is possible instead to use 2 bytes per string for start and length and adapt the routines by adding a scale of 2 in the address computation and removing the length subtraction (or in the terminated case, remove the first loop and instead directly load the start).
Fixed-length string table
Another approach is to reserve a constant number of bytes in the string table, allowing to compute the start address by multiplication or preferably with SIB byte scale; if the strings are not fixed-length, then an additional table storing the lengths can be used.
This will of course be profitable if the strings are constant length or close to it.
String hashing
String hashing can be done with crc32
or with FNV hashing.
Divisibility check
In addition to using div
, it is possible to use a subtraction loop to check for divisibility, like this:
divisibility_loop:
sub %e?x, %e?x
je is_divisor
jns divisibility_loop
Or this:
divisibility_loop:
sub %e?x, %e?x
js not_divisor
jne divisibility_loop
Instead of:
cltd # if edx != 0
div %e?x
test %edx, %edx
je is_divisor
Advantages compared to division:
- Saves one byte if
cltd
would be required for division - Can use any register or memory location for the number to divide rather than having to use
eax
- Can divide by immediates without extra instructions to load them
Disadvantages compared to division:
- Doesn't produce a quotient unless 2-4 more bytes are used
- Doesn't produce a nonnegative remainder (but rather a nonpositive one) unless 2 more bytes are used
- Execution time is proportional to quotient rather than constant
Common tasks and optimizations
Control flow
Loops
There are 3 main ways to loop for a given number of iterations:
- Backwards with
loop
. This is the shortest, but can only loop backwards inecx
with 0 excluded, and doesn't set flags in the decrement. You can also get a conditional exit for free withloopne
/loope
. - Backwards without
loop
(usingdec
orsub
to set flags for the comparison). This can save one byte if not usingeax
as the loop since initialization with an immediate is cheaper than comparison with an immediate. - Forwards without
loop
(usingcmp
to set flags for the comparison)
It's also possible to loop until a stop condition (if there is no natural condition, it's possible to match one of the bytes in the last output, ideally checking %al since it's shortest):
- With a normal conditional jump
- Using
loopne
to get a freercx
/ecx
decrement - note that you can use .byte 0x67 before theloopne
to get a 32-bit decrement (this goes well with backwardsint $0x80
output starting at 4GB)
Conditional jumps and flags
It's possible to shorten conditional branches in several ways:
- Put the value in
rcx
and usejrcxz
orloop
- Replace a
mov
into a cleared register with anadd
,or
,xor
or possiblyadc
so that flags are set for free - Reuse flags generated from a previous operation instead of doing
test
orcmp
test
a register with itself instead of usingcmp
with 0 when checking for 0- Use
shr $1
,ror $1
orrcr $1
to check the lowest bit instead oftest $1
, if possible.shr $1
can be useful because it also clears it. - Use
loope
orloopne
to get a freercx
/ecx
decrement in addition to a je/jne jump - note that you can use .byte 0x67 before theloopne
to get a 32-bit decrement
Flags can be saved and restored with pushf
/popf
or sahf
/lahf
.
Sometimes it is possible to restructure comparison to use the carry flag (e.g. checking for zero with cmp $1, %e?x
and then using the carry flag), which allows to preserve flags across dec
, inc
, div
and idiv
.
A 2-byte loop dummy; dummy:
instruction can be used to decrement rcx
preserving flags.
If using multiple conditional jumps, try to find a single instruction (usually a cmp/subtraction) with an appropriate value that can set flags suitable for all of them at once.
The parity flag can be used to check whether there is an even or an odd number of 1 bits in the lowest byte.
The sign flag can often be used to check for special values.
The div
and idiv
instruction are documented to have undefined effect on flags: in fact, it seems they clear all flags except CF, which is preserved. The imul
instruction is documented to have an undefined effect on non-CF/OF flags; in fact, it seems that it preserves them.
Note that a single cmp
instruction with x
lets you discriminate, with only conditional jumps, between x - 2
(jb/jl/js and jnp), x - 1
(jb/jl/js and jp), x
(je), x + 1
(ja/jg and jnp) and x + 3
(ja/jg and jp).
Among range of values, a cmp
with x
lets you discriminate between numeric intervals like this (assuming x
between 1 and 0x7f):
0..x-1 | x | x..0x7f | 0x80..(0x7f+x) | (0x80+x)..0xff |
---|---|---|---|---|
CF, SF | ZF | OF | SF | |
jb | jz | jg | jo | jnb+js |
jnz | jz | jnz | jnz | jnz |
jc | jnc | jnc | jnc | jnc |
jb | jae | jae | jae | jae |
jbe | jbe | ja | ja | ja |
jle | jle | jg | jle | jle |
jl | jge | jge | jl | jl |
jno | jno | jno | jo | jno |
js | jns | jns | jns | js |
For negative x
= 0x100 - y
:
0..x-0x81 | x-0x80..0x7f | 0x80..x | x | x..0xff |
---|---|---|---|---|
0..0x7f-y | 0x80-y..0x7f | 0x80..-y | -y | -y..0xff |
CF | CF, SF, OF | CF, SF | ZF | |
jna+jns | jo | jl | jz | ja |
jnz | jnz | jnz | jz | jnz |
jc | jc | jc | jnc | jnc |
jb | jb | jb | jae | jae |
jbe | jbe | jbe | jbe | ja |
jge | jge | jl | jge | jge |
jg | jg | jle | jle | jg |
jno | jo | jno | jno | jno |
jns | js | js | jns | jns |
In addition, the parity flag further splits each interval into two sets.
Applying the above, you can for instance:
- Match 0, 1, 2, 3, 5, 0x80, 0x81, 0x82, 0x83 with a single cmp with 2, then 4, 6, 7, 8, 10, 0x85, 0x86, 0x87, 0x88 with a cmp with 7 and so on
- Match 0x7a, 0x7b, 0x7c, 0x7d, 0xfa, 0xfb, 0xfc, 0xfd, 0xff with a single cmp with 0xfc.
Unconditional short forward jumps
The simplest way to jump forward unconditionally is the 2-byte jmp
instruction. However, it can often be done in one byte by outputting a byte that causes the machine code in the next instruction to change to an instruction that does nothing relevant, possibly setting flags desirably.
It's also of course even better to reorder the code to eliminate unconditional jumps. If that's impossible, it's good to see if there are opportunities to reorder the code so that the previously described optimization can be applied.
The following generally applicable disabling bytes are available:
- 1-byte:
test $imm8, %al
ormov $imm8, %?[lh]
- 4-byte:
test $imm32, %eax
ormov $imm32, %e?x
- Single instructions: REX prefix (usually)
In general, a great way to find such bytes is to try them all, since there are only 256 choices. Here is a script to do so:
#/bin/bash
#asm="$(</dev/stdin)"
asm="$1"
out="$(mktemp)"
if ! as --64 -o "$out" <<<"$asm"; then
echo "Input fails to assemble" >&2
exit 1
fi
disasm="$(objdump --no-show-raw-insn --no-addresses -w -d "$out"|grep -E '^\s+')"
re="$2"
if test -z "$re"; then
re='\(bad\)|.byte|\.\.|'"$(tr '\n' ';' <<<"$disasm"|sed -re '
s/\s+/ /g
s/\s*;\s*/;/g
s/^;*\s*//
s/\s*;*$//
s/;/\n/g
s/[^^[:space:]]/[&]/g
s/\n/|/g
s/([a-zA-Z0-9_]\])\s+(\[[a-zA-Z0-9])/\1\\s+\2/g
s/([a-zA-Z0-9_]\])\s+(\[[a-zA-Z0-9])/\1\\s+\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[a-zA-Z0-9])/\1\\s*\2/g
s/([a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/([^a-zA-Z0-9_]\])\s*(\[[^a-zA-Z0-9])/\1\\s*\2/g
s/\s/\\s*/g
s/\^/\\^/g
')"
fi
echo "All except matching: $re"
for((i=0;i<256;++i)); do
echo ".byte $i"$'\n'"$asm"|as --64 -o "$out";
dis="$(objdump -w -d "$out")"
if ! grep -qE "$re" <<<"$dis"; then
sed -e '/^Disassembly/d; /^\s*$/d; /:\s*file format/d;' <<<"$dis"
fi
done
Non-short jumps
In long programs, some jumps may have a displacement outside the -128 to 127 bytes range, which will result in a 5-byte instruction instead of a 2-byte one.
You can try to solve the issue as follows:
- Rearrange (or optimize!) the program so that you don't have jumps that are so long
- Change the jump to jump to another jump so that both jumps are short. This only needs 2 or 4 bytes, but you need an existing jump to the same destination or a place to put the jump without needing to jump around or disable it.
- Compute the jump destination by setting the lower byte in a register holding an instruction pointer and then using the "jCC *%r?x" jump instruction variant. This is particularly useful for the backward jump that closes the argument loop.
- Save the jump destination by executing a syscall earlier at the jump destination and then using
jCC *%rcx
- Have a data table start in the trailing
ff
bytes of the jump
Function calls and bytecode/bitcode interpretation
Function calls are in most cases not efficient, since it takes 5 bytes for call symbol
, plus possible argument setup: it is often better to structure repeated tasks as a loop and read the input values from an arrays.
If using function calls anyway, 2-byte call *%r?x
is preferable to 5-byte call symbol
, and so it can be efficient to, using the syscall trick, set up a register pointing to program data and offset it so that it points to the most used function: this will allow to both address your data, if any, relative to that register and call that function with two bytes.
If structuring the repeated tasks as a loop is not straightforward, it's usually possible to first conceptually structure the program as a sequence of function calls, then transform the functions in a single meta-function that dispatches to the functions, then transform the calls to a meta function into a loop that reads data from arrays, and then finally inline everything.
When reading data in this fashion, it can be beneficial to compress the data, effectively creating a bytecode or bitcode interpreter.
In this case, there is of course a tradeoff between a more powerful interpreter that takes more code but allows shorter bytecode and a simpler interpreter with longer bytecode, and a lot of experimentation can be done to find the best tradeoff.
The most generic tradeoff is the one between byte-aligned bytecode (which wastes space if you need than a multiple of 8 bits), word-aligned bitcode (i.e. each opcode is N bits and goes across byte boundaries, but it don't straddle across "word" boundaries where words are usually 16/32/64-bit, or even 128/256/512 with SSE/AVX2/AVX-512 - for instance you can have 5 6-bit opcodes in a 32-bit word, wasting 2 bits every 32), fully unaligned bitcode (which wastes little space, but is the most complex to read) and even forms of arithmetic/range coding (optimal, but most complex, probably not worth it unless there's a hole with a ton of data).
Here are some tips for using bytecode/bitcode:
- It is possible to test and consume a flag by putting it in the low bit and using
shr/sar $1, %reg
followed by testing the carry flag (other shift/rotate instruction work as well); note thatshr
instructions also puts the high bit in the overflow flag and you can test for zero and parity as well - It is possible to split a bitfield with 2 value by shifting it so that the higher part ends up in
ah
and the lower inal
- It's usually faster to test for the end of bytecode by checking for zero somewhere in the loop rather than using a counter; of course, this requires to engineer the bytecode so that there are no zeroes in normal bytecode. Note that you don't need to specify the final zero as long as the bytecode is at the end of a section
- You might to be able to choose the encoding so that the start or end of bytecode is identical to either the end or start of program code or another data constant, so that you can save bytes by overlapping
Using instructions unusually
In addition to writing and reading output and using lookup tables for xlatb
, string instructions can shorten more general tasks:
rsi
can be incremented withlods
rdi
can be incremented withscas
rsi
andrdi
can be both incremented withcmps
rep lods
can addecx
toesi
, optionally scaled by 1/2/4/8rep scas
can addecx
toedi
, optionally scaled by 1/2/4/8rep cmps
can addecx
to bothesi
andedi
, optionally scaled by 1/2/4/8- Reading a byte at
ebx
can be done withxlatb
if eax is clear
The loop
instruction can be used to decrement rcx
preserving flags with a target on the next instruction.
The cpuid
instruction, especially in the lahf; cpuid
variant, can be used to clear all main registers at once.
The syscall
instruction can be used to get the current IP in rcx.
Instructions
Effective addressing with ModR/M and SIB bytes
r8
-r15
registers need an extra REX byte if not already present
rsp
can never be used as an index register, but all other forms are possible, albeit sometimes with an extra byte (or 3-4 extra bytes in case of (,%r?x,#)
, which can usually be replaced with (%r?x,%r?x,#)
where the first register is cleared).
RIP-relative addressing is usually suboptimal since there is no disp8 variant; instead, use syscall
at the start of loop and then index from ecx
which will have the address following the syscall
instruction.
#
can be 1, 2, 4 or 8; by specifying the same register as both base and index, you can multiply it by 2, 3, 5 or 9.
An address size prefix can be used to use 32-bit registers in the address calculation: this is usually only useful when referring to esp
or derived values in edi
or esi
to store a table at the address corrisponding to the lower 32-bits of esp
and only if done up to 2 times (otherwise moving to a register is shorter).
Bytes | Value | Condition |
---|---|---|
+0 | %e?x |
|
+0 | (%r?x) |
not rbp or rsp or r12 or r13 |
+1 | disp8(%r?x) |
not rsp or r12 |
+1 | (%r?x,%r?x,#) |
base (1st reg) not rbp or r13 , index (2nd reg) not rsp |
+1 | (%rsp) |
|
+1 | (%rbp) |
|
+2 | disp8(%r?x,%r?x,#) |
index (2nd reg) not rsp |
+2 | disp8(%rsp) |
|
+2 | disp8(%r12) |
|
+4 | disp32(%r?x) |
not rsp or r12 |
+4 | disp32(%rip) |
|
+5 | disp32(%r?x,%r?x,#) |
index (2nd reg) not rsp |
+5 | disp32(,%r?x,#) |
not rsp |
+5 | addr32 |
|
+5 | disp32(%rsp) |
|
+5 | disp32(%r12) |
Copying or moving values between registers
Bytes | Instruction | Values | Side effect | Condition |
---|---|---|---|---|
1 | xchg %e?x, %eax |
32-bit values | exchanges data | only to/from eax |
2 | xchg %e?x, %e?x |
32-bit values | exchanges data | |
2 | push %r?x; pop %r?x |
64-bit values | needs stack | |
2 | xchg %r?x, %rax |
64-bit values | exchanges data | only to/from rax |
2 | mov %?[lh], %?[lh] |
8-bit values | ||
2 | mov %e?x, %e?x |
32-bit values | ||
2 | lea (%r?x), %e?x |
32-bit values | ||
2 | add/or/xor %e?x, %e?x |
32-bit values | sets flags | clear register |
2 | add/or/xor %?[lh], %?[lh] |
8-bit values | sets flags | clear register |
2 | (movsxd %e?x, %e?x ) |
32-bit values | (encode with .byte, not supported by defasm) | |
3 | mov %r?x, %r?x |
64-bit values | ||
3 | lea (%r?x), %r?x |
64-bit values | ||
3 | add/or/xor %r?x, %r?x |
64-bit values | sets flags | clear register |
3 | xchg %r?x, %r?x |
64-bit values | exchanges data | |
3 | mov %?x, %?x |
16-bit values | ||
3 | add/or/xor %?x, %?x |
16-bit values | sets flags | clear register |
3 | xchg %?x, %?x |
16-bit values | exchanges data | |
3 | movzbl/movsbl %?[lh], %e?x |
8-bit into 32-bit values | extends | |
3 | movslq %?[lh], %r?x |
32-bit into 64-bit values | extends | |
3 | movzwl/movswl %?x, %e?x |
16-bit into 32-bit values | extends | |
4 | movzbw/movsbw %?[lh], %r?x |
8-bit into 16-bit values | extends | |
4 | movsbq %?[lh], %r?x |
8-bit into 64-bit values | extends | |
4 | movswq %?x, %r?x |
16-bit into 64-bit values | extends |
Constructing constants
Note: this list may not be exhaustive
Common generic patterns
Bytes | Instruction | Values | Condition |
---|---|---|---|
2 | mov $imm, %?[lh] |
8-bit values | clear register |
2 | mov $imm, %?h |
16-bit values divisible by 256 | clear register |
3 | push $imm; pop %r?x |
8-bit values, sign-extended to 64-bit | needs stack |
4 | mov $imm16, %?x |
16-bit values | clear register |
5 | mov $imm32, %e?x |
32-bit values | |
10 | mov $imm64, %r?x |
64-bit values |
Common patterns for specific constants
Bytes | Instruction | Values | Condition |
---|---|---|---|
1 | cltd |
0 | in edx , if eax is zero or positive |
1 | pop %r?x |
1 | needs stack, only if there are no input arguments, only once |
2 | xor %e?x, %e?x |
0 | |
2 | dec %e?x |
-1 (32-bit) | clear register |
2 | mul %e?x |
0 in eax and edx |
%e?x must be zero |
2 | sbb %e?x, %e?x |
-1 (32-bit) | if the carry flag is set |
3 | stc; sbb %e?x, %e?x |
-1 (32-bit) | |
3 | lahf; cpuid |
zeroes in eax , ebx , ecx , edx |
|
3 | imul $imm, %other, %e?x |
32-bit multiples of another constant in other |
Addresses
syscall is free in holes with multiple arguments where you write output in a loop for each argument, since you can start the argument loop with the syscall.
%esp
and %rsp
can be used directly as addresses, and argument addresses can be popped off the stack.
Bytes | Instruction | Values | Condition |
---|---|---|---|
0-2 | syscall |
rip |
in rcx , also sets r11 to flags and rax |
1 | pop %r?x |
argv[0] | must be 2nd pop instruction |
2 | mov %esp, %e?x |
even <4G address usually after end of program but before other areas | |
2 | dec %e?x |
0xffffffff , odd <4G address after end of program |
clear register |
2 | sbb %e?x, %e?x |
0xffffffff , odd <4G address after end of program |
if the carry flag is set |
2-4 | syscall; mov %ecx, %e?x |
rip |
also sets r11 to flags and rax |
3 | stc; sbb %e?x, %e?x |
0xffffffff , odd <4G address after end of program |
|
3-5 | syscall; lea imm8(%rcx), %e?x |
rip+imm8 |
also sets r11 to flags and rax |
4 | bts $22, %e?x |
0x400000 |
clear register |
4 | btc $22, %e?x |
0x400000 |
clear register |
4 | mov $0x4?, %?h; bswap %e?x |
0x4?0000 |
clear register except ?h |
4 | lahf; shl $13, %eax |
0x400000 |
in eax clear register, clear flags |
5 | mov $address, %e?x |
any address |
Other patterns
Bytes | Instruction | Values | Condition |
---|---|---|---|
1 | cltd |
-1 | in edx , if eax is negative |
1 | lahf |
2 plus flags | in ah , if all clear or the flag register value otherwise |
2 | inc %e?x |
1 | clear register |
2 | mov %esp, %e?x |
large value usable as address | |
2 | int $0x80 |
-4 (64-bit) | in eax , if eax clear |
2 | syscall |
0x202 plus flags | in r11 |
2 | pushf; pop %r?x |
0x202 plus flags | needs stack |
2 | cpuid |
zeroes in eax , ebx , ecx , edx |
for most values of eax != 0 |
2 | cpuid |
a bunch of constants | assuming eax = 0 |
2 | div %eax |
1 in eax |
assuming edx = 0 and eax != 0 |
2 | rdtsc |
time-dependent value | in eax and in edx |
3 | dec %r?x |
-1 (64-bit) | clear register |
3 | mov $imm, %al; cwtl |
8-bit values | in eax , if ah is clear |
3 | lea $imm8(%other), %e?x |
32-bit values close to another constant in other |
|
3 | stc; pushf; pop %r?x |
0x203 plus flags | |
3 | smsw %e?x |
0x80050033 | (in common OSes) |
3 | rdpkru |
0x55555554 | in eax , if ecx clear (OS-dependent?) |
3 | str %e?x |
0x40 | (OS-dependent?) |
3 | xgetbv |
0x207 | in eax , if ecx clear (OS-dependent?) |
3 | add $-imm, %e?x |
negative 8-bit values, sign-extended to 32-bit | clear register |
3 | rdrand %e?x |
random value | |
4 | xor %eax, %eax; mov $imm, %al |
8-bit values | in eax |
4 | lea $imm8(%other), %r?x |
64-bit values close to another constant in other |
|
4 | imul $imm, %other, %r?x |
64-bit multiples of another constant in other |
|
4 | bts $bit, %e?x |
32-bit powers of 2 | clear register |
4 | stc; sbb %r?x, %r?x |
-1 (64-bit) | |
4 | add $-imm, %r?x |
negative 8-bit values, sign-extended to 64-bit | clear register |
4 | syscall; mov $imm, %cl |
any program offset in a 256-byte window | assuming syscall does nothing |
4 | mov $5, %al; cpuid |
eax = ebx = 0x40, ecx = 3, edx = 0x11 |
|
4 | mov $6, %al; cpuid |
eax = 4, ebx = 0, ecx = 1, edx = 0 |
|
5 | bts $bit, %r?x |
64-bit powers of 2 | clear register |
6 | push $-imm32; pop %r?x |
negative 32-bit values, sign-extended to 64-bit | needs stack |
7 | mov $-imm32, %r?x |
negative 32-bit values, sign-extended to 64-bit |
Special functionality
BCD integers via the FPU
It's possibly to convert an integer into a packed BCD (meaning that every 4 bits hold a decimal digit) with this code:
push %rax
fildq (%rsp)
fbstp (%rsp)
pop %rax
There is also fbld
for the inverse operation
Entering 32-bit mode (and 16-bit mode)
While by default programs run in 64-bit mode, i.e. with a 64-bit code segment, it is also possible to switch to 32-bit compatibility mode by executing a far jump/call/return to selector 0x23 (or using sysenter
, which returns in 32-bit mode, but seems unavailable on codegolf's AMD processor), which will make available the 1-byte BCD instructions (AA*, DA*) and the 1-byte encodings for INC and DEC (while of course disabling all 64-bit features).
Unfortunately the shortest code sequence to do so seems to be 6-9 bytes. The base version is push $0x23; call to32; [...]; to32: .byte 0x48; to64: .byte 0xcb
, using push $0x33; call to64
to return to 64-bit mode.
It can be shortened to 6 bytes as .section mytext1, "wax"; push $0x23; .byte 0xe8, 0x1; .section mytext2, "wax"; .byte 0x48, 0xcb
which requires rax
to be a valid pointer and an even number of bytes to be present in section mytext1 before this: this will result in the call executing 00 00
= add %al, (%rax)
instructions until the far return in the next section; then the return will be one byte back, so 00 48 cb
= add %cl,-0x35(%eax)
will be executed, continuing execution after the far return once in 32-bit mode.
It is also possible to enter 16-bit mode the same way, by first using the modify_ldt syscall to set up a 16-bit code segment, and jumping there, although of course this is quite unlikely to be profitable.