GCN1, GCN2, GCN3 ISA Quick Reference Card

Program Control
s_endpgm - Terminates the wavefront
s_nop - Does nothing.
s_sleep - Sleep for s0*64 cycles (64..448)
s_trap, s_rfe - Trap handler call/return
s_setprio - sets wavefront priority 0..3
s_sethalt - sets the halt bit. Ignored while priv=1
s_setvskip - skips the next N vertex instructions.
s_branch - Unconditional jump
s_cbranch_xxx - Conditional jump. Conditions: (vccz, vccnz, execz, execnz, scc0, scc1, cdbgsys, cdbguser, cdbgsys_and_user)
s_branch_I/G_fork, s_branch_join - Does complicated flow graph.
s_barrier - Forces each wavefront to wait
until all other wavefronts reach the same instruction.
s_waitcnt - Waits for a specific counter. vmCnt: Vector-memory operation, returns in order. lgkmCnt: LDS, GDS, Scalar-memory, semdmsg, returns out of order. expCnt: Export, GDS.

Scalar Integer Arithmetic
s_add, s_sub, s_addk - Add, sub. SCC=carry
s_addc, s_subb - D=S1+S2+SCC. SCC=carry
s_abs - D=abs(S1). SCC=result nonzero
s_absdiff - D=abs(S01-S2). SCC=result nonzero
s_min, s_max - SCC= first_operand==result
s_mul, s_mulk - 32bit signed multiply
s_sext - Sign extend from 8 or16bits to 32
s_cselect - D = SSC ? S1 : S2
s_cmov, s_cmovk - if(SSC) D = S1
s_cmp_xx, s_cmpk_xx - 32bit signed/unsigned compare. xx={eq,ne,gt,ge,lt,le}
s_cmp_xx_U64 - 64 bit unsigned compare xx={eq,ne}
s_bitcmp0, s_bitcmp1 - Bit compare
s_mov, s_movk - basic logic
s_and, s_or, s_xor, s_not- basic logic operations
s_nand, s_nor, s_xnor - result is negated
s_andn2, s_orn2 - 2nd operand is negated
s_lshl, s_lshr, s_ashr - bit shift operations
s_bfm - Bit field mask. S1= size, S2=pos
s_bfe - Bit field extract. S1=data, S2[5:0]=offset, S2[22:16]=width
s_wqm - WholeQuadmode. If any bit in a group if 4 is 1, set the resulting group of 4 bits to 1.
s_quadmask - similar to wqm, except it produces only 8 bit result from 8*4bit input.
s_brev - reverse bits.
s_bcnt0, s_bcnt1 - Count 0/1 bits
s_ff0, s_ff1 - find first 0/1 bit. -1 if not found
s_flbit - find last bit. D = the number of zeroes before the first one starting from MSB. -1 if none. signed operand: finds the first nonSign bit from MSB.
s_bitset0, s_bitset1 - Set specific bit.
s_op_saveexec - D = EXEC, EXEC = S1 <op> EXEC, SCC = EXECNZ. op={and,or,xor,andn2,orn2,nand,nor,xnor}
s_movrels, s_movreld - Move a value into an SGPR relative to M0. S:source / D:destination index is increment with M0.

Scalar Special
s_tracedata - Send M0 as user data to thread-trace
s_sendmsg, s_sendmsghalt - sends an interrupt to the host
s_dcache_inv - Invalidate entire L1 K cache (GCN3: data cache)
s_dcache_wb - Write-back of dirty data.
s_icache_inv - Invalidate entire L1 I cache
s_memtime - reads 64bit counter. lgkmCnt should be 0.
s_mem_realtime - reads 64bit realtime counter. lgkmCnt...0
s_{inc/dec}perflevel - Inc/dec performance counter.
s_memtime - read
s_getreg, s_setreg, s_setreg_imm32 - reads/writes a HW_REG. Must add s_nop between consecutive s_setreg to the same register. SIMM16 format: hwreg(hwReg, offset, size)
1:HW_REG_MODE: r/w [1:0]:single round mode, [3:2] double r.m. (0:nearest even, 1:+inf, 2:-inf, 3:towards0) [5:4]:single denormal mode, [7:6]:double d.m. (0:flush input and output denorms, 1:allow in/flush out, 2:flush in/allow out, 3:allow in and out)
4:HW_REG_HW_ID: r debug. [3:0]:wave buffer slot, [5:4]:SIMD id, [11:8]:CU id, [12]:SHader array id(within SE), [14:13]:Shader Endine id, [19:16]Thread-groun id
5:GRR_ALLOC: r [5:0]:Vbase/4, [13:8]:Vsize/4, [21:16]:Sbase/8, [27:24]:Ssize/8
6:LDS_ALLOC: r [7:0]:LDSbase, [20:12]:LDSsize (256 byte units)
7:IB_STS: r [3:0]:vmCnt, [6:4]:expCnt, [10:8]:lgkmCnt, [14:12]:valuCnt (no of VALU instrs outstanding)
s_set_gpr_idx_<xx> - Sets GCN3 VGPR Indexing parameters. <off>: Disable. <on>:Enable and set index(m0[4:0]=S1) and operand mask(m0[15:12]=SIMM4, (src0,src1,src2,dst)). <idx> sets idx only. <mode> sets operand mask only.

Scalar Memory
s_load_dword{,x2,x4,x8,x16} sdst, base, offset - reads 1..16 dwords from base+offset. Offset can be literal dword offset or sgprs byte offset(truncated to dword).
s_buffer_load_dword{,x2..x16} - Uses a buffer resource. res.{baseAddr, Stride, numRecords} fields are used. Stride, NumRecords is for clamping the addr. Truncated to dwords.
Alignment: x2: must be aligned to even SGPR. x4,x8,x16 must be aligned to 4.
Literal offset: is interpreted as dwords on GCN12, bytes on GCN3, in HetPas it needs to be specified explicitly.
GCN3 only: s_store_dword, s_buffer_store_dword, GCL option. Store can only use literal or m0.
s_atc_probe{_buffer} - Probe or prefetch an address into the SQC data cache.
Wait: s_waitcnt lgkmCnt ordered within type only.

Vector inputs
v0..v255 - vector regs
s0..s101 - scalar regs. Only 1 allowed per instruction, but can be reused. If there is a scalar input, literal 32 is not allowed.
vcc_lo/hi, exec_lo/hi, m0 - special sgprs
vccz, execz, scc - scalar flags
LDS_direct - read from addr = m0[15:0], data type m0[18:16] {0:ubyte, 1:ushort, 2:dword, 4:sbyte, 5:sshort}
[-16..64] - integer inline constants
+/-[0.5, 1.0, 2.0, 4.0] - inline float constants
1/(2*PI) - only on GCN3. Instr dependent precision.
[Literal] - 32bit const from instruction stream. (no S read alloved if used)
3a,b: Negate: All 3 inputs can be negated. (GCN3: output too.)
3a,b: Omod: output can be multiplied by [0.5, 1.0, 2.0, 4.0]
3a,b: Clamp: output clamped to [-1.0..1.0] range. (GCN3: integers are clamped min/max int. cmp: if clamp is set, it signals on float exceptions too. floats: clamps to [0.0..1.0] (really?!))
3a: Abs: All 3 inputs can be absed.
GCN3 VGPR Indexing: Can be enabled with s_set_gpr_* Special rules apply to the following: v_{read/write}lane, v_readfirstlane, v_mac_*, v_madak, v_madmk, v_*sh*_rev, v_cvt_pkaccum, SDWA.

Vector Memory
Simple operation:
1. Prepare a buffer resource: r0,r1 = 48bit byte address.
2. set stride: r1 |= 4<<16 for dword access.
3. set num_records to max: r2=0xFFFFFF00
4. set default num/data fomat: r3=0x002A7204
Use the following:
buffer_{load,store}_<size> vData, vIndex, iOffset idxen - loads/stores data. <size> can be one of {ubyte, sbyte, ushort, sshort, dword, dwordx2, dwordx3, dwordx4}. vIndexscaled with res.stride, iOffset is a byte offset. If res.TID, then index+=thread_id[5:0]. GLC option meaning: skip L1 cache. LDS option: Can load directly to LDS_offset = m0[15:0]
Atomics: buffer_atomic_* - GLC=return prev val, _x2=64bit
swap - dst=vData
cmpswap, fcmpswap - if(dst==vData[1]) dst=vData[0]
add, sub - dst +=/-= vData
radd, rsub - dst = src +/- dst
umin, umax, smin, smax, fmin, fmax - dst = minmax(dst, vData)
and, or, xor - bitwise logic operations.
inc - dst = (dst>=vData) ? 0 : dst+1
dec - dst = (dst==0 || dst>vData) ? vData : dst-1
Cache invalidation:
buffer_wbinvl1 - Write back and invalidate L1.Always returns ACK to shader. _sc ( _vol) - Only for MTYPE SC and GC.
Wait: s_waitcnt vmCnt, ordered

Buffer Resource Descriptor
r0:0, 48bits - Base address
r1:16, 14bits - Stride (0..16K)
r1:30 - L1 texture cache swizzle
r1:31 - Swizzle AOS
r2, 32bits - Num_records
r3, 4*3bits - select xyzv. 0=0, 1=1, 4=R, 5=G, 6=B, 7=A
r3:12, 3bits - Num format (float, int, ...)
r3:15, 4bits - Data format (no of fields, size of each field)
r3:19, 2bits - Element size (2,4,8,16 bytes. Used for swizzling.
r3:21, 2bits - Index stride (8,16,32,64) Used for swizzling.
r3:23 - TID. Add thread-id to the index.
r3:24 - ATC. 0=resource is in GPUVM, 1=resource is in ATC mem
r3:25 - Hash 1=addresses are hashed for better cache perf.
r3:26 - Heap 1=out-of-range if offset=0 or >=num_records.
r3:27, 3bits - MTYPE - Memory type - controls cache behavior.
r3:30, 2bits - Type - 0=buffer

Required NOPs
s_setreg | 2 | s_???reg <same reg>
s_set_vskip | 2 | s_getreg mode
s_setreg mode.vskip | 2 | any VOP
VALU sets VCC or EXEC | 5 | VALU uses EXECZ or VCCZ as data
VALU writes sgpr/vcc | 4 | v_{read/write}lane using the same sgpr/vcc as lane select
VALU writes VCC | 4 | v_div_fmas
VM store/atomic | 1 | overwrite vData
VALU writes sgpr | 5 | VMEM reads that sgpr !!!
SALU writes m0 | 1 | GDS, s_sendmsg, s_tracedata
VALU writes vgpr | 2 | VALU DPP reads that vgpr
VALU writes EXEC | 5 | VALU DPP op
Mixed use of VCC:alias vs. sgpr# (readlane) | 1 | VALU reads VCC (except for carry-in usage)
s_setreg, s_trapsts | 1 | s_ rfe, s_rfe_restore
VALU writes sgpr | any SALU

Data Share Operations
m0 usage: m0 clamps final byte address. No clamping: m0=-1
Single address: addr = LDS_BASE+vAddr+iOffset
Double address: addr = LDS_BASE+vAddr+(iOffset0, iOffset1)
ds_read_{b32,b64,b96,b128,u8,i8,u16,i16} - Read one value per thread, sign extend to dword, if signed.
ds_read2_{b32,b64} - Read 2 values. (double address mode)
ds_read2{st64}_{b32,b64} - Read 2 values, st64: offset*=64
ds_write_{b32,b64,b96,b128,b8,b16} - Write 1 value.
ds_write2{st64}_{b32,b64} - Write 2 values.
ds_wrxchg2{st64}_rtn_{b32,b64} - Exchange gpr with lds.

Special DS ops:
ds_swizzle_b32 - Exchange data across wavewront, no data is written to LDS
ds_consume - Consume entries from a buffer.
ds_append - Append one or more entries to a buffer.
ds_ordered_count - Increment an append counter. Operation is done in order of wavefront creation.
ds_permute_b32 - Forward permute. Does not write any lds memory. LDS[dst]=src0, result=LDS[thread_id]
ds_bpermute_b32 - Backward permute. No write to lds. LDS[thread_id]=src0, result=LDS[dst]
ds_condxchg_rtn_b32 - Conditional write exchange (?)

Global wave synch:
ds_gws_init threadcnt offset0:ID gds - initialize a barrier. Optimal wavefront count is 8 on each compute units.
ds_gws_barrier v0 offset0:ID gds - does the actual barrier. Between init and barrier there must be a few hundred cycles. More barried ID's can be issued by overlapping them.
ds_gws_sema_{v/br/p} - Undocumented.
ds_gws_sema_release_all - Release all wavefronts waiting on this semaphore. Id in offset0[4:0]

add, sub, rsub, inc, dec: u32, u64, rtn, src2
min, max: u32, i32, f32, u64, i64, f64, rtn, src2
and, or, xor, mskor: b32, b64, rtn
cmpst: b32, f32, b64, f64, rtn

Vector - Move
cndmask : b32
mov : b32
movrel{d,s,sd} : b32
readfirstlane : b32
readlane : b32
writelane : b32

Vector - Arithmetic
add : f16, f32, f64, u16, u32
sub, subrev : f16, f32, u16, u32
addc, subb, subbrev : u32
exp : f16, f32
{exp,log,mac,mad,mul}_legacy : f32
ceil, floor, fract, trunc, rndne : f16, f32, f64
fma : f16, f32, f64
log : f16, f32
mac : f16, f32
mad : f16, f32, i16, u16, i32_i24, i64_i32, u32_u24, u64_u32
madak, madmk : f16, f32
max, min : f16, f32, f64, i16, i32, u16, u32
max3, min3, med3 : f32, i32, u32
mul : f16, f32, f64, u16, u32, i32_i24, u32_u24
mul_hi : i32, i32_i24, u32, u32_u24
mul_lo : u32, i32, u16
rcp, rsq, sqrt : f16, f32, f64
rcp_iflag : f32
sin, cos : f16, f32

Vector - Bitwise
alignbit, alignbyte : b32
and, not, or, xor : b32
ashrrev : i16, i32, i64
bcnt : u32_b32
bfe : i32, u32
bfi, bfm, bfrev : b32
ffbh : i32, u32
ffbl : b32
lshlrev, lshrrev : b16, b32, b64
mbcnt_hi, mbcnt_lo : u32_b32
perm : b32

Vector - Conversion
cvt_f16 : f32, i16, u16
cvt_f32 : f16, f64, i32, u32, ubyte{0..3}
cvt_f64 : f32, i32, u32
cvt_flr : i32_f32
cvt_i16 : f16
cvt_i32 : f32, f64
cvt_off : f32_i4
cvt_pk : i16_i32, u16_u32, u8_f32
cvt_pkaccum : u8_f32
cvt_pknorm : i16_f32, u16_f32
cvt_pkrtz : f16_f32
cvt_rpi : i32_f32
cvt_u16 : f16
cvt_u32 : f32, f64

Vector - Miscellaneous
cube{id,ma,sc,tc} : f32
div_{fixup,fmas,scale} : f16, f32, f64
frexp_exp : i16_f16, i32_f32, i32_f64
frexp_mant : f16, f32, f64
ldexp : f16, f32, f64
lerp : u8
mov_fed : b32
mqsad : u32_u8
mqsad_pk, qsad_pk : u16_u8
msad : u8
sad : n8, u16, u32
sad_hi : u8
trig_preop : f64

Vector - Compare
cmp_{op16} : f32, f64
cmpx_{op16} : f32, f64
cmp_{op8} : i32, i64, u32, u64, i16, u16
cmpx_{op8} : i32, i64, u32, u64, i16, u16
cmp_class : f32, f64, f16
cmpx_class : f32, f64, f16

{op16} : F, LT, EQ, LE, GT, LG, GE, O, U, NGE, NLG, NGT, NLE, NEW, NLT, T
{op8} : F, LT, EQ, LE, GT, NE(LG), T

HetPas specific
- SMRD/SMEM literal offset: Must specify ofsByte or ofsDWord as it became architecture dependent.
- v_add/sub: both i32 and u32 works. (GCN3: renamed i32->u32)