Hack64 Wiki
Other Titles
Hack64 Wiki
Other Titles
The RSP is a part of the N64's Reality Co-Processor (RCP), which is a slave processor to the main R4300i CPU.
The RSP contains two 4 Kilobyte caches referred to IMEM (Instruction Memory) and DMEM (Data memory).
Like the names suggest, IMEM can only be used for RSP instructions and DMEM can only be used for data. These caches can only be accessed by the RSP, and cannot be accessed externally.
Cache | Virtual address range |
---|---|
DMEM | 0x04000000:0x04000FFF |
IMEM | 0x04001000:0x04001FFF |
There are 32 general purpose vector registers, each of them 128-bits wide.
VU register format | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Element 0 | Element 1 | Element 2 | Element 3 | Element 4 | Element 5 | Element 6 | Element 7 | ||||||||
Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 | Byte 9 | Byte 10 | Byte 11 | Byte 12 | Byte 13 | Byte 14 | Byte 15 |
A VU register can be accessed in different ways depending on the instruction.
For most computational instructions, like add or multiply, vector elements are used. This allows the RSP to perform 8 computational operations in one instruction. For example,
// Lets say that $v0 contains [0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007] // and $v1 contains [0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100] vadd $v2, $v0, $v1 // Vector add instruction // $v2 now contains [0x0100, 0x0101, 0x0102, 0x0103, 0x0104, 0x0105, 0x0106, 0x0107]
You can also do operations with a subsets of a register with scalar halves (h) and scalar quarters (q).
Short form | Long form | Description |
---|---|---|
$v0 | $v0[01234567] | Every element in the vector |
$v0[0q] | $v0[00224466] | Even elements |
$v0[1q] | $v0[11335577] | Odd elements |
$v0[0h] | $v0[00004444] | First and Fifth elements |
$v0[1h] | $v0[11115555] | Second and Sixth elements |
$v0[2h] | $v0[22226666] | Third and Seventh elements |
$v0[3h] | $v0[33337777] | Fourth and Eighth elements |
$v0[0] | $v0[00000000] | First element only |
$v0[1] | $v0[11111111] | Second element only |
$v0[2] | $v0[22222222] | Third element only |
$v0[3] | $v0[33333333] | Fourth element only |
$v0[4] | $v0[44444444] | Fifth element only |
$v0[5] | $v0[55555555] | Sixth element only |
$v0[6] | $v0[66666666] | Seventh element only |
$v0[7] | $v0[77777777] | Eighth element only |
For example,
// Computing 2 distances between 2 pairs of 3D points: // Distance = sqrt((xa-xb)² + (ya-yb)² + (za-zb)²) // $v0 = [ xa1, ya1, za1, 0, xa2, ya2, za2, 0 ] // $v1 = [ xb1, yb1, zb1, 0, xb2, yb2, zb2, 0 ] vsub $v2, $v0, $v1 // Subtract $v1 from $v0 // $v2 = [ xa1-xb1, ya1-yb1, za1-zb1, 0, xa2-xb2, ya2-yb2, za2-zb2, 0 ] vmudh $v2, $v2, $v2 // Multiply $v2 by itself // $v2 = [ (xa1-xb1)², (ya1-yb1)², (za1-zb1)², 0, (xa2-xb2)², (ya2-yb2)², (za2-zb2)², 0 ] // To make things simpler: xN' = (xaN-xbN)², yN' = (yaN-ybN)², and zN' = (zaN-zbN)² // Now we add the y terms to the x terms using $v2[1q] // $v2[1q] = [y1', y1', 0, 0, y2', y2', 0, 0] vadd $v2, $v2, $v2[1q] // $v2 = [ x1' + y1', 2(y1'), z1', 0, x2' + y2', 2(y2'), z2', 0 ] // Then we add the z terms to the x+y terms using $v2[2h] // $v2[2h] = [z1', z1', z1', z1', z2', z2', z2', z2'] vadd $v2, $v2, $v2[2h] // $v2 = [ x1' + y1' + z1', 2(y1') + z1', 2(z1'), z1', x2' + y2' + z2', 2(y2') + z2', 2(z2'), z2' ] // To get the square root of X, you have to compute 1/SQRT(X) and multiply it by X. // Note: we only care about elements 0 and 4 at this point. vrsq $v3, $v2[0h] // Calculate 1/SQRT // rsq1 = 1/sqrt(x1' + y1' + z1'), rsq2 = 1/sqrt(x2' + y2' + z2') // $v3 = [ rsq1, rsq1, rsq1, rsq1, rsq2, rsq2, rsq2, rsq2 ] vmudh $v3, $v3, $v2[0h] // Multiply to get final answer // sqrt1 = sqrt(x1' + y1' + z1'), sqrt2 = sqrt(x2' + y2' + z2') // $v3 = [sqrt1, sqrt1, sqrt1, sqrt1, sqrt2, sqrt2, sqrt2, sqrt2] // The distance for the first set of points is in $v3[0] and the distance for the second set is in $v3[4]. // Do note that these 6 instructions will take 24 cycles due to data hazards stalling the pipeline. // Inserting a few unrelated instructions in-between these instructions will get you better performance.
For Vector Load/Store Commands, [e] refers to the byte index of the vector register not the element index.
Vector Load and Store Commands | ||
---|---|---|
Command | Definition | Pseudo code |
LBV vt[e], offset(base) | Load Byte into Vector Register | vt[e] = (u8)DMEM[base + offset]; |
LSV vt[e], offset(base) | Load Short into Vector Register | vt[e] = (u16)DMEM[base + offset]; |
LLV vt[e], offset(base) | Load Long into Vector Register | vt[e] = (u32)DMEM[base + offset]; |
LDV vt[e], offset(base) | Load Double into Vector Register | vt[e] = (u64)DMEM[base + offset]; |
LQV vt[0], offset(base) | Load Quad into Vector Register | vt[0] = (u128)DMEM[base + offset]; |
LRV vt[0], offset(base) | Load Quad (Rest) into Vector Register | vt[0] = (u128)DMEM[base + offset]; // Stops loading when (base + offset) % 16 == 0 |
LTV vt[e], offset(base) | Load Transpose into Vector Register | TODO |
LFV vt[e], offset(base) | Load Packed Fourth into Vector Register | for(i in [0..3]) vt[e+i*2] = ((u8)DMEM[base+offset+i*2]) « 7; |
LHV vt[0], offset(base) | Load Packed Half into Vector Register | for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i*4]) « 7; |
LPV vt[0], offset(base) | Load Packed Bytes into Vector Register | for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 8; |
LUV vt[0], offset(base) | Load Unsigned Packed Bytes into Vector | for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 7; |
SBV vt[e], offset(base) | Store Byte From Vector Register | DMEM[base + offset] = (u8)vt[e]; |
SSV vt[e], offset(base) | Store Short From Vector Register | DMEM[base + offset] = (u16)vt[e]; |
SLV vt[e], offset(base) | Store Long From Vector Register | DMEM[base + offset] = (u32)vt[e]; |
SDV vt[e], offset(base) | Store Double From Vector Register | DMEM[base + offset] = (u64)vt[e]; |
SQV vt[0], offset(base) | Store Quad From Vector Register | DMEM[base + offset] = (u128)vt[0]; |
SQV vt[0], offset(base) | Store Quad(Rest) From Vector Register | DMEM[base + offset] = (u128)vt[0]; // Stops writing when (base + offset) % 16 == 0 |
STV vt[e], offset(base) | Store Transpose from Vector Register | TODO |
SFV vt[e], offset(base) | Store Packed Fourth from Vector Register | for(i in [0..3]) DMEM[base+offset+i*2] = (u8)(vt[e+i*2] » 7); |
SHV vt[0], offset(base) | Store Packed Half from Vector Register | for(i in [0..7]) DMEM[base+offset+i*4] = (u8)(vt[i*2] » 7); |
SPV vt[0], offset(base) | Store Packed Bytes from Vector Register | for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8); |
SUV vt[0], offset(base) | Store Unsigned Packed Bytes from Vector | for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8); |
SWV vt[0], offset(base) | Store Wrapped from Vector Register | TODO |
In the pseudo code in the some of some of following commands, the variable j depends on the scalar element. (Note that e is a 4-bit number) if (e == 0): j = i; elseif ((e3 & 0b1110) == 0b0010): // Scalar Quarter j = (e & 0b0001) + (i & 0b1110); elseif ((e3 & 0b1100) == 0b0100): // Scalar Half j = (e & 0b0011) + (i & 0b1100); elseif ((e3 & 0b1000) == 0b1000): // Scalar Whole j = e & 0b0111; endif
For Vector Computational Commands, [e] refers to the element index.
Vector Computational Commands | ||
---|---|---|
Command | Definition | Pseudo code |
VNOP | Vector Null Instruction | Does nothing. Used for padding. |
VABS vd, vs, vt[e] | Vector Absolute Value of Short Elements | for(i in [0..7]) { if (vs[i*2] < 0) vd[i*2] = -vt[j*2]; else if (vs[i*2] == 0) vd[i*2] = 0; else if (vs[i*2] > 0) vd[i*2] = vt[j*2]; } |
VADD vd, vs, vt[e] | Vector Add of Short Elements | for(i in [0..7]) vd[i] = (vs[i] + vt[j]) & 0xFFFF; |
VADDC vd, vs, vt[e] | Vector Add of Short Elements With Carry | for(i in [0..7]) vd[i] = vs[i] + vt[j]; |
VSUB vd, vs, vt[e] | Vector Subtraction of Short Elements | for(i in [0..7]) vd[i] = (vs[i] - vt[j]) & 0xFFFF; |
VSUBC vd, vs, vt[e] | Vector Subtraction of Short Elements With Carry | for(i in [0..7]) vd[i] = vs[i] - vt[j]; |
VAND vd, vs, vt[e] | Vector AND of Short Elements | for(i in [0..7]) vd[i] = vs[i] & vt[j]; |
VOR vd, vs, vt[e] | Vector OR of Short Elements | for(i in [0..7]) vd[i] = vs[i] | vt[j]; |
VXOR vd, vs, vt[e] | Vector XOR of Short Elements | for(i in [0..7]) vd[i] = vs[i] ^ vt[j]; |
VNAND vd, vs, vt[e] | Vector NAND of Short Elements | for(i in [0..7]) vd[i] = ~(vs[i] & vt[j]); |
VNOR vd, vs, vt[e] | Vector NOR of Short Elements | for(i in [0..7]) vd[i] = ~(vs[i] | vt[j]); |
VNXOR vd, vs, vt[e] | Vector NXOR of Short Elements | for(i in [0..7]) vd[i] = ~(vs[i] ^ vt[j]); |
VSAR vd, vs, vt[e] | Vector Accumulator Read (and Write) | TODO |
VCH vd, vs, vt[e] | Vector Select Clip Test High | TODO |
VCL vd, vs, vt[e] | Vector Select Clip Test Low | TODO |
VCR vd, vs, vt[e] | Vector Select Crimp Test Low | TODO |
VEQ vd, vs, vt[e] | Vector Select Equal | TODO |
VNE vd, vs, vt[e] | Vector Select Not Equal | TODO |
VGE vd, vs, vt[e] | Vector Select Greater Than or Equal | TODO |
VLT vd, vs, vt[e] | Vector Select Less Than | TODO |
VMACF vd, vs, vt[e] | Vector Multiply-Accumulate of Signed Fractions | TODO |
VMACQ vd, vs, vt[e] | Vector Accumulator Oddification | TODO |
VMACU vd, vs, vt[e] | Vector Multiply-Accumulate of Unsigned Fractions | TODO |
VMADH vd, vs, vt[e] | Vector Multiply-Accumulate of High Partial Products | TODO |
VMADL vd, vs, vt[e] | Vector Multiply-Accumulate of Low Partial Products | TODO |
VMADM vd, vs, vt[e] | Vector Multiply-Accumulate of Mid Partial Products | TODO |
VMADN vd, vs, vt[e] | Vector Multiply-Accumulate of Mid Partial Products | TODO |
VMUDH vd, vs, vt[e] | Vector Multiply of High Partial Products | TODO |
VMUDL vd, vs, vt[e] | Vector Multiply of Low Partial Products | TODO |
VMUDM vd, vs, vt[e] | Vector Multiply of Mid Partial Products | TODO |
VMUDN vd, vs, vt[e] | Vector Multiply of Mid Partial Products | TODO |
VMULF vd, vs, vt[e] | Vector Multiply of Signed Fractions | TODO |
VMULQ vd, vs, vt[e] | Vector Multiply MPEG Quantization | TODO |
VMULU vd, vs, vt[e] | Vector Multiply of Unsigned Fractions | TODO |
VRNDN vd, vs, vt[e] | Vector Accumulator DCT Rounding (Negative) | TODO |
VRNDP vd, vs, vt[e] | Vector Accumulator DCT Rounding (Positive) | TODO |
VRCP vd[de], vt[e] | Vector Element Scalar Reciprocal (Single Precision) | TODO |
VRCPL vd[de], vt[e] | Vector Element Scalar Reciprocal (Double Prec. Low) | TODO |
VRSQ vd[de], vt[e] | Vector Element Scalar SQRT Reciprocal | TODO |
VRSQH vd[de], vt[e] | Vector Element Scalar SQRT Reciprocal (Double Prec. High) | TODO |
VRSQL vd[de], vt[e] | Vector Element Scalar SQRT Reciprocal (Double Prec. Low) | TODO |
VMOV vd[de], vt[e] | Vector Element Scalar Move | TODO |
VMRG vd[de], vt[e] | Vector Select Merge | TODO |
Avoiding pipeline stalls in software can be accomplished by understanding the following rules.
1.) VU register destination writes 4 cycles later (need 3 cycles between load and use). This applies to vector computational instructions, vector loads, and coprocessor 2 moves (mtc2). For example, llv $v0[0], 0x00($r0) | llv $v0[0], 0x00($r0) /* Stall here */ | llv $v2[0], 0x04($r0) /* Stall here */ | llv $v3[0], 0x08($r0) /* Stall here */ | llv $v4[0], 0x0C($r0) addv $v1, $v0, $v0 | addv $v1, $v0, $v0 Both of these sections of code take up the same number of cycles in the RSP. It is better to have unrelated instructions in-between the vector opcodes to improve efficiency, even though it might reduce code readability. 2.) SU register load takes 3 cycles (need 2 cycles between load and use). This applies to SU loads and coprocessor moves (mfc0, cfc2, mfc2). SU computational results are available in the next cycle. lw $t0, 0x00($r0) | lw $t0, 0x00($r0) /* Stall here */ | lw $t4, 0x04($r0) /* Stall here */ | add $t6, $t2, $t1 addi $t1, $t0, 1 | addi $t1, $t0, 1 Like rule #1, inserting unrelated instructions between a load and a use is more efficient. However, the following is ok due to pipeline forwarding: addi $t1, $t0, 1 sw $t1, 0x00($r0) 3.) Any load followed by any store 2 cycles later, causes a one cycle bubble. Coprocessor moves (mtc0, mfc0, mtc2, mfc2, ctc2, cfc2) count as both loads and stores. lw $t0, 0x40(r0) addi $t1, $t1, 0x01 /* Stall here due to the lw above */ sw $t1, 0x30(r0) 4.) A branch target not 64-bit aligned always single issues. 5.) Branches: - Can dual issue (with preceding instruction). - No branch instruction permitted in a delay slot. - Delay slot always single issues. - Taken branch causes a 1 cycle bubble.