User Tools

Site Tools


rsp

Reality Signal Processor Overview (MIPS R4300)

The RSP is a part of the N64's Reality Co-Processor (RCP), which is a slave processor to the main R4300i CPU.


Differences from the R4300i

  • Runs at 2/3 the speed of the R4300i, specifically at 62.5 MHz.
  • Does not have a floating-point unit, but instead has a vector unit to do operations in parallel.
  • Runs instructions off of IMEM(4Kb), so the PC counter is only 12 bits wide. Any higher bits in a address get ignored.
  • Cannot directly read memory from RDRAM, the programmer must initiate a DMA to transfer memory between RDRAM and DMEM(4Kb).
  • The general purpose registers are 32-bit wide, not 64-bit.
  • Has no support for interrupts, traps, or exceptions.
  • Modified Instructions:
    • ADD/ADDU, ADDI/ADDIU, SLTI/SLTIU, SUB/SUBU are the same, since the RSP does not generate an overflow exception.
    • BREAK does not generate a trap, instead it is used to set bits in the RSP status register and signals an interrupt to the CPU.
  • Missing Instructions:
    • All 64-bit instructions
      • LD, SD, LDC1, LDC2, SDC1, SDC2, DADDI, DADDIU, DSLLV, DSRLV, DSRAV, DMULT, DMULTU, DDIV, DDIVU, DADD, DADDU, DSUB, DSUBU, DSLL, DSRL, DSRA, DSLL32, DSRL32, DSRA32
    • load/store left & right, and load locked
      • LDL, LDR, LWL, LWR, LWU, SWL, SDL, SDR, SWR, LL, LLD
    • Store conditionals
      • SC, SCD
    • Likely branches
      • BEQL, BNEL, BLEZL, BGTZL, BLTZL, BGEZL, BLTZALL, BGTZALL, BGEZALL
    • HI and LO register moves
      • MFHI, MTHI, MFLO, MTLO
    • Multiply/Divide
      • MULT, MULTU, DIV, DIVU
    • SYSCALL, since the RSP does not generate exceptions
    • SYNC
    • Branch on co-processor
      • BCzF, BCzT
    • Trap
      • TGE, TGEU, TLT, TLTU, TEQ, TNE, TGEI, TGEIU, TLTI, TLTIU, TEQI, TNEI

IMEM and DMEM

The RSP contains two 4 Kilobyte caches referred to IMEM (Instruction Memory) and DMEM (Data memory).

Like the names suggest, IMEM can only be used for RSP instructions and DMEM can only be used for data. These caches can only be accessed by the RSP, and cannot be accessed externally.

Cache Virtual address range
DMEM0x04000000:0x04000FFF
IMEM0x04001000:0x04001FFF

Vector Unit

Registers

There are 32 general purpose vector registers, each of them 128-bits wide.

VU register format
Element 0 Element 1 Element 2 Element 3 Element 4 Element 5 Element 6 Element 7
Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8 Byte 9 Byte 10 Byte 11 Byte 12 Byte 13 Byte 14 Byte 15

A VU register can be accessed in different ways depending on the instruction.

For most computational instructions, like add or multiply, vector elements are used. This allows the RSP to perform 8 computational operations in one instruction. For example,

// Lets say that $v0 contains [0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007] 
//           and $v1 contains [0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100]

vadd $v2, $v0, $v1 // Vector add instruction

// $v2 now contains [0x0100, 0x0101, 0x0102, 0x0103, 0x0104, 0x0105, 0x0106, 0x0107]

You can also do operations with a subsets of a register with scalar halves (h) and scalar quarters (q).

Short formLong formDescription
$v0$v0[01234567] Every element in the vector
$v0[0q]$v0[00224466] Even elements
$v0[1q]$v0[11335577] Odd elements
$v0[0h]$v0[00004444] First and Fifth elements
$v0[1h]$v0[11115555] Second and Sixth elements
$v0[2h]$v0[22226666] Third and Seventh elements
$v0[3h]$v0[33337777] Fourth and Eighth elements
$v0[0]$v0[00000000] First element only
$v0[1]$v0[11111111] Second element only
$v0[2]$v0[22222222] Third element only
$v0[3]$v0[33333333] Fourth element only
$v0[4]$v0[44444444] Fifth element only
$v0[5]$v0[55555555] Sixth element only
$v0[6]$v0[66666666] Seventh element only
$v0[7]$v0[77777777] Eighth element only

For example,

// Computing 2 distances between 2 pairs of 3D points:
// Distance = sqrt((xa-xb)² + (ya-yb)² + (za-zb)²)
// $v0 = [ xa1, ya1, za1, 0, xa2, ya2, za2, 0 ]
// $v1 = [ xb1, yb1, zb1, 0, xb2, yb2, zb2, 0 ]

vsub $v2, $v0, $v1 // Subtract $v1 from $v0

// $v2 = [ xa1-xb1, ya1-yb1, za1-zb1, 0, xa2-xb2, ya2-yb2, za2-zb2, 0 ]

vmudh $v2, $v2, $v2 // Multiply $v2 by itself

// $v2 = [ (xa1-xb1)², (ya1-yb1)², (za1-zb1)², 0, (xa2-xb2)², (ya2-yb2)², (za2-zb2)², 0 ]

// To make things simpler: xN' = (xaN-xbN)², yN' = (yaN-ybN)², and zN' = (zaN-zbN)²

// Now we add the y terms to the x terms using $v2[1q]
// $v2[1q] = [y1', y1', 0, 0, y2', y2', 0, 0]

vadd $v2, $v2, $v2[1q]

// $v2 = [ x1' + y1', 2(y1'), z1', 0, x2' + y2', 2(y2'), z2', 0 ]

// Then we add the z terms to the x+y terms using $v2[2h]
// $v2[2h] = [z1', z1', z1', z1', z2', z2', z2', z2']

vadd $v2, $v2, $v2[2h]

// $v2 = [ x1' + y1' + z1', 2(y1') + z1', 2(z1'), z1', x2' + y2' + z2', 2(y2') + z2', 2(z2'), z2' ]

// To get the square root of X, you have to compute 1/SQRT(X) and multiply it by X.
// Note: we only care about elements 0 and 4 at this point.

vrsq $v3, $v2[0h] // Calculate 1/SQRT

// rsq1 = 1/sqrt(x1' + y1' + z1'), rsq2 = 1/sqrt(x2' + y2' + z2')
// $v3 = [ rsq1, rsq1, rsq1, rsq1, rsq2, rsq2, rsq2, rsq2 ]

vmudh $v3, $v3, $v2[0h] // Multiply to get final answer

// sqrt1 = sqrt(x1' + y1' + z1'), sqrt2 = sqrt(x2' + y2' + z2')
// $v3 = [sqrt1, sqrt1, sqrt1, sqrt1, sqrt2, sqrt2, sqrt2, sqrt2]

// The distance for the first set of points is in $v3[0] and the distance for the second set is in $v3[4].

// Do note that these 6 instructions will take 24 cycles due to data hazards stalling the pipeline.
// Inserting a few unrelated instructions in-between these instructions will get you better performance.

Commands

For Vector Load/Store Commands, [e] refers to the byte index of the vector register not the element index.
Vector Load and Store Commands
CommandDefinitionPseudo code
LBV vt[e], offset(base)Load Byte into Vector Register vt[e] = (u8)DMEM[base + offset];
LSV vt[e], offset(base)Load Short into Vector Register vt[e] = (u16)DMEM[base + offset];
LLV vt[e], offset(base)Load Long into Vector Register vt[e] = (u32)DMEM[base + offset];
LDV vt[e], offset(base)Load Double into Vector Register vt[e] = (u64)DMEM[base + offset];
LQV vt[0], offset(base)Load Quad into Vector Register vt[0] = (u128)DMEM[base + offset];
LRV vt[0], offset(base)Load Quad (Rest) into Vector Register vt[0] = (u128)DMEM[base + offset];
// Stops loading when (base + offset) % 16 == 0
LTV vt[e], offset(base)Load Transpose into Vector Register TODO
LFV vt[e], offset(base)Load Packed Fourth into Vector Register for(i in [0..3]) vt[e+i*2] = ((u8)DMEM[base+offset+i*2]) « 7;
LHV vt[0], offset(base)Load Packed Half into Vector Register for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i*4]) « 7;
LPV vt[0], offset(base)Load Packed Bytes into Vector Register for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 8;
LUV vt[0], offset(base)Load Unsigned Packed Bytes into Vector for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 7;
SBV vt[e], offset(base)Store Byte From Vector Register DMEM[base + offset] = (u8)vt[e];
SSV vt[e], offset(base)Store Short From Vector Register DMEM[base + offset] = (u16)vt[e];
SLV vt[e], offset(base)Store Long From Vector Register DMEM[base + offset] = (u32)vt[e];
SDV vt[e], offset(base)Store Double From Vector Register DMEM[base + offset] = (u64)vt[e];
SQV vt[0], offset(base)Store Quad From Vector Register DMEM[base + offset] = (u128)vt[0];
SQV vt[0], offset(base)Store Quad(Rest) From Vector Register DMEM[base + offset] = (u128)vt[0];
// Stops writing when (base + offset) % 16 == 0
STV vt[e], offset(base)Store Transpose from Vector Register TODO
SFV vt[e], offset(base)Store Packed Fourth from Vector Register for(i in [0..3]) DMEM[base+offset+i*2] = (u8)(vt[e+i*2] » 7);
SHV vt[0], offset(base)Store Packed Half from Vector Register for(i in [0..7]) DMEM[base+offset+i*4] = (u8)(vt[i*2] » 7);
SPV vt[0], offset(base)Store Packed Bytes from Vector Register for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8);
SUV vt[0], offset(base)Store Unsigned Packed Bytes from Vector for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8);
SWV vt[0], offset(base)Store Wrapped from Vector Register TODO
In the pseudo code in the some of some of following commands, the variable j depends on the scalar element.
(Note that e is a 4-bit number)

if (e == 0):
  j = i;
elseif ((e3 & 0b1110) == 0b0010): // Scalar Quarter
  j = (e & 0b0001) + (i & 0b1110);
elseif ((e3 & 0b1100) == 0b0100): // Scalar Half
  j = (e & 0b0011) + (i & 0b1100);
elseif ((e3 & 0b1000) == 0b1000): // Scalar Whole
  j = e & 0b0111;
endif
For Vector Computational Commands, [e] refers to the element index.
Vector Computational Commands
CommandDefinitionPseudo code
VNOP Vector Null Instruction Does nothing. Used for padding.
VABS vd, vs, vt[e]Vector Absolute Value of Short Elements for(i in [0..7]) {
    if (vs[i*2] < 0)
        vd[i*2] = -vt[j*2];
    else if (vs[i*2] == 0)
        vd[i*2] = 0;
    else if (vs[i*2] > 0)
        vd[i*2] = vt[j*2];
}
VADD vd, vs, vt[e]Vector Add of Short Elementsfor(i in [0..7]) vd[i] = (vs[i] + vt[j]) & 0xFFFF;
VADDC vd, vs, vt[e]Vector Add of Short Elements With Carry for(i in [0..7]) vd[i] = vs[i] + vt[j];
VSUB vd, vs, vt[e]Vector Subtraction of Short Elements for(i in [0..7]) vd[i] = (vs[i] - vt[j]) & 0xFFFF;
VSUBC vd, vs, vt[e]Vector Subtraction of Short Elements With Carry for(i in [0..7]) vd[i] = vs[i] - vt[j];
VAND vd, vs, vt[e]Vector AND of Short Elements for(i in [0..7]) vd[i] = vs[i] & vt[j];
VOR vd, vs, vt[e]Vector OR of Short Elements for(i in [0..7]) vd[i] = vs[i] | vt[j];
VXOR vd, vs, vt[e]Vector XOR of Short Elements for(i in [0..7]) vd[i] = vs[i] ^ vt[j];
VNAND vd, vs, vt[e]Vector NAND of Short Elements for(i in [0..7]) vd[i] = ~(vs[i] & vt[j]);
VNOR vd, vs, vt[e]Vector NOR of Short Elements for(i in [0..7]) vd[i] = ~(vs[i] | vt[j]);
VNXOR vd, vs, vt[e]Vector NXOR of Short Elements for(i in [0..7]) vd[i] = ~(vs[i] ^ vt[j]);
VSAR vd, vs, vt[e]Vector Accumulator Read (and Write) TODO
VCH vd, vs, vt[e]Vector Select Clip Test High TODO
VCL vd, vs, vt[e]Vector Select Clip Test Low TODO
VCR vd, vs, vt[e]Vector Select Crimp Test Low TODO
VEQ vd, vs, vt[e]Vector Select Equal TODO
VNE vd, vs, vt[e]Vector Select Not Equal TODO
VGE vd, vs, vt[e]Vector Select Greater Than or Equal TODO
VLT vd, vs, vt[e]Vector Select Less Than TODO
VMACF vd, vs, vt[e]Vector Multiply-Accumulate of Signed Fractions TODO
VMACQ vd, vs, vt[e]Vector Accumulator Oddification TODO
VMACU vd, vs, vt[e]Vector Multiply-Accumulate of Unsigned Fractions TODO
VMADH vd, vs, vt[e]Vector Multiply-Accumulate of High Partial Products TODO
VMADL vd, vs, vt[e]Vector Multiply-Accumulate of Low Partial Products TODO
VMADM vd, vs, vt[e]Vector Multiply-Accumulate of Mid Partial Products TODO
VMADN vd, vs, vt[e]Vector Multiply-Accumulate of Mid Partial Products TODO
VMUDH vd, vs, vt[e]Vector Multiply of High Partial Products TODO
VMUDL vd, vs, vt[e]Vector Multiply of Low Partial Products TODO
VMUDM vd, vs, vt[e]Vector Multiply of Mid Partial Products TODO
VMUDN vd, vs, vt[e]Vector Multiply of Mid Partial Products TODO
VMULF vd, vs, vt[e]Vector Multiply of Signed Fractions TODO
VMULQ vd, vs, vt[e]Vector Multiply MPEG Quantization TODO
VMULU vd, vs, vt[e]Vector Multiply of Unsigned Fractions TODO
VRNDN vd, vs, vt[e]Vector Accumulator DCT Rounding (Negative) TODO
VRNDP vd, vs, vt[e]Vector Accumulator DCT Rounding (Positive) TODO
VRCP vd[de], vt[e]Vector Element Scalar Reciprocal (Single Precision) TODO
VRCPL vd[de], vt[e]Vector Element Scalar Reciprocal (Double Prec. Low) TODO
VRSQ vd[de], vt[e]Vector Element Scalar SQRT Reciprocal TODO
VRSQH vd[de], vt[e]Vector Element Scalar SQRT Reciprocal (Double Prec. High) TODO
VRSQL vd[de], vt[e]Vector Element Scalar SQRT Reciprocal (Double Prec. Low) TODO
VMOV vd[de], vt[e]Vector Element Scalar Move TODO
VMRG vd[de], vt[e]Vector Select Merge TODO

Hazards

Mary Jo's Rules

Avoiding pipeline stalls in software can be accomplished by understanding the following rules.

1.) VU register destination writes 4 cycles later (need 3 cycles between load and use). 
    This applies to vector computational instructions, vector loads, and coprocessor 2 moves (mtc2).
    
      For example,
    
      llv $v0[0], 0x00($r0) | llv $v0[0], 0x00($r0)
      /* Stall here */      | llv $v2[0], 0x04($r0)
      /* Stall here */      | llv $v3[0], 0x08($r0)
      /* Stall here */      | llv $v4[0], 0x0C($r0)
      addv $v1, $v0, $v0    | addv $v1, $v0, $v0
    
      Both of these sections of code take up the same number of cycles in the RSP. It is better to have 
      unrelated instructions in-between the vector opcodes to improve efficiency, even though it might
      reduce code readability. 
    
2.) SU register load takes 3 cycles (need 2 cycles between load and use). This applies to SU loads 
    and coprocessor moves (mfc0, cfc2, mfc2). SU computational results are available in the next cycle.
    
      lw $t0, 0x00($r0) | lw $t0, 0x00($r0)
      /* Stall here */  | lw $t4, 0x04($r0)
      /* Stall here */  | add $t6, $t2, $t1
      addi $t1, $t0, 1  | addi $t1, $t0, 1
    
      Like rule #1, inserting unrelated instructions between a load and a use is more efficient. 
      However, the following is ok due to pipeline forwarding:
    
      addi $t1, $t0, 1
      sw $t1, 0x00($r0)
    
    
3.) Any load followed by any store 2 cycles later, causes a one cycle bubble. Coprocessor moves 
    (mtc0, mfc0, mtc2, mfc2, ctc2, cfc2) count as both loads and stores.
    
      lw $t0, 0x40(r0)
      addi $t1, $t1, 0x01
      /* Stall here due to the lw above */
      sw $t1, 0x30(r0)
    
4.) A branch target not 64-bit aligned always single issues.

5.) Branches:
  - Can dual issue (with preceding instruction).
  - No branch instruction permitted in a delay slot.
  - Delay slot always single issues.
  - Taken branch causes a 1 cycle bubble.
rsp.txt · Last modified: 2019/04/06 22:15 by David