Reality Signal Processor Overview (MIPS R4300)

The RSP is a part of the N64's Reality Co-Processor (RCP), which is a slave processor to the main R4300i CPU.

Differences from the R4300i

Runs at 2/3 the speed of the R4300i, specifically at 62.5 MHz.
Does not have a floating-point unit, but instead has a vector unit to do operations in parallel.
Runs instructions off of IMEM(4Kb), so the PC counter is only 12 bits wide. Any higher bits in a address get ignored.
Cannot directly read memory from RDRAM, the programmer must initiate a DMA to transfer memory between RDRAM and DMEM(4Kb).
The general purpose registers are 32-bit wide, not 64-bit.
Has no support for interrupts, traps, or exceptions.
Modified Instructions:
- ADD/ADDU, ADDI/ADDIU, SLTI/SLTIU, SUB/SUBU are the same, since the RSP does not generate an overflow exception.
- BREAK does not generate a trap, instead it is used to set bits in the RSP status register and signals an interrupt to the CPU.
Missing Instructions:
- All 64-bit instructions
  - LD, SD, LDC1, LDC2, SDC1, SDC2, DADDI, DADDIU, DSLLV, DSRLV, DSRAV, DMULT, DMULTU, DDIV, DDIVU, DADD, DADDU, DSUB, DSUBU, DSLL, DSRL, DSRA, DSLL32, DSRL32, DSRA32
- load/store left & right, and load locked
  - LDL, LDR, LWL, LWR, LWU, SWL, SDL, SDR, SWR, LL, LLD
- Store conditionals
  - SC, SCD
- Likely branches
  - BEQL, BNEL, BLEZL, BGTZL, BLTZL, BGEZL, BLTZALL, BGTZALL, BGEZALL
- HI and LO register moves
  - MFHI, MTHI, MFLO, MTLO
- Multiply/Divide
  - MULT, MULTU, DIV, DIVU
- SYSCALL, since the RSP does not generate exceptions
- SYNC
- Branch on co-processor
  - BCzF, BCzT
- Trap
  - TGE, TGEU, TLT, TLTU, TEQ, TNE, TGEI, TGEIU, TLTI, TLTIU, TEQI, TNEI

IMEM and DMEM

The RSP contains two 4 Kilobyte caches referred to IMEM (Instruction Memory) and DMEM (Data memory).

Like the names suggest, IMEM can only be used for RSP instructions and DMEM can only be used for data. These caches can only be accessed by the RSP, and cannot be accessed externally.

Cache	Virtual address range
DMEM	0x04000000:0x04000FFF
IMEM	0x04001000:0x04001FFF

Vector Unit

Registers

There are 32 general purpose vector registers, each of them 128-bits wide.

VU register format
Element 0		Element 1		Element 2		Element 3		Element 4		Element 5		Element 6		Element 7
Byte 0	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6	Byte 7	Byte 8	Byte 9	Byte 10	Byte 11	Byte 12	Byte 13	Byte 14	Byte 15

A VU register can be accessed in different ways depending on the instruction.

For most computational instructions, like add or multiply, vector elements are used. This allows the RSP to perform 8 computational operations in one instruction. For example,

// Lets say that $v0 contains [0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007] 
//           and $v1 contains [0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100, 0x0100]

vadd $v2, $v0, $v1 // Vector add instruction

// $v2 now contains [0x0100, 0x0101, 0x0102, 0x0103, 0x0104, 0x0105, 0x0106, 0x0107]

You can also do operations with a subsets of a register with scalar halves (h) and scalar quarters (q).

Short form	Long form	Description
$v0	$v0[01234567]	Every element in the vector
$v0[0q]	$v0[00224466]	Even elements
$v0[1q]	$v0[11335577]	Odd elements
$v0[0h]	$v0[00004444]	First and Fifth elements
$v0[1h]	$v0[11115555]	Second and Sixth elements
$v0[2h]	$v0[22226666]	Third and Seventh elements
$v0[3h]	$v0[33337777]	Fourth and Eighth elements
$v0[0]	$v0[00000000]	First element only
$v0[1]	$v0[11111111]	Second element only
$v0[2]	$v0[22222222]	Third element only
$v0[3]	$v0[33333333]	Fourth element only
$v0[4]	$v0[44444444]	Fifth element only
$v0[5]	$v0[55555555]	Sixth element only
$v0[6]	$v0[66666666]	Seventh element only
$v0[7]	$v0[77777777]	Eighth element only

For example,

// Computing 2 distances between 2 pairs of 3D points:
// Distance = sqrt((xa-xb)² + (ya-yb)² + (za-zb)²)
// $v0 = [ xa1, ya1, za1, 0, xa2, ya2, za2, 0 ]
// $v1 = [ xb1, yb1, zb1, 0, xb2, yb2, zb2, 0 ]

vsub $v2, $v0, $v1 // Subtract $v1 from $v0

// $v2 = [ xa1-xb1, ya1-yb1, za1-zb1, 0, xa2-xb2, ya2-yb2, za2-zb2, 0 ]

vmudh $v2, $v2, $v2 // Multiply $v2 by itself

// $v2 = [ (xa1-xb1)², (ya1-yb1)², (za1-zb1)², 0, (xa2-xb2)², (ya2-yb2)², (za2-zb2)², 0 ]

// To make things simpler: xN' = (xaN-xbN)², yN' = (yaN-ybN)², and zN' = (zaN-zbN)²

// Now we add the y terms to the x terms using $v2[1q]
// $v2[1q] = [y1', y1', 0, 0, y2', y2', 0, 0]

vadd $v2, $v2, $v2[1q]

// $v2 = [ x1' + y1', 2(y1'), z1', 0, x2' + y2', 2(y2'), z2', 0 ]

// Then we add the z terms to the x+y terms using $v2[2h]
// $v2[2h] = [z1', z1', z1', z1', z2', z2', z2', z2']

vadd $v2, $v2, $v2[2h]

// $v2 = [ x1' + y1' + z1', 2(y1') + z1', 2(z1'), z1', x2' + y2' + z2', 2(y2') + z2', 2(z2'), z2' ]

// To get the square root of X, you have to compute 1/SQRT(X) and multiply it by X.
// Note: we only care about elements 0 and 4 at this point.

vrsq $v3, $v2[0h] // Calculate 1/SQRT

// rsq1 = 1/sqrt(x1' + y1' + z1'), rsq2 = 1/sqrt(x2' + y2' + z2')
// $v3 = [ rsq1, rsq1, rsq1, rsq1, rsq2, rsq2, rsq2, rsq2 ]

vmudh $v3, $v3, $v2[0h] // Multiply to get final answer

// sqrt1 = sqrt(x1' + y1' + z1'), sqrt2 = sqrt(x2' + y2' + z2')
// $v3 = [sqrt1, sqrt1, sqrt1, sqrt1, sqrt2, sqrt2, sqrt2, sqrt2]

// The distance for the first set of points is in $v3[0] and the distance for the second set is in $v3[4].

// Do note that these 6 instructions will take 24 cycles due to data hazards stalling the pipeline.
// Inserting a few unrelated instructions in-between these instructions will get you better performance.

Commands

For Vector Load/Store Commands, [e] refers to the byte index of the vector register not the element index.

Vector Load and Store Commands
Command	Definition	Pseudo code
LBV vt[e], offset(base)	Load Byte into Vector Register	vt[e] = (u8)DMEM[base + offset];
LSV vt[e], offset(base)	Load Short into Vector Register	vt[e] = (u16)DMEM[base + offset];
LLV vt[e], offset(base)	Load Long into Vector Register	vt[e] = (u32)DMEM[base + offset];
LDV vt[e], offset(base)	Load Double into Vector Register	vt[e] = (u64)DMEM[base + offset];
LQV vt[0], offset(base)	Load Quad into Vector Register	vt[0] = (u128)DMEM[base + offset];
LRV vt[0], offset(base)	Load Quad (Rest) into Vector Register	vt[0] = (u128)DMEM[base + offset]; // Stops loading when (base + offset) % 16 == 0
LTV vt[e], offset(base)	Load Transpose into Vector Register	TODO
LFV vt[e], offset(base)	Load Packed Fourth into Vector Register	for(i in [0..3]) vt[e+i2] = ((u8)DMEM[base+offset+i2]) « 7;
LHV vt[0], offset(base)	Load Packed Half into Vector Register	for(i in [0..7]) vt[i2] = ((u8)DMEM[base + offset + i4]) « 7;
LPV vt[0], offset(base)	Load Packed Bytes into Vector Register	for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 8;
LUV vt[0], offset(base)	Load Unsigned Packed Bytes into Vector	for(i in [0..7]) vt[i*2] = ((u8)DMEM[base + offset + i]) « 7;
SBV vt[e], offset(base)	Store Byte From Vector Register	DMEM[base + offset] = (u8)vt[e];
SSV vt[e], offset(base)	Store Short From Vector Register	DMEM[base + offset] = (u16)vt[e];
SLV vt[e], offset(base)	Store Long From Vector Register	DMEM[base + offset] = (u32)vt[e];
SDV vt[e], offset(base)	Store Double From Vector Register	DMEM[base + offset] = (u64)vt[e];
SQV vt[0], offset(base)	Store Quad From Vector Register	DMEM[base + offset] = (u128)vt[0];
SQV vt[0], offset(base)	Store Quad(Rest) From Vector Register	DMEM[base + offset] = (u128)vt[0]; // Stops writing when (base + offset) % 16 == 0
STV vt[e], offset(base)	Store Transpose from Vector Register	TODO
SFV vt[e], offset(base)	Store Packed Fourth from Vector Register	for(i in [0..3]) DMEM[base+offset+i2] = (u8)(vt[e+i2] » 7);
SHV vt[0], offset(base)	Store Packed Half from Vector Register	for(i in [0..7]) DMEM[base+offset+i4] = (u8)(vt[i2] » 7);
SPV vt[0], offset(base)	Store Packed Bytes from Vector Register	for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8);
SUV vt[0], offset(base)	Store Unsigned Packed Bytes from Vector	for(i in [0..7]) DMEM[base+offset+i] = (u8)(vt[i*2] » 8);
SWV vt[0], offset(base)	Store Wrapped from Vector Register	TODO

In the pseudo code in the some of some of following commands, the variable j depends on the scalar element.
(Note that e is a 4-bit number)

if (e == 0):
  j = i;
elseif ((e3 & 0b1110) == 0b0010): // Scalar Quarter
  j = (e & 0b0001) + (i & 0b1110);
elseif ((e3 & 0b1100) == 0b0100): // Scalar Half
  j = (e & 0b0011) + (i & 0b1100);
elseif ((e3 & 0b1000) == 0b1000): // Scalar Whole
  j = e & 0b0111;
endif

For Vector Computational Commands, [e] refers to the element index.

Vector Computational Commands
Command	Definition	Pseudo code
VNOP	Vector Null Instruction	Does nothing. Used for padding.
VABS vd, vs, vt[e]	Vector Absolute Value of Short Elements	for(i in [0..7]) { if (vs[i2] < 0) vd[i2] = -vt[j2]; else if (vs[i2] == 0) vd[i2] = 0; else if (vs[i2] > 0) vd[i2] = vt[j2]; }
VADD vd, vs, vt[e]	Vector Add of Short Elements	for(i in [0..7]) vd[i] = (vs[i] + vt[j]) & 0xFFFF;
VADDC vd, vs, vt[e]	Vector Add of Short Elements With Carry	for(i in [0..7]) vd[i] = vs[i] + vt[j];
VSUB vd, vs, vt[e]	Vector Subtraction of Short Elements	for(i in [0..7]) vd[i] = (vs[i] - vt[j]) & 0xFFFF;
VSUBC vd, vs, vt[e]	Vector Subtraction of Short Elements With Carry	for(i in [0..7]) vd[i] = vs[i] - vt[j];
VAND vd, vs, vt[e]	Vector AND of Short Elements	for(i in [0..7]) vd[i] = vs[i] & vt[j];
VOR vd, vs, vt[e]	Vector OR of Short Elements	for(i in [0..7]) vd[i] = vs[i] \| vt[j];
VXOR vd, vs, vt[e]	Vector XOR of Short Elements	for(i in [0..7]) vd[i] = vs[i] ^ vt[j];
VNAND vd, vs, vt[e]	Vector NAND of Short Elements	for(i in [0..7]) vd[i] = ~(vs[i] & vt[j]);
VNOR vd, vs, vt[e]	Vector NOR of Short Elements	for(i in [0..7]) vd[i] = ~(vs[i] \| vt[j]);
VNXOR vd, vs, vt[e]	Vector NXOR of Short Elements	for(i in [0..7]) vd[i] = ~(vs[i] ^ vt[j]);
VSAR vd, vs, vt[e]	Vector Accumulator Read (and Write)	TODO
VCH vd, vs, vt[e]	Vector Select Clip Test High	TODO
VCL vd, vs, vt[e]	Vector Select Clip Test Low	TODO
VCR vd, vs, vt[e]	Vector Select Crimp Test Low	TODO
VEQ vd, vs, vt[e]	Vector Select Equal	TODO
VNE vd, vs, vt[e]	Vector Select Not Equal	TODO
VGE vd, vs, vt[e]	Vector Select Greater Than or Equal	TODO
VLT vd, vs, vt[e]	Vector Select Less Than	TODO
VMACF vd, vs, vt[e]	Vector Multiply-Accumulate of Signed Fractions	TODO
VMACQ vd, vs, vt[e]	Vector Accumulator Oddification	TODO
VMACU vd, vs, vt[e]	Vector Multiply-Accumulate of Unsigned Fractions	TODO
VMADH vd, vs, vt[e]	Vector Multiply-Accumulate of High Partial Products	TODO
VMADL vd, vs, vt[e]	Vector Multiply-Accumulate of Low Partial Products	TODO
VMADM vd, vs, vt[e]	Vector Multiply-Accumulate of Mid Partial Products	TODO
VMADN vd, vs, vt[e]	Vector Multiply-Accumulate of Mid Partial Products	TODO
VMUDH vd, vs, vt[e]	Vector Multiply of High Partial Products	TODO
VMUDL vd, vs, vt[e]	Vector Multiply of Low Partial Products	TODO
VMUDM vd, vs, vt[e]	Vector Multiply of Mid Partial Products	TODO
VMUDN vd, vs, vt[e]	Vector Multiply of Mid Partial Products	TODO
VMULF vd, vs, vt[e]	Vector Multiply of Signed Fractions	TODO
VMULQ vd, vs, vt[e]	Vector Multiply MPEG Quantization	TODO
VMULU vd, vs, vt[e]	Vector Multiply of Unsigned Fractions	TODO
VRNDN vd, vs, vt[e]	Vector Accumulator DCT Rounding (Negative)	TODO
VRNDP vd, vs, vt[e]	Vector Accumulator DCT Rounding (Positive)	TODO
VRCP vd[de], vt[e]	Vector Element Scalar Reciprocal (Single Precision)	TODO
VRCPL vd[de], vt[e]	Vector Element Scalar Reciprocal (Double Prec. Low)	TODO
VRSQ vd[de], vt[e]	Vector Element Scalar SQRT Reciprocal	TODO
VRSQH vd[de], vt[e]	Vector Element Scalar SQRT Reciprocal (Double Prec. High)	TODO
VRSQL vd[de], vt[e]	Vector Element Scalar SQRT Reciprocal (Double Prec. Low)	TODO
VMOV vd[de], vt[e]	Vector Element Scalar Move	TODO
VMRG vd[de], vt[e]	Vector Select Merge	TODO

Hazards

Mary Jo's Rules

Avoiding pipeline stalls in software can be accomplished by understanding the following rules.

1.) VU register destination writes 4 cycles later (need 3 cycles between load and use). 
    This applies to vector computational instructions, vector loads, and coprocessor 2 moves (mtc2).
    
      For example,
    
      llv $v0[0], 0x00($r0) | llv $v0[0], 0x00($r0)
      /* Stall here */      | llv $v2[0], 0x04($r0)
      /* Stall here */      | llv $v3[0], 0x08($r0)
      /* Stall here */      | llv $v4[0], 0x0C($r0)
      addv $v1, $v0, $v0    | addv $v1, $v0, $v0
    
      Both of these sections of code take up the same number of cycles in the RSP. It is better to have 
      unrelated instructions in-between the vector opcodes to improve efficiency, even though it might
      reduce code readability. 
    
2.) SU register load takes 3 cycles (need 2 cycles between load and use). This applies to SU loads 
    and coprocessor moves (mfc0, cfc2, mfc2). SU computational results are available in the next cycle.
    
      lw $t0, 0x00($r0) | lw $t0, 0x00($r0)
      /* Stall here */  | lw $t4, 0x04($r0)
      /* Stall here */  | add $t6, $t2, $t1
      addi $t1, $t0, 1  | addi $t1, $t0, 1
    
      Like rule #1, inserting unrelated instructions between a load and a use is more efficient. 
      However, the following is ok due to pipeline forwarding:
    
      addi $t1, $t0, 1
      sw $t1, 0x00($r0)
    
    
3.) Any load followed by any store 2 cycles later, causes a one cycle bubble. Coprocessor moves 
    (mtc0, mfc0, mtc2, mfc2, ctc2, cfc2) count as both loads and stores.
    
      lw $t0, 0x40(r0)
      addi $t1, $t1, 0x01
      /* Stall here due to the lw above */
      sw $t1, 0x30(r0)
    
4.) A branch target not 64-bit aligned always single issues.

5.) Branches:
  - Can dual issue (with preceding instruction).
  - No branch instruction permitted in a delay slot.
  - Delay slot always single issues.
  - Taken branch causes a 1 cycle bubble.

Sidebar

Table of Contents

Reality Signal Processor Overview (MIPS R4300)

Differences from the R4300i

IMEM and DMEM

Vector Unit

Registers

Commands

Hazards

Mary Jo's Rules

User Tools

Site Tools

Sidebar

Table of Contents

Reality Signal Processor Overview (MIPS R4300)

Differences from the R4300i

IMEM and DMEM

Vector Unit

Registers

Commands

Hazards

Mary Jo's Rules

Page Tools