1 Introduction

这门课在 UNNC 被分两个部分教学：系统架构 + 网络。系统架构部分使用的教材是 MK Computer Organization and Design 5th Edition 2。

2 Hierarchy, Components & Technology

2.1 Overview

Computer components
Computer systems hierarchy
Technology trends

2.2 Computer Components

2.2.1 Computer Components

Input
- Mouse
- Keyboard
- Touchpad
- ...
Output
- Monitor
- Printer
- ...
Memory
- Volatile main memory
  - Random Access Memory (RAM)
  - CPU cache
- Non-volatile secondary memory
  - Read Only Memory (ROM)
  - Flash memory
  - Optical disk (CDROM, DVD)
Processor (CPU)
- Datapath: the component that performs arithmetic operations
- Control: the component that commands the datapath, memory and I/O devices according t othe instructions of the program

The IAS (von Neumann Machine)

The IAS (von Neumann Machine)

2.2.2 Below Your Program

Application software
- Written in high-level language
System software
- Compiler
  - HLL code => machine code
- Operating system
  - Handling basic I/O operations
  - Managing memory & storage
  - Scheduling tasks & sharing resources
Hardware
- Processor
- Memory
- I/O controller

2.2.3 Levels of Programming Code

Compiler: a program that translates high-level language statements into assembly language statements
Instruction: a command that computer hardware understands and obeys
Assembler: a program that translates a symbolic version of instructions into the binary versions
Assembly language: a symbolic representation of machine instructions
Machine language: a binary representation of machine instructions

2.3 Computer Systems Hierarchy

2.3.1 Computer Systems Hierarchy

2.3.2 Instruction Set Architecture (ISA)

ISA
- A well-defined hardware/software interface
The "contract" between software and hardware
- Allow communications
- Standardize instructions, machine language bit patterns, operation modes, storage locations, etc
Advantage
- Allow different implementations of the same architecture
Disadvantage
- Sometimes prevent adding new innovations
Modern ISAs
- 80x86, MIPS, ARM, PowerPC

2.4 Technology Trends

2.4.1 Basic Logic

Because of the on-or-off nature of the binary information and signal routing the computer uses, an efficient electronic switch is required:

Vacuum tube
Transistor
Integrated circuit (IC)): combine dozens to hundreds of transistor on a single chip
Very large-Scale integrated (VLSI) circuit: a device containing hundreds of thousands to millions of transistors
Ultra large-scale integrated circuit: a device containing over millions of transistors

2.4.2 Trends

Year	Technology	Relative Performance/Cost
1951	Vacuum tube	1
1965	Transistor	35
1975	IC	9*10^3
1995	VLSIC	2.4*10^6
2013	ULSIC	2.5*10^11

2.4.3 Microprocessor

Microprocessor: computer CPU on a single chip (IC)

2.4.4 Modern Multicore Processor

Multicore processor: a single chip contains more than one microprocessor core

2.4.5 Manufacturing Chips

3 Computer Performance

3.1 Overview

Computer performance
Power wall
SPEC Benchmark
Amdahl's Law

3.2 Computer Performance

3.2.1 Response Time and Throughput

Response time (execution time)
- The total time to complete a task
Throughput (brandwidth)
- The number of tasks completed per unit time
Effects
- Replacing the processor in a computer with a faster version
- Adding additional processor to a system that uses multiple processors for separate tasks

3.2.2 Relative Performance

To maximize performance => minimize execution

1	Performance = 1 / Execution time_x

Computer X is n times faster than computer Y:

1	Performance_x / Performance_y = Execution time_y / Execution time_x = n

3.2.3 Measuring Performance

Response time
- The total time to complete a task
- Disk accesses, memory accesses, I/O activities, etc
CPU execution time (CPU time)
- The actual time the CPU spends computing for a specific task
- Exclude time spent waiting for I/O or running other programs
- Further divided into:
  - User CPU time: the CPU time spent in a program itself
  - System CPU time: the CPU time spent in the OS performing tasks on behalf of the program

3.2.4 CPU Clocking

Clock: determines when events take place in hardware
Clock signal: produced by an external oscillator circuit
Clock cycles (clock ticks, clock periods): the time for one clock period, usually of the processor clock, which runs at a constant rate (like 250ps)
Clock frequency (clock rate): cycles per second, which is the inverse of the clock period (like 4.0GHz)

3.2.5 CPU Time

1 2	CPU Time = CPU Clock Cycles * Clock Cycle Time = CPU Clock Cycles / Clock Rate

To improve performance:

Reducing the number of clock cycles
Increasing clock rate
Trade-off: the number of clock cycles & the clock cycle time

3.2.6 Instruction Performance

The execution time of instructions: determined by the number of instructions in a program
The instruction count for a program: determined by program, ISA and compiler
Clock cycles per instruction (CPI): the average number of clock cycles each instruction takes to execute, determined by CPU hardware

1	CPU Clock Cycles = Instruction Count * CPI

1 2	CPU Time = Instruction Count * CPI * Clock Cycle Time = Instruction Count * CPI / Clock Rate

If different instruction classes take different numbers of cycles:

1
2
3

               n
Clock Cycles = ∑(CPI_i * Instruction Count_i)
              i=1

Weighted average CPI:

CPI = CPU Clock Cycles / Instruction Count
      n
    = ∑(CPI_i * Instruction Count_i / Instruction Count)
     i=1

Instruction Count: relative frequency

3.2.7 Understanding Program Performance

Algorithm
- Affects insturction count
- Affects CPI (possibly) - by favouring slower or faster instructions
Programming language
- Affects instructions count
- Affects CPI - beacause of its own features
Compiler
- Affects instructions count
- Affects CPI
ISA
- Affects instructions count
- Affects CPI, clock rate

3.3 Power Wall

Complememtray metal oxide semiconductor (CMOS): the dominant technology for integrated circuits
Dynamic energy: the primary source of energy consumption for CMOS, consumed when transistors switch states from 0 to 1 and vice versa.
1
Energy ∝ Capacitive load * Voltage^2

The energy of a single transition:

1	Energy ∝ 0.5 * Capacitive load * Voltage^2

The power required per transistor:

1	Power ∝ 0.5 * Capacitive load * Voltage^2 * Frequency

3.4 SPEC Benchmark

System Performance Evalucation Cooperative (SPEC): the standard sets of benchmarks for modern computer systems
SPEC CPU2006

SPEC CPU2006
SPEC Power Benchmark

SPEC Power Benchmark

3.5 Amdah's Law

3.5.1 Amdah's Law

Amdah's Law: the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.

The execution time of the program after the improvement:

1	T_improved = T_affected / improvement factor + T_unaffected

3.5.2 Law of Diminishing Returns

3.5.3 Low Power at Idle

(NEED FURTHER READING)

4 MIPS32 Programming

4.1 Overview

MIPS introduction
MIPS structure
MIPS syntax & instruction
MIPS basic arithmetic
MIPS branching
MIPS logical operations

4.2 MIPS Introduction

4.2.1 Introduction

Microprocessor without Interlocked Processor States (MIPS)
- ISA based on Reduced Instruction Set Computing (RISC) CPU design strategy
- Simple instruction, fast execution
- 1980s 32-bit, 1991 64-bit

4.2.2 Bits, Bytes and Words

Bit: binary digit (either 0 or 1)
Byte: a sequence of bits (having 8 bits in length since the mid 1960's to get 256 possibilities)
Word: amount of data computer can process in one step. (Most CPUs have a word size of 32 or 64 bits today. On the 32-bit MIPS, a word is 4 bytes long.)

4.3 MIPS Structure

4.3.1 MIP32 Structure

1 word = 32 bits = 4 bytes
Addresses are 32-bit words => (2^32 different locations)
Registers are 32-bit words

4.3.2 Fetch-Execute Cycle

Fetch the instruction: fetch the instruction from the memory address that is currently stored in the program counter (PC), and store it in the instruction regitser (IR).
Decode the instruction: the encoded instruction presented in the IR is interpreted by the decoder
Execute the instruction
Write back

Repeat {
    Fetch (PC):
        Fetch the instruction word (at PC)
        Instruction decoded
        Calculate next instruction address
    Execute (ALU, Registers and Control):
        Read operands
        Executes the operations
        Write/store results
}

4.4 MIPS Syntax & Instruction

4.4.1 MIPS Programming: "Hello, world!"

		.data
msg:	.asciiz "Hello, world!\n"
		.text
		.globl main
main:
		la $a0, msg		# load label msg
		li $v0, 4		# load immediate
		syscal			# print it
		li $v0, 10
		syscall			# exit

4.4.2 MIPS Basic Syntax

Assembler directives begin with a "."
- .data
  - Start assembling data used by the program
- .text
  - Start assembling program instructions
- .asciiz
  - Place a null-terminated ASCII string in memory
Labels are names followed by a colon ":"
- Descriptive names chosen by programers
- The assembler will assign memory addresses to labels for later reference
- The label "main" is the entry of the program
Machine instructions
- Lines after "main:" contain symbolic machine instructions
Comments begin with a "#"

4.4.3 MIPS Registers

Name	Register Number	Use
$zero	0	Constant 0
$at \| 1 \| Reserved for assembler \| \| \$v0-$v1	2-3	Function select, return result
$a0-$a3 \| 4-7 \| Arguments \| \| \$t0-$t7	8-15	temporaries
$s0-$s7 \| 16-23 \| saved temporaries \| \| \$t8-$t9	24-25	temporaries
$k0-$k1	26-27	kernel registers
$gp	28	global heap pointer
$sp	29	stack pointer
$fp	30	frame pointer
$ra	31	return address

4.4.4 MIPS Instruction Field

General R-format:

	op	rs	rt	rd	shamt	funct
bits	6	5	5	5	5	6

op - operation code (basic operation of the instruction)
rs - first source register operand
rt - second source register operand
rd - destination register operand
shamt - shift amount (used in shift instruction)
funct - function code (select the variant of the operation in the op code field)

General I-format:

	op	rs	rt	constant or address
bits	6	5	5	16

4.4.5 MIPS Instruction Format

General syntax:

3 Operands	2 Operands	Other
op dst, src, src	op dst, src	op
op dst, src, imm	op dst, imm	op src

dst - destination register
src - source register
imm - immediate value (16 bits)

4.5 MIPS Arithmetic

4.5.1 Add

1 2	# rd = rs1 + rs2 add rd, rs1, rs2

Example: add $s0, $t0, $t1

	Before	After
$s0	4	11
$t0	5	5
$t1	6	6

4.5.2 Subtract

1 2	# rd = rs1 - rs2 sub rd, rs1, rs2

Example: sub $s0, $t0, $t1

	Before	After
$s0	4	-1
$t0	5	5
$t1	6	6

4.5.3 Add Immediate

1 2	# rd = rs1 + imm addi rd, rs1, imm

Example: addi $s0, $t0, 100

	Before	After
$s0	4	105
$t0	5	5

4.5.4 Assignment

1 2	move a, b #Assign: a = b

The instruction is accepted by MIPS assembler, but it is called the pseudo-instruction, which will be translated into:
1
add a, b, $zero

4.5.5 System Call

To request an operating-system-like service:

Load the service number in register $v0
Load argument values, if any, in $a0, …, $a3 (or $f12 for floating point values)
Issue the SYSCALL instruction
Return values and if any, put the result in register $v0 (or $f0 for floating point values)

May destroy $v0-$v1, $a0-$a3, $t0-$t9, $ra, but always preserves $s0-$s7

(System services)

4.6 MIPS Branching

4.6.1 Branch Instructions

beq a, b, l - Branch on equal
- Goto instruction at label l if a==b, otherwise continue
bne a, b, l - Branch on not equal
- Goto instruction at label l if a!=b, otherwise continue
j l - Jump to
- Jump to instruction at label l

4.6.2 If-then-else

Example:

i	j	f	g	h
$s0	$s1	$s2	$s3	$s4

# if (i==j) then {f=g+h} else {f=g-h}

    bne $s0, $s1, Else
    add $s2, $s3, $s4
    j Exit
Else: sub $s2, $s3, $s4
Exit: ...

4.6.3 While

Example:

i	j
$s0	$s1

# while (i!=0) {i=i-1; j=j+3;}

loop: beq $s0, $zero, wexit
      addi $s0, $s0, -1
      addi $s1, $s1, 3
      j loop
wexit: ...

4.6.4 Other Comparisons

slt a, b, c - set on less than
- Set register a to 1, if b<c, otherwise to 0. (a = b<c ? 1 : 0)
slti a, b, n - set on less than immediate
- Set register a to 1, if b<n, otherwise to 0. (a = b<n ? 1 : 0)

4.6.5 Example: Maximum of Two Numbers

		.text
main:	li $v0, 5
		syscall				# read integer
		move $s0, $v0		# store in s0
		li $v0, 5
		syscall				# read integer
		slt $t0, $v0, $s0	# if $v0<$s0, $t0=1
		bne $t0, $zero, out	# goto out, if $t0!=0, ($t0=1, $v0<$s0)
		move $s0, $v0		# $s0=$v0, otherwise ($v0>=$s0)
out:	move $a0, $s0		# print max
		li $v0, 1
		syscall				# print integer
		li $v0, 10
		syscall				# exit

4.7 MIPS Logical Operations

4.7.1 Bitwise Logical Operations

OR, AND, NOR, XOR, ...

4.7.2 Logical Operators

or, and, nor, xor, ...

4.7.3 MIPS Logical Operations

# rd = rs and rt
and rd, rs, rt

# rd = rs and imm
andi rd, rs, imm

# rd = rs or rt
or rd, rs, rt

# rd = rs or imm
ori rd, rs, imm

# rd = rs nor rt
nor rd, rs, rt

# no nori operation

# rd = not rs (implementing negation using nor)
nor rd, rs, $zero

# rd = rs xor rt
xor rd, rs, rt

# rd = rs xor imm
xori rd, rs, imm

5 Sign Numbers, Overflow, Multiplication and Division

5.1 Overview

Sign numbers
Overflow
MIPS shift opreations
MIPS multiplication
MIPS division

5.2 Sign Numbers

5.2.1 4-Bit Signed Encodings

(See details in CSF.)

Binary	Unsigned	Sign and Magnitude	1's Complement	2's Complement	Excess-8
0000	0	0	0	0	-8
0001	1	1	1	1	-7
0010	2	2	2	2	-6
0011	3	3	3	3	-5
0100	4	4	4	4	-4
0101	5	5	5	5	-3
0110	6	6	6	6	-2
0111	7	7	7	7	-1
1000	8	-0	-7	-8	0
1001	9	-1	-6	-7	1
1010	10	-2	-5	-6	2
1011	11	-3	-4	-5	3
1100	12	-4	-3	-4	4
1101	13	-5	-2	-3	5
1110	14	-6	-1	-2	6
1111	15	-7	-0	-1	7

5.2.2 Sign Extension

Sign extension: the operation of increasing the number of bits of a binary number while preserving the number's sign(+/-) and value

Examples:

Befor Extension	After Extension
00 1010	0000 0000 0000 1010
11 1111 0001	1111 1111 1111 0001

5.3 Overflow

5.3.1 Problem

Overflow: happens when the last carry(1) cannot be accommodated.

Situations:

Operation	Result
+ + +	-
- + -	+
+ - -	-
- - +	+

5.3.2 Solution in MIPS

Exception Program Counter (EPC): a register in MIPS to contain the address of the instruction that caused the exception.

The instruction move from system control (mfc0) is used to copy EPC into a general-purpose register so that MIPS software has the option of returning to the offending instruction via a jump register instruction.

Causing exceptions on overflow:
- add
- addi
- sub
Not causing exceptions on overflow:
- addu (add unsigned)
- addiu
- subu

5.3.3 References

MIPS Overflow Exception

MIPS can trap on overflow, but unlike many other computers, there is no conditional branch to test overflow. A sequence of MIPS instructions can discover overflow. For signed addition, the sequence is the following:

addu $t0, $t1, $t2          # $t0 = sum, but don’t trap

# if (signs of $t1 & $t2 differ) then {goto No_overflow}
xor $t3, $t1, $t2           # $t3 = signs of $t1 & $t2 differ ? <0 : >0
slt $t3, $t3, $zero         # $t3 = $t3<0 ? 1 : 0
bne $t3, $zero, No_overflow # $t3 == 1 && {goto No_overflow}

# if (signs of $t1 & $t2 equal, but sign of sum differs) then {goto Overflow}
xor $t3, $t0, $t1           # $t3 = sign of sum differs ? <0 : >0
slt $t3, $t3, $zero         # $t3 = $t3<0 ? 1 : 0
bne $t3, $zero, Overflow    # $t3 == 1 && {goto Overflow}

For unsigned addition ($t0 = $t1 + $t2), the test is

addu $t0, $t1, $t2      # $t0 = sum, but don’t trap

nor $t3, $t1, $zero     # $t3 = NOT $t1 = (2's comp of $t1) - 1 = 2^32 - $t1 - 1
sltu $t3, $t3, $t2      # $t3 = (2^32 – $t1 – 1) < $t2 ? 1 : 0
                        # ⇒ 2^32 – 1 < $t1 + $t2
bne $t3,$zero,Overflow  # if(2^32 –1<$t1+$t2) goto overflow

5.4 MIPS Shift Operations

5.4.1 Shift Operations

Examples:

Original Bits	Shifted Left	Shifted Right
0101 (5)	1010 (10)	0010 (2)
1100 (-4)	1000 (-8)	0110 (6) => 1110 (-2)
1011 (-5)	0110 (6) => 1110(-2) [overflow]	0101 (5) => 1101 (-3)

For unsigned numbers:
- Shift left corresponds to multiplication by 2
- Shift right corresponds to division by 2, ignoring the remainder
For signed numbers:
- On 2's complement, shift left is multiplication by 2 if no overflow
- (Shift right corresponds to division by 2 given sign extension, plus the remainder?)
- Arithmetical shift right performs sign extension when shifting (to be half)

5.4.2 MIPS Logical Shift Operations

# Shift left logical
# $s0 = $s1 << 2
sll $s0, $s1, 2

# Shift right logical
# $s0 = $s1 >> 3
srl $s0, $s1, 3

5.4.3 MIPS Arithmetic Shift Operations

# Shift right arithmetical
# $s0 = $s1 >> 2 (performing sign extension)
sra $s0, $s1, 2

# No need for a sla when using 2's complement number representation

5.5 Multiplication

5.5.1 Multiplication

m-bit number * n-bit number = m+n-bit number

Example: 1101 * 1011 (Long Multiplication)

           1 1 0 1           13
*          1 0 1 1           11
------------------
           1 1 0 1
         1 1 0 1
       0 0 0 0
+    1 1 0 1
------------------
=  1 0 0 0 1 1 1 1          143

5.5.2 MIPS Multiplication Instructions

MIPS stores the 64-bit result of the multiplication of two 32-bit registers in two special 32-bit registers Hi (32~63 bits) and Lo (0~31 bits).

# Hi~Lo = $s1 * $s2 (signed)
mult $s1, $s2

# Hi~Lo = $s1 * $s2 (unsigned)
multu $s1, $s2

# Store the lower 32-bit of the result of the previous operation in register $s0
mflo $s0

# Store the higher 32-bit of the result of the previous operation in register $s0
mfhi $s0

# Since we are commonly only interested in the 32-bit result of a multiplication, the sequence:
mult $s1, $s2
mflo $s0
# can be encoded as one operation:
mul $s0, $s1, $s2

# Note that mul doesn't check for overflow, but a pseudo-insturtion multiply with overflow `mulo` does

5.6 MIPS Division

5.6.1 Division

dividend = quotient * divisor + remainder

Algorithm: repeatedly try to subtract the left-shifted divisor from the dividend, shifting the divisor right each time

Example:

(Long Division Decimal:)

(Long Division Binary:)

Carry out division for absolute values of the dividend and divisor
Negate the quotient if the signs disagree
The remainder must have the same signs as dividend

5.6.2 MIPS Division Instructions

MIPS stores the 64-bit result of the division of two 32-bit registers in two special 32-bit registers Hi (remainder) and Lo (quotient).

# Hi~Lo = $s1 ÷ $s2 (signed)
div $s1, $s2

# Hi~Lo = $s1 ÷ $s2 (unsigned)
divu $s1, $s2

# Store the quotient of the previous operation in register $s0
mflo $s0

# Store the remainder of the previous operation in register $s0
mfhi $s0

# The instruction:
div $s0, $s1, $s2
# corresponds to:
div $s1, $s2
mflo $s0
# i.e.: $s0 = $s1 / $s2

6 Machine Code, Addressing

6.1 Overview

Machine code VS. assembly language
MIPS instruction encodings
Memory organization
MIPS addressing modes
Stack and Heap
Dynamic data

6.2 Machine Code VS. Assembly Language

# Assembly
add $s0, $s1, $s2

# Hexadecimal (machine code)
0232 8020

# Binary
0000 0010 0011 0010 1000 0000 0010 0000

6.3 MIPS Instruction Encodings

6.3.1 R-format - Operands Are Registers Only

	op	rs	rt	rd	shamt	funct
bits	6	5	5	5	5	6

op - operation code (basic operation of the instruction)
rs - first source register operand
rt - second source register operand
rd - destination register operand
shamt - shift amount (used in shift instruction)
funct - function code (select the variant of the operation in the op code field)

6.3.2 I-format - Operands Contain Immediate

	op	rs	rd	imm
bits	6	5	5	16

op - operation code
rs - first source register operand
rt - second source register operand
imm - immediate constant

6.3.3 J-format - Operands Are Immediate Only

	op	imm
bits	6	26

op - operation code
imm - immediate constant

6.3.4 Examples

``` # Assembly add $s0, $s1, $s2

Hexadecimal (machine code)

0232 8020


  |   op   |  rs   |  rt   |  rd   | shmt  | funct  |
  | :----: | :---: | :---: | :---: | :---: | :----: |
  | 000000 | 10001 | 10010 | 10000 | 00000 | 100000 |
  |   R    |  $s1  |  $s2  |  $s0  |       |  add   |

- ```
  # Assembly
  sub $s0, $s1, $s2

  # Hexadecimal (machine code)
  0233 8022

op	rs	rt	rd	shmt	funct
000000	10001	10010	10000	00000	100010
R	$s1	$s2	$s0		sub

``` # Assembly addi $t0, $t1, 100

Hexadecimal (machine code)

2128 0064


  |   op   |  rs   |  rd   |         imm         |
  | :----: | :---: | :---: | :-----------------: |
  | 001000 | 01001 | 01000 | 0000 0000 0110 0100 |
  |  addi  |  $t1  |  $t0  |                     |

### 6.3.5 Decoding Machine Code

![MIPS opcode map](https://tvax4.sinaimg.cn/large/006tNc79gy1flc1g5d0irj30ym13s163.jpg)

- Convert hexadecimal to binary to find the op fields
- Look at the op filed (bits 31:26) to determine the operation
- Reformat the binary instruction into corresponding format fields
- Decode the operation type, instruction by checking the field values

### 6.3.6 Register VS Memory

- Processor can access registers directly
- Limit of 32 registers
- Most programs require much more data than can be held in registers (like text, code and images)
- Extra data must kept in memory
  - **Load**: transfer data from memory to register
  - **Store**: transfer data from register to memory

![](https://tvax4.sinaimg.cn/large/006tNc79gy1flf08zfgmlj30ko09cglz.jpg)

(Memory hierarchy)

> #### Key Differences Between Register and Memory
>
> 1. The primary difference between register and memory is that register **holds the data that the CPU is currently processing** whereas, the memory **holds the data the that will be required for processing**.
> 2. The Register ranges from **32-bits register to 64-bits register** whereas, the memory capacity ranges from some **GB** to some **TB**.
> 3. The processor accesses register **faster** than the memory.
> 4. Computers registers are **accumulator register, program counter, instruction register, address register**, etc. On the other hands, memory is referred as the main memory of the computer which is **RAM**.

## 6.4 Memory Organization

### 6.4.1 Memory

Memory is organized as a sequence of bytes.

| Address |    Value    |
| :-----: | :---------: |
|    0    | 8-bit value |
|    1    | 8-bit value |
|    2    | 8-bit value |
|   ...   |     ...     |

Registers hold 32 bits (= 4 bytes = 1 word).

| Address |    Value     |
| :-----: | :----------: |
|    0    | 32-bit value |
|    4    | 32-bit value |
|    8    | 32-bit value |
|   ...   |     ...      |

### 6.4.2 Data Transfer Instruction: Load Word

# $t0 = Memory[$t1+n] lw $t0, n($t1)

The address has to be divisible by 4 (word aligned).

The offset n has to fit within 16 bit, i.e. -32768 <= n <= 32767.


Example: `lw $t0, 4($s0)`

=> `$t0 = Memory[$s0 + 4] = Memory[10000000_(16) + 4] = 110`

|                       |     Before     |     After      |
| :-------------------: | :------------: | :------------: |
| Memory[10000000_(16)] |      100       |      100       |
| Memory[10000004_(16)] |      110       |      110       |
|          $s0          | 10000000\_(16) | 10000000\_(16) |
|          $t0          |       10       |    **110**     |

### 6.4.3 Data Transfer Instruction: Store Word

Memory[$t1+n] = $t0

sw $t0, n($t1)

The address has to be divisible by 4 (word aligned).

The offset n has to fit within 16 bit, i.e. -32768 <= n <= 32767.


Example: `sw $t0, 4($s0)`

=> `Memory[$s0 + 4] = Memory[10000000_(16) + 4] = $t0 = 10`

|                       |     Before     |     After      |
| :-------------------: | :------------: | :------------: |
| Memory[10000000_(16)] |      100       |      100       |
| Memory[10000004_(16)] |      110       |     **10**     |
|          $s0          | 10000000\_(16) | 10000000\_(16) |
|          $t0          |       10       |      110       |

### 6.4.4 Transferring Bytes and Halfwords

- ASCII characters: 1 byte
- Halfword: 16 bits (= 2 bytes)

Load (unsigned) byte

lb rt, address lbu rt, address

Load (unsigned) halfword

lh rt, address lhu rt, address

Store byte

sb rt, address

Store halfword

sh rt, address

The addresses for halfwordshave to be divisible by 2.

There is no restriction for byte addresses.


_Complete data transfer instruction list see: pp A-49 ~ A-80._

### 6.4.5 32-Bit Immediate and Addresses

- Problem:

  - MIPS ISA uses fixed size 32-bit instructions, so there are always < 32 bits available for any instruction arguments
  - However, sometimes 32-bit constants and addresses are necessary

- Solution:

  - Large constants and 32-bit addresses have to be broken into pieces and then reassembled into a register

- Example:

  - Load the following 32-bit constant into register `$s0`:

    `0000 0100 1011 0010 0001 0000 1101 0011`

# STEP 1: load upper immediate
# lui $s0, imm
lui $s0, 1202   # 1202 = 0000 0100 1011 0010

# Now, $s0 = 0000 0100 1011 0010 0000 0000 0000 0000

# STEP 2: add(or) the lower immediate
# ori $s0, $s0, imm
ori $s0, $s0, 4307   # 4307 = 0001 0000 1101 0011

# Now, $s0 = 0000 0100 1011 0010 0001 0000 1101 0011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

    _Do not use `addi` in STEP 2. Because `addi` is with sign-extend. If the immediate value is negative, then the upper 16 bits will be all 1's, and adding this will ruin the upper 16 bits._

## 6.5 MIPS Addressing Modes

### 6.5.1 Register Addressing

Register addressing: a source or destination operand is specified as content of one of the registers `$0`~`$31`.

E.g. `add $s0, $s1, $s2`

### 6.5.2 Immediate Addressing

Immediate addressing: a numeric value embedded in the instruction is the actual operand.

E.g. `addi $s0, $s1, 100`

### 6.5.3 PC-Relative addressing

PC-relatvie addressing: a data or instruction memory location is specified as an offset relatvie to the incremented PC.

(PC refers to special purpose register , it stores the address of next instruction to be fetched: the operand `address = PC + offset`. Usually used in conditional branch.)

E.g. `bne $s0, $s1, Label`

### 6.5.4 Base Addressing

Base addressing: a data or instruction memory location is specified as a signed offset from a register. The address of the operand is the sum of the immediate and the value in a register.

E.g. `lw $t0, 4($s0)`

### 6.5.5 Pseudodirect Addressing

Pseudodirect addressing: the memory address is (mostly) embedded in the J-type instructions.

E.g. `j Label`

### 6.5.6 Addressing Mode vs Instruction Type

- Addressing mode: how an address (memory or register) is determined
- Instruction type: how an instruction is put together

## 6.6 Stack and Heap

Two available areas of memory used to store variables are commonly referred to as the stack and the heap.

### 6.6.1 Stack

- Store function local variables
- Store temporary variables
- Pass the arguments for next function call
- Return the address when function ends

### 6.6.2 Heap

- A much larger pool of memory than stack
- Must be explicitly allocated and deallocated by the programmer manually
- _This is completely unrelated to the Heap data structure (?)_

## 6.7 Dynamic Data

### 6.7.1 MIPS Memory Allocation

To allocate additional memory, use `sbrk` system call (9).

Example: allocate 16 bytes (= 4 words) of memory and returns the address of the new memory in $v0.

li $a0, 16 li $v0, 9 syscall # sbrk


### 6.7.2 MIPS Memory Deallocation

There is no system call to deallocate memory in MIPS. Either does JAVA. While JAVA uses <u>garbage collection</u> technique to reclaim unused memory automatically.

# 7 Floating Point Numbers

## 7.1 Overview

- Floating Point Representation
- Arithmetic for Floats
- Error Propagation
- MIPS Floating Point Unit

## 7.2 Floating Point Representation

### 7.2.1 Scientific Notation

General form:

![Floating Point Representation](https://tvax4.sinaimg.cn/large/006tNc79gy1flffkh8rqmj30p4078q3i.jpg)

where:

- `m` = significand
- `b` = base number (any positive integer)
- `e` = exponent (an integer)

E.g.

- `2.73*10^(-7)` is in normalized scientific notation
- `0.273*10^(-6)` is not in normalized scientific notation
- `20.73*10^(-8)` is not in proper scientific notation

### 7.2.2 Normalized Scientific Notation for Binary

General form:

![Normalized Scientific Notation for Binary](https://tvax1.sinaimg.cn/large/006tNc79gy1flffusxbl1j30ty09041a.jpg)

E.g.

- `11.101 = 1.1101 * 2^1` is normalized form
- `11.101 = 11.101 * 2^0` is not normalized form

### 7.2.3 IEEE 754 Floating Point Standard

IEEE 754 Floating Point Standard: a type of sign and magnitude representation

![IEEE 754 Floating Point Standard](https://tvax4.sinaimg.cn/large/006tNc79gy1flfg04fu8mj30q002474l.jpg)

|                           |   S   | Biased Exponent | Fraction |
| :-----------------------: | :---: | :-------------: | :------: |
| Single-precision (32-bit) | 1 bit |     8 bits      | 23 bits  |
| Double-precision (64-bit) | 1 bit |     11 bits     | 52 bits  |

- **S**: sign bit of the fraction (0 for +, 1 for -)
- **Biased Exponent**:

  - In <u>excess representation</u> for fast sorting (2's complement with negative exponent will make it look like a big number)

    > The designers of IEEE 754 also wanted a floating-point representation that could be easily processed by integer comparisons, especially for sorting. This desire is why the sign is in the most significant bit, allowing a quick test of less than, greater than, or equal to 0. (It’s a little more complicated than a simple integer sort, since this notation is essentially sign and magnitude rather than two’s complement.)

  - A fixed value called <u>bias</u> is subtracted from the field to get the true exponent value

  - `Bias = 2^(k-1) - 1`, where `k` is the number of bits in exponent field. E.g. Single precision, `bias = 127`; double precision, `bias = 1023`.

  - `Biased Exponent = Actual Exponent + Bias`. E.g. `Actual Exponent = Biased Exponent - Bias = 0000 0101 - Biad = 5 - 127 = -122`

Example: binary to single-precision floating point format

`-10.101`

`-10.101 = (-1) * (1.0101) * 2^1`

`Biased Exponent = 1 + 127 = 128 = 1000 0000`

|  S  | Biased Exponent |           Fraction           |
| :-: | :-------------: | :--------------------------: |
|  1  |    1000 0000    | 0101 0000 0000 0000 0000 000 |

> Special values (single precision):
>
> - Smallest positive normalized number: `1.0 * 2^(-126)`
> - Least negative normalized number: `-1.0 * 2^(-126)`
> - Largest positive normalized number: `1.1111 1111 1111 1111 1111 111 * 2^127 = (2-2^(-23)) * 2^127`
> - Most negative normalized number: `-1.1111 1111 1111 1111 1111 111 * 2^127 = -(2-2^(-23)) * 2^127`
> - ±∞: `± 1.0 * 2^128`
> - NaN: `1.f * 2^128` where `f ≠ 0`

### 7.2.5 Overflow, Underflow, Infinities and NaN

- Overflow: result > largest representable number (`127` for single-precision) => ±∞
- Underflow: result < smallest normalized number (`-126` for single-precision) => denormalized number
- Infinities: `1/0`
- NaN: `0.0/0.0`, `±∞/±∞`, `0*±∞`, `-∞+∞`, `sqrt(-1.0)`, `log(-1.0)`

> The purpose of NaNs is to allow programmers to postpone some tests and decisions to a later time in the program when they are convenient.

### 7.2.6 Binary Floating Point ⇌ Decimal Floating Point

Examples:

- Binary => Decimal

10.101 = 12^1 + 02^0 + 12^(-1) + 02^(-2) + 12^(-3) = 2 + 0 + 0.5 + 0 + 0.125 = 2.625
1
2
3
4
5

- Decimal => Binary

- Method 1 :

3.75 = 2^1 + 20 + 2^(-1) + 2(-2) = 11.11 0.625 = 2^(-1) + 2^(-3) = 0.101

1
2
3


- Method 2:

0.5_[10] = (1/2)[10] = (1/2^1)[10] = (0.1*2^0)_[2] = (1.0*2^(-1))_[2]

-0.4375_[10] = (-7/16)_[10]
             = (-7/2^4)_[10]
             = (-0.0111*2^0)_[2]
             = (-1.11*2^(-2))_[2]
1
2
3

- Method 3:


27.1875_[10] = 27_[10] + 0.1875_[10]
             = 11011_[2] + 0.1875_[10]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

    |        |  \*2  | Integer Part |
    | :----: | :---: | :----------: |
    | 0.1875 | 0.375 |      0       |
    | 0.375  | 0.75  |      0       |
    |  0.75  |  1.5  |      1       |
    |  0.5   |  1.0  |      1       |

    => `0.1875_[10] = 0.0011_[2]`

    => `27.1875_[10] = 11011.0011`

  - _`0.3` in decimal cannot be expressed exactly by finite digits in binary_

- Decimal => Binary => Single-precision binary floating point format

  - ```
    -13.825_[10] = -( 13_[10] + 0.825_[10] )
                 = -( 1101_[2] + 0.825_[10] )


| 0.825 | Integer |
| :---: | :-----: |
| 1.65  |    1    |
|  1.3  |    1    |
|  0.6  |    0    |
|  1.2  |    1    |
|  0.4  |    0    |
|  0.8  |    0    |
|  1.6  |    1    |
|  1.2  |    1    |
|  ...  |   ...   |

1
2
3
4
                        .  .
-13.825_[10] = -1101.1101001_[2]
                         .  .
             = -(1.1011101001 * 2^3)_[2]


`Biased Exponent = 3 + 127 = 130 = 1000 0010`

|  S  | Biased Exponent |           Fraction           |
| :-: | :-------------: | :--------------------------: |
|  1  |    1000 0010    | 101 1101 0011 0011 0011 0011 |

7.3 Arithmetic for Floats

7.3.1 Floating Point Addition

Steps:

Shifted right the smaller number until its exponent matches the larger number
Add the significands
Normalize the sum
Overflow or underflow ? => Exception
Round the sum
Need further normalization ? => goto 3

Example:

  1.0*2^(-1) - 1.11*2^(-2)  (using 4-bit precision)
= 1.000*2^(-1) - 0.111*2^(-1)
= 0.001*2^(-1)
= 1.000*2^(-4)

7.3.2 Floating Point Multiplication

Steps:

Add the exponents
Multiply the significands
Normalize the product
Overflow or underflow ? => Exception
Round the product
Need further normalization ? => goto 3
Sign the product

Example:

  - 1.0*2^(-1) * 1.11*2^(-2)  (using 4-bit precision)
= - 1.110*2^(-3)

  1.1*2^3 * 1.11*2^4  (using a 3-bit fraction)
= 10.101*2^7
= 1.010*2^8

7.3.3 IEEE 754 Rounding

Rounding modes:

Round Up
Round Down
Toward Zero
Round to Even

(MIPS and Java use round to even by default)

7.4 Error Propagation

7.4.1 Floating Point Arithmetic Errors

Sources of floating point arithmetic errors:

Rounding
Approximate representation

Denote:

m: the number to be represented
n: m's approximation

The absolute error: a = |m - n|

The relative error: r = a / |m|

The true value m ∈ [n-a, n+a] ∪ [n(1-r), n(1+r)]

7.4.2 Error Propagation

// Addition or subtraction sums absolute error of operands
(x ± a_x) + (y ± a_y) = (x + y) ± (a_x + a_y)

// Multiplication or division sums relative error of operands
// The much smaller (r_x * r_y) can be ignored
x(1 ± r_x) * y(1 ± r_y) = (x * y)(1 ± (r_x + r_y) ± r_x * r_y)

Example: What is the absolute error in y^2 - 4xz?

1 2	y^2 * (1 ± 2r_y) - 4xz(1 ± (r_x + r_z)) = (y^2 - 4xz) ± (2 y^2 * r_y + 4xz(r_x + r_z))

7.5 MIPS Floating Point Unit

7.5.1 IEEE 754 for MIPS

Floating Point Unit (FPU): performs IEEE operations (MIPS core refers to FPU as coprocessor 1)

FPU features 32 single precision registers ($f0, $f1, …, $f31) or as 16 pairs of double precision registers ($f0, $f2, …, $f30, here $fi actually stands for the pair $fi and $f(i+1)).

FPU instructions does not raise exceptions. So checking for ±∞ or NaN is needed.

FPU defaults to round to even.

7.5.2 Floating-Point Instructions in MIPS

General format:
- mmm.s: single precision
- mmm.d: double precision
Addition:
- add.s fd, fs, ft e.g. add.s $f0, $f1, $f2
- add.d fd, fs, ft e.g. add.d $f0, $f2, $f4
Subtraction
- sub.s
- sub.d
Multiplication
- mul.s
- mul.d
Divsion
- div.s
- div.d

7.5.3 Load/Store for Floating Point

// Load word copr. 1 (32-bit word)
// lwc1 fd, n(rs)
lwc1 $f1, 100($s2)   // $f1 = Memory[$s2 + 100]

// Store word copr. 1 (32-bit word)
// swc1 fd, n(rs)
swc1 $f1, 100($s2)   // Memory[$s2 + 100] = $f1
// Address rs+n must be word aligned

// To store/load doubles
lwc1 $f0, 0($s0)
lwc1 $f1, 4($s0)
// Or with pseudo-instruction
l.d $f0, 0($s0)

No encoding for immediate floating-point operands.

Must be placed in .data segment due to too many bytes.

Assembler directives: .float n or .double n.

7.5.4 Copying Data Between FPU and CPU

// Move from copr. 1
// mfc1 rd, fs
mfc1 $t0, $f7   // copy content of $f7 to $t0

// Move to copr. 1
// mtc1 $zero, $fs
mtc1 $zero, $f12   // set $f12 := 0

Only the bit pattern is copied, not the actual value that it represents.

7.5.5 Floating Point I/O

(See full list on 4.5.5)

7.5.6 Example: Area of a Circle

  		.data
pi:		.double 3.141592653589793
		.text
		.globl main
main:	li $v0, 7				# read_double
		syscall					# radius < -user input
		la $a0, pi				# load address
		l.d $f12, 0($a0)		# load double at address $a0 into $f12, $f12 := pi
		mul.d $f12, $f12, $f0	# $f12 := $f12 * r
		mul.d $f12, $f12, $f0	# $f12 := $f12 * r
		li $v0, 3				# print_double
		syscall					# print area

7.5.7 Floating Point Comparison

General format:

c.x.s: single
c.x.d: double

where x can be:

eq: equal
neq: not equal
lt: less than
le: less than or equal
gt: greater than
ge: greater than or equal

7.5.8 Floating Point Branch

bclt label - Branch on coprocessor 1 true
- Goto label if flag is true, otherwise continue
bclf label - Branch on coprocessor 1 false
- Goto label if flag is false, otherwise continue

Example: jump to islt if the contents of $f0<$f1 as single precision floating point numbers

1 2	c.lt.s $f0, $f1 bclt islt

8 MIPS Procedure

8.1 Overview

MIPS Procedure
Recursive Procedure

8.2 Procedures

8.2.1 Procedures

Definition: protion of code, within a larger program, which appear frequently
Help to
- reduce code duplication
- improve code re-usability
- decompose complex programs into manageable parts
Other names
- methods (Java, OO languages)
- functions (C, C++, Haskell)
- routines, subroutines (not popular anymore)

8.2.2 Callee vs Caller

Callee: a function called by a caller
Caller: a functoin calling a callee

8.2.3 MIPS Procedure Calling Convention I

Arguments are passed in registers $a0 ~ $a3
The return value is passed in register $v0 (and $v1 if necessary)
The return address of the caller is stored in register $ra
jal label: jump and link
- Effect: call a procedure
- Side effects:
  - $ra = PC + 4
  - PC = [label]
js rs: jump register
- Effect: jump to the address specified by the register rs
- (To return to the caller using jr $ra)

8.2.4 Example: x * x + y * y

		.text
f:		# procedure f, args in x=$a0, y=$a1
		mul $v0, $a0, $a0	# $v0 = x*x
		mul $s0, $a1, $a1	# $s0 = y*y
		add $v0, $v0, $s0	# $v0 = x*x+y*y
		jr $ra				# return to caller, where $ra contains the return address of the instruction `move $a0, $v0`)

main:	li $a0, 3			# $a0 = 3
		li $a1, 4			# $a1 = 4
		jal f				# jump and link f
		move $a0, $v0		# $a0 = $v0 = 25
		li $v0, 1			# print_int 25
		syscall
		li $v0, 10			# exit
		syscall

8.2.5 MIPS Procedure Calling Convention II

To use any registers within the procedure:
- Save what is in these registers at the start of the procedure
- Restore these values at the end of the procedure
The callee has to restore: $s0 ~ $s7, $ra, $sp
The caller is responsible itself for most other registers: e.g. $t0 ~ $t7, $a0 ~ $a3

We use a stack to do this.

8.2.6 The Stack

In MIPS, stack is stored in memory (last-in, first-out [lifo] list)
The stack pointer $sp
- points to the last location on stack
- contains the address of the next free location
Each procedure is associated with a call frame with a frame pointer $fp
- points to the first word of procedure frame
- does not change during the execution

8.2.7 Stack: Push & Pop

8.2.8 Stack in MIPS

To push $s0 onto the stack ($s0 is placed on the stack):

1 2	addi $sp, $sp, -4 # allocate space on the stack sw $s0, 0($sp) # save $s0 -- push

To pop $s0 from the stack ($s0 is taken from the stack):

1 2	lw $s0, 0($sp) # restore $s0 -- pop addi $sp, $sp, 4 # deallocate space

8.2.9 MIPS Procedure Calling Convention III

Caller
- Push any of $a0 ~ $a3, $v0 ~ $v1 and $t0 ~ $t0 if necessary
- Place arguments in $a0 to $a3 if needed
- Make the call using jal callee
- Pop saved registers and/or extra arguments off stack
Callee
- Push any of $ra, $s0 ~ $s7 that may be overwrritten
- Perform desired task
- Place result in $v0 and $v1
- Pop above registers off the stack
- Return to caller with jr $ra

8.2.10 Argument Passing

First four arguments: $a0 ~ $a3
Remaining arguments: put on stack by the caller
Example: fun(int x0, int x1, …, int x7)

8.2.11 Return Values

First two return values: $v0, ($v1)
Remaining return values: put on stack by the procedure, and popped from the stack by the caller

8.2.12 MIPS Procedure Call: Good Strategy

Caller
- At call time
  - Put arguments in $a0 ~ $a3
  - Save any caller-save temporaries
Callee
- At entry
  - Allocate all stack space
  - Save $ra, $s0 ~ $s3
- At exit
  - Restore $ra, $s0 ~ $s3
  - Deallocate all stack space
  - Put return value in $v0
Caller
- After return
  - Retrieve return value from $v0
  - Restore any caller-save temporaries

8.2.13 Example: g(x) = f(x, x+1) + f(x+2, x+3)

f:		# procedure f, args x=$a0, y=$a1, returns x*x+y*y
		mul $v0, $a0, $a0
		mul $t0, $a1, $a1
		add $v0, $a0, $t0		# $v0 = x*x+y*y
		jr $ra

g:		# procedure g, arg x=$a0, returns f(x, x+1) + f(x+2, x+3)
		addi, $sp, $sp, -8		# allocate space on the stack
		sw $s0, 0($sp)			# save $s0 -- push
		sw $ra, 4($sp)			# save $ra -- push

		addi $a1, $a0, 1		# $a1 = x + 1
		jal f					# f(x, x+1)

		move $s0, $v0			# $s0 = f(x, x+1)

		addi $a0, $a1, 1		# $a0 = x + 2
		addi $a1, $a0, 1		# $a1 = x + 3
		jal f					# f(x+2, x+3)

		add $s0, $s0, $v0		# $s0 = f(x, x+1) + f(x+2, x+3)
		move $v0, $s0			# $v0 = f(x, x+1) + f(x+2, x+3)

		lw $s0, 0($sp)			# restore $s0 -- pop
		lw $ra, 4($sp)			# restore $ra -- pop
		addi $sp, $sp, 8		# deallocate space
		jr $ra

main:	# prints out g(3)
		li $a0, 3				# $a0 = 3
		jal g

		move $a0, $v0			# $a0 = f(x, x+1) + f(x+2, x+3)

		li $v0, 1				# print_int
		syscall

		li $v0, 10				# exit
		syscall

8.3 Recursive Procedure

8.3.1 Example: Factorial

# int fact(int n): return n<=0 ? 1 : n * fact(n-1);
# assume $a0 = n

fact:		addi $sp, $sp, -8		# space for two words
			sw $ra, 4($sp)
			sw $a0, 0($sp)

			li $v0, 1				# $v0 = 1
			blez $a0, fact_return	# if ($a0<=0) then {goto fact_return}

			addi $a0, $a0, -1		# n--

			jal fact

tmpl:		lw $a0, 0($sp)			# retrieve original n
			mul $v0, $v0, $a0		# n * fact(n-1)

fact_return:lw $ra, 4($sp)
			addi $sp, $sp, 8
			jr $ra

9 Piplining

9.1 Overview

Pipeline
Pilined Implementation
Obstacles to pipelining

9.2 Pipelining

9.2.1 Functional Units and Their Timings

Executing a MIPS instruction can take up to five functional units:

Type	Duration	Stage	Use
Memory	200ps	IF (Instruction Fetch)	get instruction from memory
Memory	200ps	MEM (Memory Access)	data memory read or write
ALU	200ps	EX (Execute)	ALU operation
Register	100ps	ID (Instruction Decode)	get source register operands
Register	100ps	WB (Write-Back)	result to destination register

9.2.2 Memory Hierarchy

9.2.3 Critical Paths and Instruction Timings

Instruction Class	IF	ID	EX	MEM	WB	Total
R-format(`add`, `sub`, `AND`, `OR`, `slt`)	200	100	200		100	600
Load Word (`lw`)	200	100	200	200	100	800
Store Word (`sw`)	200	100	200			500
Branch (`beq`)	200	100	200			500
Jump (`j`)	200					200

(This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay.)

9.2.4 Pipelining

Single-cycle, nonpipelined execution in top versus pipelined execution in bottom.

In this case, we see a fourfold speed-up on average time between instructions, from 800 ps down to 200 ps. The pipeline stage times of a computer are limited by the slowest resource, either the ALU operation or the memory access. We assume the write to the register file occurs in the first half of the clock cycle and the read from the register file occurs in the second half. We use this assumption throughout this chapter.

Pipelining: an implementation technique in which multiple instructions are overlapped in execution.

Executing n (n >> s) instructions takes:(n + (s-1)) * t / s ≈ nt / s.

Where:

t is the timing of every instruction
s is the number of the stages of every instruction
t/s is the timing of a cycle
s-1 is the further cycles that the n-th instruction needs to complete

9.2.5 Designing ISAs for Pipelining

MIPS instructions are the same length. => much easier to fetch instructions in the first pipeline stage, and to decode them in the second stage.
MIPS has only a few instruction formats, with the source register fields being located in the same place in each instruction. => the second stage can begin reading the register file at the same time that the hardware is determining what type of instruction was fetched.
MIPS memory operands only appear in loads or stores. => the execute stage can be used to calculate the memory address, and then access memory in the following stage.
MIPS operands must be aligned in memory. => the requested data can be transferred between processor and memory in a single pipeline stage.

9.2.6 Pipeline Hazards

Structural Hazard: When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute.
- E.g. In MIPS pipeline with a single memory: both load/store and insrtuction fetch require memory access at the same time
- Solutions:
  - Avoid scheduling instructions that would create sturctural hazards
  - Hardware includes control logic that stalls until earlier instruction is no longer using contended resource
  - Duplicate hardware to design so that each instruction can access to independent resources at the same time.
Data Hazard: When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.
- Solutions:
  - Avoid
  - Stall
  - Forwarding / Bypassing: A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmervisible registers or memory.
    - E.g. EX forwarding: add $s0, $t0, $t1, sub $t2, $s0, $t3
      
      Graphical representation of forwarding.
      
      Extra hardware to take result directly from ALU output when it is computed instead of waiting for it to be stored in a register.
      
      Shading on the right half of the register file or memory means the element is read in that stage.
      
      Shading of the left half means it is written in that stage.
      
      MEM has a white background because add does not access the data memory.
    - E.g. MEM forwarding: lw $s0, 20($t1), sub $t2, $s0, $t3
      
      We need a stall even with forwarding when an R-format instruction following a load tries to use the data.
      
      This figure shows an important pipeline concept, officially called a pipeline stall, but often given the nickname bubble.
      
      Without the stall, the path from memory access stage output to execution stage input would be going backward in time, which is impossible. This figure is actually a simplification, since we cannot know until after the subtract instruction is fetched and decoded whether or not a stall will be necessary.
Control Hazard / Branch Hazard: When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.
- Solutions:
  - Stall
    
    Pipeline showing stalling on every conditional branch as solution to control hazards.
    
    This example assumes the conditional branch is taken, and the instruction at the destination of the branch is the OR instruction. There is a one-stage pipeline stall, or bubble, after the branch. In reality, the process of creating a stall is slightly more complicated. The effect on performance, however, is the same as would occur if a bubble were inserted.
  - Branch Prediction: A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome.
    
    Predicting that branches are not taken as a solution to control hazard.
    
    The top drawing shows the pipeline when the branch is not taken. The bottom drawing shows the pipeline when the branch is taken. As we noted previously, the insertion of a bubble in this fashion simplifies what actually happens, at least during the first clock cycle immediately following the branch.
    
    If the prediction is wrong, the pipeline control must ensure that the instructions wrongly executed have no effectand, and must restart the pipeline from the proper branch address
    - Static Branch Prediction:
      - Predict "taken":
        
        Backwards branches (likely to be a loop)
        
        Unconditional branches (or jumps)
      - Predict "not taken":
        
        Forward branches (likely to be if-then control)
    - Dynamic Branch Prediction:
      - Keep track of historic branch decisions
      - Over 90% prediction accuracy
  - Delayed Decision (actually used by the MIPS architecture): The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. It is hidden from the MIPS assembly language programmer because the assembler can automatically arrange the instructions to get the branch behavior desired by the programmer. MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. In our example, the add instruction before the branch in Figure 4.31 does not affect the branch and can be moved after the branch to fully hide the branch delay. Since delayed branches are useful when the branches are short, no processor uses a delayed branch of more than one cycle. For longer branch delays, hardware-based branch prediction is usually used.

9.2.7 Example: Reordering Code to Avoid Pipeline Stalls

# a = b + e;
# c = b + f;
# Assuming all variables are in memory and are addressable as offsets from $t0.

lw $t1, 0($t0)			# load b from $t0+[0]
lw $t2, 4($t0)			# load e from $t0+[4]
add $t3, $t1, $t2		# a = b + e
sw $t3, 12($t0)			# store a into $t0+[12]
lw $t4, 8($t0)			# load f from $t0+[8] *
add $t5, $t1, $t4		# c = b + f
sw $t5, 16($t0)			# store c into $t0+[16]

# After

lw $t1, 0($t0)			# load b from $t0+[0]
lw $t2, 4($t0)			# load e from $t0+[4]
lw $t4, 8($t0)			# load f from $t0+[8] *
add $t3, $t1, $t2		# a = b + e
sw $t3, 12($t0)			# store a into $t0+[12]
add $t5, $t1, $t4		# c = b + f
sw $t5, 16($t0)			# store c into $t0+[16]

10 Pointers and References

10.1 Overview

Pointers
Reference
Arrays
Strings

10.2 Pointers

10.2.1 Pointers

Pointer: a variable containing the address of another variable.

In C: (Assume c in memory at address 0x10000000, p in $a0, x in $s0.)

int c = 100;
int *p;
p = &c;
x = *p;
*p = 200;

In MIPS:

lui $a0, 0x10000000		# p = &c = 0x10000000
lw $s0, 0($a0)			# x = *p = c = 100
addi $t0, $zero, 200	# $t0 = 200
sw $t0, 0($a0)			# *p = $t0 = 200

10.3 References

10.3.1 References

In Java:

class Int{
  public int n;
  public Int(int m) { n = m; }
}
public class Reference{
  public static void main(String args[]){
    Int a = new Int(1);			// create an Intobject with n==1
    Int b = new Int(2);			// create an Intobject with n==2
    a = b;						// copy the reference
    b.n = 3;
    System.out.println(a.n);	// 3
  }
}

In MIPS:

		.data
aa:		.word 3
bb:		.word 2
		.text
		.globl main
main:	la $a0, aa			# load address of aa into $a0
		la $a1, bb			# load address of bb into $a1
		jal swap
		...					# print a, b
swap:	lw $t0, ($a0)
		lw $t1, ($a1)
		sw $t1, ($a0)
		sw $t0, ($a1)
		jr $ra

10.4 Arrays

10.4.1 Arrays

		.data
as:		.word 3, 1, 4, 1, 5, 9, 2, 6
		.text
main:	la $t1, as			# load address of as into $t1
		lw $t2, 24($t1)		# load word from $t1+24 (as[6]) into $t4

		.data
asx:	.word 1, 2, 3
bsx:	.word 5, 4
		.text
		.globl main
main:	la $s0, asx
		la $s1, bsx
		move $s0, $s1		# $s0 = $s1 = the address of bsx
		li $t0, 7
		sw $t0, 0($s1)		# bsx[0] = 7
		lw $a0, 0($s0)
		li $v0, 1			# print bsx[0]
		syscall
		li $v0, 10			# exit
		syscall

# return the max value of an array
			.data
as:			.word 3, 1, 4, 1, 5, 9, 2, 6
			.text
			.globl main
main:		addi $sp, $sp, -4		# allocate the space on stack
			sw $ra, 0($sp)			# save the return address -- push

			la $a0, as				# $a0 = &as
			li $a1, 8				# $a1 = 8 (the length of the array)
			jal array_max			# array_max(&as, 8)

			move $a0, $v0			# $a0 = $v0
			li $v0, 1				# print_int
			syscall

			lw $ra, 0($sp)			# restore return address -- pop
			addi $sp, $sp, 4		# deallocate the space on stack

			jr $ra

array_max:	# $a0: array address, $a1: array length
			li $v0, 0x80000000		# $v0 = MIN_VALUE = 2147483648
			li $t0, 0				# counter $t0 = i = 0
			j am_cond

am_loop:	sll $t1, $t0, 2			# shift left logical, $t1 = 4 * i
			add $t1, $a0, $t1		# $t1 = &as + 4 * i
			lw $t1, ($t1)			# $t1 = as[&as + 4 * i]
			addi $t0, $t0, 1		# i++
			bge $v0, $t1, am_cond	# if ($v0 >= as[&as + 4 * i]) then {goto am_cond}
			move $v0, $t1			# $v0 = as[&as + 4 * i]

am_cond:	blt $t0, $a1, am_loop	# i < length && {goto am_loop}
			jr $ra					# back to the main function

# Clearing an array using reference
# clear (int array[], int size)
# {
# 	int i;
# 	for (i = 0; i < size; i++)
# 	{
# 		array[i] = 0;
# 	}
# }

clear:		# $a0: &array; $a1: size
			move $t0, $zero			# counter $t0 = i = 0

loop:		sll $t1, $t0, 2			# $t1 = 4 * i
			add $t2, $a0, $t1		# $t2 = &array[0] + 4 * i
			sw $zero, 0($t2)		# array[&array[0] + 4 * i] = 0
			addi $t0, $t0, 1		# i++

			slt $t3, $t0, $a1		# $t3 = i<size ? 1 : 0
			bne $t3, $zero, loop	# $t3 == 0 && {goto loop}

			jr $ra

# Using "mul" or "add" in the loop to compute the new index i

# Clearing an array using pointer
# clear (int *array, int size)
# {
#   int *p;
#   for (p = &array[0]; p<&array[0]; p++)
#   {
#     *p = 0;
#   }
# }

clear:		# $a0: &array; $a1: size
			move $t0, $a0			# int *p = &array[0]
			sll $t1, $a1, 2			# $t1 = 4 * size
			add $t2, $a0, $t1		# $t2 = &array[0 + size] = &array[size]

loop:		sw $zero, 0($t0)		# *p = 0
			addi $t0, $t0, 4		# p = p + 4

			slt $t3, $t0, $t2		# $t3 = p < &array[size] ? 1 : 0
			bne $t3, $zero, loop	# $t3 != 0 && {goto loop}

			jr $ra

# Operating on the pointer directly in the loop

10.5 Strings

10.5.1 Strings

Example: printing the length of a string

In C:

int strlen (char *s)
{
  int len = 0;
  while(*s != 0)
  {
    s++;
    l++;
  }
  return len;
}
int main ()
{
  printf("%d\n", strlen("hello"));
}

In MIPS:

				.data
hello:			.asciiz "hello"
				.text
				.globl main
main:			la $a0, hello				# char *s = "hello"
				jar strlen

				move $a0, $v0
				li $v0, 1					# print_int
				syscall
				li $v0, 10					# exit
				syscall

strlen:			# $a0: pointer to the string
				li $v0, 0					# len = 0
				j strlen_cond

strlen_loop:	addi $v0, $v0, 1			# len++

strlen_cond:	lbu $t0, ($a0)				# Load byte unsigned $t0 = *a0
				addi $a0, $a0, 1			# s++
				bne $t0, $zero, strlen_loop	# $t0 != 0 && {strlen_loop}

Name	Register Number	Use
$zero	0	Constant 0
\(at \| 1 \| Reserved for assembler \| \| \$v0-\)v1	2-3	Function select, return result
$a0-\(a3 \| 4-7 \| Arguments \| \| \$t0-\)t7	8-15	temporaries
$s0-\(s7 \| 16-23 \| saved temporaries \| \| \$t8-\)t9	24-25	temporaries
$k0-$k1	26-27	kernel registers
$gp	28	global heap pointer
$sp	29	stack pointer
$fp	30	frame pointer
$ra	31	return address