System & Architecture I: Computer Organization and Design

1 Introduction

这门课在 UNNC 被分两个部分教学:系统架构 + 网络。系统架构部分使用的教材是 MK Computer Organization and Design 5th Edition 2

2 Hierarchy, Components & Technology

2.1 Overview

  • Computer components
  • Computer systems hierarchy
  • Technology trends

2.2 Computer Components

Computer Components
Computer Components

2.2.1 Computer Components

  • Input
    • Mouse
    • Keyboard
    • Touchpad
    • ...
  • Output
    • Monitor
    • Printer
    • ...
  • Memory
    • Volatile main memory
      • Random Access Memory (RAM)
      • CPU cache
    • Non-volatile secondary memory
      • Read Only Memory (ROM)
      • Flash memory
      • Optical disk (CDROM, DVD)
  • Processor (CPU)
    • Datapath: the component that performs arithmetic operations
    • Control: the component that commands the datapath, memory and I/O devices according t othe instructions of the program

The IAS (von Neumann Machine)

The IAS (von Neumann Machine)
The IAS (von Neumann Machine)

2.2.2 Below Your Program

  • Application software
    • Written in high-level language
  • System software
    • Compiler
      • HLL code => machine code
    • Operating system
      • Handling basic I/O operations
      • Managing memory & storage
      • Scheduling tasks & sharing resources
  • Hardware
    • Processor
    • Memory
    • I/O controller

2.2.3 Levels of Programming Code

Levels of Programming Code
Levels of Programming Code
  • Compiler: a program that translates high-level language statements into assembly language statements
  • Instruction: a command that computer hardware understands and obeys
  • Assembler: a program that translates a symbolic version of instructions into the binary versions
  • Assembly language: a symbolic representation of machine instructions
  • Machine language: a binary representation of machine instructions

2.3 Computer Systems Hierarchy

2.3.1 Computer Systems Hierarchy

Computer Systems Hierarchy
Computer Systems Hierarchy

2.3.2 Instruction Set Architecture (ISA)

Instruction set
Instruction set
  • ISA
    • A well-defined hardware/software interface
  • The "contract" between software and hardware
    • Allow communications
    • Standardize instructions, machine language bit patterns, operation modes, storage locations, etc
  • Advantage
    • Allow different implementations of the same architecture
  • Disadvantage
    • Sometimes prevent adding new innovations
  • Modern ISAs
    • 80x86, MIPS, ARM, PowerPC

2.4.1 Basic Logic

Because of the on-or-off nature of the binary information and signal routing the computer uses, an efficient electronic switch is required:

  • Vacuum tube
  • Transistor
  • Integrated circuit (IC)): combine dozens to hundreds of transistor on a single chip
  • Very large-Scale integrated (VLSI) circuit: a device containing hundreds of thousands to millions of transistors
  • Ultra large-scale integrated circuit: a device containing over millions of transistors
Year Technology Relative Performance/Cost
1951 Vacuum tube 1
1965 Transistor 35
1975 IC 9*10^3
1995 VLSIC 2.4*10^6
2013 ULSIC 2.5*10^11

2.4.3 Microprocessor

Microprocessor: computer CPU on a single chip (IC)

2.4.4 Modern Multicore Processor

Multicore processor: a single chip contains more than one microprocessor core

2.4.5 Manufacturing Chips

Manufacturing Chips
Manufacturing Chips

3 Computer Performance

3.1 Overview

  • Computer performance
  • Power wall
  • SPEC Benchmark
  • Amdahl's Law

3.2 Computer Performance

3.2.1 Response Time and Throughput

  • Response time (execution time)
    • The total time to complete a task
  • Throughput (brandwidth)
    • The number of tasks completed per unit time
  • Effects
    • Replacing the processor in a computer with a faster version
    • Adding additional processor to a system that uses multiple processors for separate tasks

3.2.2 Relative Performance

To maximize performance => minimize execution

1
Performance = 1 / Execution time_x

Computer X is n times faster than computer Y:

1
Performance_x / Performance_y = Execution time_y / Execution time_x = n

3.2.3 Measuring Performance

  • Response time
    • The total time to complete a task
    • Disk accesses, memory accesses, I/O activities, etc
  • CPU execution time (CPU time)
    • The actual time the CPU spends computing for a specific task
    • Exclude time spent waiting for I/O or running other programs
    • Further divided into:
      • User CPU time: the CPU time spent in a program itself
      • System CPU time: the CPU time spent in the OS performing tasks on behalf of the program

3.2.4 CPU Clocking

  • Clock: determines when events take place in hardware
  • Clock signal: produced by an external oscillator circuit
  • Clock cycles (clock ticks, clock periods): the time for one clock period, usually of the processor clock, which runs at a constant rate (like 250ps)
  • Clock frequency (clock rate): cycles per second, which is the inverse of the clock period (like 4.0GHz)

3.2.5 CPU Time

1
2
CPU Time = CPU Clock Cycles * Clock Cycle Time
= CPU Clock Cycles / Clock Rate

To improve performance:

  • Reducing the number of clock cycles
  • Increasing clock rate
  • Trade-off: the number of clock cycles & the clock cycle time

3.2.6 Instruction Performance

  • The execution time of instructions: determined by the number of instructions in a program
  • The instruction count for a program: determined by program, ISA and compiler
  • Clock cycles per instruction (CPI): the average number of clock cycles each instruction takes to execute, determined by CPU hardware
1
CPU Clock Cycles = Instruction Count * CPI
1
2
CPU Time = Instruction Count * CPI * Clock Cycle Time
= Instruction Count * CPI / Clock Rate

If different instruction classes take different numbers of cycles:

1
2
3
               n
Clock Cycles = ∑(CPI_i * Instruction Count_i)
i=1

Weighted average CPI:

1
2
3
4
CPI = CPU Clock Cycles / Instruction Count
n
= ∑(CPI_i * Instruction Count_i / Instruction Count)
i=1

Instruction Count: relative frequency

3.2.7 Understanding Program Performance

  • Algorithm
    • Affects insturction count
    • Affects CPI (possibly) - by favouring slower or faster instructions
  • Programming language
    • Affects instructions count
    • Affects CPI - beacause of its own features
  • Compiler
    • Affects instructions count
    • Affects CPI
  • ISA
    • Affects instructions count
    • Affects CPI, clock rate

3.3 Power Wall

  • Complememtray metal oxide semiconductor (CMOS): the dominant technology for integrated circuits

  • Dynamic energy: the primary source of energy consumption for CMOS, consumed when transistors switch states from 0 to 1 and vice versa.

    1
    Energy ∝ Capacitive load * Voltage^2
  • The energy of a single transition:

    1
    Energy ∝ 0.5 * Capacitive load * Voltage^2
  • The power required per transistor:

    1
    Power ∝ 0.5 * Capacitive load * Voltage^2 * Frequency

3.4 SPEC Benchmark

  • System Performance Evalucation Cooperative (SPEC): the standard sets of benchmarks for modern computer systems

  • SPEC CPU2006

    SPEC CPU2006
    SPEC CPU2006
  • SPEC Power Benchmark

    SPEC Power Benchmark
    SPEC Power Benchmark

3.5 Amdah's Law

3.5.1 Amdah's Law

  • Amdah's Law: the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. It is a quantitative version of the law of diminishing returns.

  • The execution time of the program after the improvement:

    1
    T_improved = T_affected / improvement factor + T_unaffected

3.5.2 Law of Diminishing Returns

3.5.3 Low Power at Idle

(NEED FURTHER READING)

4 MIPS32 Programming

4.1 Overview

  • MIPS introduction
  • MIPS structure
  • MIPS syntax & instruction
  • MIPS basic arithmetic
  • MIPS branching
  • MIPS logical operations

4.2 MIPS Introduction

4.2.1 Introduction

  • Microprocessor without Interlocked Processor States (MIPS)
    • ISA based on Reduced Instruction Set Computing (RISC) CPU design strategy
    • Simple instruction, fast execution
    • 1980s 32-bit, 1991 64-bit

4.2.2 Bits, Bytes and Words

  • Bit: binary digit (either 0 or 1)
  • Byte: a sequence of bits (having 8 bits in length since the mid 1960's to get 256 possibilities)
  • Word: amount of data computer can process in one step. (Most CPUs have a word size of 32 or 64 bits today. On the 32-bit MIPS, a word is 4 bytes long.)

4.3 MIPS Structure

4.3.1 MIP32 Structure

MIP32 Structure
MIP32 Structure
  • 1 word = 32 bits = 4 bytes
  • Addresses are 32-bit words => (2^32 different locations)
  • Registers are 32-bit words

4.3.2 Fetch-Execute Cycle

  • Fetch the instruction: fetch the instruction from the memory address that is currently stored in the program counter (PC), and store it in the instruction regitser (IR).
  • Decode the instruction: the encoded instruction presented in the IR is interpreted by the decoder
  • Execute the instruction
  • Write back
1
2
3
4
5
6
7
8
9
10
Repeat {
Fetch (PC):
Fetch the instruction word (at PC)
Instruction decoded
Calculate next instruction address
Execute (ALU, Registers and Control):
Read operands
Executes the operations
Write/store results
}

4.4 MIPS Syntax & Instruction

4.4.1 MIPS Programming: "Hello, world!"

1
2
3
4
5
6
7
8
9
10
		.data
msg: .asciiz "Hello, world!\n"
.text
.globl main
main:
la $a0, msg # load label msg
li $v0, 4 # load immediate
syscal # print it
li $v0, 10
syscall # exit

4.4.2 MIPS Basic Syntax

  • Assembler directives begin with a "."
    • .data
      • Start assembling data used by the program
    • .text
      • Start assembling program instructions
    • .asciiz
      • Place a null-terminated ASCII string in memory
  • Labels are names followed by a colon ":"
    • Descriptive names chosen by programers
    • The assembler will assign memory addresses to labels for later reference
    • The label "main" is the entry of the program
  • Machine instructions
    • Lines after "main:" contain symbolic machine instructions
  • Comments begin with a "#"

4.4.3 MIPS Registers

Name Register Number Use
$zero 0 Constant 0
\(at | 1 | Reserved for assembler | | \$v0-\)v1 2-3 Function select, return result
$a0-\(a3 | 4-7 | Arguments | | \$t0-\)t7 8-15 temporaries
$s0-\(s7 | 16-23 | saved temporaries | | \$t8-\)t9 24-25 temporaries
$k0-$k1 26-27 kernel registers
$gp 28 global heap pointer
$sp 29 stack pointer
$fp 30 frame pointer
$ra 31 return address

4.4.4 MIPS Instruction Field

General R-format:

op rs rt rd shamt funct
bits 6 5 5 5 5 6
  • op - operation code (basic operation of the instruction)
  • rs - first source register operand
  • rt - second source register operand
  • rd - destination register operand
  • shamt - shift amount (used in shift instruction)
  • funct - function code (select the variant of the operation in the op code field)

General I-format:

op rs rt constant or address
bits 6 5 5 16

4.4.5 MIPS Instruction Format

General syntax:

3 Operands 2 Operands Other
op dst, src, src op dst, src op
op dst, src, imm op dst, imm op src
  • dst - destination register
  • src - source register
  • imm - immediate value (16 bits)

4.5 MIPS Arithmetic

4.5.1 Add

1
2
# rd = rs1 + rs2
add rd, rs1, rs2

Example: add $s0, $t0, $t1

Before After
$s0 4 11
$t0 5 5
$t1 6 6

4.5.2 Subtract

1
2
# rd = rs1 - rs2
sub rd, rs1, rs2

Example: sub $s0, $t0, $t1

Before After
$s0 4 -1
$t0 5 5
$t1 6 6

4.5.3 Add Immediate

1
2
# rd = rs1 + imm
addi rd, rs1, imm

Example: addi $s0, $t0, 100

Before After
$s0 4 105
$t0 5 5

4.5.4 Assignment

1
2
move a, b
#Assign: a = b

The instruction is accepted by MIPS assembler, but it is called the pseudo-instruction, which will be translated into:

1
add a, b, $zero

4.5.5 System Call

To request an operating-system-like service:

  • Load the service number in register $v0
  • Load argument values, if any, in $a0, …, $a3 (or $f12 for floating point values)
  • Issue the SYSCALL instruction
  • Return values and if any, put the result in register $v0 (or $f0 for floating point values)

May destroy \(v0-\)v1, \(a0-\)a3, \(t0-\)t9, $ra, but always preserves \(s0-\)s7

SPIM System Calls
SPIM System Calls

(System services)

4.6 MIPS Branching

4.6.1 Branch Instructions

  • beq a, b, l - Branch on equal
    • Goto instruction at label l if a==b, otherwise continue
  • bne a, b, l - Branch on not equal
    • Goto instruction at label l if a!=b, otherwise continue
  • j l - Jump to
    • Jump to instruction at label l

4.6.2 If-then-else

Example:

i j f g h
$s0 $s1 $s2 $s3 $s4
1
2
3
4
5
6
7
# if (i==j) then {f=g+h} else {f=g-h}

bne $s0, $s1, Else
add $s2, $s3, $s4
j Exit
Else: sub $s2, $s3, $s4
Exit: ...

4.6.3 While

Example:

i j
$s0 $s1
1
2
3
4
5
6
7
# while (i!=0) {i=i-1; j=j+3;}

loop: beq $s0, $zero, wexit
addi $s0, $s0, -1
addi $s1, $s1, 3
j loop
wexit: ...

4.6.4 Other Comparisons

  • slt a, b, c - set on less than
    • Set register a to 1, if b<c, otherwise to 0. (a = b<c ? 1 : 0)
  • slti a, b, n - set on less than immediate
    • Set register a to 1, if b<n, otherwise to 0. (a = b<n ? 1 : 0)

4.6.5 Example: Maximum of Two Numbers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
		.text
main: li $v0, 5
syscall # read integer
move $s0, $v0 # store in s0
li $v0, 5
syscall # read integer
slt $t0, $v0, $s0 # if $v0<$s0, $t0=1
bne $t0, $zero, out # goto out, if $t0!=0, ($t0=1, $v0<$s0)
move $s0, $v0 # $s0=$v0, otherwise ($v0>=$s0)
out: move $a0, $s0 # print max
li $v0, 1
syscall # print integer
li $v0, 10
syscall # exit

4.7 MIPS Logical Operations

4.7.1 Bitwise Logical Operations

OR, AND, NOR, XOR, ...

4.7.2 Logical Operators

or, and, nor, xor, ...

4.7.3 MIPS Logical Operations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# rd = rs and rt
and rd, rs, rt

# rd = rs and imm
andi rd, rs, imm

# rd = rs or rt
or rd, rs, rt

# rd = rs or imm
ori rd, rs, imm

# rd = rs nor rt
nor rd, rs, rt

# no nori operation

# rd = not rs (implementing negation using nor)
nor rd, rs, $zero

# rd = rs xor rt
xor rd, rs, rt

# rd = rs xor imm
xori rd, rs, imm

5 Sign Numbers, Overflow, Multiplication and Division

5.1 Overview

  • Sign numbers
  • Overflow
  • MIPS shift opreations
  • MIPS multiplication
  • MIPS division

5.2 Sign Numbers

5.2.1 4-Bit Signed Encodings

(See details in CSF.)

Binary Unsigned Sign and Magnitude 1's Complement 2's Complement Excess-8
0000 0 0 0 0 -8
0001 1 1 1 1 -7
0010 2 2 2 2 -6
0011 3 3 3 3 -5
0100 4 4 4 4 -4
0101 5 5 5 5 -3
0110 6 6 6 6 -2
0111 7 7 7 7 -1
1000 8 -0 -7 -8 0
1001 9 -1 -6 -7 1
1010 10 -2 -5 -6 2
1011 11 -3 -4 -5 3
1100 12 -4 -3 -4 4
1101 13 -5 -2 -3 5
1110 14 -6 -1 -2 6
1111 15 -7 -0 -1 7

5.2.2 Sign Extension

Sign extension: the operation of increasing the number of bits of a binary number while preserving the number's sign(+/-) and value

Examples:

Befor Extension After Extension
00 1010 0000 0000 0000 1010
11 1111 0001 1111 1111 1111 0001

5.3 Overflow

5.3.1 Problem

Overflow: happens when the last carry(1) cannot be accommodated.

Situations:

Operation Result
+ + + -
- + - +
+ - - -
- - + +

5.3.2 Solution in MIPS

Exception Program Counter (EPC): a register in MIPS to contain the address of the instruction that caused the exception.

The instruction move from system control (mfc0) is used to copy EPC into a general-purpose register so that MIPS software has the option of returning to the offending instruction via a jump register instruction.

  • Causing exceptions on overflow:
    • add
    • addi
    • sub
  • Not causing exceptions on overflow:
    • addu (add unsigned)
    • addiu
    • subu

5.3.3 References

  • MIPS Overflow Exception

MIPS can trap on overflow, but unlike many other computers, there is no conditional branch to test overflow. A sequence of MIPS instructions can discover overflow. For signed addition, the sequence is the following:

1
2
3
4
5
6
7
8
9
10
11
addu $t0, $t1, $t2          # $t0 = sum, but don’t trap

# if (signs of $t1 & $t2 differ) then {goto No_overflow}
xor $t3, $t1, $t2 # $t3 = signs of $t1 & $t2 differ ? <0 : >0
slt $t3, $t3, $zero # $t3 = $t3<0 ? 1 : 0
bne $t3, $zero, No_overflow # $t3 == 1 && {goto No_overflow}

# if (signs of $t1 & $t2 equal, but sign of sum differs) then {goto Overflow}
xor $t3, $t0, $t1 # $t3 = sign of sum differs ? <0 : >0
slt $t3, $t3, $zero # $t3 = $t3<0 ? 1 : 0
bne $t3, $zero, Overflow # $t3 == 1 && {goto Overflow}

For unsigned addition ($t0 = $t1 + $t2), the test is

1
2
3
4
5
6
addu $t0, $t1, $t2      # $t0 = sum, but don’t trap

nor $t3, $t1, $zero # $t3 = NOT $t1 = (2's comp of $t1) - 1 = 2^32 - $t1 - 1
sltu $t3, $t3, $t2 # $t3 = (2^32 – $t1 – 1) < $t2 ? 1 : 0
# ⇒ 2^32 – 1 < $t1 + $t2
bne $t3,$zero,Overflow # if(2^32 –1<$t1+$t2) goto overflow

5.4 MIPS Shift Operations

5.4.1 Shift Operations

Examples:

Original Bits Shifted Left Shifted Right
0101 (5) 1010 (10) 0010 (2)
1100 (-4) 1000 (-8) 0110 (6) => 1110 (-2)
1011 (-5) 0110 (6) => 1110(-2) [overflow] 0101 (5) => 1101 (-3)
  • For unsigned numbers:

    • Shift left corresponds to multiplication by 2
    • Shift right corresponds to division by 2, ignoring the remainder
  • For signed numbers:

    • On 2's complement, shift left is multiplication by 2 if no overflow
    • (Shift right corresponds to division by 2 given sign extension, plus the remainder?)
    • Arithmetical shift right performs sign extension when shifting (to be half)

5.4.2 MIPS Logical Shift Operations

1
2
3
4
5
6
7
# Shift left logical
# $s0 = $s1 << 2
sll $s0, $s1, 2

# Shift right logical
# $s0 = $s1 >> 3
srl $s0, $s1, 3

5.4.3 MIPS Arithmetic Shift Operations

1
2
3
4
5
# Shift right arithmetical
# $s0 = $s1 >> 2 (performing sign extension)
sra $s0, $s1, 2

# No need for a sla when using 2's complement number representation

5.5 Multiplication

5.5.1 Multiplication

m-bit number * n-bit number = m+n-bit number

Example: 1101 * 1011 (Long Multiplication)

1
2
3
4
5
6
7
8
9
           1 1 0 1           13
* 1 0 1 1 11
------------------
1 1 0 1
1 1 0 1
0 0 0 0
+ 1 1 0 1
------------------
= 1 0 0 0 1 1 1 1 143

5.5.2 MIPS Multiplication Instructions

MIPS stores the 64-bit result of the multiplication of two 32-bit registers in two special 32-bit registers Hi (32~63 bits) and Lo (0~31 bits).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Hi~Lo = $s1 * $s2 (signed)
mult $s1, $s2

# Hi~Lo = $s1 * $s2 (unsigned)
multu $s1, $s2

# Store the lower 32-bit of the result of the previous operation in register $s0
mflo $s0

# Store the higher 32-bit of the result of the previous operation in register $s0
mfhi $s0

# Since we are commonly only interested in the 32-bit result of a multiplication, the sequence:
mult $s1, $s2
mflo $s0
# can be encoded as one operation:
mul $s0, $s1, $s2

# Note that mul doesn't check for overflow, but a pseudo-insturtion multiply with overflow `mulo` does

5.6 MIPS Division

5.6.1 Division

dividend = quotient * divisor + remainder

Algorithm: repeatedly try to subtract the left-shifted divisor from the dividend, shifting the divisor right each time

Example:

(Long Division Decimal:)

Long Division Decimal
Long Division Decimal

(Long Division Binary:)

Long Division Binary
Long Division Binary
  • Carry out division for absolute values of the dividend and divisor
  • Negate the quotient if the signs disagree
  • The remainder must have the same signs as dividend

5.6.2 MIPS Division Instructions

MIPS stores the 64-bit result of the division of two 32-bit registers in two special 32-bit registers Hi (remainder) and Lo (quotient).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Hi~Lo = $s1 ÷ $s2 (signed)
div $s1, $s2

# Hi~Lo = $s1 ÷ $s2 (unsigned)
divu $s1, $s2

# Store the quotient of the previous operation in register $s0
mflo $s0

# Store the remainder of the previous operation in register $s0
mfhi $s0

# The instruction:
div $s0, $s1, $s2
# corresponds to:
div $s1, $s2
mflo $s0
# i.e.: $s0 = $s1 / $s2

6 Machine Code, Addressing

6.1 Overview

  • Machine code VS. assembly language
  • MIPS instruction encodings
  • Memory organization
  • MIPS addressing modes
  • Stack and Heap
  • Dynamic data

6.2 Machine Code VS. Assembly Language

1
2
3
4
5
6
7
8
# Assembly
add $s0, $s1, $s2

# Hexadecimal (machine code)
0232 8020

# Binary
0000 0010 0011 0010 1000 0000 0010 0000

6.3 MIPS Instruction Encodings

6.3.1 R-format - Operands Are Registers Only

op rs rt rd shamt funct
bits 6 5 5 5 5 6
  • op - operation code (basic operation of the instruction)
  • rs - first source register operand
  • rt - second source register operand
  • rd - destination register operand
  • shamt - shift amount (used in shift instruction)
  • funct - function code (select the variant of the operation in the op code field)

6.3.2 I-format - Operands Contain Immediate

op rs rd imm
bits 6 5 5 16
  • op - operation code
  • rs - first source register operand
  • rt - second source register operand
  • imm - immediate constant

6.3.3 J-format - Operands Are Immediate Only

op imm
bits 6 26
  • op - operation code
  • imm - immediate constant

6.3.4 Examples

  • ``` # Assembly add $s0, $s1, $s2

    Hexadecimal (machine code)

    0232 8020

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12

    | op | rs | rt | rd | shmt | funct |
    | :----: | :---: | :---: | :---: | :---: | :----: |
    | 000000 | 10001 | 10010 | 10000 | 00000 | 100000 |
    | R | $s1 | $s2 | $s0 | | add |

    - ```
    # Assembly
    sub $s0, $s1, $s2

    # Hexadecimal (machine code)
    0233 8022

    op rs rt rd shmt funct
    000000 10001 10010 10000 00000 100010
    R $s1 $s2 $s0 sub
  • ``` # Assembly addi $t0, $t1, 100

    Hexadecimal (machine code)

    2128 0064

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59

    | op | rs | rd | imm |
    | :----: | :---: | :---: | :-----------------: |
    | 001000 | 01001 | 01000 | 0000 0000 0110 0100 |
    | addi | $t1 | $t0 | |

    ### 6.3.5 Decoding Machine Code

    ![MIPS opcode map](https://tvax4.sinaimg.cn/large/006tNc79gy1flc1g5d0irj30ym13s163.jpg)

    - Convert hexadecimal to binary to find the op fields
    - Look at the op filed (bits 31:26) to determine the operation
    - Reformat the binary instruction into corresponding format fields
    - Decode the operation type, instruction by checking the field values

    ### 6.3.6 Register VS Memory

    - Processor can access registers directly
    - Limit of 32 registers
    - Most programs require much more data than can be held in registers (like text, code and images)
    - Extra data must kept in memory
    - **Load**: transfer data from memory to register
    - **Store**: transfer data from register to memory

    ![](https://tvax4.sinaimg.cn/large/006tNc79gy1flf08zfgmlj30ko09cglz.jpg)

    (Memory hierarchy)

    > #### Key Differences Between Register and Memory
    >
    > 1. The primary difference between register and memory is that register **holds the data that the CPU is currently processing** whereas, the memory **holds the data the that will be required for processing**.
    > 2. The Register ranges from **32-bits register to 64-bits register** whereas, the memory capacity ranges from some **GB** to some **TB**.
    > 3. The processor accesses register **faster** than the memory.
    > 4. Computers registers are **accumulator register, program counter, instruction register, address register**, etc. On the other hands, memory is referred as the main memory of the computer which is **RAM**.

    ## 6.4 Memory Organization

    ### 6.4.1 Memory

    Memory is organized as a sequence of bytes.

    | Address | Value |
    | :-----: | :---------: |
    | 0 | 8-bit value |
    | 1 | 8-bit value |
    | 2 | 8-bit value |
    | ... | ... |

    Registers hold 32 bits (= 4 bytes = 1 word).

    | Address | Value |
    | :-----: | :----------: |
    | 0 | 32-bit value |
    | 4 | 32-bit value |
    | 8 | 32-bit value |
    | ... | ... |

    ### 6.4.2 Data Transfer Instruction: Load Word

    # \(t0 = Memory[\)t1+n] lw \(t0, n(\)t1)

The address has to be divisible by 4 (word aligned).

The offset n has to fit within 16 bit, i.e. -32768 <= n <= 32767.

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Example: `lw $t0, 4($s0)`

=> `$t0 = Memory[$s0 + 4] = Memory[10000000_(16) + 4] = 110`

| | Before | After |
| :-------------------: | :------------: | :------------: |
| Memory[10000000_(16)] | 100 | 100 |
| Memory[10000004_(16)] | 110 | 110 |
| $s0 | 10000000\_(16) | 10000000\_(16) |
| $t0 | 10 | **110** |

### 6.4.3 Data Transfer Instruction: Store Word

Memory[$t1+n] = $t0

sw \(t0, n(\)t1)

The address has to be divisible by 4 (word aligned).

The offset n has to fit within 16 bit, i.e. -32768 <= n <= 32767.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Example: `sw $t0, 4($s0)`

=> `Memory[$s0 + 4] = Memory[10000000_(16) + 4] = $t0 = 10`

| | Before | After |
| :-------------------: | :------------: | :------------: |
| Memory[10000000_(16)] | 100 | 100 |
| Memory[10000004_(16)] | 110 | **10** |
| $s0 | 10000000\_(16) | 10000000\_(16) |
| $t0 | 10 | 110 |

### 6.4.4 Transferring Bytes and Halfwords

- ASCII characters: 1 byte
- Halfword: 16 bits (= 2 bytes)

Load (unsigned) byte

lb rt, address lbu rt, address

Load (unsigned) halfword

lh rt, address lhu rt, address

Store byte

sb rt, address

Store halfword

sh rt, address

The addresses for halfwordshave to be divisible by 2.

There is no restriction for byte addresses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

_Complete data transfer instruction list see: pp A-49 ~ A-80._

### 6.4.5 32-Bit Immediate and Addresses

- Problem:

- MIPS ISA uses fixed size 32-bit instructions, so there are always < 32 bits available for any instruction arguments
- However, sometimes 32-bit constants and addresses are necessary

- Solution:

- Large constants and 32-bit addresses have to be broken into pieces and then reassembled into a register

- Example:

- Load the following 32-bit constant into register `$s0`:

`0000 0100 1011 0010 0001 0000 1101 0011`

# STEP 1: load upper immediate
# lui $s0, imm
lui $s0, 1202   # 1202 = 0000 0100 1011 0010

# Now, $s0 = 0000 0100 1011 0010 0000 0000 0000 0000

# STEP 2: add(or) the lower immediate
# ori $s0, $s0, imm
ori $s0, $s0, 4307   # 4307 = 0001 0000 1101 0011

# Now, $s0 = 0000 0100 1011 0010 0001 0000 1101 0011
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

_Do not use `addi` in STEP 2. Because `addi` is with sign-extend. If the immediate value is negative, then the upper 16 bits will be all 1's, and adding this will ruin the upper 16 bits._

## 6.5 MIPS Addressing Modes

### 6.5.1 Register Addressing

Register addressing: a source or destination operand is specified as content of one of the registers `$0`~`$31`.

E.g. `add $s0, $s1, $s2`

### 6.5.2 Immediate Addressing

Immediate addressing: a numeric value embedded in the instruction is the actual operand.

E.g. `addi $s0, $s1, 100`

### 6.5.3 PC-Relative addressing

PC-relatvie addressing: a data or instruction memory location is specified as an offset relatvie to the incremented PC.

(PC refers to special purpose register , it stores the address of next instruction to be fetched: the operand `address = PC + offset`. Usually used in conditional branch.)

E.g. `bne $s0, $s1, Label`

### 6.5.4 Base Addressing

Base addressing: a data or instruction memory location is specified as a signed offset from a register. The address of the operand is the sum of the immediate and the value in a register.

E.g. `lw $t0, 4($s0)`

### 6.5.5 Pseudodirect Addressing

Pseudodirect addressing: the memory address is (mostly) embedded in the J-type instructions.

E.g. `j Label`

### 6.5.6 Addressing Mode vs Instruction Type

- Addressing mode: how an address (memory or register) is determined
- Instruction type: how an instruction is put together

## 6.6 Stack and Heap

Two available areas of memory used to store variables are commonly referred to as the stack and the heap.

### 6.6.1 Stack

- Store function local variables
- Store temporary variables
- Pass the arguments for next function call
- Return the address when function ends

### 6.6.2 Heap

- A much larger pool of memory than stack
- Must be explicitly allocated and deallocated by the programmer manually
- _This is completely unrelated to the Heap data structure (?)_

## 6.7 Dynamic Data

### 6.7.1 MIPS Memory Allocation

To allocate additional memory, use `sbrk` system call (9).

Example: allocate 16 bytes (= 4 words) of memory and returns the address of the new memory in $v0.

li $a0, 16 li $v0, 9 syscall # sbrk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

### 6.7.2 MIPS Memory Deallocation

There is no system call to deallocate memory in MIPS. Either does JAVA. While JAVA uses <u>garbage collection</u> technique to reclaim unused memory automatically.

# 7 Floating Point Numbers

## 7.1 Overview

- Floating Point Representation
- Arithmetic for Floats
- Error Propagation
- MIPS Floating Point Unit

## 7.2 Floating Point Representation

### 7.2.1 Scientific Notation

General form:

![Floating Point Representation](https://tvax4.sinaimg.cn/large/006tNc79gy1flffkh8rqmj30p4078q3i.jpg)

where:

- `m` = significand
- `b` = base number (any positive integer)
- `e` = exponent (an integer)

E.g.

- `2.73*10^(-7)` is in normalized scientific notation
- `0.273*10^(-6)` is not in normalized scientific notation
- `20.73*10^(-8)` is not in proper scientific notation

### 7.2.2 Normalized Scientific Notation for Binary

General form:

![Normalized Scientific Notation for Binary](https://tvax1.sinaimg.cn/large/006tNc79gy1flffusxbl1j30ty09041a.jpg)

E.g.

- `11.101 = 1.1101 * 2^1` is normalized form
- `11.101 = 11.101 * 2^0` is not normalized form

### 7.2.3 IEEE 754 Floating Point Standard

IEEE 754 Floating Point Standard: a type of sign and magnitude representation

![IEEE 754 Floating Point Standard](https://tvax4.sinaimg.cn/large/006tNc79gy1flfg04fu8mj30q002474l.jpg)

| | S | Biased Exponent | Fraction |
| :-----------------------: | :---: | :-------------: | :------: |
| Single-precision (32-bit) | 1 bit | 8 bits | 23 bits |
| Double-precision (64-bit) | 1 bit | 11 bits | 52 bits |

- **S**: sign bit of the fraction (0 for +, 1 for -)
- **Biased Exponent**:

- In <u>excess representation</u> for fast sorting (2's complement with negative exponent will make it look like a big number)

> The designers of IEEE 754 also wanted a floating-point representation that could be easily processed by integer comparisons, especially for sorting. This desire is why the sign is in the most significant bit, allowing a quick test of less than, greater than, or equal to 0. (It’s a little more complicated than a simple integer sort, since this notation is essentially sign and magnitude rather than two’s complement.)

- A fixed value called <u>bias</u> is subtracted from the field to get the true exponent value

- `Bias = 2^(k-1) - 1`, where `k` is the number of bits in exponent field. E.g. Single precision, `bias = 127`; double precision, `bias = 1023`.

- `Biased Exponent = Actual Exponent + Bias`. E.g. `Actual Exponent = Biased Exponent - Bias = 0000 0101 - Biad = 5 - 127 = -122`

Example: binary to single-precision floating point format

`-10.101`

`-10.101 = (-1) * (1.0101) * 2^1`

`Biased Exponent = 1 + 127 = 128 = 1000 0000`

| S | Biased Exponent | Fraction |
| :-: | :-------------: | :--------------------------: |
| 1 | 1000 0000 | 0101 0000 0000 0000 0000 000 |

> Special values (single precision):
>
> - Smallest positive normalized number: `1.0 * 2^(-126)`
> - Least negative normalized number: `-1.0 * 2^(-126)`
> - Largest positive normalized number: `1.1111 1111 1111 1111 1111 111 * 2^127 = (2-2^(-23)) * 2^127`
> - Most negative normalized number: `-1.1111 1111 1111 1111 1111 111 * 2^127 = -(2-2^(-23)) * 2^127`
> - ±∞: `± 1.0 * 2^128`
> - NaN: `1.f * 2^128` where `f ≠ 0`

### 7.2.5 Overflow, Underflow, Infinities and NaN

- Overflow: result > largest representable number (`127` for single-precision) => ±∞
- Underflow: result < smallest normalized number (`-126` for single-precision) => denormalized number
- Infinities: `1/0`
- NaN: `0.0/0.0`, `±∞/±∞`, `0*±∞`, `-∞+∞`, `sqrt(-1.0)`, `log(-1.0)`

> The purpose of NaNs is to allow programmers to postpone some tests and decisions to a later time in the program when they are convenient.

### 7.2.6 Binary Floating Point ⇌ Decimal Floating Point

Examples:

- Binary => Decimal

10.101 = 12^1 + 02^0 + 12^(-1) + 02^(-2) + 12^(-3) = 2 + 0 + 0.5 + 0 + 0.125 = 2.625
1
2
3
4
5

- Decimal => Binary

- Method 1 :

3.75 = 2^1 + 2
0 + 2^(-1) + 2(-2) = 11.11 0.625 = 2^(-1) + 2^(-3) = 0.101
1
2
3

- Method 2:

0.5_[10] = (1/2)[10] = (1/2^1)[10] = (0.1*2^0)_[2] = (1.0*2^(-1))_[2]

-0.4375_[10] = (-7/16)_[10]
             = (-7/2^4)_[10]
             = (-0.0111*2^0)_[2]
             = (-1.11*2^(-2))_[2]
1
2
3

- Method 3:

27.1875_[10] = 27_[10] + 0.1875_[10] = 11011_[2] + 0.1875_[10]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

| | \*2 | Integer Part |
| :----: | :---: | :----------: |
| 0.1875 | 0.375 | 0 |
| 0.375 | 0.75 | 0 |
| 0.75 | 1.5 | 1 |
| 0.5 | 1.0 | 1 |

=> `0.1875_[10] = 0.0011_[2]`

=> `27.1875_[10] = 11011.0011`

- _`0.3` in decimal cannot be expressed exactly by finite digits in binary_

- Decimal => Binary => Single-precision binary floating point format

- ```
-13.825_[10] = -( 13_[10] + 0.825_[10] )
= -( 1101_[2] + 0.825_[10] )
| 0.825 | Integer | | :---: | :-----: | | 1.65 | 1 | | 1.3 | 1 | | 0.6 | 0 | | 1.2 | 1 | | 0.4 | 0 | | 0.8 | 0 | | 1.6 | 1 | | 1.2 | 1 | | ... | ... |
1
2
3
4
                        .  .
-13.825_[10] = -1101.1101001_[2]
. .
= -(1.1011101001 * 2^3)_[2]
`Biased Exponent = 3 + 127 = 130 = 1000 0010` | S | Biased Exponent | Fraction | | :-: | :-------------: | :--------------------------: | | 1 | 1000 0010 | 101 1101 0011 0011 0011 0011 |

7.3 Arithmetic for Floats

7.3.1 Floating Point Addition

Floating Point Addition
Floating Point Addition

Steps:

  1. Shifted right the smaller number until its exponent matches the larger number
  2. Add the significands
  3. Normalize the sum
  4. Overflow or underflow ? => Exception
  5. Round the sum
  6. Need further normalization ? => goto 3

Example:

1
2
3
4
  1.0*2^(-1) - 1.11*2^(-2)  (using 4-bit precision)
= 1.000*2^(-1) - 0.111*2^(-1)
= 0.001*2^(-1)
= 1.000*2^(-4)

7.3.2 Floating Point Multiplication

Floating Point Multiplication
Floating Point Multiplication

Steps:

  1. Add the exponents
  2. Multiply the significands
  3. Normalize the product
  4. Overflow or underflow ? => Exception
  5. Round the product
  6. Need further normalization ? => goto 3
  7. Sign the product

Example:

1
2
3
4
5
6
  - 1.0*2^(-1) * 1.11*2^(-2)  (using 4-bit precision)
= - 1.110*2^(-3)

1.1*2^3 * 1.11*2^4 (using a 3-bit fraction)
= 10.101*2^7
= 1.010*2^8

7.3.3 IEEE 754 Rounding

Rounding modes:

  • Round Up
  • Round Down
  • Toward Zero
  • Round to Even

(MIPS and Java use round to even by default)

7.4 Error Propagation

7.4.1 Floating Point Arithmetic Errors

Sources of floating point arithmetic errors:

  • Rounding
  • Approximate representation

Denote:

  • m: the number to be represented
  • n: m's approximation

The absolute error: a = |m - n|

The relative error: r = a / |m|

The true value m[n-a, n+a] ∪ [n(1-r), n(1+r)]

7.4.2 Error Propagation

1
2
3
4
5
6
// Addition or subtraction sums absolute error of operands
(x ± a_x) + (y ± a_y) = (x + y) ± (a_x + a_y)

// Multiplication or division sums relative error of operands
// The much smaller (r_x * r_y) can be ignored
x(1 ± r_x) * y(1 ± r_y) = (x * y)(1 ± (r_x + r_y) ± r_x * r_y)

Example: What is the absolute error in y^2 - 4xz?

1
2
  y^2 * (1 ± 2*r_y) - 4xz(1 ± (r_x + r_z))
= (y^2 - 4xz) ± (2 * y^2 * r_y + 4xz(r_x + r_z))

7.5 MIPS Floating Point Unit

7.5.1 IEEE 754 for MIPS

Floating Point Unit (FPU): performs IEEE operations (MIPS core refers to FPU as coprocessor 1)

FPU features 32 single precision registers ($f0, $f1, …, $f31) or as 16 pairs of double precision registers ($f0, $f2, …, $f30, here $fi actually stands for the pair $fi and $f(i+1)).

FPU instructions does not raise exceptions. So checking for ±∞ or NaN is needed.

FPU defaults to round to even.

7.5.2 Floating-Point Instructions in MIPS

  • General format:
    • mmm.s: single precision
    • mmm.d: double precision
  • Addition:
    • add.s fd, fs, ft e.g. add.s $f0, $f1, $f2
    • add.d fd, fs, ft e.g. add.d $f0, $f2, $f4
  • Subtraction
    • sub.s
    • sub.d
  • Multiplication
    • mul.s
    • mul.d
  • Divsion
    • div.s
    • div.d

7.5.3 Load/Store for Floating Point

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Load word copr. 1 (32-bit word)
// lwc1 fd, n(rs)
lwc1 $f1, 100($s2) // $f1 = Memory[$s2 + 100]

// Store word copr. 1 (32-bit word)
// swc1 fd, n(rs)
swc1 $f1, 100($s2) // Memory[$s2 + 100] = $f1
// Address rs+n must be word aligned

// To store/load doubles
lwc1 $f0, 0($s0)
lwc1 $f1, 4($s0)
// Or with pseudo-instruction
l.d $f0, 0($s0)

No encoding for immediate floating-point operands.

Must be placed in .data segment due to too many bytes.

Assembler directives: .float n or .double n.

7.5.4 Copying Data Between FPU and CPU

1
2
3
4
5
6
7
// Move from copr. 1
// mfc1 rd, fs
mfc1 $t0, $f7 // copy content of $f7 to $t0

// Move to copr. 1
// mtc1 $zero, $fs
mtc1 $zero, $f12 // set $f12 := 0

Only the bit pattern is copied, not the actual value that it represents.

7.5.5 Floating Point I/O

Floating Point I/O
Floating Point I/O

(See full list on 4.5.5)

7.5.6 Example: Area of a Circle

1
2
3
4
5
6
7
8
9
10
11
12
  		.data
pi: .double 3.141592653589793
.text
.globl main
main: li $v0, 7 # read_double
syscall # radius < -user input
la $a0, pi # load address
l.d $f12, 0($a0) # load double at address $a0 into $f12, $f12 := pi
mul.d $f12, $f12, $f0 # $f12 := $f12 * r
mul.d $f12, $f12, $f0 # $f12 := $f12 * r
li $v0, 3 # print_double
syscall # print area

7.5.7 Floating Point Comparison

General format:

  • c.x.s: single
  • c.x.d: double

where x can be:

  • eq: equal
  • neq: not equal
  • lt: less than
  • le: less than or equal
  • gt: greater than
  • ge: greater than or equal

7.5.8 Floating Point Branch

  • bclt label - Branch on coprocessor 1 true
    • Goto label if flag is true, otherwise continue
  • bclf label - Branch on coprocessor 1 false
    • Goto label if flag is false, otherwise continue

Example: jump to islt if the contents of $f0<$f1 as single precision floating point numbers

1
2
c.lt.s $f0, $f1
bclt islt

8 MIPS Procedure

8.1 Overview

  • MIPS Procedure
  • Recursive Procedure

8.2 Procedures

8.2.1 Procedures

  • Definition: protion of code, within a larger program, which appear frequently
  • Help to
    • reduce code duplication
    • improve code re-usability
    • decompose complex programs into manageable parts
  • Other names
    • methods (Java, OO languages)
    • functions (C, C++, Haskell)
    • routines, subroutines (not popular anymore)

8.2.2 Callee vs Caller

  • Callee: a function called by a caller
  • Caller: a functoin calling a callee

8.2.3 MIPS Procedure Calling Convention I

  • Arguments are passed in registers $a0 ~ $a3
  • The return value is passed in register $v0 (and $v1 if necessary)
  • The return address of the caller is stored in register $ra
  • jal label: jump and link
    • Effect: call a procedure
    • Side effects:
      • $ra = PC + 4
      • PC = [label]
  • js rs: jump register
    • Effect: jump to the address specified by the register rs
    • (To return to the caller using jr $ra)

8.2.4 Example: x * x + y * y

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
		.text
f: # procedure f, args in x=$a0, y=$a1
mul $v0, $a0, $a0 # $v0 = x*x
mul $s0, $a1, $a1 # $s0 = y*y
add $v0, $v0, $s0 # $v0 = x*x+y*y
jr $ra # return to caller, where $ra contains the return address of the instruction `move $a0, $v0`)

main: li $a0, 3 # $a0 = 3
li $a1, 4 # $a1 = 4
jal f # jump and link f
move $a0, $v0 # $a0 = $v0 = 25
li $v0, 1 # print_int 25
syscall
li $v0, 10 # exit
syscall

8.2.5 MIPS Procedure Calling Convention II

  • To use any registers within the procedure:
    • Save what is in these registers at the start of the procedure
    • Restore these values at the end of the procedure
  • The callee has to restore: $s0 ~ $s7, $ra, $sp
  • The caller is responsible itself for most other registers: e.g. $t0 ~ $t7, $a0 ~ $a3

We use a stack to do this.

8.2.6 The Stack

Structure of the Stack
Structure of the Stack
  • In MIPS, stack is stored in memory (last-in, first-out [lifo] list)
  • The stack pointer $sp
    • points to the last location on stack
    • contains the address of the next free location
  • Each procedure is associated with a call frame with a frame pointer $fp
    • points to the first word of procedure frame
    • does not change during the execution

8.2.7 Stack: Push & Pop

Push a Word
Push a Word
Pop a Word
Pop a Word

8.2.8 Stack in MIPS

  • To push $s0 onto the stack ($s0 is placed on the stack):

    1
    2
    addi $sp, $sp, -4	# allocate space on the stack
    sw $s0, 0($sp) # save $s0 -- push
  • To pop $s0 from the stack ($s0 is taken from the stack):

    1
    2
    lw $s0, 0($sp)		# restore $s0 -- pop
    addi $sp, $sp, 4 # deallocate space

8.2.9 MIPS Procedure Calling Convention III

  • Caller
    • Push any of $a0 ~ $a3, $v0 ~ $v1 and $t0 ~ $t0 if necessary
    • Place arguments in $a0 to $a3 if needed
    • Make the call using jal callee
    • Pop saved registers and/or extra arguments off stack
  • Callee
    • Push any of $ra, $s0 ~ $s7 that may be overwrritten
    • Perform desired task
    • Place result in $v0 and $v1
    • Pop above registers off the stack
    • Return to caller with jr $ra

8.2.10 Argument Passing

  • First four arguments: $a0 ~ $a3

  • Remaining arguments: put on stack by the caller

  • Example: fun(int x0, int x1, …, int x7)

8.2.11 Return Values

  • First two return values: $v0, ($v1)
  • Remaining return values: put on stack by the procedure, and popped from the stack by the caller

8.2.12 MIPS Procedure Call: Good Strategy

  • Caller
    • At call time
      • Put arguments in $a0 ~ $a3
      • Save any caller-save temporaries
  • Callee
    • At entry
      • Allocate all stack space
      • Save $ra, $s0 ~ $s3
    • At exit
      • Restore $ra, $s0 ~ $s3
      • Deallocate all stack space
      • Put return value in $v0
  • Caller
    • After return
      • Retrieve return value from $v0
      • Restore any caller-save temporaries

8.2.13 Example: g(x) = f(x, x+1) + f(x+2, x+3)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
f:		# procedure f, args x=$a0, y=$a1, returns x*x+y*y
mul $v0, $a0, $a0
mul $t0, $a1, $a1
add $v0, $a0, $t0 # $v0 = x*x+y*y
jr $ra

g: # procedure g, arg x=$a0, returns f(x, x+1) + f(x+2, x+3)
addi, $sp, $sp, -8 # allocate space on the stack
sw $s0, 0($sp) # save $s0 -- push
sw $ra, 4($sp) # save $ra -- push

addi $a1, $a0, 1 # $a1 = x + 1
jal f # f(x, x+1)

move $s0, $v0 # $s0 = f(x, x+1)

addi $a0, $a1, 1 # $a0 = x + 2
addi $a1, $a0, 1 # $a1 = x + 3
jal f # f(x+2, x+3)

add $s0, $s0, $v0 # $s0 = f(x, x+1) + f(x+2, x+3)
move $v0, $s0 # $v0 = f(x, x+1) + f(x+2, x+3)

lw $s0, 0($sp) # restore $s0 -- pop
lw $ra, 4($sp) # restore $ra -- pop
addi $sp, $sp, 8 # deallocate space
jr $ra

main: # prints out g(3)
li $a0, 3 # $a0 = 3
jal g

move $a0, $v0 # $a0 = f(x, x+1) + f(x+2, x+3)

li $v0, 1 # print_int
syscall

li $v0, 10 # exit
syscall

8.3 Recursive Procedure

8.3.1 Example: Factorial

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# int fact(int n): return n<=0 ? 1 : n * fact(n-1);
# assume $a0 = n

fact: addi $sp, $sp, -8 # space for two words
sw $ra, 4($sp)
sw $a0, 0($sp)

li $v0, 1 # $v0 = 1
blez $a0, fact_return # if ($a0<=0) then {goto fact_return}

addi $a0, $a0, -1 # n--

jal fact

tmpl: lw $a0, 0($sp) # retrieve original n
mul $v0, $v0, $a0 # n * fact(n-1)

fact_return:lw $ra, 4($sp)
addi $sp, $sp, 8
jr $ra

9 Piplining

9.1 Overview

  • Pipeline
  • Pilined Implementation
  • Obstacles to pipelining

9.2 Pipelining

9.2.1 Functional Units and Their Timings

Executing a MIPS instruction can take up to five functional units:

Type Duration Stage Use
Memory 200ps IF (Instruction Fetch) get instruction from memory
Memory 200ps MEM (Memory Access) data memory read or write
ALU 200ps EX (Execute) ALU operation
Register 100ps ID (Instruction Decode) get source register operands
Register 100ps WB (Write-Back) result to destination register

9.2.2 Memory Hierarchy

Memory Hierarchy
Memory Hierarchy

9.2.3 Critical Paths and Instruction Timings

Instruction Class IF ID EX MEM WB Total
R-format(add, sub, AND, OR, slt) 200 100 200 100 600
Load Word (lw) 200 100 200 200 100 800
Store Word (sw) 200 100 200 500
Branch (beq) 200 100 200 500
Jump (j) 200 200

(This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay.)

9.2.4 Pipelining

Single-cycle, nonpipelined execution in top versus pipelined execution in bottom.
Single-cycle, nonpipelined execution in top versus pipelined execution in bottom.

In this case, we see a fourfold speed-up on average time between instructions, from 800 ps down to 200 ps. The pipeline stage times of a computer are limited by the slowest resource, either the ALU operation or the memory access. We assume the write to the register file occurs in the first half of the clock cycle and the read from the register file occurs in the second half. We use this assumption throughout this chapter.

Pipelining: an implementation technique in which multiple instructions are overlapped in execution.

Executing n (n >> s) instructions takes:(n + (s-1)) * t / s ≈ nt / s.

Where:

  • t is the timing of every instruction
  • s is the number of the stages of every instruction
  • t/s is the timing of a cycle
  • s-1 is the further cycles that the n-th instruction needs to complete

9.2.5 Designing ISAs for Pipelining

  • MIPS instructions are the same length. => much easier to fetch instructions in the first pipeline stage, and to decode them in the second stage.
  • MIPS has only a few instruction formats, with the source register fields being located in the same place in each instruction. => the second stage can begin reading the register file at the same time that the hardware is determining what type of instruction was fetched.
  • MIPS memory operands only appear in loads or stores. => the execute stage can be used to calculate the memory address, and then access memory in the following stage.
  • MIPS operands must be aligned in memory. => the requested data can be transferred between processor and memory in a single pipeline stage.

9.2.6 Pipeline Hazards

  • Structural Hazard: When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute.

    • E.g. In MIPS pipeline with a single memory: both load/store and insrtuction fetch require memory access at the same time
    • Solutions:
      • Avoid scheduling instructions that would create sturctural hazards
      • Hardware includes control logic that stalls until earlier instruction is no longer using contended resource
      • Duplicate hardware to design so that each instruction can access to independent resources at the same time.
  • Data Hazard: When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.

    • Solutions:

      • Avoid

      • Stall

      • Forwarding / Bypassing: A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmervisible registers or memory.

        • E.g. EX forwarding: add $s0, $t0, $t1, sub $t2, $s0, $t3

          Graphical representation of forwarding.
          Graphical representation of forwarding.

          Extra hardware to take result directly from ALU output when it is computed instead of waiting for it to be stored in a register.

          Shading on the right half of the register file or memory means the element is read in that stage.

          Shading of the left half means it is written in that stage.

          MEM has a white background because add does not access the data memory.

        • E.g. MEM forwarding: lw $s0, 20($t1), sub $t2, $s0, $t3

          We need a stall even with forwarding when an R-format instruction following a load tries to use the data.
          We need a stall even with forwarding when an R-format instruction following a load tries to use the data.

          This figure shows an important pipeline concept, officially called a pipeline stall, but often given the nickname bubble.

          Without the stall, the path from memory access stage output to execution stage input would be going backward in time, which is impossible. This figure is actually a simplification, since we cannot know until after the subtract instruction is fetched and decoded whether or not a stall will be necessary.

  • Control Hazard / Branch Hazard: When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.

    • Solutions:

      • Stall

        Pipeline showing stalling on every conditional branch as solution to control hazards.
        Pipeline showing stalling on every conditional branch as solution to control hazards.

        This example assumes the conditional branch is taken, and the instruction at the destination of the branch is the OR instruction. There is a one-stage pipeline stall, or bubble, after the branch. In reality, the process of creating a stall is slightly more complicated. The effect on performance, however, is the same as would occur if a bubble were inserted.

      • Branch Prediction: A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome.

        Predicting that branches are not taken as a solution to control hazard.
        Predicting that branches are not taken as a solution to control hazard.

        The top drawing shows the pipeline when the branch is not taken. The bottom drawing shows the pipeline when the branch is taken. As we noted previously, the insertion of a bubble in this fashion simplifies what actually happens, at least during the first clock cycle immediately following the branch.

        If the prediction is wrong, the pipeline control must ensure that the instructions wrongly executed have no effectand, and must restart the pipeline from the proper branch address

        • Static Branch Prediction:
          • Predict "taken":
            • Backwards branches (likely to be a loop)
            • Unconditional branches (or jumps)
          • Predict "not taken":
            • Forward branches (likely to be if-then control)
        • Dynamic Branch Prediction:
          • Keep track of historic branch decisions
          • Over 90% prediction accuracy
      • Delayed Decision (actually used by the MIPS architecture): The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. It is hidden from the MIPS assembly language programmer because the assembler can automatically arrange the instructions to get the branch behavior desired by the programmer. MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. In our example, the add instruction before the branch in Figure 4.31 does not affect the branch and can be moved after the branch to fully hide the branch delay. Since delayed branches are useful when the branches are short, no processor uses a delayed branch of more than one cycle. For longer branch delays, hardware-based branch prediction is usually used.

9.2.7 Example: Reordering Code to Avoid Pipeline Stalls

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# a = b + e;
# c = b + f;
# Assuming all variables are in memory and are addressable as offsets from $t0.

lw $t1, 0($t0) # load b from $t0+[0]
lw $t2, 4($t0) # load e from $t0+[4]
add $t3, $t1, $t2 # a = b + e
sw $t3, 12($t0) # store a into $t0+[12]
lw $t4, 8($t0) # load f from $t0+[8] *
add $t5, $t1, $t4 # c = b + f
sw $t5, 16($t0) # store c into $t0+[16]

# After

lw $t1, 0($t0) # load b from $t0+[0]
lw $t2, 4($t0) # load e from $t0+[4]
lw $t4, 8($t0) # load f from $t0+[8] *
add $t3, $t1, $t2 # a = b + e
sw $t3, 12($t0) # store a into $t0+[12]
add $t5, $t1, $t4 # c = b + f
sw $t5, 16($t0) # store c into $t0+[16]

10 Pointers and References

10.1 Overview

  • Pointers
  • Reference
  • Arrays
  • Strings

10.2 Pointers

10.2.1 Pointers

Pointer: a variable containing the address of another variable.

In C: (Assume c in memory at address 0x10000000, p in $a0, x in $s0.)

1
2
3
4
5
int c = 100;
int *p;
p = &c;
x = *p;
*p = 200;

In MIPS:

1
2
3
4
lui $a0, 0x10000000		# p = &c = 0x10000000
lw $s0, 0($a0) # x = *p = c = 100
addi $t0, $zero, 200 # $t0 = 200
sw $t0, 0($a0) # *p = $t0 = 200

10.3 References

10.3.1 References

In Java:

1
2
3
4
5
6
7
8
9
10
11
12
13
class Int{
public int n;
public Int(int m) { n = m; }
}
public class Reference{
public static void main(String args[]){
Int a = new Int(1); // create an Intobject with n==1
Int b = new Int(2); // create an Intobject with n==2
a = b; // copy the reference
b.n = 3;
System.out.println(a.n); // 3
}
}

In MIPS:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
		.data
aa: .word 3
bb: .word 2
.text
.globl main
main: la $a0, aa # load address of aa into $a0
la $a1, bb # load address of bb into $a1
jal swap
... # print a, b
swap: lw $t0, ($a0)
lw $t1, ($a1)
sw $t1, ($a0)
sw $t0, ($a1)
jr $ra

10.4 Arrays

10.4.1 Arrays

1
2
3
4
5
		.data
as: .word 3, 1, 4, 1, 5, 9, 2, 6
.text
main: la $t1, as # load address of as into $t1
lw $t2, 24($t1) # load word from $t1+24 (as[6]) into $t4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
		.data
asx: .word 1, 2, 3
bsx: .word 5, 4
.text
.globl main
main: la $s0, asx
la $s1, bsx
move $s0, $s1 # $s0 = $s1 = the address of bsx
li $t0, 7
sw $t0, 0($s1) # bsx[0] = 7
lw $a0, 0($s0)
li $v0, 1 # print bsx[0]
syscall
li $v0, 10 # exit
syscall
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# return the max value of an array
.data
as: .word 3, 1, 4, 1, 5, 9, 2, 6
.text
.globl main
main: addi $sp, $sp, -4 # allocate the space on stack
sw $ra, 0($sp) # save the return address -- push

la $a0, as # $a0 = &as
li $a1, 8 # $a1 = 8 (the length of the array)
jal array_max # array_max(&as, 8)

move $a0, $v0 # $a0 = $v0
li $v0, 1 # print_int
syscall

lw $ra, 0($sp) # restore return address -- pop
addi $sp, $sp, 4 # deallocate the space on stack

jr $ra

array_max: # $a0: array address, $a1: array length
li $v0, 0x80000000 # $v0 = MIN_VALUE = 2147483648
li $t0, 0 # counter $t0 = i = 0
j am_cond

am_loop: sll $t1, $t0, 2 # shift left logical, $t1 = 4 * i
add $t1, $a0, $t1 # $t1 = &as + 4 * i
lw $t1, ($t1) # $t1 = as[&as + 4 * i]
addi $t0, $t0, 1 # i++
bge $v0, $t1, am_cond # if ($v0 >= as[&as + 4 * i]) then {goto am_cond}
move $v0, $t1 # $v0 = as[&as + 4 * i]

am_cond: blt $t0, $a1, am_loop # i < length && {goto am_loop}
jr $ra # back to the main function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Clearing an array using reference
# clear (int array[], int size)
# {
# int i;
# for (i = 0; i < size; i++)
# {
# array[i] = 0;
# }
# }

clear: # $a0: &array; $a1: size
move $t0, $zero # counter $t0 = i = 0

loop: sll $t1, $t0, 2 # $t1 = 4 * i
add $t2, $a0, $t1 # $t2 = &array[0] + 4 * i
sw $zero, 0($t2) # array[&array[0] + 4 * i] = 0
addi $t0, $t0, 1 # i++

slt $t3, $t0, $a1 # $t3 = i<size ? 1 : 0
bne $t3, $zero, loop # $t3 == 0 && {goto loop}

jr $ra

# Using "mul" or "add" in the loop to compute the new index i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Clearing an array using pointer
# clear (int *array, int size)
# {
# int *p;
# for (p = &array[0]; p<&array[0]; p++)
# {
# *p = 0;
# }
# }

clear: # $a0: &array; $a1: size
move $t0, $a0 # int *p = &array[0]
sll $t1, $a1, 2 # $t1 = 4 * size
add $t2, $a0, $t1 # $t2 = &array[0 + size] = &array[size]

loop: sw $zero, 0($t0) # *p = 0
addi $t0, $t0, 4 # p = p + 4

slt $t3, $t0, $t2 # $t3 = p < &array[size] ? 1 : 0
bne $t3, $zero, loop # $t3 != 0 && {goto loop}

jr $ra

# Operating on the pointer directly in the loop

10.5 Strings

10.5.1 Strings

Example: printing the length of a string

In C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int strlen (char *s)
{
int len = 0;
while(*s != 0)
{
s++;
l++;
}
return len;
}
int main ()
{
printf("%d\n", strlen("hello"));
}

In MIPS:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
				.data
hello: .asciiz "hello"
.text
.globl main
main: la $a0, hello # char *s = "hello"
jar strlen

move $a0, $v0
li $v0, 1 # print_int
syscall
li $v0, 10 # exit
syscall

strlen: # $a0: pointer to the string
li $v0, 0 # len = 0
j strlen_cond

strlen_loop: addi $v0, $v0, 1 # len++

strlen_cond: lbu $t0, ($a0) # Load byte unsigned $t0 = *a0
addi $a0, $a0, 1 # s++
bne $t0, $zero, strlen_loop # $t0 != 0 && {strlen_loop}