Manufacturing Process of a CPU

Silicon ingot is extracted to the purest form (any impurities will lead to a defect)
Ingot is sliced up into wafers (thin disks)
wafers are then etched and patterned with a layer of material above it for the circuitry of the chip
The wafer is then sliced up into individual CPU dies
Bond the die to the package, and ship to customers

Usually, the yield of a manufacturing process is the proportion of working dies per wafer (not ingot). This does not necessarily only pertain to a CPU, but also any integrated circuit. This is the reason why ASICS are so expensive.

Execution Time and Throughput

Execution time is how long it takes to do a task, and throughput is how much work is done per unit time. Throughput is roughly the bandwidth of a processor, and response time is specific to a task.

For example, if you have multiple cores, the execution time for a task that’s single threaded or is difficult to parallelize effectively will not change, but the throughput of the multi-core chip will be higher than one that is single-core.

So how do we measure performance? Let us define performance by:

\[P(x) = \frac{1}{T(x)}\]

for some task $x$, as in it is inversely proportional by the execution time. For example, if it took 10s to run a task $x$ on some machine $A$ and 15s on $B$, then one can say $A$ is 1.5x faster than $B$, because:

\[\frac{P^A(x)}{P^B(x)} = \frac{T^B(x)}{T^A(x)} = \frac{15}{10} = 1.5\]

Measuring Execution Time

Elapsed time: total response time that can be measured by a clock
CPU time: Time spent only on the CPU for the given task.
- Does not account for I/O and OS context switching for other tasks to run, etc

How can we modify CPU time? This requires us to be familiar with a couple terminologies:

Clock frequency: Cycles of a clock per second (e.x. 4.0GHz)
Clock period: Duration of a single clock cycle (e.x. 0.25ns), denote as $P$.

Suppose $T$ is the CPU time, and $C$ is number of cycles required for some task:

\[T = CP\]

What can we do to decrease the CPU time? Well, we can either decrease the number of cycles or the time for each period. Decreasing the number of cycles for a single op could be done using vectorized instructions(doing multiple instructions at once, and this is not necessarily the whole story) or simply devising a better algorithm. To decrease the period, one can overclock a processor to run at higher GHz(and thus lower period).

So far, we have assumed that each instruction costs one cycle. We had some $C$ that we needed to lower. In reality, there may be multiple cycles required for a single instruction. For $N$ instructions, the $k$-th instruction which is used $N_k$ times in the procedure, which requires $I_k$ clock cycles, $C$ can be computed as :

\[C = \sum_k^N I_k N_k\]

So now our vectorized instruction assumption isn’t necessarily true. If we can vectorize $K$ instructions, we don’t necessarily get an $K$x speedup, since the vectorized instruction may cost on average more clock cycles as well.

Amdahl’s Law

Improving algorithms leads to an issue observed by Amdahl’s law:

\[T_{improved} = \frac{T_{affected}}{\text{improvement factor}} + T_{unaffected}\]

This is simply saying that whatever you’re trying to optimize, you’ll never get below $T_{unaffected}$, or in other words, $T_{improved} \geq T_{unaffected}$. So the corollary or moral of the story of this law is that if you want to optimize algorithms, make the common case fast.

Performance Factors

Algorithm & compilers: Will definitely affect instruction counts, so will affect $N_k$.
ISA: Will affect $I_k$, $N_k$, and $P$.
- For example, CISC vs. RISC ISA’s will have higher and lower $P$’s respectively.

Power Trends

\[\text{Power} = \text{Capacitive load} * \text{voltage}^2 * \text{frequency}\]

The “power wall”

The power wall is a phenomenon where our current chip designs have marginal improvements on voltage reduction and heat reduction. How else can we reduce power? If we have denser and denser chips, it will be harder for us to cool the corresponding chips that are drawing in so much power.

The trend between clock rate and power have been growing with each other, but lately we have switched to a multi-core processor design, which means we are keeping the clock rate the same, but just having more cores, each of which is now being optimized for lower power consumption. This is how one can keep the power consumption roughly the same but still increase performance. However, now the definition of performance is slightly different… How can we still evaluate it?

One type of benchmark uses elapsed time over multiple programs, and takes the geometric mean of all the runtimes. An example is SPEC CPU2006.

Instruction Set Architecture

Factors to consider when creating an ISA

Operations:
- How many should we have?
- Which operations should we include?
- How many bytes does it operate on? (Length)
Operands
- How many operands do we need? (2 sources, 1 destination = 3)
- Are operands in memory, registers, constants?
- What data types are allowed? (integer, float, etc)
- How many bits are required to describe an operand?
Instruction format
- How big is the instructions itself? (What’s the least amount of bytes we can use to describe some instructions?)

RISC vs. CISC

Two main ISA classes, stands for Reduced Instruction Set Computers, and Complex Instruction Set Computers.

CISC (intel x86)
- Large # of instructions
- Many specialized complex instructions
- Fits compactly into text memory, since less instructions
RISC (ARM, MIPS, etc)
- Relatively fewer instructions
  - Simple instruction set means pipelining and parallelism is easier.

There is some grey line between RISC and CISC, since intel actually “compiles” the instructions into “micro-ops”, which is RISC-like.

MIPS

MIPS is a RISC ISA, and it has the following:

32 registers, each of which is 32 bits. (numbered from 0-31)
- Each 32-bit data in each register is called a “word”
- $t0, $t1, ... $t9 for temporary values. (regs 8-15, then 24-25)
- $s0, $s1, ... $s7 for saved registers. (regs 16-23)
Memory is byte addressed (each address = 8-bit byte location)
MIPS is Big Endian (MSB, most significant byte, is at the least address of a word). For illustration purposes:

-------- High address (0xffffffff)
| stack
|   |
|   v
|   ^
|   |
| heap
| uninitialized data
| initializes static data
| text
-------- Low address (0x00000000)

We can see that if the most significant byte is at the least address, then for example, an unsigned integer array in the stack:

[ 0 1 0 0 0 0 0 0 = 128, 1 0 0 0 0 0 0 0 = 256 ]
low          high       low         high

For example, let’s translate the following C code:

int f, g, h, i, j;
// Some initialization
f = (g + h) - (i + j); // we focus on this line!

Into MIPS instructions:

# $s0 = f, $s1 = g, $s2 = h, $s3 = i, $s4 = j.
add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1

Immediate Operands

Immediate operands are literal constants. For example:

addi $s3, $s3, 4 # s3 += 4;
addi $s2, $s1, -1 # No subi, so s2 = s1 - 1;

Most of the time, our immediate operands are small constants, so we can truncate the size of a number in our instruction from 32 bits to something smaller.

In particular, 0 has its own register, which holds its value, the $zero register. This proves to make the ISA simpler while sacrificing 1 register space.

Conditional Operations

Conditional operations make up the if statements in C. Here are some examples of conditional operators:

beq rs, rt, L1 is equivalent to if(rs == rt) goto L1;
bne rs, rt, L1 is equivalent to if(rs != rt) goto L1;
j L1 is equivalent to goto L1;
slt rd, rs, rt is equivalent to if(rs < rt) rd = 1; else rd = 0;
- Use sltu if you’re comparing unsigned numbers.
slti rt, rs, constant is equivalent to `if(rs < constant) rt = 1; else rt = 0;
- Use sltui if you’re using unsigned numbers.

As expected, conditional operations will change the program counter register, and therefore it is hard to optimize for a block of code that has jump statements.

A basic block is a sequence of instructions with no branching, and the compiler can easily optimize these blocks for comparable performance.

Instruction Formats

R-format

In memory, R-format looks like:

[op (6 bits), 
rs (5 bits), 
rt (5 bits), 
rd (5 bits), 
shamt (5 bits), 
funct (6 bits)]

The opcode (op) is one that indicates what format of an instruction we’re running in. (For example, all R-format instructions will have the same op code)
rs, rt, rd is for source, second source, and destination register numbers respectively.
shamt is how many shifts to perform. This is used for shift instructions only.
funct describes what function to perform (for example, add).

For example:

add $t0, $s1, $s2

is translated into:

[0, # add is R-format, and op code for R is 0
17, # s1 is mapped to 17 as register number
18, # s2 is mapped to 18
8,  # t0 is mapped to 8
0,  # no shift is necessary; this isn't a shift command
32] # 32 = function id for add
= 0x02324020

I-format

I stands for immediate, as in immediate operands.

In memory, I-format looks like:

[op (6 bits),
rs (5 bits),
rt (5 bits),
constant/addr (16 bits)]

Same idea with R-format, except constant/addr is the last field, holding 16 bits. But 16 bits isn’t enough to hold all 32 bits for a word. We will learn the details of how to represent 32 bits later, but tldr it requires 2 instructions.

Ray