Usually, the yield of a manufacturing process is the proportion of working dies per wafer (not ingot). This does not necessarily only pertain to a CPU, but also any integrated circuit. This is the reason why ASICS are so expensive.
Execution time is how long it takes to do a task, and throughput is how much work is done per unit time. Throughput is roughly the bandwidth of a processor, and response time is specific to a task.
For example, if you have multiple cores, the execution time for a task that’s single threaded or is difficult to parallelize effectively will not change, but the throughput of the multi-core chip will be higher than one that is single-core.
So how do we measure performance? Let us define performance by:
\[P(x) = \frac{1}{T(x)}\]for some task $x$, as in it is inversely proportional by the execution time. For example, if it took 10s to run a task $x$ on some machine $A$ and 15s on $B$, then one can say $A$ is 1.5x faster than $B$, because:
\[\frac{P^A(x)}{P^B(x)} = \frac{T^B(x)}{T^A(x)} = \frac{15}{10} = 1.5\]How can we modify CPU time? This requires us to be familiar with a couple terminologies:
Suppose $T$ is the CPU time, and $C$ is number of cycles required for some task:
\[T = CP\]What can we do to decrease the CPU time? Well, we can either decrease the number of cycles or the time for each period. Decreasing the number of cycles for a single op could be done using vectorized instructions(doing multiple instructions at once, and this is not necessarily the whole story) or simply devising a better algorithm. To decrease the period, one can overclock a processor to run at higher GHz(and thus lower period).
So far, we have assumed that each instruction costs one cycle. We had some $C$ that we needed to lower. In reality, there may be multiple cycles required for a single instruction. For $N$ instructions, the $k$-th instruction which is used $N_k$ times in the procedure, which requires $I_k$ clock cycles, $C$ can be computed as :
\[C = \sum_k^N I_k N_k\]So now our vectorized instruction assumption isn’t necessarily true. If we can vectorize $K$ instructions, we don’t necessarily get an $K$x speedup, since the vectorized instruction may cost on average more clock cycles as well.
Improving algorithms leads to an issue observed by Amdahl’s law:
\[T_{improved} = \frac{T_{affected}}{\text{improvement factor}} + T_{unaffected}\]This is simply saying that whatever you’re trying to optimize, you’ll never get below $T_{unaffected}$, or in other words, $T_{improved} \geq T_{unaffected}$. So the corollary or moral of the story of this law is that if you want to optimize algorithms, make the common case fast.
CISC
vs. RISC
ISA’s will have higher and lower $P$’s respectively.The power wall is a phenomenon where our current chip designs have marginal improvements on voltage reduction and heat reduction. How else can we reduce power? If we have denser and denser chips, it will be harder for us to cool the corresponding chips that are drawing in so much power.
The trend between clock rate and power have been growing with each other, but lately we have switched to a multi-core processor design, which means we are keeping the clock rate the same, but just having more cores, each of which is now being optimized for lower power consumption. This is how one can keep the power consumption roughly the same but still increase performance. However, now the definition of performance is slightly different… How can we still evaluate it?
One type of benchmark uses elapsed time over multiple programs, and takes the geometric mean of all the runtimes. An example is SPEC CPU2006.
Two main ISA classes, stands for Reduced Instruction Set Computers, and Complex Instruction Set Computers.
There is some grey line between RISC and CISC, since intel actually “compiles” the instructions into “micro-ops”, which is RISC-like.
MIPS is a RISC ISA, and it has the following:
$t0, $t1, ... $t9
for temporary values. (regs 8-15, then 24-25)$s0, $s1, ... $s7
for saved registers. (regs 16-23)-------- High address (0xffffffff)
| stack
| |
| v
| ^
| |
| heap
| uninitialized data
| initializes static data
| text
-------- Low address (0x00000000)
We can see that if the most significant byte is at the least address, then for example, an unsigned integer array in the stack:
[ 0 1 0 0 0 0 0 0 = 128, 1 0 0 0 0 0 0 0 = 256 ]
low high low high
For example, let’s translate the following C code:
int f, g, h, i, j;
// Some initialization
f = (g + h) - (i + j); // we focus on this line!
Into MIPS instructions:
# $s0 = f, $s1 = g, $s2 = h, $s3 = i, $s4 = j.
add $t0, $s1, $s2
add $t1, $s3, $s4
sub $s0, $t0, $t1
Immediate operands are literal constants. For example:
addi $s3, $s3, 4 # s3 += 4;
addi $s2, $s1, -1 # No subi, so s2 = s1 - 1;
Most of the time, our immediate operands are small constants, so we can truncate the size of a number in our instruction from 32 bits to something smaller.
In particular, 0 has its own register, which holds its value, the $zero
register. This proves to make the ISA simpler while sacrificing 1 register space.
Conditional operations make up the if
statements in C. Here are some examples of conditional operators:
beq rs, rt, L1
is equivalent to if(rs == rt) goto L1;
bne rs, rt, L1
is equivalent to if(rs != rt) goto L1;
j L1
is equivalent to goto L1;
slt rd, rs, rt
is equivalent to if(rs < rt) rd = 1; else rd = 0;
sltu
if you’re comparing unsigned numbers.slti rt, rs, constant
is equivalent to `if(rs < constant) rt = 1; else rt = 0;
sltui
if you’re using unsigned numbers.As expected, conditional operations will change the program counter register, and therefore it is hard to optimize for a block of code that has jump statements.
A basic block is a sequence of instructions with no branching, and the compiler can easily optimize these blocks for comparable performance.
In memory, R-format looks like:
[op (6 bits),
rs (5 bits),
rt (5 bits),
rd (5 bits),
shamt (5 bits),
funct (6 bits)]
op
) is one that indicates what format of an instruction we’re running in. (For example, all R-format instructions will have the same op code)rs, rt, rd
is for source, second source, and destination register numbers respectively.shamt
is how many shifts to perform. This is used for shift instructions only.funct
describes what function to perform (for example, add
).For example:
add $t0, $s1, $s2
is translated into:
[0, # add is R-format, and op code for R is 0
17, # s1 is mapped to 17 as register number
18, # s2 is mapped to 18
8, # t0 is mapped to 8
0, # no shift is necessary; this isn't a shift command
32] # 32 = function id for add
= 0x02324020
I stands for immediate, as in immediate operands.
In memory, I-format looks like:
[op (6 bits),
rs (5 bits),
rt (5 bits),
constant/addr (16 bits)]
Same idea with R-format, except constant/addr
is the last field, holding 16 bits. But 16 bits isn’t enough to hold all 32 bits for a word. We will learn the details of how to represent 32 bits later, but tldr it requires 2 instructions.