Question

1 Approved Answer

Posted on Aug 26, 2024

Question 1 You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best to allocate your hardware resources.

image text in transcribed

Question 1 You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best to allocate your hardware resources. Which of the hardware and software techniques you learned in Chapter 2 should you apply? You have a list of latencies for the functional units and for memory, as well as some representative code. Your boss has been somewhat vague about the performance requirements of your new design, but you know from experience that, all else being equal, faster is usually better. Start with the basics. table provides a sequence of instructions and list of latencies. a) What would be the baseline performance (in cycles, per loop iteration) of the code sequence in the table, if no new instruction's execution could be initiated until the previous instruction's execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken, and that there is a one cycle branch delay slot. b) Think about what latency numbers really mean - they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of backto-back instructions (a "producer" followed by a "consumer") will execute correctly. But not all instruction pairs have a producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in the table require if the pipeline detected true data dependencies and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the code with inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency "+2" needs 2 cycles to be inserted into the code sequence. Think of it this way: a 1 -cycle instruction has latency 1+0, meaning zero extra wait states. So latency 1+1 implies 1 stall cycle; latency 1+N has N extra stall cycles.) c) Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall is to observe a true data dependency. Now how many cycles does the loop require? d) Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an d) Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall is to observe a true data dependency. Now how many cycles does the loop require? e) In the multiple-issue design of (d), you may have recognized some subtle issues. Even though the two pipelines have the exact same instruction repertoire, they are not identical nor interchangeable, because there is an implicit ordering between them that must reflect the ordering of the instructions in the original program. If instruction N+1 begins execution in Execution Pipe 1 at the same time that instruction N begins in Pipe 0 , and N+1 happens to require a shorter execution latency than N, then N+1 will complete before N (even though program ordering would have implied otherwise). Recite at least two reasons why that could be hazardous and will require special considerations in the microarchitecture. Give an example of two instructions from the code in the table that demonstrate this hazard. f) Reorder the instructions to improve performance of the code in table. Assume the two-pipe machine in (c), and that the out-of-order completion issues of (d) have been dealt with successfully. Just worry about observing true data dependencies and functional unit latencies for now. How many cycles does your reordered code take