Question
Special-Purpose Superscalar Assumptions: i) single-cycle pipelining, ii) 5-cycle instruction latency, and iii) five pipeline stages. We build a four-lane RISC 1.0 pipeline by laying down
Special-Purpose Superscalar Assumptions: i) single-cycle pipelining, ii) 5-cycle instruction latency, and iii) five pipeline stages. We build a four-lane RISC 1.0 pipeline by laying down four one-lane 'fdxmw' pipelines parallel to each other. This gives a concurrency of 5 in time and a concurrency of 4 in space. a) [5 marks] {Pi} is a restricted class of programs. Every member Pj of the group can be decomposed into four independent (million-instruction) threads (these threads do not communicate or synchronize with one another). Assuming no stalls, what is the speedup of the four-lane 'fdxmw' pipeline over the one-lane, unpipelined 'fdxmw' datapath? Show work. b) [5 marks] By what factor must fetch bandwidth be increased for the four-lane machine? By what factor must result bandwidth be increased assuming each result is written to memory? Explain in a few words. c) [5 marks] Of course, there are _intrathread_ stalls. Executing program P4 shows the following average number of stalls per instruction in each lane: <0.15, 0.20, 0.10, 0.25>. What is this more realistic speedup on P4? Show work. d) [5 marks] {Qi} is a restricted class of programs. Every member Qj of the group can be decomposed into four _dependent_ (million-instruction) threads (these threads do communicate and/or synchronize with one another). Unlike cores, boxes can not be turned on and off. How does the power efficiency of running one of the Qj compare to running one of the Pj? Explain.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started