Question: In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading (MT). Each of these processors is superscalar,

In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading (MT). Each of these processors is superscalar, uses in-order pipelines, requires a fixed three-cycle stall following all loads and branches, and has identical L1 caches. Instructions from the same thread issued in the same cycle are read in program order and must not contain any data or control dependences.
■ Processor A is a superscalar simultaneous MT architecture, capable of issuing up to two instructions per cycle from two threads.
■ Processor B is a fine-grained MT architecture, capable of issuing up to four instructions per cycle from a single thread and switches threads on any pipeline stall.

■ Processor C is a coarse-grained MT architecture, capable of issuing up to eight instructions per cycle from a single thread and switches threads on an L1 cache miss.
Our application is a list searcher, which scans a region of memory for a specific value stored in R9 between the address range specified in R16 and R17. It is parallelized by evenly dividing the search space into four equal-sized contiguous blocks and assigning one search thread to each block (yielding four threads). Most of each thread’s runtime is spent in the following unrolled loop body:

loop: 1w x1,0 (x16) 1w x2,8(x16) 1w x3,16 (x16) 1w x4,24 (x16)

Assume the following:
■ A barrier is used to ensure that all threads begin simultaneously.
■ The first L1 cache miss occurs after two iterations of the loop.
■ None of the BEQAL branches is taken.
■ The BLT is always taken.
■ All three processors schedule threads in a round-robin fashion.
Determine how many cycles are required for each processor to complete the first two iterations of the loop.

loop: 1w x1,0 (x16) 1w x2,8(x16) 1w x3,16 (x16) 1w x4,24 (x16) 1w x5,32 (x16) 1w x6,40(x16) 1w x7,48 (x16) 1w x8,56 (x16) beq x9,x1, match0 beq x9,x2, matchl beq x9,x3, match2 beq x9,x4, match3 beq x9,x5, match4 beq x9,x6,match5 beq x9,x7, match6 beq x9,x8, match7 DADDIU x16, x16,#64 blt x16, x17, loop

Step by Step Solution

★★★★★

3.57 Rating (157 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Computer Architecture Questions!

Discuss why it might be important to your client to adopt International Accounting Standards even though they are currently only operating domestically throughout the central part of the United...

JME has the following data from his service and production departments - S1 is maintenance, incurs $20,000 of cost, and is driven by square feet S2 is tech support, incurs $15,000 of cost, and is...

Simons Inc. (SIM or the Company) is a U.S. public company that files quarterly and annual reports with the Securities and Exchange Commission (SEC). SIM is a leading online store serving customers...

In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading. Each of these processors is superscalar, uses in-order...

Question is at the bottom. .13 251 In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading. Each of these processors is...

.13 251 In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading. Each of these processors is superscalar, uses in-order...

3.13 25] In this exercise, you wi explore performance trade-offs between three processors that each employ different types of multithreading. Each of these processors is superscalar, uses in-order...

Provide a summary technical report with about Pipelined Execution which is also named as Instruction Level Parallelism, addressing mainly the following areas: 1. What is Pipelined Execution and its...

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

Let s(t) be the displacement function of a mouse moving along the r-axis. Let v(t) and a(t) be its velocity and acceleration functions respectively. If a(t) = 2+ 4e", v(0) = 1 and s(0) = 4, %3D...

Amaranth Corporation would like to acquire the rights to a chemical process owned by Bistre Corporation. Bistre cannot sell the process because the rights are not transferrable under the terms of the...

What is the net income value at the breakeven point

12.2 Kanye Achebe just became the operations manager of Weston Transportation. Weston transports large crates for online companies and transports containers overseas. Kanye would like to evaluate...

Because of the massive scale of WSCs, it is very important to properly allocate network resources based on the workloads that are expected to be run. Different allocations can have significant...

Consider this high-level code sequence of three statements: A = B + C; B = A + C; D = A B; Use the technique of copy propagation (see Figure A.20) to transform the code sequence to the point where no...

The design of MIPS provides for 32 general-purpose registers and 32 floating-point registers. If registers are good, are more registers better? List and discuss as many trade-offs as you can that...

All interest rates are quoted in annual nominal terms (they are not effective interest rates). Spot: USD/SGD 1.3646 / 1.3665 6-month Forward Points: USD/SGD -305 / -301 U.S. (USD) Interest Rates...

Pittmans framing cost formula for its supplies cost is $1150 per month plus $11 per frame for the month of November the company plan for there to to 792 frames but the actual level of activity was...

in korean; Many computational problems exist for which an efficient algorithm is unknown. Such problems are often encountered in real applications.