Question: This part of our case study will focus on the amount of instruction-level parallelism available to the run time hardware scheduler under the most favorable

This part of our case study will focus on the amount of instruction-level parallelism available to the run time hardware scheduler under the most favorable execution scenarios (the ideal case). (Later, we will consider less ideal scenarios for the run time hardware scheduler as well as the amount of parallelism available to a compiler scheduler.) For the ideal scenario, assume that the hash table is initially empty. Suppose there are 1024 new data elements, whose values are numbered sequentially from 0 to 1023, so that each goes in its own bucket (this reduces the problem to a matter of updating known array locations!). Figure 3.15 shows the hash table contents after the first three elements have been inserted, according to this "ideal case." Since the value of element[i] is simply i in this ideal case, each element is inserted into its own bucket.
For the purposes of this case study, assume that each line of code in Figure 3.14 takes one execution cycle (its dependence height is 1) and, for the purposes of computing ILP, takes one instruction. These (unrealistic) assumptions are made to greatly simplify bookkeeping in solving the following exercises. And while statements execute on each iteration of their respective loops, to test if the loop should continue. In this ideal case, most of the dependences in the code sequence are relaxed and a high degree of ILP is therefore readily available. We will later examine a general case, in which the realistic dependences in the code segment reduce the amount of parallelism available.
Further suppose that the code is executed on an "ideal" processor with infinite issue width, unlimited renaming, "omniscient" knowledge of memory access disambiguation, branch prediction, and so on, so that the execution of instructions is limited only by data dependence. Consider the following in this context:
a. Describe the data (true, anti, and output) and control dependences that govern the parallelism of this code segment, as seen by a run time hardware scheduler. Indicate only the actual dependences (i.e., ignore dependences between stores and loads that access different addresses, even if a compiler or processor would not realistically determine this). Draw the dynamic dependence graph for six consecutive iterations of the outer loop (for insertion of six elements), under the ideal case. In this dynamic dependence graph, we are identifying data dependences between dynamic instances of instructions: each static instruction in the original program has multiple dynamic instances due to loop execution. The following definitions may help you find the dependences related to each instruction:
• Data true dependence: On the results of which previous instructions does each instruction immediately depend?
• Data anti dependence: Which instructions subsequently write locations read by the instruction?
• Data output dependence: Which instructions subsequently write locations written by the instruction?
• Control dependence: On what previous decisions does the execution of a particular instruction depend (in what case will it be reached)?
b. Assuming the ideal case just described, and using the dynamic dependence graph you just constructed, how many instructions are executed, and in how many cycles?
c. What is the average level of ILP available during the execution of the for loop?
d. In part (c) we considered the maximum parallelism achievable by a run-time hardware scheduler using the code as written. How could a compiler increase the available parallelism, assuming that the compiler knows that it is dealing with the ideal case. Think about what is the primary constraint that prevents executing more iterations at once in the ideal case. How can the loop be restructured to relax that constraint?
e. For simplicity, assume that only variables i, hash_index, ptrCurr, and ptrUpdate need to occupy registers. Assuming general renaming, how many registers are necessary to achieve the maximum achievable parallelism in part (b)?
f. Assume that in your answer to part (a) there are 7 instructions in each iteration. Now, assuming a consistent steady-state schedule of the instructions in the example and an issue rate of 3 instructions per cycle, how is execution time affected?
g. Finally, calculate the minimal instruction window size needed to achieve the maximal level of parallelism?

Step by Step Solution

★★★★★

3.55 Rating (172 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

a Figure L20 shows the dependence graph for the C code in Figure 314 Each node in Figure L20 corresponds to a line of C statement in Figure 314 Each node 6 in Figure L20 starts an iteration of the for ... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Document Format (1 attachment)

903-C-S-S-A-D (3183).docx

120 KBs Word File

Students Have Also Explored These Related Systems Analysis And Design Questions!

This exercise explores energy efficiency and its relationship with performance. Problems in this exercise assume the following energy consumption for activity in Instruction memory, Registers, and...

What is the distinction between instruction-level parallelism and machine parallelism?

Let us now consider less favorable scenarios for extraction of instruction-level parallelism by a run-time hardware scheduler in the hash table code in Figure 3.14 (the general case). Suppose that...

Provide a summary technical report with about Pipelined Execution which is also named as Instruction Level Parallelism, addressing mainly the following areas: 1. What is Pipelined Execution and its...

It appeared that TECHSERV relied significantly on a regimented approach to project management. Do you believe that there was anything else the management team could have done to guarantee success? a)...

can someone solve this Modern workstations typically have memory systems that incorporate two or three levels of caching. Explain why they are designed like this. [4 marks] In order to investigate...

PAPERS What Project Strategy Really Is: The Fundamental Building Block in Strategic Project ManagementPeerasit Patanakul, Stevens Institute of Technology, Hoboken, NJ, USA Aaron J. Shenhar, Rutgers...

U.S. Army Cost Benefit Analysis Guide 12 JANUARY 2010 Prepared by Office of the Deputy Assistant Secretary of the Army (Cost and Economics) Version 1.0 U.S. Army Cost Benefit Analysis Guide - V 1.0 2...

please type your response Summarize the facts of the case. About this case study: This case study was developed as a joint effort by the Center for Audit Quality, Financial Executives International,...

Overview of the case. What do you think of the particular assignment given to Browning? TEXTBOOK : Brown, C.V., DeHayes, D.W., Hoffer, J.A., Martin, E.W., and Perkins, W.C. Managing Information...

Two investors are evaluating AT&Ts stock for possible purchase. They agree on the expected value of D1 and also on the expected future dividend growth rate. Further, they agree on the riskiness of...

(a) Find an estimated equation of regression between the demand (Y) and price (X) as given in the following data. X (Price) 13 15 17 20 23 25 26 30 36 Y(Demand) 37 31 28 25 27 24 21 19 24 Also, (i)...

A mouse is a pointing device that provides input to a computer. a storage device. an output device that receives information from a computer. a transmission medium.

A researcher is conducting a directional (one-tailed) test with a sample of n = 25 to evaluate the effect of a treatment that is predicted to decrease scores. If the researcher obtains t = 1.700,...

Prove that in a two-level cache hierarchy, where L1 is closer to the processor, inclusion is maintained with no extra action if L2 has at least as much associativity as L1, both caches use line...

When trying to perform detailed performance evaluation of a multiprocessor system, system designers use one of three tools: analytical models, trace-driven simulation, and execution-driven...

Multiprocessors and clusters usually show performance increases as you increase the number of the processors, with the ideal being nx speedup for n processors. The goal of this biased benchmark is to...

Consider the following information about Stocks I and II: Rate of Return If State Occurs State of Probability of Economy State of Economy Stock I Stock II Recession .20 .05 .22 Normal .55 .20 .09...

1. Is the market in weak-form efficiency? If yes, what is the evidence supporting that? If no, what is evidence against that? 2. Is the market in semi strong-form efficiency? If yes, what is the...

Ariel holds a $5,000 portfolio that consists of four stocks. Her investment in each stock, as well as each stocks beta, is listed in the following table: Stock Investment Beta Standard Deviation...