Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1. [100 pts] In this exercise, we look at how software techniques can extract instructionlevel parallelism (ILP) in a common vector loop. The following loop

image text in transcribed

1. [100 pts] In this exercise, we look at how software techniques can extract instructionlevel parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX+Y. Initially, f2 holds constant a,x1 is set to the base address of array X, and x2 is set to the base address of array Y: The table below shows the number of intervening clock cycles needed to avoid a stall. a. [20 pts] Show all the data, anti and output dependence. Note: dependent instructions may not be next to each other. b. [40 pts] Assume a single-issue pipeline. Show how the loop would look both unscheduled by the compiler and after compiler scheduling, including any stalls or idle clock cycles. What is the execution time (in cycles) per element of the result vector Y, unscheduled and scheduled? c. [40 pts] Assume a single-issue pipeline. Unroll the loop by a factor of 3 to schedule it without any stalls, collapsing the loop overhead instructions. Show the unrolled and scheduled instruction sequence. What is the execution time per element of the result? You can assume that the number of iterations is always a multiple of the unrolled loop body. 1. [100 pts] In this exercise, we look at how software techniques can extract instructionlevel parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y=aX+Y. Initially, f2 holds constant a,x1 is set to the base address of array X, and x2 is set to the base address of array Y: The table below shows the number of intervening clock cycles needed to avoid a stall. a. [20 pts] Show all the data, anti and output dependence. Note: dependent instructions may not be next to each other. b. [40 pts] Assume a single-issue pipeline. Show how the loop would look both unscheduled by the compiler and after compiler scheduling, including any stalls or idle clock cycles. What is the execution time (in cycles) per element of the result vector Y, unscheduled and scheduled? c. [40 pts] Assume a single-issue pipeline. Unroll the loop by a factor of 3 to schedule it without any stalls, collapsing the loop overhead instructions. Show the unrolled and scheduled instruction sequence. What is the execution time per element of the result? You can assume that the number of iterations is always a multiple of the unrolled loop body

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Data Management Databases And Organizations

Authors: Richard T. Watson

3rd Edition

0471418455, 978-0471418450

More Books

Students also viewed these Databases questions

Question

why we face Listening Challenges?

Answered: 1 week ago

Question

6. What is process reengineering? Why is it relevant to training?

Answered: 1 week ago