Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Assume you have the following codevoid inner 4 ( vec _ ptr u , vec _ ptr v , data _ t * dest )

Assume you have the following codevoid inner4(vec_ptr u, vec_ptr v, data_t *dest)
{int length = vec_length(u);data_t *vdata = get_vec_start(v);for (i =0; i length; i++){}
*dest = sum;
}and you modify the code to use 4-way loop unrolling and four parallel accumulators. Measurements for this function with the x86-
64 architecture shows it achieves a CPE of 2.0 for all types of data.
Assuming the model of the Intel i7 architecture shown in class (one branch unit, two arithmetic units, one load and one store unit),
the performance of this loop with any arithmetic operation can not get below 2.0 CPE because of
When the same 44 code is compiled for the IA32 architecture, it achieves a CPE of 2.75, worse than the CPE of 2.25 achieved
with just four-way unrolling. The mostly likely reason this occurs is because
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions