Question
Assuming for the following loop is to be executed on a 4-unit VLIW processor that can execute an instruction on any execution unit, show how
Assuming for the following loop is to be executed on a 4-unit VLIW processor that can execute an instruction on any execution unit, show how a compiler would schedule the original loop and unrolled version (4 times). Assume the processor has as many architectural registers as required, the latencies of 3 cycles for LD operations, and 2 cycles for DIVs and ADDs. Assume the branch delay of the processor is long enough that all operations in one iteration complete before the next iteration starts. As in other VLIW problems, assume that the compiler examines all possible operation orderings to find one that fits into the fewest number of instructions. Compare how much faster the unrolled loop over the original loop:
loop:
LD r1, (r2)
LD r3, (r4)
LD r5, (r6)
ADD r1, r1, r3
ADD r1, r1, r5
DIV r1, r1, r7
ST (r0), r1
ADD r2, #4, r2
ADD r4, #4, r4
ADD r6, #4, r6
ADD r0, #4, r0
BR loop
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access with AI-Powered Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started