Question: We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00 on a machine where the basic operation to

We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00 on a machine where the basic operation to be performed, floating point addition, has a latency of just 3 clock cycles. Let us try to understand why our function performs so poorly.

The following is the assembly code for the inner loop of the function:

1 2 3 4 5 6 7 Inner loop of psumi a in %rdi, i in %rax, cnt in %rdx .L5: vmovss -4(%rsi,%rax, 4), %xmmo

Perform an analysis similar to those shown for combine3 (Figure 5.14) and for write_read (Figure 5.36) to diagram the data dependencies created by this loop, and hence the critical path that forms as the computation proceeds. Explain why the CPE is so high.

Figure 5.1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 /* Compute prefix sum of vector a */ void psum1

Figure 5.14

%xmmo mul %xmmo load %rax %rdx cmp jne (a) add %rdx data [/] %xmmo load mul %xmmo (b) %rdx add %rdx

Figure 5.36

%rax %rdi %rsi %rdx 1 s_data add %rax s_addr (2) load (a) sub %rdx jne %rax s_data load add %rax (b) %rdx sub


1 2 3 4 5 6 7 Inner loop of psumi a in %rdi, i in %rax, cnt in %rdx .L5: vmovss -4(%rsi,%rax, 4), %xmmo vaddss (%rdi,%rax, 4), %xmm0, %xmmo vmovss %xmm0, (%rsi,%rax, 4) $1, %rax addq cmpq %rdx, %rax jne .L5 loop: Get p[i-1] Add a[i] Store at p[i] Increment i Compare i:cnt If , goto loop

Step by Step Solution

3.37 Rating (147 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

We can see that this function has a writeread dependency ... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Computer Systems A Programmers Perspective Questions!