We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00

Question:

We saw that our measurements of the prefix-sum function psum1 (Figure 5.1) yield a CPE of 9.00 on a machine where the basic operation to be performed, floating point addition, has a latency of just 3 clock cycles. Let us try to understand why our function performs so poorly.

The following is the assembly code for the inner loop of the function:

1 2 3 4 5 6 7 Inner loop of psumi a in %rdi, i in %rax, cnt in %rdx .L5: vmovss -4(%rsi,%rax, 4), %xmmo

Perform an analysis similar to those shown for combine3 (Figure 5.14) and for write_read (Figure 5.36) to diagram the data dependencies created by this loop, and hence the critical path that forms as the computation proceeds. Explain why the CPE is so high.

Figure 5.1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 /* Compute prefix sum of vector a */ void psum1

Figure 5.14

%xmmo mul %xmmo load %rax %rdx cmp jne (a) add %rdx data [/] %xmmo load mul %xmmo (b) %rdx add %rdx

Figure 5.36

%rax %rdi %rsi %rdx 1 s_data add %rax s_addr (2) load (a) sub %rdx jne %rax s_data load add %rax (b) %rdx sub


Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Computer Systems A Programmers Perspective

ISBN: 9781292101767

3rd Global Edition

Authors: Randal E. Bryant, David R. O'Hallaron

Question Posted: