When we use gcc to compile combine3 with command-line option -02, we get code with substantially better
Question:
When we use gcc to compile combine3 with command-line option -02, we get code with substantially better CPE performance than with -01:
We achieve performance comparable to that for combine4, except for the case of integer sum, but even it improves significantly. On examining the assembly code generated by the compiler, we find an interesting variant for the inner loop:
We see that, besides some reordering of instructions, the only difference is that the more optimized version does not contain the vmovsd implementing the read from the location designated by dest (line 2).
A. How does the role of register %xmm0 differ in these two loops?
B. Will the more optimized version faithfully implement the C code of combine3, including when there is memory aliasing between dest and the vector data?
C. Either explain why this optimization preserves the desired behavior, or give an example where it would produce different results than the less optimized code.
Step by Step Answer:
Computer Systems A Programmers Perspective
ISBN: 9781292101767
3rd Global Edition
Authors: Randal E. Bryant, David R. O'Hallaron