Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Optimizing Cache Performance via Advanced Techniques Concepts illustrated by this case study Non-blocking Caches Compiler Optimizations for Caches Software and Hardware Prefetching

Optimizing Cache Performance via Advanced Techniques

Concepts illustrated by this case study

■ Non-blocking Caches

■ Compiler Optimizations for Caches

■ Software and Hardware Prefetching

■ Calculating Impact of Cache Performance on More Complex Processors

The transpose of a matrix interchanges its rows and columns; this is illustrated below:

Here is a simple C loop to show the transpose:

Assume that both the input and output matrices are stored in the row major order (row major order means that the row index changes fastest). Assume that you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (don’t worry about cache conflicts) least recently used (LRU) replacement L1 data cache with 64 byte blocks. Assume that the L1 cache misses or prefetches require 16 cycles and always hit in the L2 cache, and that the L2 cache can process a request every two processor cycles. Assume that each iteration of the inner loop above requires four cycles if the data are present in the L1 cache. Assume that the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically, assume that writing back dirty cache blocks requires 0 cycles.

[10/15/15/12/20] <2.2> For the simple implementation given above, this execution order would be nonideal for the input matrix; however, applying a loop interchange optimization would create a nonideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.

a. [10] <2.2> What should be the minimum size of the cache to take advantage of blocked execution?

b. [15] <2.2> How do the relative number of misses in the blocked and unblocked versions compare in the minimum sized cache above?

c. [15] <2.2> Write code to perform a transpose with a block size parameter B which uses B × B blocks.

d. [12] <2.2> What is the minimum associativity required of the L1 cache for consistent performance independent of both arrays’ position in memory?

e. [20] <2.2> Try out blocked and nonblocked 256 × 256 matrix transpositions on a computer. How closely do the results match your expectations based on what you know about the computer’s memory system? Explain any discrepancies if possible.
 

A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A43 A44 A11 A21 A31 A41 A 12 A22 A32 A42 A13 A23 A33 A43 LA 14 A24 A34 A44]

Step by Step Solution

3.50 Rating (157 Votes )

There are 3 Steps involved in it

Step: 1

a The minimum size of the cache to take advantage of blocked execution should be 64 bytes This is because the block size of the L1 cache needs to be e... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Computer Organization and Design The Hardware Software Interface

Authors: David A. Patterson, John L. Hennessy

5th edition

124077269, 978-0124077263

More Books

Students also viewed these Electrical Engineering questions