Question

1 Approved Answer

Posted on Sep 21, 2024

ADVANCED COMPUTER ARCHITECTURE CPCS504 Assignment1 Spring 2021 /1442 Due Date 1st March 2021 Chapter1: 1. In Example 1 of Section 1.2.1, we assumed that the

image text in transcribed

ADVANCED COMPUTER ARCHITECTURE CPCS504 Assignment1 Spring 2021 /1442 Due Date 1st March 2021 Chapter1: 1. In Example 1 of Section 1.2.1, we assumed that the cache miss penalty was 20 cycles. With modern processors running at a frequency of 1 to 3 GHz, the cache miss penalty can reach several hundred cycles (we will see how this can be somewhat mitigated by a cache hierarchy). Keeping all other parameters, the same as in the example, plot CPI vs.cache miss penalty cost when the latter varies between 20 and 500 cycles (choose appropriate intervals). Do your computations argue for the threat of a "memory wall" whereby loading instructions and data could potentially dominate the execution time? Chapter2: 2. Assume that you want to use only one ALU for the basic five-stage pipeline of Section 2.1.2. A problem arises when you implement branches in that the ALU needs to be used twice: once for branch target computation and once for making the register comparison. Design the data path and control unit with this constraint. What are the performance implications? 3. As a designer you are asked to evaluate three possible options for an on-chip write-through data cache. Some of the design options (associativity) and performance consequences (miss rate, miss penalty) are described in the table below: Data cache options Miss rate Miss penalty Cache A: Direct mapped 0.08 4 cycles Cache B: Two-way set-associative 0.04 6 cycles Cache C: Four-way set-associative 0.02 8 cycles (a) Assume that load instructions have a CPI of 1.2 if they hit in the cache and a CPI of 1.2 + (miss penalty) otherwise. All other instructions have a CPI of 1.2. The instruction mix is such that 20% of instructions are loads. What is the CPI for each configuration (you'll need to keep three decimal digits, i.e., compute the CPI as x.xxx)? Which cache would you choose if CPI is the determining factor? (b) Assume now that if the direct-mapped cache is used, the cycle time is 20 ns. If the two-way set-associative cache is used, the cycle time is 22 ns. If the four-way is used, the cycle time is 24 ns. What is the average time per instruction? Which cache would you choose if average time per instruction is the determining factor? (c) In the case of the two-way set-associative cache, the replacement algorithm is LRU. In the case of the four-way set-associative cache, the replacement algorithm is such that the most recently used (MRU) line is not replaced the choice of which of the other three is replaced is random and not part of the logic associated with each line in the cache. Indicate what bits are needed to implement the replacement algorithms for each line. 4. Assume that it takes 1 cycle to access and return the information on a data-TLB hit and 1 cycle to access and return the information on a data-cache hit. A data-TLB miss takes 300 cycles to resolve, and a data-cache miss takes 100 cycles to resolve. The data-TLB hit rate is 0.99, and the data-cache hit rate is 0.95. What is the average memory access time for a load data reference? Now the data cache is an L1 D-cache and is backed up by an L2 unified cache. Every data reference that misses in L1 has a 60% chance of hitting in L2. A miss in L1 followed by a hit in L2 has a latency of 10 cycles. What is the average memory access time for a load data reference in this new configuration? Chapter3: 5. Consider the in-order-issue/in-order-completion" execution sequence shown in Figure 16.14. Esecute Write Tweede 12 1 11 13 15 IS 12 14 16 15 12 14 11 3 4 5 6 7 3 16 13 14 15 15 IS 16 Figure 16.14 An In-Order Issue, In-Order Completion Execution Sequence a. Identify the most likely reason why 12 could not enter the execute stage until the fourth cycle. Will "in-order issue/out-of-order completion" or "out-of-order issue out-of-order completion" fix this? If so, which? b. Identify the reason why 16 could not enter the write stage until the nineth cycle. Will "in-order issue/out-of-order completion" or "out-of-order issue/out-of-order completion" fix this? If so, which? 6. Consider the following sequence of instructions, where the syntax consists of an opcode followed by the destination register followed by one or two source registers: 0 ADD R3, R1, R2 1 LOAD R6, [R3) 2 AND R7, R5, 3 3 ADD Ri, R6, R7 4 SRL R7, RO, 8 OR R2, R4, R7 SUB R5, R3, R4 7 ADD RO, R1, 10 LOAD R6, [F SUB R2, R1, R6 10 AND R3, R7, 15 5 6 5 8 9 Assume the use of a four-stage pipeline: fetch, decode/issue, execute, write back. Assume that all pipeline stages take one clock cycle except for the execute stage. For simple integer arithmetic and logical instructions, the execute stage takes one cycle, but for a LOAD from memory, five cycles are consumed in the execute stage. If we have a simple scalar pipeline but allow out-of-order execution, we can construct the following table for the execution of the first seven instructions: Decode Execute Write Back Instruction 0 1 2 3 4 Fetch 0 1 2 3 4 2 3 4 5 4 5 10 6 11 7 10 12 5 8 9 The entries under the four pipeline stages indicate the clock cycle at which each instruction begins each phase. In this program, the second ADD instruction (instruction 3) depends on the LOAD instruction (instruction 1) for one of its operands, r6. Because the LOAD instruction takes five clock cycles, and the issue logic encounters the dependent ADD instruction after two clocks, the issue logic must delay the ADD instruction for three clock cycles. With an out-of-order capability, the processor can stall instruction 3 at clock cycle 4, and then move on to issue the following three independent instructions, which enter execution at clocks 6, 8, and 9. The LOAD finishes execution at clock 9, and so the dependent ADD can be launched into execution on clock 10. a. Complete the preceding table. b. Redo the table assuming no out-of-order capability. What is the savings using the capability? c. Redo the table assuming a superscalar implementation that can handle two instructions at a time at each stage. ADVANCED COMPUTER ARCHITECTURE CPCS504 Assignment1 Spring 2021 /1442 Due Date 1st March 2021 Chapter1: 1. In Example 1 of Section 1.2.1, we assumed that the cache miss penalty was 20 cycles. With modern processors running at a frequency of 1 to 3 GHz, the cache miss penalty can reach several hundred cycles (we will see how this can be somewhat mitigated by a cache hierarchy). Keeping all other parameters, the same as in the example, plot CPI vs.cache miss penalty cost when the latter varies between 20 and 500 cycles (choose appropriate intervals). Do your computations argue for the threat of a "memory wall" whereby loading instructions and data could potentially dominate the execution time? Chapter2: 2. Assume that you want to use only one ALU for the basic five-stage pipeline of Section 2.1.2. A problem arises when you implement branches in that the ALU needs to be used twice: once for branch target computation and once for making the register comparison. Design the data path and control unit with this constraint. What are the performance implications? 3. As a designer you are asked to evaluate three possible options for an on-chip write-through data cache. Some of the design options (associativity) and performance consequences (miss rate, miss penalty) are described in the table below: Data cache options Miss rate Miss penalty Cache A: Direct mapped 0.08 4 cycles Cache B: Two-way set-associative 0.04 6 cycles Cache C: Four-way set-associative 0.02 8 cycles (a) Assume that load instructions have a CPI of 1.2 if they hit in the cache and a CPI of 1.2 + (miss penalty) otherwise. All other instructions have a CPI of 1.2. The instruction mix is such that 20% of instructions are loads. What is the CPI for each configuration (you'll need to keep three decimal digits, i.e., compute the CPI as x.xxx)? Which cache would you choose if CPI is the determining factor? (b) Assume now that if the direct-mapped cache is used, the cycle time is 20 ns. If the two-way set-associative cache is used, the cycle time is 22 ns. If the four-way is used, the cycle time is 24 ns. What is the average time per instruction? Which cache would you choose if average time per instruction is the determining factor? (c) In the case of the two-way set-associative cache, the replacement algorithm is LRU. In the case of the four-way set-associative cache, the replacement algorithm is such that the most recently used (MRU) line is not replaced the choice of which of the other three is replaced is random and not part of the logic associated with each line in the cache. Indicate what bits are needed to implement the replacement algorithms for each line. 4. Assume that it takes 1 cycle to access and return the information on a data-TLB hit and 1 cycle to access and return the information on a data-cache hit. A data-TLB miss takes 300 cycles to resolve, and a data-cache miss takes 100 cycles to resolve. The data-TLB hit rate is 0.99, and the data-cache hit rate is 0.95. What is the average memory access time for a load data reference? Now the data cache is an L1 D-cache and is backed up by an L2 unified cache. Every data reference that misses in L1 has a 60% chance of hitting in L2. A miss in L1 followed by a hit in L2 has a latency of 10 cycles. What is the average memory access time for a load data reference in this new configuration? Chapter3: 5. Consider the in-order-issue/in-order-completion" execution sequence shown in Figure 16.14. Esecute Write Tweede 12 1 11 13 15 IS 12 14 16 15 12 14 11 3 4 5 6 7 3 16 13 14 15 15 IS 16 Figure 16.14 An In-Order Issue, In-Order Completion Execution Sequence a. Identify the most likely reason why 12 could not enter the execute stage until the fourth cycle. Will "in-order issue/out-of-order completion" or "out-of-order issue out-of-order completion" fix this? If so, which? b. Identify the reason why 16 could not enter the write stage until the nineth cycle. Will "in-order issue/out-of-order completion" or "out-of-order issue/out-of-order completion" fix this? If so, which? 6. Consider the following sequence of instructions, where the syntax consists of an opcode followed by the destination register followed by one or two source registers: 0 ADD R3, R1, R2 1 LOAD R6, [R3) 2 AND R7, R5, 3 3 ADD Ri, R6, R7 4 SRL R7, RO, 8 OR R2, R4, R7 SUB R5, R3, R4 7 ADD RO, R1, 10 LOAD R6, [F SUB R2, R1, R6 10 AND R3, R7, 15 5 6 5 8 9 Assume the use of a four-stage pipeline: fetch, decode/issue, execute, write back. Assume that all pipeline stages take one clock cycle except for the execute stage. For simple integer arithmetic and logical instructions, the execute stage takes one cycle, but for a LOAD from memory, five cycles are consumed in the execute stage. If we have a simple scalar pipeline but allow out-of-order execution, we can construct the following table for the execution of the first seven instructions: Decode Execute Write Back Instruction 0 1 2 3 4 Fetch 0 1 2 3 4 2 3 4 5 4 5 10 6 11 7 10 12 5 8 9 The entries under the four pipeline stages indicate the clock cycle at which each instruction begins each phase. In this program, the second ADD instruction (instruction 3) depends on the LOAD instruction (instruction 1) for one of its operands, r6. Because the LOAD instruction takes five clock cycles, and the issue logic encounters the dependent ADD instruction after two clocks, the issue logic must delay the ADD instruction for three clock cycles. With an out-of-order capability, the processor can stall instruction 3 at clock cycle 4, and then move on to issue the following three independent instructions, which enter execution at clocks 6, 8, and 9. The LOAD finishes execution at clock 9, and so the dependent ADD can be launched into execution on clock 10. a. Complete the preceding table. b. Redo the table assuming no out-of-order capability. What is the savings using the capability? c. Redo the table assuming a superscalar implementation that can handle two instructions at a time at each stage