Question

1 Approved Answer

Posted on Sep 22, 2024

When an instruction reaches to the head of the Re-Order Buffer (ROB), its ready flag, which is set when it passes the writeback stage, is

When an instruction reaches to the head of the Re-Order Buffer (ROB), its ready flag, which is set when it passes the writeback stage, is checked. If the ready flag is set, then the instruction retires or exits from the processor. However, this process might be delayed when that instruction is a load with a pending cache miss. Remember that when a load instruction experiences a cache miss, it may take a long time to bring its data from the memory to into the cache, and, until that time, that load instruction cannot complete its writeback stage and cannot exit from the processor. The problem is that the processor only permits in-order retirement, and since that load is at the head of the ROB, no other instruction is allowed to exit the processor. This initiates a complete stall at the back-end of the pipeline, and, if the front-end continues to allow instructions to enter the pipeline, the front-end also stalls in no time. Now, assume that we allow this load and every cache-missing load instructions to exit the processor with a bogus value, lets say zero, before their actual values arrive from memory to the cache. Then, we can also allow the rest of the instructions to exit the processor in program order. When, the actual value of this load instruction arrives, we can replay all those instructions starting from the writeback of the load instruction onwards, but, now, with the correct value of the load. Since, we replay all the instructions after the miss-pending load, we do not expect a direct performance gain from such a mechanism. Instead of waiting for the load instruction to receive its value from memory, we run a stream of instructions with a bogus value of that load. But, note that, some of those follower instructions are going to be load-dependent. These are the instructions that are the consumers of the value of the load. Since, we are running these instructions with a bogus value, their results will be also incorrect. Therefore, somehow we need to mark them so that we do not access invalid memory addresses because of them. Meanwhile, some of the instructions will be totally independent of the load instruction, and, they will have correct results and, as a result, they may trigger some load and store instructions with correct addresses as well. There is an indirect impact of this mechanism on performance. If a load or a store instruction, which is not dependent on the first load, arrives, it initiates a valid cache access, which might probably miss the cache, as well. Although, that instruction misses the cache, it initiates a prefetch-like effect, and brings the data into the cache. Once the load at the head of the ROB receives its actual data, same instructions are replayed for the second time. But, this time, the data will already be in the cache, and, as a result, our application will run faster. Assume an IBM Power PC-like datapath, where Re-Order Buffer (ROB), Physical Register Files (PRF) and Architectural Register Files (ARF) are separate structures. Show your design to implement the mechanism described above. What modifications on pipeline stages and datapath structures are needed? Do we need extra structures to make this mechanism really work? If so, what are these structures? Explain your design in detail.