Questions and Answers of Computer Architecture

Based on the illustration of an iPhone shown in Figure 2.4, draw a system model for an iPhone.Figure 2.4
Repeat Exercise 6.1 for a subtract instruction.Data from Exercise 6.1The steps that the Little Man performs are closely related to the way in which the CPU actually executes instructions. Draw a flow
Repeat Exercise 6.1 for a branch on positive instruction.Data from Exercise 6.1The steps that the Little Man performs are closely related to the way in which the CPU actually executes instructions.
Consider the business shown in Figure 2.5. Assume that the building is located in a single building. Show a block diagram for a backbone-based network configuration that will allow efficient use of
Identify the binary sequence that is represented by the Manchester encoded sequence shown in Figure 14E.2.
Consider a representation of a work organization or school with which you are familiar. Identify the major components that characterize the primary operations within the organization and draw a
The steps that the Little Man performs are closely related to the way in which the CPU actually executes instructions. Draw a flow chart that carefully describes the steps that the Little Man follows
Draw side-by-side flow diagrams that show how the Little Man executes a store instruction and the corresponding CPU fetch–execute cycle.
Generally, the distance that a programmer wants to move from the current instruction location on a BRANCH ON CONDITION is fairly small. This suggests that it might be appropriate to design the BRANCH
Draw a flow diagram that shows step by step the process for converting a mixed number in a base other than 10 to decimal.
Without writing a program, predict the ORD (binary) value for your computer system for the letter “A”, for the letter “B”, for the letter “C”. How did you know? Might the value be
What base is the student in the chapter cartoon using to perform his addition?
Extend the simple program shown in Section 6.3 to accept three inputs from a user, add them, and output the result.
Using the register operations indicated in this chapter, show the fetch–execute cycle for an instruction that produces the 2’s complement of the number in A. Show the fetch–execute cycle for an
Carefully draw a diagram that represents the binary sequence 00101110100010. Now, below your original diagram, draw the Manchester encoded representation of the same sequence.
What is the 4B/5B encoding for the binary sequence1101000011001101?
Compute the effective CPI for an implementation of an embedded RISC-V CPU using Figure A.29. Assume we have made the following measurements of average CPI for each of the instruction types:Figure
Compute the effective CPI for RISC-V using Figure A.29 and the table above. Average the instruction frequencies of bzip and hmmer to obtain the instruction mix. You may assume that all other
Compute the effective CPI for an implementation of a RISC-V CPU using Figure A.29. Assume we have made the following measurements of average CPI for each of the instruction types:Figure A.29
Compute the effective CPI for RISC-V using Figure A.29 and the table above. Average the instruction frequencies of perlbench and sjeng to obtain the instruction mix.Figure A.29 Program astar bzip gec
Compiler optimizations may result in improvements to code size and/or performance. Consider one or more of the benchmark programs from the SPEC CPU2017 or the EEMBC benchmark suites. Use the RISC-V
Consider the following fragment of C code:Assume that A and B are arrays of 64-bit integers, and C and i are 64-bit integers.Assume that all data values and their addresses are kept in memory (at
Consider the following fragment of C code:Assume that R, G, B, Y, U, and V are arrays of 64-bit integers. Assume that all data values and their addresses are kept in memory (at addresses 1000, 2000,
For the following, we consider instruction encoding for instruction set architectures.a. Consider the case of a processor with an instruction length of 14 bits and with 64 general-purpose registers
For the following assume that integer values A, B, C, D, E, and F reside in memory. Also assume that instruction operation codes are represented in 8 bits, memory addresses are 64 bits, and register
The design of RISC-V provides for 32 general-purpose registers and 32 floating-point registers. If registers are good, are more registers better?List and discuss as many trade-offs as you can that
Consider a C struct that includes the following members:Note that for C, the compiler must keep the elements of the struct in the same order as given in the struct definition. For a 32-bit machine,
Many computer manufacturers now include tools or simulators that allow you to measure the instruction set usage of a user program. Among the methods in use are machine simulation, hardware-supported
Newer processors such as Intel's i7 Kaby Lake include support for AVX2 vector/multimedia instructions. Write a dense matrix multiply function using single-precision values and compile it with
For the SGEMM code developed above for the i7 processor, include the use of AVX2 intrinsics to improve the performance. In particular, try to vectorize your code to better utilize the AVX hardware.
The RISC-V processor is open source and boasts an impressive collection of implementations, simulators, compilers, and other tools. See riscv.org for an overview of tools, including spike, a
Gcc targets most modern instruction set architectures (see www.gnu.org/software/gcc/). Create a version of gcc for several architectures that you have access to, such as x86, RISC-V, PowerPC, and
Power efficiency has become very important for modern processors, particularly for embedded systems. Create a version of gcc for two architectures that you have access to, such as x86, RISC-V,
Your task is to compare the memory efficiency of four different styles of instruction set architectures. The architecture styles are:■ Accumulator—All operations occur between a single register
Use the four different instruction set architecture styles from above, but assume that the memory operations supported include register indirect as well as direct addressing. Invent your own assembly
The size of displacement values needed for the displacement addressing mode or for PC-relative addressing can be extracted from compiled applications. Use a disassembler with one or more of the SPEC
The value represented by the hexadecimal number 5249 5343 5643 5055 is to be stored in an aligned 64-bit double word.a. Using the physical arrangement of the first row in Figure A.5, write the value
The relative frequency of different addressingmodes impacts the choices of addressing modes support for an instruction set architecture. Figure A.7 illustrates the relative frequency of addressing
Consider typical applications for desktop, server, cloud, and embedded computing. How would instruction set architecture be impacted for machines targeting each of these markets?
You are trying to appreciate how important the principle of locality is in justifying the use of a cache memory, so you experiment with a computer having an L1 data cache and a main memory (you
For the purpose of this exercise, we assume that we have a 512- byte cache with 64-byte blocks. We will also assume that the main memory is 2 KB large.We can regard the memory as an array of 64-byte
Cache organization is often influenced by the desire to reduce the cache's power consumption. For that purpose we assume that the cache is physically distributed into a data array (holding the data),
We compare the write bandwidth requirements of write-through versus write-back caches using a concrete example. Let us assume that we have a 64 KB cache with a line size of 32 bytes. The cache will
You are building a system around a processor with in-order execution that runs at 1.1 GHz and has a CPI of 1.35 excluding memory accesses.The only instructions that read or write data from memory are
We want to observe the following calculationArrays a, b, c, and d memory layout is displayed below (each has 512 4-byte-wide integer elements). The above calculation employs a for loop that runs
Increasing a cache's associativity (with all other parameters kept constant)statistically reduces the miss rate. However, there can be pathological cases where increasing a cache's associativity
Whereas larger caches have lower miss rates, they also tend to have longer hit times.Assume a direct-mapped 8 KB cache has 0.22 ns hit time and miss rate m1; also assume a 4-way associative 64 KB
Use the following code fragment:Assume that the initial value of x3 is x2+396.a. Data hazards are caused by data dependences in the code. Whether a dependency causes a hazard depends on the machine
Suppose the branch frequencies (as percentages of all instructions) are as follows:a. We are examining a four-stage pipeline where the branch is resolved at the end of the second cycle for
For these problems, we will explore a pipeline for a register-memory architecture. The architecture has two instruction formats: a register-register format and a register-memory format. There is a
Construct a table like that shown in Figure C.21 to check for WAW stalls in the RISC V FP pipeline of Figure C.30. Do not consider FP divides.Figure C.21Figure C.30 Situation No dependence Dependence
It is critical that the scoreboard be able to distinguish RAW and WAR hazards, because a WAR hazard requires stalling the instruction doing the writing until the instruction reading an operand
Implement and run the Skippy algorithm on a disk drive of your choosing.a. Graph the results of running Skippy. Report the manufacturer and model of your disk.b. What is the minimal transfer
Assume that you have a RAID 4 system with six disks. Draw a simple diagram showing the layout of blocks across disks for this RAID system.
Imagine that instead of a read-only workload, you now have a write-only workload on a RAID 1 array.a. Describe how you can use queuing theory to model this system and workload.b. Given this system
Finally, we put theory into practice by developing a userlevel tool to guard against file corruption. Assume you are to write a simple set of tools to detect and repair data integrity. The first tool
Assume the constants shown as follows.Write code for RISC-V and RV64V. Assume the starting addresses of tiPL, tiPR, clL, clR, and clP are in RtiPL, RtiPR, RclL, RclR, and RclP, respectively. Do not
Consider the possibility of unrolling the loop and mapping multiple iterations to vector operations. Assume that you can use scatter-gather loads and stores (vldi and vsti). How does this affect the
Now assume we want to implement the MrBayes kernel on a GPU using a single thread block. Rewrite the C code of the kernel using CUDA.Assume that pointers to the conditional likelihood and transition
With CUDA we can use coarse-grain parallelism at the block level to compute the conditional likelihood of multiple nodes in parallel. Assume that we want to compute the conditional likelihood from
Convert your code from Exercise 4.6 into PTX code. How many instructions are needed for the kernel?Exercise 4.6With CUDA we can use coarse-grain parallelism at the block level to compute the
Consider the following code, which multiplies two vectors that contain single-precision complex values:Assume that the processor runs at 700 MHz and has a maximum vector length of 64. The load/store
In this problem, we will compare the performance of a vector processor with a hybrid system that contains a scalar processor and a GPU-based coprocessor. In the hybrid system, the host processor has
Section 4.5 discussed the reduction operation that reduces a vector down to a scalar by repeated application of an operation. A reduction is a special type of a loop recurrence. An example is shown
The following kernel performs a portion of the finite-difference time-domain (FDTD) method for computing Maxwell’s equations in a three-dimensional space, part of one of the SPEC06fp
In this exercise, we will examine several loops and analyze their potential for parallelization.a. Does the following loop have a loop-carried dependency?b. In the following loop, find all the true
For this programming exercise, you will write and characterize the behavior of a CUDA kernel that contains a high amount of data-level parallelism but also contains conditional execution behavior.
For each part of this exercise, the initial cache and memory state are assumed to initially have the contents shown in Figure 5.37. Each part of this exercise specifies a sequence of one or more CPU
Some applications reada large dataset first and then modify most or all of it. The base MSI coherence protocol will first fetch all of the cache blocks in the Shared state and then be forced to
Code running on a single core and not sharing any variables with other cores can suffer some performance degradation because of the snooping coherence protocol.Consider the two following iterative
For each part of this exercise, assume that initially all caches lines are invalid, and the data in memory Mi is the byte i (0X00 For each of the following parts,■ Show the final state (i.e.,
The directory protocol used in 5.9 (based on Figure 5.20) assumes that the directory controller receives requests, sends invalidates, receives modified data, sends modified data to requester if block
In problem 5.9, it was assumed that all transactions on the system were serially executed, which is both unrealistic and inefficient in a DSM multicore.We now relax this condition. We will require
Use the routing and delay information described earlier and trace how the following groups of transactions will progress in the system (assume that all accesses are misses). a. CO: R, M7 C2: W, M2
In a read miss, a cache might overwrite a line in the shared (S) state without notifying the directory that owns the corresponding memory block. Alternatively, it will notify the directory so that it
Consider the following code segments running on two processors P1 and P2. Assume A and B are initially 0.a. If the processors adhere to sequential consistency (SC) consistency model. What are the
Consider the following code segments running on two processors P1 and P2. Assume A, and B, are initially 0. Explain how an optimizing compiler might make it impossible for B to be ever set to 2 in a
In a processor implementing a SC consistency model, the data cache is augmented with a data prefetch unit. Will that alter the SC implementation execution results? Why or why not?
Assume that the following code segment is executed on a processor that implements partial store order (PSO),a. Augment the code with synchronization primitives to make it emulate the behavior of a
Sequential consistency (SC) requires that all reads and writes appear to have executed in some total order. This may require the processor to stall in certain cases before committing a read or write
Repeat part (a) of problem 5.19 under an SC model on a processor that has a read prefetch unit. Assume a read prefetch was triggered 20 cycles in advance of the write operation.problem 5.19Sequential
In this exercise, we examine the effect of the interconnection network topology on the CPI of programs running on a 64-processor distributed-memory multiprocessor. The processor clock rate is 2.0
Show how the basic snooping protocol of Figure 5.6 can be changed for a write-through cache. What is the major hardware functionality that is not needed with a write-through cache compared with a
Please answer the following problems:a. Add a clean exclusive state to the basic snooping cache coherence protocol (Figure 5.6). Show the protocol in the finite state machine format used in the
An application is calculating the number of occurrences of a certain word in a very large number of documents. A very large number of processors divided the work, searching the different documents.
In this chapter, data-level parallelism has been discussed as a way for WSCs to achieve high performance on large problems. Conceivably, even greater performance can be obtained by using high-end
To achieve a lower OPEX, one appealing alternative is to use low-power versions of servers to reduce the total electricity required to run the servers; however, similar to high-end servers, low-power
Servers that have different operating modes offer opportunities for dynamically running different configurations in the cluster to match workload usage. Use the data in Figure 6.35 for the
Discuss the trade-offs and benefits of the two options in Exercise 6.3, assuming a constant workload being run on the servers.Exercise 6.3Servers that have different operating modes offer
Unlike high-performance computing (HPC) clusters, WSCs often experience significant workload fluctuation throughout the day. Discuss the trade-offs and benefits of the two options in Exercise 6.3,
The TCO model presented so far abstracts away a significant amount of lower level details. Discuss the impact of these abstractions to the overall accuracy of the TCO model. When are these
One of the challenges in provisioning a WSC is determining the proper power load, given the facility size. As described in the chapter, nameplate power is often a peak value that is rarely
One assumption in the TCO model is that the critical load of the facility is fixed, and the amount of servers fits that critical load. In reality, due to the variations of server power based on load,
WSCs are often used in an interactive manner with end users, as mentioned in Section 6.5. This interactive usage often leads to time-of-day fluctuations, with peaks correlating to specific time
Discuss some options to better utilize the excess servers during the off-peak hours or find ways to save costs. Given the interactive nature of WSCs, what are some of the challenges to aggressively
Propose one possible way to improve TCO by focusing on reducing server power. What are the challenges to evaluating your proposal? Estimate the TCO improvements based on your proposal. What are some
One of the important enablers of WSC is ample request level parallelism, in contrast to instruction- or thread-level parallelism. This question explores the implication of different types of
When a cloud computing service provider receives jobs consisting of multiple Virtual Machines (VMs) (e.g., a MapReduce job), many scheduling options exist. The VMs can be scheduled in a round-robin
MapReduce enables large amounts of parallelism by having data-independent tasks run on multiple nodes, often using commodity hardware; however, there are limits to the level of parallelism. For
WSC programmers often use data replication to overcome failures in the software. Hadoop HDFS, for example, employs three-way replication (one local copy, one remote copy in the rack, and one remote

Showing 1 - 100 of 1390