Questions and Answers of Computer Architecture

Although request-level parallelism allows many machines to work on a single problem in parallel, thereby achieving greater overall performance, one of the challenges is how to avoid dividing the
One trend in high-end servers is toward the inclusion of nonvolatile flash memory in the memory hierarchy, either through solid-state disks (SSDs) or PCI Express-attached cards. Typical SSDs have a
Caching is heavily used in some WSC designs to reduce latency, and there are multiple caching options to satisfy varying access patterns and requirements.a. Let’s consider the design options for
Imagine you are the site operation and infrastructure manager of an Alexa.com top site and are considering using Amazon Web Services (AWS). What factors do you need to consider in determining whether
Figure 6.12 shows the impact of user perceived response time on revenue, and motivates the need to achieve high-throughput while maintaining low latency.a. Taking Web search as an example, what are
The efficiency of typical power supply units (PSUs) varies as the load changes; for example, PSU efficiency can be about 80% at 40% load (e.g., output 40 W from a 100-W PSU), 75% when the load is
Power stranding is a term used to refer to power capacity that is provisioned but not used in a datacenter. Consider the data presented in Figure 6.37 [Fan, Weber, and Barroso, 2007] for different
Section 6.7 discussed the use of per-server battery sources in the Google design. Let us examine the consequences of this design.a. Assume that the use of a battery as a mini-server-level UPS is
For this exercise, consider a simplified equation for the total operational power of a WSC as follows:a. Assume an 8 MW datacenter at 80% power usage, electricity costs of $0.10 per kilowatt-hour,
As discussed in this chapter, the cooling equipment in WSCs can themselves consume a lot of energy. Cooling costs can be lowered by proactively managing temperature. Temperature-aware workload
Energy proportionality (sometimes also referred to as energy scale-down) is the attribute of the system to consume no power when idle, but more importantly gradually consume more power in proportion
This exercise illustrates the interactions of energy proportionality models with optimizations such as server consolidation and energy-efficient server designs. Consider the scenarios shown in
Consider the following breakdowns of the power consumption of a server:CPU, 50%; memory, 23%; disks, 11%; networking/other, 16%CPU, 33%; memory, 30%; disks, 10%; networking/other, 27%a. Assume a
Pitt Turner IV et al. presented a good overview of datacenter tier classifications. Tier classifications define site infrastructure performance. For simplicity, consider the key differences as shown
Based on the observations in Figures 6.12 and 6.13, what can you say qualitatively about the trade-offs between revenue loss from downtime and costs incurred for uptime?Figure 6.12:Figure 6.13:
Some recent studies have defined a metric called TPUE, which stands for “true PUE” or “total PUE.” TPUE is defined as PUE * SPUE.PUE, the power utilization effectiveness, is defined in
Two benchmarks provide a good starting point for energy-efficiency accounting in servers—the SPECpower_ssj2008 benchmark (available at http://www.spec.org/power_ssj2008/) and the JouleSort metric
Figure 6.1 is a listing of outages in an array of servers.When dealing with the large scale of WSCs, it is important to balance cluster design and software architectures to achieve the required
Look up the current prices of standard DDR4 DRAM versus DDR4 DRAM that has error-correcting code (ECC). What is the increase in price per bit for achieving the higher reliability that ECC provides?
a. Consider a cluster of servers costing $2000 each. Assuming an annual failure rate of 5%, an average of an hour of service time per repair, and replacement parts requiring 10% of the system cost
The Open Compute project at www.opencompute. org provides a community to design and share efficient designs for warehouse scale computers. Look at some of the recently proposed designs. How do they
Assume that the MapReduce job from Page #438 in Section 6.2 is executing a task with 2^40 bytes of input data, 2^37 bytes of intermediate data, and 2^30 bytes of output data. This job is entirely
Imagine you have created a web service that runs very well (responds within 100 ms latency) 99% of the time, and has performance issues 1% of the time (maybe the CPU went into a lower power state and
Matrix multiplication is a key operation supported in hardware by the TPU. Before going into details of the TPU hardware, it’s worth analyzing the matrix multiplication calculation itself. One
Consider the neural network model MLP0 from Figure 7.5. That model has 20 Mweights in five fully connected layers (neural network researchers count the input layer as if it were a layer in the stack,
Consider the first convolutional layer of AlexNet, which uses a 7 х 7 convolutional kernel, with an input feature depth of 3 and an output feature depth of 48. The original image width is
The TPU uses fixed-point arithmetic (sometimes also called quantized arithmetic, with overlapping and conflicting definitions), where integers are used to represent values on the real number line.
In addition to tanh, another s-shaped smooth function, the logistic sigmoid function y=1 / (1+exp(–x)),is commonly used as an activation function in neural networks. A common way to implement them
One popular family of FPGAs, the Virtex-7 series, is built by Xilinx. A Virtex-7 XC7VX690T FPGA contains 3,600 25x18-bit integer multiply-add "DSP slices." Consider building a TPU-style design on
Amazon Web Services (AWS) offers a wide variety of “computing instances,” which are machines configured to target different applications and scales. AWS prices tell us useful data about the Total
As shown in Figure 7.34 (but simplified to fewer PEs), each Pixel Visual Core includes a 16 х 16 set of full processing elements, surroundedFigure 7.34by an additional two layers of "simplified"
Consider a case in which each of the eight cores on a Pixel Visual Core device is connected through a four-port switch to a 2D SRAM, forming a core+memory unit. The remaining two ports on the switch
The first Anton molecular dynamics supercomputer typically simulated a box of water that was 64 Å on a side. The computer itself might be approximated as a box with 1 m side length. A single
The Anton communication network is a 3D, 8 х 8 х 8 torus, where each node in the system has six links to neighboring nodes. Latency for a packet to transit single link is about 50 ns. Ignore
What extra complexities may arise if the messages can be adaptively rerouted on the links? For example, a coherency message from core M1 directory controller to C2 (expressed in binary as M001
Using the sample program results in Figure 2.33:a. What are the overall size and block size of the second-level cache?b. What is the miss penalty of the second-level cache?c. What is the
What is the read latency experienced by a memory controller on a row buffer miss?
What is the latency experienced by a memory controller on a row buffer hit?
If the memory channel supports only one bank and the memory access pattern is dominated by row buffer misses, what is the utilization of the memory channel?
Assuming a 100% row buffer miss rate, what is the minimum number of banks that the memory channel should support in order to achieve a 100% memory channel utilization?
Assuming a 50% row buffer miss rate, what is the minimum number of banks that the memory channel should support in order to achieve a 100% memory channel utilization?
Assume that we are executing an application with four threads and the threads exhibit zero spatial locality, that is, a 100% row buffer miss rate. Every 200 ns, each of the four threads
From these questions, what have you learned about the benefits and downsides of growing the number of banks?
In the default configuration, a rank consists of eight 8 2 Gb DRAM chips. A rank can also comprise164 chips or 416 chips. You can also vary the capacity of each DRAM chip—1 Gb, 2 Gb, and 4 Gb.
Now let’s turn our attention to memory power. Download a copy of the Micron power calculator from this link: https://www.micron.com//media/documents/products/power-calculator/ddr3_power_calc.xlsm.
The following questions investigate the impact of small and simple caches using CACTI and assume a 65 nm (0.065 m) technology. (CACTI is available in an online form at
You are investigating the possible benefits of a waypredicting L1 cache. Assume that a 64 KB four-way set associative single-banked L1 data cache is the cycle time limiter in a system. For an
You have been asked to investigate the relative performance of a banked versus pipelined L1 data cache for a new microprocessor. Assume a 64 KB two-way set associative cache with 64-byte blocks. The
You are designing a write buffer between a write-through L1 cache and a write-back L2 cache. The L2 cache write data bus is 16 B wide and can perform a write to an independent cache address every
A cache acts as a filter. For example, for every 1000 instructionsof a program, an average of 20 memory accesses may exhibit low enoughlocality that they cannot be serviced by a 2 MB cache. The 2 MB
Consider a 16 MB 16-way L3 cache that is shared by two programs A and B. There is a mechanism in the cache that monitors cache miss rates for each program and allocates 1–15 ways to each program
You are designing a PMD and optimizing it for low energy. The core, including an 8 KB L1 data cache, consumes 1 W whenever it is not in hibernation. If the core has a perfect L1 cache hit rate, it
You are designing a PMD that is optimized for low power. Qualitatively explain the impact on cache hierarchy (L2 and memory) power and overall application energy if you design an L2 cache
The ways of a set can be viewed as a priority list, ordered from high priority to low priority. Every time the set is touched, the list can be reorganized to change block priorities. With this view,
In a processor that is running multiple programs, the last-level cache is typically shared by all the programs. This leads to interference, where one program’s behavior and cache footprint can
A large multimegabyte L3 cache can take tens of cycles to access because of the long wires that have to be traversed. For example, it may take 20 cycles to access a 16 MB L3 cache. Instead of
Consider a desktop system with a processor connected to a 2 GB DRAM with error-correcting code (ECC). Assume that there is only one memory channel of width 72 bits (64 bits for data and 8 bits for
A sample DDR2 SDRAMtiming diagram is shown in Figure 2.34. tRCD is the time required to activate a row in a bank, and column address strobe (CAS) latency (CL) is the number of cycles required to read
Assume that a DDR2-667 2GBDIMMwith CL = 5 is available for 130 and a DDR2-533 2 GB DIMM with CL = 4 is available for 100. Assume that two DIMMs are used in a system, and the rest of the system costs
You are provisioning a server with eight-core 3 GHz CMP that can execute a workload with an overall CPI of 2.0 (assuming that L2 cache miss refills are not delayed). The L2 cache line size is 32
Consider a processor that has four memory channels. Should consecutive memory blocks be placed in the same bank, or should they be placed in different banks on different channels?
A large amount (more than a third) of DRAM power can be due to page activation (see http://download.micron.com/pdf/technotes/ddr2/TN4704.pdf and http://www.micron.com/systemcalc). Assume you are
Virtual machines (VMs) have the potential for adding many beneficial capabilities to computer systems, such as improved total cost of ownership (TCO) or availability. Could VMs be used to provide the
Virtualmachines canlose performance fromanumberof events, such as the execution of privileged instructions, TLB misses, traps, and I/O.These events are usually handled in system code. Thus one way of
With the adoption of virtualization support on the x86 architecture, virtual machines are actively evolving and becoming mainstream. Compare and contrast the Intel VT-x and AMD’s AMD-V
Since instruction-level parallelism can also be effectively exploited on in-order superscalar processors and very long instruction word (VLIW) processors with speculation, one important reason for
In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading (MT). Each of these processors is superscalar, uses in-order
In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus
In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running the loop from Exercise 3.14. The functional units (FUs) are described in the following table.Assume the
Tomasulo’s algorithm has a disadvantage: only one result can compute per clock per CDB. Use the hardware configuration and latencies from the previous question and find a code sequence of no more
Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always four cycles and the
Consider a branch-target buffer that has penalties of zero, two, and two clock cycles for correct conditional branch prediction, incorrect prediction, and a buffer miss, respectively. Consider a
Figure 1.26 gives hypothetical relevant chip statistics that influence the cost of several current chips. In the next few exercises, you will be exploring the effect of different possible design
They will sell a range of chips from that factory, and they need to decide how much capacity to dedicate to each chip. Imagine that they will sell two chips. Phoenix is a completely new architecture
Your colleague at AMD suggests that, since the yield is so poor,you might make chips more cheaply if you released multiple versions of the same chip, just with different numbers of cores. For
A cell phone performs very different tasks, including streaming music, streaming video, and reading email. These tasks perform very different computing tasks. Battery life and overheating are two
As mentioned in Exercise 1.4, cell phones run a wide variety of applications. We’ll make the same assumptions for this exercise as the previous one, that it isExercise 1.4A cell phone performs very
General-purpose processes are optimized for general-purpose computing. That is, they are optimized for behavior that is generally found across a large number of applications. However, once the domain
One challenge for architects is that the design created today will require several years of implementation, verification, and testing before appearing on the market. This means that the architect
Availability is the most important consideration for designing servers, followed closely by scalability and throughput.a. We have a single processor with a failure in time (FIT) of 100. What is the
In a server farm such as that used by Amazon or eBay, a single failure does not cause the entire system to crash. Instead, it will reduce the number of requests that can be satisfied at any one
In this exercise, assume that we are considering enhancing a quad-core machine by adding encryption hardware to it. When computing encryption operations, it is 20 times faster than the normal mode of
When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a
Your company has just bought a new 22-core processor,and you have been tasked with optimizing your software for this processor.You will run four applications on this system, but the resource
When parallelizing an application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of the application that can be parallelized and the cost of
Why does the ltanium have more than one type of NOP (no operation), for example nop. i and nop. b?
Consider the following generic code. Indicate all the potential data hazards in this code. LDR LDR ADD SUB MUL STR r1, [x2] r3, [14] r6, rl, r3 r7, r1, r3 r3, r6, 17 [r2], r3
The throughput of a superscalar processor depends on the number of instructions it can issue per clock cycle. Early superscalar processors were able to issue two instructions per clock. Later
IA64 general-purpose registers r0 to r127 have 65 bits, because they include a special N aT ( not a thing) bit in addition to the normal 64 data bits. What is the purpose of the N aT bit and how is
Explain the meaning of the term memory disambiguation in the context of memory access.
What is the effect of the . crel completer when it is suffixed to an IA64 comparison instruction, for example, cmp.crel p1, p2 = r5, r6
Translate the following into IA64 code making use of predication. for (i = 0; i < 100: i++) { if (X [i] = Y[i]) } Y[i]Y[i] + else X [i] = X [i] 1; 1;
A superscalar processor can be best thought of as a pipelined processor where the pipeline is replicated. For example, a four-way superscalar processor has four parallel pipelines allowing four
All other factors being equal, which is easier to design: a superscalar processor or a VLIW processor?
An existing superscalar processor with 32 registers in the register file is redesigned and the instruction length increased from 32 to 64 bits which allows 1,024 general-purpose registers. Providing
We said (when discussing the potential limitations of VLIW computers) "If a VLIW with m operations per bundle has an instruction with an n-cycle latency, up to m·n - l execution slots are lost."
Why is a stack machine a particularly poor candidate for the application of superscalar principles? a stack machine makes use of push and pull operations, and its data processing operations are
What is cache memory?
Why do computers use cache memory?
What is the meaning of the following terms. a. Temporal locality. b. Spatial locality.

Showing 100 - 200 of 1390