5
PART 1
GENERAL BACKGROUND
6
In the following sections we’ll describe the problems and the architecture to be simulated, the
simulation environment, the performance metrics and the applications we want to measure the
behaviour.
The Cache Memory and the Multiprocessor Systems
THE CACHE MEMORY: WHY, WHAT AND HOW
Over the last decades, the trend in processor technologies has not been adequately supported by
advances in memory industries: clearly there is a processor-memory performance gap that computer
architects must try to close. This, plus the guideline that smaller hardware is more expensive but
faster, led to the hierarchy based on memories of different speed and sizes. The goal is to provide a
memory system with cost almost as low as the cheapest level of memory and speed almost as fast as
the fastest level. Cache is the name generally given to the first level of the memory hierarchy
encountered once the address leaves the CPU.
The principle of locality is the second consideration behind the idea of caches: researchers
discovered that nearly all code was extremely repetitive in nature. If anything repetitive can be
stored in a small, very high-speed memory, then the effect of wait states can be limited only to the
less repetitive portions of the program, which can be stored in slower, less expensive memory. We
distinguish between the spatial and the temporal locality. The first regards a phenomenon by which
most computer code executes out of a small area repetitively; this space is not necessarily in a
single address range of main memory but may be spread around quite significantly. The temporal
locality is about the fact that instructions execute in close sequence with each other, rather being
spread through time: a processor is much more likely to need to access a memory location which it
accessed 10 cycles ago than one which it accessed 10000 cycles ago (that is only 0.12 ms on a 166
MHz Pentium!!!). Without locality in time and in space caches would not be able to work.
For reason of cost and speed, all the mechanisms to assure that the currently useful parts of main
memory are copied into the cache memory must be hardware.
As said before, for our systems we need to have a deep structure; in order to understand the
different performances between the levels we’ll show a memory hierarchy for a common system
with the relative sizes and access times. The size of each layer is of course limited by the price per
bytes.
• Registers; a few or a few tens; 2 ns
• Cache
♦ On-chip (SRAM); today’s microprocessors have between 8 and 64 Kbytes; they are pretty
fast because they don’t need to have an off-chip access, but they are slower than registers
since it’s not possible to read and write from them simultaneously; in this case the size is a
limiting factor also because the processor is made using the same chip
♦ Off-chip (economical SRAM); range from 256 K to 16 Mbytes; about 10 ns
• Main Memory (DRAM); between 16 and 512 Mbytes; 70 ns
• Hard Magnetic Disks; from hundreds and thousands Mbytes; tens of ms
• Magnetic Tape; infinite size; several seconds
The cache memory behaviour is given by the collaboration among a number of components,
illustrated by figure 1 and here described.
7
The cache data memory is exactly the small and fast memory we were referring to before; it is used
to store replicas of instructions and data that could be accessed more slowly from the main memory.
The number of bytes (or memory words) it can contain is the cache size.
The cache directory is a list of the main memory addresses of the data stored in corresponding
locations of the cache data memory. This means that with each cache location, not only is data
stored, but so is an address, making the combined directory and data RAMs behave like a single,
very wide memory.
The bus buffers are now controlled in such a way that if the cache can supply a copy of the
requested main memory location (we call that case cache hit), then the main memory is not allowed
to put its data onto the CPU’s data pins. The opposite of a hit is a cache miss.
Finally there is the cache controller, whose task is to tie the whole system together, using a certain
policy.
When an address is presented to the cache directory, a table is interrogated and, if a matching entry
is found, another address is output, showing where the location exists in the data memory. In this
way the CPU is fooled: from its perspective the data always come from the main memory, although
at some times it arrives faster than other times. It is important to notice also that the cache must
intercept all of the CPU’s signals, both inputs and outputs, and determine whether these signals are
to be routed to/from main memory or whether they should stay local to the cache. In a way, the
cache is an insulating layer between the CPU and the outside world.
The goal of the cache is then to shorten the time that CPU needs to wait in getting the data from the
main memory. In order to have good results it’s required that the miss ratio, defined as the fraction
of accesses that are not in the cache, is as low as possible. Moreover the miss penalty too, that is the
additional clock cycles to service the miss, must be kept the lowest. Hit time is the time to hit in the
cache.
Particularly the formula to evaluate the cache performance is the following:
CPU
Cache
Directory
Cache
Controller
Cache
Data
Memory
Address
Data
Address
Buffers
Data
Buffers
System
Bus
penaltyMissrateMisstimeHittimeaccessmemoryAverage ______ ×+=
Figure 1: The components of a cache memory
8
A problem that cache designers have to face is the associativity, that is where to place a given
memory location. If any main memory address has the full freedom to be replicated anywhere
within the cache, then the cache is said to be fully associative. Instead when a block has only one
place it can appear in the cache, we say that we have a direct mapped cache. These are the two
extremes, but the most popular mapping algorithm in today’s cache design is set associative (within
any set the entry is associative): the address which is output from the processor is first divided
between the block address and the block offset. The block frame address is further split into the
equivalent of a prefix and a suffix at a location determined by the size and architecture of the cache.
The lower address bits are called the set bits since they contain the set address. The remaining bits
are compared against the directory entry for the selected set and are referred to as the tag bits.
Figure 2 illustrates a possible address partition.
If there are n blocks in a set, the cache placement is called n-way set associative; then direct
mapped is simply one-way set associative and a fully associative cache with m blocks could be
called m-way set associative.
The block offset field selects the desired data from the block, the set field selects the set, and the tag
field is compared for a hit. It is common procedure to add a valid bit to the tag in the cache array to
say whether or not the entry contains a valid address; if the bit is not set, there cannot be a match on
this address. Moreover, all possible tags are searched in parallel because speed is critical.
If the total cache size is kept the same, increasing associativity increases the number of blocks per
set, thereby decreasing the size of the set field and increasing the size of the tag. The formula
relating all this values is the following, while figure 3 shows the difference among the three types of
mapping.
In the example in the figure the cache has eight blocks and the memory has sixteen blocks. Indeed
real caches contain hundreds of block frames and real memories contain millions of blocks.
Assume the address in question is 12. The three options for caches are shown left to right. In fully
associative, block 12 from memory can go into any of the eight block frames of the cache. With
direct mapped, block 12 can only be placed into block frame 4 (12 modulo 8). Set associative,
which has some of both features, allows the block to be placed anywhere in set 0 (12 modulo 4).
With two blocks per set, this means block 12 can be placed either in block 0 or block 1 in the cache.
Block Address
Tag Set
Block
offset
ityassociativblocksize
cachesizebitsset
×
=_2
Figure 2: How a memory address is subdivided
9
Another decision that computer architects have to take in designing a cache memory system regards
the replacement algorithm. In fact, every time the CPU can’t find what it needs within the cache, it
must perform a main memory access: a commonly used location is then displaced by another
commonly used location, causing what is called as thrashing. Except for direct mapped caches,
where there is no choice, the cache controller must select a block to be replaced with the desired
data, and several algorithms have been proposed. The two primary strategies employed are the
Random and the Least Recently Used (LRU): in the latter the block replaced is the one that has
been unused for the longest time. Often it’s not worth to increment the circuital complexity for
implementing an LRU, and Random is chosen; some other times LRU is only approximated.
Usually reads dominate processor cache access, but high performance design cannot neglect the
speed of writes, which normally take longer than reads. Another complexity is that the processor
also specifies the size of the write, only that portion of a block can be changed. In contrast, reads
can access more bytes the necessary without fear. The write policies often distinguish cache
designs. There are two basic options when writing to the cache:
• Write through: the information is written to both the block in the cache and to the block in the
lower-level memory.
• Write back: the information is written only to the block in the cache. The modified cache block
is written to main memory only when it is replaced.
To reduce the frequency of writing back blocks on replacement, a feature called the dirty bit is
commonly used. This status bit indicates whether the block is dirty, that is modified, or clean. Both
the strategies have their advantages, but since with write back some writes don’t go to memory, it
uses less memory bandwidth and therefore it is the most attractive in multiprocessors. Write
through, instead, helps in keeping the cache consistent with lower levels of the memory hierarchy.
Moreover, since the data are not needed on a write, for both policies there are two common options
on a write miss. With write allocate case the behaviour is similar to read misses: the block is loaded
and it is written above. The other option implies that the block is modified in the lower level and not
loaded into the cache.
This decison, as the others, is strongly dependent on the applications for which cache are to be used.
Fully associativity Direct mapped 2-way set associative
0 1 2 - - - 7 0 1 2 - - - 7 0 1 2 - - - 7
0 1 2 - - - 12 15
Block
No.
Block
No.
CACHE
MEMORY
Set Set Set Set
0 1 2 3
Figure 3: The differences among the mapping algorithms
10
Having only one data cache is not a good idea, because the processor also needs instructions.
Although a single cache could try to supply both, it can be a bottleneck. Separate caches are found
in most recent processors: one cache is dedicated to instructions and another to data. This fact also
offers the opportunity of optimising each cache separately: different capacities, block sizes, and
associativities may lead to better performance.
As shown above, speed is a critical factor in these systems: particularly a commonly used technique
to reduce the miss penalty is to build more cache levels, usually two. By adding another level of
cache between the original cache and the memory, the first-level cache can be small enough to
match the clock cycle time of the fast CPU, while the second-level cache can be large enough to
capture many accesses that would go to main memory, thereby lessening the effective miss penalty.
With the new structure, the performance can be evaluated using the more general formula:
with
When planning to introduce the second level we have to be aware that very probably the secondary
cache will be generally idle, while a primary cache will be almost never idle. But sometimes this is
a good price to pay in order to have a more efficient memory system.
The crucial parameters to play with in secondary caches are the size of the blocks and the size of the
cache itself. These considerations concern also with whether all data in the first level cache are
always in the second level cache (multilevel inclusion property). Inclusion is desirable because
consistency can be determined just by checking the second level, but it also involves some
drawbacks. Multiprocessor systems always use the inclusion property.
Summarising the second-level cache considerations, the essence of cache design is balancing fast
hits and few misses: this leads to larger caches with higher associativity and larger blocks.
Another important factor is the number of misses. For analysing it, we must say that there are three
categories, and reducing one of them often involves an increment of another one:
• Compulsory Misses: the very first access to a block cannot be in the cache, so the block must be
brought into the cache.
• Capacity Misses: if the cache cannot contain all the block needed during execution of a
program, capacity misses will occur because of blocks being discarded and later retrieved.
• Conflict Misses: if the block placement strategy is set associative or direct mapped, conflict
misses (in addition to compulsory and capacity misses) will occur because a block can be
discarded and later retrieved if too many blocks map to its set.
The goal in cache design is to optimise the performance for a given set of applications: architects hit
it by careful, quantitative analysis, trying to tune the listed factors in the best fashion and perhaps
using other tricks that reference [2] clearly explains.
In this section we referred mainly to [2] and to [3], in which the reader can find much more details
and related considerations.
111 ______ LLL penaltyMissrateMisstimeHittimeaccessmemoryAverage ×+=
2221 ____ LLLL penaltyMissrateMisstimeHitpenaltyMiss ×+=
11
THE CACHE COHERENCE PROBLEM
As seen in the previous section, most modern processors use caches to hide the growing disparity
between processor and memory speeds. On a uniprocessor, the effectiveness of a cache depends
primarily on the hit rate, which in turn depends on such factors as cache and working set sizes, the
amount of temporal and spatial locality in the reference stream, the degree of associativity in the
cache, and the replacement and write policies.
One reason for adding a cache to a system has not yet been mentioned: multiprocessors can take
advantage of cache memories to improve the effective bus bandwidth, or simply bus traffic, into
their main memory.
We will next see that the use of cache memories to reduce a processor’s main memory bandwidth is
a very important benefit and should become much more important in the near future with the
growing acceptance of multiprocessor architectures as a mean of dramatically improving system
performance. Multiple processors can get into trouble if their utilisation approaches 100%, and
perform worse than systems with fewer processors once the 100% limit is reached.
Since microprocessors are likely to remain the dominant uniprocessor technology, the logical way
to improve performance beyond a single processor is by connecting multiple microprocessors
together. This is likely to be more cost-effective than designing a custom processor.
The most prevalent form of parallel architecture is the multiprocessor of small or moderate scale
that provides a global physical address space and symmetric access to all of main memory from any
processor, often called a symmetric multiprocessor (SMP) or tightly coupled system. Every
processor has its own cache, and all the processors and memory modules attach to the same
interconnect, which is usually a shared bus (Figure 4). Task allocation is performed by software at
run time, offering complete flexibility in the range of application programs that can be run on the
machine, while the shared address space programming model is supported directly by hardware.
SMPs dominate the server market and are becoming more common on the desktop. They are also
important building blocks for larger-scale systems.
Processor
Processor
Processor
One or more
levels of cache
One or more
levels of cache
One or more
levels of cache
Main memory
I/O System
Bus
Figure 4: Basic structure of a shared-memory multiprocessor
12
This kind of architecture is used where the required scalability is up to a few dozen processors. To
support larger processor counts, memory must be distributed among the processors rather than
centralised. Distributing the memory among the nodes has two major benefits. First, it is a cost-
effective way to scale the memory bandwidth, if most of the accesses are to the local memory in the
node. Second, it reduces the latency for accesses to the local memory. The key disadvantage is that
communicating data between processors becomes somewhat more complex and has higher latency.
We won’t go deeper in the discussion of this kind of systems, often called distributed shared-
memory (DSM) or non-uniform memory access (NUMA), and the interested reader is invited to
take a look at [2] and [29].
The reason that SMP systems need cache is not hard to understand. In fact each processor has its
own cache; the more often the cache is accessed, the less often the processor uses the bus to access
main memory. This helps to keep the bus from saturating; that is, utilisation is lowered to the point
that the bus becomes more available in a timely manner to those processors that need it. The amount
of bus bandwidth required by a cached processor is strongly related to the cache’s miss rate, so
designers of multiple-processor systems focus much more attention on miss rates than would
designers of single-processor systems. A designer of a single-processor system might hesitate to
spend extra money to improve the hit ratio of a cache design from 96% to 98%. In a multiple-
processor system, however, the difference between a hit rate of 96% and a hit rate of 98% translates
to a difference in miss rates of 4% and 2%, or double. In other words, one design would require half
the bandwidth of the other, so half the number of processors could be used in the system before the
effects of saturation would be noticed. Moreover caches play an essential role in reducing the
average data access time as seen by the processors.
A critical challenge due to the addition of caches to multiple-processor systems is named cache
coherence. The problem arises when copies of the same memory block are present in the caches of
one or more processors; if a processor writes to and hence modifies that memory block, then, unless
special action is taken, the other processors will continue to access the old, stale copy of the block
that is in their caches. Example in table 1 helps in the explanation.
Time Event Cache
Contents
For CPU A
Cache
Contents
For CPU B
Memory
Contents
For location X
0 1
1 CPU A reads X 1 1
2 CPU B reads X 1 1 1
3 CPU A stores 0 into X 0 1 0
Table 1: Example of a coherency problem
Cache coherence problems arise even in uniprocessors when I/O operations occur, because data,
using DMA, is moved between memory and peripheral component without involving the processor.
Since I/O operations are much less frequent than memory operations, several coarse solutions have
been adopted in uniprocessors. In multiprocessors, reading and writing of shared variables by
different processors is expected to be a frequent event since it is the way that multiple processes
belonging to a parallel application communicate with each other.
Fortunately, the techniques and support used to solve the multiprocessor cache coherence problem
also solve the I/O coherence problem. Essentially all microprocessors today provide support for
multiprocessor cache coherence. The operating system itself benefits greatly from transparent,
hardware-supported coherence of its data structures.
13
Informally, we could say that a memory system is coherent if any read of a data item returns the
most recently written value of that data item.
More formally, a multiprocessor memory system is coherent if the results of any execution of a
program are such that, for each location, it is possible to construct a hypothetical serial order of all
operations to the location (i.e. put all reads/writes issued by all processes into a total order) that is
consistent with the results of the execution and in which
1. operations issued by any particular process occur in the order in which they were issued to the
memory system by that process, and
2. the value returned by each read operation is the value written by the last write to that location in
the serial order
Two properties are implicit in the definition of coherence: write propagation means that writes
become visible to other processes; write serialisation means that all writes to a location (from the
same or different processes) are seen in the same order by all processes.
For a more detailed discussion about the coherence issue, we suggest [1].
Small-scale multiprocessors solve the problem with a hardware solution, by introducing a protocol
to maintain the caches coherent. These protocols are called cache-coherence protocols (CCP) and
they work tracking the state of any sharing of a data block. There are two classes of protocols,
which use different techniques to track the sharing status, in use:
• Directory based: The sharing status of a block of physical memory is kept in just one location,
called the directory.
• Snooping: Every cache that has a copy of the data from a block of physical memory also has a
copy of the sharing status of the block, and no centralised state is kept. The caches are usually
on a shared-memory bus, and all cache controllers monitor or snoop on the bus to determine
whether or not they have a copy of a block that is requested on the bus.
Snooping protocols became popular with multiprocessors using microprocessors and caches
attached to a single shared memory because these protocols can use a pre-existing physical
connection to interrogate the status of the caches.
Directory based protocols, that won’t be further analysed here, are well described by [2].
There are two ways to maintain the coherence requirement. One method is to ensure that a
processor has exclusive access to a data item before it writes that item. This style of protocol is
called a write invalidate protocol because it invalidates other copies on a write. The alternative is to
update all the cached copies of a data item when that item is written. This type of protocol is called
a write update or write broadcast protocol. Because bus and memory bandwidth is usually the
commodity most in demand in a bus-based multiprocessor, invalidation has become the protocol of
choice for almost all implementations.
The key to implementing an invalidate protocol in a small-scale machine is the use of the bus to
perform invalidates. To perform an invalidate the processor simply acquires bus access and
broadcasts the address to be invalidated on the bus. All processors continuously snoop on the bus
watching the addresses. The processors check whether the address on the bus is in their cache. If so,
the corresponding data cache is invalidated.
In addition to invalidating outstanding copies of a cache block that is being written into, we also
need to locate a data item when a cache miss occurs. As said before, since write-back caches
generate lower requirements for memory bandwidth, they are greatly preferable in a multiprocessor,
despite the slight increase in complexity.
For a write-back cache the problem of finding the most recent data value is quite hard, since it can
be in a cache rather than in memory. Fortunately, write-back caches can use the same snooping
scheme both for cache misses and for writes. Each processor snoops every address placed on the
14
bus. If a processor finds it has a dirty copy of the requested cache block, it provides that cache block
in response to the read request and causes the main memory access to be aborted.
At this point, what we need to do is just to introduce an example protocol. We’ll discuss the
simplest; after this, many others have been proposed, based on the features of this one, but changing
some characteristics mainly depending on the application they were created for.
The name of the protocol, due to the name of the three states it uses, is MSI and it is a “write
invalidate” having a “write-back after first write” memory-write policy.
A bus based cache coherence protocol is usually implemented by incorporating a finite state
controller in each node. This controller responds to requests from the processor and from the bus,
changing the state of the selected cache block, as well as using the bus to access data or to
invalidate it.
The finite state machine associated to the protocol is shown in figure 5.
Cache Blocks States:
1. MODIFIED (M): also called DIRTY, only this cache has a valid copy of the block, and the copy
in the memory is stale; read/write
2. SHARED (S): the block is present in an unmodified state in this cache, main memory is up-to-
date, and zero or more other caches may also have an up-to-date (shared) copy; read only
3. INVALID (I)
Before a shared or invalid block can be written and placed in the modified state, all the other
potential copies must be invalidated via a read-exclusive bus transaction.
When the processor tries to read or write a memory block that doesn’t exist in the cache, a block
currently in the cache will have to be replaced by the newly requested block, and if the existing
block is in the modified state, its contents will have to be written back to main memory.
M
S
I
PrRd/- PrWr/-
BusRd/Flush
BusRdX/Flush
PrRd/-
BusRd/-
BusRdX/-
PrWr/BusRdX
PrRd/BusRd
PrWr/BusRdX
Figure 5: The finite state machine associated to
the cache coherence protocol MSI
15
Bus transactions:
� Bus Read (BusRd): This transaction is generated by a Processor Read (PrRd) that misses in the
cache, and the processor expects a data response as a result. The cache controller puts the
address on the bus and asks for a copy that it does not intend to modify. The memory system
(possibly another cache) supplies the data.
� Bus Read Exclusive (BusRdX): This transaction is generated by a Processor Write (PrWr) to a
block that is either not in the cache or is in the cache but not in the modified state. The cache
controller puts the address on the bus and asks for an exclusive copy that it intends to modify.
The memory system (possibly another cache) supplies the data. All other caches are invalidated.
Once the cache obtains the exclusive copy, the write can be performed in the cache. This
transaction serves to order the write as well as cause the invalidations and hence ensure that the
write becomes visible to others (write propagation). The processor may require an
acknowledgement as a result of this transaction.
� Bus Write Back (BusWB): This transaction is generated by a cache controller on a write back;
the processor does not know about it and does not expect a response. The cache controller puts
the address and the contents for the memory block on the bus. The main memory is updated
with the latest contents
The new action needed to support write-back protocols is that, in addition to changing the state of
cached blocks, a cache controller can intervene in an observed bus transaction and flush the
contents of the referenced block from its cache onto the bus rather than allowing the memory to
supply the data. The cache controller can also initiate bus transactions, supply data for write backs,
or pick up data supplied by the memory system.
State transitions:
v A processor read to a block that is invalid (or not present) causes a BusRd transaction to service
the miss.
� The newly loaded block is promoted, moved up in the state diagram, from invalid to the
shared state in the requesting cache, whether or not any other cache holds a copy.
� Any other caches with the block in the shared state observe the BusRd but take no
special action, allowing main memory to respond with the data.
� If a cache has the block in the modified state (there can only be one) and it observes a
BusRd transaction on the bus, then it must get involved in the transaction since the copy
in main memory is stale. This cache flushes the data onto the bus, in lieu of memory, and
demotes its copy of the block to the shared state.
• The memory and the requesting cache both pick up the block.
v Writing into an invalid block is a write miss, which is serviced by first loading the entire block
and then modifying the desired bytes within in. The write miss generates a read-exclusive bus
transaction.
� The block of data returned by the read exclusive is promoted to the modified state, and the
desired bytes are then written into it.
� All other cached copies of the block are invalidated, thereby granting the requesting
cache exclusive ownership of the block.
� If another cache later requests exclusive access, then in response to its BusRdX transaction
this block will be invalidated (demoted to the invalid state) after flushing the exclusive copy
to the bus.
16
v Writing into a shared block is treated essentially like a write miss, using a read-exclusive bus
transaction to acquire exclusive ownership.
� The data that comes back in the read exclusive can be ignored in this case, unlike when
writing to an invalid or not present block, since it is already in the cache. The block in the
requesting cache transitions to the modified state.
v Writes to a block while it is in the modified state and reads to a block not in invalid
stategenerate no bus transactions.
v A replacement of a block from a cache logically demotes the block to invalid (not present) by
removing it from the cache. A replacement therefore causes the state machines for two blocks to
change states in the cache.
� If the block being replaced was in modified state, the replacement transition from M to I
generates a write-back transaction.
� No special action is taken by the other caches on this transaction.
� If the block being replaced was in shared or invalid state, then it itself does not cause any
transaction on the bus.
With write-back protocols, a block can be written many times before the memory is actually
updated.
A read may obtain data not from memory but rather from a writer’s cache, and in fact it may be this
read rather than a replacement that causes memory to be updated.
Write hits do not appear on the bus, so the concept of a write being performed with respect to other
processors is a little different. In fact, to say that a write is being performed means that the write is
being “made visible”. A write to a shared or invalid block is made visible by the bus read-exclusive
transaction it triggers. The writer will “observe” the data in its cache after this transaction. The write
will be made visible to other processors by the invalidations that the read exclusive generates, and
those processors will experience a cache miss before actually observing the value written. Write hits
to a modified block are visible to other processors but again are observed by them only after a miss
through a bus transaction.
Thus, in the MSI protocol, the write to a nonmodified block is performed or made visible when the
BusRdX transaction occurs, and the write to a modified block is made visible when the block is
made visible when the block is updated in the writer’s cache.
A problem arises with our MSI if we consider a sequential application running on a multiprocessor.
Such multiprogrammed use in fact constitutes the most common workload on small-scale
multiprocessors. When the process reads in and modifies a data item, in the MSI protocol two bus
transactions are generated even though there are never any shares. The first is a BusRd that gets the
memory block in S state, and the second is a BusRdX that converts the block from S to M state. By
adding a state that indicates that the block is the only (exclusive) copy but is not modified and by
loading the block in this state, we can save the latter transaction since the state indicates that no
other processor is caching the block.
The solution of this problem implied the birth of a new protocol, MESI, which in some situations
could be optimised in other ways; then other protocols have been created. A full treatment of a
number of them can be found in [3].
We believe that in these few pages we presented all the knowledge to understand the environment
in which we’ll be working and the problems we have to face. However much more can be found in
literature. Especially [1], [2] and [3] present very interesting analysis and discussions.
17
Performance Evaluation Methodology
There are three possible approaches for performance evaluation of computer architecture:
simulation, building analytical models, or prototype construction. Prototype construction is time-
consuming and expensive, and therefore something that is generally avoided until a late stage of
product development. Analytical methods have been explored primarily because of their speed, but
usually they are less accurate than simulations and do not allow for the possibility of trying different
software scheduling and execution schemes.
Therefore the preferred method for performance evaluation today is the use of simulators, since
they are more flexible and cost-efficient than building prototypes and more accurate than other
known methods. Moreover simulation allows for detailed statistics collection, providing a better
understanding of the trade-offs involved and facilitating further performance tuning. In order to get
an idea of the factors involved in building a multiprocessor system, in our case its cache system, it
can be useful to build a less complex simulator but that is still sufficiently detailed to include
important hardware and software related issues.
The simulation methodologies can be classified into three groups according to how the workload is
simulated. Program-driven simulators use real programs as their workload, either as unmodified
binaries or in a format specific to the simulator. Distribution-driven workloads make use of random
variables of, for instance, memory references, locality, and program behaviour; typically the
stochastic parameters are based on statistics from a reference system.
Our work instead is a trace-driven simulator: this category uses traces of input events, such as
memory references, captured on an existing system, but do not execute any code by itself.
A simulator with program-driven load mimics the target system closely. Real application programs
are executed on top of the simulator. The advantage is flexibility in target system models and
accurate results for the applications used in the simulations. A very important weakness is that such
a simulator tends to be computationally expensive and it is not always practical to construct one.
The construction of such a simulator is often extremely complicated, due to the need to recreate the
entire execution environment; this is especially true for complex program systems including system
software. A particular type of simulator using program-driven workload is an execution-driven
simulator, where the code is executed on a real processor and the memory system is simulated. The
execution is interrupted at predefined interaction points (such as access to shared data or message
passing, this depends on the target architecture), and the simulator takes over. The advantage of this
approach is speed, but it is considerably less flexible.
A distribution-driven workload is based on statistics on system behaviour. Simulation is fast and it
is easy to tweak system parameters in order to see what would happen in various application
scenarios. The difficulty lies in obtaining accurate parameters, and include in the model how these
parameters interact.
A simulator with trace driven workload uses as input memory reference traces generated by
program running on an existing system.
A primary advantage of this method, shared with distribution-driven and analytical models, is that
no code execution is performed, and thus the complexity of correctly modelling an instruction-set,
and all devices needed to create the execution environment necessary to run real programs is
avoided.
The major drawback is that the reliability of the results is heavily dependent on the traces.
Everything works well as long as the target system is fairly similar to the system the traces were
18
captured from. Using traces from a uniprocessor on a multiprocessor system simulation is more
difficult as, for example, execution order will be different on the target system.
Furthermore the traces need to be large to obtain as high probability of a valid results as possible.
[25] claims that it is possible to concatenate different traces under certain circumstances. The
criteria needed to be satisfied is that the number of new references is almost constant over time and
traces; not often this requirement is verified and it is then quite difficult to have large enough data to
work with.
However, trace-driven simulation is the most widely used method to evaluate caches, even if this
demands large amounts of storage and computer time. Several techniques have been proposed to
improve the simulations. [13] is fully dedicated to this topic; [1] also dedicates it a chapter.
Performance Metrics
The final task of the simulator is to collect statistics about the cache behaviour and to relate them to
the parameters characterising the system. In this fashion we can individuate the bottleneck and
determine trade-offs between the different configurations and policies. All the parameters we are
interested in are non-time related, that means we don’t care about timing information; anyway the
extension should be rather easy, by using analytical models and statistics about the different delays
encountered.
Our main concern when evaluating performance is the traffic on the bus, that is the factor that limits
the most an efficient utilisation of multiprocessors.
Of course the primary source of the traffic is the miss ratio. In order to see how different
components behave, we distinguish between read and write, and between data and instructions.
Moreover, it is useful to understand how helpful is the cache hierarchy: then we want to display
how many accesses succeed in the first level and how many in the second level.
We are also interested in how the different policies affect performance. For the most common types
of cache coherence protocols we individuated four categories of bus accesses: read misses,
invalidations (including the write misses), flushes and writebacks. The first three regards mainly the
coherency traffic, while the last deals with the replacement of the blocks.
Analysing and comparing the collected data we are able to tune in a better way the parameters
determining the configuration of the cache system and the adopted strategies. For example an high
number of writebacks using LRU means that either the associativity has to be increased or it is not
worth to introduce the complexity given by the replacement algorithm.
Similar conclusions can be drawn by considering the other output values.