Cache Related Issues in Multiprocessors

 

Calin Cascaval

Dept. of Computer Science

Univ. of Illinois at Urbana-Champaign

Why Caches?

Key Observation: Programs exhibit locality of reference, i.e., a program accesses a set of locations repeatedly, then move to another set, and so on.

What are Caches?

Fast, expensive memories (SRAMs) that operate at the same speed as the processor.

About Caches

There can be hierarchies of caches, and caches closer to processor are smaller and faster, while caches closer to memory are bigger, slower and also cheaper.

A cache line is the unit of transfer between memory and cache, usually a power of 2. It is based on two observations:

Larger is not necessarily better!

Locality

Locality (cont.)

The best solution for matrix multiplication is tiling :

do jj = 1, N, T

do ii = 1, N, T

do kk = 1, N, T

do j = jj, min(jj+T-1, N)

do i = ii, min(ii+T-1, N)

c(i, j) = 0

do k = kk, min(kk+T-1,N)

c(i, j) = c(i, j) + a(i, k) * b(k, j)

 

Important: T (the tile size) should be chosen such that the blocks fit in the cache.

Caches in SMPs

Cache Coherence

The system must provide a coherent, uniform view of the memory to all processors, despite the presence of local, private cache storage.

Options:

Cache Coherence (cont.)

Another classification:

 

Smaller lines are better as units of cache coherence!

What is False Sharing?

Two processors sharing a multi-word block because they need to access two different words that happen to be in the same cache block [Torrellas 1990]

If one of the accesses to the block is a write access, false sharing can induce a large number of cache misses (invalidations)

False sharing is an artifact introduced by data collocation.

Depends on the cache block (line) size and the particular placement of data in memory.

Example

Hardware Solutions

Michel Dubois, Jonas Skeppstedt, Livio Ricciulli, Krishnan Ramamurthy, and Per Stenström [Dub93]

Write-through cache

Example

Hardware Solutions (continued)

Write back caches

Example

Software Solutions

Torrellas [Tor90] proposes the following "Data Placement Optimizations", although not implemented in a compiler

Software Solutions (cont.)

Eggers and Jeremiassen [EJ91] measure and identify false sharing for some applications written in C. Applying transformations to eliminate false sharing they are able to reduce false sharing misses with 40% to 75%, for a total cache misses reduction of 20%-30%.

Basic idea: data restructuring transformations such that

Software Solutions (cont.)

The transformations:

Group and Transpose

Indirection

Pad and Align

Heuristics

Used to decide which transformations are applied to which data structures.

Factors:

Software Solutions (cont.)

Granston [Gra94] develops a loop transformation theory for eliminating false sharing

Transformations are:

 

Conclusions

Hardware techniques exist for eliminating false sharing, although nobody has implemented them in real machines.

Software techniques have been implemented in parallelizing compilers, but false sharing continues to be a problem.

Reducing false sharing is beneficial because it reduces the number of cache misses and also the coherence traffic.

References

[Tor90] J. Torrellas, M. Lam, J. Hennessy, Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates , International Conference on Parallel Processing (ICPP), 1990

[EJ91] S. Eggers, T. Jeremiassen, Eliminating False Sharing , ICPP, 1991

[Dub93] Michel Dubois, Jonas, Skeppstedt, Livio Ricciulli, Krishnan Ramamurthy, and Per Stenström, The Detection and Elimination of Useless Misses in Multiprocessors , ISCA, 1993

[Gra94] E. Granston, Toward a Compile-Time Methodology for Reducing False Sharing and Communication Traffic in Shared Virtual Memory Systems , Languages and Compilers for Parallel Computing, Springer-Verlag, 1994

 

 

Machine Problem on the WEB Site