## Software Logging under Speculative Parallelization

María Jesús Garzarán, Milos Prvulovic<sup>†</sup>, José María Llabería<sup>‡</sup>, Víctor Viñals, Lawrence Rauchwerger<sup>§</sup>, and Josep Torrellas<sup>†</sup>

Universidad de Zaragoza, Spain {garzaran,victor}@posta.unizar.es http://www.cps.unizar.es/deps/DIIS/gaz

<sup>‡</sup>Universitat Politècnica de Catalunya, Spain llaberia@ac.upc.es

Abstract

Speculative parallelization aggressively runs hardto-analyze codes in parallel. Speculative tasks generate unsafe state, which is typically buffered in caches. Often, a cache may have to buffer the state of several tasks and, as a result, it may have to hold multiple versions of the same variable. Modifying the cache to hold such multiple versions adds complexity and may increase the hit time. It is better to use logging, where the cache only stores the last versions of variables while the log keeps the older ones. Logging also helps to reduce the size of the speculative state to be retained in caches.

This paper explores efficient software-only logging for speculative parallelization. We show that such an approach is very attractive for programs with tasks that create multiple versions of the same variable. Using simulations of a 16-processor CC-NUMA, we show that the execution time of such programs on a system with software logging is on average 36% shorter than on a system where caches can only hold a single version of any given variable. Furthermore, execution takes only 10% longer than in a system with hardware support for logging.

## **1** Introduction

Speculative thread-level parallelization attempts to extract parallelism from hard-to-analyze codes like those with pointers, indirectly-indexed structures, interprocedural dependences, or input-dependent patterns. The idea is to break the code into tasks and speculatively run some of them in parallel. A combination of software and hardware support tracks memory accesses at run time and, if a dependence violation occurs, the state is repaired and parallel execution resumes. Many different schemes have been <sup>†</sup>University of Illinois at Urbana-Champaign {prvulovi, torrellas}@cs.uiuc.edu http://iacoma.cs.uiuc.edu

<sup>§</sup>Texas A&M University rwerger@cs.tamu.edu

proposed, ranging from software-only [6, 14, 15] to hardware-based [3, 5, 7, 10, 11, 12, 13, 16, 18, 19, 21, 23], and targeting small systems [5, 7, 11, 12, 16, 19] or scalable ones [3, 6, 13, 14, 15, 18, 21, 23].

As a speculative task runs, it generates state that cannot be merged with main memory because it is unsafe. Different schemes handle the buffering of speculative state differently. In some schemes, each task uses its own private range of storage addresses [6, 14, 15, 22]. In most schemes, however, tasks buffer the speculative state dynamically in caches [3, 5, 11, 13, 18] or write buffers [7, 19]. If the cache or buffer overflows due to conflicts or insufficient capacity, the processor has to stall or squash the task.

The size of the speculative state to be buffered by a processor depends on the working set size of individual tasks and on the load imbalance between tasks. Indeed, task load imbalance may force a processor to buffer the state of several speculative tasks at a time, which increases the overall speculative state size. Many of these tasks may be finished, but are still speculative because a predecessor task is still running.

Buffering the speculative state of several tasks at a time may be challenging. Specifically, such state may contain individual variables that have been written by several tasks. Such variables are common in applications that have quasi-privatization access patterns. In this case, the buffer must be organized to hold several speculative versions of the same variable. Furthermore, for these variables, it is preferable to keep the last version handier, since it is more likely to be needed next.

Past work on speculative parallelization takes different approaches to handle this multi-version prob-

Many schemes do not address this issue. lem. Thus, it must be assumed that the processor stalls or squashes the task before creating a second local version of a speculative variable. Other schemes propose redesigning the cache to hold multiple speculative versions of the variable, e.g. in different ways of a set-associative cache [3, 17]. This approach complicates cache operation and may increase cache hit time. Finally, other schemes propose to store only last versions in the cache and automatically displace older speculative versions to a hardware-managed undo log [21, 23]. With logging, before a task overwrites a speculative version generated by a previous task, the hardware saves the version in the log. Logging is attractive because it reduces the size of the speculative state to be retained in caches and keeps last versions handier. However, this solution requires non-trivial hardware support.

Since logging has many advantages but also a hardware cost, this paper explores buffering multiple versions through efficient software-only logging. Logs are declared as plain user data structures and are managed in software. We present one efficient implementation. Simulations of a 16-processor CC-NUMA show that software logging is very attractive for programs with tasks that create multiple versions of the same variable. The execution time of such programs on a system with software logging is on average 36% shorter than on a system where caches can only hold a single version of any given variable. Furthermore, execution takes only 10% longer than in a system with hardware support for logging.

This paper is organized as follows: Section 2 introduces speculative parallelization and versioning; Section 3 introduces the speculation protocol used; Section 4 presents efficient software logging; Section 5 discusses our evaluation environment; and Section 6 presents the evaluation.

# 2 Speculative Parallelization and Versioning

#### 2.1 Basics of Speculative Parallelization

When several tasks run under speculative parallelization, they have a relative order imposed by the sequential code they come from. Consequently, we use the terms predecessor and successor tasks. If we give increasing IDs to successor tasks, the lowest-ID task still running is called non-speculative, while its successors are called speculative.

The set of variables that a speculative task writes is typically kept buffered away in the cache [3, 5, 11, 13, 18] or write buffer [7, 19]. These variables cannot be merged with main memory because they are unsafe. They are called the speculative state. Only when the task becomes non-speculative can its speculative state be merged with main memory.

When the non-speculative task finishes, it commits. Any state that it kept buffered can be merged with memory and the non-speculative status is passed to a successor task. When a speculative task finishes, it cannot commit. The processor that ran it can start to execute another speculative task, but the cache has to be able to hold speculative state from the two (or more) uncommitted tasks. Thus, in order to distinguish which of these tasks produced a particular cached variable, we associate a task-ID field with each variable (or line) in the cache.

## 2.2 Multiple Local Speculative Versions

In some cases, the speculative tasks that share a given cache as a reservoir for their speculative state may try to generate multiple versions of the same variable. This occurs, for example, in codes with quasi-privatization access patterns. In this case, if we have a simple cache, we may decide to support only a single version of each variable and, when a second local version is about to be created, stall the processor or squash the task.

One alternative approach that has been proposed is to redesign the cache to hold multiple versions of the same variable at a time [3, 17]. The cache must be able to buffer several lines with the same address tag but different task-IDs. For example, Figure 1-(a) shows a cache with three versions of line 0x400 generated by task-IDs i, j, and k. These lines can go into different ways of the same set in a set-associative cache [3, 17].

Unfortunately, this approach adds complexity to the cache. In addition, the extra comparisons needed affect a sensitive timing path and may increase the cache hit latency. Moreover, since all the versions share the cache, the chance of line displacements due to capacity or conflict increases. The result may be lower performance, since existing schemes typically prevent the displacement of speculative versions to memory by either squashing the task or stalling the processor.

A final shortcoming of this approach is that it makes it equally hard to access any of the versions of a given variable. Instead, we would like to be able to access the *last* version of the variable faster. Such a version is the one generated by the youngest task



Figure 1: Two ways of keeping multiple local speculative versions.

that ran on the processor and wrote the variable. It is the version that will be needed to satisfy any subsequent load by this task or younger ones.

The older versions are much less likely to be accessed. For example, they may be accessed by a read request from an old task running on a second processor. Since we expect these events to be relatively infrequent, we could afford to make accesses to non-last versions a bit slower.

Our proposed approach is to keep *last* versions in the cache, and copy *non-last* versions to a software *log structure*, mapped to the virtual space of the application. A natural implementation of the log is a list of records that are placed in memory contiguously in real time. In general, a log record includes the previous version of the variable (before it is overwritten), its address, the producer task-ID, and the overwriting task-ID. Log records can be displaced from the cache and possibly even bypass it to minimize space contention with last versions. Figure 1-(b) shows a cache with its associated log organization in memory. Version k is the last one.

## **3** Speculation Protocol Used

To give a concrete example of how a software log could be used, we use a speculative parallelization protocol similar to the one in [21]. That protocol included a hardware-based logging scheme presented in [23] that worked in the background with fairly little overhead. In this paper, we take that protocol and explore an inexpensive software implementation of logging.

In [21], speculative accesses are marked with special load and store instructions that trigger the protocol. In each node, the first speculative access to a shared data page prompts the OS to allocate a page of *local time stamps* in the local memory of the node. These time stamps will record, for each word, the ID of the youngest local task that writes the word (*PMaxW*), and the ID of the youngest local task that reads it without writing it first (*PMaxR1st*). The latter operation is also called *exposed load*. These local time stamps are needed by the protocol, and are automatically updated by dependence-detecting hardware with small overhead [21].

# 4 Efficient Software Logging

#### 4.1 Log Operations

A logging system must support four operations, namely saving a new record in the log (*Insertion*), finding a record in the log (*Retrieval*), unwinding the log to undo tasks (*Recovery*), and freeing up log records after their information is useless for retrieval or recovery (*Recycle*).

Figure 2 shows simple per-processor software structures that we use for logging. The log buffer is broken down into fixed-sized sectors that will be used to log individual tasks. The compiler estimates the size of the sectors and log buffer based on the number of writes in a task and the number of tasks per processor that are likely to be uncommitted at a time, respectively.



Figure 2: Simple per-processor software structures that we use for logging.

When a task starts running, it is dynamically assigned an entry in the Task Pointer Table and one sector in the Log Buffer. Free sectors are obtained from the Free Sector Stack. Each entry in the Task Pointer Table has two pointer fields: Next that points to the next entry to fill, and End that points to the end entry to check for overflow. If the task needs more entries than a sector, we dynamically assign another sector and link it to the previous one, while we set the Overflow bit and update the End pointer. If the Free Sector Stack runs out of entries, we resize the Log Buffer and Stack accordingly.

**Insertion.** Insertion is the most overhead-sensitive operation since it occurs frequently. At compile time, the compiler instruments stores in the code with instructions to save a log record. As shown in Figure 2, a record includes the following information about the variable that is about to be updated: its virtual address (the only one the software knows), its value before the update, and its producer Task-ID (*PMaxW*). *PMaxW* is obtained from the local timestamp page (Section 3). After the record is inserted at run time, the Next pointer is incremented. At the end of a task, all the records that it generated are in contiguous locations in one or more sectors, easily retrievable through the Task Pointer Table with the task ID.

**Recycle.** When a processor finishes a task, it tries to commit it [21]. Based on the resulting value of the commit point, it can identify which of the entries in its Task Pointer Table correspond to committed tasks. The data in those tasks' sectors will not be needed in the future and, therefore, the sectors can be recycled. Consequently, we invalidate the corresponding Task Pointer Table entries and return the sectors to the Free Sector Stack.

**Recovery** and **Retrieval.** Recovery occurs when we need to undo tasks after the detection of an out-oforder RAW dependence. Retrieval occurs when an in-order RAW dependence cannot simply be satisfied by the underlying coherence protocol because the requested version is not in a cache: another task running on the producer processor has overwritten the variable, pushing the desired version into the log. Since these two cases happen infrequently, we solve them with software exception handlers that access the logs.

#### 4.2 Insertion Overhead

Inserting a record in the log of Figure 2 involves collecting the items to save, saving them in sequence using the *Next* pointer, and advancing the pointer. Figure 3 shows the MIPS assembly instructions added before every speculative store that we log. All memory accesses in the figure are non-speculative. Overall, we need 9 instructions: 1 to check for sector overflow, 6 to collect and insert the information, 1 to increment the pointer, and 1 to update the cached time stamp.

Figure 3 shows two special instructions, namely *load half-word time stamp (lh\_ts)* and *store half-word time stamp (sh\_ts)*. These are special instructions that load the 16-bit *PMaxW* local time stamp of the variable, and update it (the cached copy only), respectively. *lh\_ts* loads the time stamp so that it can be saved in the log. *sh\_ts* updates the cached copy with the ID of the executing task. This is done to prevent the cached copy from becoming stale. The reason is that the time stamp pages in memory are read and updated in hardware by the speculation protocol as it tries to detect dependence violations (Section 3). They cannot be updated in software.

We can reduce the overhead of the instrumentation by noting that the log only needs to save the value overwritten by the *first store* to the variable in the task. Consequently, for variables accessed with speculative accesses, we can modify the instrumentation in Figure 3 to dynamically test whether or not a store is a first store in the task, and log only if it is. Such testing can be done by comparing the *PMaxW* of the variable with the ID of the executing task. If they are the same, the store is not a first store. A detailed discussion of this topic is beyond the scope of this paper.

#### 4.3 Alternative: Hardware-Only Logging

In the evaluation section, we will compare our software logging system to a hardware-only implementation of logging described in [23]. The latter uses a logging module embedded in the directory controller of each node. Log record insertion is done in the background with no overhead visible to the program. Similarly, recycling has practically no overhead. The log is kept in memory, therefore avoiding cache pollution.

# 5 Evaluation Methodology

## 5.1 Simulation Environment

We use an execution-driven simulation system based on MINT [20] to model in detail a CC-NUMA with 16 nodes. Each node contains a fraction of the shared memory and directory, as well as a 4-issue dynamic

|            | ; offse<br>bgt<br>allo | $r(r_3) = address of the variar1, r2, insertion$                                                                                                                                                                                                                           | ne variable to update. r5 = ID of the executing task<br>; check for sector overflow |  |  |  |  |
|------------|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|--|--|--|--|
| insertion: |                        | ur4, r3, offset; compute address of variabler4, 0(r2); store in logtsr4, offset(r3); load the 16-bit <i>PMaxW</i> time stampr4, 4(r2); store as a full word in logr4, offset(r3); load value of variabler4, 8(r2); store in logur2, r2, log_record_size; increment pointer |                                                                                     |  |  |  |  |
|            | addu                   | r4, r3, offset                                                                                                                                                                                                                                                             | ; compute address of variable                                                       |  |  |  |  |
|            | SW                     | r4, 0(r2)                                                                                                                                                                                                                                                                  | ; store in log                                                                      |  |  |  |  |
|            | lh_ts                  | r4, offset(r3)                                                                                                                                                                                                                                                             | ; load the 16-bit <i>PMaxW</i> time stamp                                           |  |  |  |  |
|            | SW                     | r4, 4(r2)                                                                                                                                                                                                                                                                  |                                                                                     |  |  |  |  |
|            | lw                     | r4, offset(r3)                                                                                                                                                                                                                                                             | ; load value of variable                                                            |  |  |  |  |
|            | SW                     | r4, 8(r2)                                                                                                                                                                                                                                                                  | ; store in log                                                                      |  |  |  |  |
|            | addu                   | r2, r2, log_record_size                                                                                                                                                                                                                                                    | ; increment pointer                                                                 |  |  |  |  |
|            |                        | r5, offset(r3)                                                                                                                                                                                                                                                             | ; update cached PMaxW                                                               |  |  |  |  |
|            |                        |                                                                                                                                                                                                                                                                            |                                                                                     |  |  |  |  |

Figure 3: Instructions added before an instrumented speculative store.

superscalar. The processor has a 32-entry instruction window and 4 Int, 2 FP, and 2 Ld/St units. It supports 8 pending loads and 16 stores. It also has a 2K-entry BTB with 2-bit saturating counters. Each node has a 2-way 32-Kbyte L1 D-cache and a 4-way 2-Mbyte L2, both with 64-byte lines and a write-back policy. Contention is accurately modeled. The average nocontention round-trip latencies from the processor to the on-chip L1 cache, L2 cache, memory in the local node, and memory in a remote node that is 2 and 3 protocol hops away are 1, 12, 60, 208 and 291 cycles, respectively.

We use release consistency and a cache coherence protocol like that of DASH. Pages of shared data are allocated round-robin across the nodes. We choose this allocation because our applications have irregular access patterns and the tasks are dynamically scheduled; it is virtually impossible to optimize shared data allocation at compile time. Private data are allocated locally.

For speculation, we use the protocol of Section 3. In the evaluation, we simulate all software overheads, including allocation and recycling of log sectors, and the dynamic scheduling and committing of tasks. We wrote software handlers for parallel recovery after a dependence violation and to retrieve data from logs. In addition, a processor that allocates a page of time stamps is penalized with 4,000 cycles.

#### 5.2 Workload

We use a set of scientific applications where a large fraction of the code is not analyzable by a parallelizing compiler. These applications are: *Apsi* from SPECfp2000 [8], *Track* and *Bdna* from Perfect [1], *Dsmc3d* from HPF-2 [4], *P3m* from NCSA, and *Tree* from Univ. of Hawaii [9]. We use the Polaris parallelizing compiler [2] to identify the non-analyzable sections and prepare them for speculative parallelization. The source of non-analyzability is that the de-

pendence structure is either too complicated or unknown because it depends on input data. For example, the code often has doubly-subscripted accesses to arrays. The code also has sections that have complex control flow, with conditionals that depend on array values and jump to code sections that modify the same or other arrays. In these sections, Polaris marks the speculative references, which will trigger speculation protocol actions. Polaris also identifies store instructions that may need to be logged, and we instrument them according to Section 4.

Table 1 shows the non-analyzable sections in each application. These sections are loops. The table lists the weight of these loops relative to the total *sequential* execution time of the application (%Tseq), with the I/O excluded. This value, which is obtained on a single-processor Sun Ultra 5 workstation, is on average 51.4%. The table also shows the number of invocations of these loops during program execution, the average number of instructions per invocation, and the average number of instructions per iteration. Note that all the data presented in the evaluation, including speedups, refer only to the code sections in the table.

# 6 Evaluation

We compare multiprocessors that support no logging, software logging, or hardware logging. Under no logging, a node can only hold in its cache hierarchy a single version for each variable; if the processor is about to overwrite a local version produced by an uncommitted task, the processor stalls. For software logging, we use our scheme of Sections 4.1 and 4.2. Finally, for hardware logging, we use the support of Section 4.3.

Figure 4 compares the execution time of these three systems, called *NoLog*, *Sw*, and *Hw*, respec-

| Appl    | Non-Analyzable             | % of | # of  | Iters per | Instruc  |
|---------|----------------------------|------|-------|-----------|----------|
|         | Sections (Loops)           | Tseq | Invoc | Invoc     | per Iter |
| P3m     | pp_do100                   | 56.5 | 1     | 97336     | 69165    |
| Tree    | accel_do10                 | 79.1 | 41    | 1024      | 28746    |
| Apsi    | run_do[20,30,40,50,60,100] | 29.3 | 900   | 63        | 102639   |
| Bdna    | actfor_do240               | 44.2 | 1     | 1499      | 103339   |
| Dsmc3d  | move3_goto100              | 41.2 | 80    | 46777     | 5442     |
| Track   | nlfilt_do300               | 58.1 | 56    | 502       | 5577     |
| Average |                            | 51.4 | 180   | 24533     | 52484    |

Table 1: Application characteristics. In *Apsi*, we use an input grid of 512x1x64. In *P3m*, while the loop has 97,336 iterations, we only use the first 9,000 iterations in the evaluation. Finally, in *Dsmc3d*, the data corresponds to unrolling the loop 15 times.

tively. They run on 16 processors. For each application, the bars are normalized to *NoLog* and broken down into execution of instructions (*Useful*), waiting on data, control, and structural pipeline hazards (*Hazard*), synchronization (*Sync*), waiting on data from the memory system (*Memory*), and stall when attempting to overwrite an uncommitted version in *NoLog* (*Stall*). A sixth category, measuring the execution of software handlers for data recovery and retrieval, is too small to be seen. Finally, the numbers on top of each bar show the speedup relative to the sequential execution of the code, with all the application data placed on the local memory of the single active processor.

A comparison between *NoLog* and *Sw* reveals the benefits of software logging. With software logging, processors do not stall when they overwrite local uncommitted versions. Thus, *Stall* disappears. However, software logging introduces extra instructions and memory system accesses to generate and maintain the log. As a result, it tends to have higher *Use-ful* and *Memory* times. Indirectly, the other times (*Hazard* and *Sync*) may also increase.

To understand these results, note that logging is most beneficial in applications that have both quasiprivatization access patterns and load imbalance. P3m, Tree, Apsi, and Bdna have quasi-privatization patterns and, of them, P3m and Tree have the largest imbalance. These observations are consistent with Figure 4. The figure shows that P3m, Tree and, to a lesser extent, Apsi and Bdna have Stall under NoLog. Sw removes all Stall and speeds up these applications, especially P3m and Tree. For Dsmc3d and Track, the difference between NoLog and Sw is an indirect effect of the different data layouts, which cause different cache conflicts, and the different execution timings, which result in different dependence violations found at run time. Overall, software logging is effective: Sw is on average 36% faster than *NoLog* for the four applications with quasi-privatization patterns. If we take the average of all the applications, the speedup of the 16-processor execution increases from 4.3 under *NoLog* to 6.1 under *Sw*.

The Hw system also eliminates the *Stall* time like the *Sw* system. Furthermore, it induces negligible overheads. The cost, of course, is special hardware support. Comparing *Sw* to *Hw*, we see the overhead of software logging. From the figure, we see that this overhead is very modest. Indeed, for the four applications with quasi-privatization patterns, the average overhead is only 9% of the *Sw* execution time. Therefore, we conclude that software logging is efficient as well as effective.

# 7 Conclusion

A good solution to buffer speculative state with multi-version variables is to enhance a cache hierarchy with logs. In this paper we show that software logging is inexpensive and delivers high performance for applications with quasi-privatization patterns and load imbalance. Using simulations of a 16processor CC-NUMA, we show that the execution time of such applications on a system with software logging is on average 36% shorter than on a system where caches can only hold a single version of any given variable. Furthermore, execution takes only 10% longer than in a system with hardware support for logging.

# Acknowledgments

This work was supported in part by the National Science Foundation under grants CCR-9970488, EIA-9975018, EIA-0081307, and EIA-0072102; by the CICYT of Spain under grant TIC98-0511-C02; and by gifts from IBM and Intel.



Figure 4: Execution time on a 16-node multiprocessor with different logging supports.

## References

- M. Berry, D. Chen, P. Koss, D. Kuck, and L. Pointer. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers. *International Journal of Supercomputer Applications*, 3(3):5–40, Fall 1989.
- [2] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Advanced Program Restructuring for High-Performance Computers with Polaris. *IEEE Computer*, 29(12):78–82, December 1996.
- [3] M. Cintra, J. F. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In *Proceedings of the 27th Annual International Symposium on Computer Architecture*, pages 13–24, June 2000.
- [4] I. Duff, R. Schreiber, and P. Havlak. HPF-2 Scope of Activities and Motivating Applications. Technical Report CRPC-TR94492, Rice University, November 1994.
- [5] S. Gopal, T. N. Vijaykumar, J. E. Smith, and G. S. Sohi. Speculative Versioning Cache. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture, pages 195–205, February 1998.
- [6] M. Gupta and R. Nim. Techniques for Speculative Run-Time Parallelization of Loops. In *Proceedings of Supercomputing 1998*, November 1998.

- [7] L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 58–69, October 1998.
- [8] J.L. Henning. SPEC CPU2000: Measuring Performance in the New Millenium. *IEEE Computer*, 33(7):28–35, July 2000.
- [9] Joshua E. Barnes. Treecode. Technical report, Institute for Astronomy, University of Hawaii, 1994.
- [10] T. Knight. An Architecture for Mostly Functional Languages. In ACM Lisp and Functional Programming Conference, pages 500–519, August 1986.
- [11] V. Krishnan and J. Torrellas. A Chip-Multiprocessor Architecture with Speculative Multithreading. *IEEE Trans. on Computers, Special Issue on Multithreaded Architectures*, pages 866–880, September 1999.
- [12] P. Marcuello, A. Gonzalez, and J. Tubella. Speculative Multithreaded Processors. In *Proceedings of the 1998 International Conference on Supercomputing*, July 1998.
- [13] M. Prvulovic, M. J. Garzaran, L. Rauchwerger, and J. Torrellas. Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization. In *Proceedings of the 28th Annual International Symposium on Computer Architecture*, July 2001.
- [14] L. Rauchwerger and D. Padua. The LRPD Test: Speculative Run-Time Parallelization of

Loops with Privatization and Reduction Parallelization. In *Proceedings of the SIGPLAN* 1995 Conference on Programming Language Design and Implementation, pages 218–232, June 1995.

- [15] P. Rundberg and P. Stenstrom. Low-Cost Thread-Level Data Dependence Speculation on Multiprocessors. In *Fourth Workshop on Multithreaded Execution, Architecture and Compilation*, December 2000.
- [16] G. S. Sohi, S. Breach, and S. Vajapeyam. Multiscalar Processors. In *Proceedings of the 22nd Annual International Symposium on Computer Architecture*, pages 414–425, June 1995.
- [17] J.G. Steffan, C. B. Colohan, and T. C. Mowry. Architectural Support for Thread-Level Data Speculation. Technical report, CMU-CS-97-188, School of Computer Science, Carnegie Mellon University, November 1997.
- [18] J.G. Steffan, C.B. Colohan, A. Zhai, and T. C. Mowry. A Scalable Approach to Thread-Level Speculation. In *Proceedings of the 27th Annual International Symposium on Computer Architecture*, pages 1–12, June 2000.
- [19] J.Y. Tsai, J. Huang, C. Amlo, D. Lilja, and P.C.Yew. The Superthreaded Processor Architecture. *IEEE Trans. on Computers, Special Issue on Multithreaded Architectures*, 48(9):881–902, September 1999.
- [20] J. Veenstra and R. Fowler. MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors. In Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS'94), pages 201– 207, January 1994.
- [21] Y. Zhang. Hardware for Speculative Parallelization in DSM Multiprocessors. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, May 1999.
- [22] Y. Zhang, L. Rauchwerger, and J. Torrellas. Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors. In *Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA)*, pages 162– 174, February 1998.

[23] Y. Zhang, L. Rauchwerger, and J. Torrellas. Hardware for Speculative Parallelization of Partially-Parallel Loops in DSM Multiprocessors. In *Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA)*, pages 135–139, January 1999.