PLDI 2011 Tutorial:

Program Optimization through Loop Vectorization

Date and Time TBD

Abstract

Most modern microprocessors contain vector extension that benefit from the fine grain parallelism of vector operations. Programmers can take advantage of these extensions through vectorizing compilers or by explicitly programming them with intrinsics. A significant fraction of the peak performance of many of today's machines comes from their vector extensions.

This tutorial covers techniques to make the best possible use of vector extensions. We will discuss:

  1. Compiler limitations and the manual transformations that the programmer must apply to overcome compiler limitations.
  2. The use of intrinsics to explicitly program vector extensions.

Compiler limitations and manual transformations will be illustrated using code snippets from real applications and common kernels processed by the IBM xlc and INTEL icc compilers. The report generated by the compiler, execution time and speedups obtained from vectorization will be shown in each case. The source code of all the examples will be distributed, so that attendees can run the examples and measure their performance on their own machines.

The only pre-requisite for this tutorial is familiarity with the C language.

Presenters

Maria J. Garzaran

Maria Jesus Garzaran is a research assistant professor at the Computer Science Department of the University of Illinois at Urbana-Champaign and a member of the Intel-Microsoft Universal Parallel Computing Research Center (UPCRC). Maria's research focus is in the areas of thread-level speculation, automatic performance tuning, parallel programming, and reliability. She has published in some of the most selective conferences, such as PLDI, HPCA, ISCA, PPoPP, PACT, and CGO. Maria was program co-chair of the workshop on languages and compilers for parallel computing and the workshop on programming models for ubiquitous parallelism. She received a Ph.D. in Computer Science from the Universidad de Zaragoza, Spain, in 2002 and is the recipient of the 2002 Best PhD Thesis Award from Universidad de Zaragoza. She is a member of IEEE and ACM. Additional information can be found in her website.

Saeed Maleki

Saeed Maleki is a PhD student at the Department of Computer Science, University of Illinois at Urbana-Champaign, with Prof. D. Padua and Prof. M. Garzaran, working under the supervision of Profs. David Padua and Maria Garzaran. He received his B.S. in Mathematics and Computer Science from Sharif University of Technology, Tehran, Iran in 2008. His main areas of interest are vectorization, parallelization, and autotuning of scientific applications.

William D. Gropp

William Gropp received his B.S. in Mathematics from Case Western Reserve University in 1977, a MS in Physics from the University of Washington in 1978, and a Ph.D. in Computer Science from Stanford in 1982. He held the positions of assistant (1982-1988) and associate (1988-1990) professor in the Computer Science Department at Yale University. In 1990, he joined the Numerical Analysis group at Argonne, where he was a Senior Computer Scientist in the Mathematics and Computer Science Division, a Senior Scientist in the Department of Computer Science at the University of Chicago, and a Senior Fellow in the Argonne-Chicago Computation Institute. From 2000 through 2006, he was also Deputy Director of the Mathematics and Computer Science Division at Argonne. In 2007, he joined the University of Illinois at Urbana-Champaign as the Paul and Cynthia Saylor Professor in the Department of Computer Science. Since 2008, he has also been Deputy Director for Research for the Institute of Advanced Computing Applications and Technologies at the University of Illinois. His research interests are in parallel computing, software for scientific computing, and numerical methods for partial differential equations. He has played a major role in the development of the MPI message-passing standard. He is co-author of the most widely used implementation of MPI, MPICH, and was involved in the MPI Forum as a chapter author for both MPI-1 and MPI-2. He has written many books and papers on MPI including "Using MPI" and "Using MPI-2". He is also one of the designers of the PETSc parallel numerical library, and has developed efficient and scalable parallel algorithms for the solution of linear and nonlinear equations. Gropp is a Fellow of ACM and IEEE and received the Sidney Fernbach Award from the IEEE Computer Society in 2008. Gropp is a member of the National Academy of Engineering.

David Padua

David Padua is the Donald Biggar Willet Professor of Computer Science at the University of Illinois at Urbana-Champaign, where he has been a faculty member since 1985. He has served as a program committee member, program chair, or general chair for more than 40 conferences and workshops. He served on the editorial board of the IEEE Transactions of Parallel and Distributed Systems, as editor-in-chief of the International Journal of Parallel Programming (IJPP) and as Steering Committee Chair of ACM SIGPLAN's Principles and Practice of Parallel Programming. He is member of the editorial boards of the Journal of Parallel and Distributed Computing, ACM Transactions on Programming Languages and Systems (TOPLAS), and IJPP. His areas of interest include compilers, machine organization, and parallel computing. He has published more than 140 papers in those areas. He is a fellow of the IEEE and the ACM.

Description

The subject of this tutorial is the programming of the vector or SIMD (Single Instruction Multiple Data) units that are available in most of today's microprocessors. The tutorial will start with and overview of the instruction set, the cost model of microprocessor vector units and the limitations of today's vector extensions such as the requirement that the stride of vectors read and stored be one.

Both automatic and manual vectorization will be discussed. Although compilers have advanced considerably since the times of the first vectorizers, there are still many cases in which they are not effective. We will describe the type of analysis that today's compilers perform to determine whether a code segment (typically a loop) can be vectorized and the limitations in the implementation of the analysis that preclude vectorization by the compiler. We will present examples of compiler reports listing those loops that where vectorized and the reason why each of the other loops where not vectorized. We will discuss the type of transformations and code modifications that the programmer can apply in response to the compiler report to enable automatic vectorization. These changes include the addition of annotations specifying for example that a certain pair of arrays do not share elements (there is no aliasing) or that the first element of an array is properly aligned. The changes also include manual code transformations such as loop interchange, scalar expansion, node splitting, loop peeling, or loop distribution.

We will also cover vectorization through the use of compiler intrinsics that directly map to the SIMD instructions of the machine. In some cases, source code level transformations that enable automatic compiler vectorization are not possible or result in low performance gains. In this case, the programmer may need to resort to manual vectorization.

The tutorial will be driven by examples. For each example, we will describe the outcome of the INTEL icc and the IBM xlc compiler, and the performance speedups obtained by each on one or more platforms. The examples will be based on examples written in C that we have taken from different sources including:

Speedups obtained from compiler vectorization will be shown for each code. We will discuss those cases where higher speedups are obtained as a result of a change of the access pattern during the process of vectorization, enabling a better use of the hardware prefetch and/or the cache hierarchy.

Outline

  1. Introduction
  2. Data Dependences (Definition)
  3. Overcoming limitations to SIMD-vectorization
  4. Limitations and corresponding solutions are described in this section. Examples using XLC and ICC are presented. Report generated by the compiler, execution time and speedup resulting from the vectorization are shown in each case.
    1. Data Dependences (Acyclic and Cyclic dependence graphs)
      • Loop Transformations such as use of compiler directives, loop distribution, reordering of statements, node splitting, scalar expansion, loop peeling, or interchange are discussed in this Section.
      • Reductions
      • Loop induction variables
    2. Data Alignment
    3. Aliasing
    4. Non-unit strides
    5. Conditional Statements
  5. Vectorization with intrinsics

Target Audience

The audience will be C or Fortran programmers. No other particular experience with vectorization is necessary. The attendees will come out of this tutorial with a good understanding of the manual transformations and code changes that may enable automatic compiler vectorization. They will learn the foundations to manually write vector codes.