Special Session: Auto-Tuning for Multicore and GPU (ATMG)

in Conjunction with the IEEE  MCSoC-12

The University of Aizu, Aizu-Wakamatsu, Japan
September 20-22, 2012

ATMG Workshop Advance Program
MCSoC12 workshop page: http://www.ieee-mcsoc.org/program.html

11:00-12:20Session 1 : GPGPU and Programming Environment
Session Chair: Takahiro Katagiri (The University of Tokyo, Japan)
11:00-11:25 Invited Speaker 1
* Hiroyuki Takizawa (Tohoku University, Japan)
    "Software Evolution for System Architecture Revolution"
Today, HPC system architectures are getting more and more complicated mainly for improving the power efficiencies. New-generation HPC systems will have more compute nodes, more processor cores, and accelerators such as GPUs and MICs. On the other hand, existing application programs have been developed without considering such complicated hardware configurations. Thus, the application programs need to fit the new systems to achieve a high performance. The changes of system architectures can be revolutionary and may often require an application program to change drastically to fully exploit the potential of new systems. A typical example would be so-called GPU computing, in which hot kernels of an existing application must be rewritten based on a new programming model such as CUDA and OpenCL. In general, it is very difficult and error-prone to significantly modify a large-scale application program. Therefore, we need to incrementally improve the application program, so-called "software evolution." We also need a systematic way to support the software evolution to adapt to the system architecture revolution because the programming for software evolution is likely to be labor-intensive and error-prone. Motivated by this, we have started a new research project to establish a programming framework to systematically support software evolution. In this presentation, I will talk about the basic concept and current status of the project.
11:30-11:55 * Yaohung Tsai (National Taiwan University, Taiwan), Ray-Bing Chen (National Cheng-Kung University, Taiwan), and Weichung Wang (National Taiwan University, Taiwan)
    "Tuning Block Size for QR Factorization on CPU-GPU Hybrid Systems"
11:55-12:20 * Kazuya Matsumoto, Naohito Nakasato, Stanislav G. Sedukhin (The University of Aizu, Japan)
    "Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU"

10:40-12:30Session 2: Numerical Computation and Auto-tuning
Session Chair: Hiroyuki Takizawa (Tohoku University, Japan)
10:40-11:05 Invited Speaker 2
* Daisuke Takahashi (University of Tsukuba, Japan)
    "Automatic Tuning for Parallel FFTs on Clusters of Multi-Core Processors"
In this talk, an automatic performance tuning for parallel fast Fourier transforms (FFTs) is presented. The six-step FFT algorithm can be altered into a recursive six-step FFT algorithm to reduce the number of cache misses. Since the optimal depth of recursion may depend on the problem size, a method to determine the optimal depth of recursion that minimizes the number of cache misses is proposed. In addition, an automatic tuning of all-to-all communication is also implemented. Performance results of parallel FFTs with automatic performance tuning on clusters of multi-core processors are reported.
11:10-11:35 * Akira Imakura, Tetsuya Sakurai(University of Tsukuba, Japan), Kohsuke Sumiyoshi (Numazu College of Technology, Japan), and Hideo Matsufuru (High Energy Accelerator Research Organization, Japan)
     "An Auto-Tuning Technique of the Weighted Jacobi-Type Iteration used for Preconditioners of Krylov Subspace Methods"
11:40-12:05 * Satoshi Ito, Satoshi Ohshima, and Takahiro Katagiri (The University of Tokyo, Japan)
      "SSG-AT: An Auto-tuning Method of Sparse Matrix-vector Multiplication for Semi-Structured Grids -An Adaptation to OpenFOAM-"
12:05-12:30 Invited Speaker 3
* Shoaib Kamil (University of California at Berkeley, USA)
    "Bridging the Productivity-Performance Gap with Selective Embedded Just-in-Time Specialization"
Domain-expert "productivity programmers" desire scalable application performance, but usually must rely on "efficiency programmers" who are experts in explicit parallel programming to achieve it. Since such efficiency programmers are rare, to maximize reuse of their work we wish to encapsulate their expertise into mini-compilers for domain-specific embedded languages (DSELs) glued together by a common high-level host language that is familiar to productivity programmers. The SEJITS (Selective Embedded Just-In-Time Specialization) methodology enables embedding these mini-compilers in widely-used productivity languages such as Python, Ruby, and Lua by leveraging features of these languages (like good Foreign Function Interfaces, introspection, and metaprogramming) and external optimizing compiler toolchains. SEJITS combines DSELs and code generation with auto-tuning, enabling programmers to build high-performance productive DSELs in modern productivity languages. This talk outlines our proof-of-concept, called Asp (Asp is SEJITS for Python) which strives to make developing DSEL compilers easy. I will presents results for a number of implemented DSELs and applications across a variety of domains, including machine learning, stencil computations, and graph algorithms. Results show these compilers can obtain up to 98% of peak performance, work well with existing software packages, and can be used to obtain high parallel performance across domains and architectures, all while programming in a productive high-level language.