Multicore and GPU (ATMG)
Conjunction with the IEEE
The University of Aizu, Aizu-Wakamatsu, Japan
ATMG Workshop Advance Program
September 20-22, 2012
MCSoC12 workshop page: http://www.ieee-mcsoc.org/program.html
|11:00-12:20||Session 1 : GPGPU and Programming Environment|
Session Chair: Takahiro Katagiri (The University of Tokyo, Japan)
Invited Speaker 1|
* Hiroyuki Takizawa (Tohoku University, Japan)
"Software Evolution for System Architecture Revolution"
Today, HPC system architectures are getting more and more complicated mainly for improving the power efficiencies. New-generation HPC systems will have more compute nodes, more processor cores, and accelerators such as GPUs and MICs. On the other hand, existing application programs have been developed without considering such complicated hardware configurations. Thus, the application programs need to fit the new systems to achieve a high performance. The changes of system architectures can be revolutionary and may often require an application program to change drastically to fully exploit the potential of new systems. A typical example would be so-called GPU computing, in which hot kernels of an existing application must be rewritten based on a new programming model such as CUDA and OpenCL. In general, it is very difficult and error-prone to significantly modify a large-scale application program. Therefore, we need to incrementally improve the application program, so-called "software evolution." We also need a systematic way to support the software evolution to adapt to the system architecture revolution because the programming for software evolution is likely to be labor-intensive and error-prone. Motivated by this, we have started a new research project to establish a programming framework to systematically support software evolution. In this presentation, I will talk about the basic concept and current status of the project.
* Yaohung Tsai (National Taiwan University, Taiwan), Ray-Bing Chen (National Cheng-Kung University, Taiwan), and Weichung Wang (National Taiwan University, Taiwan)|
"Tuning Block Size for QR Factorization on CPU-GPU Hybrid Systems"
* Kazuya Matsumoto, Naohito Nakasato, Stanislav G. Sedukhin (The University of Aizu, Japan)|
"Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on
|10:40-12:30||Session 2: Numerical Computation and Auto-tuning|
Session Chair: Hiroyuki Takizawa (Tohoku University, Japan)
Invited Speaker 2|
* Daisuke Takahashi (University of Tsukuba, Japan)
"Automatic Tuning for Parallel FFTs on Clusters of Multi-Core Processors"
In this talk, an automatic performance tuning for parallel fast Fourier
transforms (FFTs) is presented.
The six-step FFT algorithm can be altered into a recursive six-step FFT
algorithm to reduce the number of cache misses.
Since the optimal depth of recursion may depend on the problem size, a
method to determine the optimal depth of recursion that minimizes the number
of cache misses is proposed.
In addition, an automatic tuning of all-to-all communication is also
Performance results of parallel FFTs with automatic performance tuning on
clusters of multi-core processors are reported.
* Akira Imakura, Tetsuya Sakurai(University of Tsukuba, Japan), Kohsuke Sumiyoshi
(Numazu College of Technology, Japan), and Hideo Matsufuru (High Energy Accelerator
Research Organization, Japan)
"An Auto-Tuning Technique of the Weighted Jacobi-Type Iteration used for
Preconditioners of Krylov Subspace Methods"
* Satoshi Ito, Satoshi Ohshima, and Takahiro Katagiri (The University of Tokyo, Japan)|
"SSG-AT: An Auto-tuning Method of Sparse Matrix-vector Multiplication for
Semi-Structured Grids -An Adaptation to OpenFOAM-"
Invited Speaker 3|
* Shoaib Kamil (University of California at Berkeley, USA)
"Bridging the Productivity-Performance Gap with
Selective Embedded Just-in-Time Specialization"
Domain-expert "productivity programmers" desire scalable application
performance, but usually must rely on "efficiency programmers" who are
experts in explicit parallel programming to achieve it. Since such
efficiency programmers are rare, to maximize reuse of their work we
wish to encapsulate their expertise into mini-compilers for
domain-specific embedded languages (DSELs) glued together by a common
high-level host language that is familiar to productivity programmers.
The SEJITS (Selective Embedded Just-In-Time Specialization)
methodology enables embedding these mini-compilers in widely-used
productivity languages such as Python, Ruby, and Lua by leveraging
features of these languages (like good Foreign Function Interfaces,
introspection, and metaprogramming) and external optimizing compiler
toolchains. SEJITS combines DSELs and code generation with
auto-tuning, enabling programmers to build high-performance productive
DSELs in modern productivity languages. This talk outlines our
proof-of-concept, called Asp (Asp is SEJITS for Python) which strives
to make developing DSEL compilers easy. I will presents results for a
number of implemented DSELs and applications across a variety of
domains, including machine learning, stencil computations, and graph
algorithms. Results show these compilers can obtain up to 98% of peak
performance, work well with existing software packages, and can be
used to obtain high parallel performance across domains and
architectures, all while programming in a productive high-level