The Fourteenth International Workshop on
Accelerators and Hybrid Emerging Systems

To be held in conjunction with
38th IEEE International Parallel and Distributed Processing Symposium
San Fransisco, California, USA
May 27, 2024

Opening Remarks

10:30 am - 10:40 am

Session 1: High-Performance Computing

10:40 am - 12:00 pm

Session Chair: Shintaro Iwasaki, Meta

  • 10:40 am - 11:00 am
    Performance Versus Maintainability: A Case Study of Scream on Frontier
    James White
  • 11:00 am - 11:30 am
    ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels
    Ali Tehranijamsaz, Alok Mishra, Akash Dutta, Abid M. Malik, Barbara Chapman, and Ali Jannesari
  • 11:30 am - 12:00 pm
    Alternative Quadrant Representations with Morton Index and AVX2 Vectorization for AMR Algorithms within the p4rest Software Library
    Mikhail Kirilin and Carsten Burstedde

Lunch Break

12:00 pm - 1:00 pm

  • Lunch will not be provided by the conference.


1:00 pm - 2:00 pm

Block-based GPU Programming with Triton

Philippe Tillet, OpenAI

Abstract: Philippe Tillet Traditional single instruction, multiple threads (SIMT) programming with CUDA can be daunting to machine learning researchers in need of fast custom kernels. This can significantly slow down the evaluation of novel research ideas that cannot be neatly decomposed into a set of pre-built, vendor-optimized primitives. In this talk, we will shed light on an alternative programming model which -- while relatively high-level -- aims to be more expressive than common graph-compilers (e.g., XLA, Torch-Inductor) and enable the use of custom data-structures (e.g., linked list, block-sparse tensors, etc.). We will specifically discuss the design and implementation of Triton, a mid-level programming language that uses block-based abstractions to simplify kernel development for researchers without deep GPU programming expertise.

Bio: Philippe Tillet first began working with GPUs in 2011 as a contributor to the ViennaCL library. He then received his B.S. from Telecom SudParis (France) in 2012, his M.S. from NCTU (Taiwan) in 2014, and his Ph.D. from Harvard University in 2020. He joined OpenAI full time in 2020 to pursue his work on the Triton compiler — a project he started in 2018 after being frustrated by the difficulty of writing auto-tuners for matrix multiplications in CUDA. Since then, he grew the Triton language into a reference for block-based programming model, and used it to write all the training kernels that were used by GPT4.

Session 2: Accelerating AI/ML Workloads

2:00 pm - 3:10 pm

Session Chair: Carl Pearson, Sandia National Laboratories

  • 2:00 pm - 2:30 pm
    Avoiding Training in the Platform-Aware Optimization Process for Faster DNN Latency Reduction
    Raúl Marichal, Ernesto Dufrechou, and Pablo Ezzatti
  • 2:30 pm - 2:50 pm
    A Comparative Study on Simulation Frameworks for AI Accelerator Evaluation
    Christoffer Åleskog, Håkan Grahn, and Anton Borg
  • 2:50 pm - 3:10 pm
    Extending the SYCL Joint Matrix for Binarized Neural Networks
    Zheming Jin

Closing Remarks

3:10 pm - 3:20 pm


All presentations will be in-person. Presenters are expected to target 25 minutes (full papers) or 15 minutes (short papers) for the talks with 5 minutes for questions.