Methods for Characterization and Analysis of Voltage Marginsin Modern CPUs, GPUs, and FPGAs
Thursday, June 17, 2021, Virtual Event
Afternoon tutorial held in conjunction with ISCA 2021
12PM - 3PM (EDT/New York)
6PM - 9PM (CET/Brussels)
12AM - 3AM (CST/Beijing)
Modern large-scale computing systems employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. Conservative design margins in these computing platforms aim to guarantee correct execution of the software layers of the computing system under various operating conditions, such as accounting for the worst-case voltage noise (Ldi/dt), harsh environmental conditions, workload variability, inherent within-die variability among different components of the same chip, and die-to-die variability among different manufactured chips. However, such conservative guard-banding of voltage and frequency leads to limited energy efficiency. In this tutorial, we will present recent methods and studies on system-level voltage-margins characterization and identification to improve energy efficiency while guaranteeing the correctness of software execution. In particular, the tutorial will cover our findings based on comprehensive experimental analysis of commercial-off-the-shelf CPUs, GPUs, and FPGAs, as detailed below.
CPUs (led by University of Athens)
- We will present the main challenges of the massive process of characterization and identification of the design margins. Such a process aims to identify different types of variability of modern multicore CPUs (across cores, chips and workloads). Also, it aims to analyze the system behavior in scaled conditions (types of observed malfunctions).
- We will present findings about the magnitude of power and energy that can be saved through the exploitation of the margins and variability.
- We will discuss the effectiveness of dedicated micro-viruses for the identification of the safe Vmin and failure probabilities compared to classic benchmarks workloads.
- We will quantify the importance of clock frequency, thread/core allocation and workload behavior on scaled voltage operation for energy efficiency and how this quantification can be exploited by the system layers for task allocation aiming either energy reduction or a balanced energy-performance operation.
- The analysis is based on real system measurements in different multicore server CPU chips mainly based on ARMv8 and x86 architectures.
GPUs (led by Harvard University and Shanghai Jiao Tong University)
- We will present findings about our measurement-based study on the power/energy/performance benefits of reducing the operating margin on the GPUs.
- We will explain the main challenge, which is the fast-changing voltage noise, for reducing the operating margin on the GPUs.
- We will describe the integrated simulation platform for modeling the performance/power/voltage of GPUs.
- We will present a novel system paradigm for the reduced voltage operation of GPUs.
FPGAs (led by BSC)
- We will discuss the energy-reliability trade-off for multiple components of FPGAs under reduced-voltage operations below the vendor-set default value. We will also discuss the voltage guard-banding feature that we experimentally explored for different technologies of Xilinx FPGAs.
- We will present and discuss the comprehensive characterization of undervolting-related errors in FPGA on-chip memories. Consequently, we will discuss effective techniques such as ECC to mitigate these undervolting faults.
- We will present the demonstration of FPGA undervolting on state-of-the-art CNN accelerators, particularly focusing on the trade-off of energy-saving versus CNN accuracy-loss. We will also discuss an effective frequency underscaling technique to mitigate the accuracy loss cost of the reduced voltage FPGA operations.
- We will discuss our undervolting study on the High-Bandwidth Memory (HBM) packed with the FPGAs, including the voltage guardbanding, fault characterization in such memories, and the effect of data patterns and memory stacking.
Dimitris Gizopoulos, George Papadimitriou (University of Athens)
Osman Unsal, Behzad Salami (BSC)
Vijay Janapa Reddi (Harvard University)
Jingwen Leng (Shanghai Jiao Tong University)
The target audience of the tutorial includes researchers and practitioners interested in energy efficiency through voltage margins identification and exploitation and the corresponding reliability considerations, for general purpose CPUs as well as for hardware accelerators on GPUs and FPGAs.
Dimitris Gizopoulos (email@example.com) is Professor at the Department of Informatics & Telecommunications of the National & Kapodistrian University of Athens in Greece where he leads the Computer Architecture Laboratory. The group's research focuses on the dependability, the energy-efficiency and the performance of computer architectures. Gizopoulos has published more than 180 papers in top-tier conferences and journals, has served and is currently serving as Associate Editor for several IEEE and ACM Transactions and Magazines and as member of several Program, Organizing and Steering Committees of IEEE and ACM conferences. Gizopoulos is an IEEE Fellow, a Golden Core member of the IEEE Computer Society and a Senior ACM member. He has presented several conference tutorials at ISCA, MICRO, DSN, DATE.
George Papadimitriou (firstname.lastname@example.org) is a Postdoctoral Researcher at the Dept. of Informatics & Telecommunications of the University of Athens. He received his MSc in Computer Systems Technology and his PhD in Computer Architecture from the University of Athens. His research focuses on dependability and energy-efficient computer architectures, microprocessor reliability, functional correctness of hardware designs and design validation of microprocessors and microprocessor-based systems, in which he has published more than 25 papers in international conferences and journals. He has participated in tutorials in these areas at MICRO and ISCA.
Osman Unsal (email@example.com) co-leads the Computer Architecture for Parallel Paradigms research group at Barcelona Supercomputing Center. His main research interests are in computer architecture, fault-tolerance, energy efficiency and transactional memory. He has published 160 papers in conferences and journals in the topics of Computer Architecture, VLSI design, Parallel Computing and Programming Models. Previously he was with Intel Microprocessor Research Labs, and he co-led the BSC-Microsoft Research Center from 2006 to 2014. He received the B.S. degree from Istanbul Technical University, the M.S. degree from Brown University and the Ph.D. degree from University of Massachusetts, Amherst.
Behzad Salami (firstname.lastname@example.org) is a resident Researcher with the Computer Science Department at Barcelona Supercomputing Center (BSC) and an affiliated research member of SAFARI Research Group at ETH Zurich. He received a Ph.D. degree (with honors) in computer architecture from the Universitat Politecnica de Catalunya (UPC) in 2018. He participated as a researcher in several EU-funded research projects like LEGaTO, AXLE, and EuroEXA and also, led a technology transfer project as the PI. He received several awards and grants for his research activities like the HiPEAC paper award, the HiPEAC collaboration grant, the Tetramax technology transfer grant, the I4MS-SAE certificate of excellence, etc. His research interests are reconfigurable computing, low-power and fault-resilient hardware accelerators, and processing near- and in-memory systems.
Vijay Janapa Reddi (email@example.com) is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to that, he was an Associate Professor at The University of Texas at Austin in the Department of Electrical and Computer Engineering. His research interests include computer architecture and runtime systems, specifically in the context of autonomous machines and mobile and edge computing systems. He received a Ph.D. in computer science from Harvard University.
Jingwen Leng (firstname.lastname@example.org) is a tenure-track Associate Professor in the John Hopcroft Computer Science Center and Computer Science Department at Shanghai Jiao Tong University. He received his Ph.D. from the University of Texas at Austin, where he focused on improving the efficiency and resiliency of general-purpose GPUs. He is currently interested at taking a holistic approach to optimizing the performance, efficiency, and reliability for heterogeneous computing systems.
DATE 2021 - "Understanding Power and Reliability of High-Bandwidth Memory with Voltage Underscaling", S. Nabavi, B. Salami, O. Unsal, A. Cristal, H. Sarbazi-Azad, O. Mutlu, in 24th Design, Automation and Test in Europe Conference (DATE 2021), Feburary 2021.
DSN 2020 - "An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration", B. Salami, E. Baturay, I. Yuksel, F. Koch, O. Ergin, A. Cristal, O. Unsal, H. Sarbazi-Azad, O. Mutlu, in 50th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2020), June 2020.
HPCA 2020 - "Asymmetric Resilience: A System Architecture for Transient Error Recovery in Accelerator-Rich Processors", J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, Q. Chen, M. Guo, V. Reddi, in International Symposium on High Performance Computer Architecture (HPCA), 2020.
HPCA 2019 - "Adaptive Voltage/Frequency Scaling and Core Allocation for Balanced Energy and Performance on Multicore CPUs", G. Papadimitriou, A. Chatzidimitriou, D. Gizopoulos, IEEE International Symposium on High-Performance Computer Architecture (HPCA 2019), Washington, DC, USA, February 2019.
PDP 2019 - "Evaluating Built-In ECC of FPGA On-Chip Memories for the Mitigation of Undervolting Faults", B. Salami, O. Unsal, A. Cristal, in 27th Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP 2019), February 2019.
MICRO 2018 - "Comprehensive Evaluation of Supply Voltage Underscaling in FPGA on-Chip Memories", B. Salami, O. Unsal, A. Cristal, in 51rd IEEE/ACM International Symposium on Microarchitecture (MICRO 2018), October 2018.
ISPASS 2018 - "Micro-Viruses for Fast System-Level Voltage Margins Characterization in Multicore CPUs", G. Papadimitriou, A. Chatzidimitriou, M. Kaliorakis, Y. Vastakis, D. Gizopoulos, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2018), Belfast, Northern Ireland, United Kingdom, April 2018.
DSN 2018 - "Measuring and Exploiting Guardbands of Server-Grade ARMv8 CPU Cores and DRAMs", K. Tovletoglou, L. Mukhanov, G. Karakonstantis, A. Chatzidimitriou, G. Papadimitriou, M. Kaliorakis, D. Gizopoulos, Z. Hadjilambrou, Y. Sazeides, A. Lampropulos, S. Das, P. Vo, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2018), Luxembourg, June 2018.
CAL 2018 - "Statistical Analysis of Multicore CPUs Operation in Scaled Voltage Conditions", M. Kaliorakis, A. Chatzidimitriou, G. Papadimitriou, and D. Gizopoulos, IEEE Computer Architecture Letters (CAL 2018), Volume: 17, Issue: 2, February 2018.
MICRO 2017 - "Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs", G. Papadimitriou, M. Kaliorakis, A. Chatzidimitriou, D. Gizopoulos, P. Lawthers, and S. Das, IEEE/ACM International Symposium on Microarchitecture (MICRO 2017), Cambridge, MA, USA, October 2017.
MICRO 2015 - "Safe limits on voltage reduction efficiency in gpus: A direct measurement approach", J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, and V. J. Reddi, in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2015.
HPCA 2015 - "GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures", J. Leng, Y. Zu, and V. J. Reddi, in IEEE International Symposium on High Performance Computer Architecture (HPCA), 2015.
ISLPED 2014 - "GPUVolt: modeling and characterizing voltage noise in GPU architectures", J. Leng, Y. Zu, M. Rhu, M. S. Gupta, and V. J. Reddi, in International Symposium on Low Power Electronics and Design (ISLPED), 2014.
ISCA 2013 - "GPUWattch: enabling energy optimizations in GPGPUs", J. Leng, T. H. Hetherington, A. ElTantawy, S. Z. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, in ACM/IEEE International Symposium on Computer Architecture (ISCA), 2013.