Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

29 August 2023

Papers citing "Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library"

6 / 6 papers shown

Title
Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance Hiroyuki Ootomo Rio Yokota 38 32 0 07 Mar 2022
tcFFT: Accelerating Half-Precision FFT through Tensor Cores Bin-Rui Li Shenggan Cheng James Lin 16 12 0 23 Apr 2021
A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels Peng Chen Mohamed Wahib Shiníchiro Takizawa Ryousei Takano Satoshi Matsuoka 28 22 0 14 Jul 2019
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking Zhe Jia Marco Maggioni Benjamin Staiger D. Scarpazza 48 309 0 18 Apr 2018
NVIDIA Tensor Core Programmability, Performance & Precision Stefano Markidis Steven W. D. Chien Erwin Laure Ivy Bo Peng Jeffrey S. Vetter 36 372 0 11 Mar 2018
In-Datacenter Performance Analysis of a Tensor Processing Unit N. Jouppi C. Young Nishant Patil David Patterson Gaurav Agrawal ... Vijay Vasudevan Richard Walter Walter Wang Eric Wilcox Doe Hyun Yoon 213 4,626 0 16 Apr 2017