Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

doi:10.48550/arXiv.2311.12359

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

Publication:

arXiv e-prints

Pub Date:

November 2023

DOI:

10.48550/arXiv.2311.12359

arXiv:

arXiv:2311.12359

Bibcode:

2023arXiv231112359A

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Hardware Architecture;
Computer Science - Machine Learning;
Computer Science - Performance

E-Print:

Accepted in FPL (International Conference on Field-Programmable Logic and Applications) 2024 conference. Revised with updated results

NASA/ADS

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs

Abstract