Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography
Abstract
In Digital Holography (DH), it is crucial to extract the object distance from a hologram in order to reconstruct its amplitude and phase. This step is called auto-focusing and it is conventionally solved by first reconstructing a stack of images and then by sharpening each reconstructed image using a focus metric such as entropy or variance. The distance corresponding to the sharpest image is considered the focal position. This approach, while effective, is computationally demanding and time-consuming. In this paper, the determination of the distance is performed by Deep Learning (DL). Two deep learning (DL) architectures are compared: Convolutional Neural Network (CNN) and Vision Transformer (ViT). ViT and CNN are used to cope with the problem of auto-focusing as a classification problem. Compared to a first attempt [11] in which the distance between two consecutive classes was 100$\mu$m, our proposal allows us to drastically reduce this distance to 1$\mu$m. Moreover, ViT reaches similar accuracy and is more robust than CNN.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2021
- DOI:
- arXiv:
- arXiv:2108.09147
- Bibcode:
- 2021arXiv210809147C
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Electrical Engineering and Systems Science - Image and Video Processing
- E-Print:
- 6 pages, 11 figures, ICCCR 2022 Conference