Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography

doi:10.48550/arXiv.2108.09147

Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography

In Digital Holography (DH), it is crucial to extract the object distance from a hologram in order to reconstruct its amplitude and phase. This step is called auto-focusing and it is conventionally solved by first reconstructing a stack of images and then by sharpening each reconstructed image using a focus metric such as entropy or variance. The distance corresponding to the sharpest image is considered the focal position. This approach, while effective, is computationally demanding and time-consuming. In this paper, the determination of the distance is performed by Deep Learning (DL). Two deep learning (DL) architectures are compared: Convolutional Neural Network (CNN) and Vision Transformer (ViT). ViT and CNN are used to cope with the problem of auto-focusing as a classification problem. Compared to a first attempt [11] in which the distance between two consecutive classes was 100$\mu$m, our proposal allows us to drastically reduce this distance to 1$\mu$m. Moreover, ViT reaches similar accuracy and is more robust than CNN.

Publication:

arXiv e-prints

Pub Date:

August 2021

DOI:

10.48550/arXiv.2108.09147

arXiv:

arXiv:2108.09147

Bibcode:

2021arXiv210809147C

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Electrical Engineering and Systems Science - Image and Video Processing

E-Print:

6 pages, 11 figures, ICCCR 2022 Conference

ADS

Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography

Abstract