CNN-based Local Vision Transformer for COVID-19 Diagnosis

doi:10.48550/arXiv.2207.02027

CNN-based Local Vision Transformer for COVID-19 Diagnosis

Deep learning technology can be used as an assistive technology to help doctors quickly and accurately identify COVID-19 infections. Recently, Vision Transformer (ViT) has shown great potential towards image classification due to its global receptive field. However, due to the lack of inductive biases inherent to CNNs, the ViT-based structure leads to limited feature richness and difficulty in model training. In this paper, we propose a new structure called Transformer for COVID-19 (COVT) to improve the performance of ViT-based architectures on small COVID-19 datasets. It uses CNN as a feature extractor to effectively extract local structural information, and introduces average pooling to ViT's Multilayer Perception(MLP) module for global information. Experiments show the effectiveness of our method on the two COVID-19 datasets and the ImageNet dataset.

Publication:

arXiv e-prints

Pub Date:

July 2022

DOI:

10.48550/arXiv.2207.02027

arXiv:

arXiv:2207.02027

Bibcode:

2022arXiv220702027X

Keywords:

Electrical Engineering and Systems Science - Image and Video Processing;
Computer Science - Computer Vision and Pattern Recognition

E-Print:

5 pages, 4 figures

NASA/ADS

CNN-based Local Vision Transformer for COVID-19 Diagnosis

Abstract