Food Classification using Joint Representation of Visual and Textual Data

doi:10.48550/arXiv.2308.02562

Food Classification using Joint Representation of Visual and Textual Data

Food classification is an important task in health care. In this work, we propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification, and the traditional BERT transformer-based network is used for text classification. The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101. The experimental results show that the proposed network outperforms the other methods, a significant difference of 11.57% and 6.34% in accuracy is observed for image and text classification, respectively, when compared with the second-best performing method. We also compared the performance in terms of accuracy, precision, and recall for text classification using both machine learning and deep learning-based models. The comparative analysis from the prediction results of both images and text demonstrated the efficiency and robustness of the proposed approach.

Publication:

arXiv e-prints

Pub Date:

August 2023

DOI:

10.48550/arXiv.2308.02562

arXiv:

arXiv:2308.02562

Bibcode:

2023arXiv230802562M

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Computers and Society;
Computer Science - Machine Learning

E-Print:

Updated results and discussions to be posted and some sections needed to be expanded

NASA/ADS

Food Classification using Joint Representation of Visual and Textual Data

Abstract