Multimodal Chain-of-Thought Reasoning in Language Models
Abstract
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at https://github.com/amazon-science/mm-cot.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2023
- DOI:
- 10.48550/arXiv.2302.00923
- arXiv:
- arXiv:2302.00923
- Bibcode:
- 2023arXiv230200923Z
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence;
- Computer Science - Computer Vision and Pattern Recognition