Pegasus-v1 Technical Report

doi:10.48550/arXiv.2404.14687

Pegasus-v1 Technical Report

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

Publication:

arXiv e-prints

Pub Date:

April 2024

DOI:

10.48550/arXiv.2404.14687

arXiv:

arXiv:2404.14687

Bibcode:

2024arXiv240414687J

Keywords:

Computer Science - Multimedia;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language;
Computer Science - Computer Vision and Pattern Recognition

NASA/ADS

Pegasus-v1 Technical Report

Abstract