HiLight: Technical Report on the Motern AI Video Language Model
Abstract
This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2024
- DOI:
- 10.48550/arXiv.2407.07325
- arXiv:
- arXiv:2407.07325
- Bibcode:
- 2024arXiv240707325W
- Keywords:
-
- Computer Science - Computer Vision and Pattern Recognition;
- Computer Science - Computation and Language;
- Computer Science - Multimedia;
- Electrical Engineering and Systems Science - Image and Video Processing