HiLight: Technical Report on the Motern AI Video Language Model

doi:10.48550/arXiv.2407.07325

HiLight: Technical Report on the Motern AI Video Language Model

This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.

Publication:

arXiv e-prints

Pub Date:

July 2024

DOI:

10.48550/arXiv.2407.07325

arXiv:

arXiv:2407.07325

Bibcode:

2024arXiv240707325W

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Computation and Language;
Computer Science - Multimedia;
Electrical Engineering and Systems Science - Image and Video Processing

NASA/ADS

HiLight: Technical Report on the Motern AI Video Language Model

Abstract