Detection-Fusion for Knowledge Graph Extraction from Videos

doi:10.48550/arXiv.2501.00136

Detection-Fusion for Knowledge Graph Extraction from Videos

One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2501.00136

arXiv:

arXiv:2501.00136

Bibcode:

2025arXiv250100136D

Keywords:

Computer Science - Computer Vision and Pattern Recognition;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning

E-Print:

12 pages, To be submitted to a conference

ADS

Detection-Fusion for Knowledge Graph Extraction from Videos

Abstract