Combining Audio and Image Sequence for Video Moment Retrieval by Natural Language


연구 분야: Artificial Intelligence



학회: International Conference on Artificial Intelligence and Soft Computing


초록

The video moment retrieval with the natural language area aims to locate the segment (moment) of the video most relevant to a textual description (natural language). However, existing methods are based only on the image sequence analysis and neglect the information derived from the audio. Thus, the main objective of this study is to combine both features (from image and audio) to make the retrieval more comprehensive and robust. For this, a model is built on audio and image sequence extractors aligned that relate to the textual description to retrieve the desired moment of the video. We proposed a weakly supervised model that uses attention mechanisms and the audio component for video moment retrieval by natural language. Results demonstrate that the proposed model outperforms the current state-of-the-art in the metric mIoU by more than 27%, in addition to decreasing the response time of the video moment retrieval (reducing the computational complexity from polynomial to linear).


Author Profile
Luís G. de Souza

Department of Computing Federal University of Technology - Parana Cornelio Procopio PR Brazil

Brazil
Author Profile
Sílvio R. R. Sanches

Department of Computing Federal University of Technology - Parana Cornelio Procopio PR Brazil

Brazil
Author Profile
Pedro H. Bugatti

Department of Computing Federal University of Technology - Parana Cornelio Procopio PR Brazil

Brazil

📄 논문 정보

발행 연도 2025년
인용수 0
출판 국가 Brazil
사이트 Springer
좋아요 수 0

연관 논문 목록 (71건)