연구 분야: Artificial Intelligence
학회: International Conference on Artificial Intelligence and Soft Computing
The video moment retrieval with the natural language area aims to locate the segment (moment) of the video most relevant to a textual description (natural language). However, existing methods are based only on the image sequence analysis and neglect the information derived from the audio. Thus, the main objective of this study is to combine both features (from image and audio) to make the retrieval more comprehensive and robust. For this, a model is built on audio and image sequence extractors aligned that relate to the textual description to retrieve the desired moment of the video. We proposed a weakly supervised model that uses attention mechanisms and the audio component for video moment retrieval by natural language. Results demonstrate that the proposed model outperforms the current state-of-the-art in the metric mIoU by more than 27%, in addition to decreasing the response time of the video moment retrieval (reducing the computational complexity from polynomial to linear).
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Brazil |
| 사이트 | Springer |
| 좋아요 수 | 0 |