연구 분야: Safety
학회: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data
During an interaction with a robot, the first thing we usually do is wake up the robot using a wake word. For example: ‘XiaoduXiaodu’ and ‘Hey, Siri’, these wake words undoubtedly reduce the interaction experience between us and robots. In this work, we focus on interacting with the robot without the use of wake words, even when the user is not within the robot’s field of view. To accomplish this task, we propose a multimodal activation detection model (MADM), which consists of three parts: primary feature extraction, high-level feature fusion, and fused feature classification. The first part is used to extract the original video and audio as primary feature vectors. The second part uses our proposed local variable weight fusion strategy to convert primary features into high-level features and fuse them into fused features for classification. The three parts use a fully connected neural network to classify the fused features to determine whether a response is required. To evaluate MADM, we constructed a dataset containing 7992 short videos recorded by 99 invited volunteers. Extensive experiments demonstrate the effectiveness of our model and the necessity of a feature fusion strategy.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Andorra, China |
| 사이트 | Springer |
| 좋아요 수 | 0 |