PSNet: position-shift alignment network for image caption


연구 분야: Verification



학회: International Journal of Multimedia Information Retrieval


초록

Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.


Author Profile
Lixia Xue

School of Computer Science and Information Engineering Hefei University of Technology Hefei China

Andorra
Author Profile
Awen Zhang

School of Computer Science and Information Engineering Hefei University of Technology Hefei China

Andorra
Author Profile
Ronggui Wang

School of Computer Science and Information Engineering Hefei University of Technology Hefei China

Andorra

📄 논문 정보

발행 연도 2023년
인용수 0
출판 국가 Andorra
사이트 Springer
좋아요 수 0

연관 논문 목록 (24건)