ULTRON: Unifying Local Transformer and Convolution for Large-Scale Image Retrieval


연구 분야: Verification



학회: Asian Conference on Computer Vision


초록

In large-scale image retrieval, the primary goal is to extract discriminative features and embed them into global image representations. Previous methods based on CNNs effectively learn local features and create robust representations, leading to strong performance. Transformers that excel in learning global context, however, often struggle to extract fine details and therefore do not perform well in large-scale landmark recognition. In this paper, we propose a novel hybrid architecture named ULTRON, which combines transformer blocks with local self-attention and a convolution-based encoder. Our local transformer block contains an advanced self-attention mechanism that enhances the spatial context awareness of key features and updates the value features by considering broader information within fixed-size regional windows. In addition, we have designed a channel-wise dilated convolution that adjusts dilation per channel, enabling effective multiscale feature learning while robustly capturing local features. We focus on learning local contexts throughout the entire network and effectively blending these contexts in the attention-based pooling process. This approach generates a powerful global representation that includes local information, relying solely on classification loss without requiring additional modules to capture local features. Experimental results demonstrate that our model outperforms previous works due to effectively embedding local features into a global representation.


Author Profile
Minseong Kweon

School of Mechanical Engineering Pusan National University Busan Republic of Korea

Korea
Author Profile
Jinsun Park

School of Computer Science and Engineering Pusan National University Busan Republic of Korea

Andorra

📄 논문 정보

발행 연도 2024년
인용수 0
출판 국가 Andorra, Korea
사이트 Springer
좋아요 수 0

연관 논문 목록 (17건)