연구 분야: Artificial Intelligence
학회: Signal, Image and Video Processing
Convolutional architectures have demonstrated remarkable success in various vision tasks, offering efficient learning through their inherent induction bias. However, they might be constrained by a potential performance limit. On the other hand, vision transformers (ViTs) leverage more adaptable self-attention layers and have recently surpassed CNNs in image classification. Yet, ViTs often necessitate resource-intensive pre-training on sizable external datasets or refinement from pre-trained convolutional networks. In this paper, we propose an efficient integration of CNNs and Vision Transformers via a hierarchical stage-wise transformer. We introduce convolutional operations for precise feature extraction and devise a distinct module hierarchy for capturing both local and global features. The approach involves a parallel implementation of the CNN-based encoder and the Transformer-based segmentation network. To mitigate the challenge of feature misalignment arising from the amalgamation of CNNs and Transformers, we introduce an innovative adaptive feature fusion module. Our method undergoes comprehensive evaluation across various widely-used benchmark datasets, effectively addressing this challenge. Importantly, these advancements are achieved without imposing significant computational overhead.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 5 |
| 출판 국가 | China |
| 사이트 | Springer |
| 좋아요 수 | 0 |