연구 분야: Artificial Intelligence
학회: 2025 7th International Conference on Intelligent Sustainable Systems (ICISS)
Vision Transformer (ViT) is an image recognition model that uses transformer architecture, which has a numerous advantage over Convolution Neural Networks (CNN). It offers improved accuracy, scalability, flexibility, global context, and transferability. ViT can handle images of different sizes and aspect ratios, making it more versatile than CNN. It can process an entire image at once, allowing it to capture global context information and long-range dependencies. Additionally, ViTs pre-training on huge amounts of image data can be transferred to other image recognition tasks, making it a useful tool for transfer learning. This paper describes the differences between ViT and CNN and how ViT splits images into patches for classification. The positional encoding of different features is done in ViT to avoid the requirement of filters. Proposed implementation obtained final accuracy of prediction 93% for top-1 accuracy.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 239 |
| 출판 국가 | Andorra |
| 사이트 | IEEE |
| 좋아요 수 | 0 |