Object Detection using Vision Transformer and Deep Learning for Computer Vision Applications


연구 분야: Artificial Intelligence



학회: 2025 7th International Conference on Intelligent Sustainable Systems (ICISS)


초록

Vision Transformer (ViT) is an image recognition model that uses transformer architecture, which has a numerous advantage over Convolution Neural Networks (CNN). It offers improved accuracy, scalability, flexibility, global context, and transferability. ViT can handle images of different sizes and aspect ratios, making it more versatile than CNN. It can process an entire image at once, allowing it to capture global context information and long-range dependencies. Additionally, ViTs pre-training on huge amounts of image data can be transferred to other image recognition tasks, making it a useful tool for transfer learning. This paper describes the differences between ViT and CNN and how ViT splits images into patches for classification. The positional encoding of different features is done in ViT to avoid the requirement of filters. Proposed implementation obtained final accuracy of prediction 93% for top-1 accuracy.


Author Profile
Sudarshan S

Electronics and Telecommunication Engineering Center for Computer Vision Research (CCVR) RV College of Engineering® Bengaluru India

Andorra
Author Profile
Mohana

Computer Science and Engineering Center for Computer Vision Research (CCVR) RV College of Engineering® Bengaluru India

Andorra
Author Profile
Ambika G

Electronics and Telecommunication Engineering Center for Computer Vision Research (CCVR) RV College of Engineering® Bengaluru India

Andorra

📄 논문 정보

발행 연도 2025년
인용수 239
출판 국가 Andorra
사이트 IEEE
좋아요 수 0

연관 논문 목록 (180건)