Abstract: In recent years, Transformer has gradually become the mainstream architecture in computer vision. Its broad expressiveness and high parallelism give it the ability to match the performance of convolutional neural networks (CNNs). However, there are two main problems in applying the attention mechanism to computer vision at the current stage: high computational complexity and the need for a large amount of training data. To address these issues, a category-query based visual Transformer model (OB_ViT) is proposed. The innovation lies in two aspects: the introduction of learnable category queries and the use of a loss function based on the Hungarian algorithm. Specifically, a learnable category query is used as input to the decoder, which allows reasoning about the relationship between target categories and the global image context. In addition, the Hungarian algorithm is used to enforce unique predictions, ensuring that each category query learns only one target category. Experimental results on the Cifar10 and 5-class Flower image classification datasets show that the OB_ViT model achieves significantly improved learning accuracy while reducing the number of parameters compared to ViT and ResNet50. For example, on the Cifar10 dataset, there is a 15% reduction in parameters and a 22% improvement in accuracy.
姜春雨, 王伟. 基于类别查询的视觉Transformer研究[J]. 吉林化工学院学报, 2024, 41(3): 62-67.
JIANG Chunyu, Wang Wei. Research on Visual Transformers based on Class Queries. Journal of Jilin Institute of Chemical Technology, 2024, 41(3): 62-67.