Abstract:
The applications of the pure Transformer model on sequences of image patches
achieved promising results, comparable to those of the Convolutional Neural Networks
(CNNs), the leading models of computer vision tasks. However, one of its
gaps is the need for large volumes of data for Vision Transformers, making it worth
looking into smaller scale datasets. Despite its fast advances and wide range of
application, it remains lagging when it comes to the field of 3D images. In general,
low-level resolution images pose problems in the model learning curve. Hence, this
study leverages Vision Transformers (ViTs) capabilities in capturing global linkages
and long-range interdependencies within an image, in the aim of achieving performance
comparable to the benchmark established by the MedMNIST3D v2 family of datasets - offering small-scale images of high and low levels of resolution. Previous
studies have demonstrated a plethora of methods in treating 3D images, increasing
the interest in applying ViTs models from scratch to that data modality, specifically
in small-scale datasets.
The VesselMNIST3d dataset binary classification experiment was implemented by
treating the 3D image as video where the third dimension represents the number of frames. Therefore, initiating temporal information for the model to learn, enriching
more relationships across spatial information at a higher dimension. The study
provides a robustness experimentation to prove the high performing of the vanilla
Video Vision Transformer model scoring on average 0.877 for Area Under the Curve
(AUC) and 0.916 for Accuracy (ACC) across 30 independent experiment. The study
extends proof of pretraining the model at a higher resolution to improve the model’s
learning capacity at a lower resolution level in which succeeded to boost 3% AUC
score. The study transcends multiple levels of interpretation and caution for proper
inferential results in order to make the Vision Transformer model competitive in its
weak areas, at an alerting domain needing for growth.