Facebook AI Research (FAIR) recently open-sourced Multiscale Vision Transformers (MViT), a deep-learning model for computer vision based on the Transformer architecture. MViT contains several internal resolution-reduction stages and outperforms other Transformer vision models while requiring less compute power, achieving new state-of-the-art accuracy on several benchmarks.
The FAIR team described the model and several experiments in a blog post. MViT modifies the standard Transformer attention scheme, incorporating a pooling mechanism that reduces the visual resolution, while increasing the feature representation, or channel, dimension. In contrast to other computer vision (CV) models based on the Transformer, MViT does not require pre-training and contains fewer parameters, thus requiring less compute power at inference time. In a series of experiments, FAIR showed that MViT outperforms previous work on common video-understanding datasets, including Kinetics, Atomic Visual Actions (AVA), Charades, and Something-Something. According to the researchers,
Though much more work is needed, the advances enabled by MViT could significantly improve detailed human action understanding, which is a crucial component in real-world AI applications such as robotics and autonomous vehicles. In addition, innovations in video recognition architectures are an essential component of robust, safe, and human-centric AI.
Most deep-learning CV models are based on the Convolutional Neural Network (CNN) architecture. Inspired by the structure of the animal visual cortex, a CNN contains several hidden layers that reduce the spatial dimension of image input while increasing channel dimension; the output of each layer is called a feature map. Video-processing models are often based on CNNs that are extended in the time-dimension, including multiple image frames as input. With the recent success of the Transformer architecture in natural language processing (NLP) tasks, many researchers have explored the application of Transformers to vision problems, such as Google’s Vision Transformer (ViT). Unlike CNNs, however, these Transformer-based architectures do not change the resolution of their internal feature maps, which results in models with very large numbers of parameters, requiring extensive pre-training on large datasets.
The key idea of MViT is combining the attention mechanism of Transformers with the multi-scale feature maps of CNN-based models. MViT accomplishes this by introducing a scale stage after a sequence of Transformer attention blocks. The scale stage reduces the spatial dimension of its input 4x, by applying a pooling operation before applying attention, a combined operation termed Multi-head Pooling Attention (MHPA). The output of the MHPA layer is then up-sampled by a multilayer perceptron (MLP) layer to double the channel dimension. The combination of these two operations “roughly preserves the computational complexity across stages.”
The research team trained MViT models of various sizes and evaluated their performance on benchmarks compared to a baseline “off-the-shelf” ViT model. The small MViT model outperformed the baseline by 7.5 percentage points on the Kinetics-400 dataset while using 5.5x less FLOPs. On the Kinetics-600 dataset, the large MViT model set a new state-of-the-art accuracy of 83.4% with 8.4x fewer parameters and consuming 56.0x fewer FLOPs than the baseline. The team also investigated transfer learning, by pre-training their models on the Kinetics datasets then evaluating on AVA, Charades, and Something-Something. In all these scenarios, MViT outperformed previous models. Finally, the team also showed that MViT could perform as an image-recognition system, simply by using a single input frame. Again, MViT outperformed other Transformer models, while using fewer parameters and FLOPs.
In the Pyramidion, a trainable pooling was applied between layers, introducing the bottleneck gradually along the encoding process…As in MViT, it leads to better results & complexity.
The MViT code and pre-trained models are available as part of FAIR’s PySlowFast video-understanding codebase.