sherlock

Scattering Vision Transformer: Spectral Mixing Matters

Badri N. Patro, Vijay Srinivas Agneeswaran

Microsoft

[NeurIPS 2023] [NeurIPS PPT] [Code]

sherlock

Abstract

Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flowers, and Stanford Cars datasets.

NeurIPS 2023 Poster

SVT Main Diagram

SOTA result on the ImageNet1K dataset for image size 224 x 224

SVT Performance Plots (Hierarchical)

SVT Performance Plots(Vanilla)

Filter Visualization

High Filter Visualization- All Six Directional Components

BibTex

@article{patro2023svt,
author = {Patro, Badri N. and Agneeswaran, Vijay Srinivas},
title = {Scattering Vision Transformer: Spectral Mixing Matters},
journal={arXiv preprint arXiv:2311.02446},
year = {2023}
}

*There are no ordinary moments - Dan Millman.