ViViT (Video Vision Transformer)

ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository.

Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

ViViT is an extension of the Vision Transformer (ViT) to video.

We refer to the paper for details.

Intended uses & limitations

The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

BibTeX entry and citation info

@misc{arnab2021vivit,
      title={ViViT: A Video Vision Transformer}, 
      author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario Lučić and Cordelia Schmid},
      year={2021},
      eprint={2103.15691},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Downloads last month: 32,452

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for google/vivit-b-16x2-kinetics400

Finetunes

62 models