I am a Research Scientist at Google DeepMind working primarily on multimodal understanding and generation. I completed my PhD with Philip Torr at the University of Oxford, where I focused on deep structured models for pixel-level scene understanding. Prior to that, I completed my undergraduate degree at the University of Cape Town.

Up-to-date list on Google Scholar
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul
Neural Information Processing Systems (NeurIPS), 2024

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
Neural Information Processing Systems (NeurIPS), 2024

Gemini 1.5
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team

Streaming Dense Video Captioning
Streaming Dense Video Captioning
Xingyi Zhou*, Anurag Arnab*, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2024

Time-Memory-and Parameter-Efficient Visual Adaptation
Time-Memory-and Parameter-Efficient Visual Adaptation
Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab
Computer Vision and Pattern Recognition (CVPR), 2024
Highlight paper

End-to-end spatio-temporal action localisation with video transformers
End-to-End Spatio-Temporal Action Localisation with Video Transformers
Alexey A Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lucic, Cordelia Schmid, Anurag Arnab
Computer Vision and Pattern Recognition (CVPR), 2024

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim
Computer Vision and Pattern Recognition (CVPR), 2024
Highlight paper

Pixel LLM
Pixel Aligned Language Models
Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2024

Description
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Google
Computer Vision and Pattern Recognition (CVPR), 2024

Description
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael Ryoo
Computer Vision and Pattern Recognition (CVPR), 2024

Description
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab
International Conference on Computer Vision (ICCV), 2023

Description
UnLoc: A Unified Framework for Video Localization Tasks
Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
International Conference on Computer Vision (ICCV), 2023

Description
Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
Neural Information Processing Systems (NeurIPS), 2023

Description
How Can Objects Help Action Recognition?
Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2023

Description
Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
Computer Vision and Pattern Recognition (CVPR), 2023

Description
Scaling Vision Transformers to 22 Billion Parameters
International Conference on Machine Learning (ICML), 2023
Google Research

Description
Adaptive Computation with Elastic Input Sequence
Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You
International Conference on Machine Learning (ICML), 2023

Description
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov*, Anurag Arnab*, Krzysztof Marcin Choromanski, Mario Lucic, Yi Tay, Mostafa Dehghani*
Transactions on Machine Learning Research (TMLR), 2022

Description
Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer*, Alexey Gritsenko*, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby
European Conference on Computer Vision (ECCV), 2022

Description
M&M Mix: A Multimodal Multiview Transformer Ensemble
Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid
Winner of the Epic Kitchens Action Recognition Challenge at CVPR 2022

Description
Multiview Transformers for Video Recognition
Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2022

Description
End-to-end Generative Pretraining for Multimodal Video Captioning
Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2022

Description
Learning with Neighbor Consistency for Noisy Labels
Ahmet Iscen, Jack Valmadre, Anurag Arnab, Cordelia Schmid
Computer Vision and Pattern Recognition (CVPR), 2022

Description
The Efficiency Misnomer
Mostafa Dehghani*, Anurag Arnab*, Lucas Beyer*, Ashish Vaswani, Yi Tay*
International Conference on Learning Representations (ICLR), 2022

Description
Scenic: A JAX library for Computer Vision Research and Beyond
Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay
Computer Vision and Pattern Recognition (CVPR) Demo, 2022

Description
ViViT: A Video Vision Transformer
Anurag Arnab*, Mostafa Dehghani*, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid
International Conference on Computer Vision (ICCV), 2021

Description
Unified Graph Structured Models for Video Understanding
Anurag Arnab, Chen Sun, Cordelia Schmid
International Conference on Computer Vision (ICCV), 2021

Description
Compressive Visual Representations
Kuang-Huei Lee*, Anurag Arnab*, Sergio Guadarrama, John Canny, Ian Fischer*
Conference on Neural Information Processing Systems (NeurIPS), 2021

Description
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
Conference on Neural Information Processing Systems (NeurIPS), 2021

Description
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Conference on Neural Information Processing Systems (NeurIPS), 2021

Description
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
European Conference on Computer Vision (ECCV), 2020

Description
Dynamic Graph Message Passing Networks
Li Zhang, Dan Xu, Anurag Arnab, Philip H.S. Torr
Computer Vision and Pattern Recognition (CVPR), 2020
Oral presentation

Description
Meta-Learning Deep Visual Words for Fast Video Object Segmentation
Harkirat Singh Behl, Mohammad Najafi, Anurag Arnab, Philip H.S. Torr.
Intelligent Robots and Systems (IROS), 2020
NeurIPS Machine Learning for Autonomous Driving Workshop, 2019

Exploiting Temporal Context for 3D Human Pose Estimation In The Wild
Exploiting Temporal Context for 3D Human Pose Estimation In The Wild
Anurag Arnab*, Carl Doersch*, Andrew Zisserman
Computer Vision and Pattern Recognition (CVPR), 2019

Description
Dual Graph Convolutional Network for Semantic Segmentation
Li Zhang*, Xiangtai Li*, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, Philip H.S. Torr
British Machine Vision Conference (BMVC), 2019

Weakly- and Semi-Supervised Panoptic Segmentation
Weakly- and Semi-Supervised Panoptic Segmentation
Qizhu Li*, Anurag Arnab*, Philip H.S Torr
European Conference on Computer Vision (ECCV), 2018

On the Robustness of Semantic Segmentation Models to Adversarial Attacks
On the Robustness of Semantic Segmentation Models to Adversarial Attacks
Anurag Arnab, Ondrej Miksik, Philip H.S Torr
Computer Vision and Pattern Recognition (CVPR), 2018
Pattern Analysis and Machine Intelligence (PAMI), 2019

Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation
Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation
Anurag Arnab, Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Måns Larsson, Alexander Kirillov, Bogdan Savchynskyy, Carsten Rother, Fredrik Kahl, Philip H.S. Torr
IEEE Signal Processing Magazine, 2018

Revisiting Deep Structured Models for Pixel-Level Labeling with Gradient-Based Inference
Revisiting Deep Structured Models for Pixel-Level Labeling with Gradient-Based Inference
Måns Larsson, Anurag Arnab, Shuai Zheng, Philip H.S. Torr, Fredrik Kahl.
SIAM Journal on Imaging Sciences, 2018

Pixelwise Instance Segmentation with a Dynamically Instantiated Network
Pixelwise Instance Segmentation with a Dynamically Instantiated Network
Anurag Arnab, Philip H.S. Torr
Computer Vision and Pattern Recognition (CVPR), 2017

Holistic, Instance-level Human Parsing
Holistic, Instance-level Human Parsing
Qizhu Li*, Anurag Arnab*, Philip H.S Torr
British Machine Vision Conference (BMVC), 2017

A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise	Potentials
A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials
Måns Larsson, Anurag Arnab, Fredrik Kahl, Shuai Zheng, Philip H.S. Torr
Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), 2017

Higher Order Conditional Random Fields in Deep Neural Networks
Higher Order Conditional Random Fields in Deep Neural Networks
Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, Philip H.S Torr
European Conference on Computer Vision (ECCV), 2016

Bottom-up Instance Segmentation
Bottom-up Instance Segmentation using Deep Higher-Order CRFs
Anurag Arnab, Philip H.S Torr.
British Machine Vision Conference (BMVC), 2016

Joint Object-Material Category Segmentation from Audio-Visual Cues
Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip H.S. Torr.
British Machine Vision Conference (BMVC), 2015

Semantic Paint
SemanticPaint: A Framework for the Interactive Segmentation of 3D Scenes
Stuart Golodetz, Michael Sapienza, Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, Anurag Arnab, Victor Adrian Prisacariu, Olaf Kaehler, Carl Yuheng Ren, David W. Murray, Shahram Izadi, Philip H.S. Torr
ACM SIGGRAPH 2015 Emerging Technologies, 2015 (live demo)
arXiv 1510.03727, 2015
Description
Pixel-level Scene Understanding with Deep Structured Models
Anurag Arnab
University of Oxford 2019

Advanced Architectures for Vision
Invited talk at African Computer Vision Summer School (ACVSS) at Nairobi, Kenya. July 2024.
[Slides]

Large-Scale Video Understanding with Transformers
Invited talk at GIST Workshop for Accelerating Intelligence at GIST, South Korea. December 2022.
Invited talk at Google Visits POSTECH at POSTECH, South Korea. December 2022.
[Slides]

Large-Scale Video Understanding with Transformers
Invited talk at Holistic Video Understanding Workshop at CVPR. June 2022.
[Slides]

Winning entry to the Epic Kitchens Action Recognition Challenge
Invited talk at Epic Kitchens Workshop at CVPR. June 2022.
[Slides]

Video Understanding with Imperfect Data
Invited talk at Learning from Limited and Imperfect Data (L2ID) workshop at CVPR. June 2021.
[Slides]

Transformers: A Review, and Recent Developments in Vision
Invited lecture at Deep Learning Indaba X Tanzania. June 2021.
[Slides]

Structured Models for Video Understanding
Invited talk at Ulsan National Institute of Science and Technology (UNIST), South Korea. June 2021
[Slides]

Video Understanding in the Wild with Incomplete Supervision
Invited talk at 1st Visual Intelligence Seminar at Fudan University, China. January 2021
[Slides]

Scene Understanding with Deep Structured Models
Invited talk at University of Warsaw. January 2020
[Slides]

Learning from Weak Supervision: Panoptic Segmentation and ​3D Human Pose Estimation
Invited talk at Learning from Imperfect Data Workshop at CVPR. June 2019
[Slides]

Pixelwise Instance Segmentation with a Dynamically Instantiated Network
ETH Zurich, August 2017
[Slides]

Holistic Scene Understanding with Deep Learning and Dense Random Fields
Invited tutorial at Deep Learning Meets Model Optimization and Statistical Inference at European Conference on Computer Vision (ECCV), October 2016.
[Slides]

Joint Object-Material Category Segmentation from Audio-Visual Cues
Vision and Learning Seminar (Online), February 2016
[Video]

Joint Object-Material Category Segmentation from Audio-Visual Cues
CVSSP Seminar, University of Surrey, November 2015
[Slides]