Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2110.06915
Cited By
Object-Region Video Transformers
13 October 2021
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Object-Region Video Transformers"
50 / 66 papers shown
Title
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
51
0
0
30 Mar 2025
Extending Video Masked Autoencoders to 128 frames
N. B. Gundavarapu
Luke Friedman
Raghav Goyal
Chaitra Hegde
Eirikur Agustsson
...
Mikhail Sirotenko
Ming Yang
Tobias Weyand
Boqing Gong
Leonid Sigal
82
1
0
20 Nov 2024
SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining
Ruiqi Xian
Xiyang Wu
Tianrui Guan
Xijun Wang
Boqing Gong
Dinesh Manocha
ViT
39
0
0
26 Sep 2024
Mamba Fusion: Learning Actions Through Questioning
Zhikang Dong
Apoorva Beedu
Jason Sheinkopf
Irfan Essa
Mamba
70
2
0
17 Sep 2024
ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers
Luoyu Mei
Shuai Wang
Yun Cheng
Ruofeng Liu
Zhimeng Yin
Wenchao Jiang
Shuai Wang
Wei Gong
30
5
0
02 Sep 2024
Classification Matters: Improving Video Action Detection with Class-Specific Attention
Jinsung Lee
Taeoh Kim
Inwoong Lee
Minho Shim
Dongyoon Wee
Minsu Cho
Suha Kwak
46
0
0
29 Jul 2024
Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Rui Qian
Shuangrui Ding
Dahua Lin
OCL
52
1
0
09 Jul 2024
PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition
Y. Hao
Diansong Zhou
Zhicai Wang
Chong-Wah Ngo
Meng Wang
ViT
32
4
0
03 Jul 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&Ro
VLM
45
23
0
17 Jun 2024
A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection
Matthew Korban
Peter Youngs
Scott T. Acton
ViT
29
6
0
13 May 2024
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition
Xunsong Li
Pengzhan Sun
Yangcen Liu
Lixin Duan
Wen Li
43
3
0
18 Apr 2024
Learning Correlation Structures for Vision Transformers
Manjin Kim
Paul Hongsuck Seo
Cordelia Schmid
Minsu Cho
ViT
37
7
0
05 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
41
32
0
01 Apr 2024
Learning Causal Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition
Yuke Li
Guangyi Chen
Ben Abramowitz
Stefano Anzellotti
Donglai Wei
TTA
40
1
0
20 Feb 2024
Memory Consolidation Enables Long-Context Video Understanding
Ivana Balavzević
Yuge Shi
Pinelopi Papalampidi
Rahma Chaabouni
Skanda Koppula
Olivier J. Hénaff
99
22
0
08 Feb 2024
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition
Guangzhao Dai
Xiangbo Shu
Wenhao Wu
Rui Yan
Jiachao Zhang
VLM
24
5
0
18 Jan 2024
TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding
Yun-Hai Liu
Haolin Yang
Xu Si
Ling Liu
Zipeng Li
Yuxiang Zhang
Yebin Liu
Li Yi
59
22
0
16 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
49
4
0
28 Dec 2023
ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition in the Operating Room
Idris Hamoud
Muhammad Abdullah Jamal
V. Srivastav
Didier Mutter
N. Padoy
Omid Mohareri
21
2
0
19 Dec 2023
Encoding Surgical Videos as Latent Spatiotemporal Graphs for Object and Anatomy-Driven Reasoning
Aditya Murali
Deepak Alapatt
Pietro Mascagni
Armine Vardazaryan
Alain Garcia
Nariaki Okamoto
Didier Mutter
N. Padoy
MedIm
36
7
0
11 Dec 2023
CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee
Jongseo Lee
Jinwoo Choi
EgoV
35
12
0
30 Nov 2023
DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding
Kyungho Bae
Geo Ahn
Youngrae Kim
Jinwoo Choi
30
2
0
30 Nov 2023
Object-based (yet Class-agnostic) Video Domain Adaptation
Dantong Niu
Amir Bar
Roei Herzig
Trevor Darrell
Anna Rohrbach
22
1
0
29 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLM
LRM
33
80
0
27 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
15
14
0
31 Oct 2023
Opening the Vocabulary of Egocentric Actions
Dibyadip Chatterjee
Fadime Sener
Shugao Ma
Angela Yao
VLM
41
16
0
22 Aug 2023
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos
Sanket Thakur
Cigdem Beyan
Pietro Morerio
Vittorio Murino
Alessio Del Bue
26
11
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
26
19
0
15 Aug 2023
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Shuangrui Ding
Peisen Zhao
Xiaopeng Zhang
Rui Qian
H. Xiong
Qi Tian
ViT
29
16
0
08 Aug 2023
Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun
Calvin Luo
Xingyi Zhou
Anurag Arnab
Cordelia Schmid
OCL
LRM
ViT
35
3
0
17 Jul 2023
Multimodal Distillation for Egocentric Action Recognition
Gorjan Radevski
Dusan Grujicic
Marie-Francine Moens
Matthew Blaschko
Tinne Tuytelaars
EgoV
23
23
0
14 Jul 2023
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Syed Talal Wasim
Muhammad Uzair Khattak
Muzammal Naseer
Salman Khan
M. Shah
F. Khan
ViT
51
19
0
13 Jul 2023
Free-Form Composition Networks for Egocentric Action Recognition
Haoran Wang
Qinghua Cheng
Baosheng Yu
Yibing Zhan
Dapeng Tao
Liang Ding
Haibin Ling
EgoV
52
0
0
13 Jul 2023
A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
Zhonghan Zhao
Wenhao Chai
Shengyu Hao
Wenhao Hu
Guanhong Wang
Shidong Cao
Min-Gyoo Song
Jenq-Neng Hwang
Gaoang Wang
32
17
0
07 Jul 2023
How can objects help action recognition?
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
35
14
0
20 Jun 2023
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Roei Herzig
Donghyun Kim
...
Rameswar Panda
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
38
52
0
31 May 2023
Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective
Thanh-Dat Truong
Khoa Luu
EgoV
27
10
0
25 May 2023
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Roei Herzig
Alon Mendelson
Leonid Karlinsky
Assaf Arbelle
Rogerio Feris
Trevor Darrell
Amir Globerson
VLM
32
31
0
10 May 2023
Modelling Spatio-Temporal Interactions for Compositional Action Recognition
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
43
1
0
04 May 2023
SVT: Supertoken Video Transformer for Efficient Video Understanding
Chen-Ming Pan
Rui Hou
Hanchao Yu
Qifan Wang
Senem Velipasalar
Madian Khabsa
ViT
21
0
0
01 Apr 2023
CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection
Lei Chen
Zhan Tong
Yibing Song
Gangshan Wu
Limin Wang
25
3
0
28 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Jungin Park
Jiyoung Lee
K. Sohn
ViT
21
37
0
17 Mar 2023
EgoViT: Pyramid Video Transformer for Egocentric Action Recognition
Chen-Ming Pan
Zhiqi Zhang
Senem Velipasalar
Yi Tian Xu
ViT
12
1
0
15 Mar 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
32
8
0
20 Feb 2023
AIM: Adapting Image Models for Efficient Video Action Recognition
Taojiannan Yang
Yi Zhu
Yusheng Xie
Aston Zhang
Cheng Chen
Mu Li
ViT
58
144
0
06 Feb 2023
OAMixer: Object-aware Mixing Layer for Vision Transformers
H. Kang
Sangwoo Mo
Jinwoo Shin
VLM
39
4
0
13 Dec 2022
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig
Ofir Abramovich
Elad Ben-Avraham
Assaf Arbelle
Leonid Karlinsky
Ariel Shamir
Trevor Darrell
Amir Globerson
41
16
0
08 Dec 2022
Interaction Region Visual Transformer for Egocentric Action Anticipation
Debaditya Roy
Ramanathan Rajendiran
Basura Fernando
36
15
0
25 Nov 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Rameswar Panda
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
48
70
0
21 Nov 2022
Students taught by multimodal teachers are superior action recognizers
Gorjan Radevski
Dusan Grujicic
Matthew Blaschko
Marie-Francine Moens
Tinne Tuytelaars
21
1
0
09 Oct 2022
1
2
Next