Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.02547
Cited By
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
6 April 2022
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation"
50 / 56 papers shown
Title
MetaFormer Is Actually What You Need for Vision
Weihao Yu
Mi Luo
Pan Zhou
Chenyang Si
Yichen Zhou
Xinchao Wang
Jiashi Feng
Shuicheng Yan
170
911
0
22 Nov 2021
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
111
45
0
21 Sep 2021
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Jiawei Chen
C. Ho
ViT
77
77
0
20 Aug 2021
Vision-Language Transformer and Query Generation for Referring Segmentation
Henghui Ding
Chang-rui Liu
Suchen Wang
Xudong Jiang
78
266
0
12 Aug 2021
Full-Duplex Strategy for Video Object Segmentation
Ge-Peng Ji
Deng-Ping Fan
Keren Fu
Zhe Wu
Jianbing Shen
Ling Shao
VOS
118
133
0
06 Aug 2021
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation
Tianrui Hui
Shaofei Huang
Si Liu
Zihan Ding
Guanbin Li
Wenguan Wang
Jizhong Han
Fei Wang
67
49
0
14 May 2021
Visual Saliency Transformer
Nian Liu
Ni Zhang
Kaiyuan Wan
Ling Shao
Junwei Han
ViT
297
359
0
25 Apr 2021
Self-supervised Video Object Segmentation by Motion Grouping
Charig Yang
Hala Lamdouar
Erika Lu
Andrew Zisserman
Weidi Xie
VOS
OCL
80
161
0
15 Apr 2021
Weakly Supervised Video Salient Object Detection
Wangbo Zhao
Jing Zhang
Long Li
Nick Barnes
Nian Liu
Junwei Han
60
61
0
06 Apr 2021
Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Ya Jing
Tao Kong
Wei Wang
Liang Wang
Lei Li
Tieniu Tan
70
136
0
30 Mar 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
222
2,150
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
453
21,439
0
25 Mar 2021
Natural Language Video Localization: A Revisit in Span-based Question Answering Framework
Hao Zhang
Aixin Sun
Wei Jing
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
149
86
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
84
300
0
22 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
124
664
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
389
2,053
0
09 Feb 2021
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li-xin Yuan
Yunpeng Chen
Tao Wang
Weihao Yu
Yujun Shi
Zihang Jiang
Francis E. H. Tay
Jiashi Feng
Shuicheng Yan
ViT
133
1,939
0
28 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
659
41,103
0
22 Oct 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
227
5,080
0
08 Oct 2020
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
Míriam Bellver
Carles Ventura
Carina Silberer
Ioannis V. Kazakos
Jordi Torres
Xavier Giró-i-Nieto
VOS
70
33
0
01 Oct 2020
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
421
13,048
0
26 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
118
503
0
01 May 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
145
440
0
02 Apr 2020
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed
Jia Deng
MDE
244
2,625
0
26 Mar 2020
Distilling Knowledge from Graph Convolutional Networks
Yiding Yang
Jiayan Qiu
Xiuming Zhang
Dacheng Tao
Xinchao Wang
210
232
0
23 Mar 2020
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Liujuan Cao
Chenglin Wu
Cheng Deng
Rongrong Ji
ObjD
253
291
0
19 Mar 2020
Motion-Attentive Transition for Zero-Shot Video Object Segmentation
Tianfei Zhou
Shunzhou Wang
Yi Zhou
Yazhou Yao
Jianwu Li
Ling Shao
VOS
175
189
0
09 Mar 2020
Motion Guided Attention for Video Salient Object Detection
Haofeng Li
Guanqi Chen
Guanbin Li
Yizhou Yu
93
167
0
16 Sep 2019
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye
Mrigank Rochan
Zhi Liu
Yang Wang
EgoV
52
477
0
09 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
79
1,246
0
03 Apr 2019
Cross-task weakly supervised learning from instructional videos
Dimitri Zhukov
Jean-Baptiste Alayrac
R. G. Cinbis
David Fouhey
Ivan Laptev
Josef Sivic
SSL
124
249
0
19 Mar 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.8K
94,891
0
11 Oct 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Youngjae Yu
Jongseok Kim
Gunhee Kim
91
345
0
07 Aug 2018
Stacked Cross Attention for Image-Text Matching
Kuang-Huei Lee
Xi Chen
G. Hua
Houdong Hu
Xiaodong He
89
1,151
0
21 Mar 2018
Actor and Action Video Segmentation from a Sentence
Kirill Gavrilyuk
Amir Ghodrati
Zhenyang Li
Cees G. M. Snoek
VLM
73
149
0
20 Mar 2018
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume
Deqing Sun
Xiaodong Yang
Ming-Yuan Liu
Jan Kautz
3DPC
259
2,445
0
07 Sep 2017
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
115
946
0
04 Aug 2017
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen
George Papandreou
Florian Schroff
Hartwig Adam
SSeg
232
8,473
0
17 Jun 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
713
131,652
0
12 Jun 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
235
8,019
0
22 May 2017
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
139
1,248
0
02 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
75
558
0
14 Apr 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
75
827
0
28 Mar 2017
Recurrent Multimodal Interaction for Referring Image Segmentation
Chenxi Liu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Alan Yuille
EgoV
73
239
0
23 Mar 2017
Deformable Convolutional Networks
Jifeng Dai
Haozhi Qi
Yuwen Xiong
Yi Li
Guodong Zhang
Han Hu
Yichen Wei
201
5,334
0
17 Mar 2017
FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos
Enis Berk Çoban
Bo Xiong
Michael I. Mandel
VOS
118
382
0
19 Jan 2017
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
413
10,494
0
21 Jul 2016
TGIF: A New Dataset and Benchmark on Animated GIF Description
Yuncheng Li
Yale Song
Liangliang Cao
Joel R. Tetreault
Larry Goldberg
A. Jaimes
Jiebo Luo
79
271
0
10 Apr 2016
Segmentation from Natural Language Expressions
Ronghang Hu
Marcus Rohrbach
Trevor Darrell
VLM
EgoV
74
435
0
20 Mar 2016
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
194,020
0
10 Dec 2015
1
2
Next