ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2204.02547
  4. Cited By
Modeling Motion with Multi-Modal Features for Text-Based Video
  Segmentation

Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

6 April 2022
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
ArXiv (abs)PDFHTML

Papers citing "Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation"

50 / 56 papers shown
Title
MetaFormer Is Actually What You Need for Vision
MetaFormer Is Actually What You Need for Vision
Weihao Yu
Mi Luo
Pan Zhou
Chenyang Si
Yichen Zhou
Xinchao Wang
Jiashi Feng
Shuicheng Yan
170
911
0
22 Nov 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLMViT
111
45
0
21 Sep 2021
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
  Recognition
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Jiawei Chen
C. Ho
ViT
77
77
0
20 Aug 2021
Vision-Language Transformer and Query Generation for Referring
  Segmentation
Vision-Language Transformer and Query Generation for Referring Segmentation
Henghui Ding
Chang-rui Liu
Suchen Wang
Xudong Jiang
78
266
0
12 Aug 2021
Full-Duplex Strategy for Video Object Segmentation
Full-Duplex Strategy for Video Object Segmentation
Ge-Peng Ji
Deng-Ping Fan
Keren Fu
Zhe Wu
Jianbing Shen
Ling Shao
VOS
118
133
0
06 Aug 2021
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
  Segmentation
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation
Tianrui Hui
Shaofei Huang
Si Liu
Zihan Ding
Guanbin Li
Wenguan Wang
Jizhong Han
Fei Wang
67
49
0
14 May 2021
Visual Saliency Transformer
Visual Saliency Transformer
Nian Liu
Ni Zhang
Kaiyuan Wan
Ling Shao
Junwei Han
ViT
297
359
0
25 Apr 2021
Self-supervised Video Object Segmentation by Motion Grouping
Self-supervised Video Object Segmentation by Motion Grouping
Charig Yang
Hala Lamdouar
Erika Lu
Andrew Zisserman
Weidi Xie
VOSOCL
80
161
0
15 Apr 2021
Weakly Supervised Video Salient Object Detection
Weakly Supervised Video Salient Object Detection
Wangbo Zhao
Jing Zhang
Long Li
Nick Barnes
Nian Liu
Junwei Han
60
61
0
06 Apr 2021
Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Ya Jing
Tao Kong
Wei Wang
Liang Wang
Lei Li
Tieniu Tan
70
136
0
30 Mar 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
222
2,150
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
453
21,439
0
25 Mar 2021
Natural Language Video Localization: A Revisit in Span-based Question
  Answering Framework
Natural Language Video Localization: A Revisit in Span-based Question Answering Framework
Hao Zhang
Aixin Sun
Wei Jing
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
149
86
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
84
300
0
22 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse
  Sampling
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
124
664
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
389
2,053
0
09 Feb 2021
Tokens-to-Token ViT: Training Vision Transformers from Scratch on
  ImageNet
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li-xin Yuan
Yunpeng Chen
Tao Wang
Weihao Yu
Yujun Shi
Zihang Jiang
Francis E. H. Tay
Jiashi Feng
Shuicheng Yan
ViT
133
1,939
0
28 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
659
41,103
0
22 Oct 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
227
5,080
0
08 Oct 2020
RefVOS: A Closer Look at Referring Expressions for Video Object
  Segmentation
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
Míriam Bellver
Carles Ventura
Carina Silberer
Ioannis V. Kazakos
Jordi Torres
Xavier Giró-i-Nieto
VOS
70
33
0
01 Oct 2020
End-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT3DVPINN
421
13,048
0
26 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
118
503
0
01 May 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
145
440
0
02 Apr 2020
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed
Jia Deng
MDE
244
2,625
0
26 Mar 2020
Distilling Knowledge from Graph Convolutional Networks
Distilling Knowledge from Graph Convolutional Networks
Yiding Yang
Jiayan Qiu
Xiuming Zhang
Dacheng Tao
Xinchao Wang
210
232
0
23 Mar 2020
Multi-task Collaborative Network for Joint Referring Expression
  Comprehension and Segmentation
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Liujuan Cao
Chenglin Wu
Cheng Deng
Rongrong Ji
ObjD
253
291
0
19 Mar 2020
Motion-Attentive Transition for Zero-Shot Video Object Segmentation
Motion-Attentive Transition for Zero-Shot Video Object Segmentation
Tianfei Zhou
Shunzhou Wang
Yi Zhou
Yazhou Yao
Jianwu Li
Ling Shao
VOS
175
189
0
09 Mar 2020
Motion Guided Attention for Video Salient Object Detection
Motion Guided Attention for Video Salient Object Detection
Haofeng Li
Guanqi Chen
Guanbin Li
Yizhou Yu
93
167
0
16 Sep 2019
Cross-Modal Self-Attention Network for Referring Image Segmentation
Cross-Modal Self-Attention Network for Referring Image Segmentation
Linwei Ye
Mrigank Rochan
Zhi Liu
Yang Wang
EgoV
52
477
0
09 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLMSSL
79
1,246
0
03 Apr 2019
Cross-task weakly supervised learning from instructional videos
Cross-task weakly supervised learning from instructional videos
Dimitri Zhukov
Jean-Baptiste Alayrac
R. G. Cinbis
David Fouhey
Ivan Laptev
Josef Sivic
SSL
124
249
0
19 Mar 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
94,891
0
11 Oct 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
Youngjae Yu
Jongseok Kim
Gunhee Kim
91
345
0
07 Aug 2018
Stacked Cross Attention for Image-Text Matching
Stacked Cross Attention for Image-Text Matching
Kuang-Huei Lee
Xi Chen
G. Hua
Houdong Hu
Xiaodong He
89
1,151
0
21 Mar 2018
Actor and Action Video Segmentation from a Sentence
Actor and Action Video Segmentation from a Sentence
Kirill Gavrilyuk
Amir Ghodrati
Zhenyang Li
Cees G. M. Snoek
VLM
73
149
0
20 Mar 2018
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume
Deqing Sun
Xiaodong Yang
Ming-Yuan Liu
Jan Kautz
3DPC
259
2,445
0
07 Sep 2017
Localizing Moments in Video with Natural Language
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
115
946
0
04 Aug 2017
Rethinking Atrous Convolution for Semantic Image Segmentation
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen
George Papandreou
Florian Schroff
Hartwig Adam
SSeg
232
8,473
0
17 Jun 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
713
131,652
0
12 Jun 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
235
8,019
0
22 May 2017
Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
139
1,248
0
02 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
75
558
0
14 Apr 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
75
827
0
28 Mar 2017
Recurrent Multimodal Interaction for Referring Image Segmentation
Recurrent Multimodal Interaction for Referring Image Segmentation
Chenxi Liu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Alan Yuille
EgoV
73
239
0
23 Mar 2017
Deformable Convolutional Networks
Deformable Convolutional Networks
Jifeng Dai
Haozhi Qi
Yuwen Xiong
Yi Li
Guodong Zhang
Han Hu
Yichen Wei
201
5,334
0
17 Mar 2017
FusionSeg: Learning to combine motion and appearance for fully automatic
  segmention of generic objects in videos
FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos
Enis Berk Çoban
Bo Xiong
Michael I. Mandel
VOS
118
382
0
19 Jan 2017
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
413
10,494
0
21 Jul 2016
TGIF: A New Dataset and Benchmark on Animated GIF Description
TGIF: A New Dataset and Benchmark on Animated GIF Description
Yuncheng Li
Yale Song
Liangliang Cao
Joel R. Tetreault
Larry Goldberg
A. Jaimes
Jiebo Luo
79
271
0
10 Apr 2016
Segmentation from Natural Language Expressions
Segmentation from Natural Language Expressions
Ronghang Hu
Marcus Rohrbach
Trevor Darrell
VLMEgoV
74
435
0
20 Mar 2016
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
194,020
0
10 Dec 2015
12
Next