ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.05049
  4. Cited By
Human-centric Spatio-Temporal Video Grounding With Visual Transformers

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

10 November 2020
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
ArXivPDFHTML

Papers citing "Human-centric Spatio-Temporal Video Grounding With Visual Transformers"

37 / 37 papers shown
Title
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
113
6
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
145
41
0
31 Dec 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLM
VGen
VLM
67
6
0
07 Nov 2024
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour
David Harrison
Maxwell Horton
Jeffrey Marker
Houman Bedayat
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
Saman Naderiparizi
MQ
93
3
0
14 Oct 2024
ORDNet: Capturing Omni-Range Dependencies for Scene Parsing
ORDNet: Capturing Omni-Range Dependencies for Scene Parsing
Shaofei Huang
Si Liu
Tianrui Hui
Jizhong Han
Yue Liu
Jiashi Feng
Shuicheng Yan
3DPC
OffRL
90
15
0
11 Jan 2021
Linguistic Structure Guided Context Modeling for Referring Image
  Segmentation
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Tianrui Hui
Si Liu
Shaofei Huang
Guanbin Li
Sansi Yu
Faxi Zhang
Jizhong Han
63
153
0
01 Oct 2020
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Shaofei Huang
Tianrui Hui
Si Liu
Guanbin Li
Yunchao Wei
Jizhong Han
Luoqi Liu
Yue Liu
EgoV
64
181
0
01 Oct 2020
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
  Sentences
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
Zhu Zhang
Zhou Zhao
Yang Zhao
Qi. Wang
Huasheng Liu
Lianli Gao
60
115
0
19 Jan 2020
PPDM: Parallel Point Detection and Matching for Real-time Human-Object
  Interaction Detection
PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection
Yue Liao
Si Liu
Fei Wang
Yanjie Chen
Chen Qian
Jiashi Feng
101
267
0
30 Dec 2019
Learning 2D Temporal Adjacent Networks for Moment Localization with
  Natural Language
Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
Songyang Zhang
Houwen Peng
Jianlong Fu
Jiebo Luo
66
469
0
08 Dec 2019
UNITER: UNiversal Image-TExt Representation Learning
UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
VLM
OT
97
447
0
25 Sep 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
142
1,661
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
227
2,474
0
20 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
198
900
0
16 Aug 2019
Fusion of Detected Objects in Text for Visual Question Answering
Fusion of Detected Objects in Text for Visual Question Answering
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
54
173
0
14 Aug 2019
Sentence Specified Dynamic Video Thumbnail Generation
Sentence Specified Dynamic Video Thumbnail Generation
Yiitan Yuan
Lin Ma
Wenwu Zhu
53
30
0
12 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
130
1,948
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
217
3,667
0
06 Aug 2019
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Zhenfang Chen
Lin Ma
Wenhan Luo
Kwan-Yee K. Wong
78
103
0
06 Jun 2019
MAN: Moment Alignment Network for Natural Language Moment Retrieval via
  Iterative Graph Adjustment
MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment
Da Zhang
Xiyang Dai
Xin Eric Wang
Yuan-fang Wang
L. Davis
58
305
0
30 Nov 2018
MAC: Mining Activity Concepts for Language-based Temporal Localization
MAC: Mining Activity Concepts for Language-based Temporal Localization
Runzhou Ge
J. Gao
Kan Chen
Ram Nevatia
67
179
0
21 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.5K
94,511
0
11 Oct 2018
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
Huijuan Xu
Kun He
Bryan A. Plummer
Leonid Sigal
Stan Sclaroff
Kate Saenko
CLIP
61
323
0
13 Apr 2018
MAttNet: Modular Attention Network for Referring Expression
  Comprehension
MAttNet: Modular Attention Network for Referring Expression Comprehension
Licheng Yu
Zhe Lin
Xiaohui Shen
Jimei Yang
Xin Lu
Joey Tianyi Zhou
Tamara L. Berg
ObjD
97
825
0
24 Jan 2018
Localizing Moments in Video with Natural Language
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
107
946
0
04 Aug 2017
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual
  Actions
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu
Chen Sun
David A. Ross
Carl Vondrick
C. Pantofaru
...
G. Toderici
Susanna Ricco
Rahul Sukthankar
Cordelia Schmid
Jitendra Malik
VGen
97
1,028
0
23 May 2017
TALL: Temporal Activity Localization via Language Query
TALL: Temporal Activity Localization via Language Query
J. Gao
Chen Sun
Zhenheng Yang
Ram Nevatia
120
819
0
05 May 2017
Action Tubelet Detector for Spatio-Temporal Action Localization
Action Tubelet Detector for Spatio-Temporal Action Localization
Vicky Kalogeiton
Philippe Weinzaepfel
V. Ferrari
Cordelia Schmid
66
325
0
04 May 2017
Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
134
1,242
0
02 May 2017
Spatio-temporal Person Retrieval via Natural Language Queries
Spatio-temporal Person Retrieval via Natural Language Queries
Masataka Yamaguchi
Kuniaki Saito
Yoshitaka Ushiku
Tatsuya Harada
64
58
0
26 Apr 2017
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
Licheng Yu
Hao Tan
Joey Tianyi Zhou
Tamara L. Berg
ObjD
91
275
0
30 Dec 2016
Semi-Supervised Classification with Graph Convolutional Networks
Semi-Supervised Classification with Graph Convolutional Networks
Thomas Kipf
Max Welling
GNN
SSL
577
28,999
0
09 Sep 2016
Modeling Context in Referring Expressions
Modeling Context in Referring Expressions
Licheng Yu
Patrick Poirson
Shan Yang
Alexander C. Berg
Tamara L. Berg
125
1,261
0
31 Jul 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense
  Image Annotations
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
...
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
194
5,726
0
23 Feb 2016
Generation and Comprehension of Unambiguous Object Descriptions
Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao
Jonathan Huang
Alexander Toshev
Oana-Maria Camburu
Alan Yuille
Kevin Patrick Murphy
ObjD
112
1,345
0
07 Nov 2015
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
  Networks
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Shaoqing Ren
Kaiming He
Ross B. Girshick
Jian Sun
AIMat
ObjD
465
62,122
0
04 Jun 2015
Coherent Multi-Sentence Video Description with Variable Level of Detail
Coherent Multi-Sentence Video Description with Variable Level of Detail
Anna Rohrbach
Marcus Rohrbach
Weijian Qiu
Annemarie Friedrich
Sikandar Amin
Mykhaylo Andriluka
Manfred Pinkal
Bernt Schiele
69
218
0
24 Mar 2014
1