ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1909.11059
  4. Cited By
Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

24 September 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
    MLLM
    VLM
ArXivPDFHTML

Papers citing "Unified Vision-Language Pre-Training for Image Captioning and VQA"

50 / 238 papers shown
Title
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
44
221
0
27 Feb 2023
Understanding Social Media Cross-Modality Discourse in Linguistic Space
Understanding Social Media Cross-Modality Discourse in Linguistic Space
Chunpu Xu
Hanzhuo Tan
Jing Li
Piji Li
24
7
0
26 Feb 2023
Can Pre-trained Vision and Language Models Answer Visual
  Information-Seeking Questions?
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
55
80
0
23 Feb 2023
Retrieval-augmented Image Captioning
Retrieval-augmented Image Captioning
R. Ramos
Desmond Elliott
Bruno Martins
VLM
36
29
0
16 Feb 2023
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
  Image Captioning
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
Zhuolin Yang
Ming-Yu Liu
Zihan Liu
V. Korthikanti
Weili Nie
...
Yuke Zhu
Mohammad Shoeybi
Bryan Catanzaro
Chaowei Xiao
Anima Anandkumar
VLM
RALM
34
39
0
09 Feb 2023
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
  Semantic Consistency
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency
Pengzhen Ren
Changlin Li
Hang Xu
Yi Zhu
Guangrun Wang
Jian-zhuo Liu
Xiaojun Chang
Xiaodan Liang
49
43
0
31 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with
  Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
34
15
0
05 Jan 2023
Enhancing Multi-modal and Multi-hop Question Answering via Structured
  Knowledge and Unified Retrieval-Generation
Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation
Qian Yang
Qian Chen
Wen Wang
Baotian Hu
Min Zhang
44
24
0
16 Dec 2022
Controllable Image Captioning via Prompting
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
24
23
0
04 Dec 2022
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual
  Grounding
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen
Ronghang Hu
Xinlei Chen
Matthias Nießner
Angel X. Chang
29
52
0
01 Dec 2022
What do you MEME? Generating Explanations for Visual Semantic Role
  Labelling in Memes
What do you MEME? Generating Explanations for Visual Semantic Role Labelling in Memes
Shivam Sharma
Siddhant Agarwal
Tharun Suresh
Preslav Nakov
Md. Shad Akhtar
Tanmoy Charkraborty
VLM
35
18
0
01 Dec 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
26
9
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
36
17
0
21 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
  Unified Vision Language Model
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
33
32
0
21 Nov 2022
Incorporating Pre-training Paradigm for Antibody Sequence-Structure
  Co-design
Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design
Kaiyuan Gao
Lijun Wu
Jinhua Zhu
Tianbo Peng
Yingce Xia
...
Shufang Xie
Tao Qin
Haiguang Liu
Kun He
Tie-Yan Liu
40
9
0
26 Oct 2022
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Jihyeon Janel Lee
Wooyoung Kang
Eun-Sol Kim
CoGe
24
3
0
19 Oct 2022
Novel 3D Scene Understanding Applications From Recurrence in a Single
  Image
Novel 3D Scene Understanding Applications From Recurrence in a Single Image
Shimian Zhang
Skanda Bharadwaj
Keaton Kraiger
Yashasvi Asthana
Hong Zhang
R. Collins
Yanxi Liu
38
1
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in
  Vision-Language Pre-training
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
32
63
0
14 Oct 2022
Machine Generated Text: A Comprehensive Survey of Threat Models and
  Detection Methods
Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods
Evan Crothers
Nathalie Japkowicz
H. Viktor
DeLMO
54
107
0
13 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale
  Pre-training
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
42
4
0
10 Oct 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
54
28
0
28 Sep 2022
Paraphrasing Is All You Need for Novel Object Captioning
Paraphrasing Is All You Need for Novel Object Captioning
Cheng Yang
Yao-Hung Hubert Tsai
Wanshu Fan
Ruslan Salakhutdinov
Louis-Philippe Morency
Yu-Chiang Frank Wang
51
4
0
25 Sep 2022
PACT: Perception-Action Causal Transformer for Autoregressive Robotics
  Pre-Training
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
Rogerio Bonatti
Sai H. Vemprala
Shuang Ma
Felipe Vieira Frujeri
Shuhang Chen
Ashish Kapoor
44
22
0
22 Sep 2022
DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality
  Attention
DM2^22S2^22: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention
Shunsuke Kitada
Yuki Iwazaki
Riku Togashi
Hitoshi Iyatomi
26
1
0
07 Sep 2022
Every picture tells a story: Image-grounded controllable stylistic story
  generation
Every picture tells a story: Image-grounded controllable stylistic story generation
Holy Lovenia
Bryan Wilie
Romain Barraud
Samuel Cahyawijaya
Willy Chung
Pascale Fung
26
8
0
04 Sep 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
  Pretraining
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
56
159
0
25 Aug 2022
Interpreting Song Lyrics with an Audio-Informed Pre-trained Language
  Model
Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model
Yixiao Zhang
Junyan Jiang
Gus Xia
S. Dixon
38
9
0
24 Aug 2022
Visual Subtitle Feature Enhanced Video Outline Generation
Visual Subtitle Feature Enhanced Video Outline Generation
Qi Lv
Ziqiang Cao
Wenrui Xie
Derui Wang
Jingwen Wang
...
Yuan-Fang Li
Min Cao
Wenjie Li
Sujian Li
Guohong Fu
VGen
31
0
0
24 Aug 2022
Quality Not Quantity: On the Interaction between Dataset Design and
  Robustness of CLIP
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Thao Nguyen
Gabriel Ilharco
Mitchell Wortsman
Sewoong Oh
Ludwig Schmidt
CLIP
VLM
55
99
0
10 Aug 2022
LATTE: LAnguage Trajectory TransformEr
LATTE: LAnguage Trajectory TransformEr
A. Bucker
Luis F. C. Figueredo
Sami Haddadin
Ashish Kapoor
Shuang Ma
Sai H. Vemprala
Rogerio Bonatti
LM&Ro
46
59
0
04 Aug 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient
  Image-Text Matching and Retrieval
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
16
21
0
29 Jul 2022
Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
27
47
0
26 Jul 2022
GRIT: Faster and Better Image captioning Transformer Using Dual Visual
  Features
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
ViT
41
106
0
20 Jul 2022
Explicit Image Caption Editing
Explicit Image Caption Editing
Zhen Wang
Long Chen
Wenbo Ma
G. Han
Yulei Niu
Jian Shao
Jun Xiao
25
12
0
20 Jul 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Yuqi Liu
Pengfei Xiong
Luhui Xu
Shengming Cao
Qin Jin
44
114
0
16 Jul 2022
ZoDIAC: Zoneout Dropout Injection Attention Calculation
ZoDIAC: Zoneout Dropout Injection Attention Calculation
Zanyar Zohourianshahzadi
Jugal Kalita
36
0
0
28 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
58
229
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
35
124
0
15 Jun 2022
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning
  Tasks
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
Tuan Dinh
Yuchen Zeng
Ruisu Zhang
Ziqian Lin
Michael Gira
Shashank Rajput
Jy-yong Sohn
Dimitris Papailiopoulos
Kangwook Lee
LMTD
55
128
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
79
531
0
13 Jun 2022
Language Models are General-Purpose Interfaces
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
35
96
0
13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional
  MoEs
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
39
66
0
09 Jun 2022
CLIP4IDC: CLIP for Image Difference Captioning
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo
Tong Wang
Jorma T. Laaksonen
VLM
29
27
0
01 Jun 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
64
529
0
27 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally
  Spreading Out Disinformation
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
26
12
0
25 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case
  Study in Self-Rationalization
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
36
4
0
24 May 2022
Novel Multicolumn Kernel Extreme Learning Machine for Food Detection via
  Optimal Features from CNN
Novel Multicolumn Kernel Extreme Learning Machine for Food Detection via Optimal Features from CNN
G. Tahir
Tahir Chu
C. K. Loo
13
2
0
15 May 2022
Automated Audio Captioning: An Overview of Recent Progress and New
  Challenges
Automated Audio Captioning: An Overview of Recent Progress and New Challenges
Xinhao Mei
Xubo Liu
Mark D. Plumbley
Wenwu Wang
31
38
0
12 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
42
34
0
10 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLM
MLLM
62
97
0
05 May 2022
Previous
12345
Next