Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.02779
Cited By
Unifying Vision-and-Language Tasks via Text Generation
4 February 2021
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unifying Vision-and-Language Tasks via Text Generation"
50 / 368 papers shown
Title
Location-Aware Visual Question Generation with Lightweight Models
Nicholas Collin Suwono
Justin Chih-Yao Chen
Tun-Min Hung
T. Huang
I-Bin Liao
Yung-Hui Li
Lun-Wei Ku
Shao-Hua Sun
18
4
0
23 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
30
0
0
20 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
22
1
0
18 Oct 2023
Beyond Segmentation: Road Network Generation with Multi-Modal LLMs
Sumedh Rasal
Sanjay K. Boddhu
35
5
0
15 Oct 2023
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Jiachen Li
Qiaozi Gao
Michael Johnston
Xiaofeng Gao
Xuehai He
Suhaila Shakiah
Hangjie Shi
R. Ghanadan
William Yang Wang
LM&Ro
27
12
0
14 Oct 2023
VizAbility: Enhancing Chart Accessibility with LLM-based Conversational Interaction
Joshua Gorniak
Yoon Kim
Donglai Wei
Nam Wook Kim
32
8
0
14 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
45
25
0
07 Oct 2023
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
Zhizheng Zhang
Wenxuan Xie
Xiaoyi Zhang
Yan Lu
34
10
0
07 Oct 2023
Demystifying Embedding Spaces using Large Language Models
Guy Tennenholtz
Yinlam Chow
Chih-Wei Hsu
Jihwan Jeong
Lior Shani
Azamat Tulepbergenov
Deepak Ramachandran
Martin Mladenov
Craig Boutilier
28
11
0
06 Oct 2023
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
20
7
0
05 Oct 2023
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
19
1
0
28 Sep 2023
Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere
Felipe del Rio
Andres Carvallo De Ferari
Carlos Aspillaga
Eugenio Herrera-Berg
Cristian Buc Calderon
DiffM
27
0
0
27 Sep 2023
Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan
Bingquan Shen
MLLM
37
1
0
27 Sep 2023
Survey of Social Bias in Vision-Language Models
Nayeon Lee
Yejin Bang
Holy Lovenia
Samuel Cahyawijaya
Wenliang Dai
Pascale Fung
VLM
47
16
0
24 Sep 2023
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
Renqiu Xia
Bo-Wen Zhang
Hao Peng
Hancheng Ye
Xiangchao Yan
Peng Ye
Botian Shi
Yu Qiao
Junchi Yan
14
0
0
20 Sep 2023
Towards Artificial General Intelligence (AGI) in the Internet of Things (IoT): Opportunities and Challenges
Fei Dou
Jin Ye
Geng Yuan
Qin Lu
Wei Niu
...
Hongyue Sun
Yunli Shao
Changying Li
Tianming Liu
Wenzhan Song
AI4CE
37
29
0
14 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
38
7
0
05 Sep 2023
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
Cheng Shi
Sibei Yang
VLM
19
21
0
03 Sep 2023
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
Fengxiang Bie
Yibo Yang
Zhongzhu Zhou
Adam Ghanem
Minjia Zhang
...
Pareesa Ameneh Golnari
David A. Clifton
Yuxiong He
Dacheng Tao
Shuaiwen Leon Song
EGVM
33
18
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhengyuan Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
28
8
0
31 Aug 2023
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
Tao Chen
Zexiong Lin
Hui Li
Jiayi Ji
Yiyi Zhou
Guanbin Li
Rongrong Ji
21
0
0
22 Aug 2023
Whether you can locate or not? Interactive Referring Expression Generation
Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang
31
4
0
19 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu
Yanyang Li
M. Lyu
Liwei Wang
VLM
32
15
0
18 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
13
5
0
17 Aug 2023
Improving Joint Speech-Text Representations Without Alignment
Cal Peyser
Zhong Meng
Ke Hu
Rohit Prabhavalkar
Andrew Rosenberg
Tara N. Sainath
M. Picheny
Kyunghyun Cho
VLM
31
4
0
11 Aug 2023
RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic
Saleem Ahmed
Bhavin Jawade
Shubham Pandey
S. Setlur
Venugopal Govindaraju
21
5
0
03 Aug 2023
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
Qiang-feng Zhou
Chaohui Yu
Shaofeng Zhang
Sitong Wu
Zhibin Wang
Fan Wang
34
27
0
03 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
61
42
0
30 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
25
4
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
26
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
32
25
0
13 Jul 2023
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
37
126
0
11 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
Yuwei Bao
B. Lattimer
J. Chai
CLL
43
1
0
05 Jul 2023
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
Bang-ju Yang
Fenglin Liu
Zheng Li
Qingyu Yin
Chenyu You
Bing Yin
Yuexian Zou
VLM
33
5
0
05 Jul 2023
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
A. S. Penamakuri
Manish Gupta
Mithun Das Gupta
Anand Mishra
37
7
0
29 Jun 2023
VisText: A Benchmark for Semantically Rich Chart Captioning
Benny J. Tang
Angie Boggust
Arvind Satyanarayan
28
76
0
28 Jun 2023
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
54
556
0
23 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
34
18
0
20 Jun 2023
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang
Kaipeng Zhang
Yuying Ge
Wenqi Shao
Zeyue Xue
Yu Qiao
Ping Luo
DiffM
21
5
0
20 Jun 2023
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Md Zahid Hasan
Jiajing Chen
Jiyang Wang
Mohammed Shaiqur Rahman
Ameya Joshi
Senem Velipasalar
C. Hegde
Anuj Sharma
S. Sarkar
VLM
46
18
0
16 Jun 2023
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Kelly O. Marshall
Minh Pham
Ameya Joshi
Anushrut Jignasu
Aditya Balu
Adarsh Krishnamurthy
A. Hegde
CLIP
18
3
0
14 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
26
53
0
13 Jun 2023
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
VLM
15
1
0
03 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip Torr
Volker Tresp
VPVLM
VLM
34
17
0
03 Jun 2023
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning
Abisek Rajakumar Kalarani
P. Bhattacharyya
Niyati Chhaya
Sumit Shekhar
CoGe
VLM
19
9
0
01 Jun 2023
Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Mingyang Zhou
Yi Ren Fung
Long Chen
Christopher Thomas
Heng Ji
Shih-Fu Chang
23
11
0
29 May 2023
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
Noam Rotstein
David Bensaid
Shaked Brody
Roy Ganz
Ron Kimmel
VLM
26
27
0
28 May 2023
Decoding the Underlying Meaning of Multimodal Hateful Memes
Ming Shan Hee
Wen-Haw Chong
Roy Ka-Wei Lee
32
33
0
28 May 2023
Previous
1
2
3
4
5
6
7
8
Next