ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.02779
  4. Cited By
Unifying Vision-and-Language Tasks via Text Generation

Unifying Vision-and-Language Tasks via Text Generation

4 February 2021
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
    MLLM
ArXivPDFHTML

Papers citing "Unifying Vision-and-Language Tasks via Text Generation"

50 / 368 papers shown
Title
Location-Aware Visual Question Generation with Lightweight Models
Location-Aware Visual Question Generation with Lightweight Models
Nicholas Collin Suwono
Justin Chih-Yao Chen
Tun-Min Hung
T. Huang
I-Bin Liao
Yung-Hui Li
Lun-Wei Ku
Shao-Hua Sun
18
4
0
23 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network
  for VL Representation
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
30
0
0
20 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
  Interactions
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
22
1
0
18 Oct 2023
Beyond Segmentation: Road Network Generation with Multi-Modal LLMs
Beyond Segmentation: Road Network Generation with Multi-Modal LLMs
Sumedh Rasal
Sanjay K. Boddhu
35
5
0
15 Oct 2023
Mastering Robot Manipulation with Multimodal Prompts through Pretraining
  and Multi-task Fine-tuning
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Jiachen Li
Qiaozi Gao
Michael Johnston
Xiaofeng Gao
Xuehai He
Suhaila Shakiah
Hangjie Shi
R. Ghanadan
William Yang Wang
LM&Ro
27
12
0
14 Oct 2023
VizAbility: Enhancing Chart Accessibility with LLM-based Conversational
  Interaction
VizAbility: Enhancing Chart Accessibility with LLM-based Conversational Interaction
Joshua Gorniak
Yoon Kim
Donglai Wei
Nam Wook Kim
32
8
0
14 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
45
25
0
07 Oct 2023
Reinforced UI Instruction Grounding: Towards a Generic UI Task
  Automation API
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
Zhizheng Zhang
Wenxuan Xie
Xiaoyi Zhang
Yan Lu
34
10
0
07 Oct 2023
Demystifying Embedding Spaces using Large Language Models
Demystifying Embedding Spaces using Large Language Models
Guy Tennenholtz
Yinlam Chow
Chih-Wei Hsu
Jihwan Jeong
Lior Shani
Azamat Tulepbergenov
Deepak Ramachandran
Martin Mladenov
Craig Boutilier
28
11
0
06 Oct 2023
Expedited Training of Visual Conditioned Language Generation via
  Redundancy Reduction
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
20
7
0
05 Oct 2023
Social Media Fashion Knowledge Extraction as Captioning
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
19
1
0
28 Sep 2023
Targeted Image Data Augmentation Increases Basic Skills Captioning
  Robustness
Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness
Valentin Barriere
Felipe del Rio
Andres Carvallo De Ferari
Carlos Aspillaga
Eugenio Herrera-Berg
Cristian Buc Calderon
DiffM
27
0
0
27 Sep 2023
Tackling VQA with Pretrained Foundation Models without Further Training
Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan
Bingquan Shen
MLLM
37
1
0
27 Sep 2023
Survey of Social Bias in Vision-Language Models
Survey of Social Bias in Vision-Language Models
Nayeon Lee
Yejin Bang
Holy Lovenia
Samuel Cahyawijaya
Wenliang Dai
Pascale Fung
VLM
47
16
0
24 Sep 2023
StructChart: Perception, Structuring, Reasoning for Visual Chart
  Understanding
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
Renqiu Xia
Bo-Wen Zhang
Hao Peng
Hancheng Ye
Xiangchao Yan
Peng Ye
Botian Shi
Yu Qiao
Junchi Yan
14
0
0
20 Sep 2023
Towards Artificial General Intelligence (AGI) in the Internet of Things
  (IoT): Opportunities and Challenges
Towards Artificial General Intelligence (AGI) in the Internet of Things (IoT): Opportunities and Challenges
Fei Dou
Jin Ye
Geng Yuan
Qin Lu
Wei Niu
...
Hongyue Sun
Yunli Shao
Changying Li
Tianming Liu
Wenzhan Song
AI4CE
37
29
0
14 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical
  Learning
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
38
7
0
05 Sep 2023
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for
  Vision-Language Models
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
Cheng Shi
Sibei Yang
VLM
19
21
0
03 Sep 2023
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
  Large Model
RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model
Fengxiang Bie
Yibo Yang
Zhongzhu Zhou
Adam Ghanem
Minjia Zhang
...
Pareesa Ameneh Golnari
David A. Clifton
Yuxiong He
Dacheng Tao
Shuaiwen Leon Song
EGVM
33
18
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual
  Augmentation
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhengyuan Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
28
8
0
31 Aug 2023
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product
  Summarization
MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization
Tao Chen
Zexiong Lin
Hui Li
Jiayi Ji
Yiyi Zhou
Guanbin Li
Rongrong Ji
21
0
0
22 Aug 2023
Whether you can locate or not? Interactive Referring Expression
  Generation
Whether you can locate or not? Interactive Referring Expression Generation
Fulong Ye
Yuxing Long
Fangxiang Feng
Xiaojie Wang
31
4
0
19 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity
  Control
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu
Yanyang Li
M. Lyu
Liwei Wang
VLM
32
15
0
18 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language
  Tasks
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
13
5
0
17 Aug 2023
Improving Joint Speech-Text Representations Without Alignment
Improving Joint Speech-Text Representations Without Alignment
Cal Peyser
Zhong Meng
Ke Hu
Rohit Prabhavalkar
Andrew Rosenberg
Tara N. Sainath
M. Picheny
Kyunghyun Cho
VLM
31
4
0
11 Aug 2023
RealCQA: Scientific Chart Question Answering as a Test-bed for
  First-Order Logic
RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic
Saleem Ahmed
Bhavin Jawade
Shubham Pandey
S. Setlur
Venugopal Govindaraju
21
5
0
03 Aug 2023
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic
  and Regional Comprehension
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
Qiang-feng Zhou
Chaohui Yu
Shaofeng Zhang
Sitong Wu
Zhibin Wang
Fan Wang
34
27
0
03 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
61
42
0
30 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
25
4
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
26
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language
  Pre-training
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
32
25
0
13 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
37
126
0
11 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
Human Inspired Progressive Alignment and Comparative Learning for
  Grounded Word Acquisition
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
Yuwei Bao
B. Lattimer
J. Chai
CLL
43
1
0
05 Jul 2023
Multimodal Prompt Learning for Product Title Generation with Extremely
  Limited Labels
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
Bang-ju Yang
Fenglin Liu
Zheng Li
Qingyu Yin
Chenyu You
Bing Yin
Yuexian Zou
VLM
33
5
0
05 Jul 2023
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
  Question Answering
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
A. S. Penamakuri
Manish Gupta
Mithun Das Gupta
Anand Mishra
37
7
0
29 Jun 2023
VisText: A Benchmark for Semantically Rich Chart Captioning
VisText: A Benchmark for Semantically Rich Chart Captioning
Benny J. Tang
Angie Boggust
Arvind Satyanarayan
28
76
0
28 Jun 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
54
556
0
23 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based
  Fusion
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
34
18
0
20 Jun 2023
Align, Adapt and Inject: Sound-guided Unified Image Generation
Align, Adapt and Inject: Sound-guided Unified Image Generation
Yue Yang
Kaipeng Zhang
Yuying Ge
Wenqi Shao
Zeyue Xue
Yu Qiao
Ping Luo
DiffM
21
5
0
20 Jun 2023
Vision-Language Models can Identify Distracted Driver Behavior from
  Naturalistic Videos
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Md Zahid Hasan
Jiajing Chen
Jiyang Wang
Mohammed Shaiqur Rahman
Ameya Joshi
Senem Velipasalar
C. Hegde
Anuj Sharma
S. Sarkar
VLM
46
18
0
16 Jun 2023
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Kelly O. Marshall
Minh Pham
Ameya Joshi
Anushrut Jignasu
Aditya Balu
Adarsh Krishnamurthy
A. Hegde
CLIP
18
3
0
14 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
26
53
0
13 Jun 2023
Table and Image Generation for Investigating Knowledge of Entities in
  Pre-trained Vision and Language Models
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
VLM
15
1
0
03 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained
  Vision-Language Models
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip Torr
Volker Tresp
VPVLM
VLM
34
17
0
03 Jun 2023
"Let's not Quote out of Context": Unified Vision-Language Pretraining
  for Context Assisted Image Captioning
"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning
Abisek Rajakumar Kalarani
P. Bhattacharyya
Niyati Chhaya
Sumit Shekhar
CoGe
VLM
19
9
0
01 Jun 2023
Enhanced Chart Understanding in Vision and Language Task via Cross-modal
  Pre-training on Plot Table Pairs
Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Mingyang Zhou
Yi Ren Fung
Long Chen
Christopher Thomas
Heng Ji
Shih-Fu Chang
23
11
0
29 May 2023
FuseCap: Leveraging Large Language Models for Enriched Fused Image
  Captions
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
Noam Rotstein
David Bensaid
Shaked Brody
Roy Ganz
Ron Kimmel
VLM
26
27
0
28 May 2023
Decoding the Underlying Meaning of Multimodal Hateful Memes
Decoding the Underlying Meaning of Multimodal Hateful Memes
Ming Shan Hee
Wen-Haw Chong
Roy Ka-Wei Lee
32
33
0
28 May 2023
Previous
12345678
Next