Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.02779
Cited By
Unifying Vision-and-Language Tasks via Text Generation
4 February 2021
Jaemin Cho
Jie Lei
Hao Tan
Mohit Bansal
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unifying Vision-and-Language Tasks via Text Generation"
50 / 368 papers shown
Title
A Large Vision-Language Model based Environment Perception System for Visually Impaired People
Zezhou Chen
Zhaoxiang Liu
Kai Wang
Kohou Wang
Shiguo Lian
50
0
0
25 Apr 2025
SignX: The Foundation Model for Sign Recognition
Sen Fang
Chunyu Sui
Hongwei Yi
C. Neidle
Dimitris N. Metaxas
SLR
35
0
0
22 Apr 2025
ChartQA-X: Generating Explanations for Charts
Shamanthak Hegde
Pooyan Fazli
H. Seifi
20
0
0
17 Apr 2025
FLIP Reasoning Challenge
Andreas Plesner
Turlan Kuzhagaliyev
Roger Wattenhofer
AAML
VLM
LRM
72
0
0
16 Apr 2025
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Yuyang Ji
Haohan Wang
LRM
37
0
0
14 Apr 2025
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Haoqiang Lin
Haokun Wen
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
52
14
0
25 Mar 2025
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
Zichen Miao
Wei Chen
Qiang Qiu
90
1
0
24 Mar 2025
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He
Qihang Yu
Qihao Liu
Liang-Chieh Chen
68
0
0
13 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Pandeng Li
Yun Zheng
Liwei Wang
ObjD
VLM
56
0
0
03 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
35
2
0
01 Mar 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
39
6
0
24 Feb 2025
Natural Language Supervision for Low-light Image Enhancement
Jiahui Tang
Kaihua Zhou
Zhijian Luo
Yueen Hou
43
0
0
11 Jan 2025
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
Risa Shinoda
Kuniaki Saito
Shohei Tanaka
Tosho Hirasawa
Yoshitaka Ushiku
31
1
0
23 Dec 2024
Consistency of Compositional Generalization across Multiple Levels
Chuanhao Li
Zhen Li
Chenchen Jing
Xiaomeng Fan
Wenbo Ye
Yuwei Wu
Yunde Jia
CoGe
79
0
0
18 Dec 2024
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
Jinghan Sun
Dong-mei Wei
Zhe Xu
Donghuan Lu
Hong Liu
Hong Wang
Sotirios A. Tsaftaris
Steven G. McDonagh
Yefeng Zheng
Liansheng Wang
MedIm
94
0
0
18 Dec 2024
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Yunbin Tu
Liang-Sheng Li
Li Su
Qingming Huang
75
0
0
18 Dec 2024
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
Cong Wei
Yujie Zhong
Haoxian Tan
Y. Liu
Zheng Zhao
Jie Hu
Yujiu Yang
VOS
MLLM
VLM
LRM
88
1
0
26 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
46
28
0
30 Oct 2024
Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant
A. S. Penamakuri
Anand Mishra
24
1
0
24 Oct 2024
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
Ahmed Heakl
Omkar Thawakar
Ali Alharthi
Ines Riahi
Abduljalil Saif
Jorma T. Laaksonen
F. Khan
Salman Khan
Rao Muhammad Anwer
45
1
0
24 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
L. Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
49
3
0
24 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
21
0
0
23 Oct 2024
Offline Evaluation of Set-Based Text-to-Image Generation
Negar Arabzadeh
Fernando Diaz
Junfeng He
EGVM
32
0
0
22 Oct 2024
Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Yu Zhao
Hao Fei
Xiangtai Li
L. Qin
Jiayi Ji
Hongyuan Zhu
Meishan Zhang
M. Zhang
Jianguo Wei
DiffM
26
1
0
20 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
57
3
0
14 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
27
0
0
13 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Yifei Xing
Xiangyuan Lan
Ruiping Wang
D. Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
35
0
0
08 Oct 2024
Generalizable Prompt Tuning for Vision-Language Models
Qian Zhang
VLM
VPVLM
50
0
0
04 Oct 2024
Natural Language Generation for Visualizations: State of the Art, Challenges and Future Directions
Enamul Hoque
Mohammed Saidul Islam
29
2
0
29 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
36
1
0
19 Sep 2024
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli
Andrey Barsky
Mohamed Ali Souibgui
Artemis LLabres
Marco Bertini
Dimosthenis Karatzas
36
3
0
14 Sep 2024
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
25
0
0
28 Aug 2024
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description
Yizhang Jin
Jian Li
Jiangning Zhang
Jianlong Hu
Zhenye Gan
Xin Tan
Yong Liu
Yabiao Wang
Chengjie Wang
Lizhuang Ma
25
3
0
09 Aug 2024
MSG-Chart: Multimodal Scene Graph for ChartQA
Yue Dai
Soyeon Caren Han
Wei Liu
16
1
0
09 Aug 2024
Are Bigger Encoders Always Better in Vision Large Models?
Bozhou Li
Hao Liang
Zimo Meng
Wentao Zhang
VLM
38
3
0
01 Aug 2024
Advancing Chart Question Answering with Robust Chart Component Recognition
Hanwen Zheng
Sijia Wang
Chris Thomas
Lifu Huang
37
1
0
19 Jul 2024
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan
Chaofeng Chen
Yiping Ke
Xinjiang Wang
Litong Feng
Wayne Zhang
VLM
36
23
0
17 Jul 2024
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram
Ming-Liang Zhang
Zhong-Zhi Li
Fei Yin
Liang Lin
Cheng-Lin Liu
LRM
22
5
0
10 Jul 2024
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Yuxuan Wang
Yijun Liu
Fei Yu
Chen Huang
Kexin Li
Zhiguo Wan
Wanxiang Che
VLM
CoGe
35
5
0
01 Jul 2024
RAVEN: Multitask Retrieval Augmented Vision-Language Learning
Varun Nagaraj Rao
Siddharth Choudhary
Aditya Deshpande
R. Satzoda
Srikar Appalaraju
RALM
VLM
55
4
0
27 Jun 2024
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu
Yi Ren Fung
Sha Li
Yixin Wan
Kai-Wei Chang
Heng Ji
39
5
0
20 Jun 2024
Enhancing Question Answering on Charts Through Effective Pre-training Tasks
Ashim Gupta
Vivek Gupta
Shuo Zhang
Yujie He
Ning Zhang
Shalin S Shah
25
2
0
14 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
55
0
0
13 Jun 2024
Zoom and Shift are All You Need
Jiahao Qin
36
2
0
13 Jun 2024
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Jinhao Li
Haopeng Li
S. Erfani
Lei Feng
James Bailey
Feng Liu
VLM
29
3
0
05 Jun 2024
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering
Tao Li
Linjun Shou
Xuejun Liu
34
0
0
03 Jun 2024
Image Captioning via Dynamic Path Customization
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Yiyi Zhou
Xiaopeng Hong
Yongjian Wu
Rongrong Ji
34
0
0
01 Jun 2024
Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs
Mohammed Saidul Islam
Raian Rahman
Ahmed Masry
Md Tahmid Rahman Laskar
Mir Tafseer Nayeem
Enamul Hoque
LRM
ELM
36
4
0
01 Jun 2024
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
41
14
0
28 May 2024
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Xiaolin Chen
Liqiang Nie
Mohan S. Kankanhalli
LRM
23
8
0
27 May 2024
1
2
3
4
5
6
7
8
Next