Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14100
Cited By
GIT: A Generative Image-to-text Transformer for Vision and Language
27 May 2022
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GIT: A Generative Image-to-text Transformer for Vision and Language"
50 / 405 papers shown
Title
Text-Guided Image Clustering
Andreas Stephan
Lukas Miklautz
Kevin Sidak
Jan Philip Wahle
Bela Gipp
Claudia Plant
Benjamin Roth
11
4
0
05 Feb 2024
Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents
Zelong Li
Wenyue Hua
Hao Wang
He Zhu
Yongfeng Zhang
LLMAG
72
19
0
01 Feb 2024
Benchmarking Large Multimodal Models against Common Corruptions
Jiawei Zhang
Tianyu Pang
Chao Du
Yi Ren
Bo-wen Li
Min-Bin Lin
MLLM
30
14
0
22 Jan 2024
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
Kohei Uehara
Nabarun Goswami
Hanqin Wang
Toshiaki Baba
Kohtaro Tanaka
...
Takagi Naoya
Ryo Umagami
Yingyi Wen
Tanachai Anakewat
Tatsuya Harada
LRM
28
2
0
18 Jan 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang
Bohan Zhuang
Qi Wu
14
11
0
12 Jan 2024
SnapCap: Efficient Snapshot Compressive Video Captioning
Jianqiao Sun
Yudi Su
Hao Zhang
Ziheng Cheng
Zequn Zeng
Zhengjue Wang
Bo Chen
Xin Yuan
32
1
0
10 Jan 2024
VLLaVO: Mitigating Visual Gap through LLMs
Shuhao Chen
Yulong Zhang
Weisen Jiang
Jiangang Lu
Yu Zhang
VLM
54
2
0
06 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
43
8
0
06 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
27
9
0
03 Jan 2024
Detours for Navigating Instructional Videos
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
29
6
0
03 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLM
VLM
LLMAG
46
210
0
03 Jan 2024
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen
Yi Hong
MedIm
20
1
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
K. Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
35
12
0
01 Jan 2024
Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy
Weijian Mai
Jian Zhang
Pengfei Fang
Zhijun Zhang
51
9
0
31 Dec 2023
Open-Vocabulary Video Relation Extraction
Wentao Tian
Zheng Wang
Yu Fu
Jingjing Chen
Lechao Cheng
23
2
0
25 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
44
1
0
21 Dec 2023
Pixel Aligned Language Models
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
45
15
0
14 Dec 2023
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
Yafei Hu
Quanting Xie
Vidhi Jain
Jonathan M Francis
Jay Patrikar
...
Xiaolong Wang
Sebastian A. Scherer
Z. Kira
Fei Xia
Yonatan Bisk
LM&Ro
AI4CE
34
63
0
14 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
33
13
0
13 Dec 2023
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI
Kai Huang
Boyuan Yang
Wei Gao
34
1
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
26
20
0
13 Dec 2023
NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous Driving Datasets using Markup Annotations
Yuichi Inoue
Yuki Yada
Kotaro Tanahashi
Yu Yamaguchi
29
17
0
11 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
30
2
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
25
2
0
07 Dec 2023
Mitigating Open-Vocabulary Caption Hallucinations
Assaf Ben-Kish
Moran Yanuka
Morris Alper
Raja Giryes
Hadar Averbuch-Elor
MLLM
VLM
23
6
0
06 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
47
29
0
05 Dec 2023
Transformer-Based Deep Learning Model for Bored Pile Load-Deformation Prediction in Bangkok Subsoil
S. Youwai
Chissanupong Thongnoo
AI4CE
12
0
0
05 Dec 2023
UPOCR: Towards Unified Pixel-Level OCR Interface
Dezhi Peng
Zhenhua Yang
Jiaxin Zhang
Chongyu Liu
Yongxin Shi
Kai Ding
Fengjun Guo
Lianwen Jin
31
10
0
05 Dec 2023
Uni3DL: Unified Model for 3D and Language Understanding
Xiang Li
Jian Ding
Zhaoyang Chen
Mohamed Elhoseiny
35
3
0
05 Dec 2023
Lenna: Language Enhanced Reasoning Detection Assistant
Fei Wei
Xinyu Zhang
Ailing Zhang
Bo-Wen Zhang
Xiangxiang Chu
MLLM
LRM
29
23
0
05 Dec 2023
Object Recognition as Next Token Prediction
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
40
9
0
04 Dec 2023
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
Jialin Wu
Xia Hu
Yaqing Wang
Bo Pang
Radu Soricut
MoE
19
14
0
01 Dec 2023
Segment and Caption Anything
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLM
VLM
28
18
0
01 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq R. Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
41
45
0
30 Nov 2023
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Zineng Tang
Ziyi Yang
Mahmoud Khademi
Yang Liu
Chenguang Zhu
Mohit Bansal
LRM
MLLM
AuLLM
54
44
0
30 Nov 2023
GELDA: A generative language annotation framework to reveal visual biases in datasets
Krish Kabra
Kathleen M. Lewis
Guha Balakrishnan
VLM
16
1
0
29 Nov 2023
Transformer Based Model for Predicting Rapid Impact Compaction Outcomes: A Case Study of Utapao International Airport
S. Youwai
Sirasak Detcheewa
AI4CE
15
0
0
29 Nov 2023
PALM: Predicting Actions through Language Models
Sanghwan Kim
Daoji Huang
Yongqin Xian
Otmar Hilliges
Luc Van Gool
Xi Wang
VLM
22
10
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
38
0
0
28 Nov 2023
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng
Hang Zhang
Guanzheng Chen
Xin Li
Shijian Lu
Chunyan Miao
Li Bing
VLM
MLLM
95
198
0
28 Nov 2023
Incorporating granularity bias as the margin into contrastive loss for video captioning
Jiayang Gu
Fengming Yao
21
0
0
25 Nov 2023
T-Rex: Counting by Visual Prompting
Qing Jiang
Feng Li
Tianhe Ren
Shilong Liu
Zhaoyang Zeng
Kent Yu
Lei Zhang
18
11
0
22 Nov 2023
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions
Songhao Han
Le Zhuo
Yue Liao
Si Liu
VLM
26
14
0
20 Nov 2023
Extraction and Summarization of Explicit Video Content using Multi-Modal Deep Learning
Shaunak Joshi
Raghav Gaggar
19
0
0
17 Nov 2023
Efficient End-to-End Visual Document Understanding with Rationale Distillation
Wang Zhu
Alekh Agarwal
Mandar Joshi
Robin Jia
Jesse Thomason
Kristina Toutanova
34
2
0
16 Nov 2023
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder
Abdelrahman Mohamed
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Alcides Alcoba Inciarte
Muhammad Abdul-Mageed
VLM
MLLM
27
7
0
15 Nov 2023
Multiple-Question Multiple-Answer Text-VQA
Peng Tang
Srikar Appalaraju
R. Manmatha
Yusheng Xie
Vijay Mahadevan
46
5
0
15 Nov 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
K. Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAG
LM&Ro
35
101
0
13 Nov 2023
Analyzing Modular Approaches for Visual Question Decomposition
Apoorv Khandelwal
Ellie Pavlick
Chen Sun
45
4
0
10 Nov 2023
Previous
1
2
3
4
5
6
7
8
9
Next