Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14100
Cited By
GIT: A Generative Image-to-text Transformer for Vision and Language
27 May 2022
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GIT: A Generative Image-to-text Transformer for Vision and Language"
50 / 405 papers shown
Title
Deep learning in medical image registration: introduction and survey
Ahmad Hammoudeh
Stéphane Dupont
MedIm
19
4
0
01 Sep 2023
Towards Real Time Egocentric Segment Captioning for The Blind and Visually Impaired in RGB-D Theatre Images
Khadidja Delloul
S. Larabi
32
2
0
26 Aug 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLM
CLIP
44
13
0
25 Aug 2023
Vision Transformer Adapters for Generalizable Multitask Learning
Deblina Bhattacharjee
Sabine Süsstrunk
Mathieu Salzmann
ViT
21
8
0
23 Aug 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
60
9
0
23 Aug 2023
ViCo: Engaging Video Comment Generation with Human Preference Rewards
Yuchong Sun
Bei Liu
Xu Chen
Ruihua Song
Jianlong Fu
VGen
22
2
0
22 Aug 2023
ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations
Sreyan Ghosh
Chandra Kiran Reddy Evuru
Sonal Kumar
Utkarsh Tyagi
Sakshi Singh
Sanjoy Chowdhury
Dinesh Manocha
OOD
30
1
0
19 Aug 2023
Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani
Nikos Deligiannis
13
5
0
17 Aug 2023
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
Dario Cioni
Lorenzo Berlincioni
Federico Becattini
A. Bimbo
DiffM
24
9
0
14 Aug 2023
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Xiaofei Wang
Manthan Thakker
Zhuo Chen
Naoyuki Kanda
Sefik Emre Eskimez
Sanyuan Chen
M. Tang
Shujie Liu
Jinyu Li
Takuya Yoshioka
26
79
0
14 Aug 2023
Encode-Store-Retrieve: Enhancing Memory Augmentation through Language-Encoded Egocentric Perception
Junxiao Shen
John J. Dudley
Per Ola Kristensson
RALM
20
0
0
10 Aug 2023
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models
Zheng Ma
Mianzhi Pan
Wenhan Wu
Ka Leong Cheng
Jianbing Zhang
Shujian Huang
Jiajun Chen
VLM
CoGe
23
3
0
06 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
45
607
0
04 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIP
VLM
27
10
0
02 Aug 2023
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences
Di Yang
Hongyu Chen
Xinglin Hou
T. Ge
Yuning Jiang
Qin Jin
36
7
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
61
42
0
30 Jul 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng-Tao Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&Ro
LRM
30
1,100
0
28 Jul 2023
Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang
Ce Zhang
Yushun Tang
Z. He
VLM
MLLM
CLIP
36
15
0
28 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
Towards a Visual-Language Foundation Model for Computational Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Ivy Liang
...
Andrew Zhang
L. Le
Georg Gerber
Anil V. Parwani
Faisal Mahmood
VLM
MedIm
40
46
0
24 Jul 2023
GIST: Generating Image-Specific Text for Fine-grained Object Classification
Kathleen M. Lewis
Emily Mu
Adrian V. Dalca
John Guttag
VLM
29
7
0
21 Jul 2023
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback
Ashish Singh
Prateek R. Agarwal
Zixuan Huang
Arpita Singh
Tong Yu
Sungchul Kim
Victor S. Bursztyn
N. Vlassis
Ryan A. Rossi
36
6
0
20 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
32
25
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
39
87
0
11 Jul 2023
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
34
126
0
11 Jul 2023
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han
Hui Chen
MingSung Kan
Soujanya Poria
24
1
0
09 Jul 2023
Reading Between the Lanes: Text VideoQA on the Road
George Tom
Minesh Mathew
Sergi Garcia
Dimosthenis Karatzas
C. V. Jawahar
25
6
0
08 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
Multi-Similarity Contrastive Learning
Emily Mu
John Guttag
Maggie Makar
SSL
32
2
0
06 Jul 2023
Garbage in, garbage out: Zero-shot detection of crime using Large Language Models
Anj Simmons
Rajesh Vasa
VLM
AILaw
10
3
0
04 Jul 2023
CLIPAG: Towards Generator-Free Text-to-Image Generation
Roy Ganz
Michael Elad
VLM
28
7
0
29 Jun 2023
Seeing in Words: Learning to Classify through Language Bottlenecks
Khalid Saifullah
Yuxin Wen
Jonas Geiping
Micah Goldblum
Tom Goldstein
VLM
13
2
0
29 Jun 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
42
598
0
27 Jun 2023
FunQA: Towards Surprising Video Comprehension
Binzhu Xie
Sicheng Zhang
Zitang Zhou
Bo-wen Li
Yuanhan Zhang
Jack Hessel
Jingkang Yang
Ziwei Liu
36
20
0
26 Jun 2023
Large Multimodal Models: Notes on CVPR 2023 Tutorial
Chunyuan Li
MLLM
VLM
19
20
0
26 Jun 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLM
ObjD
VLM
41
698
0
26 Jun 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu
Kevin Qinghong Lin
Linjie Li
Jianfeng Wang
Yaser Yacoob
Lijuan Wang
VLM
MLLM
22
241
0
26 Jun 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELM
MLLM
42
766
0
23 Jun 2023
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
18
2
0
21 Jun 2023
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
S. Hall
F. G. Abrantes
Hanwen Zhu
Grace A. Sodunke
Aleksandar Shtedritski
Hannah Rose Kirk
CoGe
21
39
0
21 Jun 2023
Dense Video Object Captioning from Disjoint Supervision
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
31
3
0
20 Jun 2023
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion
Simone Bianco
Luigi Celona
Marco Donzella
Paolo Napoletano
34
18
0
20 Jun 2023
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Peng-Tao Xu
Wenqi Shao
Kaipeng Zhang
Peng Gao
Shuo Liu
Meng Lei
Fanqing Meng
Siyuan Huang
Yu Qiao
Ping Luo
ELM
MLLM
33
159
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Jiaheng Liu
VLM
CLIP
30
8
0
15 Jun 2023
Training Multimedia Event Extraction With Generated Images and Captions
Zilin Du
Yunxin Li
Xu Guo
Yidan Sun
Boyang Albert Li
DiffM
21
7
0
15 Jun 2023
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao
Lei Ji
Luowei Zhou
Kevin Lin
Joya Chen
Zihan Fan
Mike Zheng Shou
MLLM
27
72
0
14 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
21
53
0
13 Jun 2023
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation
Jeremy Gwinnup
Kevin Duh
VLM
22
3
0
12 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViT
MedIm
24
171
0
11 Jun 2023
Optimizing ViViT Training: Time and Memory Reduction for Action Recognition
Shreyank N. Gowda
Anurag Arnab
Jonathan Huang
ViT
18
4
0
07 Jun 2023
Previous
1
2
3
4
5
6
7
8
9
Next