Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.14100
Cited By
GIT: A Generative Image-to-text Transformer for Vision and Language
27 May 2022
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"GIT: A Generative Image-to-text Transformer for Vision and Language"
50 / 405 papers shown
Title
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
39
3
0
29 May 2024
SketchDeco: Decorating B&W Sketches with Colour
Chaitat Utintu
Pinaki Nath Chowdhury
Aneeshan Sain
Subhadeep Koley
A. Bhunia
Yi-Zhe Song
DiffM
34
3
0
29 May 2024
Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks
Yunqi Zhang
Songda Li
Chunyuan Deng
Luyi Wang
Hui Zhao
31
0
0
27 May 2024
Enhancing Adverse Drug Event Detection with Multimodal Dataset: Corpus Creation and Model Development
Pranab Sahoo
Ayush Kumar Singh
Sriparna Saha
Aman Chadha
S. Mondal
30
2
0
24 May 2024
How Culturally Aware are Vision-Language Models?
Olena Burda-Lassen
Aman Chadha
Shashank Goswami
Vinija Jain
VLM
39
0
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
82
42
0
23 May 2024
StackOverflowVQA: Stack Overflow Visual Question Answering Dataset
Motahhare Mirzaei
Mohammad Javad Pirhadi
Sauleh Eetemadi
26
0
0
17 May 2024
Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video
Tomoya Sugihara
Shuntaro Masuda
Ling Xiao
Toshihiko Yamasaki
43
3
0
14 May 2024
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Enxin Song
Wenhao Chai
Tianbo Ye
Jenq-Neng Hwang
Xi Li
Gaoang Wang
VLM
MLLM
37
30
0
26 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
47
20
0
22 Apr 2024
Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images
Ali Naseh
Katherine Thai
Mohit Iyyer
Amir Houmansadr
47
5
0
21 Apr 2024
ECOR: Explainable CLIP for Object Recognition
Ali Rasekh
Sepehr Kazemi Ranjbar
Milad Heidari
Wolfgang Nejdl
VLM
46
4
0
19 Apr 2024
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Zixuan Gong
Qi Zhang
Guangyin Bao
Lei Zhu
Ke Liu
Liang Hu
Duoqian Miao
44
9
0
19 Apr 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Yuchi Wang
Shuhuai Ren
Rundong Gao
Linli Yao
Qingyan Guo
Kaikai An
Jianhong Bai
Xu Sun
DiffM
VLM
46
6
0
16 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
39
3
0
16 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
45
6
0
14 Apr 2024
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
52
4
0
11 Apr 2024
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Dejia Xu
Vidit Goel
Xingqian Xu
Zhangyang Wang
Humphrey Shi
N. Sebe
53
31
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
50
25
0
10 Apr 2024
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Fulong Ma
Weiqing Qi
Guoyang Zhao
Linwei Zheng
Sheng Wang
Yuxuan Liu
Ming-Yu Liu
79
9
0
10 Apr 2024
[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen
Ryan Cotterell
Michael Y. Hu
Tal Linzen
Aaron Mueller
Candace Ross
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
Chengxu Zhuang
34
22
0
09 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
44
20
0
09 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
83
88
0
08 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
41
32
0
01 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
51
6
0
28 Mar 2024
Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study
Atsushi Teramoto
Ayano Michiba
Yuka Kiriyama
Tetsuya Tsukamoto
K. Imaizumi
H. Fujita
MedIm
24
1
0
26 Mar 2024
Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content
Zhicheng Du
Zhaotian Xie
Huazhang Ying
Likun Zhang
Peiwu Qin
21
0
0
23 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
58
4
0
21 Mar 2024
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
Chengxu Zhuang
Evelina Fedorenko
Jacob Andreas
32
2
0
21 Mar 2024
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
Jielin Qiu
William Jongwon Han
Winfred Wang
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Christos Faloutsos
Lei Li
Lijuan Wang
VLM
61
2
0
19 Mar 2024
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Paul S. Scotti
Mihir Tripathy
Cesar Kadir Torrico Villanueva
Reese Kneeland
Tong Chen
...
Charan Santhirasegaran
Jonathan Xu
Thomas Naselaris
Kenneth A. Norman
Tanishq Mathew Abraham
48
35
0
17 Mar 2024
Autonomous Monitoring of Pharmaceutical R&D Laboratories with 6 Axis Arm Equipped Quadruped Robot and Generative AI: A Preliminary Study
Shunichi Hato
Nozomi Ogawa
31
1
0
15 Mar 2024
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng
Bohan Zhou
Yicheng Feng
Ye Wang
Zongqing Lu
VLM
MLLM
46
7
0
14 Mar 2024
Visual Decoding and Reconstruction via EEG Embeddings with Guided Diffusion
Dongyang Li
Chen Wei
Shiying Li
Jiachen Zou
Quanying Liu
DiffM
37
18
0
12 Mar 2024
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yueping Jiang
MLLM
39
18
0
12 Mar 2024
Embodied Understanding of Driving Scenarios
Yunsong Zhou
Linyan Huang
Qingwen Bu
Jia Zeng
Tianyu Li
Hang Qiu
Hongzi Zhu
Minyi Guo
Yu Qiao
Hongyang Li
LM&Ro
62
31
0
07 Mar 2024
Multimodal Transformer for Comics Text-Cloze
Emanuele Vivoli
Joan Lafuente Baeza
Ernest Valveny Llobet
Dimosthenis Karatzas
38
4
0
06 Mar 2024
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Iryna Hartsock
Ghulam Rasool
46
62
0
04 Mar 2024
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
Mathilde Caron
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
34
5
0
04 Mar 2024
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
43
8
0
28 Feb 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
34
24
0
28 Feb 2024
All in an Aggregated Image for In-Image Learning
Lei Wang
Wanyu Xu
Zhiqiang Hu
Yihuai Lan
Shan Dong
Hao Wang
Roy Ka-Wei Lee
Ee-Peng Lim
VLM
45
1
0
28 Feb 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLM
VGen
EGVM
75
259
0
27 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
28
6
0
25 Feb 2024
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
Yusu Qian
Haotian Zhang
Yinfei Yang
Zhe Gan
83
26
0
20 Feb 2024
II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering
Jihyung Kil
Farideh Tavazoee
Dongyeop Kang
Joo-Kyung Kim
LRM
31
2
0
16 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLM
MLLM
27
12
0
13 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
42
20
0
08 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
22
5
0
08 Feb 2024
Previous
1
2
3
4
5
6
7
8
9
Next