Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1411.5726
Cited By
v1
v2 (latest)
CIDEr: Consensus-based Image Description Evaluation
20 November 2014
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CIDEr: Consensus-based Image Description Evaluation"
50 / 2,183 papers shown
Title
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Qirui Jiao
Daoyuan Chen
Yilun Huang
Yaliang Li
Ying Shen
VLM
113
8
0
08 Aug 2024
UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation
Niyati Rawal
Roberto Bigazzi
Lorenzo Baraldi
Rita Cucchiara
LM&Ro
99
1
0
08 Aug 2024
Dual-path Collaborative Generation Network for Emotional Video Captioning
Cheng Ye
Weidong Chen
Jingyu Li
Li Zhang
Zhendong Mao
126
1
0
06 Aug 2024
Multitask and Multimodal Neural Tuning for Large Models
Hao Sun
Yu Song
Jihong Hu
Yen-Wei Chen
Lanfen Lin
VLM
44
0
0
06 Aug 2024
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
Xianyu Chen
Ming Jiang
Qi Zhao
69
3
0
05 Aug 2024
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
Koki Maeda
Tosho Hirasawa
Atsushi Hashimoto
Jun Harashima
Leszek Rybicki
Yusuke Fukasawa
Yoshitaka Ushiku
101
0
0
05 Aug 2024
A Novel Evaluation Framework for Image2Text Generation
Jia-Hong Huang
Hongyi Zhu
Yixian Shen
Stevan Rudinac
A. M. Pacces
Evangelos Kanoulas
75
9
0
03 Aug 2024
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
108
6
0
31 Jul 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
Ming-Kuan Wu
Xinyue Cai
Jiayi Ji
Jiale Li
Oucheng Huang
Gen Luo
Hao Fei
Xiaoshuai Sun
Rongrong Ji
MLLM
158
13
0
31 Jul 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Xiaowei Chi
Yatian Wang
Aosong Cheng
Pengjun Fang
Zeyue Tian
...
Wenhan Luo
Qifeng Chen
Shanghang Zhang
Qi-fei Liu
Yi-Ting Guo
135
7
0
30 Jul 2024
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Dhruv Verma
Debaditya Roy
Basura Fernando
86
1
0
30 Jul 2024
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
75
7
0
29 Jul 2024
A Labeled Ophthalmic Ultrasound Dataset with Medical Report Generation Based on Cross-modal Deep Learning
Jing Wang
Junyan Fan
Meng Zhou
Yanzhu Zhang
Mingyu Shi
38
1
0
26 Jul 2024
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
Zequn Zeng
Jianqiao Sun
Hao Zhang
Tiansheng Wen
Yudi Su
Yan Xie
Zhengjue Wang
Boli Chen
101
3
0
26 Jul 2024
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
74
10
0
22 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLM
LRM
137
13
0
22 Jul 2024
Navigation Instruction Generation with BEV Perception and Large Language Models
Sheng Fan
Rui Liu
Wenguan Wang
Yi Yang
94
9
0
21 Jul 2024
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Md Zarif Hossain
Ahmed Imteaj
VLM
AAML
66
6
0
20 Jul 2024
Downstream-Pretext Domain Knowledge Traceback for Active Learning
Beichen Zhang
Liang-Sheng Li
Zheng-Jun Zha
Jiebo Luo
Qingming Huang
72
0
0
20 Jul 2024
Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
Yongshuo Zhu
Lu Li
Keyan Chen
Chenyang Liu
Fugen Zhou
Z. Shi
78
4
0
19 Jul 2024
NeuroBind: Towards Unified Multimodal Representations for Neural Signals
Fengyu Yang
Chao Feng
Daniel Wang
Tianye Wang
Ziyao Zeng
...
Hyoungseob Park
Pengliang Ji
Han Zhao
Yuanning Li
Alex Wong
113
10
0
19 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
66
0
0
18 Jul 2024
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
Leyang Shen
Gongwei Chen
Rui Shao
Weili Guan
Liqiang Nie
MoE
81
12
0
17 Jul 2024
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Yilin Ye
Shishi Xiao
Xingchen Zeng
Wei Zeng
114
5
0
17 Jul 2024
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
Yunbin Tu
Liang-Sheng Li
Li Su
Chenggang Yan
Qin Huang
102
7
0
16 Jul 2024
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Pengxiang Li
Zhi Gao
Bofei Zhang
Tao Yuan
Yuwei Wu
Mehrtash Harandi
Yunde Jia
Song-Chun Zhu
Qing Li
VLM
MLLM
102
6
0
16 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
72
1
0
16 Jul 2024
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick
Rama Chellappa
Subhashini Venugopalan
111
21
0
12 Jul 2024
Controllable Navigation Instruction Generation with Chain of Thought Prompting
Xianghao Kong
Jinyu Chen
Wenguan Wang
Hang Su
Xiaolin Hu
Yi Yang
Si Liu
LRM
105
9
0
10 Jul 2024
LEMoN: Label Error Detection using Multimodal Neighbors
Haoran Zhang
Aparna Balagopalan
Nassim Oufattole
Hyewon Jeong
Yan Wu
Jiacheng Zhu
Marzyeh Ghassemi
128
0
0
10 Jul 2024
Vision-Language Models under Cultural and Inclusive Considerations
Antonia Karamolegkou
Phillip Rust
Yong Cao
Ruixiang Cui
Anders Søgaard
Daniel Hershcovich
VLM
117
8
0
08 Jul 2024
OneDiff: A Generalist Model for Image Difference Captioning
Erdong Hu
Longteng Guo
Tongtian Yue
Zijia Zhao
Shuning Xue
Jing Liu
VLM
121
2
0
08 Jul 2024
Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference
Kai Shen
Lingfei Wu
Siliang Tang
Fangli Xu
Bo Long
Yueting Zhuang
Jian Pei
70
0
0
06 Jul 2024
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
61
6
0
05 Jul 2024
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Florian Schneider
Sunayana Sitaram
VLM
81
12
0
04 Jul 2024
Multi-Modal Video Dialog State Tracking in the Wild
Adnen Abdessaied
Lei Shi
Andreas Bulling
59
2
0
02 Jul 2024
Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation
Pablo Messina
René Vidal
Denis Parra
Álvaro Soto
Vladimir Araujo
MedIm
113
4
0
02 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
178
23
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
98
15
0
01 Jul 2024
From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
Nan Xu
Fei Wang
Sheng Zhang
Hoifung Poon
Muhao Chen
139
7
0
01 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
108
67
0
30 Jun 2024
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang
Wentao Yang
Songxuan Lai
Zecheng Xie
Lianwen Jin
94
21
0
27 Jun 2024
MatchTime: Towards Automatic Soccer Game Commentary Generation
Jiayuan Rao
Haoning Wu
Chang-rui Liu
Yanfeng Wang
Weidi Xie
88
8
0
26 Jun 2024
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
Jiafeng Liang
Shixin Jiang
Zekun Wang
Haojie Pan
Zerui Chen
Zheng Chu
Ming Liu
Ruiji Fu
Zhongyuan Wang
Bing Qin
69
3
0
26 Jun 2024
X-ray Made Simple: Lay Radiology Report Generation and Robust Evaluation
Kun Zhao
Chenghao Xiao
Chen Tang
Bohao Yang
Kai Ye
Noura Al Moubayed
Liang Zhan
Chenghua Lin
86
0
0
25 Jun 2024
UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Yuting Mei
Linli Yao
Qin Jin
65
1
0
24 Jun 2024
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Gregor Geigle
Radu Timofte
Goran Glavaš
83
0
0
20 Jun 2024
From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
Yusuke Hirota
Ryo Hachiuma
Chao-Han Huck Yang
Yuta Nakashima
VLM
89
4
0
20 Jun 2024
Adaptable Logical Control for Large Language Models
Honghua Zhang
Po-Nien Kung
Masahiro Yoshida
Guy Van den Broeck
Nanyun Peng
71
10
0
19 Jun 2024
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Rushikesh Zawar
Shaurya Dewan
Andrew F. Luo
Margaret M. Henderson
Michael J. Tarr
Leila Wehbe
VGen
CoGe
76
1
0
19 Jun 2024
Previous
1
2
3
...
6
7
8
...
42
43
44
Next