Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1411.5726
Cited By
v1
v2 (latest)
CIDEr: Consensus-based Image Description Evaluation
20 November 2014
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CIDEr: Consensus-based Image Description Evaluation"
50 / 2,183 papers shown
Title
Towards Retrieval-Augmented Architectures for Image Captioning
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Alessandro Nicolosi
Rita Cucchiara
VLM
77
12
0
21 May 2024
Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images
Xiaofei Yu
Yitong Li
Jie Ma
DiffM
88
0
0
21 May 2024
A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges
Huangjun Shen
Liangying Shao
Wenbo Li
Zhibin Lan
Zhanyu Liu
Jinsong Su
83
3
0
21 May 2024
A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data
Xinyi Wang
Grazziela Figueredo
Ruizhe Li
Wei Emma Zhang
Weitong Chen
Xin Chen
MedIm
ViT
112
2
0
21 May 2024
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning
Zishan Gu
Fenglin Liu
Changchang Yin
Ping Zhang
LRM
LM&MA
102
0
0
19 May 2024
MICap: A Unified Model for Identity-aware Movie Descriptions
Haran Raajesh
Naveen Reddy Desanur
Zeeshan Khan
Makarand Tapaswi
74
4
0
19 May 2024
Automated Radiology Report Generation: A Review of Recent Advances
Phillip Sloan
Philip Clatworthy
Edwin Simpson
Majid Mirmehdi
81
21
0
17 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
212
338
0
16 May 2024
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
103
57
0
14 May 2024
The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective
Andrew Shin
Yusuke Mori
Kunitake Kaneko
VGen
EGVM
51
2
0
13 May 2024
Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores
Kiyoon Jeong
Woojun Lee
Woongchan Nam
Minjeong Ma
Pilsung Kang
60
2
0
02 May 2024
LLM-AD: Large Language Model based Audio Description System
Peng Chu
Jiang Wang
Andre Abrantes
59
4
0
02 May 2024
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Shihao Wang
Zhiding Yu
Xiaohui Jiang
Shiyi Lan
Min Shi
Nadine Chang
Jan Kautz
Ying Li
Jose M. Alvarez
LRM
99
48
0
02 May 2024
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe
Sunayana Rane
Zachary Berger
Yonatan Bitton
Jaemin Cho
...
Zarana Parekh
Jordi Pont-Tuset
Garrett Tanzer
Su Wang
Jason Baldridge
114
63
0
30 Apr 2024
Pre-training on High Definition X-ray Images: An Experimental Study
Tianlin Li
Yuehang Li
Wentao Wu
Jiandong Jin
Yao Rong
Bowei Jiang
Chuanfu Li
Jin Tang
MedIm
ViT
LM&MA
127
3
0
27 Apr 2024
MRScore: Evaluating Radiology Report Generation with LLM-based Reward System
Yunyi Liu
Zhanyu Wang
Yingshu Li
Xinyu Liang
Lingqiao Liu
Lei Wang
Luping Zhou
LM&MA
28
3
0
27 Apr 2024
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Yuhang Huang
Zihan Wu
Chongyang Gao
Jiawei Peng
Xu Yang
73
2
0
26 Apr 2024
Step Differences in Instructional Video
Tushar Nagarajan
Lorenzo Torresani
VGen
101
5
0
24 Apr 2024
What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini
Mustafa Shukor
Matthieu Cord
Laure Soulier
Benjamin Piwowarski
138
23
0
24 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
133
22
0
22 Apr 2024
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction
Shiyi Zhang
Sule Bai
Guangyi Chen
Lei Chen
Jiwen Lu
Junle Wang
Yansong Tang
103
10
0
22 Apr 2024
Movie101v2: Improved Movie Narration Benchmark
Zihao Yue
Yepeng Zhang
Ziheng Wang
Qin Jin
VGen
104
1
0
20 Apr 2024
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
Yang Luo
Zangwei Zheng
Zirui Zhu
Yang You
53
5
0
19 Apr 2024
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
Fengyi Fu
Shancheng Fang
Weidong Chen
Zhendong Mao
ViT
VGen
56
4
0
19 Apr 2024
Resilience through Scene Context in Visual Referring Expression Generation
Simeon Junker
Sina Zarrieß
49
1
0
18 Apr 2024
Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization
Yongdong Luo
Haojia Lin
Xiawu Zheng
Yigeng Jiang
Chia-Wen Lin
Jie Hu
Guannan Jiang
Songan Zhang
Rongrong Ji
67
0
0
17 Apr 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Yuchi Wang
Shuhuai Ren
Rundong Gao
Linli Yao
Qingyan Guo
Kaikai An
Jianhong Bai
Xu Sun
DiffM
VLM
106
9
0
16 Apr 2024
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
Kai Chen
Yanze Li
Wenhua Zhang
Yanxin Liu
Pengxiang Li
...
Xinhai Zhao
Zhenguo Li
Dit-Yan Yeung
Huchuan Lu
Xu Jia
ELM
MLLM
116
37
0
16 Apr 2024
AIGeN: An Adversarial Approach for Instruction Generation in VLN
Niyati Rawal
Roberto Bigazzi
Lorenzo Baraldi
Rita Cucchiara
GAN
86
4
0
15 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
97
7
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
104
9
0
14 Apr 2024
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
Yuichi Inoue
Kento Sasaki
Yuma Ochi
Kazuki Fujii
Kotaro Tanahashi
Yu Yamaguchi
VLM
59
5
0
11 Apr 2024
Multi-Image Visual Question Answering for Unsupervised Anomaly Detection
Jun Li
Cosmin I. Bercea
Philipp Muller
Lina Felsner
Suhwan Kim
Daniel Rueckert
Benedikt Wiestler
Julia A. Schnabel
57
3
0
11 Apr 2024
Audio Dialogues: Dialogues dataset for audio and music understanding
Arushi Goel
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
AuLLM
100
5
0
11 Apr 2024
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
Minkuk Kim
Hyeon Bae Kim
Jinyoung Moon
Jinwoo Choi
Seong Tae Kim
71
25
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
80
32
0
10 Apr 2024
UMBRAE: Unified Multimodal Brain Decoding
Weihao Xia
Raoul de Charette
Cengiz Öztireli
Jing-Hao Xue
74
9
0
10 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
133
101
0
08 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Massimiliano Mancini
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
79
2
0
08 Apr 2024
Would Deep Generative Models Amplify Bias in Future Models?
Tianwei Chen
Yusuke Hirota
Mayu Otani
Noa Garcia
Yuta Nakashima
88
15
0
04 Apr 2024
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
Hao Wu
Huabin Liu
Yu Qiao
Xiao Sun
3DV
34
11
0
03 Apr 2024
Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM
Zhe Liu
Chunyang Chen
Junjie Wang
Mengzhuo Chen
Boyu Wu
Yuekai Huang
Jun Hu
Qing Wang
60
14
0
03 Apr 2024
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
A. M. H. Tiong
Junqi Zhao
Boyang Albert Li
Junnan Li
Guosheng Lin
Caiming Xiong
79
9
0
03 Apr 2024
MotionChain: Conversational Motion Controllers via Multimodal Prompts
Biao Jiang
Xin Chen
C. Zhang
Fukun Yin
Zhuoyuan Li
Gang Yu
Jiayuan Fan
VGen
LRM
96
11
0
02 Apr 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
89
7
0
01 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
107
42
0
01 Apr 2024
Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving
Akshay Gopalkrishnan
Ross Greer
Mohan M. Trivedi
VLM
96
25
0
28 Mar 2024
Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis
Chenyang Liu
Keyan Chen
Haotian Zhang
Zipeng Qi
Zhengxia Zou
Z. Shi
67
34
0
28 Mar 2024
Semantic Map-based Generation of Navigation Instructions
Chengzu Li
Chao Zhang
Simone Teufel
R. Doddipatla
Svetlana Stoyanchev
73
2
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
148
7
0
28 Mar 2024
Previous
1
2
3
...
8
9
10
...
42
43
44
Next