Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1412.2306
Cited By
v1
v2 (latest)
Deep Visual-Semantic Alignments for Generating Image Descriptions
7 December 2014
A. Karpathy
Li Fei-Fei
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Deep Visual-Semantic Alignments for Generating Image Descriptions"
50 / 54 papers shown
Title
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Boyang Deng
Songyou Peng
Kyle Genova
Gordon Wetzstein
Noah Snavely
Leonidas Guibas
Thomas Funkhouser
HAI
424
0
0
11 Apr 2025
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
Zehong Ma
Hao Chen
Wei Zeng
Limin Su
Shiliang Zhang
AI4TS
110
0
0
10 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
166
0
0
03 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
139
0
0
02 Apr 2025
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He
Qihang Yu
Qihao Liu
Liang-Chieh Chen
121
1
0
13 Mar 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta
Alberto Baldrati
Lorenzo Agnolucci
Marco Bertini
Andrew D. Bagdanov
CLIP
VLM
157
4
0
06 Feb 2025
Learning Fused State Representations for Control from Multi-View Observations
Zeyu Wang
Yao Li
Xin Li
Hongyu Zang
Romain Laroche
Riashat Islam
OffRL
125
1
0
03 Feb 2025
An Ensemble Model with Attention Based Mechanism for Image Captioning
Israa Al Badarneh
Bassam Hammo
Omar Al-Kadi
174
5
0
28 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffM
VLM
117
0
0
03 Jan 2025
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
228
4
0
31 Dec 2024
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
236
6
0
14 Oct 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
417
1
0
04 Sep 2024
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen
Tianxiang Hao
Tao He
Sicheng Zhao
Pengzhang Liu
Yongjun Bao
Guiguang Ding
Guiguang Ding
236
14
0
02 Sep 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
129
3
0
13 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
106
13
0
08 Jun 2024
RankCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zhili Feng
Zenghui Ding
Yining Sun
SSL
VLM
95
7
0
15 Apr 2024
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
Xing Han
Huy Nguyen
Carl Harris
Nhat Ho
Suchi Saria
MoE
100
21
0
05 Feb 2024
Enhancing medical vision-language contrastive learning via inter-matching relation modelling
Mingjian Li
Mingyuan Meng
M. Fulham
David Dagan Feng
Lei Bi
Jinman Kim
VLM
112
1
0
19 Jan 2024
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
127
17
0
10 Oct 2023
Linear Alignment of Vision-language Models for Image Captioning
Fabian Paischer
M. Hofmarcher
Sepp Hochreiter
Thomas Adler
CLIP
VLM
122
0
0
10 Jul 2023
Expressive Text-to-Image Generation with Rich Text
Songwei Ge
Taesung Park
Jun-Yan Zhu
Jia-Bin Huang
DiffM
123
82
0
13 Apr 2023
Vision Meets Wireless Positioning: Effective Person Re-identification with Recurrent Context Propagation
Yiheng Liu
Wen-gang Zhou
Mao Xi
Sanjing Shen
Houqiang Li
75
8
0
10 Aug 2020
Textual Description for Mathematical Equations
Ajoy Mondal
C. V. Jawahar
69
2
0
07 Aug 2020
Mitigating Gender Bias in Captioning Systems
Ruixiang Tang
Mengnan Du
Yuening Li
Zirui Liu
Na Zou
Xia Hu
FaML
44
66
0
15 Jun 2020
Attributed Sequence Embedding
Zhongfang Zhuang
Xiangnan Kong
Elke A. Rundensteiner
Jihane Zouaoui
Aditya Arora
166
12
0
03 Nov 2019
Quantum Optical Experiments Modeled by Long Short-Term Memory
Thomas Adler
Manuel Erhard
Mario Krenn
Johannes Brandstetter
Johannes Kofler
Sepp Hochreiter
59
9
0
30 Oct 2019
Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks
Matthias Plappert
Christian Mandery
Tamim Asfour
3DH
126
132
0
18 May 2017
Comprehension-guided referring expressions
Ruotian Luo
Gregory Shakhnarovich
ObjD
91
171
0
12 Jan 2017
Spatio-Temporal Attention Models for Grounded Video Captioning
M. Zanfir
Elisabeta Marinoiu
C. Sminchisescu
89
50
0
17 Oct 2016
A Survey of Multi-View Representation Learning
Yingming Li
Ming Yang
Zhongfei Zhang
AI4TS
3DV
309
513
0
03 Oct 2016
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Oriol Vinyals
Alexander Toshev
Samy Bengio
D. Erhan
111
854
0
21 Sep 2016
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro
Sameer Singh
Carlos Guestrin
FAtt
FaML
1.2K
16,990
0
16 Feb 2016
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
Jimmy S. J. Ren
Yongtao Hu
Yu-Wing Tai
Chuan Wang
Li Xu
Wenxiu Sun
Qiong Yan
67
108
0
13 Feb 2016
Face Attribute Prediction Using Off-the-Shelf CNN Features
Yang Zhong
Josephine Sullivan
Haibo Li
CVBM
70
102
0
12 Feb 2016
Language to Logical Form with Neural Attention
Li Dong
Mirella Lapata
AI4CE
NAI
109
729
0
06 Jan 2016
Write a Classifier: Predicting Visual Classifiers from Unstructured Text
Mohamed Elhoseiny
Ahmed Elgammal
Babak Saleh
84
41
0
31 Dec 2015
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
215
2,478
0
01 Apr 2015
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
295
4,488
0
20 Nov 2014
Learning a Recurrent Visual Representation for Image Caption Generation
Xinlei Chen
C. L. Zitnick
SSL
GAN
105
196
0
20 Nov 2014
From Captions to Visual Concepts and Back
Hao Fang
Saurabh Gupta
F. Iandola
R. Srivastava
Li Deng
...
Xiaodong He
Margaret Mitchell
John C. Platt
C. L. Zitnick
Geoffrey Zweig
VLM
110
1,311
0
18 Nov 2014
Show and Tell: A Neural Image Caption Generator
Oriol Vinyals
Alexander Toshev
Samy Bengio
D. Erhan
3DV
249
6,029
0
17 Nov 2014
Long-term Recurrent Convolutional Networks for Visual Recognition and Description
Jeff Donahue
Lisa Anne Hendricks
Marcus Rohrbach
Subhashini Venugopalan
S. Guadarrama
Kate Saenko
Trevor Darrell
VLM
165
6,053
0
17 Nov 2014
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
Ryan Kiros
Ruslan Salakhutdinov
R. Zemel
VLM
127
1,399
0
10 Nov 2014
Explain Images with Multimodal Recurrent Neural Networks
Junhua Mao
Wenyuan Xu
Yi Yang
Jiang Wang
Alan Yuille
VLM
GAN
103
385
0
04 Oct 2014
Going Deeper with Convolutions
Christian Szegedy
Wei Liu
Yangqing Jia
P. Sermanet
Scott E. Reed
Dragomir Anguelov
D. Erhan
Vincent Vanhoucke
Andrew Rabinovich
477
43,658
0
17 Sep 2014
Recurrent Neural Network Regularization
Wojciech Zaremba
Ilya Sutskever
Oriol Vinyals
ODL
146
2,776
0
08 Sep 2014
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan
Andrew Zisserman
FAtt
MDE
1.7K
100,386
0
04 Sep 2014
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
1.7K
39,547
0
01 Sep 2014
Video In Sentences Out
Andrei Barbu
Alexander Bridge
Zachary Burchill
D. Coroian
Sven J. Dickinson
...
Jarrell W. Waggoner
Song Wang
Jinlian Wei
Yifan Yin
Zhiqi Zhang
64
156
0
09 Aug 2014
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
A. Karpathy
Armand Joulin
Li Fei-Fei
VLM
101
937
0
22 Jun 2014
1
2
Next