ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.09734
  4. Cited By
ClipCap: CLIP Prefix for Image Captioning

ClipCap: CLIP Prefix for Image Captioning

18 November 2021
Ron Mokady
Amir Hertz
Amit H. Bermano
    CLIP
    VLM
ArXivPDFHTML

Papers citing "ClipCap: CLIP Prefix for Image Captioning"

50 / 144 papers shown
Title
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models
Yuncheng Guo
Xiaodong Gu
OffRL
VLM
32
0
0
15 May 2025
Whitened CLIP as a Likelihood Surrogate of Images and Captions
Whitened CLIP as a Likelihood Surrogate of Images and Captions
Roy Betser
Meir Yossef Levi
Guy Gilboa
31
0
0
11 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
56
0
0
08 May 2025
Mitigating Image Captioning Hallucinations in Vision-Language Models
Mitigating Image Captioning Hallucinations in Vision-Language Models
Fei Zhao
Chenyi Zhang
Runlin Zhang
Tianyang Wang
Xi Li
VLM
44
0
0
06 May 2025
Class-Conditional Distribution Balancing for Group Robust Classification
Class-Conditional Distribution Balancing for Group Robust Classification
Miaoyun Zhao
Qiang Zhang
C. Li
70
1
0
24 Apr 2025
CAMU: Context Augmentation for Meme Understanding
CAMU: Context Augmentation for Meme Understanding
Girish A. Koushik
Diptesh Kanojia
Helen Treharne
Aditya Joshi
VLM
98
0
0
24 Apr 2025
Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Maria Tzelepi
Vasileios Mezaris
34
0
0
14 Apr 2025
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
Zehong Ma
Hao Chen
Wei Zeng
Limin Su
Shiliang Zhang
AI4TS
35
0
0
10 Apr 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Xia Hu
Bo Yuan
VLM
64
0
0
24 Mar 2025
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
Zehua Wang
Yurui Dong
Fuwen Luo
Minyuan Ruan
Zhili Cheng
Chong Chen
Peng Li
Yang Liu
LRM
89
0
0
13 Mar 2025
MMRL: Multi-Modal Representation Learning for Vision-Language Models
MMRL: Multi-Modal Representation Learning for Vision-Language Models
Yuncheng Guo
Xiaodong Gu
VLM
OffRL
200
1
0
11 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
98
2
0
08 Mar 2025
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S M Sarwar
82
1
0
25 Feb 2025
LaVCa: LLM-assisted Visual Cortex Captioning
LaVCa: LLM-assisted Visual Cortex Captioning
Takuya Matsuyama
Shinji Nishimoto
Yu Takagi
63
1
0
20 Feb 2025
Dynamic Scene Understanding from Vision-Language Representations
Dynamic Scene Understanding from Vision-Language Representations
Shahaf Pruss
Morris Alper
Hadar Averbuch-Elor
OCL
227
0
0
20 Jan 2025
Decoding fMRI Data into Captions using Prefix Language Modeling
Decoding fMRI Data into Captions using Prefix Language Modeling
Vyacheslav Shen
Kassymzhomart Kunanbayev
Dae-Shik Kim
38
0
0
07 Jan 2025
Altogether: Image Captioning via Re-aligning Alt-text
Altogether: Image Captioning via Re-aligning Alt-text
Hu Xu
Po-Yao (Bernie) Huang
Xiaoqing Ellen Tan
Ching-Feng Yeh
Jacob Kahn
...
Luke Zettlemoyer
Wen-tau Yih
Shang-Wen Li
Saining Xie
Christoph Feichtenhofer
DiffM
46
7
0
31 Dec 2024
Prompt-enhanced Network for Hateful Meme Classification
Prompt-enhanced Network for Hateful Meme Classification
Junxi Liu
Yanyan Feng
Jiehai Chen
Yun Xue
Fenghuan Li
VLM
63
0
0
12 Nov 2024
ViTOC: Vision Transformer and Object-aware Captioner
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
37
0
0
09 Nov 2024
A Unified Debiasing Approach for Vision-Language Models across
  Modalities and Tasks
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
Hoin Jung
T. Jang
Xiaoqian Wang
VLM
27
2
0
10 Oct 2024
Decoding the Echoes of Vision from fMRI: Memory Disentangling for Past
  Semantic Information
Decoding the Echoes of Vision from fMRI: Memory Disentangling for Past Semantic Information
Runze Xia
Congchi Yin
Piji Li
31
0
0
30 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
163
1
0
04 Sep 2024
Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
49
5
0
31 Jul 2024
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language
  Large Models
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju
Haicheng Wang
Haozhe Cheng
Xu Chen
Zhonghua Zhai
Weilin Huang
Jinsong Lan
Shuai Xiao
Bo Zheng
VLM
49
5
0
16 Jul 2024
ViG-Bias: Visually Grounded Bias Discovery and Mitigation
ViG-Bias: Visually Grounded Bias Discovery and Mitigation
Badr-Eddine Marani
Mohamed Hanini
Nihitha Malayarukil
Stergios Christodoulidis
Maria Vakalopoulou
Enzo Ferrante
24
0
0
02 Jul 2024
World Models with Hints of Large Language Models for Goal Achieving
World Models with Hints of Large Language Models for Goal Achieving
Zeyuan Liu
Ziyu Huan
Xiyao Wang
Jiafei Lyu
Jian Tao
Xiu Li
Furong Huang
Huazhe Xu
LM&Ro
LRM
AI4CE
46
1
0
11 Jun 2024
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for
  Multimodal Large Language Models
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang
Hehe Fan
Yi Yang
53
3
0
24 May 2024
Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM
Open-vocabulary Auditory Neural Decoding Using fMRI-prompted LLM
Xiaoyu Chen
Changde Du
Che Liu
Yizhe Wang
Huiguang He
32
2
0
13 May 2024
Revisiting Relevance Feedback for CLIP-based Interactive Image Retrieval
Revisiting Relevance Feedback for CLIP-based Interactive Image Retrieval
Ryoya Nara
Yu-Chieh Lin
Yuji Nozawa
Youyang Ng
Goh Itoh
Osamu Torii
Yusuke Matsui
HAI
29
2
0
25 Apr 2024
GazeHTA: End-to-end Gaze Target Detection with Head-Target Association
GazeHTA: End-to-end Gaze Target Detection with Head-Target Association
Zhi-Yi Lin
Jouh Yeong Chew
Jan van Gemert
Xucong Zhang
44
1
0
16 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image
  Captions as Prompts
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
44
10
0
12 Apr 2024
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Tianyu Zhu
M. Jung
Jesse Clark
91
1
0
12 Apr 2024
Segment Any 3D Object with Language
Segment Any 3D Object with Language
Seungjun Lee
Yuyang Zhao
Gim Hee Lee
44
1
0
02 Apr 2024
A Survey on Large Language Model-Based Game Agents
A Survey on Large Language Model-Based Game Agents
Sihao Hu
Tiansheng Huang
Gaowen Liu
Ramana Rao Kompella
Gaowen Liu
Selim Furkan Tekin
Yichang Xu
Zachary Yahn
Ling Liu
LLMAG
LM&Ro
AI4CE
LM&MA
71
52
0
02 Apr 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering
  Over Longer Causal Chains Grounded in Dynamic Visual Scenes
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
40
7
0
01 Apr 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
57
4
0
19 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
  Objects in 3D Scenes
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
54
10
0
12 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
39
14
0
06 Mar 2024
Regeneration Based Training-free Attribution of Fake Images Generated by
  Text-to-Image Generative Models
Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models
Meiling Li
Zhenxing Qian
Xinpeng Zhang
39
2
0
03 Mar 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
35
47
0
20 Feb 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
44
36
0
16 Feb 2024
Convincing Rationales for Visual Question Answering Reasoning
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
44
1
0
06 Feb 2024
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
48
29
0
19 Dec 2023
LaViP:Language-Grounded Visual Prompts
LaViP:Language-Grounded Visual Prompts
Nilakshan Kunananthaseelan
Jing Zhang
Mehrtash Harandi
VLM
25
0
0
18 Dec 2023
MATK: The Meme Analytical Tool Kit
MATK: The Meme Analytical Tool Kit
Ming Shan Hee
Aditi Kumaresan
N. Hoang
Nirmalendu Prakash
Rui Cao
Roy Ka-Wei Lee
VLM
27
2
0
11 Dec 2023
BARET : Balanced Attention based Real image Editing driven by
  Target-text Inversion
BARET : Balanced Attention based Real image Editing driven by Target-text Inversion
Yuming Qiao
Fanyi Wang
Jingwen Su
Yanhao Zhang
Yunjie Yu
Siyu Wu
Guo-Jun Qi
DiffM
30
4
0
09 Dec 2023
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning
  Distilled from Large Language Models
Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models
Hongzhan Lin
Ziyang Luo
Jing Ma
Long Chen
29
9
0
09 Dec 2023
Auto-Vocabulary Semantic Segmentation
Auto-Vocabulary Semantic Segmentation
Osman Ülger
Maksymilian Kulicki
Yuki M. Asano
Martin R. Oswald
VLM
45
2
0
07 Dec 2023
Multimodal Data and Resource Efficient Device-Directed Speech Detection
  with Large Foundation Models
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
Dominik Wagner
Alexander W. Churchill
Siddharth Sigtia
Panayiotis Georgiou
Matt Mirsamadi
Aarshee Mishra
Erik Marchi
26
3
0
06 Dec 2023
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding,
  Reasoning, and Planning
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Sijin Chen
Xin Chen
C. Zhang
Mingsheng Li
Gang Yu
Hao Fei
Erik Cambria
Jiayuan Fan
Tao Chen
MLLM
29
82
0
30 Nov 2023
123
Next