ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.07801
  4. Cited By
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
v1v2 (latest)

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

10 July 2024
Jongsuk Kim
Jiwon Shin
Junmo Kim
ArXiv (abs)PDFHTMLGithub (9★)

Papers citing "AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning"

14 / 14 papers shown
Title
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
Yong Ren
Chenxing Li
Le Xu
Hao Gu
Duzhen Zhang
Yujie Chen
Manjie Xu
Ruibo Fu
Shan Yang
Dong Yu
LRM
64
0
0
19 May 2025
Prefix tuning for automated audio captioning
Prefix tuning for automated audio captioning
Minkyu Kim
Kim Sung-Bin
Tae-Hyun Oh
78
45
0
30 Mar 2023
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
75
54
0
15 Dec 2022
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu
Eduardo Fonseca
Radu Tudor Ionescu
Mario Lucic
Cordelia Schmid
Anurag Arnab
SSL
96
45
0
09 Dec 2022
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Xubo Liu
Qiushi Huang
Xinhao Mei
Haohe Liu
Qiuqiang Kong
...
Yu Zhang
Lilian H. Y. Tang
Mark D. Plumbley
Volkan Kilicc
Wenwu Wang
111
20
0
28 Oct 2022
Contrastive Audio-Visual Masked Autoencoder
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
89
128
0
02 Oct 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
137
558
0
27 May 2022
Interactive Audio-text Representation for Automated Audio Captioning
  with Contrastive Learning
Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning
Chen Chen
Nana Hou
Yuchen Hu
Heqing Zou
Xiaofeng Qi
Chng Eng Siong
VLM
56
21
0
29 Mar 2022
Automated Audio Captioning using Transfer Learning and Reconstruction
  Latent Space Similarity Regularization
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization
Andrew Koh
Fuzhao Xue
Chng Eng Siong
48
20
0
10 Aug 2021
AudioCLIP: Extending CLIP to Image, Text and Audio
AudioCLIP: Extending CLIP to Image, Text and Audio
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
CLIPVLM
125
370
0
24 Jun 2021
A Transformer-based Audio Captioning Model with Keyword Estimation
A Transformer-based Audio Captioning Model with Keyword Estimation
Yuma Koizumi
Ryo Masumura
Kyosuke Nishida
Masahiro Yasuda
Shoichiro Saito
74
54
0
01 Jul 2020
Multi-modal Dense Video Captioning
Multi-modal Dense Video Captioning
Vladimir E. Iashin
Esa Rahtu
56
171
0
17 Mar 2020
Audio Caption: Listen and Tell
Audio Caption: Listen and Tell
Mengyue Wu
Heinrich Dinkel
Kai Yu
59
61
0
25 Feb 2019
Google's Neural Machine Translation System: Bridging the Gap between
  Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Zhiwen Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
914
6,797
0
26 Sep 2016
1