ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,421 papers shown
Title
PrivImage: Differentially Private Synthetic Image Generation using
  Diffusion Models with Semantic-Aware Pretraining
PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
Kecen Li
Chen Gong
Zhixiang Li
Yuzhong Zhao
Xinwen Hou
Tianhao Wang
90
10
0
19 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
  Interactions
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
57
1
0
18 Oct 2023
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph
  Generation
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
Kibum Kim
Kanghoon Yoon
Jaeyeong Jeon
Yeonjun In
Jinyoung Moon
Donghyun Kim
Chanyoung Park
149
18
0
16 Oct 2023
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Zheng Ma
Changxin Wang
Bo Huang
Zi-Yue Zhu
Jianbing Zhang
55
1
0
15 Oct 2023
Leveraging Image Augmentation for Object Manipulation: Towards
  Interpretable Controllability in Object-Centric Learning
Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning
Jinwoo Kim
Janghyuk Choi
Jaehyun Kang
Changyeon Lee
Ho-Jin Choi
Seon Joo Kim
OCL
59
0
0
13 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLMVLM
77
39
0
13 Oct 2023
Can We Edit Multimodal Large Language Models?
Can We Edit Multimodal Large Language Models?
Siyuan Cheng
Bo Tian
Qingbin Liu
Xi Chen
Yongheng Wang
Huajun Chen
Ningyu Zhang
MLLM
111
29
0
12 Oct 2023
DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing
DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing
Yueming Lyu
Kang Zhao
Bo Peng
H. Chen
Yue Jiang
Yingya Zhang
Jing Dong
Caifeng Shan
69
2
0
12 Oct 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
151
17
0
10 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
55
2
0
08 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
  Pre-trained Models
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Ziyi Yin
Muchao Ye
Tianrong Zhang
Tianyu Du
Jinguo Zhu
Han Liu
Jinghui Chen
Ting Wang
Fenglong Ma
AAMLVLMCoGe
89
44
0
07 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
104
10
0
06 Oct 2023
Envisioning Narrative Intelligence: A Creative Visual Storytelling
  Anthology
Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology
Brett A. Halperin
S. Lukin
CoGe
92
24
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
79
14
0
04 Oct 2023
ReForm-Eval: Evaluating Large Vision Language Models via Unified
  Re-Formulation of Task-Oriented Benchmarks
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks
Zejun Li
Ye Wang
Mengfei Du
Qingwen Liu
Binhao Wu
...
Zhihao Fan
Jie Fu
Jingjing Chen
Xuanjing Huang
Zhongyu Wei
107
15
0
04 Oct 2023
Constructing Image-Text Pair Dataset from Books
Constructing Image-Text Pair Dataset from Books
Yamato Okamoto
Haruto Toyonaga
Yoshihisa Ijiri
Hirokatsu Kataoka
79
3
0
03 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
  Prediction
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIPVLM
102
73
0
02 Oct 2023
Towards reporting bias in visual-language datasets: bimodal augmentation
  by decoupling object-attribute association
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Qiyu Wu
Mengjie Zhao
Yutong He
Lang Huang
Junya Ono
Hiromi Wakaki
Yuki Mitsufuji
99
5
0
02 Oct 2023
Making LLaMA SEE and Draw with SEED Tokenizer
Making LLaMA SEE and Draw with SEED Tokenizer
Yuying Ge
Sijie Zhao
Ziyun Zeng
Yixiao Ge
Chen Li
Xintao Wang
Ying Shan
80
137
0
02 Oct 2023
Understanding Transferable Representation Learning and Zero-shot
  Transfer in CLIP
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Zixiang Chen
Yihe Deng
Yuanzhi Li
Quanquan Gu
VLM
91
12
0
02 Oct 2023
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDVLM
65
4
0
29 Sep 2023
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Amita Gajewar
Paul Vicol
G. Bansal
David J Fleet
105
177
0
29 Sep 2023
YOLOR-Based Multi-Task Learning
YOLOR-Based Multi-Task Learning
Hung-Shuo Chang
Chien-Yao Wang
Hang Yan
Yukun Zhu
Hongpeng Liao
MoEVLM
54
16
0
29 Sep 2023
Self-supervised Cross-view Representation Reconstruction for Change
  Captioning
Self-supervised Cross-view Representation Reconstruction for Change Captioning
Yunbin Tu
Liang Li
Filippos Christianos
Zheng-Jun Zha
Zhibin Li
Qingming Huang
SSL
73
26
0
28 Sep 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced
  Text-image Comprehension and Composition
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang
Xiaoyi Wang
Bin Wang
Yuhang Cao
Chao Xu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Da Lin
Jiaqi Wang
MLLM
179
241
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
63
9
0
26 Sep 2023
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
  Vision
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
Haoning Wu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
Liang Liao
...
Chunyi Li
Wenxiu Sun
Qiong Yan
Guangtao Zhai
Weisi Lin
VLM
128
156
0
25 Sep 2023
Semi-Supervised Domain Generalization for Object Detection via
  Language-Guided Feature Alignment
Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment
Sina Malakouti
Adriana Kovashka
ObjD
63
2
0
24 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
125
7
0
23 Sep 2023
Detect Everything with Few Examples
Detect Everything with Few Examples
Xinyu Zhang
Yuting Wang
Abdeslam Boularias
ObjDVLM
92
14
0
22 Sep 2023
Weakly-supervised Automated Audio Captioning via text only training
Weakly-supervised Automated Audio Captioning via text only training
Theodoros Kouzelis
Vassilis Katsouros
CLIP
77
7
0
21 Sep 2023
Viewpoint Integration and Registration with Vision Language Foundation
  Model for Image Change Understanding
Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
Xiaonan Lu
Jianlong Yuan
Ruigang Niu
Yuan Hu
Fan Wang
40
2
0
15 Sep 2023
Looking at words and points with attention: a benchmark for
  text-to-shape coherence
Looking at words and points with attention: a benchmark for text-to-shape coherence
Andrea Amaduzzi
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
41
2
0
14 Sep 2023
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the
  Wild
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
Huayang Li
Siheng Li
Deng Cai
Longyue Wang
Lemao Liu
Taro Watanabe
Yujiu Yang
Shuming Shi
MLLM
122
18
0
14 Sep 2023
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
76
3
0
14 Sep 2023
ITI-GEN: Inclusive Text-to-Image Generation
ITI-GEN: Inclusive Text-to-Image Generation
Cheng Zhang
Xuanbai Chen
Siqi Chai
Chen Henry Wu
Dmitry Lagun
Thabo Beeler
Fernando de la Torre
VLM
122
58
0
11 Sep 2023
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
  Retrieval
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
73
9
0
11 Sep 2023
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Zhengxiang Shi
Aldo Lipani
VLM
124
34
0
11 Sep 2023
ImageBind-LLM: Multi-modality Instruction Tuning
ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han
Renrui Zhang
Wenqi Shao
Peng Gao
Peng Xu
...
Yafei Wen
Xiaoxin Chen
Xiangyu Yue
Hongsheng Li
Yu Qiao
MLLM
92
125
0
07 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex
  Visually-Grounded Referencing using Determiners
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
Clarence Lee
M Ganesh Kumar
Cheston Tan
71
3
0
07 Sep 2023
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
  Tuning
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
...
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
MLLM
98
142
0
05 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical
  Learning
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
60
7
0
05 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
120
6
0
05 Sep 2023
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised
  Semantic Segmentation
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation
Ryota Yoshihashi
Yuya Otsuka
Kenji Doi
Tomohiro Tanaka
Hirokatsu Kataoka
116
1
0
04 Sep 2023
Contrastive Feature Masking Open-Vocabulary Vision Transformer
Contrastive Feature Masking Open-Vocabulary Vision Transformer
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDVLM
115
27
0
02 Sep 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
  Understanding, Generation, and Instruction Following
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Ziyu Guo
Renrui Zhang
Xiangyang Zhu
Yiwen Tang
Xianzheng Ma
...
Ke Chen
Peng Gao
Xianzhi Li
Hongsheng Li
Pheng-Ann Heng
MLLM
98
144
0
01 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize
  Content And Behavior
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
78
8
0
01 Sep 2023
Towards Addressing the Misalignment of Object Proposal Evaluation for
  Vision-Language Tasks via Semantic Grounding
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding
Joshua Forster Feinglass
Yezhou Yang
46
2
0
01 Sep 2023
TouchStone: Evaluating Vision-Language Models by Language Models
TouchStone: Evaluating Vision-Language Models by Language Models
Shuai Bai
Shusheng Yang
Jinze Bai
Peng Wang
Xing Zhang
Junyang Lin
Xinggang Wang
Chang Zhou
Jingren Zhou
MLLM
111
48
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
75
5
0
30 Aug 2023
Previous
123...91011...272829
Next