ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.06165
  4. Cited By
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
    VLM
ArXivPDFHTML

Papers citing "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"

50 / 490 papers shown
Title
Object-Region Video Transformers
Object-Region Video Transformers
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
30
82
0
13 Oct 2021
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
ZhuoSheng Zhang
Hanqing Zhang
Keming Chen
Yuhang Guo
Jingyun Hua
Yulong Wang
Ming Zhou
VLM
55
71
0
13 Oct 2021
CLIP4Caption: CLIP for Video Caption
CLIP4Caption: CLIP for Video Caption
Mingkang Tang
Zhanyu Wang
Zhenhua Liu
Fengyun Rao
Dian Li
Xiu Li
CLIP
VLM
35
150
0
13 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive
  Language-Image Pre-training Paradigm
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLM
CLIP
50
448
0
11 Oct 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao
Shijie Geng
Renrui Zhang
Teli Ma
Rongyao Fang
Yongfeng Zhang
Hongsheng Li
Yu Qiao
VLM
CLIP
104
992
0
09 Oct 2021
Counterfactual Samples Synthesizing and Training for Robust Visual
  Question Answering
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Yulei Niu
Hanwang Zhang
Jun Xiao
AAML
OOD
21
36
0
03 Oct 2021
CIDEr-R: Robust Consensus-based Image Description Evaluation
CIDEr-R: Robust Consensus-based Image Description Evaluation
G. O. D. Santos
Esther Luna Colombini
Sandra Avila
47
30
0
28 Sep 2021
Multimodal Integration of Human-Like Attention in Visual Question
  Answering
Multimodal Integration of Human-Like Attention in Visual Question Answering
Ekta Sood
Fabian Kögel
Philippe Muller
Dominike Thomas
Mihai Bâce
Andreas Bulling
41
16
0
27 Sep 2021
An animated picture says at least a thousand words: Selecting Gif-based
  Replies in Multimodal Dialog
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
Xingyao Wang
David Jurgens
24
5
0
24 Sep 2021
MLIM: Vision-and-Language Model Pre-training with Masked Language and
  Image Modeling
MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling
Tarik Arici
M. S. Seyfioglu
T. Neiman
Yi Tian Xu
Son N. Tran
Trishul Chilimbi
Belinda Zeng
Ismail B. Tutar
VLM
16
15
0
24 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLM
VPVLM
VLM
211
221
0
24 Sep 2021
Dense Contrastive Visual-Linguistic Pretraining
Dense Contrastive Visual-Linguistic Pretraining
Lei Shi
Kai Shuang
Shijie Geng
Peng Gao
Zuohui Fu
Gerard de Melo
Yunpeng Chen
Sen Su
VLM
SSL
57
10
0
24 Sep 2021
Cross-Modal Coherence for Text-to-Image Retrieval
Cross-Modal Coherence for Text-to-Image Retrieval
Malihe Alikhani
Fangda Han
Hareesh Ravi
Mubbasir Kapadia
Vladimir Pavlovic
Matthew Stone
26
8
0
22 Sep 2021
Image Captioning for Effective Use of Language Models in Knowledge-Based
  Visual Question Answering
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
Aitor Soroa Etxabe
Eneko Agirre
33
59
0
15 Sep 2021
xGQA: Cross-Lingual Visual Question Answering
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
42
56
0
13 Sep 2021
Explain Me the Painting: Multi-Topic Knowledgeable Art Description
  Generation
Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation
Zechen Bai
Yuta Nakashima
Noa Garcia
68
43
0
13 Sep 2021
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Yumao Lu
Zicheng Liu
Lijuan Wang
180
403
0
10 Sep 2021
TxT: Crossmodal End-to-End Learning with Transformers
TxT: Crossmodal End-to-End Learning with Transformers
Jan-Martin O. Steitz
Jonas Pfeiffer
Iryna Gurevych
Stefan Roth
LRM
21
2
0
09 Sep 2021
Learning to Generate Scene Graph from Natural Language Supervision
Learning to Generate Scene Graph from Natural Language Supervision
Yiwu Zhong
Jing Shi
Jianwei Yang
Chenliang Xu
Yin Li
SSL
42
77
0
06 Sep 2021
Product-oriented Machine Translation with Cross-modal Cross-lingual
  Pre-training
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
Yuqing Song
Shizhe Chen
Qin Jin
Wei Luo
Jun Xie
Fei Huang
26
18
0
25 Aug 2021
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLM
MLLM
51
782
0
24 Aug 2021
Auto-Parsing Network for Image Captioning and Visual Question Answering
Auto-Parsing Network for Image Captioning and Visual Question Answering
Xu Yang
Chongyang Gao
Hanwang Zhang
Jianfei Cai
24
35
0
24 Aug 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
27
137
0
23 Aug 2021
Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training
Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training
Ming Yan
Haiyang Xu
Chenliang Li
Bin Bi
Junfeng Tian
Min Gui
Wei Wang
VLM
36
10
0
21 Aug 2021
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling
Xiaopeng Lu
Zhenhua Fan
Yansen Wang
Jean Oh
Carolyn Rose
29
27
0
20 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
  Intra-modal Knowledge Integration
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
27
53
0
16 Aug 2021
ICAF: Iterative Contrastive Alignment Framework for Multimodal
  Abstractive Summarization
ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization
Zijian Zhang
Chang Shu
Youxin Chen
Jing Xiao
Qian Zhang
Lu Zheng
23
5
0
11 Aug 2021
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language
  Models
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Zheyuan Liu
Cristian Rodriguez-Opazo
Damien Teney
Stephen Gould
VLM
22
191
0
09 Aug 2021
A Thorough Review on Recent Deep Learning Methodologies for Image
  Captioning
A Thorough Review on Recent Deep Learning Methodologies for Image Captioning
Ahmed Elhagry
Karima Kadaoui
VLM
33
17
0
28 Jul 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Chenyu You
Caiming Xiong
Guosheng Lin
FaML
83
1,893
0
16 Jul 2021
From Show to Tell: A Survey on Deep Learning-based Image Captioning
From Show to Tell: A Survey on Deep Learning-based Image Captioning
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
S. Cascianelli
G. Fiameni
Rita Cucchiara
3DV
VLM
MLLM
69
256
0
14 Jul 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
202
406
0
13 Jul 2021
End-to-end Multi-modal Video Temporal Grounding
End-to-end Multi-modal Video Temporal Grounding
Yi-Wen Chen
Yi-Hsuan Tsai
Ming-Hsuan Yang
11
51
0
12 Jul 2021
Learning Vision-Guided Quadrupedal Locomotion End-to-End with
  Cross-Modal Transformers
Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
Ruihan Yang
Minghao Zhang
Nicklas Hansen
Huazhe Xu
Xiaolong Wang
OffRL
20
102
0
08 Jul 2021
Productivity, Portability, Performance: Data-Centric Python
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
59
95
0
01 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
  Generation
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
39
37
0
01 Jul 2021
Multimodal Few-Shot Learning with Frozen Language Models
Multimodal Few-Shot Learning with Frozen Language Models
Maria Tsimpoukelli
Jacob Menick
Serkan Cabi
S. M. Ali Eslami
Oriol Vinyals
Felix Hill
MLLM
88
755
0
25 Jun 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training
  for VQA Challenge 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021
Keda Lu
Bo Fang
Kuan-Yu Chen
ViT
34
2
0
24 Jun 2021
GEM: A General Evaluation Benchmark for Multimodal Tasks
GEM: A General Evaluation Benchmark for Multimodal Tasks
Lin Su
Nan Duan
Edward Cui
Lei Ji
Chenfei Wu
Huaishao Luo
Yongfei Liu
Ming Zhong
Taroon Bharti
Arun Sacheti
VLM
19
19
0
18 Jun 2021
Efficient Self-supervised Vision Transformers for Representation
  Learning
Efficient Self-supervised Vision Transformers for Representation Learning
Chunyuan Li
Jianwei Yang
Pengchuan Zhang
Mei Gao
Bin Xiao
Xiyang Dai
Lu Yuan
Jianfeng Gao
ViT
40
209
0
17 Jun 2021
Probing Image-Language Transformers for Verb Understanding
Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks
Aida Nematzadeh
36
114
0
16 Jun 2021
Understanding and Evaluating Racial Biases in Image Captioning
Understanding and Evaluating Racial Biases in Image Captioning
Dora Zhao
Angelina Wang
Olga Russakovsky
30
134
0
16 Jun 2021
Pre-Trained Models: Past, Present and Future
Pre-Trained Models: Past, Present and Future
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
58
818
0
14 Jun 2021
Assessing Multilingual Fairness in Pre-trained Multimodal
  Representations
Assessing Multilingual Fairness in Pre-trained Multimodal Representations
Jialu Wang
Yang Liu
Junfeng Fang
EGVM
26
35
0
12 Jun 2021
Dependent Multi-Task Learning with Causal Intervention for Image
  Captioning
Dependent Multi-Task Learning with Causal Intervention for Image Captioning
Wenqing Chen
Jidong Tian
Caoyun Fan
Hao He
Yaohui Jin
CML
27
6
0
18 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic
  Survey
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Min Zhang
56
270
0
10 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
90
864
0
26 Apr 2021
Playing Lottery Tickets with Vision and Language
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
109
54
0
23 Apr 2021
Understanding Chinese Video and Language via Contrastive Multimodal
  Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
Chenyi Lei
Shixian Luo
Yong-jin Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
30
41
0
19 Apr 2021
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich
  Document Understanding
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Yiheng Xu
Tengchao Lv
Lei Cui
Guoxin Wang
Yijuan Lu
D. Florêncio
Cha Zhang
Furu Wei
MLLM
VLM
38
128
0
18 Apr 2021
Previous
123...1089
Next