ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.06165
  4. Cited By
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
    VLM
ArXivPDFHTML

Papers citing "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"

50 / 482 papers shown
Title
EDIS: Entity-Driven Image Search over Multimodal Web Content
EDIS: Entity-Driven Image Search over Multimodal Web Content
Siqi Liu
Weixi Feng
Tsu-Jui Fu
Wenhu Chen
Wenjie Wang
VLM
48
9
0
23 May 2023
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Haiwei Wu
Jiantao Zhou
Shile Zhang
118
28
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
23
17
0
22 May 2023
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual
  Grounding
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding
Chenchi Zhang
Jun Xiao
Lei Chen
Jian Shao
Long Chen
VLM
LRM
34
2
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
50
116
0
18 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLM
LRM
134
707
0
17 May 2023
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level
  Grounding of Images
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images
Varuna Krishna
S. Suryavardan
Shreyash Mishra
Sathyanarayanan Ramamoorthy
Parth Patwa
Megha Chakraborty
Aman Chadha
Amitava Das
Amit P. Sheth
VLM
33
3
0
12 May 2023
Vision Language Pre-training by Contrastive Learning with Cross-Modal
  Similarity Regulation
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang
Wei Ye
Haiyang Xu
Miang yan
Shikun Zhang
Jie Zhang
Fei Huang
VLM
34
15
0
08 May 2023
Multi-Modality Deep Network for JPEG Artifacts Reduction
Multi-Modality Deep Network for JPEG Artifacts Reduction
Xuhao Jiang
Weimin Tan
Qing Lin
Chenxi Ma
Bo Yan
Liquan Shen
46
2
0
04 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
47
2
0
03 May 2023
Multimodal Graph Transformer for Multimodal Question Answering
Multimodal Graph Transformer for Multimodal Question Answering
Xuehai He
Xin Eric Wang
36
7
0
30 Apr 2023
Towards Medical Artificial General Intelligence via Knowledge-Enhanced
  Multimodal Pretraining
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining
Bingqian Lin
Zicong Chen
Mingjie Li
Haokun Lin
Hang Xu
...
Ling-Hao Chen
Xiaojun Chang
Yi Yang
L. Xing
Xiaodan Liang
LM&MA
MedIm
AI4CE
40
14
0
26 Apr 2023
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
  Word-Region Alignment
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
Lewei Yao
Jianhua Han
Xiaodan Liang
Danqian Xu
Wei Zhang
Zhenguo Li
Hang Xu
VLM
ObjD
CLIP
56
74
0
10 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
24
1
0
10 Apr 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
46
51
0
06 Apr 2023
Multi-Modal Representation Learning with Text-Driven Soft Masks
Multi-Modal Representation Learning with Text-Driven Soft Masks
Jaeyoo Park
Bohyung Han
SSL
30
4
0
03 Apr 2023
Sketch-based Video Object Localization
Sketch-based Video Object Localization
Sangmin Woo
So-Yeong Jeon
Jinyoung Park
Minji Son
Sumin Lee
Changick Kim
24
0
0
02 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
30
20
0
31 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
37
23
0
29 Mar 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media
  Multimodal Classification
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Chunpu Xu
Jing Li
VLM
26
5
0
27 Mar 2023
LIMITR: Leveraging Local Information for Medical Image-Text
  Representation
LIMITR: Leveraging Local Information for Medical Image-Text Representation
Gefen Dawidowicz
Elad Hirsch
A. Tal
31
15
0
21 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
46
47
0
21 Mar 2023
Divide and Conquer: Answering Questions with Object Factorization and
  Compositional Reasoning
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning
Shi Chen
Qi Zhao
47
5
0
18 Mar 2023
PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point
  Clouds
PersonalTailor: Personalizing 2D Pattern Design from 3D Garment Point Clouds
Sauradip Nag
Anran Qi
Xiatian Zhu
Ariel Shamir
3DPC
44
6
0
17 Mar 2023
Data Roaming and Quality Assessment for Composed Image Retrieval
Data Roaming and Quality Assessment for Composed Image Retrieval
Matan Levy
Rami Ben-Ari
N. Darshan
Dani Lischinski
48
23
0
16 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet
  Tag-guided Synthetic Data
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
73
14
0
14 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Toward Unsupervised Realistic Visual Question Answering
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
22
2
0
09 Mar 2023
CLIP-guided Prototype Modulating for Few-shot Action Recognition
CLIP-guided Prototype Modulating for Few-shot Action Recognition
Xiang Wang
Shiwei Zhang
Jun Cen
Changxin Gao
Yingya Zhang
Deli Zhao
Nong Sang
VLM
27
53
0
06 Mar 2023
Prismer: A Vision-Language Model with Multi-Task Experts
Prismer: A Vision-Language Model with Multi-Task Experts
Shikun Liu
Linxi Fan
Edward Johns
Zhiding Yu
Chaowei Xiao
Anima Anandkumar
VLM
MLLM
49
21
0
04 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
  Tasks
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
24
38
0
04 Mar 2023
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based
  Polishing
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing
Zequn Zeng
Hao Zhang
Zhengjue Wang
Ruiying Lu
Dongsheng Wang
Bo Chen
BDL
DiffM
24
33
0
04 Mar 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on
  Tasks and Challenges
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
32
4
0
04 Mar 2023
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu
Xuecheng Ouyang
Zhenwei Shao
Mei Wang
Jun Yu
MLLM
94
11
0
03 Mar 2023
Meta Learning to Bridge Vision and Language Models for Multimodal
  Few-Shot Learning
Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning
Ivona Najdenkoska
Xiantong Zhen
M. Worring
VLM
26
18
0
28 Feb 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Few-shot Multimodal Multitask Multilingual Learning
Few-shot Multimodal Multitask Multilingual Learning
Aman Chadha
Vinija Jain
53
0
0
19 Feb 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft
  Prompts
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Zhihong Chen
Shizhe Diao
Benyou Wang
Guanbin Li
Xiang Wan
MedIm
27
29
0
17 Feb 2023
Retrieval-augmented Image Captioning
Retrieval-augmented Image Captioning
R. Ramos
Desmond Elliott
Bruno Martins
VLM
34
29
0
16 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
36
7
0
16 Feb 2023
Paparazzi: A Deep Dive into the Capabilities of Language and Vision
  Models for Grounding Viewpoint Descriptions
Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions
Henrik Voigt
J. Hombeck
M. Meuschke
K. Lawonn
Sina Zarrieß
VLM
33
1
0
13 Feb 2023
Understanding Multimodal Contrastive Learning and Incorporating Unpaired
  Data
Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data
Ryumei Nakada
Halil Ibrahim Gulluk
Zhun Deng
Wenlong Ji
James Zou
Linjun Zhang
SSL
VLM
42
37
0
13 Feb 2023
Actional Atomic-Concept Learning for Demystifying Vision-Language
  Navigation
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
Bingqian Lin
Yi Zhu
Xiaodan Liang
Liang Lin
Jian-zhuo Liu
CoGe
LM&Ro
41
3
0
13 Feb 2023
HateProof: Are Hateful Meme Detection Systems really Robust?
HateProof: Are Hateful Meme Detection Systems really Robust?
Piush Aggarwal
Pranit Chawla
Mithun Das
Punyajoy Saha
Binny Mathew
Torsten Zesch
Animesh Mukherjee
AAML
37
8
0
11 Feb 2023
Unified Vision-Language Representation Modeling for E-Commerce
  Same-Style Products Retrieval
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval
Ben Chen
Linbo Jin
Xinxin Wang
D. Gao
Wen Jiang
Wei Ning
22
3
0
10 Feb 2023
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
  Retrieval
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Kuniaki Saito
Kihyuk Sohn
Xiang Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
30
107
0
06 Feb 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
Controlling for Stereotypes in Multimodal Language Model Evaluation
Manuj Malik
Richard Johansson
31
1
0
03 Feb 2023
Towards a Unified Model for Generating Answers and Explanations in
  Visual Question Answering
Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering
Chenxi Whitehouse
Tillman Weyde
Pranava Madhyastha
LRM
44
3
0
25 Jan 2023
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
  Retrieval
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval
Yan Zhang
Zhong Ji
Dingrong Wang
Yanwei Pang
Xuelong Li
VLM
24
23
0
17 Jan 2023
Toward Building General Foundation Models for Language, Vision, and
  Vision-Language Understanding Tasks
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
Xinsong Zhang
Yan Zeng
Jipeng Zhang
Hang Li
VLM
AI4CE
LRM
27
17
0
12 Jan 2023
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
Manh-Duy Nguyen
Binh T. Nguyen
C. Gurrin
VLM
28
4
0
11 Jan 2023
Previous
123456...8910
Next