Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

18 January 2023

Xiaofeng Yang

Fayao Liu

Guosheng Lin

VLM

ArXiv PDF HTML

Papers citing "Effective End-to-End Vision Language Pretraining with Semantic Visual Loss"

21 / 21 papers shown

Title
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment Xiaoxu Xu Yitian Yuan Qiudan Zhang Wen-Bin Wu Zequn Jie Lin Ma Xu Wang 95 4 0 15 Dec 2023
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Wonjae Kim Bokyung Son Ildoo Kim VLM CLIP 112 1,741 0 05 Feb 2021
Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning Beliz Gunel Jingfei Du Alexis Conneau Ves Stoyanov 60 505 0 03 Nov 2020
Rethinking Pre-training and Self-training Barret Zoph Golnaz Ghiasi Nayeon Lee Huayu Chen Hanxiao Liu E. D. Cubuk Quoc V. Le SSeg 85 651 0 11 Jun 2020
Linformer: Self-Attention with Linear Complexity Sinong Wang Belinda Z. Li Madian Khabsa Han Fang Hao Ma 195 1,700 0 08 Jun 2020
End-to-End Object Detection with Transformers Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov Sergey Zagoruyko ViT 3DV PINN 372 13,025 0 26 May 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data Di Qi Lin Su Jianwei Song Edward Cui Taroon Bharti Arun Sacheti VLM 76 261 0 22 Jan 2020
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Victor Sanh Lysandre Debut Julien Chaumond Thomas Wolf 218 7,498 0 02 Oct 2019
RandAugment: Practical automated data augmentation with a reduced search space E. D. Cubuk Barret Zoph Jonathon Shlens Quoc V. Le MQ 217 3,485 0 30 Sep 2019
UNITER: UNiversal Image-TExt Representation Learning Yen-Chun Chen Linjie Li Licheng Yu Ahmed El Kholy Faisal Ahmed Zhe Gan Yu Cheng Jingjing Liu VLM OT 101 447 0 25 Sep 2019
TinyBERT: Distilling BERT for Natural Language Understanding Xiaoqi Jiao Yichun Yin Lifeng Shang Xin Jiang Xiao Chen Linlin Li F. Wang Qun Liu VLM 94 1,860 0 23 Sep 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers Hao Hao Tan Joey Tianyi Zhou VLM MLLM 231 2,477 0 20 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu Dhruv Batra Devi Parikh Stefan Lee SSL VLM 219 3,674 0 06 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy M. Lewis Luke Zettlemoyer Veselin Stoyanov AIMat 582 24,422 0 26 Jul 2019
From Recognition to Cognition: Visual Commonsense Reasoning Rowan Zellers Yonatan Bisk Ali Farhadi Yejin Choi LRM BDL OCL ReLM 153 879 0 27 Nov 2018
Discriminability objective for training descriptive captions Ruotian Luo Brian L. Price Scott D. Cohen Gregory Shakhnarovich 100 203 0 12 Mar 2018
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang AIMat 117 4,214 0 25 Jul 2017
COCO-Stuff: Thing and Stuff Classes in Context Holger Caesar J. Uijlings V. Ferrari 121 1,385 0 12 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Yash Goyal Tejas Khot D. Summers-Stay Dhruv Batra Devi Parikh CoGe 324 3,235 0 02 Dec 2016
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Bryan A. Plummer Liwei Wang Christopher M. Cervantes Juan C. Caicedo Julia Hockenmaier Svetlana Lazebnik 193 2,053 0 19 May 2015
Microsoft COCO Captions: Data Collection and Evaluation Server Xinlei Chen Hao Fang Nayeon Lee Ramakrishna Vedantam Saurabh Gupta Piotr Dollar C. L. Zitnick 205 2,475 0 01 Apr 2015