Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

31 January 2021

Lisa Anne Hendricks

John F. J. Mellor

R. Schneider

Jean-Baptiste Alayrac

Aida Nematzadeh

ArXiv PDF HTML

Papers citing "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers"

41 / 41 papers shown

Title
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine Hanguang Xiao Feizhong Zhou Xianglong Liu Tianqi Liu Zhipeng Li Xin Liu Xiaoxuan Huang AILaw LM&MA LRM 93 25 0 31 Dec 2024
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 697 41,736 0 28 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models Jize Cao Zhe Gan Yu Cheng Licheng Yu Yen-Chun Chen Jingjing Liu VLM 61 130 0 15 May 2020
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions Arjun Reddy Akula Spandana Gella Yaser Al-Onaizan Song-Chun Zhu Siva Reddy ObjD 47 52 0 04 May 2020
Are we pretraining it right? Digging deeper into visio-linguistic pretraining Amanpreet Singh Vedanuj Goswami Devi Parikh VLM 63 48 0 19 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Xiujun Li Xi Yin Chunyuan Li Pengchuan Zhang Xiaowei Hu ... Houdong Hu Li Dong Furu Wei Yejin Choi Jianfeng Gao VLM 90 1,934 0 13 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu Jianlong Fu ViT 124 438 0 02 Apr 2020
Information-Theoretic Probing with Minimum Description Length Elena Voita Ivan Titov 71 274 0 27 Mar 2020
Visual Grounding in Video for Unsupervised Word Translation Gunnar Sigurdsson Jean-Baptiste Alayrac Aida Nematzadeh Lucas Smaira Mateusz Malinowski João Carreira Phil Blunsom Andrew Zisserman VGen 64 50 0 11 Mar 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech Jean-Baptiste Alayrac Lucas Smaira Ivan Laptev Josef Sivic Andrew Zisserman VGen SSL 114 712 0 13 Dec 2019
Connecting Vision and Language with Localized Narratives Jordi Pont-Tuset J. Uijlings Soravit Changpinyo Radu Soricut V. Ferrari ObjD 75 247 0 06 Dec 2019
Designing and Interpreting Probes with Control Tasks John Hewitt Percy Liang 58 536 0 08 Sep 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers Hao Hao Tan Joey Tianyi Zhou VLM MLLM 227 2,474 0 20 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training Gen Li Nan Duan Yuejian Fang Ming Gong Daxin Jiang Ming Zhou SSL VLM MLLM 200 900 0 16 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language Liunian Harold Li Mark Yatskar Da Yin Cho-Jui Hsieh Kai-Wei Chang VLM 130 1,948 0 09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu Dhruv Batra Devi Parikh Stefan Lee SSL VLM 217 3,667 0 06 Aug 2019
XLNet: Generalized Autoregressive Pretraining for Language Understanding Zhilin Yang Zihang Dai Yiming Yang J. Carbonell Ruslan Salakhutdinov Quoc V. Le AI4CE 220 8,415 0 19 Jun 2019
Contrastive Multiview Coding Yonglong Tian Dilip Krishnan Phillip Isola SSL 155 2,395 0 13 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic VGen 105 1,199 0 07 Jun 2019
VideoBERT: A Joint Model for Video and Language Representation Learning Chen Sun Austin Myers Carl Vondrick Kevin Patrick Murphy Cordelia Schmid VLM SSL 75 1,243 0 03 Apr 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Yang You Jing Li Sashank J. Reddi Jonathan Hseu Sanjiv Kumar Srinadh Bhojanapalli Xiaodan Song J. Demmel Kurt Keutzer Cho-Jui Hsieh ODL 218 993 0 01 Apr 2019
Learning and Evaluating General Linguistic Intelligence Dani Yogatama Cyprien de Masson dÁutume Jerome T. Connor Tomás Kociský Mike Chrzanowski ... Angeliki Lazaridou Wang Ling Lei Yu Chris Dyer Phil Blunsom ELM AI4CE 141 210 0 31 Jan 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 1.6K 94,511 0 11 Oct 2018
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval Niluthpol Chowdhury Mithun Yikang Shen Evangelos E. Papalexakis Amit K. Roy-Chowdhury 37 77 0 23 Aug 2018
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing Taku Kudo John Richardson 178 3,514 0 19 Aug 2018
Representation Learning with Contrastive Predictive Coding Aaron van den Oord Yazhe Li Oriol Vinyals DRL SSL 294 10,253 0 10 Jul 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data Antoine Miech Ivan Laptev Josef Sivic 61 234 0 07 Apr 2018
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge Damien Teney Peter Anderson Xiaodong He Anton Van Den Hengel 86 382 0 09 Aug 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang AIMat 113 4,208 0 25 Jul 2017
Image Pivoting for Learning Multilingual Multimodal Representations Spandana Gella Rico Sennrich Frank Keller Mirella Lapata SSL 62 78 0 24 Jul 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 646 130,942 0 12 Jun 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Yash Goyal Tejas Khot D. Summers-Stay Dhruv Batra Devi Parikh CoGe 322 3,224 0 02 Dec 2016
Adversarial Feature Learning Jiasen Lu Philipp Krahenbuhl Trevor Darrell GAN 107 1,608 0 31 May 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata ... Yannis Kalantidis Li Li David A. Shamma Michael S. Bernstein Fei-Fei Li 196 5,726 0 23 Feb 2016
Learning Deep Structure-Preserving Image-Text Embeddings Liwei Wang Yin Li Svetlana Lazebnik 79 781 0 19 Nov 2015
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren Kaiming He Ross B. Girshick Jian Sun AIMat ObjD 478 62,122 0 04 Jun 2015
VQA: Visual Question Answering Aishwarya Agrawal Jiasen Lu Stanislaw Antol Margaret Mitchell C. L. Zitnick Dhruv Batra Devi Parikh CoGe 182 5,452 0 03 May 2015
Microsoft COCO Captions: Data Collection and Evaluation Server Xinlei Chen Hao Fang Nayeon Lee Ramakrishna Vedantam Saurabh Gupta Piotr Dollar C. L. Zitnick 203 2,469 0 01 Apr 2015
Deep Visual-Semantic Alignments for Generating Image Descriptions A. Karpathy Li Fei-Fei 103 5,578 0 07 Dec 2014
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models Ryan Kiros Ruslan Salakhutdinov R. Zemel VLM 117 1,397 0 10 Nov 2014
A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics Yunchao Gong Qifa Ke Michael Isard Svetlana Lazebnik 3DV 129 584 0 18 Dec 2012