ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.10442
  4. Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
    MLLM
    VLM
    ViT
ArXivPDFHTML

Papers citing "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"

50 / 459 papers shown
Title
A Cookbook of Self-Supervised Learning
A Cookbook of Self-Supervised Learning
Randall Balestriero
Mark Ibrahim
Vlad Sobal
Ari S. Morcos
Shashank Shekhar
...
Pierre Fernandez
Amir Bar
Hamed Pirsiavash
Yann LeCun
Micah Goldblum
SyDa
FedML
SSL
50
275
0
24 Apr 2023
CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with
  Mixture-of-Textual-Experts for Passage Retrieval
CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with Mixture-of-Textual-Experts for Passage Retrieval
Guangyuan Ma
Xing Wu
Peng Wang
Songlin Hu
MoE
RALM
26
5
0
20 Apr 2023
Enhancing Textbooks with Visuals from the Web for Improved Learning
Enhancing Textbooks with Visuals from the Web for Improved Learning
Janvijay Singh
Vilém Zouhar
Mrinmaya Sachan
27
3
0
18 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
  Segmentation
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
37
11
0
14 Apr 2023
Modeling Dense Multimodal Interactions Between Biological Pathways and
  Histology for Survival Prediction
Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction
Guillaume Jaume
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Paul Pu Liang
Faisal Mahmood
46
44
0
13 Apr 2023
On the Opportunities and Challenges of Foundation Models for Geospatial
  Artificial Intelligence
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
35
123
0
13 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal
  representations
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
21
4
0
11 Apr 2023
Training Large Language Models Efficiently with Sparsity and Dataflow
Training Large Language Models Efficiently with Sparsity and Dataflow
V. Srinivasan
Darshan Gandhi
Urmish Thakker
R. Prabhakar
MoE
40
6
0
11 Apr 2023
A Billion-scale Foundation Model for Remote Sensing Images
A Billion-scale Foundation Model for Remote Sensing Images
Keumgang Cha
Junghoon Seo
Taekyung Lee
38
64
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM
3DV
38
25
0
11 Apr 2023
On Robustness in Multimodal Learning
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
42
2
0
10 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
36
37
0
09 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
51
7
0
09 Apr 2023
Attention: Marginal Probability is All You Need?
Attention: Marginal Probability is All You Need?
Ryan Singh
Christopher L. Buckley
31
2
0
07 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
35
41
0
07 Apr 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior
  Refinement
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
Xiang-yu Zhu
Renrui Zhang
Bowei He
A-Long Zhou
Dong Wang
Bingyan Zhao
Peng Gao
VLM
42
80
0
03 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
24
44
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
62
8
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
37
23
0
29 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init
  Attention
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
74
747
0
28 Mar 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch
  and Token Embeddings to Finite Discrete Tokens
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen
Jianbo Yuan
Yu Tian
Shijie Geng
Xinyu Li
Ding Zhou
Dimitris N. Metaxas
Hongxia Yang
14
34
0
27 Mar 2023
Equivariant Similarity for Vision-Language Foundation Models
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
46
44
0
25 Mar 2023
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
Sukmin Yun
S. Park
Paul Hongsuck Seo
Jinwoo Shin
VLM
MLLM
57
14
0
25 Mar 2023
Accelerating Vision-Language Pretraining with Free Language Modeling
Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang
Yixiao Ge
Feng Zheng
Ran Cheng
Ying Shan
Xiaohu Qie
Ping Luo
VLM
MLLM
93
9
0
24 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
54
13
0
23 Mar 2023
MAGVLT: Masked Generative Vision-and-Language Transformer
MAGVLT: Masked Generative Vision-and-Language Transformer
Sungwoong Kim
DaeJin Jo
Donghoon Lee
Jongmin Kim
VLM
47
12
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
32
29
0
20 Mar 2023
EVA-02: A Visual Representation for Neon Genesis
EVA-02: A Visual Representation for Neon Genesis
Yuxin Fang
Quan-Sen Sun
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
ViT
CLIP
42
263
0
20 Mar 2023
MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
  Cross-Modal Representations
MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations
Ye Wang
Bo‐Shu Jiang
C. Zou
Rui Ma
32
5
0
20 Mar 2023
IRGen: Generative Modeling for Image Retrieval
IRGen: Generative Modeling for Image Retrieval
Yidan Zhang
Ting Zhang
Dong Chen
Yujing Wang
Qi Chen
...
Qi Zhang
Fan Yang
Mao Yang
Q. Liao
B. Guo
3DV
VLM
40
14
0
17 Mar 2023
Investigating the Role of Attribute Context in Vision-Language Models
  for Object Recognition and Detection
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection
Kyle Buettner
Adriana Kovashka
28
0
0
17 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Dual-path Adaptation from Image to Video Transformers
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
ViT
21
37
0
17 Mar 2023
Cross-Modal Causal Intervention for Medical Report Generation
Cross-Modal Causal Intervention for Medical Report Generation
Weixing Chen
Yang Liu
Ce Wang
Jiarui Zhu
Shen Zhao
Guanbin Li
Cheng-Lin Liu
Liang Lin
39
6
0
16 Mar 2023
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness
  with Dataset Reinforcement
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Fartash Faghri
Hadi Pouransari
Sachin Mehta
Mehrdad Farajtabar
Ali Farhadi
Mohammad Rastegari
Oncel Tuzel
43
9
0
15 Mar 2023
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Bo He
Jun Wang
Jielin Qiu
Trung Bui
Abhinav Shrivastava
Zhaowen Wang
22
66
0
13 Mar 2023
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical
  Documents
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Weixiong Lin
Ziheng Zhao
Xiaoman Zhang
Chaoyi Wu
Ya Zhang
Yanfeng Wang
Weidi Xie
LM&MA
VLM
MedIm
31
144
0
13 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
26
63
0
13 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
37
1
0
13 Mar 2023
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched
  Visual Descriptions
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu
Jun Chen
Kilichbek Haydarov
Xiaoqian Shen
Wenxuan Zhang
Mohamed Elhoseiny
MLLM
45
97
0
12 Mar 2023
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Fan Bao
Shen Nie
Kaiwen Xue
Chongxuan Li
Shiliang Pu
Yaole Wang
Gang Yue
Yue Cao
Hang Su
Jun Zhu
DiffM
207
151
0
12 Mar 2023
Understanding and Constructing Latent Modality Structures in Multi-modal
  Representation Learning
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Qian Jiang
Changyou Chen
Han Zhao
Liqun Chen
Q. Ping
S. D. Tran
Yi Xu
Belinda Zeng
Trishul Chilimbi
54
40
0
10 Mar 2023
MuLTI: Efficient Video-and-Language Understanding with Text-Guided
  MultiWay-Sampler and Multiple Choice Modeling
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
51
1
0
10 Mar 2023
HumanBench: Towards General Human-centric Perception with Projector
  Assisted Pretraining
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining
Shixiang Tang
Cheng Chen
Qingsong Xie
Meilin Chen
Yizhou Wang
...
Feng Zhu
Haiyang Yang
Li Yi
Rui Zhao
Wanli Ouyang
VLM
32
36
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
69
74
0
10 Mar 2023
Exploiting the Textual Potential from Vision-Language Pre-training for
  Text-based Person Search
Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search
Guanshuo Wang
Fufu Yu
Jianing Li
Qiong Jia
Shouhong Ding
29
18
0
08 Mar 2023
Bootstrap The Original Latent: Learning a Private Model from a Black-box
  Model
Bootstrap The Original Latent: Learning a Private Model from a Black-box Model
Shuai Wang
Daoan Zhang
Jiang Zhang
Weiwei Zhang
Ruizhen Li
FedML
37
0
0
07 Mar 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware
  Attention
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
Shijie Geng
Jianbo Yuan
Yu Tian
Yuxiao Chen
Yongfeng Zhang
CLIP
VLM
49
44
0
06 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
  Tasks
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
24
38
0
04 Mar 2023
DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction
DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction
Shubhankar Borse
Debasmit Das
Hyojin Park
H. Cai
Risheek Garrepalli
Fatih Porikli
43
9
0
02 Mar 2023
Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
Sora Takashima
Ryo Hayamizu
Nakamasa Inoue
Hirokatsu Kataoka
Rio Yokota
68
19
0
02 Mar 2023
Previous
123...106789
Next