ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.10442
  4. Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
    MLLM
    VLM
    ViT
ArXivPDFHTML

Papers citing "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"

50 / 459 papers shown
Title
Are Bigger Encoders Always Better in Vision Large Models?
Are Bigger Encoders Always Better in Vision Large Models?
Bozhou Li
Hao Liang
Zimo Meng
Wentao Zhang
VLM
40
3
0
01 Aug 2024
From Attributes to Natural Language: A Survey and Foresight on
  Text-based Person Re-identification
From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification
Fanzhi Jiang
Su Yang
Mark W. Jones
Liumei Zhang
65
1
0
31 Jul 2024
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
  Experts
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Xi Lin
Akshat Shrivastava
Liang Luo
Srinivasan Iyer
Mike Lewis
Gargi Gosh
Luke Zettlemoyer
Armen Aghajanyan
MoE
48
20
0
31 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
41
1
0
23 Jul 2024
QPT V2: Masked Image Modeling Advances Visual Scoring
QPT V2: Masked Image Modeling Advances Visual Scoring
Qizhi Xie
Kun Yuan
Yunpeng Qu
Mingda Wu
Ming Sun
Chao Zhou
Jihong Zhu
44
3
0
23 Jul 2024
SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation
SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation
Pengfei Chen
Lingxi Xie
Xinyue Huo
Xuehui Yu
Xiaopeng Zhang
Yingfei Sun
Zhenjun Han
Qi Tian
VLM
68
1
0
23 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal
  Reasoning
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLM
LRM
49
9
0
22 Jul 2024
Assessing Brittleness of Image-Text Retrieval Benchmarks from
  Vision-Language Models Perspective
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective
Mariya Hendriksen
Shuo Zhang
R. Reinanda
Mohamed Yahya
Edgar Meij
Maarten de Rijke
59
0
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
36
6
0
18 Jul 2024
Distributed Semantic Segmentation with Efficient Joint Source and Task
  Decoding
Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding
Danish Nazir
Timo Bartels
Jan Piewek
Thorsten Bagdonat
Tim Fingscheidt
35
0
0
15 Jul 2024
NODE-Adapter: Neural Ordinary Differential Equations for Better
  Vision-Language Reasoning
NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning
Yi Zhang
Chun-Wun Cheng
Ke Yu
Zhihai He
Carola-Bibiane Schonlieb
Angelica I Aviles-Rivero
VLM
55
2
0
11 Jul 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for
  Resource-Limited Transfer Learning
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Haiwen Diao
Bo Wan
Xu Jia
Yunzhi Zhuge
Ying Zhang
Huchuan Lu
Long Chen
VLM
50
4
0
10 Jul 2024
iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency
iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency
Haruna Yunusa
Qin Shiyin
Abdulrahman Hamman Adama Chukkol
Isah Bello
A. Lawan
Isah Bello
48
4
0
10 Jul 2024
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer:
  A Disentangled Approach
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach
Taolin Zhang
Jiawang Bai
Zhihe Lu
Dongze Lian
Genping Wang
Xinchao Wang
Shu-Tao Xia
43
4
0
09 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Hao Peng
Heng Ji
LRM
48
16
0
08 Jul 2024
Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners
Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners
Mushui Liu
Bozheng Li
Yunlong Yu
VLM
CLIP
34
3
0
04 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
41
10
0
03 Jul 2024
VCHAR:Variance-Driven Complex Human Activity Recognition framework with
  Generative Representation
VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation
Yuan Sun
Navid Salami Pargoo
Taqiya Ehsan
Zhao Zhang
Jorge Ortiz
HAI
32
3
0
03 Jul 2024
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Longrong Yang
Dong Shen
Chaoxiang Cai
Fan Yang
Size Li
Di Zhang
Xi Li
MoE
59
2
0
28 Jun 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
61
25
0
28 Jun 2024
MammothModa: Multi-Modal Large Language Model
MammothModa: Multi-Modal Large Language Model
Qi She
Junwen Pan
Xin Wan
Rui Zhang
Dawei Lu
Kai Huang
MLLM
VLM
41
1
0
26 Jun 2024
Video Occupancy Models
Video Occupancy Models
Manan Tomar
Philippe Hansen-Estruch
Philip Bachman
Alex Lamb
John Langford
Matthew E. Taylor
Sergey Levine
63
1
0
25 Jun 2024
XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical
  Images
XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images
Elisabeta-Iulia Dima
Pablo Gómez
Sandor Kruk
Peter Kretschmar
Simon Rosen
Călin-Adrian Popa
45
0
0
25 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
49
1
0
13 Jun 2024
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Roman Bachmann
Oğuzhan Fatih Kar
David Mizrahi
Ali Garjani
Mingfei Gao
David Griffiths
Jiaming Hu
Afshin Dehghan
Amir Zamir
MoE
VLM
MLLM
41
14
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks
  and Algorithms
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng-Wei Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
51
1
0
13 Jun 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
41
3
0
13 Jun 2024
Enhancing Domain Adaptation through Prompt Gradient Alignment
Enhancing Domain Adaptation through Prompt Gradient Alignment
Hoang Phan
Lam C. Tran
Quyen Tran
Trung Le
52
0
0
13 Jun 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLM
OffRL
56
21
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
47
5
0
11 Jun 2024
MLLM-SR: Conversational Symbolic Regression base Multi-Modal Large
  Language Models
MLLM-SR: Conversational Symbolic Regression base Multi-Modal Large Language Models
Yanjie Li
Weijun Li
Lina Yu
Min Wu
Jingyi Liu
Wenqiang Li
Shu Wei
Yusong Deng
OffRL
37
3
0
08 Jun 2024
Flexible and Adaptable Summarization via Expertise Separation
Flexible and Adaptable Summarization via Expertise Separation
Preslav Nakov
Mingzhe Li
Shen Gao
Xin Cheng
Qingqing Zhu
Rui Yan
Xin Gao
Xiangliang Zhang
MoE
44
3
0
08 Jun 2024
M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating
  Interferometric SAR and RGB Data
M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data
Matthew J Allen
Francisco Dorr
Joseph A. Gallego-Mejia
Laura Martínez-Ferrer
Anna Jungbluth
Freddie Kalaitzis
Raúl Ramos-Pollán
33
3
0
06 Jun 2024
Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Dmytro Kuzmenko
N. Shvai
LM&Ro
34
0
0
05 Jun 2024
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question
  Answering
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering
Tao Li
Linjun Shou
Xuejun Liu
49
0
0
03 Jun 2024
Unlocking the Power of Spatial and Temporal Information in Medical
  Multimodal Pre-training
Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training
Jinxia Yang
Fuchun Sun
Wayne Xin Zhao
Ji-Rong Wen
40
3
0
30 May 2024
Enhancing Vision-Language Model with Unmasked Token Alignment
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu
Jinliang Zheng
Boxiao Liu
Yu Liu
Hongsheng Li
CLIP
32
0
0
29 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
71
5
0
26 May 2024
More Distinctively Black and Feminine Faces Lead to Increased
  Stereotyping in Vision-Language Models
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H.J. Lee
Jacob M. Montgomery
Calvin K. Lai
VLM
45
0
0
22 May 2024
Influence of Water Droplet Contamination for Transparency Segmentation
Influence of Water Droplet Contamination for Transparency Segmentation
Volker Knauthe
Paul Weitz
Thomas Pollabauer
Tristan Wirth
Arne Rak
Arjan Kuijper
Dieter W. Fellner
51
1
0
21 May 2024
Improving Multimodal Learning with Multi-Loss Gradient Modulation
Improving Multimodal Learning with Multi-Loss Gradient Modulation
Konstantinos Kontras
Christos Chatzichristos
Matthew Blaschko
M. D. Vos
32
3
0
13 May 2024
How to Augment for Atmospheric Turbulence Effects on Thermal Adapted
  Object Detection Models?
How to Augment for Atmospheric Turbulence Effects on Thermal Adapted Object Detection Models?
Engin Uzun
Erdem Akagündüz
45
0
0
10 May 2024
You Only Cache Once: Decoder-Decoder Architectures for Language Models
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Yutao Sun
Li Dong
Yi Zhu
Shaohan Huang
Wenhui Wang
Shuming Ma
Quanlu Zhang
Jianyong Wang
Furu Wei
VLM
38
56
0
08 May 2024
Auto-Encoding Morph-Tokens for Multimodal LLM
Auto-Encoding Morph-Tokens for Multimodal LLM
Kaihang Pan
Siliang Tang
Juncheng Li
Zhaoyu Fan
Wei Chow
Shuicheng Yan
Tat-Seng Chua
Yueting Zhuang
Hanwang Zhang
MLLM
35
18
0
03 May 2024
Self-supervised Pre-training of Text Recognizers
Self-supervised Pre-training of Text Recognizers
M. Kišš
Michal Hradiš
SSL
43
1
0
01 May 2024
Exploring Learngene via Stage-wise Weight Sharing for Initializing
  Variable-sized Models
Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models
Shiyu Xia
Wenxuan Zhu
Xu Yang
Xin Geng
34
1
0
25 Apr 2024
Multi-Head Mixture-of-Experts
Multi-Head Mixture-of-Experts
Xun Wu
Shaohan Huang
Wenhui Wang
Furu Wei
MoE
47
12
0
23 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection
  and Correction
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
50
9
0
23 Apr 2024
ECOR: Explainable CLIP for Object Recognition
ECOR: Explainable CLIP for Object Recognition
Ali Rasekh
Sepehr Kazemi Ranjbar
Milad Heidari
Wolfgang Nejdl
VLM
46
4
0
19 Apr 2024
Towards Multi-modal Transformers in Federated Learning
Towards Multi-modal Transformers in Federated Learning
Guangyu Sun
Matías Mendieta
Aritra Dutta
Xin Li
Chong Chen
78
3
0
18 Apr 2024
Previous
12345...8910
Next