Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2010.06775
Cited By
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
14 October 2020
Hao Tan
Joey Tianyi Zhou
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision"
29 / 29 papers shown
Title
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?
Mohamed Gado
Towhid Taliee
Muhammad Memon
D. Ignatov
Radu Timofte
72
0
0
27 Apr 2025
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
53
0
0
17 Jan 2025
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
203
0
0
01 Dec 2024
Grounding Spatial Relations in Text-Only Language Models
Gorka Azkune
Ander Salaberria
Eneko Agirre
42
0
0
20 Mar 2024
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen
Vipin Vijayan
Scott Grigsby
Timothy Anderson
Jeremy Gwinnup
34
2
0
05 Mar 2024
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin
Tejas Srinivasan
Jesse Thomason
Xiang Ren
33
1
0
21 Feb 2024
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
35
1
0
05 Dec 2023
VIP5: Towards Multimodal Foundation Models for Recommendation
Shijie Geng
Juntao Tan
Shuchang Liu
Zuohui Fu
Yongfeng Zhang
32
70
0
23 May 2023
Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining
Qin Chao
Eunsoo Kim
Boyang Albert Li
21
1
0
20 Apr 2023
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
24
9
0
21 Nov 2022
Detecting Euphemisms with Literal Descriptions and Visual Imagery
.Ilker Kesen
Aykut Erdem
Erkut Erdem
Iacer Calixto
29
4
0
08 Nov 2022
How direct is the link between words and images?
Hassan Shahmohammadi
Maria Heitmeier
Elnaz Shafaei-Bajestan
Hendrik P. A. Lensch
Harald Baayen
32
0
0
30 Jun 2022
VALHALLA: Visual Hallucination for Machine Translation
Yi Li
Yikang Shen
Yoon Kim
Chun-Fu Chen
Rogerio Feris
David D. Cox
Nuno Vasconcelos
MLLM
40
38
0
31 May 2022
Visually-Augmented Language Modeling
Weizhi Wang
Li Dong
Hao Cheng
Haoyu Song
Xiaodong Liu
Xifeng Yan
Jianfeng Gao
Furu Wei
VLM
36
18
0
20 May 2022
Lexical Knowledge Internalization for Neural Dialog Generation
Zhiyong Wu
Wei Bi
Xiang Li
Lingpeng Kong
B. Kao
21
2
0
04 May 2022
MCSE: Multimodal Contrastive Learning of Sentence Embeddings
Miaoran Zhang
Marius Mosbach
David Ifeoluwa Adelani
Michael A. Hedderich
Dietrich Klakow
36
35
0
22 Apr 2022
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Chunyuan Li
Haotian Liu
Liunian Harold Li
Pengchuan Zhang
J. Aneja
...
Ping Jin
Houdong Hu
Zicheng Liu
Yong Jae Lee
Jianfeng Gao
50
145
0
19 Apr 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
19
21
0
17 Mar 2022
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
54
95
0
01 Jul 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
30
274
0
09 Jun 2021
Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation
Zhiyong Wu
Lingpeng Kong
W. Bi
Xiang Li
B. Kao
LRM
23
77
0
30 May 2021
Explaining the Road Not Taken
Hua Shen
Ting-Hao 'Kenneth' Huang
FAtt
XAI
27
9
0
27 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
29
33
0
18 Mar 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
25
295
0
22 Feb 2021
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
227
2,434
0
04 Jan 2021
Deep Learning and the Global Workspace Theory
R. V. Rullen
Ryota Kanai
45
66
0
04 Dec 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
252
927
0
24 Sep 2019
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,984
0
20 Apr 2018
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Zhuowen Tu
Kaiming He
303
10,233
0
16 Nov 2016
1