Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.10798
Cited By
v1
v2
v3 (latest)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
14 October 2024
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling"
37 / 37 papers shown
Title
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
95
502
0
06 Nov 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
79
81
0
12 Aug 2023
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger
Paul-Edouard Sarlin
Marc Pollefeys
3DV
VLM
93
434
0
23 Jun 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.4K
14,631
0
15 Mar 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
201
1,634
0
15 Dec 2022
ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer
Hongkai Chen
Zixin Luo
Lei Zhou
Yurun Tian
Mingmin Zhen
Tian Fang
David McKinnon
Yanghai Tsin
Long Quan
81
171
0
30 Aug 2022
MatchFormer: Interleaving Attention in Transformers for Feature Matching
Qing Wang
Jiaming Zhang
Kailun Yang
Kunyu Peng
Rainer Stiefelhagen
ViT
76
144
0
17 Mar 2022
Learning to Match Features with Seeded Graph Matching Network
Hongkai Chen
Zixin Luo
Jiahui Zhang
Lei Zhou
Xuyang Bai
Zeyu Hu
Chiew-Lan Tai
Long Quan
58
113
0
19 Aug 2021
Are Convolutional Neural Networks or Transformers more like human vision?
Shikhar Tuli
Ishita Dasgupta
Erin Grant
Thomas Griffiths
ViT
FaML
56
185
0
15 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
174
883
0
26 Apr 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
453
21,439
0
25 Mar 2021
Learning Multi-Scene Absolute Pose Regression with Transformers
Yoli Shavit
Ron Ferens
Y. Keller
ViT
57
123
0
21 Mar 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
659
41,103
0
22 Oct 2020
Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency
Robert Geirhos
Kristof Meding
Felix Wichmann
63
123
0
30 Jun 2020
DISK: Learning local features with policy gradient
M. Tyszkiewicz
Pascal Fua
Eduard Trulls
OffRL
84
375
0
24 Jun 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
110
1,941
0
13 Apr 2020
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed
Jia Deng
MDE
244
2,625
0
26 Mar 2020
Adversarial Attacks on Monocular Depth Estimation
Ziqi Zhang
Xinge Zhu
Yingwei Li
Xiangqun Chen
Yao Guo
AAML
MDE
69
25
0
23 Mar 2020
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi
Yuta Saito
Yunzhu Li
Pushmeet Kohli
Jiajun Wu
Antonio Torralba
J. Tenenbaum
NAI
121
473
0
03 Oct 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
160
1,666
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
247
2,483
0
20 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
144
1,955
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
231
3,684
0
06 Aug 2019
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
René Ranftl
Katrin Lasinger
David Hafner
Konrad Schindler
V. Koltun
MDE
204
1,793
0
02 Jul 2019
Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters
Axel Barroso Laguna
Edgar Riba
D. Ponsa
K. Mikolajczyk
3DPC
50
278
0
01 Apr 2019
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
Paul-Edouard Sarlin
Cesar Cadena
Roland Siegwart
Marcin Dymczyk
3DV
45
875
0
09 Dec 2018
MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li
Noah Snavely
MDE
3DV
109
1,020
0
02 Apr 2018
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson
B. Hariharan
Laurens van der Maaten
Li Fei-Fei
C. L. Zitnick
Ross B. Girshick
CoGe
307
2,378
0
20 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
345
3,246
0
02 Dec 2016
Image-based localization using LSTMs for structured feature correlation
F. Walch
C. Hazirbas
Laura Leal-Taixé
Torsten Sattler
S. Hilsenbeck
Daniel Cremers
70
496
0
23 Nov 2016
FVQA: Fact-based Visual Question Answering
Peng Wang
Qi Wu
Chunhua Shen
Anton van den Hengel
A. Dick
CoGe
77
461
0
17 Jun 2016
Single-Image Depth Perception in the Wild
Weifeng Chen
Z. Fu
Dawei Yang
Jia Deng
MDE
103
520
0
13 Apr 2016
Yin and Yang: Balancing and Answering Binary Visual Questions
Peng Zhang
Yash Goyal
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
87
352
0
16 Nov 2015
Exploring Models and Data for Image Question Answering
Mengye Ren
Ryan Kiros
R. Zemel
80
715
0
08 May 2015
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
211
5,478
0
03 May 2015
ORB-SLAM: a Versatile and Accurate Monocular SLAM System
Raul Mur-Artal
José M.M. Montiel
Juan D. Tardós
122
6,399
0
03 Feb 2015
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
413
43,667
0
01 May 2014
1