ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.10798
  4. Cited By
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
v1v2v3 (latest)

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

14 October 2024
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
    DiffM
ArXiv (abs)PDFHTML

Papers citing "MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling"

37 / 37 papers shown
Title
CogVLM: Visual Expert for Pretrained Language Models
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLMMLLM
95
502
0
06 Nov 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following
  Inspired by Real-World Use
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
79
81
0
12 Aug 2023
LightGlue: Local Feature Matching at Light Speed
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger
Paul-Edouard Sarlin
Marc Pollefeys
3DVVLM
93
434
0
23 Jun 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.4K
14,631
0
15 Mar 2023
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDaMoMe
201
1,634
0
15 Dec 2022
ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer
ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer
Hongkai Chen
Zixin Luo
Lei Zhou
Yurun Tian
Mingmin Zhen
Tian Fang
David McKinnon
Yanghai Tsin
Long Quan
81
171
0
30 Aug 2022
MatchFormer: Interleaving Attention in Transformers for Feature Matching
MatchFormer: Interleaving Attention in Transformers for Feature Matching
Qing Wang
Jiaming Zhang
Kailun Yang
Kunyu Peng
Rainer Stiefelhagen
ViT
76
144
0
17 Mar 2022
Learning to Match Features with Seeded Graph Matching Network
Learning to Match Features with Seeded Graph Matching Network
Hongkai Chen
Zixin Luo
Jiahui Zhang
Lei Zhou
Xuyang Bai
Zeyu Hu
Chiew-Lan Tai
Long Quan
58
113
0
19 Aug 2021
Are Convolutional Neural Networks or Transformers more like human
  vision?
Are Convolutional Neural Networks or Transformers more like human vision?
Shikhar Tuli
Ishita Dasgupta
Erin Grant
Thomas Griffiths
ViTFaML
56
185
0
15 May 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjDVLM
174
883
0
26 Apr 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
453
21,439
0
25 Mar 2021
Learning Multi-Scene Absolute Pose Regression with Transformers
Learning Multi-Scene Absolute Pose Regression with Transformers
Yoli Shavit
Ron Ferens
Y. Keller
ViT
57
123
0
21 Mar 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
659
41,103
0
22 Oct 2020
Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans
  by measuring error consistency
Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency
Robert Geirhos
Kristof Meding
Felix Wichmann
63
123
0
30 Jun 2020
DISK: Learning local features with policy gradient
DISK: Learning local features with policy gradient
M. Tyszkiewicz
Pascal Fua
Eduard Trulls
OffRL
84
375
0
24 Jun 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
110
1,941
0
13 Apr 2020
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed
Jia Deng
MDE
244
2,625
0
26 Mar 2020
Adversarial Attacks on Monocular Depth Estimation
Adversarial Attacks on Monocular Depth Estimation
Ziqi Zhang
Xinge Zhu
Yingwei Li
Xiangqun Chen
Yao Guo
AAMLMDE
69
25
0
23 Mar 2020
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi
Yuta Saito
Yunzhu Li
Pushmeet Kohli
Jiajun Wu
Antonio Torralba
J. Tenenbaum
NAI
121
473
0
03 Oct 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLMMLLMSSL
160
1,666
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLMMLLM
247
2,483
0
20 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
144
1,955
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
231
3,684
0
06 Aug 2019
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot
  Cross-dataset Transfer
Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
René Ranftl
Katrin Lasinger
David Hafner
Konrad Schindler
V. Koltun
MDE
204
1,793
0
02 Jul 2019
Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters
Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters
Axel Barroso Laguna
Edgar Riba
D. Ponsa
K. Mikolajczyk
3DPC
50
278
0
01 Apr 2019
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
Paul-Edouard Sarlin
Cesar Cadena
Roland Siegwart
Marcin Dymczyk
3DV
45
875
0
09 Dec 2018
MegaDepth: Learning Single-View Depth Prediction from Internet Photos
MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li
Noah Snavely
MDE3DV
109
1,020
0
02 Apr 2018
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary
  Visual Reasoning
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson
B. Hariharan
Laurens van der Maaten
Li Fei-Fei
C. L. Zitnick
Ross B. Girshick
CoGe
307
2,378
0
20 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
345
3,246
0
02 Dec 2016
Image-based localization using LSTMs for structured feature correlation
Image-based localization using LSTMs for structured feature correlation
F. Walch
C. Hazirbas
Laura Leal-Taixé
Torsten Sattler
S. Hilsenbeck
Daniel Cremers
70
496
0
23 Nov 2016
FVQA: Fact-based Visual Question Answering
FVQA: Fact-based Visual Question Answering
Peng Wang
Qi Wu
Chunhua Shen
Anton van den Hengel
A. Dick
CoGe
77
461
0
17 Jun 2016
Single-Image Depth Perception in the Wild
Single-Image Depth Perception in the Wild
Weifeng Chen
Z. Fu
Dawei Yang
Jia Deng
MDE
103
520
0
13 Apr 2016
Yin and Yang: Balancing and Answering Binary Visual Questions
Yin and Yang: Balancing and Answering Binary Visual Questions
Peng Zhang
Yash Goyal
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
87
352
0
16 Nov 2015
Exploring Models and Data for Image Question Answering
Exploring Models and Data for Image Question Answering
Mengye Ren
Ryan Kiros
R. Zemel
80
715
0
08 May 2015
VQA: Visual Question Answering
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
211
5,478
0
03 May 2015
ORB-SLAM: a Versatile and Accurate Monocular SLAM System
ORB-SLAM: a Versatile and Accurate Monocular SLAM System
Raul Mur-Artal
José M.M. Montiel
Juan D. Tardós
122
6,399
0
03 Feb 2015
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
413
43,667
0
01 May 2014
1