ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.10652
  4. Cited By
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

16 April 2024
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
    CoGe
ArXivPDFHTML

Papers citing "ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images"

50 / 59 papers shown
Title
Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory
Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory
Srishti Yadav
Lauren Tilton
Maria Antoniak
T. Arnold
Jiaang Li
...
Zhaochong An
Negar Rostamzadeh
Daniel Hershcovich
Serge J. Belongie
Ekaterina Shutova
VLM
CoGe
26
0
0
28 May 2025
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Fakhraddin Alwajih
Samar Magdy
Abdellah El Mekki
Omer Nacar
Youssef Nafea
...
Mohamedou cheikh tourad
Ismail Berrada
Mustafa Jarrar
Shady Shehata
Muhammad Abdul-Mageed
VLM
26
0
0
28 May 2025
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using
  Transformer-based Method in Vietnamese Text-based Visual Question Answering
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Nghia Hieu Nguyen
Tho Thanh Quan
Ngan Luu-Thuy Nguyen
53
0
0
18 Oct 2024
Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition
Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition
T. Do
Dinh Phu Tran
An Vo
Daeyoung Kim
64
0
0
17 Oct 2024
Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Khang T. Doan
Bao G. Huynh
D. T. Hoang
Thuc D. Pham
Nhat H. Pham
Quan T.M. Nguyen
Bang Q. Vo
Suong N. Hoang
MLLM
49
6
0
22 Aug 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
72
4
0
29 Apr 2024
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
  for Visual Question Answering in Vietnamese
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
Khiem Vinh Tran
Hao Phu Phan
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
45
6
0
27 Oct 2023
Separate and Locate: Rethink the Text in Text-based Visual Question
  Answering
Separate and Locate: Rethink the Text in Text-based Visual Question Answering
Chengyang Fang
Jiangnan Li
Liang Li
Can Ma
Dayong Hu
56
13
0
31 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
109
896
0
24 Aug 2023
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle
Abhay Jain
Radu Timofte
Goran Glavaš
VLM
MLLM
57
30
0
13 Jul 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
424
4,539
0
30 Jan 2023
ViHOS: Hate Speech Spans Detection for Vietnamese
ViHOS: Hate Speech Spans Detection for Vietnamese
Phu Gia Hoang
Canh Duc Luu
K. Tran
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
49
22
0
24 Jan 2023
PreSTU: Pre-Training for Scene-Text Understanding
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
161
29
0
12 Sep 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
  for Multi-Modal Understanding
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
41
3
0
05 Aug 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
129
549
0
27 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
81
220
0
24 May 2022
ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language
  Generation
ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
Long Phan
H. Tran
Hieu Duy Nguyen
Trieu H. Trinh
ViT
75
68
0
13 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
371
3,535
0
29 Apr 2022
Visual Prompt Tuning
Visual Prompt Tuning
Menglin Jia
Luming Tang
Bor-Chun Chen
Claire Cardie
Serge Belongie
Bharath Hariharan
Ser-Nam Lim
VLM
VPVLM
148
1,624
0
23 Mar 2022
SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
  Detection and Text Recognition
SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition
Mingxin Huang
Yuliang Liu
Zhenghao Peng
Chongyu Liu
Dahua Lin
Shenggao Zhu
N. Yuan
Kai Ding
Lianwen Jin
ViT
39
102
0
19 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
524
4,343
0
28 Jan 2022
LaTr: Layout-Aware Transformer for Scene-Text VQA
LaTr: Layout-Aware Transformer for Scene-Text VQA
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
85
101
0
23 Dec 2021
Learning to Prompt for Vision-Language Models
Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
VPVLM
CLIP
VLM
490
2,396
0
02 Sep 2021
InfographicVQA
InfographicVQA
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
86
229
0
26 Apr 2021
VisualMRC: Machine Reading Comprehension on Document Images
VisualMRC: Machine Reading Comprehension on Document Images
Ryota Tanaka
Kyosuke Nishida
Sen Yoshida
82
145
0
27 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
632
41,003
0
22 Oct 2020
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
134
715
0
01 Jul 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
103
1,938
0
13 Apr 2020
Iterative Answer Prediction with Pointer-Augmented Multimodal
  Transformers for TextVQA
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Ronghang Hu
Amanpreet Singh
Trevor Darrell
Marcus Rohrbach
69
197
0
14 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
419
20,127
0
23 Oct 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
153
1,663
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
237
2,479
0
20 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
200
902
0
16 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
224
3,678
0
06 Aug 2019
How multilingual is Multilingual BERT?
How multilingual is Multilingual BERT?
Telmo Pires
Eva Schlinger
Dan Garrette
LRM
VLM
143
1,409
0
04 Jun 2019
OK-VQA: A Visual Question Answering Benchmark Requiring External
  Knowledge
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
82
1,072
0
31 May 2019
Scene Text Visual Question Answering
Scene Text Visual Question Answering
Ali Furkan Biten
Rubèn Pérez Tito
Andrés Mafla
Lluís Gómez
Marçal Rusiñol
Ernest Valveny
C. V. Jawahar
Dimosthenis Karatzas
90
356
0
31 May 2019
Towards VQA Models That Can Read
Towards VQA Models That Can Read
Amanpreet Singh
Vivek Natarajan
Meet Shah
Yu Jiang
Xinlei Chen
Dhruv Batra
Devi Parikh
Marcus Rohrbach
EgoV
82
1,216
0
18 Apr 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
1.7K
94,770
0
11 Oct 2018
VizWiz Grand Challenge: Answering Visual Questions from Blind People
VizWiz Grand Challenge: Answering Visual Questions from Blind People
Danna Gurari
Qing Li
Abigale Stangl
Anhong Guo
Chi Lin
Kristen Grauman
Jiebo Luo
Jeffrey P. Bigham
CoGe
90
847
0
22 Feb 2018
VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
Thanh Tien Vu
Dat Quoc Nguyen
Dai Quoc Nguyen
Mark Dras
Mark Johnson
63
148
0
04 Jan 2018
A Fast and Accurate Vietnamese Word Segmenter
A Fast and Accurate Vietnamese Word Segmenter
Dat Quoc Nguyen
Dai Quoc Nguyen
Thanh Tien Vu
Mark Dras
Mark Johnson
35
65
0
19 Sep 2017
Bottom-Up and Top-Down Attention for Image Captioning and Visual
  Question Answering
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
Xiaodong He
Chris Buehler
Damien Teney
Mark Johnson
Stephen Gould
Lei Zhang
AIMat
121
4,215
0
25 Jul 2017
Automatic Generation of Grounded Visual Questions
Automatic Generation of Grounded Visual Questions
Shijie Zhang
Zhuang Li
Shaodi You
Zhenglu Yang
Jiawan Zhang
OOD
51
79
0
20 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
333
3,238
0
02 Dec 2016
SQuAD: 100,000+ Questions for Machine Comprehension of Text
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
274
8,127
0
16 Jun 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and
  Visual Grounding
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui
Dong Huk Park
Daylen Yang
Anna Rohrbach
Trevor Darrell
Marcus Rohrbach
293
1,465
0
06 Jun 2016
Multimodal Residual Learning for Visual QA
Multimodal Residual Learning for Visual QA
Jin-Hwa Kim
Sang-Woo Lee
Donghyun Kwak
Min-Oh Heo
Jeonghee Kim
Jung-Woo Ha
Byoung-Tak Zhang
51
300
0
05 Jun 2016
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense
  Image Annotations
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
...
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
215
5,743
0
23 Feb 2016
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
2.2K
193,878
0
10 Dec 2015
12
Next