Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.10188
Cited By
WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data
15 December 2023
Maurice Weber
Carlo Siebenschuh
Rory Butler
Anton Alexandrov
Valdemar Thanner
Georgios Tsolakis
Haris Jabbar
Ian T. Foster
Bo-wen Li
Rick L. Stevens
Ce Zhang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data"
4 / 4 papers shown
Title
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
84
3
0
26 Feb 2025
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,989
0
31 Dec 2020
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
Guillaume Jaume
H. K. Ekenel
Jean-Philippe Thiran
134
355
0
27 May 2019
You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon
S. Divvala
Ross B. Girshick
Ali Farhadi
ObjD
289
36,335
0
08 Jun 2015
1