Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.06202
Cited By
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
11 June 2020
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages"
41 / 41 papers shown
Title
Lazy But Effective: Collaborative Personalized Federated Learning with Heterogeneous Data
Ljubomir Rokvic
Panayiotis Danassis
Boi Faltings
FedML
43
0
0
05 May 2025
TigerLLM -- A Family of Bangla Large Language Models
Nishat Raihan
Marcos Zampieri
48
0
0
14 Mar 2025
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda
M. Wijanarko
Lucky Susanto
Khumaisa Nuráini
Derry Wijaya
Alham Fikri Aji
52
0
0
25 Feb 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
51
0
0
24 Feb 2025
Exploring Translation Mechanism of Large Language Models
Hongbin Zhang
Kehai Chen
Xuefeng Bai
Xiucheng Li
Yang Xiang
Min Zhang
67
1
0
17 Feb 2025
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
81
2
0
11 Oct 2024
An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models
Nandini Mundra
Aditya Nanda Kishore
Raj Dabre
Ratish Puduppully
Anoop Kunchukuttan
Mitesh Khapra
30
3
0
08 Jul 2024
Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding
Ahmad Idrissi-Yaghir
Amin Dada
Henning Schafer
Kamyar Arzideh
Giulia Baldini
...
Peter A. Horn
Christin Seifert
F. Nensa
Jens Kleesiek
Christoph M. Friedrich
AI4MH
39
2
0
08 Apr 2024
Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space
Aviad Rom
Kfir Bar
40
1
0
25 Feb 2024
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese
N. Corrêa
Sophia Falk
Shiza Fatimah
Aniket Sen
N. D. Oliveira
30
9
0
30 Jan 2024
RoBERTurk: Adjusting RoBERTa for Turkish
Nuri Tas
22
1
0
07 Jan 2024
Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi
Md. Nishat Raihan
Dhiman Goswami
Antara Mahmud
48
1
0
19 Sep 2023
Unsupervised Paraphrasing of Multiword Expressions
Takashi Wada
Yuji Matsumoto
Timothy Baldwin
Jey Han Lau
26
0
0
02 Jun 2023
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ariel Ekgren
Amaru Cuba Gyllensten
Felix Stollenwerk
Joey Öhman
T. Isbister
Evangelia Gogoulou
F. Carlsson
Alice Heiman
Judit Casademont
Magnus Sahlgren
29
13
0
22 May 2023
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dũng Nguyễn Mạnh
Nam Le Hai
An Dau
A. Nguyen
Khanh N. Nghiem
Jingnan Guo
Nghi D. Q. Bui
34
15
0
09 May 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
30
41
0
07 Apr 2023
FairDistillation: Mitigating Stereotyping in Language Models
Pieter Delobelle
Bettina Berendt
23
8
0
10 Jul 2022
Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna
Isaac Caswell
Julia Kreutzer
Orhan Firat
D. Esch
...
Apurva Shah
Yanping Huang
Z. Chen
Yonghui Wu
Macduff Hughes
56
98
0
09 May 2022
You Are What You Write: Preserving Privacy in the Era of Large Language Models
Richard Plant
V. Giuffrida
Dimitra Gkatzia
PILM
26
19
0
20 Apr 2022
IndicXNLI: Evaluating Multilingual Inference for Indian Languages
Divyanshu Aggarwal
V. Gupta
Anoop Kunchukuttan
28
27
0
19 Apr 2022
Breaking Character: Are Subwords Good Enough for MRLs After All?
Omri Keren
Tal Avinari
Reut Tsarfaty
Omer Levy
36
15
0
10 Apr 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
39
153
0
17 Jan 2022
IndoNLI: A Natural Language Inference Dataset for Indonesian
Rahmad Mahendra
Alham Fikri Aji
Samuel Louvan
Fahrurrozi Rahman
Clara Vania
26
29
0
27 Oct 2021
MFAQ: a Multilingual FAQ Dataset
Maxime De Bruyn
Ehsan Lotfi
Jeska Buhmann
Walter Daelemans
RALM
50
21
0
27 Sep 2021
ParaShoot: A Hebrew Question Answering Dataset
Omri Keren
Omer Levy
37
17
0
23 Sep 2021
Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models
C. Carrino
Jordi Armengol-Estapé
Ona de Gibert Bonet
Asier Gutiérrez-Fandiño
Aitor Gonzalez-Agirre
Martin Krallinger
Marta Villegas
24
20
0
16 Sep 2021
BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation
Haoran Xu
Benjamin Van Durme
Kenton W. Murray
50
57
0
09 Sep 2021
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets
Cheng-Han Chiang
Hung-yi Lee
SyDa
32
24
0
08 Sep 2021
PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining
Machel Reid
Mikel Artetxe
VLM
50
26
0
04 Aug 2021
Machine Translation into Low-resource Language Varieties
Sachin Kumar
Antonios Anastasopoulos
S. Wintner
Yulia Tsvetkov
11
29
0
12 Jun 2021
Bertinho: Galician BERT Representations
David Vilares
Marcos Garcia
Carlos Gómez-Rodríguez
65
22
0
25 Mar 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
20
267
0
22 Mar 2021
Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models' Transferability
Wei-Tsung Kao
Hung-yi Lee
16
16
0
12 Mar 2021
The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
Go Inoue
Bashar Alhafni
Nurpeiis Baimukan
Houda Bouamor
Nizar Habash
35
224
0
11 Mar 2021
Pre-Training BERT on Arabic Tweets: Practical Considerations
Ahmed Abdelali
Sabit Hassan
Hamdy Mubarak
Kareem Darwish
Younes Samih
20
96
0
21 Feb 2021
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
Wissam Antoun
Fady Baly
Hazem M. Hajj
VLM
21
103
0
31 Dec 2020
AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding
Wissam Antoun
Fady Baly
Hazem M. Hajj
19
102
0
31 Dec 2020
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages
Kushal Kumar Jain
Adwait Deshpande
Kumar Shridhar
F. Laumann
Ayushman Dash
43
51
0
04 Nov 2020
Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models
Evan Williams
Paul Rodrigues
Valerie Novak
39
42
0
05 Sep 2020
KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media
Ali Safaya
Moutasem Abdullatif
Deniz Yuret
31
314
0
26 Jul 2020
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
Changhan Wang
Anne Wu
J. Pino
SLR
27
72
0
20 Jul 2020
1