ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.03915
  4. Cited By
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

7 March 2023
Hugo Laurenccon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
Teven Le Scao
Leandro von Werra
Chenghao Mou
E. G. Ponferrada
Huu Nguyen
Jorg Frohberg
Mario vSavsko
Quentin Lhoest
Angelina McMillan-Major
Gérard Dupont
Stella Biderman
Anna Rogers
Loubna Ben Allal
F. Toni
Giada Pistilli
Olivier Nguyen
Somaieh Nikpoor
Maraim Masoud
Pierre Colombo
Javier de la Rosa
Paulo Villegas
Tristan Thrush
Shayne Longpre
Sebastian Nagel
Leon Weber
M. Muñoz
Jian Zhu
Daniel Alexander van Strien
Zaid Alyafeai
Khalid Almubarak
Minh Chien Vu
Itziar Gonzalez-Dios
Aitor Soroa Etxabe
Kyle Lo
Manan Dey
Pedro Ortiz Suarez
Aaron Gokaslan
Shamik Bose
David Ifeoluwa Adelani
Long Phan
H. Tran
I. Yu
S. Pai
Jenny Chim
Violette Lepercq
Suzana Ilić
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
    AI4CE
    AILaw
ArXivPDFHTML

Papers citing "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset"

39 / 39 papers shown
Title
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
58
0
0
23 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
42
0
0
01 Apr 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
153
0
0
15 Mar 2025
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
36
5
0
31 Oct 2024
Mastering the Craft of Data Synthesis for CodeLLMs
Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen
Philip Arthur
Qianyu Feng
Cong Duy Vu Hoang
Yu-Heng Hong
...
Mark Johnson
K. K.
Don Dharmasiri
Long Duong
Yuan-Fang Li
SyDa
58
1
0
16 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
79
2
0
11 Oct 2024
Improving Pretraining Data Using Perplexity Correlations
Improving Pretraining Data Using Perplexity Correlations
Tristan Thrush
Christopher Potts
Tatsunori Hashimoto
32
17
0
09 Sep 2024
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Rishabh Maheshwary
Vikas Yadav
Hoang Nguyen
Khyati Mahajan
Sathwik Tejaswi Madhusudhan
42
3
0
24 Jun 2024
How Do Large Language Models Acquire Factual Knowledge During
  Pretraining?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang
Jinho Park
Seonghyeon Ye
Sohee Yang
Youngkyung Seo
Du-Seong Chang
Minjoon Seo
KELM
37
30
0
17 Jun 2024
Datasets for Multilingual Answer Sentence Selection
Datasets for Multilingual Answer Sentence Selection
Matteo Gabburo
S. Campese
Federico Agostini
Alessandro Moschitti
38
0
0
14 Jun 2024
Decoding the Diversity: A Review of the Indic AI Research Landscape
Decoding the Diversity: A Review of the Indic AI Research Landscape
Sankalp KJ
Vinija Jain
S. Bhaduri
Tamoghna Roy
Aman Chadha
47
5
0
13 Jun 2024
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
  Reference Models
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
35
24
0
30 May 2024
The Mosaic Memory of Large Language Models
The Mosaic Memory of Large Language Models
Igor Shilov
Matthieu Meeus
Yves-Alexandre de Montjoye
39
3
0
24 May 2024
Where does In-context Translation Happen in Large Language Models
Where does In-context Translation Happen in Large Language Models
Suzanna Sia
David Mueller
Kevin Duh
LRM
35
0
0
07 Mar 2024
PALO: A Polyglot Large Multimodal Model for 5B People
PALO: A Polyglot Large Multimodal Model for 5B People
Muhammad Maaz
H. Rasheed
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakal
Rao M. Anwer
Timothy Baldwin
M. Felsberg
Fahad S. Khan
VLM
LRM
83
13
0
22 Feb 2024
CBQ: Cross-Block Quantization for Large Language Models
CBQ: Cross-Block Quantization for Large Language Models
Xin Ding
Xiaoyu Liu
Zhijun Tu
Yun-feng Zhang
Wei Li
...
Hanting Chen
Yehui Tang
Zhiwei Xiong
Baoqun Yin
Yunhe Wang
MQ
32
13
0
13 Dec 2023
Oasis: Data Curation and Assessment System for Pretraining of Large
  Language Models
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou
Yubo Chen
Pengfei Cao
Kang Liu
Jun Zhao
Shengping Liu
24
3
0
21 Nov 2023
MELA: Multilingual Evaluation of Linguistic Acceptability
MELA: Multilingual Evaluation of Linguistic Acceptability
Ziyin Zhang
Yikang Liu
Wei Huang
Junyu Mao
Rui Wang
Hai Hu
22
3
0
15 Nov 2023
Cultural Adaptation of Recipes
Cultural Adaptation of Recipes
Yong Cao
Yova Kementchedjhieva
Ruixiang Cui
Antonia Karamolegkou
Li Zhou
Megan Dare
Lucia Donatelli
Daniel Hershcovich
18
5
0
26 Oct 2023
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining
Boxin Wang
Wei Ping
Lawrence C. McAfee
Peng-Tao Xu
Bo Li
M. Shoeybi
Bryan Catanzaro
RALM
16
46
0
11 Oct 2023
Large Language Models as Superpositions of Cultural Perspectives
Large Language Models as Superpositions of Cultural Perspectives
Grgur Kovač
Masataka Sawayama
Rémy Portelas
Cédric Colas
Peter Ford Dominey
Pierre-Yves Oudeyer
LLMAG
32
33
0
15 Jul 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial
  Language Models
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
40
80
0
23 May 2023
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ariel Ekgren
Amaru Cuba Gyllensten
Felix Stollenwerk
Joey Öhman
T. Isbister
Evangelia Gogoulou
F. Carlsson
Alice Heiman
Judit Casademont
Magnus Sahlgren
27
13
0
22 May 2023
Prompting with Pseudo-Code Instructions
Prompting with Pseudo-Code Instructions
Mayank Mishra
Prince Kumar
Riyaz Ahmad Bhat
V. Rudramurthy
Danish Contractor
Srikanth G. Tamilselvam
42
13
0
19 May 2023
How Good are Commercial Large Language Models on African Languages?
How Good are Commercial Large Language Models on African Languages?
Jessica Ojo
Kelechi Ogueji
24
5
0
11 May 2023
LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits
  Siamese-BLOOM
LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits Siamese-BLOOM
Wenhui Hua
Brian Williams
Davood Shamsi
26
3
0
10 May 2023
Emergent and Predictable Memorization in Large Language Models
Emergent and Predictable Memorization in Large Language Models
Stella Biderman
USVSN Sai Prashanth
Lintang Sutawika
Hailey Schoelkopf
Quentin G. Anthony
Shivanshu Purohit
Edward Raf
24
116
0
21 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
30
40
0
07 Apr 2023
Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face
Christopher Akiki
Odunayo Ogundepo
Aleksandra Piktus
Xinyu Crystina Zhang
Akintunde Oladipo
Jimmy J. Lin
Martin Potthast
23
5
0
28 Feb 2023
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Zheng-Xin Yong
Hailey Schoelkopf
Niklas Muennighoff
Alham Fikri Aji
David Ifeoluwa Adelani
...
Genta Indra Winata
Stella Biderman
Edward Raff
Dragomir R. Radev
Vassilina Nikoulina
CLL
VLM
AI4CE
LRM
27
81
0
19 Dec 2022
Large Language Models Struggle to Learn Long-Tail Knowledge
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal
H. Deng
Adam Roberts
Eric Wallace
Colin Raffel
RALM
KELM
36
381
0
15 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
104
2,309
0
09 Nov 2022
Can large language models reason about medical questions?
Can large language models reason about medical questions?
Valentin Liévin
C. Hother
Andreas Geert Motzfeldt
Ole Winther
ELM
LM&MA
AI4MH
LRM
19
299
0
17 Jul 2022
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Zaid Alyafeai
Maraim Masoud
Mustafa Ghaleb
Maged S. Al-Shaibani
36
25
0
13 Oct 2021
Are Multilingual Models the Best Choice for Moderately Under-resourced
  Languages? A Comprehensive Assessment for Catalan
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé
C. Carrino
Carlos Rodríguez-Penagos
Ona de Gibert Bonet
Carme Armentano-Oller
Aitor Gonzalez-Agirre
Maite Melero
Marta Villegas
60
42
0
16 Jul 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
592
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,989
0
31 Dec 2020
Stanza: A Python Natural Language Processing Toolkit for Many Human
  Languages
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi
Yuhao Zhang
Yuhui Zhang
Jason Bolton
Christopher D. Manning
AI4TS
201
1,653
0
16 Mar 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
231
4,469
0
23 Jan 2020
1