ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.03919
  4. Cited By
Data Portraits: Recording Foundation Model Training Data

Data Portraits: Recording Foundation Model Training Data

6 March 2023
Marc Marone
Benjamin Van Durme
ArXivPDFHTML

Papers citing "Data Portraits: Recording Foundation Model Training Data"

13 / 13 papers shown
Title
Certified Mitigation of Worst-Case LLM Copyright Infringement
Certified Mitigation of Worst-Case LLM Copyright Infringement
Jingyu Zhang
Jiacan Yu
Marc Marone
Benjamin Van Durme
Daniel Khashabi
MoMe
141
0
0
22 Apr 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
90
12
0
31 Dec 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
59
3
0
24 Oct 2024
AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
Han Wang
Archiki Prasad
Elias Stengel-Eskin
Mohit Bansal
82
5
0
11 Sep 2024
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data
Jingyu Zhang
Marc Marone
Tianjian Li
Benjamin Van Durme
Daniel Khashabi
93
9
0
05 Apr 2024
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu
Sewon Min
Luke Zettlemoyer
Yejin Choi
Hannaneh Hajishirzi
45
50
0
30 Jan 2024
Oasis: Data Curation and Assessment System for Pretraining of Large
  Language Models
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou
Yubo Chen
Pengfei Cao
Kang Liu
Jun Zhao
Shengping Liu
26
3
0
21 Nov 2023
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Hongbin Ye
Tong Liu
Aijia Zhang
Wei Hua
Weiqiang Jia
HILM
42
76
0
13 Sep 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
50
19
0
14 Aug 2023
Stop Uploading Test Data in Plain Text: Practical Strategies for
  Mitigating Data Contamination by Evaluation Benchmarks
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks
Alon Jacovi
Avi Caciularu
Omer Goldman
Yoav Goldberg
17
95
0
17 May 2023
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
592
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
253
1,989
0
31 Dec 2020
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
290
1,814
0
14 Dec 2020
1