ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.04261
  4. Cited By
Noise-Robust De-Duplication at Scale

Noise-Robust De-Duplication at Scale

9 October 2022
Emily Silcock
Luca DÁmico-Wong
Jinglin Yang
Melissa Dell
    SyDa
ArXivPDFHTML

Papers citing "Noise-Robust De-Duplication at Scale"

16 / 16 papers shown
Title
Technical Report: Quantifying and Analyzing the Generalization Power of a DNN
Technical Report: Quantifying and Analyzing the Generalization Power of a DNN
Yuxuan He
Junpeng Zhang
Lei Cheng
Hongyuan Zhang
Quanshi Zhang
AI4CE
26
0
0
11 May 2025
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page
Michael McRae
AI4CE
41
0
0
17 Feb 2025
Mitigating Memorization In Language Models
Mitigating Memorization In Language Models
Mansi Sakarvadia
Aswathy Ajith
Arham Khan
Nathaniel Hudson
Caleb Geniesse
Kyle Chard
Yaoqing Yang
Ian Foster
Michael W. Mahoney
KELM
MU
58
1
0
03 Oct 2024
Data Contamination Report from the 2024 CONDA Shared Task
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
Iker García-Ferrero
Alon Jacovi
Jonas Hanselle
Yanai Elazar
...
Yu-Min Tseng
Vishaal Udandarao
Zengzhi Wang
Ruijie Xu
Jinglin Yang
48
5
0
31 Jul 2024
News Deja Vu: Connecting Past and Present with Semantic Search
News Deja Vu: Connecting Past and Present with Semantic Search
Brevin Franklin
Emily Silcock
Abhishek Arora
Tom Bryan
Melissa Dell
34
1
0
21 Jun 2024
How Do Large Language Models Acquire Factual Knowledge During
  Pretraining?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Hoyeon Chang
Jinho Park
Seonghyeon Ye
Sohee Yang
Youngkyung Seo
Du-Seong Chang
Minjoon Seo
KELM
37
32
0
17 Jun 2024
Newswire: A Large-Scale Structured Database of a Century of Historical
  News
Newswire: A Large-Scale Structured Database of a Century of Historical News
Emily Silcock
Abhishek Arora
Luca DÁmico-Wong
Melissa Dell
AI4TS
GNN
37
3
0
13 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
58
36
0
26 May 2024
RETSim: Resilient and Efficient Text Similarity
RETSim: Resilient and Efficient Text Similarity
Marina Zhang
Owen Vallis
Aysegul Bumin
Tanay Vakharia
Elie Bursztein
33
1
0
28 Nov 2023
LinkTransformer: A Unified Package for Record Linkage with Transformer
  Language Models
LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Abhishek Arora
Melissa Dell
KELM
36
8
0
02 Sep 2023
American Stories: A Large-Scale Structured Text Dataset of Historical
  U.S. Newspapers
American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers
Melissa Dell
Jacob Carlson
Tom Bryan
Emily Silcock
Abhishek Arora
Zejiang Shen
Luca DÁmico-Wong
Q. Le
Pablo Querubin
Leander Heldring
AI4TS
28
12
0
24 Aug 2023
A Massive Scale Semantic Similarity Dataset of Historical English
A Massive Scale Semantic Similarity Dataset of Historical English
Emily Silcock
Melissa Dell
39
5
0
30 Jun 2023
A Language Model of Java Methods with Train/Test Deduplication
A Language Model of Java Methods with Train/Test Deduplication
Chia-Yi Su
Aakash Bansal
Vijayanta Jain
S. Ghanavati
Collin McMillan
SyDa
VLM
29
10
0
15 May 2023
SemDeDup: Data-efficient learning at web-scale through semantic
  deduplication
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas
Kushal Tirumala
Daniel Simig
Surya Ganguli
Ari S. Morcos
25
162
0
16 Mar 2023
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
593
0
14 Jul 2021
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
  Retrieval Models
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Nandan Thakur
Nils Reimers
Andreas Rucklé
Abhishek Srivastava
Iryna Gurevych
VLM
234
971
0
17 Apr 2021
1