ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.12240
16
13

PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction

24 June 2022
Sirui Liu
Jun Zhang
Haotian Chu
Min Wang
Boxin Xue
Ningxi Ni
Jialiang Yu
Yuhao Xie
Zhenyu Chen
Mengyun Chen
Yuan Liu
Piya Patra
Fan Xu
Jieping Chen
Zidong Wang
Lijiang Yang
Fan Yu
Lei Chen
Y. Gao
    3DV
ArXivPDFHTML
Abstract

Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.

View on arXiv
Comments on this paper