ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.10725
23
5

INDUS: Effective and Efficient Language Models for Scientific Applications

17 May 2024
Bishwaranjan Bhattacharjee
Aashka Trivedi
Masayasu Muraoka
Muthukumaran Ramasubramanian
Takuma Udagawa
I. Gurung
Rong Zhang
Bharath Dandala
Rahul Ramachandran
M. Maskey
Kayleen Bugbee
Mike Little
Elizabeth Fancher
Lauren M Sanders
Sylvain Costes
Sergi Blanco-Cuaresma
Kelly E. Lockhart
Thomas Allen
Felix Grazes
Megan Ansdel
Alberto Accomazzi
Yousef El-Kurdi
Davis Wertheimer
Birgit Pfitzmann
Cesar Berrospi Ramis
Michele Dolfi
Rafael Teixeira de Lima
Panos Vegenas
S. K. Mukkavilli
Peter W. J. Staar
S. Vahidinia
Ryan McGranaghan
A. Mehrabian
Tsendgar Lee
    AI4CE
ArXivPDFHTML
Abstract

Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.

View on arXiv
Comments on this paper