72
0

DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain

Abstract

Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks, but specialized domains such as Web3 present new challenges and require more tailored evaluation. Despite the significant user base and capital flows in Web3, encompassing smart contracts, decentralized finance (DeFi), non-fungible tokens (NFTs), decentralized autonomous organizations (DAOs), on-chain governance, and novel token-economics, no comprehensive benchmark has systematically assessed LLM performance in this domain. To address this gap, we introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields: fundamental blockchain concepts, blockchain infrastructure, smart contract, DeFi mechanisms, DAOs, NFTs, token economics, meme concept, and security vulnerabilities. Beyond multiple-choice questions, DMind Benchmark features domain-specific tasks such as contract debugging and on-chain numeric reasoning, mirroring real-world scenarios. We evaluated 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen, uncovering notable performance gaps in specialized areas like token economics and security-critical contract analysis. While some models excel in blockchain infrastructure tasks, advanced subfields remain challenging. Our benchmark dataset and evaluation pipeline are open-sourced onthis https URL, reaching number one in Hugging Face's trending dataset charts within a week of release.

View on arXiv
@article{huang2025_2504.16116,
  title={ DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain },
  author={ Enhao Huang and Pengyu Sun and Zixin Lin and Alex Chen and Joey Ouyang and Hobert Wang and Dong Dong and Gang Zhao and James Yi and Frank Li and Ziang Ling and Lowes Yang },
  journal={arXiv preprint arXiv:2504.16116},
  year={ 2025 }
}
Comments on this paper