ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.16241
31
0

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

25 May 2024
Chenqi Lin
Tianshi Xu
Zebin Yang
Runsheng Wang
Ru Huang
Meng Li
ArXivPDFHTML
Abstract

With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than 4.3×4.3\times4.3×, 2.7×2.7\times2.7×, 1.3×1.3\times1.3× latency reduction, respectively and more than 75.7×75.7\times75.7×, 60.2×60.2\times60.2×, 20.2×20.2\times20.2× communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.

View on arXiv
Comments on this paper