ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.03505
26
8

OPORP: One Permutation + One Random Projection

7 February 2023
Ping Li
Xiaoyun Li
ArXivPDFHTML
Abstract

Consider two DDD-dimensional data vectors (e.g., embeddings): u,vu, vu,v. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, D=256∼1024D=256\sim 1024D=256∼1024 are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector rrr is generated i.i.d. with moments: E(ri)=0,E(ri2)=1,E(ri3)=0,E(ri4)=sE(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=sE(ri​)=0,E(ri2​)=1,E(ri3​)=0,E(ri4​)=s. We multiply (as dot product) rrr with all permuted data vectors. Then we break the DDD columns into kkk equal-length bins and aggregate (i.e., sum) the values in each bin to obtain kkk samples from each data vector. One crucial step is to normalize the kkk samples to the unit l2l_2l2​ norm. We show that the estimation variance is essentially: (s−1)A+D−kD−11k[(1−ρ2)2−2A](s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right](s−1)A+D−1D−k​k1​[(1−ρ2)2−2A], where A≥0A\geq 0A≥0 is a function of the data (u,vu,vu,v). This formula reveals several key properties: (1) We need s=1s=1s=1. (2) The factor D−kD−1\frac{D-k}{D-1}D−1D−k​ can be highly beneficial in reducing variances. (3) The term 1k(1−ρ2)2\frac{1}{k}(1-\rho^2)^2k1​(1−ρ2)2 is a substantial improvement compared with 1k(1+ρ2)\frac{1}{k}(1+\rho^2)k1​(1+ρ2), which corresponds to the un-normalized estimator. We illustrate that by letting the kkk in OPORP to be k=1k=1k=1 and repeat the procedure mmm times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.

View on arXiv
Comments on this paper