We introduce the first model-stealing attack that extracts precise,
nontrivial information from black-box production language models like OpenAI's
ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding
projection layer (up to symmetries) of a transformer model, given typical API
access. For under \20USD,ourattackextractstheentireprojectionmatrixofOpenAI′sAdaandBabbagelanguagemodels.Wetherebyconfirm,forthefirsttime,thattheseblack−boxmodelshaveahiddendimensionof1024and2048,respectively.Wealsorecovertheexacthiddendimensionsizeofthegpt−3.5−turbomodel,andestimateitwouldcostunder2,000 in queries to
recover the entire projection matrix. We conclude with potential defenses and
mitigations, and discuss the implications of possible future work that could
extend our attack.
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.