ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.04201
28
1

IT3^33: Idempotent Test-Time Training

5 October 2024
N. Durasov
Assaf Shocher
Doruk Öner
Gal Chechik
Alexei A. Efros
Pascal Fua
    OOD
    VLM
ArXivPDFHTML
Abstract

This paper introduces Idempotent Test-Time Training (IT3^33), a novel approach to addressing the challenge of distribution shift. While supervised-learning methods assume matching train and test distributions, this is rarely the case for machine learning systems deployed in the real world. Test-Time Training (TTT) approaches address this by adapting models during inference, but they are limited by a domain specific auxiliary task. IT3^33 is based on the universal property of idempotence. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, that is f(f(x))=f(x)f(f(x))=f(x)f(f(x))=f(x). At training, the model receives an input xxx along with another signal that can either be the ground truth label yyy or a neutral "don't know" signal 000. At test time, the additional signal can only be 000. When sequentially applying the model, first predicting y0=f(x,0)y_0 = f(x, 0)y0​=f(x,0) and then y1=f(x,y0)y_1 = f(x, y_0)y1​=f(x,y0​), the distance between y0y_0y0​ and y1y_1y1​ measures certainty and indicates out-of-distribution input xxx if high. We use this distance, that can be expressed as ∣∣f(x,f(x,0))−f(x,0)∣∣||f(x, f(x, 0)) - f(x, 0)||∣∣f(x,f(x,0))−f(x,0)∣∣ as our TTT loss during inference. By carefully optimizing this objective, we effectively train f(x,⋅)f(x,\cdot)f(x,⋅) to be idempotent, projecting the internal representation of the input onto the training distribution. We demonstrate the versatility of our approach across various tasks, including corrupted image classification, aerodynamic predictions, tabular data with missing information, age prediction from face, and large-scale aerial photo segmentation. Moreover, these tasks span different architectures such as MLPs, CNNs, and GNNs.

View on arXiv
Comments on this paper