Set2Seq Transformer: Learning Permutation Aware Set Representations of Artistic Sequences

6 August 2024

Athanasios Efthymiou

Abstract

We propose Set2Seq Transformer, a novel sequential multiple instance architecture, that learns to rank permutation aware set representations of sequences. First, we illustrate that learning temporal position-aware representations of discrete timesteps can greatly improve static visual multiple instance learning methods that do not regard temporality and concentrate almost exclusively on visual content analysis. We further demonstrate the significant advantages of end-to-end sequential multiple instance learning, integrating visual content and temporal information in a multimodal manner. As application we focus on fine art analysis related tasks. To that end, we show that our Set2Seq Transformer can leverage visual set and temporal position-aware representations for modelling visual artists' oeuvres for predicting artistic success. Finally, through extensive quantitative and qualitative evaluation using a novel dataset, WikiArt-Seq2Rank, and a visual learning-to-rank downstream task, we show that our Set2Seq Transformer captures essential temporal information improving the performance of strong static and sequential multiple instance learning methods for predicting artistic success.

View on arXiv

@article{efthymiou2025_2408.03404,
  title={ Set2Seq Transformer: Temporal and Positional-Aware Set Representations for Sequential Multiple-Instance Learning },
  author={ Athanasios Efthymiou and Stevan Rudinac and Monika Kackovic and Nachoem Wijnberg and Marcel Worring },
  journal={arXiv preprint arXiv:2408.03404},
  year={ 2025 }
}

Comments on this paper