ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2207.04858
41
11

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

11 July 2022
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
ArXivPDFHTML
Abstract

Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space S\mathcal{S}S to a target modality space T\mathcal{T}T without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from S\mathcal{S}S to the predicted target space T′\mathcal{T'}T′, and backward translations from T′\mathcal{T'}T′ back to S\mathcal{S}S. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.

View on arXiv
Comments on this paper