Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

22 June 2014

Li Fei-Fei

Papers citing "Deep Fragment Embeddings for Bidirectional Image Sentence Mapping"

50 / 153 papers shown

Title
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval Guanqi Zhan Yuanpei Liu Kai Han Weidi Xie Andrew Zisserman VLM 252 0 0 21 Feb 2025
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Boyu Gou Ruohan Wang Boyuan Zheng Yanan Xie Cheng Chang Yiheng Shu Huan Sun Yu Su LM&Ro LLMAG 84 57 0 07 Oct 2024
Towards Deconfounded Image-Text Matching with Causal Inference Wenhui Li Xinqi Su Dan Song Lanjun Wang Kun Zhang An-An Liu BDL CML 53 10 0 22 Aug 2024
Object Recognition as Next Token Prediction Kaiyu Yue Borchun Chen Jonas Geiping Hengduo Li Tom Goldstein Ser-Nam Lim 40 9 0 04 Dec 2023
Write What You Want: Applying Text-to-video Retrieval to Audiovisual Archives Yuchen Yang VGen 27 7 0 09 Oct 2023
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features Hila Levi Guy Heller Dan Levi Ethan Fetaya OCL VLM 32 3 0 26 Sep 2023
Towards General Visual-Linguistic Face Forgery Detection Ke Sun Shen Chen Taiping Yao Haozhe Yang Xiaoshuai Sun Shouhong Ding Rongrong Ji 28 12 0 31 Jul 2023
Evaluating Pragmatic Abilities of Image Captioners on A3DS Polina Tsvilodub Michael Franke EGVM 25 3 0 22 May 2023
Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning Mozhgan Pourkeshavarz Shahabedin Nabavi Mohsen Moghaddam M. Shamsfard 31 4 0 08 Feb 2023
OPORP: One Permutation + One Random Projection Ping Li Xiaoyun Li 44 8 0 07 Feb 2023
Using Multiple Instance Learning to Build Multimodal Representations Peiqi Wang W. Wells Seth Berkowitz Steven Horng Polina Golland SSL 26 6 0 11 Dec 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding Siyi Liu Yaoyuan Liang Feng Li Shijia Huang Hao Zhang Hang Su Jun Zhu Lei Zhang ObjD 50 26 0 28 Nov 2022
Unsupervised Audio-Visual Lecture Segmentation Darshan Singh Anchit Gupta C. V. Jawahar Makarand Tapaswi VOS 24 4 0 29 Oct 2022
Weakly Supervised Face Naming with Symmetry-Enhanced Contrastive Loss Tingyu Qu Tinne Tuytelaars Marie-Francine Moens CVBM 24 4 0 17 Oct 2022
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval Xuri Ge Fuhai Chen Songpei Xu Fuxiang Tao J. Jose 30 26 0 17 Oct 2022
Content-Based Search for Deep Generative Models Daohan Lu Sheng-Yu Wang Nupur Kumari Rohan Agarwal Mia Tang David Bau Jun-Yan Zhu DiffM SyDa 40 5 0 06 Oct 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding Zihan Ding Zixiang Ding Tianrui Hui Junshi Huang Xiaoming Wei Xiaolin K. Wei Si Liu 25 12 0 11 Aug 2022
Towards Multimodal Vision-Language Models Generating Non-Generic Text Wes Robbins Zanyar Zohourianshahzadi Jugal Kalita 14 1 0 09 Jul 2022
ZoDIAC: Zoneout Dropout Injection Attention Calculation Zanyar Zohourianshahzadi Jugal Kalita 36 0 0 28 Jun 2022
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo Yiran Luo Pratyay Banerjee Tejas Gokhale Yezhou Yang Chitta Baral 32 4 0 30 Mar 2022
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization Alexander Kunitsyn M. Kalashnikov Maksim Dzabraev Andrei Ivaniuta 30 16 0 14 Mar 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey Xiangru Zhu Zhixu Li Xiaodan Wang Xueyao Jiang Penglei Sun Xuwu Wang Yanghua Xiao N. Yuan 41 154 0 11 Feb 2022
An Integrated Approach for Video Captioning and Applications Soheyla Amirian T. Taha Khaled Rasheed H. Arabnia 31 1 0 23 Jan 2022
Neural Attention for Image Captioning: Review of Outstanding Methods Zanyar Zohourianshahzadi Jugal Kalita VLM 35 45 0 29 Nov 2021
Masking Modalities for Cross-modal Video Retrieval Valentin Gabeur Arsha Nagrani Chen Sun Alahari Karteek Cordelia Schmid 19 29 0 01 Nov 2021
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation Aditya Sanghi Hang Chu Joseph G. Lambourne Ye Wang Chin-Yi Cheng Marco Fumero Kamal Rahimi Malekshan CLIP 60 289 0 06 Oct 2021
Neural Twins Talk & Alternative Calculations Zanyar Zohourianshahzadi Jugal Kalita 25 0 0 05 Aug 2021
Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval Xuri Ge Fuhai Chen J. Jose Zhilong Ji Zhongqin Wu Xiao-Chang Liu 34 55 0 05 Aug 2021
From Show to Tell: A Survey on Deep Learning-based Image Captioning Matteo Stefanini Marcella Cornia Lorenzo Baraldi S. Cascianelli G. Fiameni Rita Cucchiara 3DV VLM MLLM 69 256 0 14 Jul 2021
Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching between Parts and Words Chuan Tang Xi Yang Bojian Wu Zhizhong Han Yi Chang 3DPC 35 13 0 05 Jul 2021
Linguistic Structures as Weak Supervision for Visual Scene Graph Generation Keren Ye Adriana Kovashka 29 52 0 28 May 2021
Exploring Visual Engagement Signals for Representation Learning Menglin Jia Zuxuan Wu A. Reiter Claire Cardie Serge Belongie Ser-Nam Lim 21 13 0 15 Apr 2021
Visual Goal-Step Inference using wikiHow Yue Yang Artemis Panagopoulou Qing Lyu Li Zhang Mark Yatskar Chris Callison-Burch 39 41 0 12 Apr 2021
A Joint Network for Grasp Detection Conditioned on Natural Language Commands Yiye Chen Ruinian Xu Yunzhi Lin Patricio A. Vela 41 46 0 01 Apr 2021
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding Yongfei Liu Bo Wan Lin Ma Xuming He ObjD 24 56 0 24 Mar 2021
VLGrammar: Grounded Grammar Induction of Vision and Language Yining Hong Qing Li Song-Chun Zhu Siyuan Huang VLM 30 25 0 24 Mar 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Maksim Dzabraev M. Kalashnikov Stepan Alekseevich Komkov Aleksandr Petiushko 24 128 0 19 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision Alec Radford Jong Wook Kim Chris Hallacy Aditya A. Ramesh Gabriel Goh ... Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever CLIP VLM 191 27,929 0 26 Feb 2021
Probabilistic Embeddings for Cross-Modal Retrieval Sanghyuk Chun Seong Joon Oh Rafael Sampaio de Rezende Yannis Kalantidis Diane Larlus UQCV 415 203 0 13 Jan 2021
Deep Bayesian Active Learning, A Brief Survey on Recent Advances S. Mohamadi H. Amindavar BDL UQCV 35 17 0 15 Dec 2020
Image Captioning with Context-Aware Auxiliary Guidance Zeliang Song Xiaofei Zhou Zhendong Mao Jianlong Tan 36 31 0 10 Dec 2020
Robust Image Captioning Daniel Yarnell Xian Wang 21 0 0 06 Dec 2020
Improving Clinical Outcome Predictions Using Convolution over Medical Entities with Multimodal Learning Batuhan Bardak Mehmet Tan 31 35 0 24 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning Simon Ging Mohammadreza Zolfaghari Hamed Pirsiavash Thomas Brox ViT CLIP 31 169 0 01 Nov 2020
Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging G. Liang Connor Greenwell Yu Zhang Xiaoqin Wang Ramakanth Kavuluru Nathan Jacobs 42 21 0 06 Oct 2020
Commands 4 Autonomous Vehicles (C4AV) Workshop Summary Thierry Deruyttere Simon Vandenhende Dusan Grujicic Yu Liu Luc Van Gool Matthew Blaschko Tinne Tuytelaars Marie-Francine Moens 30 6 0 18 Sep 2020
Cosine meets Softmax: A tough-to-beat baseline for visual grounding N. Rufus U. R. Nair K. M. Krishna Vineet Gandhi 27 13 0 13 Sep 2020
The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval Giuseppe Amato Paolo Bolettieri F. Carrara Franca Debole Fabrizio Falchi Claudio Gennaro Lucia Vadicamo Claudio Vairo 20 17 0 06 Aug 2020
Efficient Urdu Caption Generation using Attention based LSTM Inaam Ilahi Hafiz Muhammad Abdullah Zia Ahtazaz Ehsan Rauf Tabassam Armaghan Ahmed VLM 24 2 0 02 Aug 2020
Multi-modal Transformer for Video Retrieval Valentin Gabeur Chen Sun Alahari Karteek Cordelia Schmid ViT 430 596 0 21 Jul 2020