Title
Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3 Su Zhang Ruyi An Yi Ding Cuntai Guan 19 28 0 24 Mar 2022
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization Alexander Kunitsyn M. Kalashnikov Maksim Dzabraev Andrei Ivaniuta 30 16 0 14 Mar 2022
Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios Germán Barquero Johnny Núnez Zhen Xu Sergio Escalera Wei-Wei Tu Isabelle M Guyon Cristina Palmero CVBM 45 12 0 07 Mar 2022
TRILLsson: Distilled Universal Paralinguistic Speech Representations Joel Shor Subhashini Venugopalan 25 37 0 01 Mar 2022
Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion Masahiro Yasuda Yasunori Ohishi Shoichiro Saito N. Harada 38 13 0 18 Feb 2022
ADIMA: Abuse Detection In Multilingual Audio Vikram Gupta Rini A. Sharon Ramit Sawhney Debdoot Mukherjee 21 19 0 16 Feb 2022
Maximizing Audio Event Detection Model Performance on Small Datasets Through Knowledge Transfer, Data Augmentation, And Pretraining: An Ablation Study Daniel C. Tompkins Kshitiz Kumar Jian Wu 17 5 0 07 Feb 2022
Prediction of Neonatal Respiratory Distress in Term Babies at Birth from Digital Stethoscope Recorded Chest Sounds Ethan Grooby C. Sitaula K. Tan Lindsay Zhou Arrabella King Ashwin Ramanathan A. Malhotra G. Dumont F. Marzbanrad 10 4 0 25 Jan 2022
Action Keypoint Network for Efficient Video Recognition Xu Chen Yahong Han Xiaohan Wang Yifang Sun Yi Yang 3DPC 27 6 0 17 Jan 2022
Continual Transformers: Redundancy-Free Attention for Online Inference Lukas Hedegaard Arian Bakhtiarnia Alexandros Iosifidis CLL 27 11 0 17 Jan 2022
Sub-mW Keyword Spotting on an MCU: Analog Binary Feature Extraction and Binary Neural Networks G. Cerutti Lukas Cavigelli Renzo Andri Michele Magno Elisabetta Farella Luca Benini 26 14 0 10 Jan 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks A. Vasudevan Dengxin Dai Luc Van Gool SSL 33 6 0 04 Jan 2022
Towards Relatable Explainable AI with the Perceptual Process Wencan Zhang Brian Y. Lim AAML XAI 25 62 0 28 Dec 2021
Cross Modal Retrieval with Querybank Normalisation Simion-Vlad Bogolin Ioana Croitoru Hailin Jin Yang Liu Samuel Albanie 27 84 0 23 Dec 2021
Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding Tanay Agrawal Dhruv Agarwal Michal Balazia Neelabh Sinha F. Brémond ViT 17 14 0 22 Dec 2021
Tell me what you see: A zero-shot action recognition method based on natural language descriptions Valter Estevam Rayson Laroca David Menotti Hélio Pedrini 33 13 0 18 Dec 2021
Benchmarking Uncertainty Quantification on Biosignal Classification Tasks under Dataset Shift Tong Xia Jing Han Cecilia Mascolo OOD 24 10 0 16 Dec 2021
Computational bioacoustics with deep learning: a review and roadmap D. Stowell 32 235 0 13 Dec 2021
Overview of The MediaEval 2021 Predicting Media Memorability Task R. Kiziltepe M. Constantin C. Demarty Graham Healy Camilo Luciano Fosco ... S. Halder Bogdan Ionescu A. Matran-Fernandez Alan F. Smeaton Lorin Sweeney 21 13 0 11 Dec 2021
VocBench: A Neural Vocoder Benchmark for Speech Synthesis Ehab A. AlBadawy Andrew Gibiansky Qing He Jilong Wu Ming-Ching Chang Siwei Lyu 22 12 0 06 Dec 2021
Sound-Guided Semantic Image Manipulation Seung Hyun Lee Wonseok Roh Wonmin Byeon Sang Ho Yoon Chanyoung Kim Jinkyu Kim Sangpil Kim DiffM 27 43 0 30 Nov 2021
SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer Zhi-qin Ye Xiangdong Wang Hong Liu Yueliang Qian Ruijie Tao Long Yan Kazushige Ouchi ViT 24 2 0 30 Nov 2021
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter Bang-ju Yang Tong Zhang Yuexian Zou CLIP 25 20 0 30 Nov 2021
Masking Modalities for Cross-modal Video Retrieval Valentin Gabeur Arsha Nagrani Chen Sun Alahari Karteek Cordelia Schmid 19 29 0 01 Nov 2021
EfficientWord-Net: An Open Source Hotword Detection Engine based on One-shot Learning R. Chidhambararajan Aman Rangaur S. C. Sethuraman 14 4 0 31 Oct 2021
Physics-informed linear regression is competitive with two Machine Learning methods in residential building MPC Felix Bünning B. Huber Adrian Schalbetter Ahmed Aboudonia Mathias Hudoba de Badyn Philipp Heer Roy S. Smith John Lygeros AI4CE 22 65 0 29 Oct 2021
Detecting Dementia from Speech and Transcripts using Transformers Loukas Ilias D. Askounis J. Psarras 16 32 0 27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation Tanzila Rahman Mengyu Yang Leonid Sigal ViT 29 8 0 26 Oct 2021
DECAR: Deep Clustering for learning general-purpose Audio Representations Sreyan Ghosh Sandesh V Katta Ashish Seth S. Umesh SSL 36 12 0 17 Oct 2021
Taming Visually Guided Sound Generation Vladimir E. Iashin Esa Rahtu VLM 32 122 0 17 Oct 2021
Rank-based loss for learning hierarchical representations I. Nolasco D. Stowell 21 8 0 11 Oct 2021
Universal Paralinguistic Speech Representations Using Self-Supervised Conformers Joel Shor A. Jansen Wei Han Daniel S. Park Yu Zhang SSL AI4TS 43 54 0 09 Oct 2021
Aura: Privacy-preserving Augmentation to Improve Test Set Diversity in Speech Enhancement Xavier Gitiaux Aditya Khant Ebrahim Beyrami Chandan K. A. Reddy J. Gupchup Ross Cutler 22 0 0 08 Oct 2021
PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions Eleonora Grassucci Aston Zhang Danilo Comminiello 28 38 0 08 Oct 2021
SERAB: A multi-lingual benchmark for speech emotion recognition Neil Scheidwasser M. Kegler P. Beckmann Milos Cernak 32 44 0 07 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... Prateek Verma AI4TS 32 2 0 07 Oct 2021
Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection Zhi-qin Ye Xiangdong Wang Hong Liu Yueliang Qian Ruijie Tao Long Yan Kazushige Ouchi ViT 35 15 0 05 Oct 2021
Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning Jing Bi Jiebo Luo Chenliang Xu 76 48 0 05 Oct 2021
Hierarchical Multimodal Transformer to Summarize Videos Bin Zhao Maoguo Gong Xuelong Li ViT 30 55 0 22 Sep 2021
Audio Interval Retrieval using Convolutional Neural Networks I. Kuzminykh Dan Shevchuk S. Shiaeles Bogdan Ghita 23 7 0 21 Sep 2021
Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions D. Curto Albert Clapés Javier Selva Sorina Smeureanu Julio C. S. Jacques Junior ... G. Guilera D. Leiva T. Moeslund Sergio Escalera Cristina Palmero 46 29 0 20 Sep 2021
Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks Russell Sammut Bonnici C. Saitis Martin Benning GAN 30 15 0 05 Sep 2021
Audio-Visual Transformer Based Crowd Counting Usman Sajid Xiangyu Chen Hasan Sajid Taejoon Kim Guanghui Wang ViT 43 22 0 04 Sep 2021
Multi-modal Representation Learning for Video Advertisement Content Structuring Daya Guo Zhaoyang Zeng 27 4 0 04 Sep 2021
EarGate: Gait-based User Identification with In-ear Microphones Andrea Ferlini Dong Ma R. Harle Cecilia Mascolo 13 71 0 27 Aug 2021
Parsing Birdsong with Deep Audio Embeddings Irina Tolkova Brian Chu Marcel Hedman Stefan Kahl Holger Klinck 36 10 0 20 Aug 2021
Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering Donggeon Lee Seongho Choi Youwon Jang Byoung-Tak Zhang 16 2 0 11 Aug 2021
Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers Chiori Hori Takaaki Hori Jonathan Le Roux 25 4 0 04 Aug 2021
Improving Music Performance Assessment with Contrastive Learning Pavan Seshadri Alexander Lerch 19 8 0 03 Aug 2021
Multimodal Feature Fusion for Video Advertisements Tagging Via Stacking Ensemble Qingsong Zhou Hai Liang Zhimin Lin Kele Xu 37 5 0 02 Aug 2021