Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Audio-Visual Speech Recognition is Worth 32×\times32×\times8 Voxels

Papers citing "Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels"