Many situations require the simultaneous processing of auditory and visual information, however, stimuli presented to one sensory modality can sometimes interfere with processing in a second sensory ...
Abstract: This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video ...
Audio-visual learning has been a major pillar of multi-modal machine learning, where the community mostly focused on its modality-aligned setting, i.e., the audio and visual modality are both assumed ...
The Colavita visual dominance effect refers to the phenomenon in which tend to respond only or preferentially to visual stimuli of bimodal audiovisual stimulus. Previous evidence has indicated that ...
This paper tackles the domain of multimodal prompting for visual recognition, specifically when dealing with missing modalities through multimodal Transformers. It presents two main contributions: (i) ...
Medical Visual Question Answering (Med-VQA) aims to combine medical image understanding with clinical language reasoning, enabling automatic answering of natural language questions grounded on medical ...
Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve videos that match a given textual query only partially. This task is inherently challenging due to the modality gap between text ...