Abstract: We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results