Abstract: Estimating the camera’s pose given images from a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and often ...
CLIP is one of the most important multimodal foundational models today. What powers CLIP’s capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, ...
Abstract: The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature ...
MASt3R-Fusion is a SLAM system that tightly integrates feed-forward pointmap regression with multi-sensor data (e.g., IMU, GNSS), drawing inspiration from MASt3R-SLAM. It is designed for practical, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results