Abstract: Remote-sensing (RS) images present unique challenges for computer vision (CV) due to lower resolution, smaller objects, and fewer features. Mainstream backbone networks show promising ...
This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities. Video-LLaMA is built on top of BLIP-2 and MiniGPT-4.