Abstract: Vision-language models have the potential to enrich purely visual tasks by utilizing the combined representation of images/videos and corresponding textual descriptions. Recent advances in ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results