Abstract: Vision-language models have the potential to enrich purely visual tasks by utilizing the combined representation of images/videos and corresponding textual descriptions. Recent advances in ...