分享

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

热度