Welcome to the workshop on Computer Vision in the Wild (CVinW)

Event: ECCV 2022 Workshop on Computer Vision in the Wild · Duration: 11 min · ▶ Watch on YouTube

Abstract

Jianfeng Gao delivers the opening remarks for the Computer Vision in the Wild (CVinW) workshop at ECCV 2022. He discusses the fundamental question of measuring intelligence in AI, contrasting skill on a task with skill acquisition efficiency. Gao highlights the recent shift towards human-level intelligence through foundation models, moving beyond task-specific computer vision models. He introduces Florence, Microsoft’s unified foundation model for computer vision, and emphasizes the workshop’s focus on detecting diverse object categories in real-world applications. The presentation concludes with an overview of the collaborative community effort behind the workshop and its detailed agenda.

Speakers

Jianfeng Gao — Microsoft Research

Talks (1)

0:53 — Jianfeng Gao: Opening Remark: Welcome to the workshop on Computer Vision in the Wild (CVinW)
- Jianfeng Gao introduces the CVinW workshop, discusses intelligence measurement, the evolution from task-specific to foundation models, and highlights Microsoft’s Florence model, concluding with an overview of the workshop’s agenda.

Key Takeaways

AI’s intelligence can be measured by task performance (where AI often surpasses humans) and skill acquisition efficiency (where humans still excel).
The field is moving from numerous task-specific computer vision models to unified foundation models that can adapt to a wide range of downstream tasks.
Foundation models like Microsoft’s Florence leverage massive datasets and self-supervision to unify diverse computer vision tasks through language.
The Computer Vision in the Wild (CVinW) workshop aims to benchmark state-of-the-art foundation vision models for detecting a vast array of object categories in real-world scenarios.
The workshop is a collaborative community effort involving various academic and industry partners.

Methods / Models / Datasets Mentioned

GPT-3
Turing
CLIP
ALIGN
Florence
Vision Transformer (ViT)
Convolutional Neural Networks (CNNs)
Fast Region-based Convolutional Neural Network (Fast R-CNN)

Topics

Artificial Intelligence (AI) · Machine Intelligence Measurement · Skill Acquisition Efficiency · Task-Specific Computer Vision Models · Foundation Models · Multimodal Intelligence · Self-Supervision · Computer Vision in the Wild (CVinW) · Object Detection · Image Classification · Vision-Language Models · AI Benchmarking

Notes

Open for commentary — connections to other work, critiques, follow-up reading.