VizWiz Grand Challenge: Opening Remarks

Event: CVPR 2024 Workshop · Duration: 240 min · ▶ Watch on YouTube

Abstract

This segment covers the opening remarks for the VizWiz Grand Challenge, including motivating questions, workshop goals, and introductions to organizers and invited speakers. It then delves into the challenge results for Visual Question Answering, Answer Grounding, and Single Answer Grounding, with presentations from the winning teams (SLCV and MGTV) detailing their methodologies and results. The segment concludes with a talk by Soravit Changpinyo on elevating the role of language in vision research, focusing on specificity, scene-text understanding, interrogative language, and multilinguality, followed by a presentation from Raul Puri on challenges in deploying omnimodels and assistive technology, including a live demo of GPT-4o’s capabilities. This segment features a Q&A session with OpenAI researchers Raul Puri and Rowan Zellers, discussing challenges in multimodal AI, including balancing flexibility with safety, addressing data bias, and deployment considerations. Following the Q&A, seven “VizWiz Poster Spotlight Talks” are presented, showcasing research on Visual Question Answering (VQA) and image accessibility. Topics include robust VQA using segmentation and cross-attention, leveraging large vision-language models, multimodal learning for VQA, navigating altered visual inputs with multimodal LLMs, making comics accessible, and enhancing zero-shot image classification. The segment concludes with a call to attend the full poster session. This segment features a presentation by Elisa Kreiss (UCLA) on human-centered AI for nonvisual accessibility, focusing on image-based text generation. She highlights the importance of considering both the communicative goal of the text and the image, demonstrating how context influences the utility of descriptions and captions. The presentation also critiques current automatic evaluation metrics for their context-independence. Following the talk, an open Q&A panel with Rowan Zellers, Brian Fischler, Raul Puri, Elisa Kreiss, and Soravit Beer Changpinyo delves into broader discussions on AI accessibility, including data collection challenges, ethical considerations, personalization, and the need for interdisciplinary approaches to ensure inclusive technology development.

Speakers

Danna Gurari — University of Colorado Boulder
Chongyan Chen — University of Texas at Austin
Soravit Changpinyo — Google Deepmind
Raul Puri — OpenAI
Rowan Zellers — OpenAI
Sangbeom Lee — Kyungpook National University
Bao-Hiep Le — Leveraging Large Vision-Language Models for Visual Question Answering in VizWiz Grand Challenge
Heegwang Kim — Visual Question Answering with Multimodal Learning for VizWiz-VQA
Yuvanshu Agarwal — Carnegie Mellon University
Ragav Sachdeva — University of Oxford
Dai Quoc Tran — Sungkyunkwan University
Jialong Zuo — School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Elisa Kreiss — UCLA
Brian Fischler
Soravit Beer Changpinyo

Talks (18)

00:00:00 — Danna Gurari: VizWiz Grand Challenge: Opening Remarks
- Danna Gurari provides opening remarks for the VizWiz Grand Challenge, outlining motivating questions, workshop goals, introducing organizers and invited speakers, and presenting the dataset challenges and winners.
00:05:10 — Chongyan Chen: VizWiz Grand Challenge: Algorithms to Interpret Visual Information for People who are Blind
- Chongyan Chen presents an overview of the VizWiz Grand Challenge, detailing the Visual Question Answering (VQA), Answer Grounding, and Single Answer Grounding tasks, highlighting progress, difficult examples, and announcing the winning teams for each challenge.
00:07:04 — SLCV Team: Lora Fine-Tuning For VizWiz Grand Challenge
- The SLCV team presents their winning solution for the VizWiz-VQA challenge, detailing their methodology using Lora+ fine-tuning with DeepSeek-VL and CogAgent models, and showcasing their results across various VQA tasks.
00:12:40 — MGTV Team: A Transformer based solution in Answer Grounding
- The MGTV team presents their transformer-based solution for the VizWiz VQA Grounding Challenge, outlining their method using EVA-CLIP-ViT and RoBERTa encoders, training details with binary cross-entropy loss, and post-processing techniques to achieve their winning score.
00:17:40 — MGTV Team: A Transformer based solution in VQA-AnswerTherapy
- The MGTV team presents their transformer-based solution for the VQA-AnswerTherapy challenge, detailing their model architecture based on BLIP2, training methods with binary cross-entropy loss, and post-processing techniques including hard ensemble and test-time augmentation.
00:21:40 — Soravit Changpinyo: Toward Vision and Richer Language(s)
- Soravit ‘Beer’ Changpinyo discusses the importance of vision and language research, particularly for visually impaired users, and explores ways to elevate the role of language in computer vision through specificity, scene-text understanding, interrogative language, and multilinguality, presenting various models and datasets developed by his team.
00:52:00 — Raul Puri: Challenges in Deploying Omnimodels and Assistive Technology (Where we are and what’s next?)
- Raul Puri discusses the challenges and future directions in deploying omnimodels and assistive technology, highlighting the iterative deployment process, the necessity of general models, the evolution of GPT-4V and GPT-4o paradigms, and the critical role of user feedback and safety mitigations.
01:11:41 — Raul Puri: Demo
- Raul Puri demonstrates the capabilities of GPT-4o in real-time, showcasing its ability to interpret visual information, engage in natural conversation, and provide context-aware responses, including identifying objects, describing scenes, and even understanding the speaker’s presentation context.
01:19:53 — Raul Puri, Rowan Zellers: Questions?
- A Q&A session discussing the balance between AI flexibility and safety, data bias, deployment challenges, and specific considerations for blind users.
01:33:31 — Sangbeom Lee: Integrating Query-aware Segmentation and Cross-Attention for Robust VQA
- Presents a method for robust Visual Question Answering (VQA) using LoRA, query-aware segmentation, and ensemble techniques, emphasizing direct integration of segment features.
01:34:07 — Bao-Hiep Le: Leveraging Large Vision-Language Models for Visual Question Answering in VizWiz Grand Challenge
- Describes an approach for Visual Question Answering (VQA) using fine-tuned large vision-language models (LVLMs) with guided prompts and an ensemble of 33 models.
01:34:35 — Heegwang Kim: Visual Question Answering with Multimodal Learning for VizWiz-VQA
- Presents a multimodal learning approach for Visual Question Answering (VQA) within the VizWiz-VQA framework.
01:35:01 — Yuvanshu Agarwal: Shifted Reality: Navigating Altered Visual Inputs with Multimodal LLMs
- Investigates the limitations of GPT-4V when presented with altered images (rotated, blurred, cropped) for blind users, highlighting hallucination issues and the need for improved robustness.
01:35:38 — Ragav Sachdeva: The Manga Whisperer: Making Comics Accessible to Everyone
- Focuses on making comic books accessible to visually impaired individuals by generating dialogue transcripts from manga pages, addressing challenges like varying character appearances and viewpoints.
01:36:05 — Dai Quoc Tran: Vision-Language Model-based PolyFormer for Recognizing Visual Questions with Multiple Answer Groundings
- Addresses the single answer grounding challenge in VQA using ViLT and PolyFormer, achieving an F1 score of 81.71 on the VizWiz Grand Challenge.
01:36:33 — Jialong Zuo: Propose, Match, then Vote: Enhancing Robustness for Zero-shot Image Classification via Cross-modal Understanding
- Proposes a pipeline for zero-shot image classification using LLMs, achieving 67.67% accuracy by combining VOLO for proposing, LLaVA and Kimi for matching, and voting by multiple experts.
02:39:51 — Elisa Kreiss: How communicative principles (should) human-centered AI for nonvisual accessibility
- Elisa Kreiss discusses how image-based text generation for accessibility needs to consider both the communicative goal of the text itself and the communicative goal of the image, emphasizing that descriptions and captions serve distinct purposes and that context is crucial for both generation and evaluation.
03:11:46 — Rowan Zellers, Brian Fischler, Raul Puri, Elisa Kreiss, Soravit Beer Changpinyo: Open Q&A Panel
- The panel discusses the challenges and opportunities in developing AI for visual accessibility, emphasizing the importance of interdisciplinary collaboration, context-aware systems, robust evaluation metrics, and user-centered design to address diverse needs and ethical considerations.

Key Takeaways

The VizWiz Grand Challenge drives innovation in vision-based technologies for people with vision impairments, with winning teams demonstrating significant progress in VQA and grounding tasks.
Elevating the role of language in vision research is crucial for developing more capable and accessible AI systems, focusing on specificity, scene-text understanding, interrogative language, and multilinguality.
Deploying omnimodels like GPT-4o for assistive technology requires an iterative development process, close collaboration with users, and robust safety mitigations to address challenges like hallucinations and privacy concerns.
Future work in vision and language models emphasizes comprehensiveness, dynamic trade-offs between various parameters (resolution, frame rates, UX), and community-driven data commons to capture real-world usage and ensure equitable experiences for blind users.
Balancing model flexibility with safety and robustness is a critical challenge in multimodal AI development, requiring careful consideration of potential misuse and usability.
Addressing data bias and the inherent limitations of datasets in representing diverse user needs is crucial for developing equitable and effective AI systems.
The integration of advanced techniques like query-aware segmentation, cross-attention, and ensemble methods significantly improves the performance and robustness of VQA models.
Multimodal LLMs show promise in tasks like image accessibility and understanding altered visual inputs, but challenges remain in ensuring their reliability and preventing hallucinations.
Image-based text generation for accessibility must differentiate between descriptions (replacing an image) and captions (complementing an image), as they serve distinct communicative goals.
Context is a critical factor that influences what information is relevant and useful in an image description, and current AI models show improved performance when trained with rich contextual data.
State-of-the-art automatic evaluation metrics like CLIPScore are often context-independent and do not reliably correlate with human judgments, especially in context-sensitive accessibility scenarios.
Developing truly human-centered AI for nonvisual accessibility requires interdisciplinary collaboration, focusing on user intent, personalized solutions, and robust evaluation methods that account for the diverse needs and experiences of users with disabilities.

Methods / Models / Datasets Mentioned

AI2D
ALIGN
BLIP-2
BLIP2
CC12M
CC3M
CLIP
CLIPScore
CLIPSeg
ChartQA
CogAgent
Concada
Conceptual Captions
DOCCI
DeepSeek-VL
DenseNet + LSTM
DocVQA
EVA-CLIP-ViT
FRCNN
Flamingo
Flickr30k
Frozen
GPT-4V
GPT-4o
GQA
ImageInWords
InstructBLIP
Kimi
LLaVA
LiT
LoRA
Lora+
MSCOCO
MSCOCO Captions
MaXM
NLVR2
OCR
OSCAR (VinVL)
PaLI
PaLI-X
PolyFormer
PreSTU
RedCaps
ResNet + LSTM
RoBERTa
SAM
SBU Captions
ST-VQA
SplitOCR
TextCaps
TextVQA
VOLO
VQAv2
VQ^2A
ViLT
VizWiz-VQA
VizWizCaptions
WebLI
nocaps

Topics

AI Safety and Robustness · Answer Grounding · Assistive Technology · Communicative goals · Context-aware AI · Data Bias · Few-shot Learning · GPT-4V · GPT-4o · Hallucinations · Human-Computer Interaction (HCI) · Human-centered AI · Image Accessibility · Image-based text generation · Interdisciplinary collaboration · Iterative Deployment · Language in Vision Research · Large Vision-Language Models (LVLMs) · Model evaluation · Multilinguality · Multimodal AI · Nonvisual accessibility · Omnimodels · Personalization · Safety Mitigations · Scene-Text Understanding · Single Answer Grounding · Visual Question Answering (VQA) · VizWiz Grand Challenge · Zero-shot Learning

Notes

Open for commentary — connections to other work, critiques, follow-up reading.