4th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling

Event: CVPR 2024 · Duration: 573 min · ▶ Watch on YouTube

Abstract

This video segment captures the initial moments of the 4th Workshop on CV4Animals, focusing primarily on resolving technical audio issues before the presentations can properly begin. The host, Urs Waldmann, and the first speaker, Marcelo Feighelstein, attempt to troubleshoot sound problems, which leads to a brief interruption of Feighelstein’s introductory talk on “Beyond Words: Connecting with Animals Through AI Emotion Analysis.” The segment highlights the importance of robust technical setups for online and hybrid events, while also providing a glimpse into the workshop’s theme of leveraging AI for understanding animal behavior and emotions. This segment features two talks on AI applications in animal behavior analysis. The first speaker, Marcelo Feighelstein, presents research on identifying pain in cats using AI, detailing model performance, key insights into model decision-making, and extending the approach to other species like sheep and rabbits. The second speaker, Alexander Mathis, introduces DeepLabCut and the concept of SuperAnimal foundation models for generalized animal pose estimation. He discusses challenges in unifying diverse datasets, proposes a novel multi-instance pose estimation approach (BUCTD), and highlights the integration of these tools with other foundation models and their application in neuroscience for reverse-engineering the sensorimotor system. This video segment consists entirely of a static black screen displaying the name ‘Ganggang Huang’. There is no visual presentation, speaker video, or discernible audio content throughout the duration of this segment. It appears to be a placeholder for a scheduled talk or a portion of the video where the presentation content was not captured. This segment introduces TAG, a foundation model for motion tracking that aims to universally track objects at any granularity, from pixels to masks and bounding boxes, across diverse categories. Inspired by large language models like GPT-4, TAG is a high-capacity model (320 million parameters) trained on a massive dataset of 75 diverse tracking datasets, including both real and synthetic videos. The model processes video volumes using a 3D CNN and a transformer, outputting multi-channel heatmaps for center, segmentation, and corners, and demonstrates state-of-the-art performance in point tracking while showing sensible uncertainty during occlusions. This segment features five talks from the CV4Animals workshop, covering diverse applications of computer vision and AI in animal welfare and ecological monitoring. Topics include developing robust foundation models for motion tracking across various granularities and modalities, unifying AI approaches for animal well-being by addressing limitations in common sense and fundamental understanding, and leveraging external identifications for long-term multi-object tracking in livestock. Additionally, presentations showcase a novel framework for wild animal tracking using high-quality SAM and domain adaptation, and a motion-based video compression algorithm for resource-constrained camera traps. The discussions highlight the importance of data quality, domain expertise integration, and evolving benchmarks in advancing AI for animal-related research. Camera trap wildlife recognition faces significant challenges due to varying illumination, occlusion, camouflage, and out-of-distribution generalization to novel locations. This work proposes a Multimodal Knowledge Graph (MMKG) framework that reformulates species identification as a link prediction problem, effectively integrating heterogeneous contextual information such as phylogenetic taxonomy and spatio-temporal metadata. The MMKG approach, utilizing Graph Neural Networks and a DistMult decoder, demonstrates substantial improvements in mean average precision (mAP) for out-of-distribution generalization compared to existing baselines on large-scale camera trap datasets like Snapshot Serengeti and iWildCam. This segment features three presentations from the CV4Animals workshop. Dor Litvak introduces ‘Ponymation,’ a method for generating 4D animal animations from unlabeled videos, showcasing its ability to create diverse and realistic motions. Gyeongsu Cho presents ‘DogRecon,’ a framework for reconstructing animatable 3D dog models from a single image using a canine prior and Gaussian representations. Andrés Hernández introduces ‘Pytorch-Wildlife,’ an open-source deep learning framework for conservation, highlighting its accessibility, transparency, and scalability for wildlife monitoring. The segment concludes with a panel discussion covering various aspects of computer vision in animal behavior research, including challenges with data, model interpretability, and the role of interdisciplinary collaboration.

Speakers

Urs Waldmann — Centre for the Advanced Study of Collective Behaviour, University of Konstanz
Marcelo Feighelstein — Tech4Animals Research Fellow, University of Haifa
Alexander Mathis — École Polytechnique Fédérale de Lausanne, Swiss Federal Institute of Technology
Ganggang Huang
Adam Harley — Stanford University
Jennifer J. Sun
Sophie Ngo bibinbe — Université Laval, CDPQ
Malika Nisal Ratnayake — Monash University
Vardaan Pahuja — The Ohio State University
Dor Litvak — CUHK MMLAB, Stanford University, UT Austin
Gyeongsu Cho — Artificial Intelligence Graduate School, UNIST; Department of Computer Science, DGIST
Andrés Hernández — Microsoft AI for Good Lab
Varun Jampani — Stability AI

Talks (14)

00:05:40 — Marcelo Feighelstein: Beyond Words: Connecting with Animals Through AI Emotion Analysis
- An introduction to using AI for understanding animal emotions, including pain detection and emotional states, and the challenges in animal pain research.
01:21:52 — Marcelo Feighelstein: Identifying Pain in Cats using AI
- This talk discusses AI models for identifying pain in cats, comparing landmark and deep learning approaches, highlighting the benefits of video data and diverse datasets, and analyzing model decision-making processes through attention visualization. It also covers applications to other species like sheep and horses, and compares AI performance to veterinarians.
01:40:24 — Alexander Mathis: Towards foundation models for behavioral analysis
- This talk introduces DeepLabCut as a toolbox for markerless pose estimation via transfer learning, discusses the development of SuperAnimal foundation models for generalized animal pose estimation across diverse species and contexts, and presents a novel Bottom-Up Conditioned Top-Down (BUCTD) approach for multi-instance pose estimation in crowded scenes. It also explores the integration of these tools with other foundation models and their application in reverse-engineering the sensorimotor system.
02:43:44 — Ganggang Huang: Presentation by Ganggang Huang (Content Not Visible)
- This segment displays a static screen with the speaker’s name, ‘Ganggang Huang’, but no visual presentation content is available.
04:06:52 — Adam Harley: Building a Foundation Model for Motion: Tracking Anything at Any Granularity
- This talk introduces a foundation model for motion, TAG, designed to track anything at any granularity by training a simple, large model on a massive, diverse dataset of real and synthetic videos, aiming to overcome limitations of granularity-specific and category-specific methods.
05:28:34 — Jennifer J. Sun: Unifying AI Approaches towards Animal Well-being
- Explored the limitations of current internet-scale AI models in fundamental understanding and common sense, proposing a framework for integrating domain expertise and evolving benchmarks to improve well-being for humans and animals.
05:35:49 — Sophie Ngo bibinbe: An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identifications: use case on long-term tracking in livestock
- Presented an HMM-based framework for long-term multi-object tracking in livestock that leverages sparse and uncertain external identifications to improve tracking performance and stability over time.
05:38:24 — Ganggang Huang: Wild Animal Tracking with High Quality-SAM and Domain Adaptation
- Introduced WATS-DA, a novel framework combining HQ-SAM and Domain Adaptation for wild animal tracking, demonstrating improved performance and generalization across diverse animal species and environments.
05:39:59 — Malika Nisal Ratnayake: Motion-based video compression for resource-constrained camera traps
- Presented EcoMotionZip, a motion-based video compression algorithm for camera traps that significantly reduces file size and frames while preserving crucial animal behavior data, enhancing remote ecological monitoring.
06:49:26 — Vardaan Pahuja: Bringing Back the Context: Camera Trap Species Identification as Link Prediction on Multimodal Knowledge Graphs
- The talk introduces a Multimodal Knowledge Graph (MMKG) approach for camera trap species identification, reformulating the problem as link prediction to improve out-of-distribution generalization by leveraging heterogeneous contextual information like taxonomy and spatio-temporal metadata.
08:11:25 — Ganggang Huang: None
- The speaker presents their work, but the content and title slide are not visible in the video segment.
08:12:22 — Dor Litvak: Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
- Ponymation is a new method that learns generative models of articulated 3D animals from unlabeled online videos, capable of generating diverse and realistic 4D animations without explicit pose annotations or parametric shape models, and generalizes to abstract representations.
08:14:31 — Gyeongsu Cho: Canine Prior-Guided Animatable 3D Gaussian Dog Reconstruction From a Single Image
- DogRecon is a framework that reconstructs animatable and controllable 3D Gaussian dog models from a single image, addressing challenges of pose prediction and limited input data by using a canine prior, canine-centric NVS, and reliable sampling weights.
08:16:48 — Andrés Hernández: Pytorch-Wildlife: A Collaborative Deep Learning Framework for Conservation
- Pytorch-Wildlife is an open-source AI framework designed for wildlife conservation, emphasizing accessibility, transparency, and scalability, offering a user interface, model zoo, and specialized utilities for camera trap data analysis.

Key Takeaways

Technical issues, particularly audio, can significantly disrupt the flow of virtual and hybrid events, requiring troubleshooting during live sessions.
The workshop aims to explore the application of computer vision and AI in understanding and tracking animal behavior and emotions.
AI tools are being developed to detect pain and emotional states in animals, which can improve animal welfare and veterinary care.
Challenges in animal pain research include the scarcity of small datasets, the need for robust generalization of models, and strict ethical considerations for data collection.
Deep learning models can effectively identify pain in animals, often outperforming human experts when relying solely on visual cues, but human experts using broader behavioral observations still achieve higher accuracy.
Building generalized foundation models like SuperAnimal for animal pose estimation is crucial for diverse applications, and challenges like unifying heterogeneous datasets can be addressed through techniques like gradient masking.
Novel approaches like the Bottom-Up Conditioned Top-Down (BUCTD) method can achieve state-of-the-art performance in multi-instance pose estimation, even in crowded scenes, by combining the strengths of different detection strategies.
Integrating animal behavior analysis tools with other foundation models (e.g., CLIP, LLMs) can unlock new capabilities, such as natural language interfaces for interactive analysis and efficient annotation through dense point trackers.
The segment features a static placeholder screen with the speaker’s name.
No visual presentation content or speaker footage is available.
The segment’s duration suggests it was intended for a presentation that is not visually present.
A universal foundation model for motion tracking can be built by training a simple, high-capacity model on a vast and diverse collection of tracking datasets, including both real and synthetic data.
Processing video as 3D volumes with a 3D CNN and a transformer allows the model to learn space-time relationships and track objects effectively through occlusions.
The model outputs multi-channel heatmaps (center, segmentation, corners) which can be post-processed to derive tracking information at various granularities (points, masks, boxes) from a single inference.
Iterative inference and the use of synthetic data are crucial for achieving high accuracy, especially for tiny objects and for improving generalization across diverse domains.
Real-world data for animal tracking is often biased, incomplete, or incorrectly annotated, posing significant challenges for AI model training.
Integrating domain expertise and developing evolving benchmarks are crucial for building trustworthy and impactful AI systems for animal well-being.
Self-supervised learning and foundation models offer promising avenues for automated data annotation and feature extraction, especially in resource-constrained or data-scarce scenarios.
Advanced tracking frameworks can leverage external identifications and motion-based compression to improve long-term tracking accuracy and manage large video datasets efficiently.
Camera trap species identification is challenging due to various environmental factors and poor generalization to unseen locations.
Leveraging heterogeneous contextual information (taxonomy, spatio-temporal data) through a Multimodal Knowledge Graph significantly improves out-of-distribution generalization.
Reformulating species identification as a link prediction problem on a knowledge graph allows for effective integration of diverse data types.
The proposed MMKG model, using GNNs, outperforms traditional image-only and context-only baselines on large-scale camera trap datasets.
Advanced deep learning models can generate realistic and diverse 4D animal animations and 3D reconstructions from limited or unlabeled data, demonstrating strong generalization capabilities.
Open-source frameworks tailored for specific domains like wildlife conservation are crucial for democratizing AI tools, promoting collaboration, and addressing unique data challenges.
The field of computer vision for animals benefits significantly from interdisciplinary collaboration, requiring a balance between engineering and scientific mindsets to tackle complex problems like uncertainty estimation and data scarcity.
Foundation models and large pre-trained datasets offer significant potential for transfer learning in animal research, but careful consideration of data quality, domain-specific challenges, and the ethical implications of model deployment are essential.

Methods / Models / Datasets Mentioned

3D CNN
3D ResNet-50
4D-fy
AI Emotion Analysis
Alignment
Alpha-Refine
AmadeusGPT
BANMO
BITE
BKinD
BUCTD
Bottom-Up Conditioned Top-Down Approach (BUCTD)
ByteTrack
ByteTrack+Re-ID
CLIP
CNN
COCO format
CalMS21 dataset
Canine Prior
Canine-centric NVS
ChatGPT
ChimpACT dataset
ChimpVLM
CoAM
CoPilot
CoTracker
Conv. PoseMachines
DA
DINO features
DINOv2
Deep Leaning
Deep Learning
DeepPose
DeeperCut
DistMult decoder
DogRecon
DreamGaussian
Droissart et al. [4]
EcoMotionZip
FACS Analysis
Fewshot-GART
Fly vs. Fly dataset
Forward-Backward algorithm
GART
GPT-4
GPT-4V
GPT3.5 LLM
GRDM
Gradient Masking
GrayST (Gray Stacking)
HMM
HQ-SAM
HRNet
HSM
Heatmap Visualization
Hugging Face Transformers
Human3.6M dataset
Hungarian algorithm
ImageNet
InternVideo
KABR dataset
Landmarks
Leave One Animal Out Cross Validation
MLP
MViTv2
MegaDetector v5
MegaDetector v6
Memory-replay self-supervised fine-tuning
MoDeep
Motion VAE
Naqvi et al. [1]
One-2-3-45
Open Tree of Life (OTT)
OpenAI
OpenPose
PIPs
PIPs++
PanAf20K dataset
PanAf500 dataset
Pixel Occlusion
Point-E
Polytrack
Ponymation
Pytorch-Wildlife
RDM
Rat7M dataset
Ratnayake et al. [2]
Regional Vectorization
Relational Graph Convolutional Networks (RGCN)
Reliable Sampling Weight
ResNet-50
ResNet-like architecture
ResNet50
SAM
SMAL
Segment Anything
Shap-E
SiamBAN
SiamCAR
SiamGAT
SiamRBO
Sora
SuperAnimal
TimeSformer
Timelapse
TokenPose
TransPose
Transformer
VITPose
Vector Occlusion
VideoPrism
Vision Transformer
Vision Transformer-like architecture
WATS-DA
WildCLIP
van der Voort et al. [3]

Topics

3D animal reconstruction · 4D animal animation · AI for emotion analysis · Animal Behavior Analysis · Animal behavior tracking · Animal tracking · Animal welfare · Behavior recognition · Benchmarking AI models · Camera trap analysis · Camera trap species identification · Camera traps · Computer vision · Computer vision for animal welfare · Deep Learning · Deep learning frameworks · DeepLabCut · Domain adaptation · Emotion Recognition · Ethograms · Foundation Models · Foundation models · Gaussian Splatting · Generative models · Graph Neural Networks · Hidden Markov Models (HMMs) · High-Quality SAM (HQ-SAM) · Interdisciplinary collaboration · Iterative inference · Large-scale datasets · Link prediction · Motion tracking · Multi-granularity tracking · Multi-object tracking · Multimodal Knowledge Graphs · Out-of-distribution generalization · Pain Detection · Pain detection · Pose Estimation · Pose estimation · Self-supervised learning · Spatio-temporal metadata · SuperAnimal · Supervised learning · Synthetic data · Technical setup · Uncertainty estimation · Universal tracking · Unlabeled video data · Video compression · Wildlife conservation · Wildlife taxonomy · Workshop introduction

Notes

Open for commentary — connections to other work, critiques, follow-up reading.