Machine Learning for Geometric Shape Analysis
Event: CVPR 2024 Workshop and Challenge · Duration: 490 min · ▶ Watch on YouTube
Abstract
This segment introduces the CVPR 2024 Workshop on Deep Learning for Geometric Computing, highlighting its 6th edition and focus on advancing shape understanding beyond traditional tasks. It details the various SkelNetOn challenges (Pixel, Point, Parametric, Image) and the new Breaking Best geometric fracture reassembly challenge, emphasizing the workshop’s role in fostering collaboration and innovation. The keynote presentation by Olga Sorkine-Hornung then delves into 20 years of mesh editing, covering classical variational deformation methods like Laplacian and ARAP, their evolution to handle high-resolution meshes and real-time performance, and the integration of deep learning for automatic feature extraction and shape modeling in systems like SPAGHETTI and SENS. This segment features a presentation on a Residual-based Dense Point-wise Network (RDPN) for 6D object pose estimation from RGB-D images. The speaker, Chu-Song Chen, details the network’s architecture, which leverages dense correspondences, a residual representation for object coordinates, and 2D/3D to 3D matching to achieve state-of-the-art performance on various benchmarks including LineMOD, Occlusion-LineMOD, YCB-Video, and MP6D. The talk also covers the challenges of 6D pose estimation, limitations of previous direct and implicit prediction methods, and the benefits of their approach in handling occlusions, lighting variations, and object symmetries. The segment concludes with the introduction of the next keynote speaker, Julie Digne, though her presentation is cut short due to technical difficulties. This segment features Julie Digne’s presentation on “Machine Learning for Geometric Shape Analysis”. She begins by outlining the challenges of geometry processing, including diverse data acquisition methods, lack of universal surface representation, and issues like noise, outliers, and missing data. Digne then introduces Implicit Neural Representations (INRs) as a powerful tool for representing shapes and extracting topological information like the medial axis, demonstrating its robustness to noise and missing data. Finally, she explores how geometric information can enhance 2D image analysis, proposing an architecture that combines 2D and 3D networks for improved segmentation results with light networks and without extensive 3D annotations. This segment features a talk by Shalini De Mello from NVIDIA, exploring the question of whether 3D data is necessary for learning geometry in the context of AI-mediated 3D telepresence. The speaker introduces NVIDIA’s work on creating highly photorealistic 3D humans and discusses the evolution of generative adversarial networks (GANs) from 2014 to 2021, highlighting the advancements in photorealism. The presentation delves into 3D GANs for unsupervised learning of photorealistic 3D faces from 2D image collections, emphasizing the ability to generate multi-view consistent images and operate in real-time. The talk also covers the challenges and solutions for high-resolution 3D rendering, including the use of tri-plane representation and a novel approach for efficient sampling in NeRFs, culminating in a demonstration of real-time 3D lifting from a single RGB image. This segment presents a real-time ViT-based encoder capable of lifting single 2D images into canonicalized 3D triplane representations, preserving person-specific textures without requiring camera pose. The model is fully supervised by multi-view synthetic data generated from EG3D, with camera augmentation applied to enhance real-world generalization. The speaker also introduces ‘Dream-in-4D’, a novel two-step diffusion-based approach for generating dynamic 4D scenes from text and/or image prompts, showcasing its ability to create animated 3D objects with photorealistic details. The discussion highlights the challenges of achieving physically accurate relighting and complex material rendering in AI-generated 3D content. This segment features an open discussion session focused on the current state and future challenges of geometric deep learning. Speakers and audience members engage in a collaborative exchange, addressing which problems are considered largely ‘solved’ and identifying the most significant remaining hurdles in the field. The discussion highlights the progress made in implicit representations while emphasizing the ongoing difficulties with explicit representations and their interpretability.
Speakers
- Ilke Demir — Intel
- Dena Bazazian — University of Plymouth
- Adarsh Krishnamurthy — Iowa State University
- Géraldine Morin — University of Toulouse
- Kathryn Leonard — Occidental College
- Silvia Sellán — University of Toronto
- Aditya Babu — Iowa State University
- Sainan Liu — Intel
- Olga Sorkine-Hornung — Professor of Computer Science, ETH Zurich
- Shalini De Mello — NVIDIA Research
- Julie Digne — CNRS, Lyon, France
- Chu-Song Chen — National Taiwan University
- Elizavet
- Sinan Liu — Intel Labs
- Alec Jacobson — University of Toronto
- Animesh Garg — University of Toronto
- Yun-Chun Chen — University of Toronto
Talks (14)
- 00:00:00 — Ilke Demir: Deep Learning for Geometric Computing CVPR 2024 Workshop and Challenge
- Introduction to the 6th edition of the Deep Learning for Geometric Computing workshop at CVPR 2024, highlighting its focus on global shape understanding, parameterized representations, hierarchical decompositions, and fostering collaboration through challenges and datasets.
- 00:05:42 — Ilke Demir: SkelNetOn Tracks and Breaking Best Challenge
- Overview of the SkelNetOn challenges (Pixel, Point, Parametric, Image) and the new Breaking Best geometric fracture reassembly challenge, encouraging participation and highlighting the workshop’s role in fostering research.
- 00:17:04 — Olga Sorkine-Hornung: 20 Years of mesh editing: What we learned and what to learn
- A keynote presentation discussing the evolution of mesh editing techniques over 20 years, from classical variational approaches to modern deep learning methods, focusing on challenges and advancements in 3D shape modeling and animation.
- 01:21:56 — Chu-Song Chen: Residual-based Dense Point-wise Network for 6Dof Object Pose Estimation Based on RGB-D Images
- This talk introduces a Residual-based Dense Point-wise Network (RDPN) for precise and efficient 6D object pose estimation from RGB-D images, addressing challenges like occlusion, lighting, and object symmetry through dense correspondences, a novel residual representation, and 2D/3D to 3D matching.
- 02:45:15 — Julie Digne: Machine Learning for Geometric Shape Analysis
- Julie Digne introduces geometry processing, discusses challenges in 3D data representation and noise, and presents her work on implicit neural representations for extracting medial axes and improving 2D segmentation with geometric information.
- 04:08:43 — Shalini De Mello: Do We Need 3D Data to Learn Geometry?
- Discusses the mission of creating AI-mediated 3D telepresence and the use of generative models for high-fidelity 3D human reconstruction from 2D images.
- 04:16:41 — Julie Digne: None
- Julie Digne is introduced as the next keynote speaker, a Senior Researcher at CNRS, Lyon, France, with interests in geometry processing, surface analysis, denoising, compression, and segmentation, but her talk content is not shown in this segment due to audio setup.
- 05:26:46 — Shalini De Mello: Our Solution: ViT-based Encoder with (Only) 3D Synthetic D-Images
- This segment introduces a real-time ViT-based encoder that generates canonicalized 3D triplane representations from single 2D images, preserving person-specific textures, and is trained solely on synthetic data.
- 06:48:27 — Ilke Demir: Open Discussion / Open Collaboration Session
- Ilke Demir introduces the open discussion session and the next speaker, Sinan Liu.
- 06:48:34 — Sinan Liu: Open Discussion / Open Collaboration Session
- Sinan Liu welcomes participants to the open discussion, encouraging new ideas and collaborations, and poses the first question to the audience.
- 06:49:09 — Alec Jacobson: Open Discussion / Open Collaboration Session
- Alec Jacobson discusses the progress in 3D representation, particularly with implicit representations, in response to the first question.
- 06:49:17 — Animesh Garg: Open Discussion / Open Collaboration Session
- Animesh Garg agrees with the sentiment that implicit representations have seen significant progress.
- 06:49:22 — Yun-Chun Chen: Open Discussion / Open Collaboration Session
- Yun-Chun Chen also highlights the advancements in implicit representations.
- 08:46:46 — Shalini De Mello: Dream-in-4D: A Unified Approach to Text- and Image-guided 4D Scene Generation
- This segment introduces Dream-in-4D, a two-step diffusion-based approach for generating dynamic 4D (animated 3D) scenes from text and/or image prompts, demonstrating its ability to create animated 3D objects with photorealistic details.
Key Takeaways
- The workshop aims to push the boundaries of geometric computing beyond classic tasks, focusing on global shape understanding, parameterized representations, and hierarchical decompositions, fostering collaboration through challenges and shared datasets.
- Classical mesh editing techniques, like Laplacian and ARAP, have evolved to enable real-time, artifact-free deformations on high-resolution meshes, addressing challenges of rigidity and local rotations through iterative optimization and subspace modeling.
- Deep learning offers a powerful approach to automatically learn high-level shape representations, features, and relationships from data, moving beyond hand-tailored rules for complex shape modeling and editing, as demonstrated by systems like SPAGHETTI and SENS.
- Despite advancements, challenges remain in achieving generalizable models across diverse object classes, controlling generative AI for specific design outcomes, and ensuring the physical validity and functionality of generated shapes, suggesting a need for hybrid approaches combining classical physics laws with data-driven learning.
- The RDPN network utilizes a novel residual representation and dense correspondences to effectively predict 6D object poses from RGB-D images, outperforming existing methods on challenging datasets.
- The approach addresses key challenges in 6D pose estimation, such as occlusion, varying lighting conditions, and object symmetry, by incorporating intrinsic adjustments and a coarse-to-fine residual representation.
- By leveraging both 2D-3D and 3D-3D dense correspondences, the method achieves robust performance, particularly in scenarios with heavy occlusion and for texture-less objects.
- The use of a 6D parameterization for rotation helps handle discontinuities and improves the signal-to-noise ratio, contributing to the overall accuracy and efficiency of the pose estimation.
- Deep learning for geometric data is challenging due to diverse representations and data quality issues, but INRs offer a robust approach.
- Implicit Neural Representations (INRs) can effectively extract topological features like the medial axis, even in the presence of noise and missing data, by leveraging signed distance functions.
- Integrating geometric information (3D data) with 2D image data can significantly improve 2D segmentation tasks, even with lightweight network architectures.
- The proposed methods demonstrate improved performance and robustness compared to traditional approaches, highlighting the value of not discarding geometric information in image analysis.
- AI-mediated 3D telepresence aims to create highly photorealistic 3D humans for digital presence in remote spaces.
- Generative models, particularly 3D GANs, can learn photorealistic 3D representations from large collections of 2D images without requiring 3D ground truth or multi-view data.
- Efficient 3D rendering and training of 3D GANs can be achieved using methods like tri-plane representation and smart sampling techniques to overcome computational challenges.
- Advanced techniques allow for one-shot, real-time 3D lifting from a single RGB image, enabling high-fidelity 3D reconstruction and novel view synthesis.
- A ViT-based encoder can generate real-time (16ms) canonicalized 3D triplane representations from single 2D images, preserving person-specific textures and generalizing to out-of-domain inputs.
- Training the 3D reconstruction model solely on synthetic data from EG3D, augmented with camera variations, is effective for achieving robust performance.
- The Dream-in-4D approach leverages diffusion models in a two-step process (static 3D optimization followed by motion learning) to generate dynamic 4D scenes from text and/or image prompts.
- Current AI-based relighting techniques often ‘fake’ physical accuracy, and achieving truly physically correct material rendering remains a significant challenge, especially for complex surfaces like hair and eyes.
- Implicit representations are widely considered a largely solved problem in geometric deep learning, having seen substantial progress.
- Explicit representations, particularly achieving interpretability and robustness, remain a significant and active area of research and a major challenge.
- The community is actively seeking new ideas and fostering collaborations to address the complex challenges in geometric deep learning.
- The discussion format encourages participants to contribute their perspectives and insights on the field’s advancements and future directions.
Methods / Models / Datasets Mentioned
3D-2D distill3DMVARAP (As-Rigid-As-Possible)ArcFace lossBPNetBinary Cross Entropy (BCE) lossBreaking Best ChallengeCMXCOCOClass Specific Classifier (CSC)ConvDFTrDeep SDFDeepLabV3DeltaConvDenseFusionDice lossDream-in-4DEG3DEG3D-PTIES6DEdge ConvEdgeConvExplicit representationsFFB6DFiberMeshGANSpaceGaussian Mixture Model (GMM)Image SkelNetOnImageNetImplicit representationsKNNKPConvL1 lossLGAfford-NetLPD3LPIPS lossLPointNetLUMOSLaplacian mesh editingLinear Blend Skinning (LBS)Local Geometric Descriptor (LGD)MAXINE 3D Eye ContactMLPMarching CubesMonster MashMoving Least Squares SurfacesMvPointNetNVIDIA Facial Reenactment (NVFAIR) DatasetNeRFNeural Implicit FunctionNeural RenderingOccupancy NetworkParametric SkelNetOnPixel SkelNetOnPoint SkelNetOnPointNetPoisson Surface ReconstructionRDPN6DReLU activation functionResNet18ResNet34SDFSDFConnectSENSSIRENSPAGHETTISSD-6DSSMAScale-Invariant Translation Estimation (SITE)SeFaSegFormer-B2ShapeConvShared MLPSphere tracingStyleCLIPStyleGANStyleGAN2StyleGAN3SuperResSynthetic DiscT6D-DirectTransformer DecoderTri-plane EncoderTri-plane representationU-Net34ViTViT (Vision Transformer)ViT-based EncoderVoxelCoresYOLOiWires
Topics
2D/3D matching · 3D Data Processing · 3D GANs · 3D Reconstruction · 3D Telepresence · 3D reconstruction · 3D representation · 4D scene generation · 6D object pose estimation · AI-mediated telepresence · ARAP (As-Rigid-As-Possible) · Collaboration · Computer vision · Deep Learning for Geometric Computing · Deep Learning for Geometry · Deep learning · Dense correspondences · Diffusion models · Explicit representations · Generative AI · Geometric Shape Analysis · Geometric deep learning · Human-Made Objects · Image Segmentation · Implicit Neural Representations (INR) · Implicit representations · Interpretability · Linear Blend Skinning (LBS) · Medial Axis Extraction · Mesh Editing · Multimodality · Neural Radiance Fields (NeRF) · Object symmetry · Occlusion handling · Open discussion · Photorealistic Faces · RGB-D images · Relighting · Research challenges · Residual representation · Robustness to Noise · SPAGHETTI (Shape Modeling) · Shape Understanding · SkelNetOn Challenges · Sketch-based Modeling · Synthetic data · Tri-plane Representation · Triplane representation · Unsupervised Learning · ViT encoder
Notes
Open for commentary — connections to other work, critiques, follow-up reading.