1st Workshop on Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics
Event: CVPR 2024 Workshop · Duration: 323 min · ▶ Watch on YouTube
Abstract
This segment introduces the ‘1st Workshop on Urban Scene Modeling’ at CVPR 2024, highlighting its focus on structured semantic 3D reconstruction and urban scene modeling. It features an opening address and a keynote speech by Iro Armeni, who discusses the critical need for a global resource cadastre to advance circular economy, climate mitigation, and urban sustainability. The segment also includes a keynote by Konrad Schindler, who reviews the evolution of urban modeling techniques, from traditional photogrammetry to modern deep learning methods like U-nets, implicit surfaces, and transformer-based next-token prediction, emphasizing the challenges and advancements in generating detailed 3D building models from various data sources. This segment features a series of presentations from the Building3D Challenge at the CVPR 2024 Workshop, focusing on 3D building wireframe reconstruction from various input data. The talks cover diverse approaches, including transformer-based methods utilizing 2D height maps, self-supervised pre-training strategies, geometry-based prediction techniques, and methods for reconstructing buildings from airborne LiDAR point clouds. The segment also includes a detailed overview of the competition’s evaluation metrics, prizes, and the winning solutions, highlighting the challenges and advancements in urban modeling and 3D reconstruction. This segment features a series of research presentations on 3D reconstruction and semantic segmentation. Topics include 3D polygonal mesh reconstruction from point clouds using autoregressive transformers, the introduction of large-scale LiDAR and scene datasets (ECLAIR, DL3DV-10K) for semantic segmentation and 3D vision, and methods for reconstructing regularized 3D building models from LiDAR data. Challenges in 3D reconstruction, such as SfM failures and visual twins, are addressed with a doppelganger classifier, leading to the MegaScenes dataset. The segment also covers a 3D point cloud dataset for underground utilities and a novel architecture for real-time RGB-D semantic segmentation on mobile platforms. This segment begins by discussing the characteristics of 3D urban and CAD models, emphasizing their structural regularities and the importance of both exterior and interior details for functionality. It then traces the evolution of neural 3D representations, highlighting the speaker’s work on structured neural representations like BAE-Net, BSP-Net, CAPRI-Net, DPA-Net, and Split-and-Fit for generating modeling primitives, contrasting them with rendering primitives from NeRF and Gaussian Splatting. The presentation concludes with Yasutaka Furukawa’s talk on pushing the frontiers of 3D content generation, focusing on ‘very loose geometric modeling’ using multi-view diffusion models (MVDiffusion++) for 3D object reconstruction from sparse inputs, and ‘very complex CAD models’ by generating B-Rep models with diffusion models.
Speakers
- Iro Armeni — Gradient Δ Spaces Research Group, Assistant Professor, Civil and Env. Engineering, Stanford University
- Konrad Schindler — Photogrammetry and Remote Sensing, ETH Zurich
- Myron Brown
- Dantao Tu — Institute of Automation, Chinese Academy of Sciences
- Hongxin Yang — University of Calgary
- Kunal Chelani — Chalmers University of Technology
- Yujia Liu — ETH Zurich
- Anand Umashankar — SharperShape, Aalto University
- Lu Ling — Purdue University, Department of Computer Science
- Jean-Philippe Bauchet — LuxCarta, Inria
- Noah Snavely — Cornell Tech & Google DeepMind
- Simon B. Jensen — Aalborg University, Denmark
- Hao (Richard) Zhang — SFU and Amazon
- Siqi Du — Shenzhen University
- Yizhi Wang — GrUVI Lab, Simon Fraser University
- Wallace Lira — GrUVI Lab, Simon Fraser University
- Wenqi Wang — GrUVI Lab, Simon Fraser University
- Ali Mahdavi-Amiri — GrUVI Lab, Simon Fraser University
- Yasutaka Furukawa — Wayve - Principal Scientist, SFU - Associate Professor
Talks (22)
- 00:00:00 — None: Workshop Introduction
- An introduction to the 1st Workshop on Urban Scene Modeling at CVPR 2024, outlining its goals, challenges, keynote speakers, and schedule.
- 00:09:37 — Iro Armeni: Building a global resource cadastre: Advancing Circular Economy, Climate Mitigation, and Urban Sustainability
- This keynote addresses the environmental impact of the built environment and proposes a computer vision-based method using street view imagery to create a global resource cadastre for sustainable urban development.
- 00:49:21 — Konrad Schindler: Urban Modelling in the Deep Learning Age: From U-nets to Next-token Prediction
- This keynote provides a historical overview of urban modeling, detailing the evolution from traditional photogrammetry to deep learning methods like U-nets, implicit surfaces, and transformer-based next-token prediction for generating detailed 3D building models.
- 01:20:47 — Konrad Schindler: None
- Discussion on the take-home message regarding deep networks for urban modeling, emphasizing a-priori knowledge representation and the limitations with hard constraints. Followed by Q&A.
- 01:25:34 — Myron Brown: Competition
- Overview of the Building3D Challenge competition timeline, evaluation metrics (Average Corner Offset, Precision, Recall, F1, Wireframe Edit Distance), prizes, and the list of participating teams and winners.
- 01:27:47 — Dantao Tu: BWFormer: 3D Building Wireframe Reconstruction from 2D Height Maps with Transformer
- Presents the 1st place solution, BWFormer, which reconstructs 3D building wireframes from 2D height maps using a transformer-based approach, detailing the pipeline, corner detection, and edge detection.
- 01:30:47 — Hongxin Yang: Self-supervised Pre-Training Method for 3D Wireframe Reconstruction from Building3D dataset
- Presents the 2nd place solution, focusing on a self-supervised pre-training method for 3D wireframe reconstruction, utilizing edge point identification and a pretrain-finetune strategy.
- 01:33:47 — Kunal Chelani: A Geometry-Based Approach to Building Roof Wireframes
- Presents the 3rd place solution, a geometry-based approach for building roof wireframe prediction, involving vertex set prediction via triangulation and scaled monocular depth, and adjacency prediction.
- 01:35:47 — Yujia Liu: Point2Building: Reconstructing Buildings from Airborne LiDAR Point Clouds
- Presents a method for reconstructing buildings from airborne LiDAR point clouds, focusing on adaptive polygonal meshes and a hierarchical modeling approach to address challenges like diverse designs and varying point density.
- 02:41:35 — Yujia Liu: 3D Polygonal Mesh Reconstruction from Point Clouds
- This talk presents a method for 3D polygonal mesh reconstruction from point clouds, which decomposes the problem into vertex modeling and face modeling using two separate autoregressive transformer modules, enhanced by iterative processing and validated on the Zurich City dataset.
- 02:43:00 — Anand Umashankar: ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation
- This presentation introduces ECLAIR, a high-fidelity aerial LiDAR dataset for semantic segmentation, addressing the scarcity of large-scale ALS data and its advantages over photogrammetry and mobile LiDAR, with benchmarks showing improved performance using pseudo labels for long-tail classes.
- 02:43:55 — Lu Ling: DL3DV-10K: A LARGE-SCALE SCENE DATASET FOR DEEP LEARNING-BASED 3D VISION
- This talk introduces DL3DV-10K, a large-scale and diverse scene dataset for deep learning-based 3D vision, featuring over 10,000 4K videos with camera poses from various global locations, annotated for scene diversity and complexity, addressing limitations of existing datasets in scale and real-world representation.
- 02:44:40 — Jean-Philippe Bauchet: SimpliCity: Reconstructing Buildings with Simple Regularized 3D Models
- This presentation introduces SimpliCity, a pipeline for reconstructing lightweight, regularized 3D building models from airborne LiDAR point clouds, using a 2.5D planimetric arrangement and a regularization procedure to achieve fidelity, simplicity, geometric guarantees, and efficiency, outperforming existing methods in compactness while maintaining accuracy.
- 02:45:40 — Noah Snavely: MegaScenes: Reconstructing the World’s Landmarks
- This talk introduces MegaScenes, a large-scale dataset for reconstructing the world’s landmarks, addressing challenges in 3D reconstruction like SfM failures due to symmetries and “visual twins” by developing a doppelganger classifier, and outlining future ambitions for 4D reconstruction and vision-3D-language models.
- 02:50:15 — Simon B. Jensen: OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities
- This presentation introduces OpenTrench3D, a photogrammetric 3D point cloud dataset for semantic segmentation of underground utilities, featuring 310 fully annotated point clouds across 7 distinct areas, and demonstrating significant performance improvements through fine-tuning pre-trained models on new utility classes.
- 02:50:55 — Hao (Richard) Zhang: Learning Differentiable Primitive Representations for CAD and Urban Modeling
- This talk revisits past work on urban 3D reconstruction and modeling, including “SmartBoxes” for interactive construction and “Structure-preserving retargeting” for irregular architectures, highlighting the importance of structural and geometric regularity in urban scene models for learning differentiable primitive representations.
- 02:51:25 — Siqi Du: Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time RGB-D semantic segmentation
- This presentation introduces Asymformer, a novel asymmetrical cross-modal architecture for real-time RGB-D semantic segmentation on mobile platforms, which addresses limitations of previous methods by using an asymmetric backbone and introducing LAFS and CMA modules for efficient feature fusion, achieving high accuracy and real-time performance on the NYUv2 dataset.
- 04:02:23 — Hao (Richard) Zhang: Characteristics of 3D Urban/CAD Models and Neural Representations
- This segment discusses the characteristics of 3D urban and CAD models, highlighting their structural and geometric regularities, and the importance of both exterior and interior details due to their functional purpose.
- 04:18:53 — Hao (Richard) Zhang: Summaries and Remaining Challenges in 3D Modeling
- The speaker summarizes the benefits of parametric curves/surfaces and structured CAD representations for editing and functional reasoning, highlighting remaining challenges in capturing structural regularity and reconstructing interiors, and introduces Slice3D as a potential solution for interiors.
- 04:24:03 — Hao (Richard) Zhang: Thank you and Acknowledgment
- The speaker concludes by thanking students, postdocs, and collaborators for their contributions to the presented works.
- 04:24:13 — Yasutaka Furukawa: Pushing the Frontiers of 3D Content Generation: Tales of very complex CAD models & very loose geometric modeling
- Yasutaka Furukawa discusses pushing the frontiers of 3D content generation, focusing on very complex CAD models and very loose geometric modeling, starting with the latter.
- 06:24:23 — Yasutaka Furukawa: 3D Reconstruction: From Super-Human to AI
- Yasutaka Furukawa highlights that 3D reconstruction was considered ‘super-human’ 20 years ago, showcasing examples like Google’s ‘Building Rome in a Day’ and Apple Maps Flyover, which demonstrate impressive capabilities in reconstructing precise geometry from images.
Key Takeaways
- The built environment significantly contributes to resource depletion and climate change, necessitating sustainable construction and renovation strategies.
- Creating a global resource cadastre, a detailed database of building materials, is crucial for managing resources and promoting circularity, but current manual auditing methods are costly and inefficient.
- Computer vision and deep learning, particularly with large language models like GPT-4, offer scalable solutions for extracting building information (style, materials, condition, heritage value) from widely available street view imagery.
- Advanced deep learning techniques, including implicit surfaces, neural fields, and transformer-based models, are pushing the boundaries of 3D urban modeling, enabling the generation of detailed polygonal building models from point clouds with improved accuracy and geometric plausibility.
- Deep networks excel at learning and representing a-priori knowledge about object shape and layout in urban modeling, surpassing traditional hand-coded rules.
- While deep networks are powerful, they may still require hand-coded rules or post-processing to enforce hard constraints and ensure geometric plausibility.
- Synthetic data, especially from realistic LiDAR simulators and high-quality CAD models, holds potential for training robust urban modeling systems.
- The Building3D Challenge highlights diverse approaches to 3D wireframe reconstruction, from transformer-based methods on height maps to geometry-driven techniques, pushing the boundaries of automated urban modeling.
- Autoregressive transformer models can effectively reconstruct 3D polygonal meshes from point clouds by sequentially generating vertices and faces, with iterative processing enhancing reliability.
- High-fidelity aerial LiDAR and large-scale scene datasets are crucial for advancing semantic segmentation and 3D vision, especially when addressing data scarcity, diversity, and real-world complexity.
- Addressing challenges like SfM failures due to symmetries and ‘visual twins’ is critical for robust 3D reconstruction, and specialized classifiers can significantly improve reconstruction quality by filtering erroneous matches.
- Future directions in 3D vision include extending reconstruction to 4D (time-varying scenes), integrating 3D data with language models for semantically rich reconstructions, and developing efficient multimodal architectures for real-time performance on mobile platforms.
- 3D urban and CAD models possess inherent structural and geometric regularities, and their functionality necessitates detailed representations of both exteriors and interiors.
- Traditional neural rendering methods like NeRF and Gaussian Splatting produce rendering primitives, which are unstructured and not suitable for modeling, editing, or functional reasoning.
- Developing structured neural representations (e.g., B-Reps, CSG trees) that learn modeling primitives directly from sparse inputs is crucial for enabling editable, reusable, and functionally meaningful 3D content generation.
- Generating complex CAD models, particularly B-Rep models, is challenging due to their arbitrary topological structures, but can be addressed by converting them into fixed-dimensional tree structures amenable to diffusion models.
Methods / Models / Datasets Mentioned
3D-GANAsymformerAtlasNetAverage Corner Offset (ACO)BAE-Net (Branched Autoencoder)BSP-Net (Binary Space Partitioning Network)BWFormerBrepGenBuilding3DCAPRI-NetCMA (Cross-Modal Attention)CNNsCOLMAPCity3DCorner Precision (CP)Corner Recall (CR)DETR-based networksDINO-VITDPA-Net (Differentiable Primitive Assembly Network)DeepCADDeepSDFDualContourEdge Precision (EP)Edge Recall (ER)F1 ScoreGISGPT-4Gaussian Splatting (3DGS)Grounding DinoHierarchical Neural Coding for Controllable CAD Model GenerationHoHoIM-NetIMPLICityK-Means clusteringLAFS (Local Attention-Guided Feature Selection)LEAP (Liberate Sparse-view 3D Modeling from Camera Poses)LiDARLoFTRMVCNNMVDiffusion++Masked Autoencoder for SSLMinkowski EngineNeRF (Neural Radiance Field)Neural Scene ChronologyO-CNNOCC-Net (Occupancy Networks)OpenSCADOptimal TransportPBWRPartNet-MobilityPoint2BuildingPoint2RoofPoint2SurfPointMetaBasePointNet(++)PointNextPointVectorRANSACReconstructing compact buildingRes16UNet14CResDEPTHSIFTScene Representation Transformer (SRT)Seam carvingSegment AnythingShuffle mechanismSkexGenSlice3DSmartBoxesSparse 3D CNNSplit-and-FitStructure-preserving retargetingS³3DRTransformer Decoder BlockTransformer EncoderTransformersU-NetU-netsWireframe Edit Distance (WED)pixelNeRF
Topics
3D Building Wireframe Reconstruction · 3D Reconstruction · 3D urban models · Autoregressive Models · B-Rep · Building Information Modeling · CAD models · Circular Economy · Climate Mitigation · Competition Evaluation Metrics · Dataset Creation · Deep Learning · Deep Learning for 3D Reconstruction · Diffusion models · Doppelganger Detection · LiDAR Data · LiDAR Point Clouds · Modeling primitives · Multi-view synthesis · Multimodal Learning · Neural 3D representations · Photogrammetry · Point Clouds · Self-supervised Learning · Semantic Segmentation · Structured representations · Transformer Networks · Urban Modeling · Urban Scene Modeling · Urban Sustainability
Notes
Open for commentary — connections to other work, critiques, follow-up reading.