1st Workshop on Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics

Event: CVPR 2024 Workshop · Duration: 323 min · ▶ Watch on YouTube

Abstract

This segment introduces the ‘1st Workshop on Urban Scene Modeling’ at CVPR 2024, highlighting its focus on structured semantic 3D reconstruction and urban scene modeling. It features an opening address and a keynote speech by Iro Armeni, who discusses the critical need for a global resource cadastre to advance circular economy, climate mitigation, and urban sustainability. The segment also includes a keynote by Konrad Schindler, who reviews the evolution of urban modeling techniques, from traditional photogrammetry to modern deep learning methods like U-nets, implicit surfaces, and transformer-based next-token prediction, emphasizing the challenges and advancements in generating detailed 3D building models from various data sources. This segment features a series of presentations from the Building3D Challenge at the CVPR 2024 Workshop, focusing on 3D building wireframe reconstruction from various input data. The talks cover diverse approaches, including transformer-based methods utilizing 2D height maps, self-supervised pre-training strategies, geometry-based prediction techniques, and methods for reconstructing buildings from airborne LiDAR point clouds. The segment also includes a detailed overview of the competition’s evaluation metrics, prizes, and the winning solutions, highlighting the challenges and advancements in urban modeling and 3D reconstruction. This segment features a series of research presentations on 3D reconstruction and semantic segmentation. Topics include 3D polygonal mesh reconstruction from point clouds using autoregressive transformers, the introduction of large-scale LiDAR and scene datasets (ECLAIR, DL3DV-10K) for semantic segmentation and 3D vision, and methods for reconstructing regularized 3D building models from LiDAR data. Challenges in 3D reconstruction, such as SfM failures and visual twins, are addressed with a doppelganger classifier, leading to the MegaScenes dataset. The segment also covers a 3D point cloud dataset for underground utilities and a novel architecture for real-time RGB-D semantic segmentation on mobile platforms. This segment begins by discussing the characteristics of 3D urban and CAD models, emphasizing their structural regularities and the importance of both exterior and interior details for functionality. It then traces the evolution of neural 3D representations, highlighting the speaker’s work on structured neural representations like BAE-Net, BSP-Net, CAPRI-Net, DPA-Net, and Split-and-Fit for generating modeling primitives, contrasting them with rendering primitives from NeRF and Gaussian Splatting. The presentation concludes with Yasutaka Furukawa’s talk on pushing the frontiers of 3D content generation, focusing on ‘very loose geometric modeling’ using multi-view diffusion models (MVDiffusion++) for 3D object reconstruction from sparse inputs, and ‘very complex CAD models’ by generating B-Rep models with diffusion models.

Speakers

Iro Armeni — Gradient Δ Spaces Research Group, Assistant Professor, Civil and Env. Engineering, Stanford University
Konrad Schindler — Photogrammetry and Remote Sensing, ETH Zurich
Myron Brown
Dantao Tu — Institute of Automation, Chinese Academy of Sciences
Hongxin Yang — University of Calgary
Kunal Chelani — Chalmers University of Technology
Yujia Liu — ETH Zurich
Anand Umashankar — SharperShape, Aalto University
Lu Ling — Purdue University, Department of Computer Science
Jean-Philippe Bauchet — LuxCarta, Inria
Noah Snavely — Cornell Tech & Google DeepMind
Simon B. Jensen — Aalborg University, Denmark
Hao (Richard) Zhang — SFU and Amazon
Siqi Du — Shenzhen University
Yizhi Wang — GrUVI Lab, Simon Fraser University
Wallace Lira — GrUVI Lab, Simon Fraser University
Wenqi Wang — GrUVI Lab, Simon Fraser University
Ali Mahdavi-Amiri — GrUVI Lab, Simon Fraser University
Yasutaka Furukawa — Wayve - Principal Scientist, SFU - Associate Professor

Talks (22)

00:00:00 — None: Workshop Introduction
- An introduction to the 1st Workshop on Urban Scene Modeling at CVPR 2024, outlining its goals, challenges, keynote speakers, and schedule.
00:09:37 — Iro Armeni: Building a global resource cadastre: Advancing Circular Economy, Climate Mitigation, and Urban Sustainability
- This keynote addresses the environmental impact of the built environment and proposes a computer vision-based method using street view imagery to create a global resource cadastre for sustainable urban development.
00:49:21 — Konrad Schindler: Urban Modelling in the Deep Learning Age: From U-nets to Next-token Prediction
- This keynote provides a historical overview of urban modeling, detailing the evolution from traditional photogrammetry to deep learning methods like U-nets, implicit surfaces, and transformer-based next-token prediction for generating detailed 3D building models.
01:20:47 — Konrad Schindler: None
- Discussion on the take-home message regarding deep networks for urban modeling, emphasizing a-priori knowledge representation and the limitations with hard constraints. Followed by Q&A.
01:25:34 — Myron Brown: Competition
- Overview of the Building3D Challenge competition timeline, evaluation metrics (Average Corner Offset, Precision, Recall, F1, Wireframe Edit Distance), prizes, and the list of participating teams and winners.
01:27:47 — Dantao Tu: BWFormer: 3D Building Wireframe Reconstruction from 2D Height Maps with Transformer
- Presents the 1st place solution, BWFormer, which reconstructs 3D building wireframes from 2D height maps using a transformer-based approach, detailing the pipeline, corner detection, and edge detection.
01:30:47 — Hongxin Yang: Self-supervised Pre-Training Method for 3D Wireframe Reconstruction from Building3D dataset
- Presents the 2nd place solution, focusing on a self-supervised pre-training method for 3D wireframe reconstruction, utilizing edge point identification and a pretrain-finetune strategy.
01:33:47 — Kunal Chelani: A Geometry-Based Approach to Building Roof Wireframes
- Presents the 3rd place solution, a geometry-based approach for building roof wireframe prediction, involving vertex set prediction via triangulation and scaled monocular depth, and adjacency prediction.
01:35:47 — Yujia Liu: Point2Building: Reconstructing Buildings from Airborne LiDAR Point Clouds
- Presents a method for reconstructing buildings from airborne LiDAR point clouds, focusing on adaptive polygonal meshes and a hierarchical modeling approach to address challenges like diverse designs and varying point density.
02:41:35 — Yujia Liu: 3D Polygonal Mesh Reconstruction from Point Clouds
- This talk presents a method for 3D polygonal mesh reconstruction from point clouds, which decomposes the problem into vertex modeling and face modeling using two separate autoregressive transformer modules, enhanced by iterative processing and validated on the Zurich City dataset.
02:43:00 — Anand Umashankar: ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation
- This presentation introduces ECLAIR, a high-fidelity aerial LiDAR dataset for semantic segmentation, addressing the scarcity of large-scale ALS data and its advantages over photogrammetry and mobile LiDAR, with benchmarks showing improved performance using pseudo labels for long-tail classes.
02:43:55 — Lu Ling: DL3DV-10K: A LARGE-SCALE SCENE DATASET FOR DEEP LEARNING-BASED 3D VISION
- This talk introduces DL3DV-10K, a large-scale and diverse scene dataset for deep learning-based 3D vision, featuring over 10,000 4K videos with camera poses from various global locations, annotated for scene diversity and complexity, addressing limitations of existing datasets in scale and real-world representation.
02:44:40 — Jean-Philippe Bauchet: SimpliCity: Reconstructing Buildings with Simple Regularized 3D Models
- This presentation introduces SimpliCity, a pipeline for reconstructing lightweight, regularized 3D building models from airborne LiDAR point clouds, using a 2.5D planimetric arrangement and a regularization procedure to achieve fidelity, simplicity, geometric guarantees, and efficiency, outperforming existing methods in compactness while maintaining accuracy.
02:45:40 — Noah Snavely: MegaScenes: Reconstructing the World’s Landmarks
- This talk introduces MegaScenes, a large-scale dataset for reconstructing the world’s landmarks, addressing challenges in 3D reconstruction like SfM failures due to symmetries and “visual twins” by developing a doppelganger classifier, and outlining future ambitions for 4D reconstruction and vision-3D-language models.
02:50:15 — Simon B. Jensen: OpenTrench3D: A Photogrammetric 3D Point Cloud Dataset for Semantic Segmentation of Underground Utilities
- This presentation introduces OpenTrench3D, a photogrammetric 3D point cloud dataset for semantic segmentation of underground utilities, featuring 310 fully annotated point clouds across 7 distinct areas, and demonstrating significant performance improvements through fine-tuning pre-trained models on new utility classes.
02:50:55 — Hao (Richard) Zhang: Learning Differentiable Primitive Representations for CAD and Urban Modeling
- This talk revisits past work on urban 3D reconstruction and modeling, including “SmartBoxes” for interactive construction and “Structure-preserving retargeting” for irregular architectures, highlighting the importance of structural and geometric regularity in urban scene models for learning differentiable primitive representations.
02:51:25 — Siqi Du: Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time RGB-D semantic segmentation
- This presentation introduces Asymformer, a novel asymmetrical cross-modal architecture for real-time RGB-D semantic segmentation on mobile platforms, which addresses limitations of previous methods by using an asymmetric backbone and introducing LAFS and CMA modules for efficient feature fusion, achieving high accuracy and real-time performance on the NYUv2 dataset.
04:02:23 — Hao (Richard) Zhang: Characteristics of 3D Urban/CAD Models and Neural Representations
- This segment discusses the characteristics of 3D urban and CAD models, highlighting their structural and geometric regularities, and the importance of both exterior and interior details due to their functional purpose.
04:18:53 — Hao (Richard) Zhang: Summaries and Remaining Challenges in 3D Modeling
- The speaker summarizes the benefits of parametric curves/surfaces and structured CAD representations for editing and functional reasoning, highlighting remaining challenges in capturing structural regularity and reconstructing interiors, and introduces Slice3D as a potential solution for interiors.
04:24:03 — Hao (Richard) Zhang: Thank you and Acknowledgment
- The speaker concludes by thanking students, postdocs, and collaborators for their contributions to the presented works.
04:24:13 — Yasutaka Furukawa: Pushing the Frontiers of 3D Content Generation: Tales of very complex CAD models & very loose geometric modeling
- Yasutaka Furukawa discusses pushing the frontiers of 3D content generation, focusing on very complex CAD models and very loose geometric modeling, starting with the latter.
06:24:23 — Yasutaka Furukawa: 3D Reconstruction: From Super-Human to AI
- Yasutaka Furukawa highlights that 3D reconstruction was considered ‘super-human’ 20 years ago, showcasing examples like Google’s ‘Building Rome in a Day’ and Apple Maps Flyover, which demonstrate impressive capabilities in reconstructing precise geometry from images.

Key Takeaways

The built environment significantly contributes to resource depletion and climate change, necessitating sustainable construction and renovation strategies.
Creating a global resource cadastre, a detailed database of building materials, is crucial for managing resources and promoting circularity, but current manual auditing methods are costly and inefficient.
Computer vision and deep learning, particularly with large language models like GPT-4, offer scalable solutions for extracting building information (style, materials, condition, heritage value) from widely available street view imagery.
Advanced deep learning techniques, including implicit surfaces, neural fields, and transformer-based models, are pushing the boundaries of 3D urban modeling, enabling the generation of detailed polygonal building models from point clouds with improved accuracy and geometric plausibility.
Deep networks excel at learning and representing a-priori knowledge about object shape and layout in urban modeling, surpassing traditional hand-coded rules.
While deep networks are powerful, they may still require hand-coded rules or post-processing to enforce hard constraints and ensure geometric plausibility.
Synthetic data, especially from realistic LiDAR simulators and high-quality CAD models, holds potential for training robust urban modeling systems.
The Building3D Challenge highlights diverse approaches to 3D wireframe reconstruction, from transformer-based methods on height maps to geometry-driven techniques, pushing the boundaries of automated urban modeling.
Autoregressive transformer models can effectively reconstruct 3D polygonal meshes from point clouds by sequentially generating vertices and faces, with iterative processing enhancing reliability.
High-fidelity aerial LiDAR and large-scale scene datasets are crucial for advancing semantic segmentation and 3D vision, especially when addressing data scarcity, diversity, and real-world complexity.
Addressing challenges like SfM failures due to symmetries and ‘visual twins’ is critical for robust 3D reconstruction, and specialized classifiers can significantly improve reconstruction quality by filtering erroneous matches.
Future directions in 3D vision include extending reconstruction to 4D (time-varying scenes), integrating 3D data with language models for semantically rich reconstructions, and developing efficient multimodal architectures for real-time performance on mobile platforms.
3D urban and CAD models possess inherent structural and geometric regularities, and their functionality necessitates detailed representations of both exteriors and interiors.
Traditional neural rendering methods like NeRF and Gaussian Splatting produce rendering primitives, which are unstructured and not suitable for modeling, editing, or functional reasoning.
Developing structured neural representations (e.g., B-Reps, CSG trees) that learn modeling primitives directly from sparse inputs is crucial for enabling editable, reusable, and functionally meaningful 3D content generation.
Generating complex CAD models, particularly B-Rep models, is challenging due to their arbitrary topological structures, but can be addressed by converting them into fixed-dimensional tree structures amenable to diffusion models.

Methods / Models / Datasets Mentioned

3D-GAN
Asymformer
AtlasNet
Average Corner Offset (ACO)
BAE-Net (Branched Autoencoder)
BSP-Net (Binary Space Partitioning Network)
BWFormer
BrepGen
Building3D
CAPRI-Net
CMA (Cross-Modal Attention)
CNNs
COLMAP
City3D
Corner Precision (CP)
Corner Recall (CR)
DETR-based networks
DINO-VIT
DPA-Net (Differentiable Primitive Assembly Network)
DeepCAD
DeepSDF
DualContour
Edge Precision (EP)
Edge Recall (ER)
F1 Score
GIS
GPT-4
Gaussian Splatting (3DGS)
Grounding Dino
Hierarchical Neural Coding for Controllable CAD Model Generation
HoHo
IM-Net
IMPLICity
K-Means clustering
LAFS (Local Attention-Guided Feature Selection)
LEAP (Liberate Sparse-view 3D Modeling from Camera Poses)
LiDAR
LoFTR
MVCNN
MVDiffusion++
Masked Autoencoder for SSL
Minkowski Engine
NeRF (Neural Radiance Field)
Neural Scene Chronology
O-CNN
OCC-Net (Occupancy Networks)
OpenSCAD
Optimal Transport
PBWR
PartNet-Mobility
Point2Building
Point2Roof
Point2Surf
PointMetaBase
PointNet(++)
PointNext
PointVector
RANSAC
Reconstructing compact building
Res16UNet14C
ResDEPTH
SIFT
Scene Representation Transformer (SRT)
Seam carving
Segment Anything
Shuffle mechanism
SkexGen
Slice3D
SmartBoxes
Sparse 3D CNN
Split-and-Fit
Structure-preserving retargeting
S³3DR
Transformer Decoder Block
Transformer Encoder
Transformers
U-Net
U-nets
Wireframe Edit Distance (WED)
pixelNeRF

Topics

3D Building Wireframe Reconstruction · 3D Reconstruction · 3D urban models · Autoregressive Models · B-Rep · Building Information Modeling · CAD models · Circular Economy · Climate Mitigation · Competition Evaluation Metrics · Dataset Creation · Deep Learning · Deep Learning for 3D Reconstruction · Diffusion models · Doppelganger Detection · LiDAR Data · LiDAR Point Clouds · Modeling primitives · Multi-view synthesis · Multimodal Learning · Neural 3D representations · Photogrammetry · Point Clouds · Self-supervised Learning · Semantic Segmentation · Structured representations · Transformer Networks · Urban Modeling · Urban Scene Modeling · Urban Sustainability

Notes

Open for commentary — connections to other work, critiques, follow-up reading.