CVPR 2024 Workshop
Event: CVPR 2024 Workshop · Duration: 310 min · ▶ Watch on YouTube
Abstract
This presentation introduces the Open-Sora project, a framework for video generation. It covers three main frameworks: Efficient 3D VideoVAE, Diffusion Transformer, and Data Preparing. The speaker emphasizes the importance of efficient 3D VideoVAE for long video generation, highlighting its architecture and training efficiency on Ascend NPU. The Diffusion Transformer is presented as a key component for video generation, with details on its architecture, training efficiency, and generation results. The presentation also delves into data preparation for video generation, outlining the process of crawling high-quality video data and generating high-quality video text annotations. The speaker discusses a diffusion-based generalist model for dense vision tasks, comparing Pixel Diffusion and Latent Diffusion, and outlining recipes for diffusion-based generalists. The presentation concludes with a comparison to prior art, visualization of results, and a discussion of future work and conclusions.
Speakers
- Hao Su — Peking University
Talks (50)
- 00:00:00 — Hao Su: Open-Sora Framework-1: Efficient 3D VideoVAE
- This talk introduces Open-Sora Framework-1, focusing on efficient 3D VideoVAE for long video generation, highlighting its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Open-Sora Framework-2: Diffusion Transformer
- This talk introduces Open-Sora Framework-2, focusing on the Diffusion Transformer for video generation, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Open-Sora Framework-3: Data Preparing
- This talk introduces Open-Sora Framework-3, focusing on data preparation for video generation, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Toward a Diffusion-Based Generalist for Dense Vision Tasks
- This talk introduces a diffusion-based generalist model for dense vision tasks, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Pixel Diffusion v.s. Latent Diffusion
- This talk introduces Pixel Diffusion v.s. Latent Diffusion, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Recipes for Diffusion-Based Generalists
- This talk introduces Recipes for Diffusion-Based Generalists, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Comparisons with Prior Art
- This talk introduces Comparisons with Prior Art, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Visualization
- This talk introduces Visualization, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: What is next
- This talk introduces What is next, detailing its architecture, training efficiency on Ascend NPU, and generation results.
- 00:00:00 — Hao Su: Conclusion
- This talk introduces Conclusion, detailing its architecture, training efficiency on Ascend NPU, and generation results.
Key Takeaways
- Open-world 3D ability is emerging faster than anticipated, driven by advancements in generative AI.
- The Open-Sora project leverages 2D foundation models to annotate data and build priors for 3D tasks, enabling efficient and high-quality 3D content generation.
- The framework incorporates 3D-native priors to achieve 3D-consistent and high-quality results, addressing challenges in generating realistic and coherent 3D content.
- The project explores a diffusion-based generalist model for dense vision tasks, demonstrating competitive performance across various benchmarks.
- Future work will focus on improving objective metrics, achieving more consistent multi-views, enhancing robustness, and exploring interactive editing for 3D content.
Methods / Models / Datasets Mentioned
Open-Sora Framework-1: Efficient 3D VideoVAEOpen-Sora Framework-2: Diffusion TransformerOpen-Sora Framework-3: Data PreparingDiffusion GeneralistPixel DiffusionLatent DiffusionMaskGITMuseStyleDropMAGVITMAGVIT-v2VideoPoetMeshLRMMeshFormerPartSLIPPointSAMZero123++
Topics
Open-Sora · Video Generation · 3D VideoVAE · Diffusion Transformer · Data Preparation · Diffusion-Based Generalist · Dense Vision Tasks · Pixel Diffusion · Latent Diffusion · Training Efficiency · Ascend NPU · Image Generation · Text-to-Image Generation · Text-to-Video Generation · Multi-View Prediction · 3D Reconstruction · Part Segmentation · Open-World 3D · Foundation Models · 3D-Native Priors
Notes
Open for commentary — connections to other work, critiques, follow-up reading.