paint-brush
Coin3D Outperforms Image-Based Methods in 3D Generation Accuracyby@rendering

Coin3D Outperforms Image-Based Methods in 3D Generation Accuracy

tldt arrow

Too Long; Didn't Read

Coin3D outperforms state-of-the-art image-based 3D generation methods by integrating shape proxies for better multiview consistency and reconstruction. Compared to Wonder3D and SyncDreamer, it reduces artifacts, enhances perceptual quality, and improves text-object alignment. User studies confirm its superiority in accuracy and control, making it a breakthrough in interactive 3D modeling.
featured image - Coin3D Outperforms Image-Based Methods in 3D Generation Accuracy
Rendering Technology Breakthroughs HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Related Works

3 Method and 3.1 Proxy-Guided 3D Conditioning for Diffusion

3.2 Interactive Generation Workflow and 3.3 Volume Conditioned Reconstruction

4 Experiment and 4.1 Comparison on Proxy-based and Image-based 3D Generation

4.2 Comparison on Controllable 3D Object Generation, 4.3 Interactive Generation with Part Editing & 4.4 Ablation Studies

5 Conclusions, Acknowledgments, and References


SUPPLEMENTARY MATERIAL

A. Implementation Details

B. More Discussions

C. More Experiments

4 EXPERIMENTS

We first compare our method with image-based 3D generation in Sec. 4.1, and compare with controllable 3D object generation methods in Sec. 4.2. Then, we show the interactive generation applicability with designated part editing in Sec. 4.3. Finally, we perform ablation studies to analyze the design of our framework in Sec. 4.4.

4.1 Comparison on Proxy-based and Image-based 3D Generation

So far, the most stable 3D object generation pipelines are mainly image-based, i.e., giving a single image as a conditioning input, and then generating multiview images for reconstruction [Liu et al. 2023a; Long et al. 2023; Shi et al. 2023a] or direct 3D representations [Hong et al. 2023]. Unlike these methods, we use a coarse shape proxy as a guidance through the entire interactive generation pipeline. Since all these methods use image conditions to bootstrap


Figure 4: We compare our proxy-based generation method with image-based methods (i.e., Wonder3D [Long et al. 2023] and SyncDreamer [Liu et al. 2023a]) on the generated multiview images and reconstructed textured mesh.


Table 1: We perform quantitative evaluation and user studies on the 3D generation task.


the diffusion model, we first compare our method with SOTA imagebased generation methods (i.e., Wonder3D [Long et al. 2023] and SyncDreamer [Liu et al. 2023a]) using the same image candidates, where our method also add extra coarse shapes as conditioning.


Qualitative comparison. We show the multiview images and the reconstructed textured meshes in Fig. 4. As shown in Fig. 4, the predicted views and the textured meshes from Wonder3D and SyncDreamer both have some artifacts (e.g., distorted green turtle and yellow swimming ring at the first and third row in Fig. 4 (b) (c), missing hollowed handrail and short legs at the second row in Fig. 4 (b), missing white creamy middle layer at the fourth row in Fig. 4 (c)). Thanks to the proxy-guided conditioning and volume-conditioned reconstruction, our method can synthesize multiview images free of single view ambiguity by complementing 3D context from the proxy (e.g., complete chairs with correct hollowed handrail in Fig. 4 (a)), and also consistently reconstruct 3D objects with intact shape and vivid appearance.


Quantitative comparison. We use CLIP score [Radford et al. 2021] to evaluate the text-object matching degree, and ImageReward [He et al. 2023; Xu et al. 2023a] and GPTEvals3D [Wu et al. 2024] to evaluate the perceptual quality of the predicted multiview images. As presented in Table 1, our method achieves the overall best metrics, demonstrating that adding proxy-based conditioning can improve the quality of 3D generation tasks. Note that Wonder3D’s ImageReward score is lower than SyncDreamer’s due to the evaluator’s bias of orthogonal image views, while their Elo scores [Elo 1967] evaluated by GPTEvals3D are comparable.


User study. We also conduct a user study to compare our method with others. Following TEXTure [Richardson et al. 2023], we ask 30 users to sort 35 testing examples in random order based on the perceptual quality and content matching degree (w.r.t the given image or text prompts), and assign the scores by their ranking (i.e., with a score of 3 for the ordered best one and a score of 1 for the last one). As reported in Table 1, our method achieves the best score among all the methods. More details can be found in the supplementary material.


Authors:

(1) Wenqi Dong, from Zhejiang University, and conducted this work during his internship at PICO, ByteDance;

(2) Bangbang Yang, from ByteDance contributed equally to this work together with Wenqi Dong;

(3) Lin Ma, ByteDance;

(4) Xiao Liu, ByteDance;

(5) Liyuan Cui, Zhejiang University;

(6) Hujun Bao, Zhejiang University;

(7) Yuewen Ma, ByteDance;

(8) Zhaopeng Cui, a Corresponding author from Zhejiang University.


This paper is available on arxiv under CC BY 4.0 DEED license.