157 reads

Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

by Writings, Papers and Blogs on Text ModelsNovember 23rd, 2024

Too Long; Didn't Read

Multimodal signals typically exhibit modality-specific information that has distinct structure but complementary semantics (Dong et al., 2023). This complementary property allows us to utilize deep language comprehension to enhance cross-modal image generation (Saharia et al., 2022). However, the potential of multimodal creation to improve comprehension remains largely unexplored.

featured image - Using MLLMs for Diffusion Synthesis That Synergizes Both Sides: How Is This Possible?

‘an explosion of energy with green light on the left side and purple light on the right side big and colorful’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 Background & Problem Statement

2.1 How can we use MLLMs for Diffusion Synthesis that Synergizes both sides?

3 DreamLLM

3.1 End-to-End Interleaved generative Pretraining (I-GPT)

3.2 Model Training

4 Experiments and 4.1 Multimodal Comprehension

4.2 Text-Conditional Image Synthesis

4.3 Multimodal Joint Creation & Comprehension

5 Discussions

5.1 Synergy between creation & Comprehension?

5. 2 What is learned by DreamLLM?

6 Related Works

7 Conclusions and References

A Additional Experiments

B Additional Qualitative Examples

C Implementation Details

D Additional Related Works

E Limitations, Failure Cases & Future Works

2.1 How Can We Use MLLMs for Diffusion Synthesis That Synergizes Both Sides?

Learning Objective Our aim is to leverage MLLMs to model distributions via direct pixel space sampling. Here, the pretrained SD functions as a score metric, distilling the learned data distribution. This approach is similar to Score Distillation Sampling (Poole et al., 2023) (SDS, also known as Score Jacobian Chaining (Wang et al., 2023a)). In this context, image posterior is learned in a DeepDream-like manner (Mordvintsev et al., 2015), using MLLMs’ conditional parameterization.

Conditional Embeddings Rather than converting the output space of MLLMs to align with CLIP, we propose to query MLLMs using learned embeddings. Consequently, MLLMs-enriched semantics serve as diffusion conditioning, and the distribution is implicitly modeled through synthesis sampling.