From Brain Embeddings to Refined Images: The MindEye2 Inference Pipeline

Table of Links

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

2.5 Model Inference

The pipeline for reconstruction inference is depicted in Figure 2. First, the diffusion prior’s predicted OpenCLIP ViT-bigG/14 image latents are fed through our SDXL unCLIP model to output a pixel image. We observed that these reconstructions were often distorted ("unrefined") due to an imperfect mapping to bigG space (see Figure 3). This may be explained by the increased versatility allowed from mapping to the larger dimensionality OpenCLIP bigG latent space. To increase image realism, we feed the unrefined reconstructions from SDXL unCLIP through base SDXL via image-to-image (Meng et al., 2022) with text conditioning guidance from MindEye2’s predicted image captions (section 2.3). We skip the first 50% of denoising diffusion timesteps, starting the process from the noised image encoding of the unrefined reconstruction. We simply take the first samples output from these stochastic models without any special 2nd-order selection. Refinement using base SDXL subjectively improves the quality of image outputs without strongly affecting low or high-level image metrics.

The final "refined" reconstructions come from combining the outputs from base SDXL with the pixel images output from the low-level submodule via simple weighted averaging (4:1 ratio). This weighted averaging step increases performance on low-level image metrics while minimally affecting reconstructions’ subjective appearance.

For retrieval inference, only the retrieval submodule’s outputs are necessary. Nearest neighbor retrieval can be performed via cosine similarity between the submodule’s OpenCLIP ViT-bigG/14 embeddings and all the ViT-bigG/14 embeddings corresponding to the images in the desired image pool.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).

From Brain Embeddings to Refined Images: The MindEye2 Inference Pipeline

Too Long; Didn't Read

Table of Links

2.5 Model Inference

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

From Brain Embeddings to Refined Images: The MindEye2 Inference Pipeline

Too Long; Didn't Read

Table of Links

2.5 Model Inference

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics