Table of Links
2 MindEye2 and 2.1 Shared-Subject Functional Alignment
2.2 Backbone, Diffusion Prior, & Submodules
2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP
3 Results and 3.1 fMRI-to-Image Reconstruction
3.3 Image/Brain Retrieval and 3.4 Brain Correlation
6 Acknowledgements and References
A Appendix
A.2 Additional Dataset Information
A.3 MindEye2 (not pretrained) vs. MindEye1
A.4 Reconstruction Evaluations Across Varying Amounts of Training Data
A.5 Single-Subject Evaluations
A.7 OpenCLIP BigG to CLIP L Conversion
A.9 Reconstruction Evaluations: Additional Information
A.10 Pretraining with Less Subjects
A.11 UMAP Dimensionality Reduction
A.13 Human Preference Experiments
2.5 Model Inference
The pipeline for reconstruction inference is depicted in Figure 2. First, the diffusion prior’s predicted OpenCLIP ViT-bigG/14 image latents are fed through our SDXL unCLIP model to output a pixel image. We observed that these reconstructions were often distorted ("unrefined") due to an imperfect mapping to bigG space (see Figure 3). This may be explained by the increased versatility allowed from mapping to the larger dimensionality OpenCLIP bigG latent space. To increase image realism, we feed the unrefined reconstructions from SDXL unCLIP through base SDXL via image-to-image (Meng et al., 2022) with text conditioning guidance from MindEye2’s predicted image captions (section 2.3). We skip the first 50% of denoising diffusion timesteps, starting the process from the noised image encoding of the unrefined reconstruction. We simply take the first samples output from these stochastic models without any special 2nd-order selection. Refinement using base SDXL subjectively improves the quality of image outputs without strongly affecting low or high-level image metrics.
The final "refined" reconstructions come from combining the outputs from base SDXL with the pixel images output from the low-level submodule via simple weighted averaging (4:1 ratio). This weighted averaging step increases performance on low-level image metrics while minimally affecting reconstructions’ subjective appearance.
For retrieval inference, only the retrieval submodule’s outputs are necessary. Nearest neighbor retrieval can be performed via cosine similarity between the submodule’s OpenCLIP ViT-bigG/14 embeddings and all the ViT-bigG/14 embeddings corresponding to the images in the desired image pool.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);
(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;
(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;
(4) Reese Kneeland, University of Minnesota and a Core contribution;
(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);
(6) Ashutosh Narang, Medical AI Research Center (MedARC);
(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);
(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);
(9) Thomas Naselaris, University of Minnesota;
(10) Kenneth A. Norman, Princeton Neuroscience Institute;
(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).