Table of Links
2 MindEye2 and 2.1 Shared-Subject Functional Alignment
2.2 Backbone, Diffusion Prior, & Submodules
2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP
3 Results and 3.1 fMRI-to-Image Reconstruction
3.3 Image/Brain Retrieval and 3.4 Brain Correlation
6 Acknowledgements and References
A Appendix
A.2 Additional Dataset Information
A.3 MindEye2 (not pretrained) vs. MindEye1
A.4 Reconstruction Evaluations Across Varying Amounts of Training Data
A.5 Single-Subject Evaluations
A.7 OpenCLIP BigG to CLIP L Conversion
A.9 Reconstruction Evaluations: Additional Information
A.10 Pretraining with Less Subjects
A.11 UMAP Dimensionality Reduction
A.13 Human Preference Experiments
A.13 Human Preference Experiments
We conducted two-alternative forced-choice experiments on 58 human raters online. We probed three comparisons intermixed into the same behavioral experiment, with each comparison consisting of 1200 trials sampled evenly from
the 1000 NSD test samples across the 4 subjects who completed all 40 scanning sessions (subjects 1, 2, 5, 7). The total 3600 experimental trials were shuffled and 87 trials were presented to each subject. Our subjects were recruited through the Prolific platform, with our experimental tasks hosted on Meadows. All other experiment details follow the protocol used in Kneeland et al. (2023c).
A.13.1 MINDEYE2 VS. MINDEYE2 (RANDOM)
The first comparison was a two-way identification task in which participants were asked to select which of two images was more similar to a ground truth image. The two images provided for comparison were both reconstructions from MindEye2 (40-hour), one being a randomly selected reconstruction from the test set and the other being the correct, corresponding reconstruction. Raters correctly identified the corresponding reconstruction 97.82% of the time (p < 0.001). This establishes a new SOTA for human-rated image identification accuracy, as the only other papers to perform such an experiment were the original method proposed by (Takagi and Nishimoto, 2022), whose method achieved 84.29%, and the MindEye1 + BOI (brain-optimized inference) method proposed by (Kneeland et al., 2023c), whose enhancement to the MindEye1 method achieved 95.62%. The method in (Takagi and Nishimoto, 2022) is different from the "+Decoded Text" method we compare against in Table 1, which was released in a later technical report (Takagi and Nishimoto, 2023), and which does not report human subjective evaluations.
A.13.2 MINDEYE2 (REFINED) VS. MINDEYE2 (UNREFINED)
The second comparison was the same task as the first but this time comparing refined MindEye2 (40-hour) reconstructions against unrefined MindEye2 reconstructions (both correctly corresponding to the appropriate fMRI activity). This comparison was designed to empirically confirm the subjective improvements in naturalistic quality provided by MindEye2’s refinement step. This is particularly important to confirm because the quantitative evaluation metrics displayed in Table 4 sometimes preferred the unrefined reconstructions. Refined reconstructions were rated as more similar to the ground truth images 71.94% of the time (p < 0.001), demonstrating that the final refinement step improves reconstruction quality and accuracy when assessed by humans.
A.13.3 MINDEYE2 (1-HOUR) VS. BRAIN DIFFUSER (1-HOUR)
The final comparison was likewise the same task but this time comparing reconstructions from MindEye2 against reconstructions from the Brain Diffuser method (Ozcelik and VanRullen, 2023), where both methods were trained using
only the first hour of scanning data from the 4 NSD subjects. This experiment demonstrated that the MindEye2 reconstructions were preferred 53.01% of the time (p = 0.044), demonstrating a statistically significant improvement in scaling performance compared to the previous state-of-the-art model for reconstructions using only 1 hour of training data. This confirms the results in the main text that MindEye2 achieves SOTA in the low-sample 1-hour setting. We visualized cases where Brain Diffuser (1-hour) was preferred over MindEye2 (1-hour) in Appendix Figure 12. We observed that Brain Diffuser reconstructions were often preferred in situations where both MindEye2 and Brain Diffuser reconstructions were low quality, but MindEye2 reconstructions were "confidently" wrong (in the sense that MindEye2 reconstructions enforce a naturalistic prior from the refinement step) whereas Brain Diffuser reconstructions were producing distorted outputs that contained subtle elements corresponding to the target image. This may indicate human raters prefer distorted outputs with recognizable features, and disfavored the model that enforces a naturalistic prior, and may lose these features.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);
(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;
(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;
(4) Reese Kneeland, University of Minnesota and a Core contribution;
(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);
(6) Ashutosh Narang, Medical AI Research Center (MedARC);
(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);
(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);
(9) Thomas Naselaris, University of Minnesota;
(10) Kenneth A. Norman, Princeton Neuroscience Institute;
(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).