High-Accuracy Image Retrieval From Brain Scans: MindEye2 Results

by Image RecognitionApril 11th, 2025
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

Explore MindEye2's performance in image retrieval, reaching near-ceiling accuracy in matching brain activity to viewed images.
featured image - High-Accuracy Image Retrieval From Brain Scans: MindEye2 Results
Image Recognition HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References


A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

3.3 Image/Brain Retrieval

Image retrieval metrics help quantify the level of fine-grained image information contained in the fMRI embeddings. There are many images in the test set that contain similar semantic content (e.g., 14 images of zebras), so if the model can identify the exact image corresponding to a given brain sample, that demonstrates such fMRI embeddings contain fine-grained image content. MindEye2 improves upon MindEye1’s retrieval evaluations by reaching near-ceiling performance on the retrieval benchmarks used in previous papers (Lin et al., 2022; Scotti et al., 2023) (Table 1). Further, retrieval performance remained competitive when MindEye2 was trained with only 1 hour of data.


Computing the retrieval metrics in Table 1 involved the following steps. The goal for brain retrieval is to identify the correct sample of brain activity that gave rise to the seen image out of a pool of brain samples. The seen image is converted to an OpenCLIP image embedding (or CLIP image embedding, depending on the contrastive space used in the paper) and cosine similarity is computed between its respective fMRI latent (e.g., from the retrieval submodule) as well as 299 other randomly selected fMRI latents in the test set. For each test sample, success is determined if the cosine similarity is greatest between the ground truth OpenCLIP/CLIP image embedding and its respective fMRI embedding (aka top-1 retrieval performance, chance=1/300). We specifically used 300 random samples because this was the approach used in previous work. We averaged retrieval performance across test samples and repeated the entire process 30 times to account for the variability in random sampling of batches. For image retrieval, the same procedure is used except image and brain samples are flipped such that the goal is to find the corresponding seen image in the image pool from the provided brain sample.

3.4 Brain Correlation

To measure whether a reconstruction is faithful to the original brain activity that evoked it, we examine whether it accurately predicts that brain activity when input to a encoding model pretrained to predict brain activity from images (Gaziv et al., 2022). Encoding models provide a more comprehensive analysis of the proximity between images and brain activity (Naselaris et al., 2011), providing a unique measure of reconstruction quality that is perhaps more informative than the image metrics traditionally used for assessment. This alignment is measured independently of the stimulus image, allowing it to be used to assess reconstruction quality when the ground-truth image is unknown, making it extendable to new data in a variety of domains including covert visual content such as mental images. Given that human judgment is grounded in human brain activity, it could also be the case that brain correlation metrics provide increased alignment with the judgments of human observers. The brain correlation metrics in Table 3 are calculated with the GNet encoding model (St-Yves et al., 2022) using protocol from Kneeland et al. (2023c). "Unrefined" reconstructions performed best, perhaps because refinement sacrifices brain alignment (and reconstruction performance as assessed by some metrics) for the additional boost in perceptual alignment from enforcing a naturalistic prior.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks