Can AI Accurately Analyze Genomes and RNA?

Authors:

(1) Jinge Wang, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;

(2) Zien Cheng, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA;

(3) Qiuming Yao, School of Computing, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;

(4) Li Liu, College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA and Biodesign Institute, Arizona State University, Tempe, AZ 85281, USA;

(5) Dong Xu, Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA;

(6) Gangqing Hu, Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV 26506, USA ([email protected]).

Table of Links

Abstract and 1. Introduction

2. Omics

3. Genetics

4. Biomedical Text Mining and 4.1. Performance Assessments across typical tasks

4.2. Biological pathway mining

5. Drug Discovery

5.1. Human-in-the-Loop and 5.2. In-context Learning

5.2 Instruction Finetuning

6. Biomedical Image Understanding

7. Bioinformatics Programming

7.1 Application in Applied Bioinformatics

7.2. Biomedical Database Access

7.2. Online tools for Coding with ChatGPT

7.4 Benchmarks for Bioinformatics Coding

8. Chatbots in Bioinformatics Education

9. Discussion and Future Perspectives

Author Contributions, Acknowledgements, Conflict of Interest Statement, Ethics Statement, and References

2. OMICS

Omics techniques are extensively employed in biomedical research, generating vast amounts of data that necessitate careful analysis to uncover significant discoveries. A novel application of GPT models in transcriptomics has been to annotate cell types in single-cell RNA sequencing data[16], traditionally a labor-intensive and expertise-demanding task. Leveraging the wealth of online texts that offer detailed descriptions of signature genes for various cell types, ChatGPT can efficiently identify cell types based on a brief list of marker genes, as demonstrated by Hou and Ji [16]. When evaluated across datasets encompassing numerous tissues and cell types, ChatGPT demonstrates strong concordance with manual annotations and surpasses several conventional methods[16]. However, given the undisclosed nature of GPT's training data and the potential for AI-generated errors, expert validation is recommended before leveraging its annotations in further research.

The task of identifying protein-coding regions within DNA sequences plays a crucial role in genome annotation[17]. A key step in this process involves the extraction of all open reading frames (ORFs), defined as continuous stretches of DNA that start with a start codon and end with a stop codon, without any intervening stop codons. In a test involving three partial sequences from the Vaccinia virus, ChatGPT-4 demonstrates its ability to identify potential ORFs[18]. However, accurately pinpointing the longest ORFs requires additional instructions based on the Chain-of-Thought (CoT) approach (See Table 1 for terminologies cited in this review). The capacity of ChatGPT-4 to precisely assess the coding potential of each ORF remains to be investigated. Nonetheless, the chatbot correctly emphasizes that an ORF's length does not inherently indicate its coding potential and experimental validation is necessary to confirm this potential[18].

Evaluating GPT models in genomics necessitates benchmark datasets with established ground truths. GeneTuring[19] serves this role with 600 questions related to gene nomenclature, genomic locations, functional characterization, sequence alignment, etc. When tested on this dataset, GPT-3 excels in extracting gene names and identifying protein-coding genes, while ChatGPT (GPT-3.5) and New Bing show marked improvements. Nevertheless, all models face challenges with SNP and alignment questions[19]. This limitation is effectively addressed by GeneGPT[20], which utilizes Codex to consult the National Center for Biotechnology Information (NCBI) database.

Bioinfo-Bench[21], currently under development, aims to assess LLMs in bioinformatics through 150 multiple-choice questions covering various topics, such as sequence analysis and phylogenetics, and 20 questions on RNA sequence mutations. ChatGPT-3.5 outperforms other LLMs, such as Llama-7B and Galactica-30B. However, its role in generating distractor options (wrong answers) for these questions might artificially boost its performance, suggesting a need for cautious interpretation when using it to self-evaluate.

3. GENETICS

In North America, 34% of genetic counselors incorporate ChatGPT into their practice, especially in administrative tasks[22]. This integration marks a significant shift towards leveraging AI for genetic counseling and underscores the importance of evaluating its reliability. Doung and Solomon[23] analyzed ChatGPT’s performance on multiple-choice questions in human genetics sourced from Twitter. The chatbot achieves a 70% accuracy rate and excels in tasks requiring memorization over critical thinking. Further analysis by Alkuraya [24] revealed ChatGPT's limitations in calculating recurrence risks for genetic diseases. A notable instance involving cystic fibrosis testing showcases the chatbot's ability to derive correct equations but falter in computation, raising concerns over its potential to mislead even professionals[24]. This aspect of plausible responses is also identified as a significant risk by genetic counselors[22].

These observations have profound implications for the future education of geneticists. It indicates a shift from memorization tasks to a curriculum that emphasizes critical thinking in varied, patient-centered scenarios, scrutinizing AI-generated explanations rather than accepting them at face value[25]. Moreover, it stresses the importance of understanding AI tools' operational mechanisms, limitations, and ethical considerations essential in genetics[23]. This shift aims to prepare geneticists better for AI use, ensuring they remain informed on the benefits and risks of technology.

This paper is available on arxiv under CC BY 4.0 DEED license.