Table of Links
2 Data
3 Methods
3.1 Lexicon Creation and Expansion
4 Results
4.1 Demographics and 4.2 System Performance
6 Conclusion, Reproducibility, Funding, Acknowledgments, Author Contributions, and References
SUPPLEMENTARY
Guidelines for Annotating Social Support and Social Isolation in Clinical Notes
Abstract
Background: Social support (SS) and social isolation (SI) are social determinants of health (SDOH) associated with psychiatric outcomes. In electronic health records (EHRs), individual level SS/SI is typically documented as narrative clinical notes rather than structured coded data. Natural language processing (NLP) algorithms can automate the otherwise labor-intensive process of data extraction.
Data and Methods: Psychiatric encounter notes from Mount Sinai Health System (MSHS, n=300) and Weill Cornell Medicine (WCM, n=225) were annotated and established a gold standard corpus. A rule-based system (RBS) involving lexicons and a large language model (LLM) using FLAN-T5-XL were developed to identify mentions of SS and SI and their subcategories (e.g., social network, instrumental support, and loneliness).
Results: For extracting SS/SI, the RBS obtained higher macro-averaged f-scores than the LLM at both MSHS (0.89 vs. 0.65) and WCM (0.85 vs. 0.82). For extracting subcategories, the RBS also outperformed the LLM at both MSHS (0.90 vs. 0.62) and WCM (0.82 vs. 0.81).
Discussion and Conclusion: Unexpectedly, the RBS outperformed the LLMs across all metrics. Intensive review demonstrates that this finding is due to the divergent approach taken by the RBS and LLM. The RBS were designed and refined to follow the same specific rules as the gold standard annotations. Conversely, the LLM were more inclusive with categorization and conformed to common English-language understanding. Both approaches offer advantages and are made available open source for future testing.
1 INTRODUCTION
Social determinants of health (SDOH) are the non-medical conditions that shape daily life and affect health outcomes [1]. Leveraging SDOH in clinical decision-making may personalize treatment planning and improve patient outcomes [2]. Social support (SS) and social isolation (SI) are two key components of SDOH that significantly impact physical and mental well-being.
SI is associated with higher health care expenditure [3], morbidity [4, 5, 6], and mortality [7, 8] and may be as harmful as smoking fifteen cigarettes a day [9]. Specific health risks liked to SI include poor physical and mental well-being [10], metabolic diseases, infectious diseases, dementia [9], suicidal thoughts [11], anxiety [12], and depression [11, 13]. We previously conducted a scoping review to evaluate the relationship between social connectedness and the risk of depression or anxiety and observed that loneliness was significantly associated with higher risks of major depressive disorder, depressive symptom severity, and generalized anxiety disorder [12]. SI comprises several interrelated psychosocial constructs including a lack of a social network, poor emotional support, and feelings of loneliness [14]. The Surgeon General’s 2023 advisory, “Our Epidemic of Loneliness and Isolation,” recommends identification of patients with SI at the health care system level to track community prevalence. Research may then study causal mechanisms, patterns across demographics, and preventive approaches [9].
In contrast, SS and related constructs including emotional support, instrumental support, and social network are associated with improved health outcomes [15, 5, 16]. Social connections facilitate health-related behaviors, including adherence to medication and treatment [17, 18]. Moreover, social relationships are indirectly associated with several aspects relevant to health, such as blood pressure, immune function, and inflammation [19, 20, 21, 22]. In regard to mental health, SS, when measured across a range of settings and populations and using a variety of measures, may be a protective factor for depressive symptoms and disorders [12].
Existing studies on SS and SI are largely based on questionnaire or survey data from small samples or specific populations (e.g., the elderly [13], adolescents during the pandemic [23, 24], and pregnant/postpartum women [12]). Research on identifying SS and SI from real world, large-scale, routinely collected electronic health records (EHRs) is lacking, likely because SDOH information, including SS and SI, are rarely encoded in EHRs as structured data elements [2]. International Classification of Diseases (ICD)-9 V-codes and ICD-10 Z-codes have been expanded to include SI, however, studies note poor adoption rates among clinicians and health systems [25]. Instead, these concepts are often captured in EHRs as part of narrative text during a clinical encounter, yet manual abstraction of such data is time-consuming and labor intensive [26, 2].
Natural language processing (NLP) automates extraction of information from unstructured data and has been implemented in previous literature for identifying different SDOH constructs, including alcohol use, substance use, and homelessness [2]. However, the highly varied language used by clinicians, domain/site-specific knowledge, and lack of annotated data present challenges in extracting SDOH from clinical notes [2]. There are currently three main approaches to extract SS and SI in clinical text, each with strengths and limitations. Note that existing NLP work on SDOH have yet to be optimized for capturing SI/SS.
The first approach involves creating dictionaries (“lexicons”) and a set of rules with which to search the text for matches. Lexicons may be either derived from standardized medical ontologies or developed specifically for the task by domain experts. Software may be used to implement the rules, including recognition of negative terms or contexts in which the lexicon match is a false-positive (e.g., if the documented SDOH is not describing the patient, but rather the patient’s sibling). The benefit of this method is that the parameters are highly controlled; there is no “black box” (an inability to see how the model makes decisions) since the pipeline creator is naming exactly what is, or is not, included. However, it is exceedingly difficult to list every term in the lexicon and create a rule for every context in which the term might occur. Previous work using this approach includes studies detecting SI from clinical notes [26, 27] in specific patient populations. ClinicalRegex and Linguamatics I2E, two rule-based/lexicon software, were used to extract SI [28] and SS mentions, again in specific patient populations [29]. Other studies [e.g., Navathe et al. [30]] combine ICD codes with lexicon terms to detect SS/SI. Since the aim of these studies is to identify SS/SI for a clinical purpose, the focus is not on the rigor of algorithm development; therefore, they take more blunt approaches, such as (a) not differentiating between types of SS/SI or considering nuances, (b) using a relatively small sample of manually validated notes, (c) using a single site, and (d) not typically making their pipelines publicly available.
The second NLP approach involves traditional training or adapting various machine learning models (pre-packaged topic modeling, deep learning, and language models). EHR-based research has used models trained on a clinical corpus and thus are ideally suited to understand clinical language. However, to perform a task such as identifying specialized concepts like SS/SI, these models still require extensive manually labeled training data to fine-tune the model, which is labor-intensive and generates results that underperform the lexicon-based approach [31].
Finally, an emerging approach is to use large language models (LLMs) which have been trained on massive amounts of data and use transfer learning to perform downstream tasks with little need for fine-tuning or manual labels. The advent of LLMs is a major milestone in the field of NLP and has been used for several tasks in health informatics including SDOH extraction from clinical notes [32, 33]. Preliminary work has used LLMs with minimal fine-tuning to extract SDOH, but the performance of identifying SS/SI has yet to be optimized for clinical or research applications [32].
In summary, each of these approaches for extracting SDOH from clinical text have strengths and limitations. The rule-based system (RBS) requires domain experts and significant time to develop lexicons and rules, which results in highly predictable outputs. In contrast, machine learning and deep learning-based systems rely heavily on a large, annotated corpus for training. Lastly, LLMs need less data for fine-tuning compared to deep learning algorithms, but are often considered black box models, making their decision-making processes less transparent.
This work aims to build on these previous systems by breaking down SS/SI into their fine-grained categories including presence/absence of social network, instrumental support, emotional support, and loneliness. This separation is important, as the literature has shown that they are separate concepts [34], not interchangeable, with distinct effects on health outcomes [12]. A general label not only diminishes the signal of detectable associations between subcategories, but also limits the eventual interventions that might come from findings. A distinction is frequently drawn between subjective and objective social support, and they do not necessarily improve together [35]. For example, loneliness is frequently found to be associated with depressive symptoms, but increasing a person’s social activity is not necessarily the way to alleviate loneliness, and other interventions might be more indicated for the individual experiencing loneliness [36]. This study aims to fill a gap in the literature by not only focusing on SS/SI extraction in clinical narratives, but also distinguishing between classes. Here, we describe the development of a rule book for manual annotations as well as rigorous development of a rule-based system (RBS) and an LLM-based system (LLM).
Additionally, the variability in clinical documentation, both within and across hospital systems, presents a challenge to the portability of NLP systems [2], and previously published lexicons and pipelines were created for single EHR datasets. An additional aim of this work is to create NLP pipelines that are portable across sites, here, two large academic medical centers in New York City: Mount Sinai Health System [MSHS] and Weill Cornell Medicine [WCM]. By making benchmarked NLP pipelines open source, we aim to enable other healthcare systems to adopt, validate, improve, and deploy the developed SI/SS extraction tool for contextualizing both psychiatric research and clinical practices.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Braja Gopal Patra, Weill Cornell Medicine, New York, NY, USA and co-first authors;
(2) Lauren A. Lepow, Icahn School of Medicine at Mount Sinai, New York, NY, USA and co-first authors;
(3) Praneet Kasi Reddy Jagadeesh Kumar. Weill Cornell Medicine, New York, NY, USA;
(4) Veer Vekaria, Weill Cornell Medicine, New York, NY, USA;
(5) Mohit Manoj Sharma, Weill Cornell Medicine, New York, NY, USA;
(6) Prakash Adekkanattu, Weill Cornell Medicine, New York, NY, USA;
(7) Brian Fennessy, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
(8) Gavin Hynes, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
(9) Isotta Landi, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
(10) Jorge A. Sanchez-Ruiz, Mayo Clinic, Rochester, MN, USA;
(11) Euijung Ryu, Mayo Clinic, Rochester, MN, USA;
(12) Joanna M. Biernacka, Mayo Clinic, Rochester, MN, USA;
(13) Girish N. Nadkarni, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
(14) Ardesheer Talati, Columbia University Vagelos College of Physicians and Surgeons, New York, NY, USA and New York State Psychiatric Institute, New York, NY, USA;
(15) Myrna Weissman, Columbia University Vagelos College of Physicians and Surgeons, New York, NY, USA and New York State Psychiatric Institute, New York, NY, USA;
(16) Mark Olfson, Columbia University Vagelos College of Physicians and Surgeons, New York, NY, USA, New York State Psychiatric Institute, New York, NY, USA, and Columbia University Irving Medical Center, New York, NY, USA;
(17) J. John Mann, Columbia University Irving Medical Center, New York, NY, USA;
(18) Alexander W. Charney, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
(19) Jyotishman Pathak, Weill Cornell Medicine, New York, NY, USA.