Debrief from a Vaticle Community talk — featuring Konrad Myśliwiec, Scientist, Systems Biology, at Roche. This talk was delivered virtually at Orbit 2021 in April.
Konrad, like so many TypeDB community members, comes from a diverse engineering background. Knowledge graphs have been part of his scope since working on an enterprise knowledge graph for GSK. He’s been a part of the TypeDB community for roughly 3 years. While most of his career has been spent in the biomedical industry, he’s spent time working on business intelligence applications, developing mobile apps, and currently as a data science engineer in the RGITSC (Roche Global IT Solutions Centre) for Roche Pharmaceuticals.
Over the last year, Konrad started to notice a trend online, whether it was articles and posts on LinkedIn or amongst his biomedical network; knowledge graphs were everywhere. Given that the world was in the midst of grappling with the COVID-19 pandemic, he wanted to know if a knowledge graph could aid in the efforts of his bio-peers.
In developing BioGrakn Covid, he sought to bring these two topics together to provide a more digestible and useful way to support the research and biomedical fight during the pandemic. In this talk, he describes how he and a group of TypeDB community members approached some of the technical development areas.
BioGrakn Covid is an open-source project started by Konrad, Tomás Sabat from Vaticle, and Kim Wager from GSK. This is a database centered around COVID-19.
Modeling biomedical data as a graph becomes the obvious choice once we realize that the natural representation of biomedical data tends to be graph-like. Graph databases traditionally represent data as binary nodes and edges. Think labeled property graphs, where two nodes are connected via a directional edge. What Konrad and co. quickly realized was that it becomes much simpler and ultimately more natural to represent biomedical data as a hypergraph, such as that found in TypeDB.
Hypergraphs generalise the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a pair of vertices. Instead, a hyperedge in a hypergraph is a set of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.
In TypeDB, the model uses rectangular shapes to denote entities (nodes) and diamonds representing relations or hyper-relations (n-ary role players in one relation).
Currently, there are quite a few publicly available datasets in BioGrakn Covid. Briefly, here are some of the datasets and mappings included in the database:
As many of us know, it is not as simple as taking the data from one of the above sources and loading it into a database. There first needs to be some consideration of data quality. Maintaining data quality, as Konrad notes in his talk, is an important aspect of the work.
You don’t want to have any data discrepancies, you don’t want to have several nodes representing the same entities, as this will affect and bias your analysis in the future.
Their approach to addressing data quality issues was to use a data source like UMLS (Unified Medical Language System). UMLS provides some structure and context that helps to maintain data quality. In the talk, Konrad focused on two subsets of UMLS. The first, UMLS’ Metathesaurus, contains the biomedical entities across different levels of various types, such as proteins, genes, diseases, drugs, and the relations between them.
As an example of what is available, we have two proteins: protein-1
and protein-2
and they are connected via an interact-with
edge.
The second subset is the Semantic Network data. This is focused on the taxonomy. Each concept is given in a taxonomy, e.g., that we have a concept compound
and protein
is a subtype of compound
. The benefits of this subset will become more obvious later on.
Using UML as the initial or baseline ontology gives us confidence that we can move forward without data quality issues — obviously, this is not the only way to approach the challenge of data quality. Still, UMLs can be effective in the biomedical domain.
The schema in TypeDB provides the structure, the model for our data, and the safety of knowing that all any data ingested into the database will adhere to this schema.
In Konrad’s case, they are creating a biomedical schema with entities: protein
, transcript
, gene
, pathway
, virus
, tissue
, drug
, disease
; and the relations between them. It should be noted that much of these decisions can be derived from the data sources you use.
Here is an excerpt from this schema:
define
gene sub fully-formed-anatomical-structure,
owns gene-symbol,
owns gene-name,
plays gene-disease-association:associated-gene;
disease sub pathological-function,
owns disease-name,
owns disease-id,
owns disease-type,
plays gene-disease-association:associated-disease;
protein sub chemical,
owns uniprot-id,
owns uniprot-entry-name,
owns uniprot-symbol,
owns uniprot-name,
owns ensembl-protein-stable-id,
owns function-description,
plays protein-disease-association:associated-protein;
protein-disease-association sub relation,
relates associated-protein,
relates associated-disease;
gene-disease-association sub relation,
owns disgenet-score,
relates associated-gene,
relates associated-disease;
Now that we have our schema, we can start to load our data.
In the talk, Konrad walked through loading the data from UniProt, which also contains data from Ensembl. The first thing to do is to identify the relevant columns and then, based on our schema, identify the relevant entities to populate with the data. From there, it is fairly simple to add the attributes for each concept.
Loading the data is then trivial — using a client API in Python, Java, or Node.js. Konrad built this migrator for the UniProt data — available via the BioGrakn Covid repo.
Those actively working with these types of publicly available datasets know that they are not updated as often as we might like. Some of them are updated yearly, so we need to supplement these data sources with relevant, current data. The trouble here is that these data usually come from papers, articles, and unstructured text. To use this data, a sub-domain model is needed. This allows us to work with the text more expressively and ultimately connect this to our biomedical model. The two models are shown below:
With the schema set and publications identified, some challenges will need to be addressed. Konrad highlights two of them: extracting biomedical entities from text and linking different ontologies within a central knowledge graph.
When approaching this challenge, Konrad reminds us not to reinvent the wheel and make use of existing named entity recognition corpora. For this project, Konrad and co. used CORD-NER and SemMed.
Named entity recognition is the NLP algorithum of identifying named entities within text.
CORD NER is a data source of pre-computed Named Entity Recognition output. The nice thing about CORD NER is that the output of the NLP work, the entities from text, are mapped to concepts in UMLs. This helps to provide consistency and data quality in the knowledge graph. With the concepts and their types, we can now map to the schema in TypeQL.
SemMed contains entities derived from publications available in PubMed, and these entities are stored in semantic triples. The problem with CORD NER is that it doesn’t contain any links between named entities, while SemMed does provide these. These semantic triples are made up of a subject, predicate, and object. All of them are mapped against UMLS Metathesaurus, which once again helps us to keep consistent naming conventions in the knowledge graph.
Below you can see an example of how Konrad worked with the data in SemMed. This is a simple join of two tables, and we see the subject, predicate, and object: NDUFS8
is a genome, NDUFS7
which is also gene or genome, and interacts-with
as the predicate or relation between them. SemMed naturally derives these linkages from text, and when it comes to mapping these entities to the defined model, this provides us with much of what is needed. In fact, Konrad notes that there are around 18 different predicates that can be mapped to relations in the schema.
Many times we need to extract structured data out of unstructured text in papers on our own. The challenge here, for Konrad, was working with clinical trials data, which usually comes in the form of XML data.
The problem lies in that there are times where the output is not straightforward. It may provide more than one drug, compound, or chemical along with other information. To perform the NER against this data, Konrad recommended using SciSpacy. SciSpacy is a Python library, built on Spacy, and it uses a transformer model that has been trained on publicly available publications to perform NER. Using it against the example above, it identifies two named entities and, even better, provides the mapping against UMLS.
Another situation that may present itself is when an ID from an ontology doesn’t exist in the data that you are working in. To resolve this, we can take the canonical name of an entity and use an API like RxNorm, to get back a RxNorm ID which can then be used to find the missing id.
Once data is loaded, how can we begin to traverse the graph to generate some insights and/or new information? In the screenshot below, Konrad walked through the visualized result of a query.
The higher-level question being asked is:
Find any gene, which plays the role of host
in a relation with the virus SARs, and that is stimulated by another gene; as well as a drug which interacts with the stimulating gene; and finally, the publications that mention the relations between the genes.
In TypeQL this query looks like this:
$gene isa gene;
$virus isa virus, has name "SARs";
$r(host: $gene, virus: $virus) isa gene-virus-host;
$gene2 isa gene;
$r2(stimulating: $gene2, stimulated: $gene) isa gene-gene-interaction;
$r3(mentioned: $r2, publication: $pub) isa a mentioning;
get $gene, $gene2, $pub;
$drug isa drug;
$r4 (interacted: $gene2, interacting: $drug) isa gene-drug-interaction;
This question becomes quite easy to query in TypeDB and is representative of how simple it is to traverse the graph while limiting the exploration space.
Other queries can help us learn more from the clinical trials data. We are able to identify potential targets for a clinical trial, identifiers from sponsoring organizations, and link this information to what we know about drugs, drug-gene relations, and drug-disease relations.
Additional query examples can be found within the BioGrakn Covid repo on GitHub.
As BioGrakn Covid is an open-source database, available for anyone to fork and play with on their own; we would be remiss to not mention the additional ways to derive value.
It is commonly understood that communities in a given network — especially in the biomedical network — tend to cluster to create communities of shared functions in a specific network or the whole system. This function of nodes in a given network can be derived from their n-hop egonet or neighbor. For example, in a protein-protein interaction network, proteins tend to cluster with other nodes that share similar functions such as, metabolic processes or information on immunity.
We might have a hypothesis for a relation, but we need to confirm if this relation should exist in the database. We can do so using Graph Neural Networks.
In order to create or identify these types of clusters, there are a few tasks we can perform: node classification, link prediction, and graph classification. Konrad chose to focus on link prediction for this talk, as he felt it is the most exciting avenue for continued exploration of BioGrakn Covid.
In BioGrakn Covid, there are existing relations we have instantiated. However, Konrad wanted to identify implicit, undiscovered relations. This is motivated by the fact that Biomedical Knowledge Graphs are notoriously incomplete. It is very exciting to find methods to automatically complete them. Having done so, new insights can be drawn straight from the graph.
The approach he uses is to hypothesize that a relation exists and confirm this using a Graph Neural Network (a new hot topic in ML).
Using as features the eigenvectors of the graph adjacency matrix plus structural features from the graph, Konrad was able to train a plain Graph Convolutional Neural Network. New relations to be predicted are hypothetical disease-gene
and disease-drug
relations between existing nodes and negative examples are taken via negative sampling from all possible and impossible relation types.
In machine learning, the term negative sampling describes the action of drawing a random sample from the data that doesn’t exist to act as a negative example. In an incomplete graph, the negative samples could truly be positive, but empirically Konrad finds this approach works well.
Konrad uses a final Softmax layer on the model to generate probabilities of relation existence.
He finds that the precision and recall behave as seen in the plot below as the threshold is varied. He takes an interest in a value of recall of 85%, which is marked with a red dot on both curves. Happily, we can see that at around this performance point is where we see a balance of precision and recall.
Looking at the results another way, we can see the confusion matrix for a fixed recall of 85%. At this point, he presents a precision of 91.16%. We can also note that the matrix is quite well balanced. We see that there are 50,160 false negatives, slightly more than the 31,007 false positives. For comparison, we see 303,393 true negatives and 264,240 true positives.
Doing an ad-hoc analysis of the gene relation predictions made, Konrad expressed some doubt as to whether all of them could be correctly classified. To investigate further, Konrad did a spot check of the top 5 relation predictions for genes targeting another node (either a compound or another gene). He found that 3 of those relations predicted have each been investigated in one or more papers in the literature.
Doing a spot check of the top 5 drug-disease interactions predicted by the network, he found in the literature that 2 of these have been established to exist.
Konrad is naturally very excited by what he’s been able to achieve using these methods and has plenty of thoughts on how to expand the scope and accuracy of the method, as outlined in the video of his talk.
This is an ongoing project, and we need your help! If you want to contribute, you can help out by helping us including:
If you wish to get in touch, please reach out to us on the #biograkn channel on our Discord (link here).
A special thank you to Konrad for his hard work, enthusiasm, inspiration, and thorough explanations.
You can find the full presentation on the Vaticle YouTube channel here.
Also published here.