Publications

Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation

Published in Bioinfomatics, 2022

Motivation: Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This paper presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. Conclusion: This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows.

Recommended citation: Jiyu Chen, Benjamin Goudey, Justin Zobel, Nicholas Geard, Karin Verspoor, Exploring automatic inconsistency detection for literature-based gene ontology annotation, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i273–i281 https://doi.org/10.1093/bioinformatics/btac230

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

Published in BMC Bioinformatics 22:565, 2021

Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Download here

A bag-of-concepts model improves relation extraction in a narrow knowledge domain with limited data.

Published in In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Student Research Workshop (NAACL-HLT SRW), Minneapolis, 2019, 2019

This paper focuses on a traditional relation ex- traction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intra-sentential co-occurrence baseline. We further introduce a new bag-of-concepts (BoC) approach to feature engineering based on the state-of-the-art word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset.

Recommended citation: Jiyu Chen, Karin Verspoor, and Zenan Zhai. "A Bag-of-concepts Model Improves Relation Extraction in a Narrow Knowledge Domain with Limited Data." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019. https://www.aclweb.org/anthology/N19-3007.pdf