Authors:
(1) Yanpeng Ye, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, and these authors contributed equally to this work;
(2) Jie Ren, GreenDynamics Pty. Ltd, Kensington, NSW, Australia, Department of Materials Science and Engineering, City University of Hong Kong, Hong Kong, China, and these authors contributed equally to this work;
(3) Shaozhou Wang, GreenDynamics Pty. Ltd, Kensington, NSW, Australia ([email protected]);
(4) Yuwei Wan, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China;
(5) Imran Razzak, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia;
(6) Tong Xie, GreenDynamics Pty. Ltd, Kensington, NSW, Australia and School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]);
(7) Wenjie Zhang, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia ([email protected]).
Editor’s note: This article is part of a broader study. You’re reading Part 2 of 9. Read the rest below.
Table of Links
- Abstract and Introduction
- Methods
- Data preparation and schema design
- LLMs training, evaluation and inference
- Entity resolution
- Knowledge graph construction
- Result
- Discussion
- Conclusion and References
Methods
Figure 1(a) illustrates the comprehensive workflow of our research. Through NER and RE tasks, we extract the structural information about catalyst, battery and solar cell. After ER and normalization, we integrate information from these three fields and construct a knowledge graph. Specifically, the pipeline displayed in the Figure 1(b), commencing with the manual annotation and normalization of the initial training data set, prompting the fine-tuned LLMs specifically for NER and RE tasks. This inference dataset is subsequently divided into ten batches, a crucial step for the iterative process that follows. Then, we finish ER task through the NLP technology including ChemDataExactor[22], mat2vec[23] and our expert dictionary. After ER, high-quality results are meticulously selected to augment the training set, thereby enhancing the model’s performance in subsequent iterations. Finally, the knowledge graph is constructed using the triples transferred from normalizd result after the last iteration.
This paper is available on arxiv under CC BY 4.0 DEED license.