Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Compared to other domains, there is a lack of work on text mining/NLP for procurement.We discuss a few in this section. Grandia and Kruyen (2020) conducted an exploratory study of over 140,000 procurement notices from the Belgium E-procurement platform between 2011 and 2016, to analyse the trend towards ‘sustainable public procurement’ (SPP). The work does not extract any information from procurement documents. Instead, a light-weight keyword matching process is conducted to search for and count keywords related to the SPP concepts found in the procurement documents. To compile the keywords, they manually studied a large collection of legislations, policies and white papers. The keywords are then manually grouped into different themes (e.g., circular economy, social return), and matched to the text content inside procurement documents for counting. Authors also mentioned the heterogeneous nature of procurement documents - although each record has a compulsory XML document, very little useful information is encoded in it but inside a vast collection of PDF, Excel, or Word documents that need to be parsed to a machine readable format. Haddadi et al. (2021) later applied a similar process to the ICT sector in Morocco to measure, also to gauge the trend of addressing ‘sustainability’ in public procurement. Both studies do not extract structured information from procurement documents but opted for simple keyword matching. This does not require complex preprocessing or interpretation of document structures in order to target the right content areas. Our task in comparison, is much more challenging.
Chalkidis et al. (2017) is one of the earlier studies that looked at extracting structured information from contract data. They developed an NER process that identifies different elements (title, parties, start and termination dates, contract period and value) in a contract document that is already converted to free-form texts. Using a well-curated dataset of just over 2,400 contracts (sources unclear), the model is trained using features such as word casing, token type (number, letter), length, and part of speech etc. Authors also used post-processing rules to fix boundary detection errors (e.g., ‘2013’ missing from the date ‘23 Oct 2013’). Our work also uses a mixture of supervised and rule-based methods. But the fundamental differences are that we have to deal with multiple tasks beyond just NER, we do not have well-curated training data for all tasks, and our datasets are much more complex.
Choi et al. (2021) analysed engineering and construction contracts with a goal to extract risk phrases and entities using rule-based phrase matching. Authors acknowledged the content accessibility issue with procurement documents, which are typically PDF documents. They used an OCR process that recognises sentence boundaries, and opted for a sentence classification solution. Each ‘risk factor’ is associated with a domain lexicon, which is used to match and label sentences. Our work is similar in the way that we also make use of domain lexicons. However, we use both rule-based and self-supervised methods, and we generalise the method to a number of NLP tasks and different languages.
Fantoni et al. (2021) working on the railway domain, acknowledged the heterogeneity and complexity of procurement documents - some large engineering systems may list over 100,000 requirements, documented in inconsistent ways and multiple documents that use non-standard structures and layouts. Their goal is to 1) identify from multiple, long procurement documents the sentences describing system requirements; and 2) classify these requirements into different railway subsystems/components. Authors mainly used unsupervised methods. Starting with building a domain ‘knowledge base’, they used keyword/phrase extraction methods to build a lexicon specific to the domain. Then rules are developed to match low granular information equivalent to named entities, such as measurement units, standard references etc. This extracted information is later used to match sentences in the document, and a score is computed based on the number of matched units and the subsystems/components they relate to for ranking purposes. Again our work also uses domain lexicons but we apply them to both rule-based and supervised methods for multiple tasks.
Rabuzin and Modrusan (2019) highlighted a lack of studies of text mining and NLP in procurement analysis. With a goal to extract technical conditions and criteria in procurement contracts, they downloaded documents from the Croatian procurement portal and converted them into machine readable texts. Authors acknowledged the complexity and inconsistencies in document structures, hence the challenge in identifying the relevant content areas for further analysis. They opted for a simple ‘sliding window’solution: a 1000-word window around occurrences of ‘technical’ and ‘professional’. The task is then transformed into a text classification one, for which authors trained three algorithms for comparison. However, it is unclear how the training data is created. The authors later updated their work (Modrusan et al., 2020) in terms of how the sliding windows are defined. Compared to our work, these studies belong to the task of ‘passage retrieval’, while we deal with not only PR, but also multiple, more fine-grained extraction tasks.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is