paint-brush
Workflow for Extracting Structured Data from Tender Documents to Build Supplier Risk Profilesby@textmining
New Story

Workflow for Extracting Structured Data from Tender Documents to Build Supplier Risk Profiles

by Text MiningDecember 23rd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A multi-step process extracts structured lot and item data from tender documents, aiding in the creation of supplier risk profiles for improved decision-making.
featured image - Workflow for Extracting Structured Data from Tender Documents to Build Supplier Risk Profiles
Text Mining HackerNoon profile picture
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

4. Proposed Methodology

Figure 5 presents an overview of our workflow. As mentioned before, this article focuses on extracting the structured lot and item information often missing in tender and award XMLs (the middle lane). This will be covered in Sections 4.1 to 4.5. In Section 4.6, we briefly cover the other parts of the workflow.


Given a collection of tender attachment documents associated with one tender notice, our first step (content extraction) is to use various data extraction libraries to convert various content formats into a single, universal data structure called ‘Vamstar Universal Documents (VUD)’ that represent text content in JSON format. In ‘lot zoning’, we use passage retrieval/selection techniques to identify the content areas (pages and tables) that potentially contain useful lot information. Next, ‘lot item detection’ uses text classification techniques to process the extracted text passages to identify content (sentences and table rows) that describe a lot and its items. Following this, we apply rule-based NER to parse the texts related to a lot and its individual items to identify specific attributes and create a structured representation of the lot (lot parsing). For most of these processes, we use domain knowledge in a generalisable way for multiple languages.


Meanwhile, other structured information is extracted from tender and award XMLs by simple XML parsing. These extracted information is then joined to form supplier-centric contract award records, which are used to populate our database. The database is then used to calculate supplier risk indices, which form a supplier risk profile. In the following sections, we explain each component. However, details of certain components may need to be redacted due to NDAs on proprietary content.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);

(3) Richard Freeman, Vamstar Ltd., London ([email protected]);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).


This paper is available on arxiv under CC BY 4.0 license.