Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
With the semi-structured data available from TED tender XMLs and award XMLs, Vamstar developed XML parsers that extract different fields and store their unique values in a temporary data store (to be called the ‘tender and award fields’ or just ‘fields’ in short). As an example, given the XML in Figure 1, a record would be created with fields such as: buyer name and addresses, lots (e.g., ‘Verapamil solution for inj. Lot No: 62’ but with multiple values as the notice contains more than 90 lots), award criteria, etc. This allows generation of useful domain knowledge in different ways.
First, training data for different fields can be created by simply taking the unique values from each field. These values usually take the form of short phrases or sentences, and therefore, can be used to train text classifiers at phase/sentence level. Note that this data is multilingual.
Second, domain lexicons for different fields can be built by extracting ‘representative’ words from each field. Given a field that we are interested in, we merge all values from different languages, and use machine translation tools to translate them into English. Instead of drawing a boundary between what is and is not a ‘representative’ word for that field, here we explain a process to calculate the ‘specificities’ of words and we refer to this as ‘domain specificity lexicon’, which is used in other components of our method. Given a word 𝑤, the basic idea is to measure how unique 𝑤 is to lot/item descriptions compared to other text content in a tender. The basic process is as follows:
Identify the set of ‘fields’ that are most relevant to a tender. For this we selected 5 fields - this is rather a subjective interpretation by the domain experts: ‘Name and addresses’ of buyers (Section I.1.1 of the notice in Figure 1), ‘notice title’ (II.1.1), ‘short description’ (II.1.4), ‘lot and item descriptions’ (a concatenation of values from fields like II.2.1), ‘contract criteria’ (a concatenation of values from fields like II.2.5). Following the process mentioned above, all values are then translated into English;
For each field, create a bag-of-words (‘BOW’) representation by concatenating the values of that field across all records in the database, applying stop words removal and lowercasing;
For 𝑤, calculate 4 metrics as follows: 𝑛𝑡𝑓(𝑤) normalised frequency of 𝑤 in the BOW representation of the ‘lot and item descriptions’ field, calculated as the ratio between the frequency of 𝑤 in the BOW over the total words in the BOW; 𝑛𝑑𝑓(𝑤) the normalised ‘document frequency’ of 𝑤, calculated as the ratio between the number of fields in which 𝑤 is found over the number of all fields (i.e., 5); 𝑛𝑡𝑓ᐧ𝑛𝑑𝑓(𝑤) , inspired by the tf-idf measure used in document retrieval, this is the product − of 𝑛𝑡𝑓(𝑤) and the inverse of 𝑛𝑡𝑓(𝑤); 𝑤𝑒𝑖𝑟𝑑𝑛𝑒𝑠𝑠(𝑤), a score calculated by comparing 𝑛𝑡𝑓(𝑤) against that calculated in a reference, general purpose corpus (in this case, the Brown corpus).
Thus our domain specificity lexicon contains unique words found within the ‘lot and item descriptions’ field, and each word has four associated scores. In addition to this, we also use another two dictionaries compiled by Vamstar. One is a list of words often used as measurement units (e.g., mg, ml), and the other is a list of words often used to describe the ‘form’ of required items (e.g., pack, box, bottle). Both are very short and contain only a few dozens of entries. These lexicons and dictionaries are also translated to other languages using Google Translate.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is