Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
With the relevant pages and tables identified, our next step in the process is to identify the text elements in the pages/tables that actually describe lots and items. We would like to capture two kinds of information: the texts that indicate a lot reference and the texts that describe individual items in a lot. In practice, this is often contained in a single, coherent text passage, such as that shown in Figure 5. Thus we handle this as a binary sentence classification task where a sentence is a positive example if it contains either a lot reference, or item information, or both; and negative otherwise. We deter the extraction of lot reference and structured item information to the next component ‘lot parsing’. We also need to deal with unstructured texts from a page and content from tables in slightly different ways.
4.4.1. Unstructured texts in a page
For unstructured texts in a page, we apply sentence splitting and then sentence classification. For features, we adapt the 24 page-level features explained above in the ‘page selection’section by applying them to each sentence to classify. As an example, in equation 1, we replace 𝑝 with the input sentence 𝑠 and by alternating the 𝑚𝑒𝑡𝑟𝑖𝑐 we obtain 4 features. The same can be applied to equations 2~3. Then replacing 𝐿 with the other two domain dictionaries will produce another 8 features. Further, we add three binary features that help capture lot references:
● If the sentence contains the word ‘lot’ (and its translation in other languages)
● If the sentence contains the word ‘number’ or ‘no.’ or ‘num’ (and its translations)
● If the sentence contains number-like tokens (including roman/arabic numerals and patterns such as ‘1.1, 1.21, X.2, II.1’)
We also compare the same set of algorithms for classification.
4.4.2. Tables
Similar to the idea explained in passage selection, we simply treat each row as a sentence and therefore, the solution would be almost the same as that for dealing with unstructured texts above. One exception is for tables like that in Figure 2, where the lot references need to be interpreted based on combining the table header ‘Lot Number’ and the text in the row spans below that header. Thus for tables, we add additional rules for converting rows into sentences:
● We apply a rule to match the table headers against the pattern ‘lot [token]’ where [token] is an optional word (e.g., ‘number’, ‘no.’);
● For the column headers matched by this rule, we find the one where all the texts of cells in that column are ‘number-like’ tokens. Then, we insert the header text before the number-like token within that cell. As an example, the row span with a value ‘1’ in Figure 2 will be updated to ‘Lot Number 1’ and will be repeated for both row 2 and 3 when creating a sentence.
Some may argue that lot items detection from tables should be simpler as once a table is classified as a ‘relevant’ one containing lot and item descriptions, given our ‘horizontal table’ assumption, we can simply take each row (after ‘expanding’ row/column spans) as an instance of item description. While this may be true for many cases like that shown in Figure 2, practically, there are complex table structures such as that shown in Figure 7. Here, each lot is split into a consecutive number of rows, and within each lot, the headers are repeated for the item(s) in that lot. The actual content we wish to extract from the tables are those indicated in the box. For this reason, we opt for the generic approach described above, i.e., treating each row in a table as a sentence for classification. The same feature set and algorithm described above are used, and the training datasets will be explained later.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(2) Tomas Jasaitis, Vamstar Ltd., London ([email protected]);
(3) Richard Freeman, Vamstar Ltd., London ([email protected]);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP ([email protected]).
This paper is