paint-brush
New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polishby@morphology

New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polish

by MorphologyDecember 30th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers in Poland have developed an open-source tool that improves the evaluation and comparison of AI used in natural language preprocessing.
featured image - New Framework Promises to Train AI to Better Understand Hard-to-Grasp Languages Like Polish
Morphology HackerNoon profile picture

Authors:

(1) Martyna Wiącek, Institute of Computer Science, Polish Academy of Sciences;

(2) Piotr Rybak, Institute of Computer Science, Polish Academy of Sciences;

(3) Łukasz Pszenny, Institute of Computer Science, Polish Academy of Sciences;

(4) Alina Wróblewska, Institute of Computer Science, Polish Academy of Sciences.

Editor's note: This is Part 5 of 10 of a study on improving the evaluation and comparison of tools used in natural language preprocessing. Read the rest below.

Abstract and 1. Introduction and related works

  1. NLPre benchmarking

2.1. Research concept

2.2. Online benchmarking system

2.3. Configuration

  1. NLPre-PL benchmark

3.1. Datasets

3.2. Tasks

  1. Evaluation

4.1. Evaluation methodology

4.2. Evaluated systems

4.3. Results

  1. Conclusions
    • Appendices
    • Acknowledgements
    • Bibliographical References
    • Language Resource References

3. NLPre-PL benchmark

3.1. Datasets

Table 1: Summary of source datasets (NKJP1M and PDB-UD) and NLPre-PL Datasets (in tokens). Explanations: POS – the part-of-speech tagset; DEP – the dependency schema; Avg. t/s – the average number of tokens per sentence


NKJP1M (Przepiórkowski et al., 2018) The NKJP1M subcorpus of the Polish National Corpus (Przepiórkowski et al., 2012) is manually annotated according to the NKJP tagset (Szałkiewicz and Przepiórkowski, 2012) and afterwards modified in line with the Morfeusz tagset (Woliński, 2019). This balanced subset of thematic- and genre-diverse texts and transcriptions is used to train Polish POS taggers. NKJP1M is maintained in two formats: TEI[8] and DAG.[9] These two formats are accepted by older NLPre tools but not modern ones. We thus convert NKJP1M to the CoNLL-X format (Buchholz and Marsi, 2006) preserving the original segmentation, POS tags and morphological features (i.e. the Morfeusz tagset), and to the CoNLL-U format10 with UD tags, Morfeusz tags (XPOS) and UD morphological features.


Since there is no generally accepted split of NKJP1M into training, development and testing subsets, we uniformly divide NKJP1M in all formats (i.e. DAG, TEI, CoNLL-X and CoNLL-U) pursuant to the formulated splitting heuristics. Each document in the subcorpus contains multiple paragraphs of continuous textual data. To avoid possible information leakage, we treat each such paragraph as an indivisible unit. To ensure that the subsets include paragraphs of varying length, we investigate the distribution over the number of segments in each paragraph. Since it is akin to Gaussian distribution, we decide to not exclude any data, and we divide the paragraphs into K = 10 buckets of roughly similar size and then sample from them with respective ratios of 0.8:0.1:0.1 (corresponding to train, dev, and test subsets). This data selection technique assures similar distribution of segments number per paragraph in three subsets, hereafter byName. For creating our second split, hereafter byType, we consider the type of document a paragraph belongs to. We first group paragraphs into categories equal to the document types, and then we repeat the above-mentioned procedure per category (see the summary of NKJP1M and data splits in Table 1). PDB-UD (Wróblewska, 2018) Polish Dependency Bank is the largest collection of Polish sentences manually annotated with dependency trees and afterwards converted into UD representations in line with the UD annotation schema (de Marneffe et al., 2021). PDB-UD slightly correlates with NKJP1M, i.e., a subset of the PDB-UD sentences comes from NKJP1M, and the language-specific tags (XPOS) in PDB-UD match the Morfeusz tagset. PDB-UD is typically used to train NLPre systems for Polish. In NLPre-PL, we use the original PDB-UD data without any modifications and its standard split (see the statistical summary of PDB-UD in Table 1).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


[8] http://nlp.ipipan.waw.pl/TEI4NKJP.


[9] https://github.com/kawu/concraft-pl#data-format


[10] https://universaldependencies.org/format.html