Researchers Introduce Clever Math Trick to Beef Up Tiny Datasets Without Frying Your GPU

Authors:

(1) Sergey Kucheryavskiy, Department of Chemistry and Bioscience, Aalborg University and a Corresponding author ([email protected]);

(2) Sergei Zhilin, CSort, LLC., Germana Titova st. 7, Barnaul, 656023, Russia and Contributing authors0 ([email protected]).

Editor's note: This is Part 1 of 4 of a study detailing a new method for the augmentation of numeric and mixed datasets. Read the rest below.

Table of Links

Abstract and 1 Introduction
2 Methods
- 2.1 Generation of PV-sets based on Singular Value Decomposition
- 2.2 Generation of PV-sets based on PLS decomposition
3 Results
- 3.1 Datasets
- 3.2 ANN regression of Tecator data
- 3.3 ANN classification of Heart data
4 Discussion
- 5 Conclusions and References

Abstract

In this paper, we propose a new method for the augmentation of numeric and mixed datasets. The method generates additional data points by utilizing cross-validation resampling and latent variable modeling. It is particularly efficient for datasets with moderate to high degrees of collinearity, as it directly utilizes this property for generation. The method is simple, fast, and has very few parameters, which, as shown in the paper, do not require specific tuning. It has been tested on several real datasets; here, we report detailed results for two cases, prediction of protein in minced meat based on near infrared spectra (fully numeric data with high degree of collinearity) and discrimination of patients referred for coronary angiography (mixed data, with both numeric and categorical variables, and moderate collinearity). In both cases, artificial neural networks were employed for developing the regression and the discrimination models. The results show a clear improvement in the performance of the models; thus for the prediction of meat protein, fitting the model to the augmented data resulted in a reduction in the root mean squared error computed for the independent test set by 1.5 to 3 times.

Keywords: data augmentation, artificial neural networks, Procrustes cross-validation, latent variables, collinearity

1 Introduction

Modern machine learning methods that rely on high complexity models, such as artificial neural networks (ANN), require a large amount of data to train and optimize the models. Insufficient training data often lead to overfitting problems, as the number of model hyperparameters to tune is much larger than the number of degrees of freedom in the dataset.

Another common issue in this case is the lack of reproducibility because the ANN training procedure is not deterministic, given the random selection of initial model parameters and the stochastic nature of their optimization. Consequently, it never leads to a model with the same parameters and performance, as different training trials can result in different models. This variability becomes large if the training set is too small.

This problem is particularly urgent in the case of fitting the experimental data, as it is often expensive and time-consuming to run many experimental trials, making it simply impossible to collect thousands of measurements needed for proper training and optimization. There can also be other obstacles, such as paperwork related to permissions in medical research.

One way to overcome the problem of insufficient training data is to artificially augment it by either simulating new data points or making small modifications to existing ones. This technique is often referred to as “data augmentation”. Data augmentation has proved to be particularly efficient in image analysis and classification, with a large body of research reporting both versatile augmentation methods [1] [2], [3] and methods that are particularly effective for specific cases [4] [5]. Augmentation methods for time series data are also relatively well developed [6].

However, there is a lack of efficient methods that can provide decent data augmentation for numeric datasets with a moderate to high degree of collinearity. Such datasets are widespread in experimental research, including various types of spectroscopic data, results of genome sequencing (e.g., 16S RNA), and many others. Many tabulated datasets also exhibit internal structures where variables are mutually correlated. Currently available methods for augmentation of such data mostly rely on adding various forms of noise [7] to the existing measurements, which is not always sufficient. There are also promising methods that utilize variational autoencoders by random sampling from their latent variable space [8], or methods based on generative adversarial networks [4]. The downsides are that both approaches require building and tuning a specific neural network model for the data augmentation and hence need a thorough and resource demanding optimization process and a relatively large initial training set.

In this paper, we propose a simple, fast, versatile, yet efficient method for augmenting numeric and mixed collinear datasets. The method is based on an approach that was initially developed for other purposes, specifically for generating validation sets, and hence is known as Procrustes cross-validation [9] [10]. However, as demonstrated in this paper, it effectively addresses the data augmentation problem, resulting in models with significantly improved prediction or classification performance.

Our method directly leverages collinearity in the generation procedure. It fits the training data with a set of latent variables and then employs cross-validation resampling to measure variations in the orientation of the variables. This variation is then introduced to the training set as sampling error, resulting in a new set of data points.

Two fitting models can be employed — singular value decomposition (SVD) and partial least squares (PLS) decomposition. The choice of the fitting model allows the user to prioritize a part of covariance structure, which will be used for generation of the new data.

Both fitting models have two parameters — the number of latent variables and the number of segments used for cross-validation resampling. The experiments show though that the parameters do not require specific tuning. Any number of latent variables large enough to capture the systematic variation of the training set values serve equally well. As well as any number of segments starting from three.

The proposed method is versatile and can be applied to both fully numeric data as well as to tabulated data where one or several variables are qualitative. This opens another perspective, namely data mocking, which can be useful, e.g., for testing of high loaded software systems, although we do not consider this aspect here.

The paper describes the theoretical foundations of the method and illustrates its practical application and performance based on two datasets of different nature. It provides comprehensive details on how the method can be effectively applied to diverse datasets in real-world scenarios.

We have implemented the method in several programming languages, including Python, R, MATLAB, and JavaScript, and all implementations are freely available in the GitHub repository (https://github.com/svkucheryavski/pcv). Additionally, we provide an online version where one can generate new data points directly in a browser (https://mda.tools/pcv).

This paper is available on arxiv under CC BY 4.0 DEED license.

Researchers Introduce Clever Math Trick to Beef Up Tiny Datasets Without Frying Your GPU

Too Long; Didn't Read

Table of Links

Abstract

1 Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Researchers Introduce Clever Math Trick to Beef Up Tiny Datasets Without Frying Your GPU

Too Long; Didn't Read

Table of Links

Abstract

1 Introduction

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics