Table of Links
-
- Classification Target
- Masked Conditional Density Estimation (MaCoDE)
-
- With Missing Data
-
- Related Works
- Conclusions and Limitations
- References
-
- A2 Proof of Proposition 1
- A3 Dataset Descriptions
-
- A5 Experimental Settings for Reproduction
3. Experiments
3.1 Overview
Questions. We conduct experiments in which we can provide answers to the following three experimental questions:
Q1. Does MaCoDE achieve state-of-the-art performance in synthetic data generation?
Q2. Can MaCoDE generate high-quality synthetic data even when faced with missing data scenarios?
Q3. Is MaCoDE capable of supporting multiple imputations for deriving statistically valid inferences from missing data?
Datasets. Similar to several recent studies [18, 34, 37, 36, 54, 20], we utilize 10 real tabular UCI and Kaggle† datasets of varying sizes. We split the dataset into training and testing sets with a
ratio of 80% for training and 20% for testing. Detailed statistics of these datasets are provided in Appendix A.3. Note that we include covtype dataset, which comprises approximately 580K rows, to demonstrate the scalability of our proposed model. We set L = 50 and τ = 1 for all datasets to assess the generalizability of our proposed model unless otherwise stated.‡
Baseline models. For Q1 and Q2, we compare MaCoDE with CTGAN [51], TVAE [51], CTAB-GAN [56], CTAB-GAN+ [57], DistVAE [1], TabDDPM [24], and TabMT [13]. For Q3, we selected the following multiple imputation models that can handle heterogeneous tabular datasets: MICE [48], GAIN [52], missMDA [21], VAEAC [19], MIWAE [34], not-MIWAE [18], and EGC [55]. Detailed experimental settings for these baseline models are provided in Appendix A.5.
Additional evaluations. Due to space limitations, comprehensive experimental settings and results concerning privacy preservability and sensitivity analysis, varying the temperature parameter τ and missingness rate, are presented in Appendices A.6.1 and A.6.2, respectively.
3.2 Evaluation Metrics
For all metrics, we report the mean and the standard error of the mean (error bar) across 10 different random seeds and 10 datasets.
Q1. To evaluate the quality of generated synthetic data, we employ two metrics: statistical fidelity [40] and machine learning utility [15]. For statistical fidelity, we utilize the Kullback–Leibler divergence (KL) and the Goodness-of-Fit (GoF) test (continuous: the two-sample KolmogorovSmirnov test statistic, categorical: the Chi-Squared test statistic) to assess marginal distributional similarity. Additionally, we employ the Maximum Mean Discrepancy (MMD) and 2-Wasserstein distance (WD) to measure joint distributional similarity. Regarding machine learning utility, we adopt four specific metrics outlined in [15]: regression performance (SMAPE, symmetric mean absolute percentage error), classification performance (F1), model selection performance (Model), and feature selection performance (Feature). Refer to Appendix A.5 for a detailed evaluation procedure.
Q2. To ascertain the capability of our proposed model to generate high-quality synthetic datasets even when the given training dataset contains missing values, we evaluate our proposed model, trained using masked training datasets (i.e., incomplete datasets), in downstream regression and classification tasks on the test dataset. We utilize the SMAPE metric for regression tasks and the F1 score for classification tasks. See Appendix A.5 for a detailed evaluation procedure.
Q3. We assess the effectiveness of multiple imputations by employing interval inference for the population mean, which was proposed by Rubin [41]. We report the bias, coverage, and confidence interval length, and the evaluation procedure for multiple imputations is outlined in Algorithm 3 [47]. Following [36, 20, 54], we generate the missing value mask for each dataset with three mechanisms (MCAR, MAR, MNAR) in four settings. See Appendix A.5 for detailed missing value generation mechanisms. For all missing data mechanisms, unless otherwise specified, the missingness rate is set to 30%.
Remark 3. We acknowledge that existing imputation methods have been evaluated using RMSE (root mean square error), a metric for assessing single imputation methods [18, 34, 37, 36, 54, 20]. However, since our objective focuses on distributional learning rather than recovering missing values, we adopt the evaluation procedure proposed in [41, 47], better suited for assessing multiple imputation methods.
Authors:
(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);
(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).
This paper is
† https://archive.ics.uci.edu/, https://www.kaggle.com/datasets/
‡ We run experiments using NVIDIA A10 GPU, and our experimental codes are available with pytorch.