New Story

Why MaCoDE Outperforms GANs in Tabular Data Generation

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.

People Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Why MaCoDE Outperforms GANs in Tabular Data Generation
Language Models (dot tech) HackerNoon profile picture
0-item
  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

A.4 Missing Mechanism

Data missingness is a common challenge in research and practical analysis, categorized into three primary missing mechanisms: (1) Missing Completely at Random (MCAR), (2) Missing at Random (MAR), and (3) Missing Not at Random (MNAR).


Under the MCAR mechanism, the reason for missingness has no relationship with any data, neither observed nor unobserved. In other words, the likelihood of data being missing is equal across all observations. The primary advantage of MCAR is that it does not introduce bias into the data analysis. However, despite this advantage, data missingness can still reduce the statistical power of the study because of the reduced sample size.


MAR occurs when the probability of missingness is related to the observed data but not the unobserved missing data. Essentially, even though the data is missing, the mechanism assumes that the missingness is explainable by other variables in the dataset. That is, the missingness can be modeled and imputed using the information available in the data, allowing for more accurate analyses despite the missingness.


Table 4: Description of datasets. #continuous represents the number of continuous and ordinal variables. #categorical denotes the number of categorical variables. The ‘Classification Target’ refers to the variable used as the response variable in a classification task to evaluate machine learning utility.


Lastly, if the missingness is not specified by either MCAR or MAR, it becomes MNAR. The MNAR is the most challenging mechanism, as it implies that the missingness is related to the unobserved data itself. In this case, the missing data is systematically different from the observed data, which introduces bias if not properly accounted for. For example, patients with severe symptoms may be less likely to report their health status, making their data missing. MNAR requires sophisticated statistical methods to address, as ignoring or improperly handling it can lead to biased and unreliable results.

A.5 Experimental Settings for Reproduction


Table 5: The hyper-parameter search space for MaCoDE. We utilized the best-performing combination of hyper-parameters within this search space. The chosen hyper-parameter is bolded.


A.5.1 Details of Implementing Baseline Models


• CTGAN and TVAE [51]: We rely on the implementations provided in the official code § for CTGAN and TVAE, where hyperparameters are predefined. We adopt the default hyperparameters specified in the module. To ensure fairness in comparison, we adjust the latent dimension of CTGAN’s generator to 100. Similarly, we increase the sizes of the TVAE’s latent dimension to 100.


• CTAB-GAN and CTAB-GAN+ [56, 57]: We adhere to the official implementations¶. The latent dimension is fixed at 100, and the maximum number of clusters is set to 10. Since the specification of mixed-type and general-type features is discretionary, we use continuous, integer, and categorical variables.


• DistVAE [1]: We follow the official implementations||. However, we changed the latent dimension to 100 for a fair comparison with other models.


• TabDDPM [24]: We utilized the TabDDPM module in synthcity.Plugins ** for synthetic data generation.


• TabMT [13]: Rather than using K-means clustering, we utilize the Gaussian Mixture Model (GMM) to discretize continuous columns with the objective of preserving the original continuous domain. We determine the optimal number of clusters for the GMM, ranging from 2 to 10, based on the Bayesian Information Criterion (BIC). During the generation of continuous columns, we initially predict the component label and then sample from the selected Gaussian component.


• MICE [48]: We employed the IterativeImputer package from Scikit-learn for multiple imputation using chained equations. Following the authors’ experiments, a max_iter range of 10 to 20 was considered sufficient for convergence, and we adopted this setting for our experiments. Additionally, to introduce randomness, we set imputation_order to random, which randomly selects a variable for imputation in each iteration. The remaining parameters were left at their default values to maintain the integrity of the MICE implementation.


• GAIN [52]: We adhere to the official implementations ††. As the paper does not explicitly discuss the separate handling of categorical and continuous variables, the code treats them simultaneously. Consequently, a rounding process is employed to handle categorical variables afterward.


• missMDA [21]: We utilized the missMDA package‡‡ in R for multiple imputation. However, we encountered out-of-memory issues with covtype and letter datasets among the 10 datasets used in our experiments. As a result, we only present results for the remaining 8 datasets when evaluating the performance of missMDA. For details on the results related to Q3 using these 8 datasets, please refer to Table 18.


• VAEAC [19]: We adhere to the official implementations§§. The authors provided hyperparameters that adequately address both continuous and categorical variables, so we used these without further modification during model fitting.


• MIWAE [34]: The implemented MIWAE code¶¶ was designed for continuous variables only. To accommodate heterogeneous tabular datasets, we treated the conditional distribution of categorical columns as categorical distributions and employed cross-entropy loss for reconstruction. For comparison with not-MIWAE, we set the latent dimension to p − 1.


• not-MIWAE [18]: The implemented not-MIWAE code*** also focused on continuous variables exclusively. To handle categorical variables, we made the same modifications as in MIWAE. For comparison with MIWAE, we set the latent dimension to p − 1. Training was conducted for 100K steps, consistent with the official implementations.


• EGC [55]: We utilized the gcimpute package††† to implement EGC. However, we encountered difficulties in fitting EGC to concrete, kings, and loan datasets among the 10 datasets used in our experiments. Therefore, for the evaluation of EGC performance, we only provide results for the remaining 7 datasets. Please refer to Table 19 for the outcomes related to Q3 using these 7 datasets.


Table 6: The number of model parameters.


A.5.2 Evaluation Settings for Q1, Q2 and Q3


• Evaluation procedure for Q1: Regarding machine learning utility, we conduct a regression and classification task for data replacement evaluation.


We assess regression prediction performance using the average SMAPE values of all continuous columns as the target variable once each with a Random Forest regressor. And we assess classification prediction performance for the classification target variable (see Table 4) using the average F1 values of five different classifiers: Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbors classifier, Decision Tree classifier, and Random Forest classifier (refer to Table 7 for detailed configuration).


For model selection evaluation, we train classifiers on both original training and synthetic datasets and evaluate their classification performance on the test dataset. We assess their effectiveness by comparing their AUROC rank orderings using Spearman’s Rank Correlation. Similarly, for feature selection evaluation, we train a random forest classifier on the original training dataset to determine the rank-ordering of important features. Then, we compare it with the rank ordering from the same model type trained on synthetic data using Spearman’s Rank Correlation.


• Evaluation procedure for Q2: Our evaluation process is based on [13, 15]. The dataset is first split into training and testing sets across 10 different random seeds. For each seed, we generate the mask. Subsequently, the model is fitted with a masked training dataset and generates synthetic data. This synthetic data is then assessed based on its performance in a downstream task.


For regression tasks, we evaluate the synthesizers using the average SMAPE values of all continuous columns as the target variable once each with a Random Forest regressor. For classification tasks, we evaluate the synthesizers the average F1 values of five different classifiers for the classification target variable (see Table 4): Logistic Regression, Gaussian Naive Bayes, K-Nearest Neighbors classifier, Decision Tree classifier, and Random Forest classifier (refer to Table 7 for detailed configuration). The final step aggregates these metrics across all seeds, presenting an average along with error bars.


• Evaluation procedure for Q3: We assess the effectiveness of multiple imputations by employing interval inference for the population mean, which was proposed by Rubin [41]. However, since obtaining the population mean is not feasible, we utilize the sample columnwise mean of the complete dataset as a parameter of interest [26, 53]. The evaluation procedure for multiple imputations is outlined in Algorithm 3, and we report the mean and standard error of bias, coverage, and confidence interval length across 10 different random seeds, all continuous columns, and datasets [47]. Note that we do not split the data into training and test sets [18].


Following [36, 20, 54], we generate the missing value mask for each dataset with three mechanisms in four settings. (MCAR) In the MCAR setting, each value is masked according to the realization of a Bernoulli random variable with a fixed parameter. (MAR) In the MAR setting, for each experiment, a fixed subset of variables that cannot have missing values is sampled. Then, the remaining variables have missing values according to a logistic model with random weights, which takes the non-missing variables as inputs. A bias term is fitted using line search to attain the desired proportion of missing values. (MNAR) Finally, two


Table 7: Regressor and classifier used to evaluate synthetic data quality in machine learning utility. The names of all parameters used in the description are consistent with those defined in corresponding packages.



different mechanisms are implemented in the MNAR setting. The first, MNARL, is identical to the previously described MAR mechanism, but the inputs of the logistic model are then masked by an MCAR mechanism. Hence, the logistic model’s outcome depends on missing values. The second mechanism, MNARQ, samples a subset of variables whose values in the lower and upper pth percentiles are masked according to a Bernoulli random variable, and the values in-between are left not missing.


Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

§ https://github.com/sdv-dev/CTGAN


¶ https://github.com/Team-TUD/CTAB-GAN, https://github.com/Team-TUD/CTAB-GAN-Plus


||https://github.com/an-seunghwan/DistVAE


**https://github.com/vanderschaarlab/synthcity ††https://github.com/jsyoon0823/GAIN/tree/master


‡‡https://cran.r-roject.org/web/packages/missMDA/index.html


§§https://github.com/tigvarts/vaeac


¶¶https://github.com/pamattei/miwae


***https://github.com/nbip/notMIWAE


†††https://github.com/udellgroup/gcimpute/tree/master

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks