Table of Links
-
- Classification Target
- Masked Conditional Density Estimation (MaCoDE)
-
- With Missing Data
-
- Related Works
- Conclusions and Limitations
- References
-
- A2 Proof of Proposition 1
- A3 Dataset Descriptions
-
- A5 Experimental Settings for Reproduction
3.3 Results
Q1. As shown in Table 1, MaCoDE consistently achieves the highest metric scores in both joint distributional similarities and machine learning utility while also achieving competitive performance in marginal distributional similarity. This underscores the effectiveness of the synthetic data generation method based on estimating conditional distribution in preserving the joint statistical fidelity of the original data and enhancing the utility of synthetic data for downstream machine learning tasks. Notably, MaCoDE demonstrates remarkable performance in the feature selection downstream task.
Remark 4 (How does MaCoDE achieve the remarkable performance in feature selection?). We attribute the effectiveness of the feature selection downstream task to our emphasis on estimating ‘conditional’ distributions. In Random Forest [4], each node in a decision tree represents a conditional distribution of a variable conditioned on the splits made by the tree up to that node, and the feature importance is determined by the purity of nodes. Therefore, accurately estimating the conditional distribution can lead to higher performance in preserving the feature importance ranking.
Q2. The missing data mechanism within the parentheses refers to the mechanism applied to the training dataset on which MaCoDE was trained. Despite encountering missing data scenarios such as MAR, MCAR, MNARL, and MNARQ, Table 2 illustrates MaCoDE’s ability to generate highquality synthetic data regarding machine learning utility while handling incomplete training datasets without significant performance degradation. Even in the presence of missing entries, MaCoDE achieves better metric scores than most baseline models in terms of SMAPE, except for TabDDPM. Additionally, concerning the F1 score, MaCoDE either competes competitively or achieves a higher score than other baseline models.
Q3. The missing data mechanism within the parentheses refers to the mechanism applied to the dataset on which MaCoDE was trained. Table 3 indicates that MaCoDE consistently exhibits competitive performance against all baseline models across metrics assessing multiple imputation performances, including bias, coverage, and confidence interval length. This suggests our proposed approach can support multiple imputations for deriving statistically valid inferences from missing data with the MAR mechanism.
Sensitivity analysis. We also conducted a sensitivity analysis by varying the missingness rate of kings dataset. Figure 3.(a) illustrates that, in terms of SMAPE, MaCoDE maintains competitive performance even as the missingness rate increases. Concerning the F1 score, MaCoDE outperforms other models at missingness rates of 0.1 and 0.3, but its performance declines beyond a missingness rate of 0.5. Additionally, regarding multiple imputation performance of Figure 3.(b), despite the increase in missingness rate, MaCoDE consistently demonstrates competitive performance compared to other imputation models without significant performance degradation.
4. Related Works
Synthetic tabular data generation. Deep generative models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have been employed in synthetic tabular data generation. CTGAN [51], TVAE [51], CTAB-GAN [56], CTAB-GAN+ [57], DistVAE [1], and GOGGLE [31] employ VAE or GAN for their ability to represent complex data distributions. Recently, diffusion-based or score-based generative models, such as TabDDPM [24], STaSy [23], and CoDi [25], have gained popularity due to their noticeable performance in approximating data distributions. However, these models often encounter challenges when dealing with incomplete training datasets as they rely on complete data for training. In contrast, Transformer-based synthesizers like TabPFGen [33], TabMT [13], and REaLTabFormer [45] can handle incomplete datasets.
Missing data imputations. Recent methods employ the deep generative model to estimate a joint distribution and generate samples for imputations. GAN-based imputers such as GAIN [52] and MisGAN [30] adopt an adversarial learning approach to generate both missing entries and masking vectors. Other approaches like VAEAC [19], HI-VAE [37], and ReMasker [10] employ strategies that allow them to learn conditional distributions on arbitrary conditioning sets using the uniform masking strategy. MIWAE [34], based on the Importance Weighted Autoencoder [5], demonstrates that under mild conditions and the MAR assumption, the target likelihood can be approximated regardless of the imputation function. Additionally, not-MIWAE [18] extends MIWAE to handle cases where the data is under MNAR assumption by modeling missing entries as a latent variable. In parallel, the Optimal Transport (OT)-based method employs distributional matching, utilizing the 2-Wasserstein distance to compare distributions in both data and latent spaces [36, 54]. To handle mixed-type tabular datasets, [55] introduced EGC (extended Gaussian copula), which relies on a latent Gaussian distribution to support single and multiple imputations.
5. Conclusions and Limitations
This paper introduces a novel approach to generating synthetic data for mixed-type tabular datasets. Our proposed method integrates histogram-based non-parametric conditional density estimation and the MLM-based approach. By demonstrating that our conditional density estimators for continuous columns are weakly consistent estimators, we bridge the theoretical gap between distributional learning and the multi-class classification task of MLM. Although our primary goal is to generate synthetic data with high MLu, we empirically demonstrate that we achieve high joint statistical fidelity and MLu simultaneously. Furthermore, empirical experiments validate that our proposed model can generate high-quality synthetic tabular datasets in terms of MLu even when incomplete training datasets are given.
However, in our context, ‘arbitrary’ conditional density estimation refers to estimating the conditional density for arbitrary combinations of conditioning sets and target variables rather than accommodating arbitrary types of continuous distributions. For example, if x has lower bounded support and its CDF doesn’t satisfy Assumption 1, then Q(0) = inf{x ∈ R : 0 ≤ F(x)} = −∞, indicating that inverse transform sampling may generate out-of-support samples. Improving our proposed model to accommodate arbitrary types of continuous distributions is our future work.
References
[1] Seunghwan An and Jong-June Jeon. Distributional learning of variational autoencoder: Application to synthetic data generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[2] Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Nicolas Chapados, and Alexandre Drouin. TACTis-2: Better, faster, simpler attentional copulas for multivariate time series. In The Twelfth International Conference on Learning Representations, 2024.
[3] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey. IEEE transactions on neural networks and learning systems, PP, 2021.
[4] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
[5] Yuri Burda, Roger Baker Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. ICLR, abs/1509.00519, 2015.
[6] Shoja’eddin Chenouri, Majid Mojirsheibani, and Zahra Montazeri. Empirical measures for incomplete data with applications. Electronic Journal of Statistics, 3:1021–1038, 2009.
[7] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pages 286–305. PMLR, 2017.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
[9] Alexandre Drouin, ’E. Marcotte, and Nicolas Chapados. Tactis: Transformer-attentional copulas for time series. In International Conference on Machine Learning, 2022.
[10] Tianyu Du, Luca Melis, and Ting Wang. Remasker: Imputing tabular data with masked autoencoding. In The Twelfth International Conference on Learning Representations, 2024.
[11] Kevin Fang, Vaikkunth Mugunthan, Vayd Ramkumar, and Lalana Kagal. Overcoming challenges of synthetic data generation. 2022 IEEE International Conference on Big Data (Big Data), pages 262–270, 2022.
[12] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Conference on Empirical Methods in Natural Language Processing, 2019.
[13] Manbir S Gulati and Paul F Roysdon. Tabmt: Generating tabular data with masked transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[14] Bruce E. Hansen. Autoregressive conditional density estimation. International Economic Review, 35:705–730, 1994.
[15] Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, and Andrija Petrovic. Reimagining synthetic tabular data generation through data-centric AI: A comprehensive benchmark. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[16] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. Journal of Multivariate Analysis, 73:120–135, 2000.
[17] Lucas Torroba Hennigen and Yoon Kim. Deriving language models from masked language models. In Annual Meeting of the Association for Computational Linguistics, 2023.
[18] Niels Bruun Ipsen, Pierre-Alexandre Mattei, and Jes Frellsen. not-{miwae}: Deep generative modelling with missing not at random data. In International Conference on Learning Representations, 2021.
[19] Oleg Ivanov, Michael Figurnov, and Dmitry Vetrov. Variational autoencoder with arbitrary conditioning. In International Conference on Learning Representations, 2019.
[20] Daniel Jarrett, Bogdan C Cebere, Tennison Liu, Alicia Curth, and Mihaela van der Schaar. HyperImpute: Generalized iterative imputation with automatic model selection. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9916–9937. PMLR, 17–23 Jul 2022.
[21] Julie Josse and François Husson. missMDA: A package for handling missing values in multivariate data analysis. Journal of Statistical Software, 70(1):1–31, 2016.
[22] Sanket Kamthe, Samuel A. Assefa, and Marc Peter Deisenroth. Copula flows for synthetic data generation. ArXiv, abs/2101.00598, 2021.
[23] Jayoung Kim, Chaejeong Lee, and Noseong Park. STasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2023.
[24] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
[25] Chaejeong Lee, Jayoung Kim, and Noseong Park. Codi: co-evolving contrastive diffusion models for mixed-type tabular synthesis. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[26] Jin Hyuk Lee and J. Charles Huber. Evaluation of multiple imputation with large proportions of missing data: How much is too much? Iranian Journal of Public Health, 50:1372 – 1380, 2021.
[27] Nunzio Alexandro Letizia and Andrea M. Tonello. Copula density neural estimation. ArXiv, abs/2211.15353, 2022.
[28] Ban Li, Senlin Luo, Xiaonan Qin, and Limin Pan. Improving gan with inverse cumulative distribution function for tabular data synthesis. Neurocomputing, 456:373–383, 2021.
[29] Rui-Bing Li, Howard D. Bondell, and Brian J. Reich. Deep distribution regression. Comput. Stat. Data Anal., 159:107203, 2019.
[30] Steven Cheng-Xian Li, Bo Jiang, and Benjamin Marlin. Learning from incomplete data with generative adversarial networks. In International Conference on Learning Representations, 2019.
[31] Tennison Liu, Zhaozhi Qian, Jeroen Berrevoets, and Mihaela van der Schaar. GOGGLE: Generative modelling for tabular data by learning relational structure. In The Eleventh International Conference on Learning Representations, 2023.
[32] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[33] Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, and Anthony Caterini. Tabpfgen– tabular data generation with tabpfn. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023. [34] Pierre-Alexandre Mattei and Jes Frellsen. Miwae: Deep generative modelling and imputation of incomplete data sets. In International Conference on Machine Learning, 2019.
[35] Stan Matwin, Jordi Nin, Morvarid Sehatkar, and Tomasz Szapiro. A review of attribute disclosure control. In Advanced Research in Data Privacy, 2015.
[36] Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. Missing data imputation using optimal transport. In International Conference on Machine Learning, 2020.
[37] Alfredo Nazábal, Pablo M. Olmos, Zoubin Ghahramani, and Isabel Valera. Handling incomplete heterogeneous data using vaes. Pattern Recognition, 107:107501, 2020.
[38] Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1268–1283, Online, November 2020. Association for Computational Linguistics.
[39] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. Proc. VLDB Endow., 11:1071–1083, 2018.
[40] Zhaozhi Qian, Rob Davis, and Mihaela van der Schaar. Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[41] Donald B. Rubin and Nathaniel Schenker. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association, 81:366–374, 1986.
[42] Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff. Masked language model scoring. In Annual Meeting of the Association for Computational Linguistics, 2019.
[43] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022. [44] Martin J. Sklar. Fonctions de repartition a n dimensions et leurs marges. 1959.
[45] Aivin V. Solatorio and Olivier Dupriez. Realtabformer: Generating realistic relational and tabular data using transformers. ArXiv, abs/2302.02041, 2023.
[46] Latanya Sweeney. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst., 10:557–570, 2002.
[47] Stef van Buuren. Flexible imputation of missing data. 2012.
[48] Stef van Buuren and Karin G. M. Groothuis-Oudshoorn. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67, 2011.
[49] Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Antoine Bosselut, Asli Celikyilmaz, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin, and Thomas Wolf, editors, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–36, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[50] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.
[51] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[52] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GAIN: Missing data imputation using generative adversarial nets. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5689–5698. PMLR, 10–15 Jul 2018.
[53] Nanhua Zhang, Chunyan Liu, Steven J Steiner, Richard B Colletti, Robert N. Baldassano, Shiran Chen, Stanley Cohen, Michael D. Kappelman, Shehzad Ahmed Saeed, Laurie S. Conklin, Richard Strauss, Sheri Volger, Eileen C. King, and Kim Hung Lo. Using multiple imputation of real-world data to estimate clinical remission in pediatric inflammatory bowel disease. Journal of Comparative Effectiveness Research, 12, 2023.
[54] He Zhao, Ke Sun, Amir Dezfouli, and Edwin V. Bonilla. Transformed distribution matching for missing value imputation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 42159–42186. PMLR, 23–29 Jul 2023.
[55] Yuxuan Zhao, Alex Townsend, and Madeleine Udell. Probabilistic missing value imputation for mixed categorical and ordered data. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
[56] Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y. Chen. Ctab-gan: Effective table data synthesizing. In Vineeth N. Balasubramanian and Ivor Tsang, editors, Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 97–112. PMLR, 17–19 Nov 2021.
[57] Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Yiyu Chen. Ctab-gan+: Enhancing tabular data synthesis. Frontiers in Big Data, abs/2204.00401, 2023.
Authors:
(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);
(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);
(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).
This paper is