New Story

How Masked Language Modeling Can Be Used to Generate Synthetic Tabular Data

by Language Models (dot tech)April 8th, 2025

Too Long; Didn't Read

This paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.

featured image - How Masked Language Modeling Can Be Used to Generate Synthetic Tabular Data

‘a robot making a graph’ Image created by HackerNoon AI Image Generator

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).

Table of Links

Abstract & Introduction
Proposal
1. Classification Target
2. Masked Conditional Density Estimation (MaCoDE)
Theoretical Results
1. With Missing Data
Experiments
Results
1. Related Works
2. Conclusions and Limitations
3. References
A1 Proof of Theorem 1
1. A2 Proof of Proposition 1
2. A3 Dataset Descriptions
A4 Missing Mechanism
1. A5 Experimental Settings for Reproduction
A6 Additional Experiments
A7 Detailed Experimental Results

Abstract

In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Given that the MLu performance relies on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We propose a novel synthetic data generation method, MaCoDE, by redefining the multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our proposed method enables estimating conditional densities across arbitrary combinations of target and conditional variables. Furthermore, we demonstrate that our proposed method bridges the theoretical gap between distributional learning and MLM. To validate the effectiveness of our proposed model, we conduct synthetic data generation experiments on 10 real-world datasets. Given the analogy between predicting masked input tokens in MLM and missing data imputation, we also evaluate the performance of multiple imputations on incomplete datasets with various missing data mechanisms. Moreover, our proposed model offers the advantage of enabling adjustments to data privacy levels without requiring re-training.

1. Introduction

There are two main objectives in synthetic data generation: (1) preserving the statistical characteristics of the original dataset and (2) achieving comparable machine learning utility (MLu) to the original dataset. However, achieving high statistical fidelity does not guarantee high MLu performance [15], which implies that the approach to synthetic data generation should align with the intended goal. In this paper, our focus is on generating synthetic data with high MLu performance. Given that MLu performance relies on accurately approximating the conditional distribution, we concentrate on devising a synthetic data generation method based on estimating the conditional distribution.

Thus, our primary contribution lies in introducing a novel masked learning method for the conditional density estimation of heterogeneous (mixed-type) tabular datasets. Our proposed method leverages the approach of [29] and Masked Language Modeling (MLM) [8] and diverges from existing methodologies, which focus on minimizing the discrepancy between the ground-truth distribution and the generative model [51, 56, 1, 24]. Note that [29] demonstrated that the conditional density estimation problem can be transformed into a multi-class classification problem.

MLM involves randomly masking a portion of input tokens during training to predict the original words based on their context. In the natural language domain, previous attempts to interpret MLM as distributional learning have been somewhat limited, relying on pseudo-likelihood or Markov random fields [12, 49, 42, 38, 17]. However, we redefine the multi-class classification task of MLM as histogram-based non-parametric conditional density estimation and show that our proposed method bridges the theoretical gap between distributional learning and the MLM approach. We term our proposed model MaCoDE (Masked Conditional Density Estimation).

By adopting a non-parametric approach for conditional density estimation, our proposed model addresses the challenge of modeling non-uniform distributions of continuous columns [28, 11]. Since the histogram-based approach is theoretically valid only when the continuous variables have bounded supports [50, 29], we transform continuous columns using the Cumulative Distribution Function (CDF) and constrain their values to the interval [0, 1] [28, 11].

Furthermore, by employing a masking scheme where the number of masked tokens is uniformly distributed and the model structure of BERT (masked transformer encoder) [8], our proposed model enables the estimation of conditional densities across arbitrary combinations of target and conditional variables [12, 19, 37]. It leads to our proposed model to generate columns of tabular data in a random order, reflecting the inherent property of tabular data not having intrinsic ordering [13]. This contrasts with the existing auto-regressive density estimators, where the generation procedure proceeds in a fixed order of columns [14, 22, 27].

Meanwhile, incomplete tabular data poses a widespread challenge in real-world data collection [47]. In MLM, the masked input tokens can be seen as missing data, and predicting the original words can be analogized to imputation. In this context, [13] proposed TabMT, an MLM-based synthesizer capable of handling incomplete data. However, since TabMT is based on predicting the K-means cluster index of masked entries, it encounters challenges for distributional learning. Conversely, our proposed model can estimate the conditional distribution while simultaneously accommodating arbitrary conditioning sets to address diverse missingness patterns - an essential capability for generating samples and performing multiple imputations [47, 19, 37]. We substantiate the effectiveness of our proposed method by evaluating its performance in both synthetic data generation and multiple imputations. This evaluation encompasses 10 real-world tabular datasets and various missing data mechanisms.