From NLP to Data Synthesis: The Surprising Power of Masked Language Models

Too Long; Didn't Read

This paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.

Company Mentioned

Table of Links

Abstract & Introduction
Proposal
1. Classification Target
2. Masked Conditional Density Estimation (MaCoDE)
Theoretical Results
1. With Missing Data
Experiments
Results
1. Related Works
2. Conclusions and Limitations
3. References
A1 Proof of Theorem 1
1. A2 Proof of Proposition 1
2. A3 Dataset Descriptions
A4 Missing Mechanism
1. A5 Experimental Settings for Reproduction
A6 Additional Experiments
A7 Detailed Experimental Results

2. Proposal

2.1 Classification Target (Discretization)

2.2 Masked Conditional Density Estimation (MaCoDE)

Definition 2 (Mask distribution [13, 19]). The distribution of mask vector m is defined as:

Synthetic data generation. Tabular data lacks the inherent ordering between columns, unlike natural language [13]. Therefore, as outlined in Algorithm 2, MaCoDE randomly generates one column at a time, conditioned on masked subset sizes from p to 1, in descending order (p → p − 1 → · · · → 2 → 1). [13] demonstrated that, under the masked distribution of Definition 2, the distribution of the number of masked entries is matched during both training and generation.

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

From NLP to Data Synthesis: The Surprising Power of Masked Language Models

Too Long; Didn't Read

Company Mentioned

Table of Links

2. Proposal

2.1 Classification Target (Discretization)

2.2 Masked Conditional Density Estimation (MaCoDE)

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

From NLP to Data Synthesis: The Surprising Power of Masked Language Models

Too Long; Didn't Read

Company Mentioned

Table of Links

2. Proposal

2.1 Classification Target (Discretization)

2.2 Masked Conditional Density Estimation (MaCoDE)

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics