From Tokens to Tables: How NLP Tech is Revolutionizing Synthetic Datasets

Written by languagemodels | Published 2025/04/08
Tech Story Tags: masked-language-modeling-(mlm) | synthetic-data-generation | conditional-density-estimation | tabular-data | machine-learning-utility-(mlu) | non-parametric-estimation | histogram-based-methods | data-imputation

TLDRThis paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.via the TL;DR App

Table of Links

  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

A.7 Detailed Experimental Results

A.7.1 Q1. Synthetic Data Quality

A.7.2 Q1. Visualization of Marginal Histogram

Figure 7: Histograms of observed dataset and synthetic dataset, generated by MaCoDE.

Figure 8: Histograms of observed dataset and synthetic dataset, generated by MaCoDE.

A.7.3 Q2: Synthetic Data Quality in Scenarios with Incomplete Training Dataset

A.7.4 Q3: Multiple Imputation Performance

A.7.5 Q3: Multiple Imputation Performance of missMDA

A.7.6 Q3: Multiple Imputation Performance of EGC

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


Written by languagemodels | Large Language Models (LLMs) ushered in a technological revolution. We breakdown how the most important models work.
Published by HackerNoon on 2025/04/08