New Story

Privacy-Preserving Synthetic Data for ML: The Role of Masked Language Models

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.
featured image - Privacy-Preserving Synthetic Data for ML: The Role of Masked Language Models
Language Models (dot tech) HackerNoon profile picture
0-item
  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

A.1 Proof of Theorem 1

Proof. This proof is based on Theorem 6.11 of [50] and Theorem 1 of [29].










Thus, for every ϵ > 0,



(B) Furthermore, by the continuous mapping theorem and the algebra of the convergence in probability, for every ϵ > 0,


A.2 Proof of Proposition 1

A.3 Dataset Descriptions

Download links.


• abalone: https://archive.ics.uci.edu/dataset/1/abalone


• banknote: https://archive.ics.uci.edu/dataset/267/banknote+authentication


• breast: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic


• concrete: https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength


• covertype: https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset


• kings: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction


• letter: https://archive.ics.uci.edu/dataset/59/letter+recognition


• loan: https://www.kaggle.com/datasets/teertha/personal-loan-modeling


• redwine: https://archive.ics.uci.edu/dataset/186/wine+quality


• whitewine: https://archive.ics.uci.edu/dataset/186/wine+quality


Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks