New Story

Solving the Missing Data Problem with Masked Language Models

by Language Models (dot tech)April 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces MaCoDE, a method that reframes masked language modeling as conditional density estimation for generating synthetic tabular data. It achieves high machine learning utility, handles missing data, allows privacy control, and outperforms state-of-the-art methods on multiple real-world datasets.

Company Mentioned

Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Solving the Missing Data Problem with Masked Language Models
Language Models (dot tech) HackerNoon profile picture
0-item
  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

A.6 Additional Experiments

A.6.1 Privacy Preservability


• Evaluation metrics: The k-anonymity property [46] is a measure used to assess the level of privacy protection in synthetic data. It ensures that each individual’s information in the dataset cannot be distinguished from that of at least k − 1 other individuals. In other words, each record in the dataset has at least k − 1 similar records in terms of quasi-identifiers, attributes that could potentially identify a subject. A higher k value implies a higher level of anonymity and better privacy preservation.


DCR (Distance to Closest Record) [39, 56] is defined as the distances between all real training and synthetic samples. A higher DCR value indicates more effective privacy preservation, indicating a lack of overlap between the real training data and the synthetic samples. Conversely, an excessively large DCR score suggests a lower quality of the generated synthetic dataset. Therefore, the DCR metric provides insights into both the privacy-preserving capability and the quality of the synthetic dataset.


Attribute disclosure [7, 35] refers to the situation where attackers can uncover additional covariates of a record by leveraging a subset of covariates they already possess, along with similar records from the synthetic dataset. To quantify the extent to which attackers can accurately identify these additional covariates, we employ classification metrics. Higher attribute disclosure metrics indicate an increased risk of privacy leakage, implying that attackers can precisely infer unknown variables. In terms of privacy concerns, attribute disclosure can be considered a more significant issue than membership inference attacks, as attackers are assumed to have access to only a subset of covariates for a given record.


• Evaluation procedure: We evaluate the k-anonymity following the approach described in [40] ‡‡‡. Additionally, similar to [56], we define DCR as the 5 th percentile of the L2 distances between all real training samples and synthetic samples. Since DCR relies on L2 distance and continuous variables, it is computed solely using continuous variables. We assess attribute disclosure using the methodology outlined in [7].


• Result: The right panel of Figure 2 demonstrates MaCoDE’s capability to regulate the privacy level by adjusting the temperature parameter τ . Simultaneously, the left panel illustrates that the quality of synthetic data, measured by feature selection performance, remains notable even with increasing privacy levels from τ = 1 to τ = 3. However, increasing τ beyond 3 leads to declining feature selection performance compared to other models despite DCR remaining competitive. For additional results on other metrics related to the trade-off between privacy level and synthetic data quality as τ varies, please refer to Table 8 and Table 9.


Figure 4: Trade-off between quality and privacy. Left: feature selection performance (synthetic data quality). Right: DCR (privacy preservability). The means and standard errors of the mean across 10 datasets and 10 repeated experiments are reported. Error bars represent the standard errors of the mean. The figure is identical to Figure 2.


Table 8: Trade-offs between privacy preservability and synthetic data quality (statistical fidelity and machine learning utility). The means and standard errors of the mean across 10 datasets and 10 repeated experiments are reported. ‘Baseline’ refers to the result obtained using half of the real training dataset. ↑ (↓) denotes higher (lower) is better.


Table 9: Privacy preservability. The means and standard errors of the mean across 10 datasets and 10 repeated experiments are reported. ‘Baseline’ refers to the result obtained using half of the real training dataset. ↑ (↓) denotes higher (lower) is better.


Table 10: Privacy preservability for each dataset. The means and standard errors of the mean across 10 repeated experiments are reported. ↑ (↓) denotes higher (lower) is better.


A.6.2 Sensitivity Analysis


• Evaluation procedure: We generated missing values into the kings dataset at rates of 0.1, 0.3, 0.5, and 0.7 for each missing mechanism and proceeded to train the model. Subsequently, we evaluated the machine learning utility using metrics such as SMAPE and F1-score, along with assessing the multiple imputation performance of the model using the same methodology described earlier.


• Result: Figure 5 and 6 show the sensitivity analysis conducted by varying the missingness rate of kings dataset across four missing data mechanisms (MCAR, MAR, MNARL, MNARQ). In Figure 5, it’s evident that MaCoDE maintains competitive performance in terms of SMAPE, even with increasing missingness rates across all missing data mechanisms. Regarding the F1 score, MaCoDE outperforms other models at missingness rates of 0.1 and 0.3, but its performance diminishes beyond a missingness rate of 0.5 (except for the MNARQ missing data mechanism). Additionally, as shown in Figure 6, MaCoDE consistently exhibits competitive performance in the multiple imputation, regardless of the increasing missingness rate, without significant performance degradation compared to other imputation models.


Hence, Figures 5 and 6 demonstrate that MaCoDE maintains comparable performance to other models even when trained on a dataset with missing values (i.e., incomplete dataset) without compromising the quality of the synthetic data it generates. This is a notable advantage of our proposed model over other baseline models, which struggle with training on datasets containing missing values.


Figure 5: Q2. Sensitivity analysis of machine learning utility according to missingness rate. Machine learning utility is evaluated using kings dataset under four missing mechanisms. The means and standard errors of the mean across 10 repeated experiments are reported. Error bars represent standard errors.


Figure 6: Q3. Sensitivity analysis of multiple imputations according to missingness rate. Multiple imputation is evaluated using kings dataset under four missing mechanisms. The means and standard errors of the mean across 10 repeated experiments are reported. Error bars represent standard errors.


Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

‡‡‡https://github.com/vanderschaarlab/synthcity

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks