Solving the Missing Data Problem with Masked Language Models

Table of Links

Abstract & Introduction
Proposal
1. Classification Target
2. Masked Conditional Density Estimation (MaCoDE)
Theoretical Results
1. With Missing Data
Experiments
Results
1. Related Works
2. Conclusions and Limitations
3. References
A1 Proof of Theorem 1
1. A2 Proof of Proposition 1
2. A3 Dataset Descriptions
A4 Missing Mechanism
1. A5 Experimental Settings for Reproduction
A6 Additional Experiments
A7 Detailed Experimental Results

A.6 Additional Experiments

A.6.1 Privacy Preservability

• Evaluation metrics: The k-anonymity property [46] is a measure used to assess the level of privacy protection in synthetic data. It ensures that each individual’s information in the dataset cannot be distinguished from that of at least k − 1 other individuals. In other words, each record in the dataset has at least k − 1 similar records in terms of quasi-identifiers, attributes that could potentially identify a subject. A higher k value implies a higher level of anonymity and better privacy preservation.

DCR (Distance to Closest Record) [39, 56] is defined as the distances between all real training and synthetic samples. A higher DCR value indicates more effective privacy preservation, indicating a lack of overlap between the real training data and the synthetic samples. Conversely, an excessively large DCR score suggests a lower quality of the generated synthetic dataset. Therefore, the DCR metric provides insights into both the privacy-preserving capability and the quality of the synthetic dataset.

Attribute disclosure [7, 35] refers to the situation where attackers can uncover additional covariates of a record by leveraging a subset of covariates they already possess, along with similar records from the synthetic dataset. To quantify the extent to which attackers can accurately identify these additional covariates, we employ classification metrics. Higher attribute disclosure metrics indicate an increased risk of privacy leakage, implying that attackers can precisely infer unknown variables. In terms of privacy concerns, attribute disclosure can be considered a more significant issue than membership inference attacks, as attackers are assumed to have access to only a subset of covariates for a given record.

• Evaluation procedure: We evaluate the k-anonymity following the approach described in [40] ‡‡‡. Additionally, similar to [56], we define DCR as the 5 th percentile of the L2 distances between all real training samples and synthetic samples. Since DCR relies on L2 distance and continuous variables, it is computed solely using continuous variables. We assess attribute disclosure using the methodology outlined in [7].

• Result: The right panel of Figure 2 demonstrates MaCoDE’s capability to regulate the privacy level by adjusting the temperature parameter τ . Simultaneously, the left panel illustrates that the quality of synthetic data, measured by feature selection performance, remains notable even with increasing privacy levels from τ = 1 to τ = 3. However, increasing τ beyond 3 leads to declining feature selection performance compared to other models despite DCR remaining competitive. For additional results on other metrics related to the trade-off between privacy level and synthetic data quality as τ varies, please refer to Table 8 and Table 9.

A.6.2 Sensitivity Analysis

• Evaluation procedure: We generated missing values into the kings dataset at rates of 0.1, 0.3, 0.5, and 0.7 for each missing mechanism and proceeded to train the model. Subsequently, we evaluated the machine learning utility using metrics such as SMAPE and F1-score, along with assessing the multiple imputation performance of the model using the same methodology described earlier.

• Result: Figure 5 and 6 show the sensitivity analysis conducted by varying the missingness rate of kings dataset across four missing data mechanisms (MCAR, MAR, MNARL, MNARQ). In Figure 5, it’s evident that MaCoDE maintains competitive performance in terms of SMAPE, even with increasing missingness rates across all missing data mechanisms. Regarding the F1 score, MaCoDE outperforms other models at missingness rates of 0.1 and 0.3, but its performance diminishes beyond a missingness rate of 0.5 (except for the MNARQ missing data mechanism). Additionally, as shown in Figure 6, MaCoDE consistently exhibits competitive performance in the multiple imputation, regardless of the increasing missingness rate, without significant performance degradation compared to other imputation models.

Hence, Figures 5 and 6 demonstrate that MaCoDE maintains comparable performance to other models even when trained on a dataset with missing values (i.e., incomplete dataset) without compromising the quality of the synthetic data it generates. This is a notable advantage of our proposed model over other baseline models, which struggle with training on datasets containing missing values.

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

‡‡‡https://github.com/vanderschaarlab/synthcity

Solving the Missing Data Problem with Masked Language Models

Too Long; Didn't Read

Company Mentioned

Coin Mentioned

Table of Links

A.6 Additional Experiments

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Solving the Missing Data Problem with Masked Language Models

Too Long; Didn't Read

Company Mentioned

Coin Mentioned

Table of Links

A.6 Additional Experiments

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics