Synthetic Data vs. Encoded Data: Which Method to Choose in Privacy-Preserving Downstream Analysis?

Zakia Zaman, Praveen Gauravaram, Sanjay Jha, and Wen Hu

Synthetic data has been promoted as a solution to overcome the limitation of state-of-the-art privacy-preserving data publishing techniques. Researchers have shown that synthetic data generated from Generative Adversarial Networks (GANs) maintains the original dataset’s statistical characteristics while offering perfect privacy and better utility in downstream tasks. Besides synthetic data, recent studies have shown the possible application of Bloom Filter (BF) Encoding [1] in privacy-preserving data analysis. In this study, we provide the first comparative analysis of the privacy/utility benefits of synthetic data publishing with the Bloom Filter Data Encoding approach. We consider a few public datasets, reproducing the method of generating synthetic data from well-published differentially private (DP) GAN models. Next, we encode the same real dataset using BF-Encoding with DP guarantee. Then, we use the synthetic data and encoded data in downstream tasks. Our experimental results show that the DP-BF-Encoded data preserves data utility better than the synthetic data. Moreover, we observed that the privacy budget (ε) is higher for all considered DP-GAN models than for the DP-BF-Encoded data, which proves that the BF-Encoded data ensures stricter privacy than the Synthetic data.

Synthetic Data vs. Encoded Data: Which Method to Choose in Privacy-Preserving Downstream Analysis?

ABOUT THE ASSOCIATION

CONFERENCE MANAGERS

LINKS

ACKNOWLEDGEMENT OF COUNTRY