As data engineers, we develop pipelines to process sensitive data. We must be aware of data privacy regulations and ensure our pipelines comply with them. This article discusses various techniques to ensure data privacy in data engineering.
Most regulations come from GDPR (General Data Protection Regulation), CCPA (California Consumer Privacy Act), HIPAA, and other data privacy laws. This article will not delve into the details of these regulations but will focus on the technical aspects of ensuring data privacy in data engineering.
The following will be covered:
Pseudonymization is a data management and de-identification procedure in which personally identifiable information fields within a data record are replaced by one or more artificial identifiers or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing. Pseudonymization is a reversible process, meaning that the original data can be restored using the pseudonym.
Source Data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-11|
|John Doe |1985-01-21|
|Mike Johnson|1990-01-31|
+------------+----------+
Pseudonymized Data
+----------+----------+
|name | dob|
+----------+----------+
|NAME-001 |1980-01-01|
|NAME-002 |1985-01-01|
|NAME-003 |1990-01-01|
+----------+----------+
Hashing is a one-way function that converts an input into a fixed-size string of characters, typically a hexadecimal number. The output is deterministic, meaning that the same input will always produce the same output. Hashing is used to store passwords securely, but it can also be used to pseudonymize data. The most common hashing algorithms are MD5, SHA-1, and SHA-256. To make the hashing process more secure, you can add a salt to the input. A salt is a random string that is added to the input before hashing. This makes it more difficult for attackers to crack the hash.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def get_spark_session():
spark = SparkSession.builder.appName("test").getOrCreate()
return spark
def get_salt():
return "some salty string" # should be stored & read from a secure location
def test_spark():
spark = get_spark_session()
columns = ["name", "dob"]
data = [("Alex Smith", "1980-01-01"), ("John Doe", "1985-01-01"), ("Mike Johnson", "1990-01-01")]
df = spark.createDataFrame(data, columns)
df.show(10, False)
hashed_df = df.withColumn("name", F.md5(F.concat(F.col("name"), F.lit(get_salt()))))
hashed_df.show(10, False)
Results:
# Source Data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-01|
|John Doe |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+
# Hashed Data
+--------------------------------+----------+
|name |dob |
+--------------------------------+----------+
|6c863d4f759030513519e9187f2cdbe6|1980-01-01|
|be8c2b1bed7ccaf2f9f8e597f390500d|1985-01-01|
|b2a1813854b9829f6d1aa302fae4493d|1990-01-01|
+--------------------------------+----------+
Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security.
Tokenization is often used to protect sensitive data in a database. It is also used to protect data in transit over the internet. Tokenization is a reversible process, meaning that the original data can be restored using the token.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
def get_spark_session():
spark = SparkSession.builder.appName("test").getOrCreate()
return spark
def test_tokenization():
spark = get_spark_session()
columns = ["name", "dob"]
data = [("Alex Smith", "1980-01-01"), ("John Doe", "1985-01-01"), ("Mike Johnson", "1990-01-01")]
src_df = spark.createDataFrame(data, columns)
src_df.show(10, False)
lookup_df = src_df.select("name").distinct().withColumn("id", F.monotonically_increasing_id())
lookup_df.show(10, False)
tokenized_df = src_df.join(lookup_df, "name", "left").withColumn("name", F.col("id")).drop("id")
tokenized_df.show(10, False)
Results:
# Source data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-01|
|John Doe |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+
# Lookup table, should be stored in a secure location
+------------+---+
|name |id |
+------------+---+
|Alex Smith |0 |
|John Doe |1 |
|Mike Johnson|2 |
+------------+---+
# Tokenized data
+----+----------+
|name|dob |
+----+----------+
|0 |1980-01-01|
|1 |1985-01-01|
|2 |1990-01-01|
+----+----------+
Anonymization is the process of removing or modifying personal data in such a way that it cannot be linked back to an individual. Anonymization is a one-way process, meaning that the original data cannot be restored. Anonymization is often used for data that is used for research or statistical purposes. Anonymized data is not considered as PII according to GDPR.
There are two main techniques for anonymization:
Generalization
Suppression
Suppression is the process of removing sensitive data from a record. For example, you can remove the name and address from a record, but keep the date of birth.
Possible implementations include:
# Source data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-01|
|John Doe |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+
# Suppressed data
+------------+----------+
|name |dob |
+------------+----------+
|null |1980-01-01|
|null |1985-01-01|
|null |1990-01-01|
+------------+----------+
Generalization is the process of replacing sensitive data with a more generalized value. For instance, one could replace the date of birth with the year of birth. Generalization is often employed to safeguard data used for research or statistical purposes.
Possible implementations include:
# Source data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-01|
|John Doe |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+
# Generalized data
+--------+--------------+
|name |year_of_birth |
+--------+--------------+
|1 |1980 |
|2 |1985 |
|3 |1990 |
+--------+--------------+
# Source data
+------------+----------+
|name |dob |
+------------+----------+
|Alex Smith |1980-01-01|
|John Doe |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+
# Binned data
+--------+-----------+
|name |age_range |
+--------+-----------+
|1 |40-50 |
|2 |30-40 |
|3 |30-40 |
+--------+-----------+
# Source data
+------------+---------------------------------------------------+
|name |address |
+------------+---------------------------------------------------+
|Alex Smith |123 Fake Street, Building 1, Amsterdam, Netherlands|
|John Doe |456 Unreal Road, Building 2, Paris, France |
|Mike Johnson|789 Nonexistent Avenue, Building 3, New York, USA |
+------------+---------------------------------------------------+
# Categorical generalized data
+------------+---------------+
|name |country |
+------------+---------------+
|Alex Smith |Netherlands |
|John Doe |France |
|Mike Johnson|USA |
+------------+---------------+
# or for IP-addresses, CIDR notation can be used
+------------+-----------------+
|name |ip_address |
+------------+-----------------+
|Alex Smith |192.168..203.157 |
|John Doe |127.0.99.185 |
|Mike Johnson|192.168.193.61 |
+------------+-----------------+
# Converted to /24 CIDR notation
+------------+------------------+
|name |ip_address |
+------------+------------------+
|Alex Smith |192.168.203.0/24 |
|John Doe |127.0.99.0/24 |
|Mike Johnson|192.168.193.0/24 |
+------------+------------------+