But privacy interacts with security in a really unique way: data that includes personally identifiable information requires the highest standard of security. The meaning of privacy has forever changed, as big compute means re-identification from combinations of fully “anonymized” datasets can be used to identify individuals, easily.
Compute, specifically big compute - unlocks patterns in high dimensional data using sparse informational vectors to become dense in personally identifiable patterns. The ability to quantitatively measure how many individuals, or groups of similar characteristics, is quantitatively measured by Unicity.
Unicity is often used in the English language as embodied kindess and openness.
Unicity in mathematics is defined as stating the uniqueness of a mathematical object, which usually means that there is only one object fulfilling given properties, or that all objects of a given class are equivalent.
Unicity Distance in cryptography is not the focus of today, but it may help to elucidate the idea: it tells us how much ciphertext is required so that the encryption key can be uniquely recovered, assuming that the attacker knows the encryption algorithm and has access to both the ciphertext and some statistics about the plaintext. Basically, it lets you calculate how big the haystack needs to be to find a needle, before you go digging.
This idea of measuring unicity in large data sets was first made famous by a study that found over 90% of people could be uniquely re-identified in a Netflix Prize data set as they, “demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber's record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.”
In 2021, I was reminded again that “
I had been doing signal processing studies on the human brain, seeing if we could change brain networks without conscious awareness. Spoilers: you totally can. That data may seem like it could be pretty sensitive, highly identifiable individual data - but there are data sets much more dangerous than that. Like your known Neflix usage.
Medical research funded by the US Government requires those data sets to be openly available to the public when privacy can be reasonably preserved, but when you calculate for the risk of re-identification not just of an individual within the data set, but by combination to any easily available ones in the nearby geographical location, things get tricky.
It’s worth reading the whole summary:
“Although anonymous data are not considered personal data, recent research has shown how individuals can often be re-identified. Scholars have argued that previous findings apply only to small-scale datasets and that privacy is preserved in large-scale datasets. Using 3 months of location data, we (1) show the risk of re-identification to decrease slowly with dataset size, (2) approximate this decrease with a simple model taking into account three population-wide marginal distributions, and (3) prove that unicity is convex and obtain a linear lower bound. Our estimates show that 93% of people would be uniquely identified in a dataset of 60M people using four points of auxiliary information, with a lower bound at 22%. This lower bound increases to 87% when five points are available. Taken together, our results show how the privacy of individuals is very unlikely to be preserved even in country-scale location datasets.”
This is the gold that hackers usually mine for in healthcare, finance, and government records. They need four golden auxiliary data points, and they can find the individual.
This isn't finding a needle in a haystack.
It’s finding a specific needle in a stack of needles.
All I need is three months of location data about that needle, and bingo, I got it.
Unicity in data sets is a massive blindspot for most organizations.
It should be a major compliance issue, but it’s a blindspot there too.
It’s a major security risk, until we learn to observe it.
I just took the IAPP AI Governance Training. It’s the new standard for understanding global regulation around privacy concerns for Artificial Intelligence just established in April 2024. I’ve got a technical background, I wanted to use that training to get inside of the minds of all the lawyers, regulators and the compliance officers that I often interact with. I’m super pleased with how it sums up the current regulatory landscape, and I like that the certification requires updating your training on the subject every year: in this regulatory landscape, things move fast.
I wish we had covered the technical advancements in Privacy Enhancing Technologies that you would need to consider if you have a data set that is at high risk of unicity. I wish we had covered any known, quantitative measurements to reduce the risk of unicity in small or large data sets. I wish we had covered unicity, period.
I wish we had covered how the use of Privacy Enhancing Technologies (PETs) is unique: all the way down to the primitives of the Linux Kernel, that technology has been specifically designed with privacy protection in mind. PETs can mitigate both compliance and security risks for high risk data sets, all at once.
Security risks are often reviewed in the form of threat modeling. It’s the speculative calculation of the multiplication of three factors: the type of threat (inside actor, supply chain vulnerability), the magnitude of impact (to stakeholders, to end users, to business reputation) and the likelihood.
Let’s focus on likelihood: I tend to calculate that as the known/perceived asset value, and even put a proposed price tag on intellectual property like algorithms. This is important. You should evaluate your algorithmic IP like it is your product, because particularly in AI, it absolutely is your product.
This also focuses your attention clearly in your threat model. If your business is specifically creating intellectual property around generative algorithms, traditional methods of security won’t work.
Let me explain why:
We are really good at encrypting data now.
It is, unfortunately, literally impossible to compute encrypted data.
If your business relies on compute (and it probably does if you have read this far), then you are responsible for making decisions about the privacy motivated security threats, to your surface area. Privacy is the one part of technology where compliance may actually be wholly aligned with security.
Back to that pesky encrypted data: there’s a few good reasons why it might be encrypted. My favorite real use case for the PET Confidential Computing is in the fight against global human trafficking.
There have always been good people in the world, fighting for the rights and freedoms of the victims of this globally distributed problem. Traditionally, OSINT techniques would be used to identify the locations of databases with information, often a corpus of photographic or videographic information, that legally, you were NOT allowed to store and hold as evidence, because the goal is to limit any ability for those records to ever have a new distribution vector.
This created a problem, as predators could easily move information around online, centralizing and decentralizing their architecture as needed. Those fighting the problem did not have the same flexibility.
Reasonable regulation, unfortunate secondary effects.
Now, Confidential Computing gives us a fair fight in the Hope for Justice Private Data Exchange: a demonstration of how to centralize those extremely high risk records into a Trusted Execution Environment, protecting the data in use by performing computation in a hardware-based, attested Trusted Execution Environment: where this data will only ever be observed by algorithms, not human eyes.
And it gets better. Because we are so good at encryption, this could now become part of a large, federated data ecosystem. Organizations around the world are able to pull their records together and use the magic of just four golden auxiliary measures to get potentially individually identifiable information about not just the individuals, but the locations and potentially patterns of movement. A fair fight, where privacy is preserved by an isolated execution environment: only algorithmic eyes will ever see those images again.
Unicity is a tool, a really good tool. Unicity replaces your blindspot with a calculation. Take a look at your own organization’s first attempts at AI Conformity Assessment: risk management, data governance, and cybersecurity practices. Think beyond the current regulation and to the total risk that your system may actually represent to end users, and start threat modeling for a data dense world. Let’s get this right.
I learned so much in the days we spent days covering every framework in AI regulation. Based on the Framework of Regulation provided in the AIGP training, here is my current recommendation for how to handle this in any medium to large sized organisation.
An Enriched AI Governance Framework
If we want to identify individuals, let’s make those surface areas secure.
If we don’t want to identify individuals, implement a way to monitor the ongoing risk of re-identification in your system’s outputs.
Lower levels of unicity in public and breached datasets would be great for all of us. It’s a data hygiene practice your team can do, that can give a quantitative measure of the risk of convergent data usage by a privacy motivated adversarial. We absolutely can, and must, raise the bar on protecting personal data from re-identification. We can only start doing that if we measure it in our own data. If you are serious about privacy enhancing technologies and the changing tides of regulation in compute, send me an interesting question about it. If your systems necessarily engage with high risk data in training, you might also care about