Gint Woss

Data analysis, without data sprawl

Your company runs on data - but with every piece of analysis, personal information gets exposed and replicated. How can you analyze raw data without exposing it?

The Challenge

Your company’s data is a treasure trove of insights and value, but with every piece of analysis, personal information gets exposed, downloaded and replicated. That increases your exposure to employee abuse, employee account takeover, endpoint security risks and data breaches.

In this blog, we'll show you how to analyze sensitive data sets without creating privacy & security risk - so you can rest easy knowing your users’ data is protected.

The Answer: Tokenization

The magic solution is an elegant combination of tokenization and data transformation.

Tokenization is the act of replacing sensitive PII data with a form-matching token. It lets you unlock the power of your data without sacrificing privacy or security. For example you might tokenize an email address, like will@userclouds.com, to a unique, random string of characters, that still looks like an email address: easfhrtyu@radfshtyui.com. The token is not PII, but you can analyze, explore or use it just as easily and effectively as the PII itself.

Tokenization is different from hashing in a few key ways:

  1. The token is just a random string - it’s not algorithmically derived from the PII
  2. It’s single use, so it can’t be matched across over datasets
  3. It can’t be resolved back into the original PII without access to the token vault
  4. In most cases, it’s considered privacy preserving

Example: Zip Code Analysis

Let’s suppose you want to analyze conversion in your ads funnel by zip code. Perhaps you want to test a thesis that higher income zip codes are higher converting, but you don’t want to expose that zip code data. Here’s how you do it in three easy steps.

Step 1: Generate analysis specific token for the email

Firstly, use a tokenization service like UserClouds to replace each user's email with a single-use token for this analysis. Because the token is analysis-specific, it can’t be matched against other data sets. This makes the dataset far less valuable to potential attackers.

If you want to use the outputs of your analysis for some next steps or deeper evaluation, you can generate a resolvable token - i.e. a token that you can convert back into the original data, if you want to.

Step 2: Transform the other PII data

Secondly, transform your other PII data (like zip codes) into non-PII that you can still use for your analysis. The key here is to transform the data enough to preserve privacy, whilst also preserving the relevant dimensions for your analysis.

In an income-based hypothesis, you could pick a token generation policy that preserves zip code income level. For example, you might map all high income zip code, like 90210, to a single high income zip code 94305. Of course, if you had another hypothesis here, like political leaning, you could define a generation policy to preserve a different dimension, like prior voting trends. Generation policies can be arbitrarily smart about preserving whatever dimension you want.

Transformation methods that can be applied to preserve privacy include:

  1. Granularity reduction, where many zip codes are mapped to one like zip code
  2. Categorization, where raw data is replaced with a category, like "high-income"
  3. Noise addition, where PII is obscured by adding random noise to it

Step 3: Analyze the tokenized data set

Finally, analyze the tokenized data set in your data warehouse, and find that company changing insight!

What we've achieved

Voila! We’ve solved the problem. Moreover:

  • We haven’t exposed our PII to our analyst, so we’ve reduced our risk of employee abuse or employee account takeover
  • We haven’t replicated or downloaded PII, containing data sprawl and reducing our surface area for attack.
  • The data we have created is completely single use non-PII. Since it is single use, it can’t be correlated with another dataset later, so it’s far less valuable to attackers.

If you want to use tokenization to protect your raw data, send us a message at info@userclouds.com.

Reach out today