Your company’s data is a treasure trove of insights and value, but with every piece of analysis, personal information gets exposed, downloaded and replicated. That increases your exposure to employee abuse, employee account takeover, endpoint security risks and data breaches.
In this blog, we'll show you how to analyze sensitive data sets without creating privacy & security risk - so you can rest easy knowing your users’ data is protected.
The magic solution is an elegant combination of tokenization and data transformation.
Tokenization is the act of replacing sensitive PII data with a form-matching token. It lets you unlock the power of your data without sacrificing privacy or security. For example you might tokenize an email address, like will@userclouds.com, to a unique, random string of characters, that still looks like an email address: easfhrtyu@radfshtyui.com. The token is not PII, but you can analyze, explore or use it just as easily and effectively as the PII itself.
Tokenization is different from hashing in a few key ways:
Let’s suppose you want to analyze conversion in your ads funnel by zip code. Perhaps you want to test a thesis that higher income zip codes are higher converting, but you don’t want to expose that zip code data. Here’s how you do it in three easy steps.
Firstly, use a tokenization service like UserClouds to replace each user's email with a single-use token for this analysis. Because the token is analysis-specific, it can’t be matched against other data sets. This makes the dataset far less valuable to potential attackers.
If you want to use the outputs of your analysis for some next steps or deeper evaluation, you can generate a resolvable token - i.e. a token that you can convert back into the original data, if you want to.
Secondly, transform your other PII data (like zip codes) into non-PII that you can still use for your analysis. The key here is to transform the data enough to preserve privacy, whilst also preserving the relevant dimensions for your analysis.
In an income-based hypothesis, you could pick a token generation policy that preserves zip code income level. For example, you might map all high income zip code, like 90210, to a single high income zip code 94305. Of course, if you had another hypothesis here, like political leaning, you could define a generation policy to preserve a different dimension, like prior voting trends. Generation policies can be arbitrarily smart about preserving whatever dimension you want.
Transformation methods that can be applied to preserve privacy include:
Finally, analyze the tokenized data set in your data warehouse, and find that company changing insight!
Voila! We’ve solved the problem. Moreover:
If you want to use tokenization to protect your raw data, send us a message at info@userclouds.com.