What Most Data Centers Don’t Understand About PII

Tags: Credit Card • Data • data center • Data Center POST • Datasets • DCP • iMiller Publilc Relations • iMPR • Personally Identifiable Information • PII • Social Security Number • SSN

By Tim Williams, CEO, Index Engines

It’s a very common misconception that credit card numbers are PII, but the truth is, PII is your email address, your home address, your phone number…any information that can be used to identify you. A single piece of PII is like a loose thread. Once you have it, you can use the Internet to start pulling on it, and get more and more of it. In the world we live in, unless you are prepared to move completely off the grid, you can’t protect your PII.

My credit card number is certainly very Sensitive Information (let’s call it SI), but it can’t be used to identify me. Only when combined with PII does it create the problems associated with identify theft. If you don’t know who owns a credit card number, there’s just not that much mischief you can do with it.

Why does this matter? Well, have you just convinced your company to invest in technology that scans your network for Social Security Numbers, Credit Card Numbers, Bank Account Numbers, Routing Numbers, HealthCare Identifiers, etc. because you were charged with finding and eliminating these kinds of threats? Unless you understand the differences between SI and PII, the task will be much more difficult than you imagined. That’s because there are two different strategies that vendors use to find SI, and both of them have flaws.

Let’s call the first method the Optimist method. The Optimist assumes an orderly world where Social Security Numbers are always stored in a format like NNN-NN-NNNN, nine digits separated into three groups (three digits, two digits and four digits) by dashes. Maybe your Optimist has had a brush of reality, and will recognize Social Security Numbers with a single spaces instead of dashes, but that is as realistic as they will get. Unfortunately, reality can be cruel, often storing Social Security numbers as nine digits without dashes grouping them, or with lots of space between the groups. It can even be stored in three separate fields that alone are unrecognizable as a Social Security number, and only make sense when displayed in a companion form that supplies the dashes and readability of the data. For these reasons, the Optimist can, and does, miss SI. (Most vendors use this method, so odds are this is what you bought).

Compare that with the Pessimist method. The Pessimist knows how disorderly reality is and casts as wide a net as possible when searching. Not only will they match any sequence of nine consecutive digits, they will also match any series of three, then two, then four digits separated by any number of non-alphanumeric characters. The Pessimist isn’t likely to miss any SI at all. The problem is all the false positive matches they will find. You will be surprised how many nine digit numbers you will find that aren’t really Social Security numbers. While both methods generate false positives and while there are well known practices used by both methods to minimize those false positives, you’ll get far more of them from a Pessimist than an Optimist. In some datasets, the false positives can be overwhelming.

It’s possible to further minimize Pessimist false positives by, for example excluding search results that aren’t near strings like “Social” or “Security” or “SSN” or “Employee” when searching for Social Security Numbers, or for ‘Credit Card”, “Amex”, “Visa”, “MasterCard” when searching for Credit Card numbers. A search like that would hit the credit card number above, regardless of how the number was formatted. Using that technique on a dataset that was pronounced “Clean of SI” after it was processed by an Optimist, you will find lots of examples they missed. It’s a very effective way to quickly find the flaws in an Optimist implementation. Of course, that is also likely to end up excluding SI that the Optimist found that did not have those strings.

So if an Optimist is foolproof, and a Pessimist can generate too many false positives, what’s to be done? That’s the true value of searching for PII. Since SI is only a problem when it is matched with PII, then it follows that by using a tool that implements the Pessimistic method to search for SI only where it is near PII (the last names of all your employees or customers for instance), you can efficiently find all the SI that truly puts your organization at risk. That means that if your dataset is large, you will need a pretty powerful indexing engine and a well thought out search process, but at least you can be confident that the task can be successfully completed.

Recent Posts

Archives