Data Engineering – What’s Data Classification?

Inspired by an Uber trip inbound to sunny San Francisco.

An Introduction

Hello everybody!

Inspired by an Uber trip and the many conversations with my friends who ask me what I do for work, I thought about briefly going over my job and a couple of major line items :

What is Data Classification?
Why is Data Classification Needed?
What makes Data Classification Challenging?

Alright, let’s begin! Tell me what you do!

The Story – What is Data Classification?

Since the dot-com era of the 1990s- with the advent of computing machines and the capability of collecting large volumes of data – companies, from big tech enterprises to non-big tech organizations, like insurance and financial tech, have often run into this problem : What type of data are we collecting? There’s business problems at hand , and to serve our end customers, we need to collect data.

But before we do so, there’s a catch. We need to ask ourselves good questions. What are we collecting? Are we collecting user information – first names, date of births, stress addresses? Are we collecting aggregate metrics – a sum or a running average of transactions executed on a given day? Or are we collecting metadata on files : the size, file extension format, and the owners? Are we collecting the right data?

Why Is Data Classification Needed?

And more importantly, can we detect sensitive data elements ( which I’ll call SDEs ) and make sure that none of it gets leaked outside the enterprise? Data like PII ( Personally Identifiable Information ) and SSN ( Social Security Numbers ). Because if the data gets leaked, then we run into catastrophic problems – reputational risk, a loss of long-term customer trust, financial losses, and so on.

This is where Data Classification teams come in! You can call them by multiple names – the guardians, the stewards, or the protectors – of data. They’re the team that scans the databases and the data sources for those SDEs and flags them when they’re detected. Yes, they’re an additional layer, but they’re also key to the defense in depth that organizations desire – without them, we’d run into far worse situations.

So What Makes Classification Challenging?

That’s a really good question! There’s a number of reasons why data classification is harder than it looks.

Firstly, the definitions of sensitive data constantly change and evolve in lockstep with technological changes. Data that wasn’t seen as sensitive years ago is today seen as sensitive, and data previously unobtainable in massive quantities – biomarkers and fingerprints – is more mainstream.

Secondly, the classification problem in itself is getting harder and harder. Traditionally, classification teams leveraged what we call rules engines to flag bad records. Those engines still operate, but only for well defined data – structured data, like your phone numbers or your birthdays, which meet a limited set of values. But modern data – media, videos, audio – is unstructured or semi-structured ; it’s less well-defined. Consequentially, we have to eschew the rules engines in favor of ML models, which can handle classification and detection of SDEs for a different input set. But unlike the rules engines, the ML models are imperfect – they generate predictions close to, but not exact to, 100% accuracy. This means that we still need the code, the notifications, and the reliance of a human-in-the-loop to determine the final verdict on the misclassifications – the false positives and the false negatives.

Finally, let’s throw in the cornucopia : the rest of the mixed bag – data volumes, execution speeds, multiple source types, the “under-the-hood” processing of catalogs or inventories, logging, telemetry, and designing the infrastructure and the stages for ETL pipelines. The work that surrounds and supports, but isn’t, actual classification. The work that still has to be done.

And now you can truly see where the data engineering talent comes into a business’s grand picture.

What Teams do Data Classification Folks work with?

In my current work, they work closely with two other system teams:

(A) The ML Model team – they operate at the layer of conjuring up or extending pre-existing ML Classfiers, like XGBoosted Decision Trees; they’re heavily involved in training, testing, and validating the latest ML models, as well as incorporating feedback from upstream teams like us to iteratively improve their latest model versions.
(B) The ML Model QA Team – this is a middle layer in between the classification team and the ML model team. They conduct stress tests and make sure that models work in production by using containers and other virtual software to emulate production settings. They resolve issues regarding the configurations, the environments, the runtime dependencies, and the systems which surround the ML models.

harisrid Tech News

recent posts

about