Data Engineering – Scanning & Exporting Data – How to Onboard your sources

Why does this work matter? How does it impact stakeholders or end users?

Hi all,

I’m writing this blog post since I’ve been working on a data engineering-esque feature extensively at work¹ , BUT, I think it’d be awesome to explain what I’m up to.

Most companies – from large tech to old-school enterprise organizations – hire for data engineering talent : talent that builds out the technical solutions for a company’s data vertical. These engineers focus on problems involved in data controls & standards, governance, and compliance. Data engineers are a “security layer” for companies in the hiding who shoulder an umbrella of corporate responsibilities. Their work involves identifying data sources, scanning sources, and then export their findings to their end users. Each step entails R&D-esque work or feature ticket work, and “the cream of the crop” of engineers ask really good clarifying questions that reduces the level of effort of tasks and enables a long-lasting solution in production with minimal issues .

Why identifying PII/SDEs Matters?

(a) Ensures that customer records are kept complaint and up-to-date with legal requirements – enforced across different levels of government ( federal, state, local )
(b) Pre-empts capital losses incurred from data breaches
(c) Engenders stronger customer base trust in corporate applications

Let’s dig deeper into their complexity and the types of questions good engineers should ask!

Identify the data sources

Are the sources with organizational scope? If so, are they internal or external?
How can I find out my sources?
- Identify from configuration files ( e.g. .yaml files )
- Look into inventories – made available on internal websites or APIs
Do I need to meet with other teams?
- Can other teams identify the inventory of data sources?
- Can we delegate maintaining the inventory to another team?

Scan the data sources

Do we need to scan data sources? Can we identify the path of least resistance?
- The sources may already be encrypted
- The sources may be unencrypted BUT they meet enterprise-requirements.
If we scan the data sources, what do we look for?
- Comprehensive scanning – of each record, across all databases
  - Targeted subsets of tables and columns?
What data elements do we need to scan for?
- Focused on conventional PII ²( e.g. passwords and SSNs )
- Looking for custom SDEs ³( e.g. classification ML-model determined )?
How often do data sources need to be scanned?
- Do I scan on a historical basis? Once every six months ( for all records )?
- Do I execute continuous scans on a periodic basis ( e.g. hourly runs or nightly runs )?

Export and Deliver the Findings

How should findings be communicated?
- Which party needs the finding? Are they internal or external auditors?
- Is the easiest strategy – outputting to a CSV/Excel-esque dump – satisfactory?
- Do we need a ( simple or complex ) UIs to enable the real-time update of identified elements?
Retention of findings?
- Can we remove our findings after a time window ( e.g. a month or a year )?
- Can we compress and archive findings?

What was my feature deliverable ?

Figuring out the data sources was mostly “done for me”, but I had to develop out the other two steps of the feature ( and it’s capability ) for a new data source ( most of the feature had been developed, but for pre-existing sources ):
a. Scan multiple data sources and databases for PII and SDEs
b. Export the findings to an Enterprise-specific layer for external personas to execute an audit.

What are the data sources you scanned?

My feature works across four major databases – the SQL flavor one being SQLServer and the NoSQL flavor ones being Snowflake, DB2, and PostgresDB. For each data source onboarded, I operated at the granularity of database.schema.table.column, following enterprise/team conventions. The scans ( so far ) have been historical versus continuous – once a source had been scanned, it didn’t have to be scanned again to facilitate export.

How to Feature Test Across Environments?

Local environments
Non-production environments
- Sandbox envs
- Other envs
Production environments

Testing types :
1. One-off tests
2. Comprehensive tests – all data sources

Assessing Feature Complexity

Metrics collection – build out dashboards, reports, or visualizations showing the volume of data under operations – scan and export. Capture metrics :
- Across fuzzy/exact text search ( e.g. all databases or database.schemas with names starting with <insert_prefix_here>* )
- Across logical granularities with tag-based filters : at database, database.schema, database.schema.table, database.schema.table.column granularities
- Across time windows ( for continual scans ) – on a periodic basis fixed by hour / day / week / month.

Monday – Friday, 7am – 4pm
123 Example Street, San Francisco, CA
(123) 456-7890

Designed with WordPress

and I’ll make sure to redact sensitive information a long the way 🙂 ↩︎
Personally Identifiable Information ↩︎
Sensitive Data Elements ↩︎

harisrid Tech News

recent posts

about