Why does this work matter? How does it impact stakeholders or end users?
Hi all,
I’m writing this blog post since I’ve been working on a data engineering-esque feature extensively at work1 , BUT, I think it’d be awesome to explain what I’m up to.
Most companies – from large tech to old-school enterprise organizations – hire for data engineering talent : talent that builds out the technical solutions for a company’s data vertical. These engineers focus on problems involved in data controls & standards, governance, and compliance. Data engineers are a “security layer” for companies in the hiding who shoulder an umbrella of corporate responsibilities. Their work involves identifying data sources, scanning sources, and then export their findings to their end users. Each step entails R&D-esque work or feature ticket work, and “the cream of the crop” of engineers ask really good clarifying questions that reduces the level of effort of tasks and enables a long-lasting solution in production with minimal issues .
Why identifying PII/SDEs Matters?
(a) Ensures that customer records are kept complaint and up-to-date with legal requirements – enforced across different levels of government ( federal, state, local )
(b) Pre-empts capital losses incurred from data breaches
(c) Engenders stronger customer base trust in corporate applications
Let’s dig deeper into their complexity and the types of questions good engineers should ask!
Identify the data sources
- Are the sources with organizational scope? If so, are they internal or external?
- How can I find out my sources?
- Identify from configuration files ( e.g. .yaml files )
- Look into inventories – made available on internal websites or APIs
- Do I need to meet with other teams?
- Can other teams identify the inventory of data sources?
- Can we delegate maintaining the inventory to another team?
Scan the data sources
- Do we need to scan data sources? Can we identify the path of least resistance?
- The sources may already be encrypted
- The sources may be unencrypted BUT they meet enterprise-requirements.
- If we scan the data sources, what do we look for?
- Comprehensive scanning – of each record, across all databases
- Targeted subsets of tables and columns?
- Comprehensive scanning – of each record, across all databases
- What data elements do we need to scan for?
- How often do data sources need to be scanned?
- Do I scan on a historical basis? Once every six months ( for all records )?
- Do I execute continuous scans on a periodic basis ( e.g. hourly runs or nightly runs )?
Export and Deliver the Findings
- How should findings be communicated?
- Which party needs the finding? Are they internal or external auditors?
- Is the easiest strategy – outputting to a CSV/Excel-esque dump – satisfactory?
- Do we need a ( simple or complex ) UIs to enable the real-time update of identified elements?
- Retention of findings?
- Can we remove our findings after a time window ( e.g. a month or a year )?
- Can we compress and archive findings?
What was my feature deliverable ?
Figuring out the data sources was mostly “done for me”, but I had to develop out the other two steps of the feature ( and it’s capability ) for a new data source ( most of the feature had been developed, but for pre-existing sources ):
a. Scan multiple data sources and databases for PII and SDEs
b. Export the findings to an Enterprise-specific layer for external personas to execute an audit.
What are the data sources you scanned?
My feature works across four major databases – the SQL flavor one being SQLServer and the NoSQL flavor ones being Snowflake, DB2, and PostgresDB. For each data source onboarded, I operated at the granularity of database.schema.table.column, following enterprise/team conventions. The scans ( so far ) have been historical versus continuous – once a source had been scanned, it didn’t have to be scanned again to facilitate export.
How to Feature Test Across Environments?
- Local environments
- Non-production environments
- Sandbox envs
- Other envs
- Production environments
Testing types :
1. One-off tests
2. Comprehensive tests – all data sources
Assessing Feature Complexity
- Metrics collection – build out dashboards, reports, or visualizations showing the volume of data under operations – scan and export. Capture metrics :
- Across fuzzy/exact text search ( e.g. all databases or database.schemas with names starting with <insert_prefix_here>* )
- Across logical granularities with tag-based filters : at database, database.schema, database.schema.table, database.schema.table.column granularities
- Across time windows ( for continual scans ) – on a periodic basis fixed by hour / day / week / month.
Monday – Friday, 7am – 4pm
123 Example Street, San Francisco, CA
(123) 456-7890
Designed with WordPress

Leave a comment