Hello everyone!!!
Today, I wanna go over the different types of database scans & exports I’ve executed – or will be executing – in my current workplace. What makes each rigorous and unique? Why were they developed out in their corresponding order? And can we apply to learnings here to case studies or upcoming scenarios? Do the ETLs exports developed here translate to other case studies or data engineering scenarios?
The four ETL-esque Exports Horsemen
- Historical – scanning the set of all records accumulated so far. Analogize to reading every row of an Excel sheet
- Automated – can we trigger an ETL/export upon receipt of a new/latest scan?
- Incremental/Delta – what if a record, or a table, previous scanned changes? Say someone adds a new subset of columns OR updates a subset of rows? Can we export the delta of the table ( versus the entire table )?
- Snowflake, DB2, Blaze
- Rescan – On a 10-Q quarterly basis ( or other, yet to be determined ), we executed a mandatory export again over all records scanned so far. Analogize it to execute of a historical export again. The rescans are required to meet legal & compliance requirements as per data retention, privacy, and security rules mandated by government entities.
As of the start of May, I’ve executed work on historical and I am currently executing work on automated. I’ll soon migrate over to incremental/delta and rescan, once the first two wrap up 🙂
The Data Sources
My capabilities focus on a subset of databases – relational and non-relational, transactional and analytical.
- Relational SQL – SQLServer
- Non-Relational NOSQL – PostgresDB, DB2, Snowflake
- Upcoming : Oracle
Fortuity had it that my first four were easy to work with – SQLServer, PostgresDB, DB2, and Snowflake. Each of these databases
(A) Already had a chunk of their development done.
(B) Came in only one flavor.
But the upcoming ones yield more flavors and rigor? Specifically, Oracle and Atlas with three flavors – one on-premises and two on-cloud ( IaaS and PaaS ).
The adoption of Airflow DAG
- Commonly used mainstream tool
- Other teams are using them – easy for a team internally in acompany to adopt
What are the Challenges ( Encountered and Upcoming )?
Extending the feedback loop mechanism, across existing and upcoming environments, posed multiple challenges. Let me go over a couple :
- Tradeoffs : sparse information versus dense information
- Sensitive Data Elements and PII are usually sparse. Can we make optimizations ahead of time?
- Onboarding an additional data source ( or data source flavor )
- Modifying the underlying rules engine and making features that work for one source work for another.
- Preventing future changes from breaking pre-existing changes.
- Waiting on dependencies and major blockers from other teams.
- Wait for their data / DDL ( data definition language ) ingestions
- Wait for their release of API endpoints
- Wait for their applications to update.
- Addressing network/connectivity/permissions/access issues (Azure Key Vaule rotations ) boundaries across devices and environments ( e.g. Windows versus MacOS )

Leave a comment