Data Engineering – Key Considerations for Onboarding Data Sources

Hi all,

I was motivated to think about this design question from a recently-encountered workplace scenario.

Recently, my team and I have been developing out capabilities to scan four different flavors of databases – SQLServer, PostgresDB, DynamoDB ( DB2 ), and Snowflake. For each data source, we needed to scan across granularity levels – databases, tables, schemas, and columns.

Which got me thinking – if someone needs to onboard a database for an enterprise process – ETL pipelines, storage, or querying – what factors should they consider? Do they need to think and impose a strict onboarding ordering? What should they communicate to their PMs and their managers when they need to justify the priority of source over another?

The Overarching Questions

Capabilities check – do I already have built-out capabilities – full of partial – for onboarding? It’s faster to build out solutions for partially onboarded data sources, even if sources are complex.
Datatype – what’s the data type? Operating with structured, textual-only data ( typical of SQL databases ) is easier than semi-structured data or unstructured data ( e.g. audio and images ).
Data volume – how much data do I need to work with? Am I dealing with a single 1GB ( gigabyte ) table ? Or do I need to execute scans in the PB/EX ( petabyte/exabyte ) range across multiple disks?
Customer urgency – which customer needs a solution met quickest? Onboarding data source one might take only two weeks, BUT, customers may need data source two ASAP.
Low-hanging fruit – on the other extreme, which sources are the quickest to onboard? Can we deliver a MVP with a single source?
Access/permissions barriers – some solutions are harder to onboard due to more bureaucratic reasons – I need to contact and communicate with another team to retrieve the apropos levels of access. Reasons can vary from enterprise security to legal & compliance.
Preprocessing steps – some data is easier to operate with than others. In some cases, I can execute an immediate read, and in other cases, I need to read and apply a series of pre-processing steps – transformations, validations, and input sanitizations. The extra steps introduces upstream complexity and delays feature deliverability.

Give me an onboarding example.

Sure.

Let’s suppose I need to onboard three data sources – SQL Server ( SQL ), MongoDB ( NoSQL ), and Snowflake ( Analytical ). For each source onboarding, I need to scan for sensitive elements, such as credit card numbers. And let’s also suppose that following volume metrics :

Data Source	Volume ( in Terabytes )	Data types
SQL Server	10 TB	Structured ( Textual )
MongoDB	100 TB	Unstructured ( BLOB )
Snowflake	1000 TB	Structured ( Numerical )

Figure 1 – a very simplified “factors analysis” view

If we went strictly off of volume, I’d bias immediately towards { SQLServer, and Snowflake }. But there’s a bias element – onboarding MongoDB is more challenging because it stores unstructured BLOB data. The SQLServer data is structured, and the Snowflake data is all numerical ( it’s an analytical DB ). In this case, I’d justify alternating the onboarding order to { SQLServer, Snowflake, MongoDB } – a new precedence hierarchy, first accounting for data type complexity, and then accounting for data volume.

harisrid Tech News

recent posts

about