Hi all,
I was motivated to think about this design question from a recently-encountered workplace scenario.
Recently, my team and I have been developing out capabilities to scan four different flavors of databases – SQLServer, PostgresDB, DynamoDB ( DB2 ), and Snowflake. For each data source, we needed to scan across granularity levels – databases, tables, schemas, and columns.
Which got me thinking – if someone needs to onboard a database for an enterprise process – ETL pipelines, storage, or querying – what factors should they consider? Do they need to think and impose a strict onboarding ordering? What should they communicate to their PMs and their managers when they need to justify the priority of source over another?
The Overarching Questions
- Capabilities check – do I already have built-out capabilities – full of partial – for onboarding? It’s faster to build out solutions for partially onboarded data sources, even if sources are complex.
- Datatype – what’s the data type? Operating with structured, textual-only data ( typical of SQL databases ) is easier than semi-structured data or unstructured data ( e.g. audio and images ).
- Data volume – how much data do I need to work with? Am I dealing with a single 1GB ( gigabyte ) table ? Or do I need to execute scans in the PB/EX ( petabyte/exabyte ) range across multiple disks?
- Customer urgency – which customer needs a solution met quickest? Onboarding data source one might take only two weeks, BUT, customers may need data source two ASAP.
- Low-hanging fruit – on the other extreme, which sources are the quickest to onboard? Can we deliver a MVP with a single source?
- Access/permissions barriers – some solutions are harder to onboard due to more bureaucratic reasons – I need to contact and communicate with another team to retrieve the apropos levels of access. Reasons can vary from enterprise security to legal & compliance.
- Preprocessing steps – some data is easier to operate with than others. In some cases, I can execute an immediate read, and in other cases, I need to read and apply a series of pre-processing steps – transformations, validations, and input sanitizations. The extra steps introduces upstream complexity and delays feature deliverability.
Give me an onboarding example.
Sure.
Let’s suppose I need to onboard three data sources – SQL Server ( SQL ), MongoDB ( NoSQL ), and Snowflake ( Analytical ). For each source onboarding, I need to scan for sensitive elements, such as credit card numbers. And let’s also suppose that following volume metrics :
| Data Source | Volume ( in Terabytes ) | Data types |
| SQL Server | 10 TB | Structured ( Textual ) |
| MongoDB | 100 TB | Unstructured ( BLOB ) |
| Snowflake | 1000 TB | Structured ( Numerical ) |
If we went strictly off of volume, I’d bias immediately towards { SQLServer, and Snowflake }. But there’s a bias element – onboarding MongoDB is more challenging because it stores unstructured BLOB data. The SQLServer data is structured, and the Snowflake data is all numerical ( it’s an analytical DB ). In this case, I’d justify alternating the onboarding order to { SQLServer, Snowflake, MongoDB } – a new precedence hierarchy, first accounting for data type complexity, and then accounting for data volume.

Leave a reply to harisrid Cancel reply