harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

Data Engineering – Effective SQL Schema Design : Manage Enterprise Datasets

“What did I just get myself into”?

Hi all,

I quickly want to write on how to model tracking datasets within Enterprise organizations. It’s a frequent ask of engineers – especially data engineers – to figure out effective schemas to model the multitude of variegated data abstractions float – managed and unmanaged – across companies.

But what makes this challenging? Isn’t it sufficient to build out a single SQL table that captures all pieces of relevant information ( e.g. the owners, the creators, the line of business, or the datasets themselves )?

Not exactly? There’s a lot of complexity and pre-empting of challenges that can rise with scalability, permissions, data evolvability, schema changes, or even new data sources getting introduced. There’s no one-size-fits-all solution. In today’s 2025, datasets come in a multitude of forms : Excel spreadsheets, Large BLOBs, or Kafka queues. There’s also the problem of scalability – the volume of data builds up the longer an enterprise organization exists and the more capabilities a company builds out. Then throw in product evolution – how do we build effective model today, that can withstand unexpected changes in product scopes and stakeholder requirements, coming up in the next 10 years? And what about testing? Can we even “battle-test” these models? What are the edge case scenarios where I need to think about pre-empting ( e.g. building indexes for fast lookups ahead of time or reducing table dimensionality to fit data into fewer stores of external memory ? Do I need to also build for a future inventory, if years down-the-road, someone needs to identify and catalog all data assets ( existing and removed )?

But back on focus? In the story, senior developer Natalie is given ambiguous customer requirements ; she needs to quickly conjure up a MVP – minimal viable product – for a feature that needs to be shipped and delivered to an end customer within a quarter-long scope. The feature requires customers to query the underlying datasets and to answer commonly-recurring questions, such as :

  • When was the last time a dataset was created? When was it last modified?
  • Who created a dataset? Who owns a dataset?
  • What team or what organizational unit ( OU ) does a dataset belong to?
  • How large are the datasets? What datasets consume the most external disk space?
  • Can I introduce filters for efficient searching? Should it be a singular filter ( e.g. a categorical type ) or multiple filters ( e.g. tags ) ?
  • If I’m executing searches, should they be exact searches ( e.g. match a tag for verbatim ) or fuzzy-based searches ( e.g. prefix matching )?

What would Natalie design and why? And more ever, can she justify and explain her rationales to her team members, if other senior engineers – like Emilio or Vinay – engage in collaborative back-and-forth Q&A?

What Database Solution would you use?

I’m using a SQL table for modeling since :
(A) The data is naturally relational.
(B) The schema is mostly fixed and easily accommodates schema-on-read – it won’t frequently adjust.
(C) The information is mostly structured – there’s minimal unstructured elements ( e.g. large BLOBs, images, or videos )
(D) I expect external personas – data analysts or business analysts – to execute ad-hoc queries with SQL or other relational languages of familarity.
(E) The data’s natural relationships also engenders natural hierarchy – it’s easy to build a “visual tree” of relations.
(F) SQL is easy-to-understand by most developers, and old solutions continue to persist effectively for modern-day settings. Shipping and iterating quickly usually commands precedence over modern solutions – even those that are “more performant” or “more efficient”.

Alright, let’s get to building – but wait, what am I building?

But let’s take a step back – before we dive into schemas and models, think about the actors. Who and what are we modeling? Are we modeling people? places? things? tangible products ( e.g. e-commerce products )? or logical abstractions ( e.g. a sense of ownership or a sense of creation )? And how are these abstractions related to each other? Does a person have only one sense of ownership, or, can they have multiple ( it’s typically one ).

They’re five primary entities – datasets, creators, owners, permissions, and organizational units ( OUs ) – with cardinality relationships :

Datasets : Creators – 1:1
Datasets : Owners – 1:1
Creators:Permissions – 1:1
Owners:Permissions – 1:1
Creators:OU – 1:1
Owners:OU – 1:1

Datasets – created to capture metadata about enterprise data abstractions ( e.g. last created/last modified at ). This is the “bread-and-butter” table.
Creators and owners – in any organization, somebody created a dataset and somebody owns the data sets. While most of the time, the creator = the owner, there’s many cases where the long-term owners are a distinct party. Let’s think of a situation where in downstream ML model developers ask upstream data engineers to create datasets for ML model testing, BUT, it’s the developers who maintain and own those datasets. And then there can be multiple owners and transfers taking place over the longevity of data. Thus the usefulness of SOC – separation of concerns.
Permissions – what are creators and owners allowed to do? Can owners read, write, or delete data ( unrestricted rwx-esque UNIX permissions )? Or in most cases, are owners allowed to only READ data?
OU – the organizational unit. Creators and owners are typically employees of a company ; they belong to a team in a larger organizational structure and they report to a direct manager. What if I need to know the teams or the orgs who own data? Introducing a layer of information – even if used in only a few query patterns – endows usefulness.

What would schemas resemble? Why are schema choices made?

That’s an excellent question!

My goals are the create tables that have the right number of columns – not to little, not to much. The goldilocks zone. If I start noticing a surplus of columns, I’ll introduce PK-FK ( primary-key foreign key ) relationships and data normalization 1; this should effectively eliminate redundancy and helps us built out solutions with a more customizable, granular view of enterprise information.


Datasets : [ PK ( INT ), Dataset Name ( STRING ), Created_Time ( TIMESTAMP ), Updated_Time ( TIMESTAMP), Size ( in KB ) ( INT ), Description ( VARCHAR 200 ), Creator ( FK ) (INT ), Owner ( FK ) (INT )
Creators : [PK ( CreatorID), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_at ( TIMESTAMP), Permissions ( FK ) (INT), OU ( FK ) (INT ) ]
Owners : [PK(Owner ID ) (INT), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_at ( TIMESTAMP ), Permissions ( FK ) ( INT ), OU ( FK ) (INT ) ]
Permissions : [PK ( Owner ID ) ( INT ), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_At ( TIMESTAMP ), Permission ( FK ) ( INT ) ]
Organizational Units ( OU ) : [ PK ( ID ) (INT), Team Name ( VARCHAR ), Org Name ( VARCHAR ), Direct Manager ( INT ) ( FK ) ]

Optimizing targeted data types for columns.

Why use a TIMESTAMP in place of INT or VARCHAR(200) in place of strings?
a. Use VARCHAR(200) or VARCHAR(<insert_fixed_limit_size>) over STRING – this enforces constraints and safety on inputs programatically if there’s known bounds on input sizes ( e.g. the size of a standard tweet on twitter.com is 140 bytes in totality ).
b. Leverage TIMESTAMP over STRING when dealing with unix-stored time. Unix-stored time is interpretable across more formats than human-readable dates ( e.g. DD/MM/YYYY or MM/DD/YYYY ) . TIMESTAMPS also save on disk space and can quickly be ordered – monotonically increasing or decreasing.

Examples of what datasets look like ( I promise I’ll populate them later )

Schema 1 : Datasets

PKDataset_Name (STRING )Created_Time ( TIMESTAMP )Updated Time ( TIMESTAMP )Size (KB) ( INT ) Descript ( STRING ) Creator ( FK ) (INT)Owner ( FK ) (INT)
1
2
3

Schema 2 : Creators

PK ( Creator ID )Creator First Name (STRING)Creator Last Name (STRING)Created_at ( TIMESTAMP ) Permissions ( FK) (INT)Organizational Unit ( FK ) (INT)
1
2
3

Schema 3 : Owners

PK ( Owner ID )Owner First NameOwner Last NameCreated_AtPermissions ( FK) Organizational Unit ( FK )
1
2
3

Schema 4 : Permissions

PK ( Owner ID )Owner First NameOwner Last NameCreated_AtPermissions ( FK) Organizational Unit ( FK )
1
2
3

Schema 5 : Organizational Unit ( OU )

PK ( OU ID )Team NameOrg NameDirect Manager
1
2
3
  1. Normalization will incur performance degradations and costs with regards to complex queries involving JOIN(…) statements across multiple tables, BUT, for the purposes of building out a model that’s flexible and maintainability, I’m biasing towards the strategy. Normalization typically enables highly-scalable solutions ↩︎
Posted in

Leave a comment