harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

  • PERSONAL – On Writing : Why the Best Engineers are also the Best Writers

    The Value of Writing – a Primer

    Hi all!

    Today, I’m going to delve into another topic – writing!!! Yep, writing!!!!

    Yep, the subject you ( probably ) badly wanted to avoid back in high school and college. The dreaded dark days when you read classical literature like Moby Dick or All Quiet on the Western Front and write out long-form essays, reflections, or short stories on the books your teacher assigned.

    Well, it turns out that writing is a crucial skill for engineers. In fact, I’d argue more important than coding ( especially with the advent of Chat-GPT and GenerativeAI tools that can quickly spit out code ). Many of us share this feedback with each other – the best engineers are solid technical writers. They’re not just good coders. In fact, they may not even be your team’s best coder. Someone more junior might have shipped out more LOC – lines of code – then them during their entire tenure!

    But the best engineers sure know how to write. They’re good at convincing multiple audiences – engineers, product owners, leadership and management – that their ideas are “worth their salt”, that their system designs are solid, and that their user stories make sense. Their documentations serve multiple roles – mentorship, conveying impact, and tracking accomplishments.

    More ever, seniority and leveling naturally endows writing. Your role responsibilities -coupled with the folks you collaborate with ( senior talent ) – will act as a “forcing function”. You’ll either get tasked out – or create the task yourself – to write up up design documents and technical specifications. Writing at the high level is ineluctable ( and inescapable ).

    Lastly, it’s not a bad thing. In fact, I argue that writing is good. Think about it – the most famous people in your life and in your profession write. Barack Obama ( former US president ) has to read and write. Roger Penrose ( renowed astrophysicist ) wrote The Road to Reality. And Thomas Cormen wrote the bible of algorithms – CLRS. Emulating your profession’s titans is a good starting place!

    But what do I write about? How do I get started?

    But what if I run into writers block? What if I can’t find a topic to write about.
    That’s a good question. Let’s tackle that too!

    • Write about what you learned – did you create a side project in your free time? Contribute to a long-running OSI ( open source initiative ). Did you engage in a fascinating project back in your undergraduate days? Write about your learnings. What were your challenges? What were your biggest takeaways?
    • Write about what you know – Finding topics is a hard subject, but start from what you know. Words flow out easily from a priori established knowledge bases. Look into starting with topics from your day-to-day work – this can help you in a few ways. Writing about your work, outside of work, not only helps you better understand your work, but enables you to deliver faster on your work deliverables ; it’s a positive feedback loop in the hiding!
    • Write opinion pieces – do you have a technical opinion? Something you want to share? Do you want to talk about what you think makes for an effective design document? Or how to make good user stories? Perhaps even a dissenting opinion ( gaaaspp, turns out you can disagree with engineering best practices, because a best practice isn’t a best practice in all situations ).
    • Write about engineering best practices – there’s a lot of good engineering practices out there, and every engineer knows a good habit or two. I’d be hard-struck to say that I’ve met an engineer who never taught me a better technique in coding, debugging, designing, or investigating issues.

    How to Refine Your Technical Writing Skillz?

    • Practice, Practice, Practice – the best way to get better at technical writing, is to practice writing. Practice makes perfect. Find opportunities to engage in technical writing. This can include outside your company.
    • Solicit feedback – the best writers don’t operate in a vacuum; they solicit and ask for feedback from their peers. Peers can see things that you don’t see in yourself – the good and the bad. I’ve asked for feedback in the past, and I’ve gotten examples such as
      a. Be concise and clear – shorten sentences and write less verbosely.
      b. Think about your audience – write different types of documents, based on the audience? You can set topics for documents – brainstorms, condensed, and expanded versions.
      c. Focus on the big picture – shift your focus away from the low-level details towards the bigger picture.
      d. Write about today and tomorrow – write about problems from a timeline perspective – what does the current state of systems look like today, and what do we want those systems to resemble tomorrow?
      e. Incorporate visuals – focus on capturing reader attention to convey your points more effectively – leverage state transition diagrams, system diagram, and tables. Color-code and label the visuals too.
    • Read ! Read widely! – are there examples of engineering writing you admire? Books whose styles resonated and “clicked” with you? Blogs that captured your interest? I strongly admired these blogs 1for their specific traits :
      a. Communication styles – Gusto ( https://engineering.gusto.com/ ) and AirBnb ( https://medium.com/airbnb-engineering ).
      b. Communication Simplicity and openness – the mathematical blogs Better Explained ( https://betterexplained.com/ ) .
      c. In-depth expositions written out in Professor Scott Aarson’s blog ( https://scottaaronson.blog/ ) .
    1. I still consult these blogs – and other online media – from time-to-time 🙂 ↩︎
  • Data Engineering – it’s ok to build for batch, even if everyone else is building for streams

    ( and please don’t let the fear-mongering of the world migrating to streaming get to you )

    Building Automation – a Primer

    Hi all,

    I’ve been working hard on a ETL-esque project at Geico to set up the internal plumbing, the infrastructure, and the business workflows to automate and execute the following :
    Step #1 ( SCAN ) : Scan a data source and populate a local table with records of each scan
    Step #2 ( EXPORT ) : Trigger an export of the data to its destination

    In today’s current state, we’re automatically scan data sources and populate our table of records, but, we manually trigger endpoints to export.

    But tomorrow’s world. Well that needs to look different? The scan and the export needs to be automated. Not only because end customers need to see their results as soon as possible, but also because we can’t keep asking humans – a.k.a. the engineers on our team – to go into an application and trigger the endpoints to expose data. That’s so much work.

    Now on the surface, the automation looks easy to do. Hey, I’ll write out a couple of shell scripts, call API endpoints in a loop, and we should be good to go, right? Not exactly. There’s a couple of caveats and unknowns making this work challenging to execute. Let’s delve deeper and dive in 🙂 !

    The Unknowns to Navigate

    (A) When did an export need to be triggered? Was it after a scan’s completion? On each record – r1,…,rn – within a scan? Or on batches of records ( e.g. batches of 10,20, or 50 records )?
    (B) When did data exports need to show up? Did they need to show up within five minutes? 24 hours? Or other configurable windows?
    (C) How would we handle failed events or failed processing? Did we have ways to avoid duplicate work? Or can we allow duplicate events, if the code triggered for events is deterministic ( e.g. the output is the same for a given input )?
    (D) What can we leverage to keep the design simple?
    – Can we incorporate intermediate tables, Kafka queues/event buses, or long-running background daemons?
    – Can we make changes to application-layer code or data-layer code? Can we leverage invariants of database triggers to confirm the completion of events – upstream and downstream?
    (E) What design components are owned by our team, versus owned by other teams?
    – Components owned by our team are faster to configure, customize, and develop.

    Scenario #1 : The Stream Mode

    The true stream mode emulates a more event driven/CDC [ Change Data Capture ] idea, in which once a scanned record is written out to a database, some notion of an update having just occured needs to immediately trigger an export.

    Before I started gathering requirements, I made the project more complex than it needed. I imagined in my hand that on each scan ( s_i ), I needed to trigger the export on the immediate storage of scan records ( r_1,r_2,…,r_n ). This means that at the smallest level of granularity, I needed to immediately export results to their intended destination. That meant introducing a notion of an event bus or CDC ( continuous data capture ) on each insert or update of a record into our local table.

    Now while CDC or event-streaming is cool, it’s more complicated. The idea introduces additional steps and questions such as :
    (A) Who should fire the event to inform that a record was stored? Whomst is the best authoritative source of truth? Should it be the producer ( the scanning application )? Or should we periodically poll the database?
    (B) If we leverage events, how long should events persist?
    (C) What do we do if a consuming application or daemon dies?
    (D) Do we own Kafka queues or event-driven architectural components?

    All of this introduces additional tasks and JIRA tickets. Like building out the Kafka queues or conducting a research & development story to find out if a local table has CDC capabilities. This delays the date of feature deliverability and adds long-term maintenance layers. More ever, it’s easy to get caught in the woods and introduce to many optimizations ( e.g. do we have to change the database storing our scanned results )? Optimization is good, BUT, over-optimizing a design is a “bad practice”.

    What do we do instead?

    … hmm, let’s try the batch mode.

    Scenario #2 : The batch mode

    The alternative solution involves batch operations. Instead of triggering an export on each record scanned, what if exports could be triggered on the completion of an entire scan? It turns out that a configurable time window for when the exports need to show up for our end users and a local database table enabled us to introduce the notion of time-bounded processing delay ( e.g. one hour or two hours ) before exporting results.

    But how did we determine what to export? Did we conjure up criteria? Hmm – we have a local table of scan results with a flexible schema; we can introduce additional columns :
    a. ScanExecutionFinished – Boolean flag ( TRUE = finished, FALSE = not finished )
    b. ScanAlreadyExported – Boolean flag – ( TRUE = is exported, FALSE = not exported )
    c. ScanExecutionStartTime – UNIX timestamp ( informs when a scan started )
    d. ScanExecutionEndTime – UNIX timestamp ( informs when a scan ended )

    We can leverage the four columns to enable a FIFO-esque automated processing of scans and exports ; exports should auto-trigger for scans with earlier execution times than those with later execution times. A cutoff time window ( e.g. a timestamp value ) can be used – let us denote this as T_prime

    If all three conditions are met for a given scan :
    a. ScanExecutionEndTime <= T_prime
    b. ScanExecutionFinished=True
    c. scanAlreadyExported = False
    Trigger an export and set scanAlreadyExported = True
    Else, do not trigger an export.

    As for how to do the batch processing, that’s up to the engineering team. I won’t delve to deep here, but I’ll briefly touch upon the two big ones –
    (A) Utilize Airflow DAGS to trigger endpoints
    (B) Introduce periodic consumer-side polling of database tables.

    Figure #1 – The above shows a kafka-based streaming architecture, and below shows a batch architecture. Batch-based ETL architectures are typically easier and faster to ship out.

    My Biggest Takeaways

    I learned a lot from this experience in designing systems – lemme share them!

    1. Requirements Gathering – Focus on the gathering of the requirements! . Don’t just dive in. Set up meetings with other engineers and understand the overall ask at hand.
    2. Build Simpler – Streaming is cool and we say that’s the future of the world. But if we’re operating in an unconstrained environment and need to ship faster, why not build simpler? Let’s bias to batch processing if we can!
    3. What’s under your control – Identify your locus of control – interior and exterior. Ask what’s under your control and what’s not under your control? In our case, we had control over the following :
      (A) The time window of exports showing up
      (B) The handling of duplicate events
      (C) The components we could introduce and own ( e.g. local PostgreDB tables or Kafka queues )
    4. Conjure criteria sets and conditions – Can we determine a rules engine – or set of criteria – to determine when to execute an operation? Let’s imagine we built from the perspective of a grossly-simplified data structures & algorithms problem?
  • SYSTEMS – Testing Strategies Used For Large Volume Financial Networks

    A Primer

    Hi all,

    I want to briefly talk about my work back at S.W.I.F.T. srcl – the Society for Worldwide International Financial Transfers ; the secure bank-to-bank organization that undergirds and enables large volume, cross-border FoReX ( foreign exchange ) transactions and settlements. It was both my first job out of college and a place of tremendous learning and growth.

    My first task set involved a more Q-&-A ( Quality & Automation ) type of testing work; I had to verify that existing and upcoming financial systems, in production, met end customer SLAs1 and SLIs2. Doing so entailed test execution for two payload types – messages ( the smaller ) and files ( the larger ).

    In its production setting, the application processes ( an estimated3 ) one million + messages per day – testing at this scale is difficult. So what do we do instead? And due to strict requirements – 100% comprehensive, correctness, and determinism – in processing financial payloads, there’s no way to execute single one-off tests to justify that “things work”.

    Identify what’s at our disposal

    So what existing industry practices can we leverage? What can we do? Do we have tools for testing? Can we get creative and deviate from typical testing norms? There’s elements of both human intuition and machine-based verification at play.

    Luckily, we had two types of pre-production environments available:

    1. A local dev environment ( at the individual level )
    2. A shared benchmark environment ( at an organization level ) with the capability of execute closer-to-large-volume tests ( around 100,000 – close to 10% – of production use cases ).

    And a group of at least four developers on a team. Let’s get to strategizing!

    The Testing Strategy

    a. Partition test types ( step #0 ) – execute tests across each payload – messages and files, across financial environments ( e_1,…,e_n); individual developers can test payloads in isolation.
    b. Local tests ( step #1 ) – Individual developers execute one-off, visual tests of 5-10 payloads on their local machines.
    c. Benchmark tests ( step #2 ) – ( At least two )4 separate developers execute benchmark environment tests.
    d. Deploy to production ( step #3 ) – Once a payload passes the benchmark test, obtain approvals from approving parties and production-ize.

    By setting up staged testing across multiple environments, we’re able to assert with confidence that a feature works as expected.

    But there’s a couple of caveats – it’s not the most “ideal” testing strategy. There are issues, and future testing could’ve been expedited. Let’s dig deep and look at what’s going on.

    My Learnings – What slows testing? How can we speed testing?

    a. Using a common shared environment – the benchmark environment is more compute intensive, so it’s a shared environment across engineers ; one engineer can use it for a hour-long test from 8:00 a.m. – 10:00 a.m., but this means that other engineers can not use it. Hence, less available time for testing – code releases are set to the cadence of benchmark environment availability.
    A Solution : Can we leverage pre-production sandbox, isolated environments? A production environment is always limited to a single, commony shared environment, BUT non-production environments are customizable and under our control. Hence, they are faster to execute tests in.
    b. Limited testing time windows – if tests can be executed during certain hours ( e.g. 9:00 a.m. – 5:00 p.m. ), this means only eight hours of availability. Now for legal & compliance ( or external factors ), the windows have to be limited.
    A Solution : Find workarounds and fixes – temporary or permanent – to extend the windows.

    Footnotes

    1. SLA – Service Level Agreement ↩︎
    2. SLI – Service Level Indicator ↩︎
    3. I never got the exact figures at the time ( most likely due to proprietary reasons OR legal & compliance reasons ). ↩︎
    4. The rule of two – a benchmark environment can experience intermittent failures – a second piece of verification lends stronger credibility ↩︎
  • Data Engineering – leverage sampling in pre-production ( there’s to much data in production anyways )

    Lies, damned lies, and statistics – Mark Twain

    Hi all,

    I ran into this scenario at work1 where we needed to deploy and productionize a feature that would exhaustively read all records for a given data source. The biggest challenge at hand – to figure out an effective test that would be quick and comprehensive enough; what strategy entails the most confidence that a feature works?

    My feature integrates with multiple data sources ( e.g. PostgresDB, DB2, SQLServer), each with a larger number of logical abstractions/subdomains under them ( e.g. tables, columns, or schemas ). The number of subdomains can go to 100+ ( putting an estimate of at least 1 Terabyte of hard-disk data to process ).

    One-off testing in the local env and in pre-prod environments gives a visual level of confidence than a feature works, BUT, there are cases where one-off is insufficient. One-off testing doesn’t give long-term confidence. There can be edge cases : wrongly-typed schema name, intermittent network failure, or unexpected configuration changes. But outside of failure mode, even as humans, we don’t trust one-off tests immediately – most folks are suspicious and might think to themselves that an engineer was “just lucky” that day OR mis-interpreted/misrepresented their tests.

    But we can’t test every single subdomain for a source – for a plenitude of reasons. Such tests consume significant blocks of time ( more than an hour ). If the test fails, we would not only (a) have to re-execute it multiple times, but also (b) waste a number of dev cycles upon each test re-execution and failure diagnosis.

    Hmmm … can we establish an in-between goldilocks zone? If you recall your statistics classes, a sufficient sized random sample, sourced from a population, can lend credence than a feature works. Out of a population of 1000, if I select say, 25 members, and demonstrate feature correctness, I establish enough credibility.

    Fortuity had it that the following were available :

    a. Feature testing could be scoped out at data source level and at subdomain level ( the table/the column )
    b. There’s three testing environments : local, pre-production, and production.
    c. Engineers can execute their tests faster, and with more control, in the local environment and the pre-prod environment.

    Alright, let me circle back to my team’s four data sources. We chose a handful – five subdomains – for one-off test execution. This gives us a sample size of 20 unique subdomains ( out of the four sources * <number_domains_per_source> ). 20 passing members gives confidence that the feature “does what it does”. Which led us to a final testing strategy, doable within an hour by a single developer – execution of 4 subdomain one-off tests in a local environment and 20 subdomain one-off tests in a pre-production environment.

    1. Credit to the discussions with other engineers for conjuring up an optimal testing strategy – some ideas are mind, other ideas were others, and they came to a good ideal at the end. Credit due to the party it’s due to ( names redacted for journalistic integrity ) ↩︎
  • Data Engineering – Key Considerations for Onboarding Data Sources

    Hi all,

    I was motivated to think about this design question from a recently-encountered workplace scenario.

    Recently, my team and I have been developing out capabilities to scan four different flavors of databases – SQLServer, PostgresDB, DynamoDB ( DB2 ), and Snowflake. For each data source, we needed to scan across granularity levels – databases, tables, schemas, and columns.

    Which got me thinking – if someone needs to onboard a database for an enterprise process – ETL pipelines, storage, or querying – what factors should they consider? Do they need to think and impose a strict onboarding ordering? What should they communicate to their PMs and their managers when they need to justify the priority of source over another?

    The Overarching Questions

    1. Capabilities check – do I already have built-out capabilities – full of partial – for onboarding? It’s faster to build out solutions for partially onboarded data sources, even if sources are complex.
    2. Datatype – what’s the data type? Operating with structured, textual-only data ( typical of SQL databases ) is easier than semi-structured data or unstructured data ( e.g. audio and images ).
    3. Data volume – how much data do I need to work with? Am I dealing with a single 1GB ( gigabyte ) table ? Or do I need to execute scans in the PB/EX ( petabyte/exabyte ) range across multiple disks?
    4. Customer urgency – which customer needs a solution met quickest? Onboarding data source one might take only two weeks, BUT, customers may need data source two ASAP.
    5. Low-hanging fruit – on the other extreme, which sources are the quickest to onboard? Can we deliver a MVP with a single source?
    6. Access/permissions barriers – some solutions are harder to onboard due to more bureaucratic reasons – I need to contact and communicate with another team to retrieve the apropos levels of access. Reasons can vary from enterprise security to legal & compliance.
    7. Preprocessing steps – some data is easier to operate with than others. In some cases, I can execute an immediate read, and in other cases, I need to read and apply a series of pre-processing steps – transformations, validations, and input sanitizations. The extra steps introduces upstream complexity and delays feature deliverability.

    Give me an onboarding example.

    Sure.

    Let’s suppose I need to onboard three data sources – SQL Server ( SQL ), MongoDB ( NoSQL ), and Snowflake ( Analytical ). For each source onboarding, I need to scan for sensitive elements, such as credit card numbers. And let’s also suppose that following volume metrics :

    Data SourceVolume ( in Terabytes )Data types
    SQL Server10 TBStructured ( Textual )
    MongoDB100 TBUnstructured ( BLOB )
    Snowflake1000 TBStructured ( Numerical )
    Figure 1 – a very simplified “factors analysis” view

    If we went strictly off of volume, I’d bias immediately towards { SQLServer, and Snowflake }. But there’s a bias element – onboarding MongoDB is more challenging because it stores unstructured BLOB data. The SQLServer data is structured, and the Snowflake data is all numerical ( it’s an analytical DB ). In this case, I’d justify alternating the onboarding order to { SQLServer, Snowflake, MongoDB } – a new precedence hierarchy, first accounting for data type complexity, and then accounting for data volume.

  • Data Engineering – Effective SQL Schema Design : Manage Enterprise Datasets

    “What did I just get myself into”?

    Hi all,

    I quickly want to write on how to model tracking datasets within Enterprise organizations. It’s a frequent ask of engineers – especially data engineers – to figure out effective schemas to model the multitude of variegated data abstractions float – managed and unmanaged – across companies.

    But what makes this challenging? Isn’t it sufficient to build out a single SQL table that captures all pieces of relevant information ( e.g. the owners, the creators, the line of business, or the datasets themselves )?

    Not exactly? There’s a lot of complexity and pre-empting of challenges that can rise with scalability, permissions, data evolvability, schema changes, or even new data sources getting introduced. There’s no one-size-fits-all solution. In today’s 2025, datasets come in a multitude of forms : Excel spreadsheets, Large BLOBs, or Kafka queues. There’s also the problem of scalability – the volume of data builds up the longer an enterprise organization exists and the more capabilities a company builds out. Then throw in product evolution – how do we build effective model today, that can withstand unexpected changes in product scopes and stakeholder requirements, coming up in the next 10 years? And what about testing? Can we even “battle-test” these models? What are the edge case scenarios where I need to think about pre-empting ( e.g. building indexes for fast lookups ahead of time or reducing table dimensionality to fit data into fewer stores of external memory ? Do I need to also build for a future inventory, if years down-the-road, someone needs to identify and catalog all data assets ( existing and removed )?

    But back on focus? In the story, senior developer Natalie is given ambiguous customer requirements ; she needs to quickly conjure up a MVP – minimal viable product – for a feature that needs to be shipped and delivered to an end customer within a quarter-long scope. The feature requires customers to query the underlying datasets and to answer commonly-recurring questions, such as :

    • When was the last time a dataset was created? When was it last modified?
    • Who created a dataset? Who owns a dataset?
    • What team or what organizational unit ( OU ) does a dataset belong to?
    • How large are the datasets? What datasets consume the most external disk space?
    • Can I introduce filters for efficient searching? Should it be a singular filter ( e.g. a categorical type ) or multiple filters ( e.g. tags ) ?
    • If I’m executing searches, should they be exact searches ( e.g. match a tag for verbatim ) or fuzzy-based searches ( e.g. prefix matching )?

    What would Natalie design and why? And more ever, can she justify and explain her rationales to her team members, if other senior engineers – like Emilio or Vinay – engage in collaborative back-and-forth Q&A?

    What Database Solution would you use?

    I’m using a SQL table for modeling since :
    (A) The data is naturally relational.
    (B) The schema is mostly fixed and easily accommodates schema-on-read – it won’t frequently adjust.
    (C) The information is mostly structured – there’s minimal unstructured elements ( e.g. large BLOBs, images, or videos )
    (D) I expect external personas – data analysts or business analysts – to execute ad-hoc queries with SQL or other relational languages of familarity.
    (E) The data’s natural relationships also engenders natural hierarchy – it’s easy to build a “visual tree” of relations.
    (F) SQL is easy-to-understand by most developers, and old solutions continue to persist effectively for modern-day settings. Shipping and iterating quickly usually commands precedence over modern solutions – even those that are “more performant” or “more efficient”.

    Alright, let’s get to building – but wait, what am I building?

    But let’s take a step back – before we dive into schemas and models, think about the actors. Who and what are we modeling? Are we modeling people? places? things? tangible products ( e.g. e-commerce products )? or logical abstractions ( e.g. a sense of ownership or a sense of creation )? And how are these abstractions related to each other? Does a person have only one sense of ownership, or, can they have multiple ( it’s typically one ).

    They’re five primary entities – datasets, creators, owners, permissions, and organizational units ( OUs ) – with cardinality relationships :

    Datasets : Creators – 1:1
    Datasets : Owners – 1:1
    Creators:Permissions – 1:1
    Owners:Permissions – 1:1
    Creators:OU – 1:1
    Owners:OU – 1:1

    Datasets – created to capture metadata about enterprise data abstractions ( e.g. last created/last modified at ). This is the “bread-and-butter” table.
    Creators and owners – in any organization, somebody created a dataset and somebody owns the data sets. While most of the time, the creator = the owner, there’s many cases where the long-term owners are a distinct party. Let’s think of a situation where in downstream ML model developers ask upstream data engineers to create datasets for ML model testing, BUT, it’s the developers who maintain and own those datasets. And then there can be multiple owners and transfers taking place over the longevity of data. Thus the usefulness of SOC – separation of concerns.
    Permissions – what are creators and owners allowed to do? Can owners read, write, or delete data ( unrestricted rwx-esque UNIX permissions )? Or in most cases, are owners allowed to only READ data?
    OU – the organizational unit. Creators and owners are typically employees of a company ; they belong to a team in a larger organizational structure and they report to a direct manager. What if I need to know the teams or the orgs who own data? Introducing a layer of information – even if used in only a few query patterns – endows usefulness.

    What would schemas resemble? Why are schema choices made?

    That’s an excellent question!

    My goals are the create tables that have the right number of columns – not to little, not to much. The goldilocks zone. If I start noticing a surplus of columns, I’ll introduce PK-FK ( primary-key foreign key ) relationships and data normalization 1; this should effectively eliminate redundancy and helps us built out solutions with a more customizable, granular view of enterprise information.


    Datasets : [ PK ( INT ), Dataset Name ( STRING ), Created_Time ( TIMESTAMP ), Updated_Time ( TIMESTAMP), Size ( in KB ) ( INT ), Description ( VARCHAR 200 ), Creator ( FK ) (INT ), Owner ( FK ) (INT )
    Creators : [PK ( CreatorID), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_at ( TIMESTAMP), Permissions ( FK ) (INT), OU ( FK ) (INT ) ]
    Owners : [PK(Owner ID ) (INT), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_at ( TIMESTAMP ), Permissions ( FK ) ( INT ), OU ( FK ) (INT ) ]
    Permissions : [PK ( Owner ID ) ( INT ), First Name ( VARCHAR ), Last Name ( VARCHAR ), Created_At ( TIMESTAMP ), Permission ( FK ) ( INT ) ]
    Organizational Units ( OU ) : [ PK ( ID ) (INT), Team Name ( VARCHAR ), Org Name ( VARCHAR ), Direct Manager ( INT ) ( FK ) ]

    Optimizing targeted data types for columns.

    Why use a TIMESTAMP in place of INT or VARCHAR(200) in place of strings?
    a. Use VARCHAR(200) or VARCHAR(<insert_fixed_limit_size>) over STRING – this enforces constraints and safety on inputs programatically if there’s known bounds on input sizes ( e.g. the size of a standard tweet on twitter.com is 140 bytes in totality ).
    b. Leverage TIMESTAMP over STRING when dealing with unix-stored time. Unix-stored time is interpretable across more formats than human-readable dates ( e.g. DD/MM/YYYY or MM/DD/YYYY ) . TIMESTAMPS also save on disk space and can quickly be ordered – monotonically increasing or decreasing.

    Examples of what datasets look like ( I promise I’ll populate them later )

    Schema 1 : Datasets

    PKDataset_Name (STRING )Created_Time ( TIMESTAMP )Updated Time ( TIMESTAMP )Size (KB) ( INT ) Descript ( STRING ) Creator ( FK ) (INT)Owner ( FK ) (INT)
    1
    2
    3

    Schema 2 : Creators

    PK ( Creator ID )Creator First Name (STRING)Creator Last Name (STRING)Created_at ( TIMESTAMP ) Permissions ( FK) (INT)Organizational Unit ( FK ) (INT)
    1
    2
    3

    Schema 3 : Owners

    PK ( Owner ID )Owner First NameOwner Last NameCreated_AtPermissions ( FK) Organizational Unit ( FK )
    1
    2
    3

    Schema 4 : Permissions

    PK ( Owner ID )Owner First NameOwner Last NameCreated_AtPermissions ( FK) Organizational Unit ( FK )
    1
    2
    3

    Schema 5 : Organizational Unit ( OU )

    PK ( OU ID )Team NameOrg NameDirect Manager
    1
    2
    3
    1. Normalization will incur performance degradations and costs with regards to complex queries involving JOIN(…) statements across multiple tables, BUT, for the purposes of building out a model that’s flexible and maintainability, I’m biasing towards the strategy. Normalization typically enables highly-scalable solutions ↩︎
  • Data Engineering – Scanning & Exporting Data – How to Onboard your sources

    Why does this work matter? How does it impact stakeholders or end users?

    Hi all,

    I’m writing this blog post since I’ve been working on a data engineering-esque feature extensively at work1 , BUT, I think it’d be awesome to explain what I’m up to.

    Most companies – from large tech to old-school enterprise organizations – hire for data engineering talent : talent that builds out the technical solutions for a company’s data vertical. These engineers focus on problems involved in data controls & standards, governance, and compliance. Data engineers are a “security layer” for companies in the hiding who shoulder an umbrella of corporate responsibilities. Their work involves identifying data sources, scanning sources, and then export their findings to their end users. Each step entails R&D-esque work or feature ticket work, and “the cream of the crop” of engineers ask really good clarifying questions that reduces the level of effort of tasks and enables a long-lasting solution in production with minimal issues .

    Why identifying PII/SDEs Matters?

    (a) Ensures that customer records are kept complaint and up-to-date with legal requirements – enforced across different levels of government ( federal, state, local )
    (b) Pre-empts capital losses incurred from data breaches
    (c) Engenders stronger customer base trust in corporate applications

    Let’s dig deeper into their complexity and the types of questions good engineers should ask!

    Identify the data sources

    1. Are the sources with organizational scope? If so, are they internal or external?
    2. How can I find out my sources?
      • Identify from configuration files ( e.g. .yaml files )
      • Look into inventories – made available on internal websites or APIs
    3. Do I need to meet with other teams?
      • Can other teams identify the inventory of data sources?
      • Can we delegate maintaining the inventory to another team?

    Scan the data sources

    1. Do we need to scan data sources? Can we identify the path of least resistance?
      • The sources may already be encrypted
      • The sources may be unencrypted BUT they meet enterprise-requirements.
    2. If we scan the data sources, what do we look for?
      • Comprehensive scanning – of each record, across all databases
        • Targeted subsets of tables and columns?
    3. What data elements do we need to scan for?
      • Focused on conventional PII 2( e.g. passwords and SSNs )
      • Looking for custom SDEs 3( e.g. classification ML-model determined )?
    4. How often do data sources need to be scanned?
      • Do I scan on a historical basis? Once every six months ( for all records )?
      • Do I execute continuous scans on a periodic basis ( e.g. hourly runs or nightly runs )?

    Export and Deliver the Findings

    1. How should findings be communicated?
      • Which party needs the finding? Are they internal or external auditors?
      • Is the easiest strategy – outputting to a CSV/Excel-esque dump – satisfactory?
      • Do we need a ( simple or complex ) UIs to enable the real-time update of identified elements?
    2. Retention of findings?
      • Can we remove our findings after a time window ( e.g. a month or a year )?
      • Can we compress and archive findings?

    What was my feature deliverable ?

    Figuring out the data sources was mostly “done for me”, but I had to develop out the other two steps of the feature ( and it’s capability ) for a new data source ( most of the feature had been developed, but for pre-existing sources ):
    a. Scan multiple data sources and databases for PII and SDEs
    b. Export the findings to an Enterprise-specific layer for external personas to execute an audit.

    What are the data sources you scanned?

    My feature works across four major databases – the SQL flavor one being SQLServer and the NoSQL flavor ones being Snowflake, DB2, and PostgresDB. For each data source onboarded, I operated at the granularity of database.schema.table.column, following enterprise/team conventions. The scans ( so far ) have been historical versus continuous – once a source had been scanned, it didn’t have to be scanned again to facilitate export.

    How to Feature Test Across Environments?

    1. Local environments
    2. Non-production environments
      • Sandbox envs
      • Other envs
    3. Production environments

    Testing types :
    1. One-off tests
    2. Comprehensive tests – all data sources

    Assessing Feature Complexity

    1. Metrics collection – build out dashboards, reports, or visualizations showing the volume of data under operations – scan and export. Capture metrics :
      • Across fuzzy/exact text search ( e.g. all databases or database.schemas with names starting with <insert_prefix_here>* )
      • Across logical granularities with tag-based filters : at database, database.schema, database.schema.table, database.schema.table.column granularities
      • Across time windows ( for continual scans ) – on a periodic basis fixed by hour / day / week / month.

    Monday – Friday, 7am – 4pm
    123 Example Street, San Francisco, CA
    (123) 456-7890

    Designed with WordPress

    1. and I’ll make sure to redact sensitive information a long the way 🙂 ↩︎
    2. Personally Identifiable Information ↩︎
    3. Sensitive Data Elements ↩︎
  • SYSTEMS – Building Micro Services for Enterprise Architectures – Why is this so hard?

    “Why is this so hard?”

    Hi all,

    I want to step back in the realm of enterprise architecture and talk about business workflows. Specifically, the challenges I noticed.

    (1) Pre-empting micro-services from becoming an untangled cobweb

    A business workflow is more complex than a few lines of code – it’s a singular function ( or multiple functions ), where within each function, we make API service calls. The workflows are conductors who orchestra a business ; the micro services – the musicians – execute a plenitude of supporting business operations : REST operations, data mutation, statistical analysis, or security measures. The intrinsic complexity of workflows and service calls naturally lends itself to a collection of micro services – I’ve once used ten micro services ( back at Capital One ) to process customer assets.

    For a singular workflow, managing the undergirding microservices is relatively straightforward. But the number of the complexity of workflows expands in lock step with overarching product complexity. Suddenly, Ia developer gets violently thrusted into a land where each workflow requires a number of services – some similar to others, and some drastically different.

    So the question is – how to do we avoid building out to many micro services? Can we make the architecture as clean as possible? Ok, there’s a few things to think about before we build them :

    What should I ask ( before I write )

    1. Do we need to create a micro service? The best micro service is no micro service.
    2. Can we create micro services that respect SRP – single responsibility pattern?
    3. Should we create micro services for singular standalone functions? Do we need to create a new module, in case we add further future functionality?
    4. Is it better to create separate micro services – for each workflow – or a shared micro service with workflow-tailored functions or conditional logic?
    5. Can I get rid of an existing micro service? Let’s remove what’s orphaned or unneeded.
    6. Can I make an existing service better and more comprehensive?

  • PRODUCT SENSE – Essential Metrics To Evaluate Enterprise Platform Success

    “When a Measure Becomes a Target, It Ceases to be a Good Measure” – British Economist Charles Goodhart’s

    Correlation is not causation

    I’ve had to work with a couple of Enterprise platforms in the past – as well as do some studying or reading of other blog posts on the side. Even though my day-to-day role is software engineering, I’ve encountered the topics tangentially via 1:1 chats with product managers or in my study of designing systems. In doing so, I’ve observed a raft number of metrics to definitely “be on the lookout for“. Let me review some of them.

    These metrics matter. They matter because they help us think of how to evaluate a platform’s success. It’s easy to throw in platforms and dashboards anywhere, plop a single number, and assert “a growth in X over multiple quarters or days clearly indicates a successful product”.

    Correlation is Not Causation

    And metrics are far far more nuanced. Usually, the more you collect, the better ( until you run into the law of diminishing returns ).

    Commonly Built-Out Enterprise Platforms:

    1. Internal Data Governance, Controls, and Privacy
    2. Data Management
    3. Real-time advertising and ads analytics
    4. High-volume end-to-end financial payments systems

    A Metrics List

    1. Number of personas onboarded
      • Teams
      • Customers/Stakeholders
      • End users
    2. Number of use cases onboarded
      • Business workflows
      • Business use cases
    3. Advertising
      • Campaign effectiveness : in totality, across segment groups and business domains
      • Profit associated per ad
    4. Click rates
      • Click number reduction on webpage drill down ( e.g. 7 clicks -> 4 clicks )
      • Increased click rates on ads
      • Increased click rates on E-commerce cart buttons ( e.g. check out or add to cart )
    5. Conversion rates
      • Number of customers who sign up ( per month/per year )
      • Number of customers retained after a sign up ( after a month/year )
    6. User base
      • Number of users
        • Number of active users ( DAU )
    7. Growth in AUM ( Assets Under Management )
      • Number of databases
      • Number of data sets
      • Financial records volume
    8. Session time
      • Decreasing session length
      • Increases in lengths ( in target areas – e.g. e-Commerce catalogs )
    9. Operations time
      • Decreasing time in “annoying” operations : sign-up, log in
      • Increase in rates of preferred operations : sharing of referall links
    10. Page Views
      • Number of views
      • Number of persistent views ( e.g. full video length or 5-minutes into a video )
      • Number of unique views ( per visitor ) ( per source )
  • TOTW/15 – Leverage “Plug-and-Play” Functions before onboarding your first customers

    There’s a storm brewing!!! … some old time sailor ( he’s right )

    Hi all!

    I want to introduce another good TOTW ( tip-of-the-week ) in codebase architecture. This scenario mirrors one that I encountered at work – I engaged in multiple back-and-forth conversations with other senior engineers on my team, so I think there’s sagacity to imparted lessons.

    Alright, well, what’s the story?

    There’s an burgeoning internal data governance and controls platform in a company – internal teams ( the customers ) onboarding to this new platform. The reasons are a plenty – there’s to many individualized solutions for multiple teams, the solutions are tightly-coupled, and they’re inefficient. The new controls platform intends to centralize a solution whose functionalities are just as good – or better – than what’s pre-existing.

    Initially, the central platform starts out more like one of these individualized platforms ; it’s tightly-coupled to a dedicated team and its single business workflow ( which I’ll call t1_workflow`Type1 ).

    Freshly-minted senior engineer Bradbury, faced with some of his first design challenges, gets thrown “into the deep end”. The platform and the product are maturing these next couple months. This means the onboarding of additional teams with their own workflows across multiple quarters. This means that the codebase is slowly transitioning from a monolithic application to a microservice-based architecture, and Bradbury needs to figure out how to organizing the services ahead of time.

    And it’s going to be challenging? Why?

    In theory, if all information is known, all invariants are established, and the systems are well-contained and small in scope, it’s easy to quickly brainstorm and conjure up a solution.

    But there’s a lot that Bradbury doesn’t know. Which he can know, if and only if he collaborates and engages in discussions with multiple parties at his company. His product owners coordinate feature planning and roadmaps with each customer ; his management and leadership communicates pressing deliverables and deadlines with the highest priority ; his team’s engineering talent- staff, seniors, and juniors – know microservices, architectural patterns, and specific codebase sections better than he does. Bradbury needs to combine and synthesize knowledge captured across working silos into a cohesive and coherent understanding.

    What’s the first thing Bradbury should do? He should avoid coding and architecture, and should first gather business requirements ( what are they? what’s the most important? and in which order? ). Let’s analyze a couple of them

    Business Requirements Ordering :

    a. Add t1_workflowType2 ( quarter_1 )
    b. Add t2_workflowType1, t2_workflowType2, and t2_workflowType3 ( quarter_1 )
    c. Add t3_workflowType1, t3_workflowType2 ( quarter_2 )
    d. Add t4_workflowType1, t4_workflowType2 , t4_workflowType3 ( quarter_2 )

    And there’s future quarters with more unknowns ( what’s happening in quarter’s 3 and 4 of the same year ? Do we know how many more teams and workflows we’re adding )? We don’t know. The engineers don’t know. Product owners don’t. Leadership & management doesn’t know. Not due to a lack of intellectual capability, but rather, the future is hard to predict. A stakeholder who promised to onboard three workflows can circle back six months later and say that they want to onboard a subset – or gasp, even none – of them.

    To further add complexity, some workflows bear striking similarities, and others, major differences. Some workflows are so similar as to entail a singular if-else conditional delta ( e.g. t2_workflowType1 and t1_workflowType2 are pragmatically the same ). As for other workflows, they’re completely different functionality – we need to literally write a new function to handle the edge case scenarios.

    There’s a lot of good, clarifying questions we need to ask ourselves.

    The Clarifying Questions

    • How do we best prepare for an unknown future?
    • How do we isolate failures and make it easy to introduce a logging posture to debug issues across teams and workflows?
    • How do we build solutions with plug-and-play customizability, so that if I need to quickly modify a workflow across teams – or single-team specific – I can do so with alacrity?
    • Can I identify the path of least resistance and minimal developmental effort?
    • What coding style would be the easiest for other engineers – not just myself – to read and to maintain?

    There’s a storm brewing.

    The naive approach :

    The naive approach is to introduce all workflow and all team logic into the same functionality. If there’s a pre-existing function ( let’s name it t1_workflowType1() ), we can build out a rules engine esque module and stuff in nested if-else logical statements for discrepancies encountered on each workflow type and team type. This solution is valid in the case of a MVP ( minimal viable product ), where the workflows are mostly similar to one another and limited in number. But there’s drawbacks when we encountered scalability concerns :

    The drawbacks

    • Collisions of Responsibilities – the same method handles to many responsibilities across teams and workflows.
    • Hard to feature test – writing tests are going to be hard ; every time I write a test, I have to make sure it passes conditions across multiple layers of business abstractions
    • Hard to maintain and to read – such a function would be ginormous. Multiple nested if-else or switch-case statements would pepper the code.

    Enter the world of “Plug-and-Play”

    Before engineer Bradbury sets forth his long-term vision, senior engineer Josh arranges a couple of design discussions. Josh shares his technical disagreements with Bradbury, using the visual aids of the product roadmap – created by collaborations of his product owners with leadership and management – to effectively communicate his points.

    Figure #1 – Visual Aid Example – notice teams and their workflows onboarding within – or across – quarters. The style is akin to Gantt charts.

    Josh avoids engaging in disparagements of Bradbury’s perspective. Instead, Josh considers Bradbury’s input with seriousness, because our titular character recognizes that he may have been thinking to much of a short-term deliverable due to a lack of visibility of long-term planning roadmaps.

    The design discussions help – Bradbury shifts from disagreements to shared consensus; they both research plug-and-play solutions. This solution entails functional decomposition at the granularity of teams and workflows . The solution facilitates the quick onboarding and offboarding of business needs. But what scenarios could occur for each team ( in the case, the stakeholder )?

    1. Onboarding isn’t needed.
    2. Onboarding needs to be delayed ( by a few weeks or a few quarters )
    3. The order of onboarding of workflows needs to change

    Benefits of “Plug-and-Play” Solutions

    1. Singular Responsibility Pattern – by decomposing teams and their workflows into separate functions, we can quickly isolate functionality at the highest level of granularity. This makes it easy to debug and understand the code.
    2. Testing ease – It’s not just coding we need to think about ; I need to build unit tests and end-to-end tests across each business use case. Introducing isolation levels makes it easy for me to test out and validate a team’s onboarding in a sequential manner ( e.g. I can assert and say that I have a subset of workflows or 100% of workflows working for teams 1 and teams 2, even if teams 3 and 4 are still onboarding ).
    3. Customization ease – suppose a team needs changes to be made to a workflow ( e.g. legal and compliance comes in and mandates the masking of PII ). In this case, I can quickly dive into a piece of code and introduce masking logic or encryption logic on payloads.
    4. Business Requirements Flexibility – Suppose months later, a team needs to offboard ( some or all ) of its workflows. Or a team suddenly needs to onboard. Without having to make possibly destructive changes to a single method, we can quickly add function calls or remove function calls to isolated pieces of code, introducing a couple of comments along the way. Or suppose two teams are pending approval for onboarding – I can prioritize and shift developmental efforts to support the current use cases