harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

  • INTERVIEWING – Writing Feedback For DS&A Interviews – CASE STUDY #1

    I’ve always wondered what it really looks like from the other side too 🙂

    An Introduction

    I’ve conducted mock algorithmic interviews on interviewing.io1, and I thought it’d be cool to share a quick example of what good interviewing feedback looks like.

    Personally, it takes me 5-10 minutes to write up feedback, because I want to be as thorough and as comprehensive as possible and ensure that I capture the right level of detail so that external reviewers – an interviewing panel, a hiring manager, or a bar raiser – can review them offline

    I’m sharing example feedback of a case study where my interviewee2 performed exceptionally. I think it’d be useful for both interviewees and interviewers – interviewees to understand what interviewers write up at the end, and interviewers to learn what they should write up.

    Why solid Feedback Matters
    • Onus – It’s the the onus of solid interviewers to set their candidates up for success. Good interviewers must build the best case and gather the most signals to support their candidates and evaluate them holistically – regardless of a candidate’s performance.
    • Audience – External evaluators who didn’t conduct the interview read feedback and make decisions.
    • Criticality – The feedback I write can really make for the decision of hire/no hire.
    • Lack of Formal Training – there’s a strong dearth, or lack of, formal training for technical interviewing.

    Alrighty then, let’s begin 🙂 !

    Strengths

    • TC is a fast coder.
    • TC is a concise coder.
    • TC demonstrated how to capture, store, and evolve computational state.
    • TC has fast typing and problem-visualizing skills – I easily followed their visual examples of a grid/array/tree/<insert_other_data_structure>.
    • TC’s code is – or almost meets – a mostly working state. A few minutes of tweaking and adjustments were needed to get it to pass all test cases on Leetcode ( or other equivalent websites ).
    • TC needed hints or nudges here-and-there, but with the accompaniments, they worked their way towards an effective solution.
    • TC demonstrates technical persistency ; they persist through a problem’s difficulty and they don’t give up.
    • TC’s Big-O Time-Space Complexity reasoning was correct and met optimal solution.
    • TC shows solid understandings of a rule’s engine, for loops, indexing and 1-off adjustments, pointers, and conditional expressions.
    • TC solved ( most or all ) the problem within a reasonable time limit.
    • TC finished ahead of time – I asked open-ended extension questions demonstrating high-level skills, such as extending a logging posture and asking how they’d refactor code in a production setting. Kudos points 🙂 !

    Areas of Refinement

    I also like the term “areas for refinement” over “weaknesses”; I think it communicates more empathetically.

    • TC can focus more on edge case scenario thinking and can walk through their code/rules engines through different scenarios.
    • TC can focus on leveraging a TDD approach, and mention the scenarios they want to think of ahead of time before immediately diving into code.
    • TC can think more about case decomposition.
    • TC can think about leveraging nested expressions/nested conditionals to evolve a rules engine.
    • TC can consider modularizing code ahead of time.

    Leveling Determinations

    • Junior/Entry-Level or Mid-Level/2-3 YOE : Strong Hire.
    • Senior-Level/Staff+ Leveling with 5 YOE or more : Hire.
    1. Credit is due to to interviewing.io for their template structure/headings here. ↩︎
    2. Credit to my mock interviewee for taking the time from her day – 1 hour and 10 minutes – to go through rigorous practice. I believe in journalistic integrity, and I’ve redacted all names and other Personally Identifiable Information. ↩︎
  • BEHAVIORAL ( Senior-Level Example ) – Can You Describe a Time You Had to Learn New Skills?

    There’s value in writing up what good examples look like at lower levels too 🙂

    Situation

    Alright, what’s the situation.

    I’m starting work on a brand new team at Geico, and I need to quickly ramp up on our codebase’s backend. I’m well-familiarized with standard backend development, using languages such as C++ or GoLang to code microservices, but I’m brand new to Python and Flask development. Given my tenure, I had strong confidence that I can upskill ( I’ve done so in the past ), but, I need to quickly land pull requests and ship out features.

    Task

    I’m also tasked with two line items. Firstly, to identify areas of the codebase I can make contributions towards. And secondly, to ramp up on developing and shipping Web Apps with Python3 and Flask.

    And unlike prior roles, I have more competing goals – I have to quickly ramp up on our team’s systems and tooling, ship out POCs supporting a migration, set up my local developer environment, and navigate up-and-coming multi-quarter ambiguous problems.

    The multiple competing goals stress tests my ability to identify the path of least execution and minimal resistance. Leading me to take two actions.

    Action

    Firstly, I write myself stories and tasks to bolster my team’s unit test coverage presence. Because it’s easier to leverage GenAI capabilities, I can quickly churn out most of the code needed for unit testing and submit PRs for their approval. And unlike source code, unit testing code is sufficiently isolated and doesn’t require me to deep understandings; I can avoid thinking about network calls, database connectivities, and external dependencies that could require set up help or permissions from others.

    Secondly, I look into making a mini side-project. Working with my team’s entire codebase, in a new framework or paradigm, is difficult; I have to navigate a vast number of files, configurations, and structures. On the other hand, coding a barebones application – akin to helloWorld programs or simple CRUD-esque HTTP servers sprinkled with a few endpoints in a routes.py – takes only a few hours, amortized across a few days, to spin up locally.

    Results and Learnings

    By taking the aforementioned steps above, I landed results – reviewed and submitted pull requests – in my first few weeks. Had I gone through more more grueling routes, like submitting source code deltas and getting permissions or installations for API testing clients, I would have had to shift my focus away from delivery to operational work.

    I also built thorough understandings of my team’s key microservices and the calls we make to support our internal applications. And best of all, I upskilled on a brand new framework – Flask – and programming language – Python3.

    I strongly appreciated this experience, because it highlighted my ability to adapt to new situations and learn new skills : qualities highly coveted by our profession.

  • BEHAVIORAL ( Senior Level Example ) – What Role Do You Usually Take In a Team?

    And I guess a bit about how I run my meetings 🙂

    Hmmm this is a really good question. What’s my role?

    I think I wrote about this a while ago, but as a senior engineer, I used to envision my role entailing a “calling the shots” frame of reference. That my decision-making would be top-down, and that leadership meant making the majority of decisions.

    But I’ve transitioned away from this parochial, limited mindset.

    My role is to be a coordinator or an orchestrator – one who listens closely and careful. I always ask myself the most important question : How do I best understand the background of individuals – or for that matter, groups of people – and collaborate with them to achieve an organizational goal?

    It’s organizational thinking at its finest.

    My job is to work at a higher-level and to figure out how to leverage the domain knowledge and expertise of others to tackle a problem. And I think it’s actually really fun; I find it tremendously rewarding. I like to joke about this in my head, but oftentimes, the position entails taking a more backseat role, as a passenger, and see how your collaborator drives, evolves, and problem-solves1 :-).

    So what do I like to do?

    Oftentimes, I schedule a 30-minute or 1-hour meeting block with a delineated agenda stating the discussion leaders and the material they’ll cover. Doing so enables me to pick up enough signals to steer planning.

    In other meetings, I listen more actively, take meticulous notes, and ask a few guiding questions. I’ve learnt that speaking less, and listening more, matters, because there’s a lot of information that one needs to parse in order to solve problems accurately.

    A Short Example

    Ok, well this just sounds ambiguous and up-in-the-air, so I’ll provide a short example.

    I’m on a recent project entailing Airflow DAG Automation work to scan and then to export sensitive data elements, and to do so requires working with three major components – scheduling Airflow DAGs, evolving backend API endpoints, and writing the correct query for edge case scenarios. Yes I’m experienced, but I’m also unfamiliar with all knowledge areas . Luckily, I’ve identified individuals – inside and outside my team – who know their domains. Which means that I can schedule meetings – 1:1 or group – with the subject matter experts to establish foundational context for creating tasks or to execute tasks 🙂 !

    I don’t know things perfectly, but, I do know how to get the ball rolling in what looks to be the right direction.

    The Role’s Challenges

    But what would you say make this tricky? Everything you wrote seems obvious, doesn’t it?

    Not exactly.

    ( What are you asking for? ) – The first challenge is to know what I’m asking for – to know what’s needed. I think it’s the best skill and it’s a three-fold skill : to ask the right question, to provide the most accurate level of supporting context, and to communicate your business needs to someone. Analogize it to a LLM and prompt engineering or Google Searches – the better your ask, the better your response.

    ( Who should you rope in? ) – The second challenge involves identifying the correct parties to “rope in” so that they’re ramped up. Long-running efforts are naturally collaborative, and members need solid understandings of what’s going on to reduce feature time-to-delivery.

    ( How do you get alignment? ) – And finally, the last challenge – group alignment : to get buy in/consensus and to healthily resolve technical disagreements. Because frequently, we operate with imperfect solutions and dissenting opinions permeate everywhere. Especially from questions like”What’s our scope?” or “What should we build out?”

    Footnotes

    1. It’s fascinating to see how another brain in real-time works, isn’t it? ↩︎
  • BEHAVIORAL ( Senior Level Example ) – Tell Me A Time You Received Feedback.

    Writing a design document is just as satisfactory – sometimes, even more – than writing code : a wizened engineer

    The Tasks

    No matter an engineer’s level, there’s always somewhere we can receive constructive feedback to hone and to refine skills : architecture, coding, soft skills, or writing. In my situation, I’m a senior engineer, and I have hard work cut out for me when it comes to writing good design documents.

    Alright, let me be your raconteur and narrate the story.

    The Situation

    What’s the situation. It’s Q2 of 2025, and I’m fast approaching the project deadline ; I’m in the final stages of delivering the automated feedback loop project. I’m putting in the hard work to effectively solution an ongoing business problem : to immediately export discrepancies after scanning data records for sensitive data elements.

    The Tasks

    However, I’m tasked with more than just coding. I’m a project lead; that means I’m tasked to write up the design document and present the purpose, the project’s background, and the justifications for critical design decisions.

    Now, I do well at detailed communication tailored to engineers – I seldom overlook details. My style stems from my engineering background. I know that engineers love to dig deep technically, ask probing questions, and critically evaluate design choices.

    But I received constructive feedback from my senior staff engineer on my first rough drafts. He always excels at highlighting my strengths – he takes positive note of my content’s comprehensiveness. Afterwards, he shifts focus to areas of improvement; he emphasizes conciseness and clarity – writing “to-the-point”. He also recommends incorporating visuals1 of system diagrams, API request-response calls, and before-and-after UI overlays of records.

    I’m to mired in the details; I need to shift the focus solely on the high-level, big-picture, “30-000 foot view” lenses. I also need to communicate to non-technical audiences who have little time – 5 to 10 minutes – to quickly read my documents. Doing so enables the audience to immediately grasp the why of things, without getting bogged down in the what of things or the how of things. Basically, why we solved a business problem.

    Yep. It turns out that I needed to refine my technical writing skills and revisit my work.

    The Actions

    So I’m off to work. I take a series of actions.

    To begin, I have to follow a templated functional requirements document ( FRD ); I’m adjusting and aligning my content to match targeted headings – the purpose, the interfaces to external systems, and data requirements. These actions require me to distill my paragraphs – to identify the pertinent and to eliminate the inessentials.

    I also have to draw up minimal system diagrams, using tools like draw.io, to convey the appropriate level of context of my systems. I add accompaniments – an acronyms legend, color-coded visuals indicating service ownership, and sequence flows with brief descriptions – to quickly ramp up my target audience on structures and interactions .

    The result

    My senior staff engineer and engage in several collaborative sessions, and in the end, I develop a concise, easy-to-understand design document. A previously lengthy product 4-5 page document distills into a 2-3 page document ( excluding images ). My design document reads more like a thoughtful professor who crafts an easy-to-understand textbook.

    My Learnings

    I really enjoyed the process of writing up the design documents – they indicate a solid sense of ownership and understanding of the how, the what, and the why of a business problem I solved. I also appreciate how the documents assist future team members – newcomers can immediately consult the document to quickly answer questions and avoid efforts delving into complex or poorly organized code.

    Personally, I value the learning experiences, because of the rarity of actionable writing skills feedback in my profession. We have robust feedback loops for clean coding and clean architecture, but a scarcity for effective writing. Moving forwards, I intend to emulate such practices into the rest of my technical career and any future roles.

    1. As the saying goes, “A picture is worth a 1000 words”.1↩︎
  • TOTW/20 – Leverage Checklists, Isolation, and Minimal Examples Before Final Deployments – Streamline Your DevOps; Save On Cycles.
    Why Do The Best Devs Follow Such Practices?

    Alright, let me take a scenario that I encountered at work, and show some of the gotchas and checklists that I conjure up so that I avoid spending to much time investigating issues that don’t need to actually be investigated

    Recently, I had to deploy an Airflow DAG file executing two operations :

    • Task 1 – Execute endpoint #1 : a /GET call grabbing a list of ids to post
    • Task 2 – Execute endpoint #2 : a /POST call updating an internal application based on the Ids list.

    And In order to do so, I had to also ensure the following :

    • The “internal plumbing” works : my DAG can pass data between the two endpoints.
    • The endpoints are callable from Airflow’s remote environment, in a way akin to calls to a remote env from one’s machine.

    Hey, this is relatively straightforward,. I just have to deploy a DAG with two endpoints and pass the data between them, and then call it end-of-day?

    Yes … and no.

    See, it’s not just the code, but everything else around the code, that can break. And what we should do – as rockstar engineers – is to investigate what could fail before deploying the final products.

    Because we clearly want to avoid diving in and suddenly deploying DAGs and proceed to repeatedly modify “on-the-fly”. Those pesky deployment build pipelines and release pipelines can consume 10-15 minutes on each execution – and if we theoretically mess up a DAG file configuration 10 times, this can cost us three-to-four hours of effort. Three-to-four hours of whack-a-mole-esque effort which could have been an hour instead.

    Because why go through the tedium and the uncertainty of 15-minute build-and-release pipeline to isolate issues which we could spend 2-3 minutes to quickly check?

    The Minimal Examples Skeleton & Scaffolding

    Hmmm
    Ok,

    First, let’s set up a skeleton scaffolding structure.

    This will be faster to deliver. We can deliver an intermediary product first – where we can confidently assert that (A) Mock calls work and (B) Data flow works – before the final product delivery ( which works with more complex datasets and edge case scenarios ).

    This grossly simplifies the end product, since skeletons are quick to convert.

    (A) Leverage minimal working examples – can I use input values which I know work – like a mock id on the /POST call or a known timestamp in the /GET call? My /GET call could theoretically fetch 100+ records, and processing them can ( theoretically ) consume 10+ minutes. What about a single Id to use?
    (B) Focus on Data Flow Verification – there’s not much of a point to the endpoints themselves if I can’t even pass data. Hmm … what if I can create a false list, and pass it from endpoint #1 to endpoint #2, printing SENT:data and RECEIVED:data in each stage 🙂 !

    Checklist Part A : In the Pipelines

    Secondly, let’s make sure our underlying infrastructure – the DevOps, the pipelines – work. It’s the first layer that can break – agnostic of the artifacts under deployment – as it always executes first.

    1. Avoid Artifact Collision – Ask on slack and other channels if others are executing tests.
    2. The Correct branch is deployed – do our environments even have the correct branch ( with the latest changes in )? It’s frequently not the case.
    3. Check Pipeline Operational Status – can we deploy the target branch and verify that its artifacts build correctly and release correctly?
    Checklist Part B : In The APIs

    Thirdly, our code calls APIs, and it turns out that API calls are easy-to-isolate with modern day tooling : curl, testing clients, or even a local browser. What if we can assert those endpoints work on mock data or real data?

    1. Leverage API Client Testing – execute API calls in Insomnia or equivalent testing clients. Verify that the calls work in a local setting.
    2. Call /HEALTH – hey, we have a convenient /health endpoint? Let’s get the 200:OK message verifying that our service are up-and-running.
    3. Token checks ( AuthN ) – Check the API Bearer Tokens? Have they met their TTL expiry? If so, can I create the tokens again?
    Checklist Part C : In The Code

    And fourthly – following the above principles – we can use tools like human review, Github CoPilot, and minimal programs to verify that the wrappers around API calls work.

    It turns out that we don’t need to wait several minutes and deploy the DAGs to check our API call wrappers work. When instead, we can utilize more at-our-disposal utilities.

    Both we – as humans – and modern-day AI tooling ( e.g. AI Coding assistants and chatbots like OpenAI’s Github Co-Pilot ) can refactor our code ahead of time to include the following – robust error handling, a good logging posture, and bolstered readability.

    1. Check the Timeouts – is my request timeout to long ( > 60 seconds ), or to short ( <= 10 seconds )?
    2. Check Payload Structures – emulate testing client payload structures in code and verify.
    3. Check API Endpoint Env-and-Code Correctness – did I copy-paste the correct endpoints? For the correct environments ( e.g. LLE endpoints or ULE endpoints )? Did I write the correct code structure around endpoints?
    4. Introduce Try-Catch Exception/Error handling – can I introduce try-catch blocks and catch exceptions in a more refined logging posture?
    The Final Checklist

    In the end, we conjured up 12 mental checks – taking about one hour max – to execute before the final product release 🙂 .

    Steps that don’t just save time on one day, but across multiple days of feature work and of feature delivery 🙂 !

    • Leverage minimal working examples
    • Focus on Data Flow Verification
    • Avoid Artifact Collision
    • Is the correct branch deployed?
    • Check Pipeline Operational Status
    • Leverage API Client Testing
    • Call /HEALTH
    • Token checks ( AuthN )
    • Check the Timeouts
    • Check the Payload Structures
    • Check the API Endpoint Correctness
    • Introduce Try-Catch Exception/Error handling
    But How do I Improve at Mental CheckListing?

    Like all skills, mental check-listing and edge-case scenario handling takes time to practice, but, it can be refined. Here’s some tips.

    1. Think of failure modes – what are you looking that could fail? Should we check network protocols? Should we check if the latest or the most correct software version was released?
    2. Don’t just think of code failures – we as developers are used to thinking of failures in code, but not as much outside. But what surrounds the code – the configurations – usually matters more.
    3. Trial & Error – the best way to get better at creating mental checklists is trial-and-error. With the plethora of components running around, it’s going to be hard to think of all scenarios to capture. We always learn better for the future. That’s how it goes.
  • BEHAVIORAL ( Senior Level Example ) – Tell Me a Recent Accomplishment That You Are Most Proud Of?

    A solid, seasoned engineer always takes passion in their past work and in their past deliverables.

    – Said By Yours Truly 🙂
    What Makes Someone Proud of their Projects?

    Or associated synonyms – proud, satisfied, fulfillment, happiness, contentment.

    Good question. I’d imagine couple of factors :

    • Appropriately Challenged – they took on challenges at the right rigor level – not to easy BUT not to difficult. In the goldilocks zone.
    • End Customer / User Satisfaction – their customers tell them that their products helped them out – in tangible ways or in intangible ways.
    • External Praise Recognition – others tell them that they delivered a meaningful feature.
    • Good Collaborators – the best work usually happens in groups, and folks will often talk about how they liked working with their collaborators. They may discuss what they learned working with someone else or “jiving” with their working style.
    • Growth and Learning – the feelings that they learned a lot ; they explored domains previously unexplored and they picked up new skills.
    • Internal Sense of Accomplishment – they delivered all ( or at least the majority ) of their work. They can reflect back and say to someone “I made X happen” or “I delivered X”.
    The Situation

    Alright, where do I even start the story? hmmmm what’s the situation1

    It’s my first three months in Geico – Q1 2024 – and there’s a major deliverable under way : the Automation Feedback Loop project. It’s a complex, quarter-long scope task – on every scan of a data source, we need to capture and and classify sensitive data element records with respect to two crucial fields of interest – business attributes ( which I’ll call BA ) and tags. Afterwards, we need to expose those results to an internal data catalog.

    But sometimes, there’s a discrepancy – a mismatch between the scanner suggested BA and tag values with those currently in catalog. In this business scenario, we need a way to analyze those discrepancies and make a crucial determination – can we overwrite the catalog with the latest scanner-suggested values, or, do we store the discrepancies locally and then notify our stake holders – the data stewards and the data owners – to perform a human audit?

    The Tasks

    To deliver my goals, I had to identify and work on a few tasks. I needed to get my hands dirty in a new system. I have to ship out two major pieces of code – scaleable, performant code that can scan our sources and read the BAs and the Tags, and a rules engine module that can identify mismatches and the specific scenarios to inform our end customers.

    The actions

    So I got to work on a set of actions. I set up multiple meetings with involved parties in my engineering org , such as leadership or senior staff talent, to clarify business requirements and to simplify our rules engine so that we knew exactly which cases needed human audit. I also set up meetings to ask clarifying questions about the technical capabilities or technical challenges of our team. Questions like “Do we have a local or production database we could work with to store records?” or “Do we have to to build out a separate UI to action on misclassifications, or could we leverage pre-existing UIs?” These questions mattered. They mattered because they helped the team avoid scope creep – we could down scope on a MVP whilst enabling other teams or future quarters to focus on other functionalities.

    The Results

    In the end, I delivered many results and products. I shipped a complete and working feature by the end of Q1 – the data stewards and the data owners could successfully action on their misclassifications. I also drew up the system diagrams, authored a design document, and wrote a simplified rules engine module. I really admire these results, because the work delivered persists, and I’m still building off of them, even a couple of months following the deliverable.

    My Learnings and My Takeaways

    All-in-all, the feedback loop is a greenfield project – I navigated ambiguous requirements and I worked on a system from the bottom up. I feel super proud of my ability to navigate it’s complexity. And there’s so many good takeaways from the project.

    The work was very challenging. Firstly, I had to incorporate extensive peer feedback and iteratively refined the design document and the diagrams . I must have created at least five versions of our system diagrams in Visio.io and multiple rough drafts of the design documents. But the sense of ownership and the ability to meet customer expectations by end of Q1 further solidified my belief in my engineering capabilities. More ever, I really admire my collaborators : engineers, leadership, and product owners. I learned new thinking patterns and new ways of tackling problems that I never would have considered before.

    1. Inspired by a behavioral interveiw prep card AND that it’s a commonly-asked question in behavioral interviews 🙂 ↩︎
  • BEHAVIORAL ( Senior Level Example ) – Walk me Through A Time That You Had to Lead/Delegate?

    Hi all,

    Today, I’ll review a challenging delegation scenario, where I coordinated multiple parties to tackle concurrent upcoming feature.

    Situation

    It was end of Q1 of 2024; Q2 was starting up, and my team and I finalized the delivery of the first phase of the Feedback loop project. The end-of-Q2 deadline to meet NYDFS compliance was fast coming up, and it entailed two major deliverables: scans & exports on historical data and automation scan work on incremental data. My team and I were struggling in making a concrete plan on how to tackle two concurrent feature demands.

    My Task

    My tasks, as a senior engineer, were laid out, crystal clear, in front of me. I had to chart out upcoming feature deliverables for the next couple of Sprints delegate the work to my team members. The delegation challenged me. For multiple reasons.

    Firstly, I had to clean up our Team’s DevOps board ; I not only had to write up new user stories and tasks, but I also had to modify and clean up existing ones. I did this to avoid duplicate future work and to avoid confusing my co-workers.

    Secondly, I usually operate at the high-level design of features – I’m not my other colleagues who’re deeper in the technical woods of the codebases. This level of operation made it harder to fully understand the scope and the complexity of items to delegate.

    And thirdly, I had to delegate to team members with different backgrounds and experience levels. This added to the challenge, since I delegated to folks more experienced and less experienced than me. The skill of best positioning your resources and your people strategically is difficult.

    Actions

    So in order to write up the delegation, I took a series of actions.

    My first action involved setting up dedicated one-on-one meetings with each member to understand what they’re currently working on and what they think they need to be working on.

    Next, I created a summary-style bullet-point list identifying individual tasks and responsibilities for each party involved.

    For my senior staff engineer, I recognized that his responsibility should be more advisory, so I tasked him to provide assistance with high-level design, ideation, and feedback on design documents.

    For my junior engineer, I focused us on partitioning out individualized ownership of historical work and continuous work. He already came in with extensive background on historical scanning work, so I emphasized that he should prioritize wrapping up the coding, the testing, and the deployment for remaining databases before diving into automation can work.

    And lastly, for me, a senior engineer, I thought focusing on design & development for the initial proof-of-concepts for automation work made the most sense. The feature was new, and we weren’t sure about what design mechanisms or technologies we’d use to handle the ambiguous requirements – requirements that I knew I’d have to clarify.

    Result

    In the end, I wrote up a clear delegation with delineated roles & responsibilities understood by all parties involved. I also coordinated with my team’s product owner by cleaning up our board – I re-wrote, deleted, and created new user stories. Doing so enabled me to shift my collaborators, my product owners, and my leadership into better roadmap alignment. We knew exactly what we needed to execute and what we needed to prioritize. Reflecting back, I can strongly see how the frontloaded delegation work positively shaped the team’s future to meeting its deliverables by the deadlines and avoid stepping over each other’s feet.

  • Data Engineering – What’s Data Classification?

    Inspired by an Uber trip inbound to sunny San Francisco.

    An Introduction

    Hello everybody!

    Inspired by an Uber trip and the many conversations with my friends who ask me what I do for work, I thought about briefly going over my job and a couple of major line items :

    • What is Data Classification?
    • Why is Data Classification Needed?
    • What makes Data Classification Challenging?

    Alright, let’s begin! Tell me what you do!

    The Story – What is Data Classification?

    Since the dot-com era of the 1990s- with the advent of computing machines and the capability of collecting large volumes of data – companies, from big tech enterprises to non-big tech organizations, like insurance and financial tech, have often run into this problem : What type of data are we collecting? There’s business problems at hand , and to serve our end customers, we need to collect data.

    But before we do so, there’s a catch. We need to ask ourselves good questions. What are we collecting? Are we collecting user information – first names, date of births, stress addresses? Are we collecting aggregate metrics – a sum or a running average of transactions executed on a given day? Or are we collecting metadata on files : the size, file extension format, and the owners? Are we collecting the right data?

    Why Is Data Classification Needed?

    And more importantly, can we detect sensitive data elements ( which I’ll call SDEs ) and make sure that none of it gets leaked outside the enterprise? Data like PII ( Personally Identifiable Information ) and SSN ( Social Security Numbers ). Because if the data gets leaked, then we run into catastrophic problems – reputational risk, a loss of long-term customer trust, financial losses, and so on.

    This is where Data Classification teams come in! You can call them by multiple names – the guardians, the stewards, or the protectors – of data. They’re the team that scans the databases and the data sources for those SDEs and flags them when they’re detected. Yes, they’re an additional layer, but they’re also key to the defense in depth that organizations desire – without them, we’d run into far worse situations.

    So What Makes Classification Challenging?

    That’s a really good question! There’s a number of reasons why data classification is harder than it looks.

    Firstly, the definitions of sensitive data constantly change and evolve in lockstep with technological changes. Data that wasn’t seen as sensitive years ago is today seen as sensitive, and data previously unobtainable in massive quantities – biomarkers and fingerprints – is more mainstream.

    Secondly, the classification problem in itself is getting harder and harder. Traditionally, classification teams leveraged what we call rules engines to flag bad records. Those engines still operate, but only for well defined data – structured data, like your phone numbers or your birthdays, which meet a limited set of values. But modern data – media, videos, audio – is unstructured or semi-structured ; it’s less well-defined. Consequentially, we have to eschew the rules engines in favor of ML models, which can handle classification and detection of SDEs for a different input set. But unlike the rules engines, the ML models are imperfect – they generate predictions close to, but not exact to, 100% accuracy. This means that we still need the code, the notifications, and the reliance of a human-in-the-loop to determine the final verdict on the misclassifications – the false positives and the false negatives.

    Finally, let’s throw in the cornucopia : the rest of the mixed bag – data volumes, execution speeds, multiple source types, the “under-the-hood” processing of catalogs or inventories, logging, telemetry, and designing the infrastructure and the stages for ETL pipelines. The work that surrounds and supports, but isn’t, actual classification. The work that still has to be done.

    And now you can truly see where the data engineering talent comes into a business’s grand picture.

    What Teams do Data Classification Folks work with?

    In my current work, they work closely with two other system teams:

    (A) The ML Model team – they operate at the layer of conjuring up or extending pre-existing ML Classfiers, like XGBoosted Decision Trees; they’re heavily involved in training, testing, and validating the latest ML models, as well as incorporating feedback from upstream teams like us to iteratively improve their latest model versions.
    (B) The ML Model QA Team – this is a middle layer in between the classification team and the ML model team. They conduct stress tests and make sure that models work in production by using containers and other virtual software to emulate production settings. They resolve issues regarding the configurations, the environments, the runtime dependencies, and the systems which surround the ML models.

  • TOTW/19 ( Part Two ) – The Road to Tackling Ambiguity – Case Study #2

    “The more examples you can see, the more clearly you can understand.”

    A Primer

    Let me share a second example – sometimes, it helps to see other task breakdowns, to get a more holistic perspective. This breakdown entails a single day – 8 hours – worth of effort.

    Your Task

    Periodically execute two endpoints to scan records and to send records.

    Ok, seems easy, right.

    Well, not exactly. Especially for a someone new to Airflow and setting up task-based DAGs. Which even for a senior engineer, can be difficult.

    Where do I begin? Can I start out with simple POCs [ proof-of-concepts ]. It’s hard to get an estimate for how long this will take. Will it take me a day? A week? A few hours?

    Hmmmm, before we dive into code, let’s dive into a high-level breakdown. We can gather a couple of requirements and scope out well-defined units of work.

    The Task knowns

    • We identify the periodic execution mechanism – Airflow DAGs – who use is up-&-coming.
    • We know the first endpoint has to query a database.
    • The second endpoint has to send payloads, based on those queries.
    • We have the endpoints working locally and remotely in our lower-level environments ; they’ve been verified from our API Testing clients and programatically from code.

    Hmm – Lemme spend 10 minutes of my day and conjure up tasks!

    The Task Breakdowns

    • Task #1 ( A single print ): Set up a “Hello World” Airflow DAG, to print a single execution of “helloWorld” in our lower level env.
    • Task #2 ( The recurring basis ) – Modify the DAG to print “helloWorld”, on a repeated, configurable basis of 5-15 seconds. Verify log files show execution.
    • Task #3 ( Endpoint #1 ) – Modify the DAG to execute the first endpoint periodicially and print inputted records to log files
    • Task #4 ( Endpoint 1 & 2 ) – Modify the DAG to execute the second endpoint as well, setting up the pipe from endpoint #1 code to endpoint #2 code. Verify plumbing works.

    Alright, can I quickly set up a time table for focused blocks of work? YES!

    The Task Time Table

    TaskEstimated Time To Task Completion
    Task #1 : A single print2 hours
    Task #2 : The recurring basis1 hour
    Task #3 : Endpoint 13 hours
    Task #4 : Endpoint 1 & 2 3 hours
    Total9 hours [ buffered to 10 hours ]
    Figure #2 : In actuality, it’s around 2-3 business days of work for the deliverable.

    Woah. An insurmountable major task, hard to estimate, is suddenly … easier to estimate?

    But the best part is the communicating benefits. You just made communicating progress and easier. Because it’s easier to confidently communicate “We’re making positive progress” when you completed 1/4 or 3/4 tasks and can say to a product owner “We verified that we can periodically print output in an Airflow DAG and that deployments to a given environment work”. The sense of completion is so much better defined, in place of saying the feature is “not done” or “partially done”.

  • TOTW/19 ( Part One )- The Road to Tackling Ambiguity is … task decomposition and avoiding rigid all-or-nothing thinking.

    The biggest struggle for a raft number of engineers isn’t their dearth of – their lack of – intellectual ability, bur rather, their executive dysfunction : their brains tendentious disposition to task avoidance, task initiation, and task prioritization faculties. A.k.a. where do I even begin?

    Primer

    Hi all,

    Today, I wanna discuss another topic – task decomposition : how to break down large, unstructured, ambiguous tasks into smaller, well-defined, more concrete tasks.

    It’s a skill that I personally struggled with – and still struggle with – to this day, but, something that I think everyone can rapidly improve on.

    Task decomposition is a vital skill for engineers. It enables engineers to ship high urgency, high pressing deliverables and also holds us accountable to delivery. It helps enigneers make good estimates of tasks and identify how long they can expect units of work to take. And lastly, it empowers developers to communicate better-defined notions of progress and productivity to their team, their leadership, and their product managers.

    The way to get better at this skill, is to well, practice it. And just like any other skill, some of us are naturally better, and others will struggle more.

    Can you share real-world case studies!

    I’m using a scenario I encountered at work as a good example. Let’s suppose a developer is tasked with the following

    Task description : Invoke endpoint #2 on completion of calls to endpoint #1. Get code to a testable, deployable state by a DEADLINE.

    Well, how should we do it? There’s so many ways? The task itself has multiple moving pieces – a few database tables, trigger code, and two API endpoint calls, and a remote Titan cloud logging posture. It’s not a one-off execution where I write a simple program to call endpoint #2 each time endpoint #1 finishes. There’s “internal plumbing” : components which need to be wired up and connected to facilitate communication at the right place and the right time. More ever, we make the calls only when data tier conditions and invariants are met.

    How do we expedite our testing capabilities? Triggering the upstream code to execute our automated call is difficult and time consuming.

    Where do I even begin? What’s the right place to start? Do I need to start with actual data – which I may need to announce as a major blocker – or can I I use mocks and make adjustments to test for behavior?

    There’s no roadmap.

    There’s no instruction steps like in academia.

    Well,

    That’s why we should engage in our mental skills toolbox, and think about three things : a sequence of steps, the minimal components needed, and the smallest stories.

    Steps – The First Draft

    Ok, so our engineer comes up with a first draft of steps, resembling the beneath structure :

    • Step #1 : Copy an existing, or create, a database trigger and test that it works.
    • Step #2 : Call API endpoint #1 and verify our code receives the the database trigger to call API endpoint #2.
    • Step #3 : Get the code working in lower level environments.

    This is a good ( initial ) set of steps. But alas, issues arise up.

    Trigger Testing : Turns out that it takes forever to test a database trigger on calls to API endpoint #1 because payloads pass through a lot of internal plumbing ; the invariant check doesn’t happen until 5-10 minutes into the call. And we might make a couple of config changes and execute the tests 5-6 times on a debug, leading to an hour ( or two ) spent on getting a working trigger. Isn’t there a faster way?

    Server-Side Receiving : In the original writing, we have to verify that the trigger is received, but how do we do this? Do we have a concrete way? Did we identify mechanisms? Did we identify copyable, modifyable code constructs? If not, can we use tools like ChatGPT or Google Search to find minimal structures? What’s the right tool for this use case?

    Alright, back to the drawing board. What do we do next, in version two of our steps?

    Steps – The Final Draft
    • Step #1 : Copy an existing, or create, a database trigger, leveraging the database UI or tools. Inject a single mock record and use database functionality like LISTEN <target_channel_name> to verify the trigger is received.
    • Step #2 : Set up a server-side listener. Copy and make modifications to existing triggers or look online for examples. Verify that the listener can receive a single trigger notification with success.
    • Step #3 : Extend the listener to call API endpoint #2. Insert a single record of mock ( or real ) data and verify endpoint #2 is called.
    • Step #4 : Invoke API endpoint #1 – on mock or real data – and verify the internal plumbing – the database trigger, the payload, the listener, and call to API endpoint #2.
    • Step #5: Recreate these steps, from local environments, to lower level pre-prod envs.
    • Step #6 : Deploy and verify feature correctness in production.
    Why this breakdown is better!
    • Smaller-sized steps – tasks have been broken down into a better identified sequence of execution steps.
    • Specificity – specificity destroys ambiguity :-). We know exactly what we need to do AND we have alternatives or places to investigate if we’re stuck.
    • Identifiable deliverables and time units – because we’ve broken down tasks, we can quickly make time estimates for how long each step steps should take. I’ve written an example time table.
    The Time Table

    In this time table, I’ve assigned units of time for upcoming tasks. Engineers can use these feature estimates to share how long they expect a deliverable to take across multiple parties – leadership, product managers, and other engineers. The sum of time needed for the case study’s feature equals an idealized seven business days. But I wanna err on caution and behave conservatively – I’m creating an adjustable buffer window and I’m throwing in an extra two extra days to account for personal time off or unexpected issues. Thus, our actual total time to delivery sums up to nine business days – which still leaves us under the standard 2-week Agile Sprint :-).

    StepTime To Deliver
    #18 hours
    #28 hours
    #38 hours
    #44 hours
    #58 hours
    Totals8 hours = 16 hours
    These visuals are super useful to share with others. I remember tech leads who used to work with similar task breakdowns written up in their design documents :-).