harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

TOTW/20 – Leverage Checklists, Isolation, and Minimal Examples Before Final Deployments – Streamline Your DevOps; Save On Cycles.
Why Do The Best Devs Follow Such Practices?

Alright, let me take a scenario that I encountered at work, and show some of the gotchas and checklists that I conjure up so that I avoid spending to much time investigating issues that don’t need to actually be investigated

Recently, I had to deploy an Airflow DAG file executing two operations :

  • Task 1 – Execute endpoint #1 : a /GET call grabbing a list of ids to post
  • Task 2 – Execute endpoint #2 : a /POST call updating an internal application based on the Ids list.

And In order to do so, I had to also ensure the following :

  • The “internal plumbing” works : my DAG can pass data between the two endpoints.
  • The endpoints are callable from Airflow’s remote environment, in a way akin to calls to a remote env from one’s machine.

Hey, this is relatively straightforward,. I just have to deploy a DAG with two endpoints and pass the data between them, and then call it end-of-day?

Yes … and no.

See, it’s not just the code, but everything else around the code, that can break. And what we should do – as rockstar engineers – is to investigate what could fail before deploying the final products.

Because we clearly want to avoid diving in and suddenly deploying DAGs and proceed to repeatedly modify “on-the-fly”. Those pesky deployment build pipelines and release pipelines can consume 10-15 minutes on each execution – and if we theoretically mess up a DAG file configuration 10 times, this can cost us three-to-four hours of effort. Three-to-four hours of whack-a-mole-esque effort which could have been an hour instead.

Because why go through the tedium and the uncertainty of 15-minute build-and-release pipeline to isolate issues which we could spend 2-3 minutes to quickly check?

The Minimal Examples Skeleton & Scaffolding

Hmmm
Ok,

First, let’s set up a skeleton scaffolding structure.

This will be faster to deliver. We can deliver an intermediary product first – where we can confidently assert that (A) Mock calls work and (B) Data flow works – before the final product delivery ( which works with more complex datasets and edge case scenarios ).

This grossly simplifies the end product, since skeletons are quick to convert.

(A) Leverage minimal working examples – can I use input values which I know work – like a mock id on the /POST call or a known timestamp in the /GET call? My /GET call could theoretically fetch 100+ records, and processing them can ( theoretically ) consume 10+ minutes. What about a single Id to use?
(B) Focus on Data Flow Verification – there’s not much of a point to the endpoints themselves if I can’t even pass data. Hmm … what if I can create a false list, and pass it from endpoint #1 to endpoint #2, printing SENT:data and RECEIVED:data in each stage 🙂 !

Checklist Part A : In the Pipelines

Secondly, let’s make sure our underlying infrastructure – the DevOps, the pipelines – work. It’s the first layer that can break – agnostic of the artifacts under deployment – as it always executes first.

  1. Avoid Artifact Collision – Ask on slack and other channels if others are executing tests.
  2. The Correct branch is deployed – do our environments even have the correct branch ( with the latest changes in )? It’s frequently not the case.
  3. Check Pipeline Operational Status – can we deploy the target branch and verify that its artifacts build correctly and release correctly?
Checklist Part B : In The APIs

Thirdly, our code calls APIs, and it turns out that API calls are easy-to-isolate with modern day tooling : curl, testing clients, or even a local browser. What if we can assert those endpoints work on mock data or real data?

  1. Leverage API Client Testing – execute API calls in Insomnia or equivalent testing clients. Verify that the calls work in a local setting.
  2. Call /HEALTH – hey, we have a convenient /health endpoint? Let’s get the 200:OK message verifying that our service are up-and-running.
  3. Token checks ( AuthN ) – Check the API Bearer Tokens? Have they met their TTL expiry? If so, can I create the tokens again?
Checklist Part C : In The Code

And fourthly – following the above principles – we can use tools like human review, Github CoPilot, and minimal programs to verify that the wrappers around API calls work.

It turns out that we don’t need to wait several minutes and deploy the DAGs to check our API call wrappers work. When instead, we can utilize more at-our-disposal utilities.

Both we – as humans – and modern-day AI tooling ( e.g. AI Coding assistants and chatbots like OpenAI’s Github Co-Pilot ) can refactor our code ahead of time to include the following – robust error handling, a good logging posture, and bolstered readability.

  1. Check the Timeouts – is my request timeout to long ( > 60 seconds ), or to short ( <= 10 seconds )?
  2. Check Payload Structures – emulate testing client payload structures in code and verify.
  3. Check API Endpoint Env-and-Code Correctness – did I copy-paste the correct endpoints? For the correct environments ( e.g. LLE endpoints or ULE endpoints )? Did I write the correct code structure around endpoints?
  4. Introduce Try-Catch Exception/Error handling – can I introduce try-catch blocks and catch exceptions in a more refined logging posture?
The Final Checklist

In the end, we conjured up 12 mental checks – taking about one hour max – to execute before the final product release 🙂 .

Steps that don’t just save time on one day, but across multiple days of feature work and of feature delivery 🙂 !

  • Leverage minimal working examples
  • Focus on Data Flow Verification
  • Avoid Artifact Collision
  • Is the correct branch deployed?
  • Check Pipeline Operational Status
  • Leverage API Client Testing
  • Call /HEALTH
  • Token checks ( AuthN )
  • Check the Timeouts
  • Check the Payload Structures
  • Check the API Endpoint Correctness
  • Introduce Try-Catch Exception/Error handling
But How do I Improve at Mental CheckListing?

Like all skills, mental check-listing and edge-case scenario handling takes time to practice, but, it can be refined. Here’s some tips.

  1. Think of failure modes – what are you looking that could fail? Should we check network protocols? Should we check if the latest or the most correct software version was released?
  2. Don’t just think of code failures – we as developers are used to thinking of failures in code, but not as much outside. But what surrounds the code – the configurations – usually matters more.
  3. Trial & Error – the best way to get better at creating mental checklists is trial-and-error. With the plethora of components running around, it’s going to be hard to think of all scenarios to capture. We always learn better for the future. That’s how it goes.
Posted in

Leave a comment