harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

TOTW/13 – Don’t just return error codes on API methods. Please return thoughtfully-written error messages too.

“Drill down”

In the beginning, there’s just ERROR codes

Hi all,

I want to write a post inspired by real-life developer work that took place back at Capital One.

Alright, let’s imagine that junior engineer Andrea needs to write APIs for Enterprise data-layer held objects. She’s been tasked to write APIs so that internal developers on other teams can execute CRUD – create, read, update, and delete – operations on Enterprise objects. The APIs not only enhance scalability, but allow for further processing and verification steps ( e.g. ACLs, rate limiting, resource sharing )

When she initially wrote the APIs, she did think about error handling, but from the scope of HTTP status codes ( see https://en.wikipedia.org/wiki/List_of_HTTP_status_codes ), such as the 400 bad request error. This error is introduced for cases that backend servers can not process requests due to possible client-side errors ( e.g. badly formatted inputs or deceptive inputs ).

In the initial version, everything seems dandy and working fine. And for a context of a standalone web application with a single client and a single server, it’s ( mostly ) sufficient to return numeric status codes. But problems arise up in the future.

Trouble brews months later !

Alright, so senior engineer Javier has to develop our his team’s application, which operates on the same set of Enterprise objects. This means that his application behaves as a client, and executes Andrea’s API calls, exposed as conventional HTTP /GET, /POST, /PUT, and /DELETE methods. Javier recognizes on client-side development that he needs to introduce error handling paths ( e.g. halting his program or engaging in graceful degradation and capturing bad payloads in a dead letter queue ) in the event of 4xx errors when executing /GET calls.

But there’s a gotcha.

He’s running into a problem where he runs into a 4xx issue, but it turns out that the client-side payload is actually well-formed. Everything is correct there. What’s going on? 4xx should be a client-side issue, right? I clearly shouldn’t be failing a request – or event the processing of my events – if things are all and good here.

Yes … and no

So it turns out that the payload is correct, BUT, the Enterprise asset the payload operates on is stale. The assets themselves are out-of-date by a couple of months, and underneath the hood, that /GET call executes a SQL query to filter out and retrieve an asset based on unique primary keys.

And the problem is both the scaffolding – everything around the query – and the SQL query are “technically” correct. It’s actually an issue with the underlying database not being up-to-date ( due to a laundry list of reasons I will avoid delving into 😛 ). Perhaps a Database admin or other user forgot to purge out old, stale records.

So tell me what the hot-fix is?

After some back-and-forth conversations, Javier and Andrea recognize that a new version of the APIs are needed in production settings with modifications. The return of the 4xx is correct, but we need more information. We need a short, 150-character tweet-sized textual error response as well ( and maybe even an additional layer of API enum code ). The response and code should reveal more information such as :

  • “Failure code <A> : Did not locate Enterprise objects; client-side payload is valid.”
  • “Failure code <B> : Located Enterprise objects; badly-formed client-side payload.”

Andrea wants to change the APIs, but changing APIs requires extensive approval and testing. As such, she arranges a meeting not only with Javier, but with his team’s product owners and other senior engineers to generate “buy-in” and get approval to release the same /GET call, but on a version two form that returns more information.

After this release, Javier project get unblocked. He’s able to return back to his client-side code and adds the error-handling paths, based on a function of not only the 4xx status code, but also the textual error response and additional API enum Failure code. His failure-handling code resembles the following structure

if(HTTP.STATUS_CODE = 4XX):
     if(HTTP.STATUS_TEXT.FAILURE_CODE == <B> OR HTTP.STATUS_TEXT.STR == "Located Enterprise objects, but ... "):
        logger.LOG_ERROR("Ran into malformed payload - delay processing of events.")
        logger.LOG_FATAL("Shutting down event streaming processing");
        return;
    elif(HTTP.STATUS_TEXT.FAILURE_CODE == <A> OR HTTP.STATUS_TEXT.STR == "Unable to locate Enterprise object; client-side payload is good."):
        logger.LOG_ERROR("Ran into stale asset case with assetId={assetId}. Logging into DLQ {DLQ_id} with name = {DLQ_name} and topic = {DLQ_topic}.")        
        DLQClient.appendToDLQ(DLQ_id, DLQ_name, DLQ_topic, assetId);
        continue;



The real-life story

In this scenario, I’m senior engineer Javier. I spent a couple of hours deep in debugging systems, triaging the root cause of failed Kafka stream events. I had to determine the cause of failure for events that stalled event stream processing – was attribution due to the staleness of downstream assets ( server-side issues ) or due to malformed inputs ( client-side issues )? In both scenarios, I introduced custom Cloudwatch-level logging – with 90 days of retention – to address the server-side issues ; this enabled enable human-in-the-loop intervention on stale assets. As for client-side issues, I biased towards graceful degradation and halting the event stream; I preferred notifying upstream producers as soon as possible that they sent malformed payloads.

This logging had repercussions – I worked with a event stream processing +1 million events/second, and to many failing stalls meant that events remained unprocessed and backed up. The probability of stale asset events superseded the probability of malformed client payloads ( I don’t have a specific multiplier by how much, but intuition intimated it ). By changing error handling, I reduced the frequency of halting event stream processing, thus enabling the processing of more customer traffic.

The Challenges I ran into

There were a good number of challenges I ran into with the design – let me review them :

  • Generating buy-in and consensus – I had to get other senior engineers, a staff engineer, and my two direct managers to agree on the architecture & design approaches. I set the agendas and led the meetings
  • Communicating unexpected delays and blockers – #TODO
  • Spending time to look into alternatives to pre-empt future issues – #TODO
Posted in

Leave a comment