“Drill down”
In the beginning, there’s just ERROR codes
Hi all,
I want to write a post inspired by real-life developer work that took place back at Capital One.
Alright, let’s imagine that junior engineer Andrea needs to write APIs for Enterprise data-layer held objects. She’s been tasked to write APIs so that internal developers on other teams can execute CRUD – create, read, update, and delete – operations on Enterprise objects. The APIs not only enhance scalability, but allow for further processing and verification steps ( e.g. ACLs, rate limiting, resource sharing )
When she initially wrote the APIs, she did think about error handling, but from the scope of HTTP status codes ( see https://en.wikipedia.org/wiki/List_of_HTTP_status_codes ), such as the 400 bad request error. This error is introduced for cases that backend servers can not process requests due to possible client-side errors ( e.g. badly formatted inputs or deceptive inputs ).
In the initial version, everything seems dandy and working fine. And for a context of a standalone web application with a single client and a single server, it’s ( mostly ) sufficient to return numeric status codes. But problems arise up in the future.
Trouble brews months later !
Alright, so senior engineer Javier has to develop our his team’s application, which operates on the same set of Enterprise objects. This means that his application behaves as a client, and executes Andrea’s API calls, exposed as conventional HTTP /GET, /POST, /PUT, and /DELETE methods. Javier recognizes on client-side development that he needs to introduce error handling paths ( e.g. halting his program or engaging in graceful degradation and capturing bad payloads in a dead letter queue ) in the event of 4xx errors when executing /GET calls.
But there’s a gotcha.
He’s running into a problem where he runs into a 4xx issue, but it turns out that the client-side payload is actually well-formed. Everything is correct there. What’s going on? 4xx should be a client-side issue, right? I clearly shouldn’t be failing a request – or event the processing of my events – if things are all and good here.
Yes … and no
So it turns out that the payload is correct, BUT, the Enterprise asset the payload operates on is stale. The assets themselves are out-of-date by a couple of months, and underneath the hood, that /GET call executes a SQL query to filter out and retrieve an asset based on unique primary keys.
And the problem is both the scaffolding – everything around the query – and the SQL query are “technically” correct. It’s actually an issue with the underlying database not being up-to-date ( due to a laundry list of reasons I will avoid delving into 😛 ). Perhaps a Database admin or other user forgot to purge out old, stale records.
So tell me what the hot-fix is?
After some back-and-forth conversations, Javier and Andrea recognize that a new version of the APIs are needed in production settings with modifications. The return of the 4xx is correct, but we need more information. We need a short, 150-character tweet-sized textual error response as well ( and maybe even an additional layer of API enum code ). The response and code should reveal more information such as :
- “Failure code <A> : Did not locate Enterprise objects; client-side payload is valid.”
- “Failure code <B> : Located Enterprise objects; badly-formed client-side payload.”
Andrea wants to change the APIs, but changing APIs requires extensive approval and testing. As such, she arranges a meeting not only with Javier, but with his team’s product owners and other senior engineers to generate “buy-in” and get approval to release the same /GET call, but on a version two form that returns more information.
After this release, Javier project get unblocked. He’s able to return back to his client-side code and adds the error-handling paths, based on a function of not only the 4xx status code, but also the textual error response and additional API enum Failure code. His failure-handling code resembles the following structure
if(HTTP.STATUS_CODE = 4XX):
if(HTTP.STATUS_TEXT.FAILURE_CODE == <B> OR HTTP.STATUS_TEXT.STR == "Located Enterprise objects, but ... "):
logger.LOG_ERROR("Ran into malformed payload - delay processing of events.")
logger.LOG_FATAL("Shutting down event streaming processing");
return;
elif(HTTP.STATUS_TEXT.FAILURE_CODE == <A> OR HTTP.STATUS_TEXT.STR == "Unable to locate Enterprise object; client-side payload is good."):
logger.LOG_ERROR("Ran into stale asset case with assetId={assetId}. Logging into DLQ {DLQ_id} with name = {DLQ_name} and topic = {DLQ_topic}.")
DLQClient.appendToDLQ(DLQ_id, DLQ_name, DLQ_topic, assetId);
continue;
The real-life story
In this scenario, I’m senior engineer Javier. I spent a couple of hours deep in debugging systems, triaging the root cause of failed Kafka stream events. I had to determine the cause of failure for events that stalled event stream processing – was attribution due to the staleness of downstream assets ( server-side issues ) or due to malformed inputs ( client-side issues )? In both scenarios, I introduced custom Cloudwatch-level logging – with 90 days of retention – to address the server-side issues ; this enabled enable human-in-the-loop intervention on stale assets. As for client-side issues, I biased towards graceful degradation and halting the event stream; I preferred notifying upstream producers as soon as possible that they sent malformed payloads.
This logging had repercussions – I worked with a event stream processing +1 million events/second, and to many failing stalls meant that events remained unprocessed and backed up. The probability of stale asset events superseded the probability of malformed client payloads ( I don’t have a specific multiplier by how much, but intuition intimated it ). By changing error handling, I reduced the frequency of halting event stream processing, thus enabling the processing of more customer traffic.
The Challenges I ran into
There were a good number of challenges I ran into with the design – let me review them :
- Generating buy-in and consensus – I had to get other senior engineers, a staff engineer, and my two direct managers to agree on the architecture & design approaches. I set the agendas and led the meetings
- Communicating unexpected delays and blockers – #TODO
- Spending time to look into alternatives to pre-empt future issues – #TODO

Leave a comment